Machine Learning and NLP For Arabic: Part Of Speech Tagging

March 4, 2024

This project by Omdena is about Building Open Source NLP Libraries & Tools for the Arabic Language as Arabic is the 5th most spoken language in the world and there are several challenges that can be found in Arabic like complicated grammar and having several dialects.

So the goal of our project is to build this open source library that helps Arabic people in natural language processing applications. In this article we will talk about part of speech tagging for Arabic language.

An Overview
Data Collection and Preprocessing
Tokenization
Padding Sequences
Word Embedding
Building Models
Model Evaluation
Predicting with our model
Conclusion

What is part of speech tagging

Part-of-speech (POS) tagging simply means labeling words with their appropriate Part-Of-Speech so it explains how a word is used in a sentence

Part-of-speech (POS) tagging

The most basic models in natural language processing are based on Bag of Words which is not the best solution as it fails to capture any syntactic relations between the words.

So we can improve this bag of words model using a technique like pos tagging.

Some of the POS tagging applications

Named entity recognition (NER)
POS Tagging is essential for building lemmatizers which reduces words to its root form.
Sentiment analysis

Labeling the word to its part of a speech tag is based on its context in the sentence so this task is not straightforward as a word may have a different POS tag based on its context in the sentence.

So in this article, we will look at using Deep Learning Methods — Recurrent Neural Networks for POS tagging.

Data Collection and Preprocessing

At first we used the open source arabic dataset UD_Arabic-PADT as it is benchmarked and well known dataset for pos tags but then we decided to generate other dataset in order to have a larger dataset that is more diverse

So the data collection subteam scraped data from the internet and used some libraries in order to generate annotated data consisting of sentences where each word is assigned to a pos tag so the final dataset we used was that data consists of 36000 sentence

We made some preprocessing steps in order to make the data in the right form to be an input to a machine learning model and be trained on

def process_csv(csv):
df = pd.read_csv(csv)
train_text ,train_tags = [] ,[]
for i in tqdm(df[‘sentence_id’].unique()):
train_text.append(df[df[‘sentence_id’] == i][‘word’].tolist())
train_tags.append(df[df[‘sentence_id’] == i][‘tag’].tolist())
return train_text,train_tags

Next step in preprocessing this data was to remove Diacritization and longation so we can have Clean and Normalized sentences

def clean_str(text):

#remove tashkeel
p_tashkeel = re.compile(r’[\u0617-\u061A\u064B-\u0652]’)
text = re.sub(p_tashkeel,””, text)
#remove longation
p_longation = re.compile(r’(.)\1+’)
subst = r”\1\1"
text = re.sub(p_longation, subst, text)
text = text.replace(‘وو’, ‘و’)
text = text.replace(‘يي’, ‘ي’)
text = text.replace(‘اا’, ‘ا’)
text = text.replace(‘أ’, ‘ا’)
text = text.replace(‘إ’, ‘ا’)
text = text.replace(‘آ’, ‘ا’)
text = text.replace(‘ى’, ‘ي’)
return text.split()

for i in range(len(train_text)):
train_text[i] = clean_str(‘ ‘.join(train_text[i]))

Tokenization

The machine learning model doesn’t understand words so we need to encode the input and output data .So we give a unique id to each word in the input data. On the other hand, we give a unique id to each tag in the output data.

We used the tokenizer function from Keras library to encode text sequence to integer sequence that can be used as an input to a machine learning model for training.

word_tokenizer = Tokenizer(oov_token = oov_tok)
word_tokenizer.fit_on_texts(train_text)
VOCABULARY_SIZE = len(word_tokenizer.word_index) + 1
X_encoded_train = word_tokenizer.texts_to_sequences(train_text)

tag_tokenizer = Tokenizer()
tag_tokenizer.fit_on_texts(train_tags)
Y_encoded_train = tag_tokenizer.texts_to_sequences(train_tags)

Padding sequences

Our sentences now don’t have the same length so we represented the sentences as a histogram in order to see the lengths of all the sentences and find the max length so we can make it the length for each sequence by padding the smaller sequences with zeros.

Padding sequences

We used the pad sequences function from keras and we defined the sequence length to be 50 and the padding_type = ‘post’ which means the padded zeros will be at the end of each sequence.

So now our data is in the right form and ready to be used in the machine learning model so finally we split it with 20% for validation and about 15% for testing.

X_train = pad_sequences(X_encoded_train, maxlen=MAX_SEQUENCE_LENGTH, padding=padding_type, truncating=trunc_type)
Y_train = pad_sequences(Y_encoded_train, maxlen=MAX_SEQUENCE_LENGTH, padding=padding_type, truncating=trunc_type)

Split data into Training, Validation and Testing Datasets

X_train, X_valid , Y_train, Y_valid = train_test_split(X_train, Y_train, test_size = 0.20, random_state = 41)
X_train, X_test, Y_train, Y_test = train_test_split(X_train, Y_train, test_size=0.15, random_state=41)

Word embedding

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. So we used an embedding matrix from Aravec in order to add it in our model as it is trained on a large corpus so that our words that exist in the downloaded matrix can take the corresponding representations which might help the model know the semantic of each word.

!wget https://bakrianoo.ewr1.vultrobjects.com/aravec/full_grams_cbow_300_twitter.zip
!unzip full_grams_cbow_300_twitter.zip

import gensim

embedding_model = gensim.models.Word2Vec.load(‘full_grams_cbow_300_twitter.mdl’)
embeddings = {}
for word,vector in zip(embedding_model.wv.vocab,embedding_model.wv.vectors):
coefs = np.array(vector, dtype=’float32')
embeddings[word] = coefs
embeddings_weights = np.zeros((VOCABULARY_SIZE, embedding_dim))
for word, i in word_tokenizer.word_index.items():
embedding_vector = embeddings.get(word)
if embedding_vector is not None:
embeddings_weights[i] = embedding_vector

Building the Models

The building model is the key to train any machine learning model. So we chose to build several Neural Networks models.

Recurrent Neural Network (RNN),Gated recurrent unit (GRU), Long short-term memory (LSTM), and Bidirectional LSTM (BILSTM) are the Neural Network models chosen to build.

For all models we used loss function with categorical_crossentropy is defined as categorical cross-entropy between an output tensor and a target tensor and for optimizer we used adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models.

All Models were built using Keras library for Neural Network models and Tokenization. In addition to Libraries we have used before feeding models with data like Pandas for preparing and preprocessing data , and Gensim for word embedding.

model = Sequential()
model.add(InputLayer((MAX_SEQUENCE_LENGTH)))
model.add(Embedding(input_dim = VOCABULARY_SIZE,
output_dim = embedding_dim,
input_length = MAX_SEQUENCE_LENGTH,
weights = [embeddings_weights],
trainable = True
))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Bidirectional(LSTM(256, return_sequences=True)))

model.add(TimeDistributed(Dense(NUM_CLASSES, activation=’softmax’)))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’,metrics=[‘accuracy’])
model.summary()

Train Model

Now the model is ready to be trained, we trained the model for 50 epochs and the batch size is 128. We also used ReduceLROnPlateau callback from keras so that the learning rate decreases when the accuracy doesn’t improve after 6 epochs and we save the best model weights in the end based on the highest validation accuracy.

After that, we built a Support vector machine (SVM)

Regularization parameter is The strength of the regularization is inversely proportional to C chosen to be C=10.0

kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}, default=’rbf’ Specifies the kernel type to be used in the algorithm. chosen kernel=’rbf’.

At first SVM took 1 hour but we found that there is a library called thundersvm library which provides GPU support for SVM which will be faster than SVM. This way thundersvm took 5 mins instead of 1 hour.

! git clone https://github.com/Xtra-Computing/thundersvm.git
! cd thundersvm && mkdir build && cd build && cmake .. && make -j
! python /content/thundersvm/python/setup.py install
from importlib.machinery import SourceFileLoader
thundersvm = SourceFileLoader(“thundersvm”, “/content/thundersvm/python/thundersvm/thundersvm.py”).load_module()
from thundersvm import SVC

clf = SVC(C=10)

clf.fit(x_train, y_train)

SVM

Support vector machine (SVM) which is a set of supervised learning methods used for classification, regression and outlier detection.

sklearn.svm.SVC(*, C=10.0, kernel=’rbf’, degree=3, gamma=’scale’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=- 1, decision_function_shape=’ovr’, break_ties=False, random_state=None)

Additionally We tried BERT as well, but The accuracy was not quite good, and it took a lot of time running. So we decided to use BILSTM as it had the best model Accuracy.

Model Evaluation

Plotting the model curves is useful as it might show if there is a problem in the training or if there is overfitting and it also helps us to see the difference in the model performance between the training and the unseen validation data .

plt.plot(result.history[‘accuracy’])
plt.plot(result.history[‘val_accuracy’])
plt.title(‘model accuracy’)
plt.ylabel(‘accuracy’)
plt.xlabel(‘epoch’)
plt.legend([‘train’,’val’],loc=’upper left’)
plt.show()

Model Accuracy

After finishing the training we need evaluate the model we calculate the accuracy

loss, accuracy = model.evaluate(X_test, Y_test, verbose = 1)
print(“Loss: {0},\nAccuracy: {1}”.format(loss, accuracy))

After we have calculated F1 score for BILSTM and SVM

from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
y_pred1 = model.predict(X_test)
y_pred = np.argmax(y_pred1, axis=-1)
y_pred = y_pred.reshape((y_pred.shape[0]*y_pred.shape[1],))
Y_test = np.argmax(Y_test, axis=-1)
Y_test = Y_test.reshape((Y_test.shape[0]*Y_test.shape[1],))
print(f1_score(Y_test, y_pred , average=”macro”))

Predicting with our model

Now we can use the trained model to predict a new sentence but first we should pass this sentence into the preprocessing steps like removing dictritization and converting the sentence into a sequence of words and finally padding this sequence to be in the same shape that the model expects.

def classify(sentence):
sentence = clean_str(sentence)
seq = [word_tokenizer.texts_to_sequences(sentence)]
pad_seq = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH, padding=padding_type, truncating=trunc_type)
pad_seq = np.squeeze(pad_seq,axis=-1)
pred = np.squeeze(model.predict(pad_seq).argmax(-1))
output = [tag_tokenizer.index_word[tag] for tag in pred if tag != 0]
return output

sentence = “جون يحب البيت الأزرق في نهاية الشارع”
output = classify(sentence)
word_tag = [(sentence.split()[i],output[i]) for i in range(len(sentence.split()))]
print(word_tag)

Output

[(‘جون’, ‘proper noun’), (‘يحب’, ‘verb’), (‘البيت’, ‘noun’), (‘الأزرق’, ‘noun’), (‘في’, ‘preposition’), (‘نهاية’, ‘noun’), (‘الشارع’, ‘noun’)]

Conclusion

To sum up, the best model that achieved best accuracy is the BILSTM. The accuracy and F1-score for BILSTM were 97.94% and 90.19%. In this article, we presented the progress of Preparing and Preprocessing Data sets going through tokenization, Embedding Matrix, and Pad sequences progress. The Models that we implemented are RNN, GRU, LSTM, BILSTM, and SVM .Finally we evaluate every model then compare results of different models and select the best model with best accuracy.

Compare results of different models