Building “Yarub” Library for Arabic NLP Purposes

How Omdena built Yarub, an open-source Arabic NLP toolkit with datasets for sentiment analysis, NER, embeddings, and dialect modeling.

Omdena
Omdena

August 12, 2021

7 minutes read

article featured image

Omdena developed “Yarub,” an open-source Arabic NLP library that provides essential tools for tasks such as sentiment analysis, NER, POS tagging, and morphological processing. By scraping, cleaning, and labeling high-quality Modern Standard Arabic datasets—where resources are historically limited—the project established a scalable foundation for advancing Arabic language AI applications in research, industry, and education.

Introduction

In this Omdena project, our goal was to develop open-source Python NLP libraries for the Arabic language that can be easily used across key Natural Language Processing tasks such as morphological analysis, named entity recognition (NER), sentiment analysis, word embeddings, dialect identification, part-of-speech tagging, and more.

As with any machine learning initiative, data quality plays a critical role in achieving strong model performance. However, for this project—Building Open Source NLP Libraries & Tools for the Arabic Language—the required data was not readily available. The complexity of the Arabic language, its rich morphology, and the existence of multiple dialects made data collection especially challenging.

To address this, we decided to focus the first phase solely on Modern Standard Arabic (MSA). Yet, this introduced an immediate barrier: there is a limited supply of pure MSA datasets, as much written Arabic blends MSA with classical and regional dialect forms.

This article outlines our data collection journey. At the beginning, the path was not entirely clear. However, thanks to Omdena’s bottom-up collaborative development approach, the pieces gradually aligned into a complete and effective strategy. Below, we summarize the main steps and insights gained throughout the process.

Collecting Modern Standard Arabic data

Training data is the data used to train an algorithm or machine learning model to predict the outcome as per our design model.

Test data is used to measure the performance, such as accuracy or efficiency, of the algorithm used to train the machine.

We aimed to collect MSA datasets specified for the various models of our Arabic NLP toolkit, which are:

  •  Sentiment analysis
  •  Morphological modeling
  •  Named Entity Recognition (NER)
  •  Dialect identification
  •  Word embeddings
  •  Lemmatization
  •  Speech tagging

Our approach to building an Arabic NLP library

  1. Search for available suitable datasets.
  2. Scrape MSA text from various sources.
  3. Prepare the scraped data to be suitable for various models.

Using open-source NLP datasets

Yarub Training Datasets were incorporated where available.

Pros:

  • Existing datasets are easy to expand and integrate.

Challenges:

  • Many datasets contain mixed classical Arabic and MSA.
  • Labels differ across sources and require standardization.
  • Preprocessing and validation were necessary to ensure quality.

Web scraping and data acquisition

Arabic NLP library - Source: Omdena

Before scraping websites, it is important to ensure ethical and legal compliance. Website permissions can be checked using:

example.com/robots.txt

If permissions allow scraping, data may be extracted responsibly. For more details about that, you can review Google Search Central about robots.txt files documentation.

Pros

  • Enables precise control over the dataset content.
  • Allows targeting of pure MSA sources.

Challenges

  • Annotation and labeling require significant time and teamwork.

Scraping news from newspaper websites

We scraped noor-book.com, which contains ~80,000 user-contributed quotes.
Because the site uses infinite scrolling, we used Selenium + BeautifulSoup to load content dynamically, scroll the page, and extract text.

Then follow those directions as provided by the documentation page:

Scraping Arabic books quotation website

We scraped noor-book.com, which contains ~80,000 user-contributed quotes.
Because the site uses infinite scrolling, we used Selenium + BeautifulSoup to load content dynamically, scroll the page, and extract text.

OS module:

It is a miscellaneous operating system interface

Here we used (os.environ) as a mapping object representing the string environment.

Selenium library:

We will import ‘web driver’ from the Selenium library. Still, first, you need to add the folder containing WebDriver’s binaries to your system’s path with the help of Selenium documentation here.

Time Library :

Used it to apply a sleep function to give the server the time needed to perform the given requests without being overloaded.

BeautifulSoup Library :

Its use is accompanied by some knowledge about the structure of a web page and some HTML tags.

We need to define where the parts we need to scrape lay in, and you can use the BeautifulSoup library to parse it.

Scraping tweets from Twitter

The idea of using scraped tweets comes from the idea to aim accounts mainly use MSA in their tweets as:

  1. Official authority’s accounts.
  2. Politicians.
  3. Newspaper accounts.
  4. Here we used Tweepy to query Twitter’s API, which you must have a Twitter developer account to be capable of using, and for that, you have to:
  5. First, have a Twitter account.
  6. Secondly, follow the steps provided here to apply for one, and you will be guided through the steps and asked to describe in your own words what you are building.
  7. You can get the tweets for whatever account you want, but with a limit of 3200 tweets from the latest tweets, you can not scrape more than 18,000 tweets per a 15-minute window.
  8. We first manually reviewed the selected accounts to make sure they only use MSA in their tweets or mostly as it is impossible to get 100 percent sure. After determining what account you want to get tweets from, you can use tweepy documentation to start scraping.

Data cleaning and processing

  1. As most of the scraped data may contain some undesirable features to be used in training the various NLP models, some data cleaning become necessary, such as removing emojis, slashes, dashes, digits, and in our case, removing Latin letters, and for that, we used :
  2. ‘re’ Regular expression operations as shown in its documentation.
  3. A structured EDA process helps uncover hidden formatting issues, language noise, and annotation gaps — explore how to perform exploratory data analysis in Python step-by-step to strengthen preprocessing before model training.

Using Doccano for data labeling

We have tried to use Doccano software for labeling web scraped data sets, but it was not so accurate at facing problems with consistency.

Using Doccano for labeling scraped data for Arabic NLP package Yarub - Source: Omdena

Using Doccano for labeling scraped data – Source: Omdena

After the vicissitude process, we have successfully achieved an MSA scraped dataset and labeling as per the requirement.

PyPI Yarub Library

Arabic NLP Yarub logo - Source: Omdena

Most often hosted at the Python Packaging Index (PyPI), historically known as the Cheese Shop. At PyPI, you can find everything from Hello World to advanced deep learning libraries.

Here you can find out about our PyPI Yarub Library.

Arabic NLP Library Yarub - Source: Omdena

Arabic NLP Library Yarub – Source: Omdena

Conclusion

We want to point to that the most important step was to know what are the specifications of the required data by each task in the project and then come applying the previously mentioned techniques, and that was only possible by communication and careful listening to the members of the other tasks and repeatedly going back to them to assure that we are on the right track.

Also, After the success in our mention in that project, it is not the end of the road as we are going to develop furthermore functionality related to our training dataset. We will add an Arabic image training dataset for computer vision challenges and research topics.

In the end, enjoy this video that will take you on a short journey through our project.

This article is written by Reham Rafee Ahmed and Mastane Lael Abdul Gaffar Qureshi.

Want to work with us too?

FAQs

Yarub is an open-source Python library designed to support core Arabic NLP tasks such as sentiment analysis, NER, lemmatization, and word embeddings.
MSA offers a standardized linguistic structure, making it easier to model before expanding to regional dialects.
Arabic has rich morphology and multiple dialects, and available datasets often mix forms of Arabic, requiring careful filtering and data cleaning.
Data was gathered through open datasets, web scraping from MSA-focused sites, and tweets from official or verified Arabic accounts.
The team cleaned text to remove noise and manually checked sources to avoid dialect contamination and mislabeled entries.
Sentiment analysis, morphological modeling, NER, dialect identification, word embeddings, lemmatization, and POS tagging.
Doccano allowed collaborative annotation, though consistency checks and manual refinements were needed to ensure accuracy.
Yes. Future phases include expanding datasets and models to cover major dialect groups and adding image-based datasets for multimodal use.