Building “Yarub” Library for Arabic NLP Purposes
How Omdena built Yarub, an open-source Arabic NLP toolkit with datasets for sentiment analysis, NER, embeddings, and dialect modeling.

August 12, 2021
7 minutes read

Omdena developed “Yarub,” an open-source Arabic NLP library that provides essential tools for tasks such as sentiment analysis, NER, POS tagging, and morphological processing. By scraping, cleaning, and labeling high-quality Modern Standard Arabic datasets—where resources are historically limited—the project established a scalable foundation for advancing Arabic language AI applications in research, industry, and education.
Introduction
In this Omdena project, our goal was to develop open-source Python NLP libraries for the Arabic language that can be easily used across key Natural Language Processing tasks such as morphological analysis, named entity recognition (NER), sentiment analysis, word embeddings, dialect identification, part-of-speech tagging, and more.
As with any machine learning initiative, data quality plays a critical role in achieving strong model performance. However, for this project—Building Open Source NLP Libraries & Tools for the Arabic Language—the required data was not readily available. The complexity of the Arabic language, its rich morphology, and the existence of multiple dialects made data collection especially challenging.
To address this, we decided to focus the first phase solely on Modern Standard Arabic (MSA). Yet, this introduced an immediate barrier: there is a limited supply of pure MSA datasets, as much written Arabic blends MSA with classical and regional dialect forms.
This article outlines our data collection journey. At the beginning, the path was not entirely clear. However, thanks to Omdena’s bottom-up collaborative development approach, the pieces gradually aligned into a complete and effective strategy. Below, we summarize the main steps and insights gained throughout the process.
Collecting Modern Standard Arabic data
Training data is the data used to train an algorithm or machine learning model to predict the outcome as per our design model.
Test data is used to measure the performance, such as accuracy or efficiency, of the algorithm used to train the machine.
We aimed to collect MSA datasets specified for the various models of our Arabic NLP toolkit, which are:
- Sentiment analysis
- Morphological modeling
- Named Entity Recognition (NER)
- Dialect identification
- Word embeddings
- Lemmatization
- Speech tagging
Our approach to building an Arabic NLP library
- Search for available suitable datasets.
- Scrape MSA text from various sources.
- Prepare the scraped data to be suitable for various models.
Using open-source NLP datasets
Yarub Training Datasets were incorporated where available.
Pros:
- Existing datasets are easy to expand and integrate.
Challenges:
- Many datasets contain mixed classical Arabic and MSA.
- Labels differ across sources and require standardization.
- Preprocessing and validation were necessary to ensure quality.
Web scraping and data acquisition

Before scraping websites, it is important to ensure ethical and legal compliance. Website permissions can be checked using:
example.com/robots.txt
Pros
- Enables precise control over the dataset content.
- Allows targeting of pure MSA sources.
Challenges
-
Annotation and labeling require significant time and teamwork.
Scraping news from newspaper websites
We scraped noor-book.com, which contains ~80,000 user-contributed quotes.
Because the site uses infinite scrolling, we used Selenium + BeautifulSoup to load content dynamically, scroll the page, and extract text.
Then follow those directions as provided by the documentation page:
Scraping Arabic books quotation website
We scraped noor-book.com, which contains ~80,000 user-contributed quotes.
Because the site uses infinite scrolling, we used Selenium + BeautifulSoup to load content dynamically, scroll the page, and extract text.
It is a miscellaneous operating system interface
Here we used (os.environ) as a mapping object representing the string environment.
We will import ‘web driver’ from the Selenium library. Still, first, you need to add the folder containing WebDriver’s binaries to your system’s path with the help of Selenium documentation here.
Used it to apply a sleep function to give the server the time needed to perform the given requests without being overloaded.
Its use is accompanied by some knowledge about the structure of a web page and some HTML tags.
We need to define where the parts we need to scrape lay in, and you can use the BeautifulSoup library to parse it.
Scraping tweets from Twitter
The idea of using scraped tweets comes from the idea to aim accounts mainly use MSA in their tweets as:
- Official authority’s accounts.
- Politicians.
- Newspaper accounts.
- Here we used Tweepy to query Twitter’s API, which you must have a Twitter developer account to be capable of using, and for that, you have to:
- First, have a Twitter account.
- Secondly, follow the steps provided here to apply for one, and you will be guided through the steps and asked to describe in your own words what you are building.
- You can get the tweets for whatever account you want, but with a limit of 3200 tweets from the latest tweets, you can not scrape more than 18,000 tweets per a 15-minute window.
- We first manually reviewed the selected accounts to make sure they only use MSA in their tweets or mostly as it is impossible to get 100 percent sure. After determining what account you want to get tweets from, you can use tweepy documentation to start scraping.
Data cleaning and processing
- As most of the scraped data may contain some undesirable features to be used in training the various NLP models, some data cleaning become necessary, such as removing emojis, slashes, dashes, digits, and in our case, removing Latin letters, and for that, we used :
- ‘re’ Regular expression operations as shown in its documentation.
-
A structured EDA process helps uncover hidden formatting issues, language noise, and annotation gaps — explore how to perform exploratory data analysis in Python step-by-step to strengthen preprocessing before model training.
Using Doccano for data labeling
We have tried to use Doccano software for labeling web scraped data sets, but it was not so accurate at facing problems with consistency.
After the vicissitude process, we have successfully achieved an MSA scraped dataset and labeling as per the requirement.
PyPI Yarub Library

Most often hosted at the Python Packaging Index (PyPI), historically known as the Cheese Shop. At PyPI, you can find everything from Hello World to advanced deep learning libraries.
Here you can find out about our PyPI Yarub Library.
Conclusion
We want to point to that the most important step was to know what are the specifications of the required data by each task in the project and then come applying the previously mentioned techniques, and that was only possible by communication and careful listening to the members of the other tasks and repeatedly going back to them to assure that we are on the right track.
Also, After the success in our mention in that project, it is not the end of the road as we are going to develop furthermore functionality related to our training dataset. We will add an Arabic image training dataset for computer vision challenges and research topics.
In the end, enjoy this video that will take you on a short journey through our project.
You might also like




