Building “Yarub” Library for Arabic NLP Purposes

August 12, 2021

In this Omdena project, our goal was to develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications like Morphological analysis, Named Entity Recognition, Sentiment Analysis, Word Embedding, Dialect Identification, Part of speech, and so on the training dataset.

Data is the secret ingredient that can break or make the recipe when it comes to machine learning models. Here in our project ‘Building Open Source NLP Libraries & Tools for the Arabic Language,’ that ingredient was not exactly served on a dish of gold, so we will take you on a journey of collecting our data.

Because of the nature of the Arabic language and the complexity of its structure, besides the fact of the presence of many dialects, it was not an easy task and the decision taken to begin with only using Modern Standard Arabic (MSA) in developing our models for the first phase of the project and here appeared the first obscure for data collection which is the lack of the availability of pure MSA data.

We will talk about our data collection journey, which started a little bit not knowing what we were specifically trying to reach. Still, because of Omdena’s successful bottom-Up strategy, the puzzle pieces started to show the whole picture. In the following lines, we will try to concentrate on our gained experience to collect the data.

Collecting Modern Standard Arabic data

Training data is the data used to train an algorithm or machine learning model to predict the outcome as per our design model.

Test data is used to measure the performance, such as accuracy or efficiency, of the algorithm used to train the machine.

We aimed to collect MSA datasets specified for the various models of our Arabic NLP toolkit, which are:

Sentiment analysis
Morphological modeling
Named Entity Recognition (NER)
Dialect identification
Word embeddings
Lemmatization
Speech tagging

Our approach to building an Arabic NLP library

Search for available suitable datasets.
Scrape MSA text from various sources.
Prepare the scraped data to be suitable for various models.

Using open-source NLP datasets

You can find out here Yarub Training datasets.

Pros

We can easily use and append with the existing new dataset.

Challenges

The existing dataset has different labels and a hybrid of classical and modern standard Arabic, which we need to separate and apply pre-processing tasks with validation.

Web scraping and data acquisition

Before we get in-depth for our work in data scraping, we need to point to that a crucial part of web scraping is to be an ethical one as we should not ever scrape a site its owner does not permit to crawl his website. You can easily check that by providing a slash and robots.txt after the URL of the website you need to crawl, and if it allowed, you would get something like that:

User-agent: *
Allow: /

For more details about that, you can review Google Search Central about robots.txt files documentation.

Using this method has pros and cons:

Pros

As we aimed only to use MSA, scraping data will give us some control of the content of our datasets by choosing the sources.

Challenges

Data annotation and labeling are extremely time-consuming and require a lot of collaborators to be achieved.

Scraping news from newspaper websites

By using a python package designed for news articles scraping called newspaper, and you can install using the following command:

pip install newspaper

Then follow those directions as provided by the documentation page:

import newspaper
news_paper = newspaper.build('Here the newspaper url') # ~15 seconds
for article in news_paper.articles:
    print article.url # filters to only valid news urls
    print news_paper.size() # number of articles
    print news_paper.category_urls()
    print news_paper.feed_urls()
# ^ categories and feeds are cached for a day (adjustable)
# ^ searches entire newspaper sitemap to find the feeds, not just homepage
#### build articles, then download, parse, and perform NLP

for article in news_paper.articles[:5]:
    article.download() # take's a while if you're downloading 1K+ articles
print news_paper.articles[0].html
### parse an article for its text, authors, etc
first_article = news_paper.articles[0]
first_article.parse()
print first_article.text
print first_article.top_img
print first_article.authors
print first_article.title

Scraping Arabic books quotation website

We used noor-book.com, which has a section that allows readers to write quotes from books they read. The site contains almost 80,000 quotes.

We used Selenium library and BeautifulSoup to scrape this site, and the main issue we faced is the site uses infinite scrolling, which means that more quotes only show when you scroll the page down.

Scraping infinite scrolling can be very challenging, and after exploring lots of codes and methods, we will try in the following lines explain the code used to scrape the site :

First, we will import the required libraries:

OS module:

It is a miscellaneous operating system interface

Here we used (os.environ) as a mapping object representing the string environment.

Selenium library:

We will import ‘web driver’ from the Selenium library. Still, first, you need to add the folder containing WebDriver’s binaries to your system’s path with the help of Selenium documentation here.

Time Library :

Used it to apply a sleep function to give the server the time needed to perform the given requests without being overloaded.

BeautifulSoup Library :

Its use is accompanied by some knowledge about the structure of a web page and some HTML tags.

We need to define where the parts we need to scrape lay in, and you can use the BeautifulSoup library to parse it.

import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
import pandas as PD
chromedriver = "/home/chromedriver"
os.environ["webdriver.chrome.driver"] = chromrdriver
driver = webdriver.Chrome("D:chromedriver_win32chromedriver.exe")
driver.get("https://www.noor-book.com/book-quotes")
screen_height = driver.execute_script("return window.screen.height;")   # get the screen height of the web
ScrollNumber = ”””” Here to put the number of scrolls needed to scrape all content wanted from the site”””
for i in range(1,ScrollNumber):
    driver.execute_script("window.scrollTo(0,100000)")
    time.sleep(0.5)###
quotes = []
soup = BeautifulSoup(driver.page_source)
for a in soup.find_all('div', attrs={'class':'quote-content-child'}):
    qoute=a.find('span', attrs={'class':'more'})
    qoutes.append(qoute.text)
driver.close()

Scraping tweets from Twitter

The idea of using scraped tweets comes from the idea to aim accounts mainly use MSA in their tweets as:

Official authority’s accounts.
Politicians.
Newspaper accounts.

Here we used Tweepy to query Twitter’s API, which you must have a Twitter developer account to be capable of using, and for that, you have to:

First, have a Twitter account.

Secondly, follow the steps provided here to apply for one, and you will be guided through the steps and asked to describe in your own words what you are building.

After you got your account, you can follow the following code:

import tweepy
from tweepy import OAuthHandler
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = '-----------’
consumer_secret = '---------’
access_token = '----------'
access_secret = '--------'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)

You can get the tweets for whatever account you want, but with a limit of 3200 tweets from the latest tweets, you can not scrape more than 18,000 tweets per a 15-minute window.

We first manually reviewed the selected accounts to make sure they only use MSA in their tweets or mostly as it is impossible to get 100 percent sure. After determining what account you want to get tweets from, you can use tweepy documentation to start scraping.

Data cleaning and processing

As most of the scraped data may contain some undesirable features to be used in training the various NLP models, some data cleaning become necessary, such as removing emojis, slashes, dashes, digits, and in our case, removing Latin letters, and for that, we used :

‘re’ Regular expression operations as shown in its documentation.

Using Doccano for data labeling

We have tried to use Doccano software for labeling web scraped data sets, but it was not so accurate at facing problems with consistency.

Using Doccano for labeling scraped data for Arabic NLP package Yarub - Source: Omdena

After the vicissitude process, we have successfully achieved an MSA scraped dataset and labeling as per the requirement.

PyPI Yarub Library

Yarub logo – Image designed using Canva.com

We have developed one python code that consists of several functions for the Yarub library we implemented. Now we can download, extract and load training datasets using the “Yarub” Library.

import io
import os
import struct
import zipfile
import requests
def load_sentiment():
    if not os.path.exists("Sentiment_Analysis/"):
        os.mkdir("Sentiment_Analysis/")
    print("[INFO] Downloading")
    url = r"https://github.com/messi313/Omdena-Dataset/raw/main/Omdena-seniment-analysis-Datasets.zip"
    r = requests.get(url)
    local_filename = "Sentiment_Analysis/Omdena-seniment-analysis-Datasets.zip"
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
           f.write(r.content)
    print("[INFO] Extracting")
    z = zipfile.ZipFile("Sentiment_Analysis/Omdena-seniment-analysis-Datasets.zip")
    z.extractall("Sentiment_Analysis/")
    print("[INFO] Done")

    #os.remove("Sentiment_Analysis/Omdena-seniment-analysis-Datasets.zip")

def load_ner():
    if not os.path.exists("Entity_Recognition/")
        os.mkdir("Entity_Recognition/")
    print("[INFO] Downloading")
    url = r"https://github.com/messi313/Omdena-Dataset/raw/main/NER_data_spacy.json"
    r = requests.get(url)
    local_filename = "Entity_Recognition/NER_data_spacy.json"
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            f.write(r.content)
    print("[INFO] Done")

def load_dialect():
    if not os.path.exists("dialect/"):
        os.mkdir("dialect/")
    print("[INFO] Downloading")
    url = r"https://github.com/messi313/Omdena-Dataset/raw/main/Final_Dialect_Dataset.zip"
    r = requests.get(url)    
    local_filename = r"dialect/Final_Dialect_Dataset.zip"    
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
           f.write(r.content)
    print("[INFO] Extracting")
    z = zipfile.ZipFile("dialect/Final_Dialect_Dataset.zip")
    z.extractall("dialect/")
    #os.remove("dialect/Final_Dialect_Dataset.zip")
    print("[INFO] Done")

def load_word_embedding():
    if not os.path.exists("Word_Embedding/"):
        os.mkdir("Word_Embedding/")
    print("[INFO] Downloading")
    url = r"https://github.com/messi313/Omdena-Dataset/raw/main/Word%20Embedding.zip"
    r = requests.get(url)
    local_filename = r"Word_Embedding/Word Embedding.zip"
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            f.write(r.content)
    print("[INFO] Extracting")
    z = zipfile.ZipFile("Word_Embedding/Word Embedding.zip")
    z.extractall("Word_Embedding/")
    #os.remove("Word_Embedding/Word Embedding.zip")
    print("[INFO] Done")

def load_pos():
    if not os.path.exists("Parts_of_speech/"):
        os.mkdir("Parts_of_speech/")
    print("[INFO] Downloading")
    url = r"https://github.com/messi313/Omdena-Dataset/raw/main/Final_Pos.zip"
    r = requests.get(url)
    local_filename = r"Parts_of_speech/Final_Pos.zip"
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            f.write(r.content)
    print("[INFO] Extracting")
    z = zipfile.ZipFile("Parts_of_speech/Final_Pos.zip")
    z.extractall("Parts_of_speech/")
    #os.remove("Parts_of_speech/pos_data.zip")
    print("[INFO] Done")

def load_morphology():
    if not os.path.exists("Morphology/"):
        os.mkdir("Morphology/")
    print("[INFO] Downloading")
    url = r"https://github.com/messi313/Omdena-Dataset/raw/main/final_morpho_data.zip"
    r = requests.get(url)
    local_filename = r"Morphology/final_morpho_data.zip"
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            f.write(r.content)
    print("[INFO] Extracting")
    z = zipfile.ZipFile("Morphology/final_morpho_data.zip")
    z.extractall("Morphology/")
    #os.remove("Morphology/final_morpho_data.zip")
    print("[INFO] Done")

Most often hosted at the Python Packaging Index (PyPI), historically known as the Cheese Shop. At PyPI, you can find everything from Hello World to advanced deep learning libraries.

Here you can find out about our PyPI Yarub Library.

Arabic NLP Library Yarub – Source: Omdena

Conclusion

We want to point to that the most important step was to know what are the specifications of the required data by each task in the project and then come applying the previously mentioned techniques, and that was only possible by communication and careful listening to the members of the other tasks and repeatedly going back to them to assure that we are on the right track.

Also, After the success in our mention in that project, it is not the end of the road as we are going to develop furthermore functionality related to our training dataset. We will add an Arabic image training dataset for computer vision challenges and research topics.

In the end, enjoy this video that will take you on a short journey through our project.