Using AI to Translate Data Science Content into Arabic

Start Date: November 8, 2021 | 5 years ago

Challenge Background

Data science has made tremendous progress in the last few years, which makes it hard for translation efforts to catch up. While learning in a second language is possible, it isn’t as effective as learning in your native language.

The Problem

Use Deep Learning to help humans in translating more Data Science content.

The results should help begin an effort to translate more data science content into Arabic, helping students understand complex topics faster and more easily. The translation model can be continuously improved, gradually decreasing the effort needed by humans to edit the machine translated articles, and allowing more content to be available into Arabic.

Goal of the Project

Collect data about available Arabic resources explaining data science. Decide on one or a few under-represented topics in data science to work on translating into Arabic.
Apply Neural Machine Translation to translate data science blogs, articles, and lecture notes in the chosen under-represented topics from English to Arabic.
Collect parallel corpora consisting of text content in the chosen field that has been translated from English to Arabic by an expert human. Use these corpora to further improve the model's performance (e.g. by fine-tuning a pre-trained model).
Create a website to host the translated articles.

(Ideally, it would have Wikipedia-like features for users to improve machine-translated articles which could then be used as input for re-training the Neural Machine Translation model)

Project Timeline

Collecting data about Arabic data science content.
Researching Neural Machine Translation and selecting a model architecture.
Exploring keyword extraction for technical terms.

Choosing a field in data science that’s underrepresented in Arabic.
Applying the chosen Neural Machine Translation model to the chosen field.
Collecting parallel corpora in English and Arabic for the chosen field.

Fine-tuning the model using collected parallel corpora.
Trying keyword extraction for improved translation.
Research different options for hosting and editing the translated articles.

Compare model performance to alternatives and publish results.
Integrate and document the system.
Deploy the articles on a web app.

What you'll learn

1. Quantifying the state of Arabic content in data science and the fields that are still lacking content.

2. Learning, using, and improving Neural Machine Translation models for domain-specific data.

3. Learning how to deploy the results to a website for everyone to benefit.

First Omdena Local Chapter Project?

Beginner-friendly, but also welcomes experts

Education-focused

Duration: 4 to 8 weeks

Open-source

Your Benefits

Address a significant real-world problem with your skills

Build your project portfolio

Access paid projects (as an Omdena Top Talent)

Get hired at top organizations

Requirements

Good English

Suitable for AI/ Data Science beginners but also more senior collaborators

Learning mindset

Application Form

Application Closed.