Resources for Low Resource Machine Translation

Oct 18

(This blog post is adapted from a set of resources I put together for the Masakhane project, which is working on machine translation for African Languages)

There are a wide variety of techniques to employ when trying to create a new machine translation model for a low resource language or improve an existing baseline. The applicability of these techniques generally depend on the availability of parallel and monolingual corpora for the target language and the availability of parallel corpora for related languages/ domains.

Common scenarios

Scenario #1 - The data you have is super noisy (e.g., scraped from the web), and you aren't sure which sentence pairs are "good"

Papers:

Resources/ examples:

Scenario #2 - You don't have any parallel data for the source-target language pair, you only have monolingual target data

Papers:

Resources/ examples:

Implementation - unsupervised MT (Facebook)

Scenario #3 - You only have a small amount of parallel data for the source-target language pair, but you have lots of parallel data for a related source-target language pair

Papers:

Resources/ examples:

Scenario #4 - You only have a small amount of parallel data for the source-target language pair, but you have lots of monolingual data for the target and/or source language

Papers:

Resources/ examples:

Video - About iterative backtranslation and dealing with "place" issues

Scenario #5 - You have a small amount of parallel data for the source-target language pair, but you also have a lot of parallel data for other language pairs

Papers:

Resources/ examples:

Want to learn more?

If you have even more interest in leveling up your AI skills (maybe you want to transition from software engineer to AI practicioner), consider joining us for the AI Classroom event. AI Classroom is an immersive, 3 day virtual training event for anyone with at least some programming experience and foundational understanding of mathematics. The training provides a practical baseline for realistic AI development using Python and open source frameworks like TensorFlow and PyTorch. After completing the course, participants will have the confidence to start developing and deploying their own AI solutions.

Buy Ticket for AI Classroom

Daniel Whitenack

Resources for Low Resource Machine Translation

Common scenarios

Scenario #1 - The data you have is super noisy (e.g., scraped from the web), and you aren't sure which sentence pairs are "good"

Scenario #2 - You don't have any parallel data for the source-target language pair, you only have monolingual target data

Scenario #3 - You only have a small amount of parallel data for the source-target language pair, but you have lots of parallel data for a related source-target language pair

Scenario #4 - You only have a small amount of parallel data for the source-target language pair, but you have lots of monolingual data for the target and/or source language

Scenario #5 - You have a small amount of parallel data for the source-target language pair, but you also have a lot of parallel data for other language pairs

Want to learn more?

"Wash your hands" in 500+ languages

AI building blocks - from scratch with Python

Terms