Resources for Low Resource Machine Translation
(This blog post is adapted from a set of resources I put together for the Masakhane project, which is working on machine translation for African Languages)
There are a wide variety of techniques to employ when trying to create a new machine translation model for a low resource language or improve an existing baseline. The applicability of these techniques generally depend on the availability of parallel and monolingual corpora for the target language and the availability of parallel corpora for related languages/ domains.
Common scenarios
Scenario #1 - The data you have is super noisy (e.g., scraped from the web), and you aren't sure which sentence pairs are "good"
Papers:
Low-Resource Corpus Filtering using Multilingual Sentence Embeddings
Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions
Resources/ examples:
Implementation -
fast_align
creates word alignments that can be used to score sentence pairsImplementation - LASER Language-Agnostic SEntence Representations
Scenario #2 - You don't have any parallel data for the source-target language pair, you only have monolingual target data
Papers:
Resources/ examples:
Scenario #3 - You only have a small amount of parallel data for the source-target language pair, but you have lots of parallel data for a related source-target language pair
Papers:
Rapid Adaptation of Neural Machine Translation to New Languages
Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation
Transfer Learning for Low-Resource Neural Machine Translation
Trivial Transfer Learning for Low-Resource Neural Machine Translation
Pivot-based Transfer Learning for Neural Machine Translation between Non-English Languages
Resources/ examples:
Scenario #4 - You only have a small amount of parallel data for the source-target language pair, but you have lots of monolingual data for the target and/or source language
Papers:
Improving Neural Machine Translation Models with Monolingual Data
Improving Back-Translation with Uncertainty-based Confidence Estimation
Neural Machine Translation of Low-Resource and Similar Languages with Backtranslation
Resources/ examples:
Scenario #5 - You have a small amount of parallel data for the source-target language pair, but you also have a lot of parallel data for other language pairs
Papers:
Massively Multilingual Neural Machine Translationin the Wild: Findings and Challenges
Multilingual Neural Machine Translation With Soft Decoupled Encoding
Effective Cross-lingual Transfer of Neural Machine Translation Models without Shared Vocabularies
Resources/ examples:
Blog - Exploring Massively Multilingual, Massive Neural Machine Translation
Blog - Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System
Want to learn more?
If you have even more interest in leveling up your AI skills (maybe you want to transition from software engineer to AI practicioner), consider joining us for the AI Classroom event. AI Classroom is an immersive, 3 day virtual training event for anyone with at least some programming experience and foundational understanding of mathematics. The training provides a practical baseline for realistic AI development using Python and open source frameworks like TensorFlow and PyTorch. After completing the course, participants will have the confidence to start developing and deploying their own AI solutions.