Machine Translation using Transformers in Python

Learn how to use Huggingface transformer models to perform machine translation on various languages using transformers and PyTorch libraries in Python.
  · 9 min read · Updated nov 2021 · Machine Learning · Natural Language Processing


Open In Colab

Machine translation is the process of using Machine Learning to automatically translate text from one language to another without any human intervention during the translation.

Neural machine translation emerged in recent years outperforming all previous approaches. More specifically, neural networks based on attention called transformers did a very good job on this task.

In this tutorial, you will learn how to perform machine translation without any training. In other words, we'll be using pre-trained models from Huggingface transformer models.

The Helsinki-NLP models we gonna use are mostly trained on the OPUS dataset, which is a collection of translated texts from the web, it is free online data.

To get started, you can either make a new empty Python notebook or file. You can also follow with the notebook in Colab by clicking the Open In Colab button above or down the article. First, let's install the required libraries:

$ pip install transformers==4.12.4 sentencepiece

Importing transformers:

from transformers import *

Using Pipeline API

Let's first get started with the pipeline API that the library offers, we'll be using the models trained by Helsinki-NLP, you can check their page to see the available models they have:

# source & destination languages
src = "en"
dst = "de"

task_name = f"translation_{src}_to_{dst}"
model_name = f"Helsinki-NLP/opus-mt-{src}-{dst}"

translator  = pipeline(task_name, model=model_name, tokenizer=model_name)

src and dst are the source and destination languages respectively, feel free to change for your needs. We dynamically change the name of task_name and model_name based on the source and destination languages, we then initialize the pipeline by specifying the model and tokenizer arguments as well. Let's test it out:

translator("You're a genius.")[0]["translation_text"]

Output:

Du bist ein Genie.

The pipeline API is pretty straightforward, by simply passing the text to the translator pipeline object, we get the output.

Alright, let's test a longer text brought from Wikipedia:

article = """
Albert Einstein ( 14 March 1879 – 18 April 1955) was a German-born theoretical physicist, widely acknowledged to be one of the greatest physicists of all time. 
Einstein is best known for developing the theory of relativity, but he also made important contributions to the development of the theory of quantum mechanics. 
Relativity and quantum mechanics are together the two pillars of modern physics. 
His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been dubbed "the world's most famous equation". 
His work is also known for its influence on the philosophy of science.
He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect", a pivotal step in the development of quantum theory. 
His intellectual achievements and originality resulted in "Einstein" becoming synonymous with "genius"
"""
translator(article)[0]["translation_text"]

Output:

Albert Einstein (* 14. März 1879 – 18. April 1955) war ein deutscher theoretischer Physiker, der allgemein als einer der größten Physiker aller Zeiten anerkannt wurde. 
Einstein ist am besten für die Entwicklung der Relativitätstheorie bekannt, aber er leistete auch wichtige Beiträge zur Entwicklung der Quantenmechaniktheorie. 
Relativität und Quantenmechanik sind zusammen die beiden Säulen der modernen Physik. 
Seine Massenenergieäquivalenzformel E = mc2, die aus der Relativitätstheorie hervorgeht, wurde als „die berühmteste Gleichung der Welt" bezeichnet. 
Seine Arbeit ist auch für ihren Einfluss auf die Philosophie der Wissenschaft bekannt. 
Er erhielt 1921 den Nobelpreis für Physik „für seine Verdienste um die theoretische Physik und vor allem für seine Entdeckung des Gesetzes über den photoelektrischen Effekt", einen entscheidenden Schritt in der Entwicklung der Quantentheorie. 
Seine intellektuellen Leistungen und Originalität führten dazu, dass „Einstein" zum Synonym für „Genius" wurde.

I have tested this output on Google Translate to get it back in English and it seems to be a great translation!

Manually Loading the Model

Since pipeline doesn't provide us with a lot of flexibility during translation generation, let's use the model and tokenizer for manual use:

def get_translation_model_and_tokenizer(src_lang, dst_lang):
  """
  Given the source and destination languages, returns the appropriate model
  See the language codes here: https://developers.google.com/admin-sdk/directory/v1/languages
  For the 3-character language codes, you can google for the code!
  """
  # construct our model name
  model_name = f"Helsinki-NLP/opus-mt-{src}-{dst}"
  # initialize the tokenizer & model
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
  # return them for use
  return model, tokenizer

The above function returns the appropriate model given the src_lang and dst_lang for source and destination languages respectively. For a list of language codes, consider checking this page. For instance, let's try English to Chinese:

# source & destination languages
src = "en"
dst = "zh"

model, tokenizer = get_translation_model_and_tokenizer(src, dst)

To translate our previous paragraph, we first need to tokenize the text:

# encode the text into tensor of integers using the appropriate tokenizer
inputs = tokenizer.encode(article, return_tensors="pt", max_length=512, truncation=True)
print(inputs)

Output:

tensor([[32614, 53456,    22,   992,   776,   822,  4048,     8,  3484,   822,
           820, 50940,    17,    43,    13,  8214,    16, 32941, 34899, 60593,
             2,  5514,  7131,     9,    34,   141,     4,     3,  7680, 60593,
            24,     4,    61,   220,     6, 53456,    32,  1109,  3305,    15,
           320,     3, 19082,     4,  1294, 24030, 28453,     2,   187,   172,
            81,   157,   435,  1061,     9,     3,    92,     4,     3, 19082,
             4, 52682, 54813,     6, 45978, 28453,     7, 52682, 54813,    46,
          1105,     3,   263, 12538,     4,  6683, 46089,     6,  1608,  3196,
          3484, 45425, 50560, 14655,   509,     8,  6873,  4374,   149,  9132,
            62, 22703,    51,  1294, 24030, 28453, 19082,     2,    66,    74,
         16044, 18553,   258,    40,  1862,   431,    23,    24,   447, 23761,
         47364, 10594,  1608,   119,    32,    81,  3305,    15,    45,  6748,
            19,     3, 34857,     4,  4102,     6,   250,   948,     3,   912,
           774, 38354, 33321,    11, 58505,    40,  4161,   175,   307,     9,
         34899, 46089,     2,     7,   978,    15,   175, 34026,     4,     3,
           191,     4,     3, 17952, 57867,  1766, 19622,    13, 29632,  2827,
            11,     3,    92,     4, 52682, 19082,     6,  1608,  6875,  5710,
             7,  5099,  2665,  3897,    11,    40,   338,   767, 40272,   480,
          6588, 57380,    29,    40,  9994, 20506,   480,     0]])

The tokenizer.encode() method encodes the text into tokens and converts them to IDs, we set return_tensors to "pt" so it'll return a PyTorch tensor. We also set max_length to 512 and truncation to True.

Let's now use greedy search to generate the translation for this:

# generate the translation output using greedy search
greedy_outputs = model.generate(inputs)
# decode the output and ignore special tokens
print(tokenizer.decode(greedy_outputs[0], skip_special_tokens=True))

We simply use the model.generate() method to get the outputs, and since the outputs are also tokenized, we need to decode them back to human-readable format, we also set skip_special_tokens to True so we don't see tokens such as <pad>, etc. Here is the output:

阿尔伯特·爱因斯坦(1879年3月14日至1955年4月18日)是德国出生的理论物理学家,被广泛承认是有史以来最伟大的物理学家之一。爱因斯坦以发展相对论闻名,但他也为量子力学理论的发展做出了重要贡献。相对论和量子力学是现代物理学的两大支柱。他的质量 — — 能源等值公式E = mc2来自相对论,被称作“世界最著名的方程 ” 。 他的工作也因其对科学哲学的影响而著称。 他获得了1921年诺贝尔物理奖,“因为他对理论物理学的服务,特别是他发现了光电效应法 ”, 这是量子理论发展的关键一步。 他的智力成就和创举导致“Einstein”成为“genius”的同义词。

You can also use beam search instead of greedy search, which may generate better translations:

# generate the translation output using beam search
beam_outputs = model.generate(inputs, num_beams=3)
# decode the output and ignore special tokens
print(tokenizer.decode(beam_outputs[0], skip_special_tokens=True))

We set num_beams to 3, for more information about beams, I suggest you read this blog post or our tutorials on text summarization and conversational AI chatbot. The output:

阿尔伯特·爱因斯坦(1879年3月14日至1955年4月18日)是德国出生的理论物理学家,被广泛承认是有史以来最伟大的物理学家之一。爱因斯坦以发展相对论闻名,但他也为量子力学理论的发展做出了重要贡献。相对论和量子力学是现代物理学的两大支柱。来自相对论的其质量 — — 能源等值公式E=mc2被称作“世界上最著名的方程式 ” 。他的工作也因其对科学哲学的影响而著称。他获得了1921年诺贝尔物理奖,“因为他对理论物理学的服务,特别是他发现了光电效应法 ”, 这是量子理论发展的关键一步。他的智力成就和原创性导致了“Einstein”与“genius”的同义。

A slightly different translation and both seem to be good translations when I translated them back to English using Google Translate.

We can also generate more than one translation in one go:

# let's change target language
src = "en"
dst = "ar"

# get en-ar model & tokenizer
model, tokenizer = get_translation_model_and_tokenizer(src, dst)
# yet another example
text = "It can be severe, and has caused millions of deaths around the world as well as lasting health problems in some who have survived the illness."
# tokenize the text
inputs = tokenizer.encode(text, return_tensors="pt", max_length=512, truncation=True)
# this time we use 5 beams and return 5 sequences and we can compare!
beam_outputs = model.generate(
    inputs, 
    num_beams=5, 
    num_return_sequences=5,
    early_stopping=True,
)
for i, beam_output in enumerate(beam_outputs):
  print(tokenizer.decode(beam_output, skip_special_tokens=True))
  print("="*50)

We set num_return_sequences to 5 to generate 5 different most probable translations, make sure that num_beams >= num_return_sequences, output:

ويمكن أن تكون حادة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض الذين نجوا من المرض.
==================================================
ويمكن أن تكون خطيرة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض الذين نجوا من المرض.
==================================================
ويمكن أن تكون حادة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة لدى بعض الذين نجوا من المرض.
==================================================
ويمكن أن تكون حادة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض من نجوا من المرض.
==================================================
ويمكن أن تكون حادة، وقد تسببت في وفاة ملايين الأشخاص في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض الذين نجوا من المرض.
==================================================

Conclusion

That's it for this tutorial! I suggest you use your two languages and your own text to see which best suits you in terms of parameters in the model.generate() method.

As stated above, there are a lot of parameters in the model.generate() method, most of them are explained in the hugging face blog post, or our tutorials on text summarization and conversational AI chatbot.

Also, on the Helsinki-NLP page, there are 1300+ pre-trained models, so your native language is definitely present there!

Check the full code here.

Learn also: Conversational AI Chatbot with Transformers in Python

Open In Colab

Happy learning ♥

View Full Code
Sharing is caring!



Read Also




Comment panel