Neural networks have significantly improved the quality of automatic translation. But in some language pairs there are still a lot of errors.


A bit of history of automated translation

Work on automatic translators began in the middle of the 20th century. After one of the successful experiments, the newspapers wrote that soon manual translation would not be needed - human translators would be replaced by machines. 70 years have passed since then, but automatic translation still makes stupid and gross mistakes. What`s wrong with him?

Why online translators used to be impossible to use without laughing

Even 5-7 years ago, any online translator produced sets of phrases in which it was difficult to grasp the meaning of the text. If you translated from a foreign language into your native language, then this could be corrected. But when translating from a native language into a foreign one, it was immediately obvious that Google Translate or another translator had worked. The technology itself was to blame - statistical machine translation.

To better understand why translators used to be so clumsy, let`s take a quick look at the main technologies that have been used to process text in different languages. Work on automated translation systems began in the middle of the 20th century. At first, they used rules that were drawn up by linguists. Their number was huge, but the result of the work was still a failure. The translators could not cope with ambiguous words and did not understand fixed expressions.

IBM model

The disappointment from the first transfer systems was so great that no one invested big money in this area for almost 30 years. This all changed in the early 1990s when one of the IBM research teams developed a new translation model. The key idea of ​​the technology is the concept of a channel with errors, which treats the text in language A as cipher text in language B. The task of the translator is to decipher the fragment.

The IBM model is based on Canadian government documents written in English and French. It was this pair that became the first that specialists began to work on. They collected the probabilities for all combinations of words of a certain length in one language and the probabilities for each of such combinations to correspond to a combination in another language. In fact, the algorithm tries to find the most frequent phrase in language A that has at least some relation to the phrase in language B.

IBM`s statistical machine translation system was a breakthrough. With the advent of the Internet, specialists have gained access to a huge amount of data in different languages. The researchers focused on collecting a corpus of parallel texts - identical documents written in different languages. These are protocols of international organizations, scientific materials, journalism. When studying them, the correspondence of sentences and words was established. For example, when comparing texts in different languages, the system understands that "cat" and "cat" are likely translations of each other. Nowadays the situation is different with projects like

In the statistical machine translation model, each word and phrase is assigned a numeric identifier that determines the frequency of use in the language. When translated, the sentence is split into independent parts (read this Wiki for reference). A potential translation is selected for each element of this array. Then the system collects several variants of a sentence in another language and selects the optimal one from the point of view of word compatibility.

Further development

But machine translation still didn`t work perfectly. The main problem was that words and phrases were translated independently. The translators did not take into account the context and did not even agree on parts of the sentence. Another problem is the lack of parallel texts. This makes it difficult to match. Statistical machine translation uses English as a universal binding language.