Machine translation
Machine translation, sometimes referred to by the acronym
MT, is a sub-field of
computational linguistics that investigates the use of
computer software to
translate text or speech in between
natural languages. At its basic level, MT performs simple
substitution of atomic words in one natural language for words in another. Using corpus techniques, more complex translations can be performed, allowing for better handling of differences in
linguistic typology, phrase
recognition, and translation of
idioms, as well as the isolation of anomalies.
Current machine translation software often allows for customisation by domain or
profession (such as
weather reports) — improving output by limiting the scope of allowable substitutions. This technique is particularly effective in domains where formal or formulaic language is used. It follows then that machine translation of government and legal documents more readily produces usable output than conversation or less standardised text.
Improved output quality can also be achieved by human intervention: for example, some systems are able to translate more accurately if the user has
unambiguously identified which words in the text are names. With the assistance of these techniques, MT has proven useful as a tool to assist human translators, and in some cases can even produce output that can be used "as is". However, current systems are unable to produce output of the same quality as a human translator, particularly where the text to be translated uses casual language.
The
translation process, whether for translation, can be stated simply as:# Decoding the meaning of the source text, and# Re-encoding this meaning in the target language.
Behind this simple procedure there lies a complex cognitive operation. For example, to decode the meaning of the source text in its entirety, the translator must interpret and analyse all the features of the text, a process which requires in-depth knowledge of both the
grammar,
semantics,
syntax,
idioms, and the like of the source language, as well as the
culture of its speakers. The translator needs the same in-depth knowledge to re-encode the meaning in the target language.
Therein lies the challenge in machine translation: how to program a computer to "understand" a text as a human being does and also to "create" a new text in the target language that "sounds" as if it has been written by a human.
This problem can be tackled in a number of ways.
|
Pyramid showing comparative depths of intermediary representation, interlingual machine translation at the peak, followed by transfer-based, then direct translation. |
Machine translation can use a method based on
linguistic rules, which means that words will be translated in a linguistic way — the most suitable (orally speaking) words of the target language will replace the ones in the source language.
It is often argued that the success of machine translation requires the problem of
natural language understanding to be solved first.
Generally, rule-based methods parse a text, usually creating an intermediary, symbolic representation, from which the text in the target language is generated. According to the nature of the intermediary representation, an approach is described as
interlingual machine translation or transfer-based machine translation. These methods require extensive
lexicons with
morphological,
syntactic, and
semantic information, and large sets of rules.
Given enough data, machine translation programs often work well enough for a
native speaker of one language to get the approximate meaning of what is written by the other native speaker. The difficulty is getting enough data of the right kind to support the particular method. For example, the large multilingual
corpus of data needed for statistical methods to work is not necessary for the grammar-based methods. But then, the grammar methods need a skilled linguist to carefully design the grammar that they use.
To translate between closely related languages, a technique referred to as
shallow-transfer machine translation may be used.
Dictionary-based machine translation
Machine translation can use a method based on
dictionary entries, which means that the words will be translated as a dictionary does â€" word by word, usually without much correlation of meaning between them.
Statistical machine translation
Statistical machine translation tries to generate translations using
statistical methods based on bilingual text corpora, such as the
Canadian Hansard corpus, the English-French record of the Canadian parliament and
EUROPARL, the record of the
European Parliament. Where such corpora are available, impressive results can be achieved translating texts of a similar kind, but such corpora are still very rare. The first statistical machine translation software was
CANDIDE from
IBM.
Example-based machine translation
Example-based machine translation (EBMT) approach is often characterised by its use of a bilingual
corpus as its main knowledge base, at run-time. It is essentially a translation by
analogy and can be viewed as an implementation of
case-based reasoning approach of
machine learning.
Interlingual machine translation
Interlingual machine translation is one instance of rule-based machine translation approaches. According to this approach, the source language, ie. the text to be translated is transformed into an interlingual, ie. source/target language independent representation. The target language is then generated out of the interlingua.
Word sense disambiguation
Word sense disambiguation concerns finding a suitable translation when a word can have more than one meaning. The problem was first raised in the
1950s by
Yehoshua Bar-Hillel. He pointed out that without a "universal encyclopaedia", a machine would never be able to distinguish between the two meanings of a word. Today there are numerous approaches to trying to overcome this problem, they can be approximately divided into "shallow" approaches and "deep" approaches.
Shallow approaches assume no knowledge of the text, they simply apply statistical methods to the words surrounding the ambiguous word. Deep approaches presume a comprehensive knowledge of the word. So far, shallow approaches have been more successful.
Named entities
Related to
named entity recognition in
information extraction.
The history of machine translation generally starts in the 1950s after
the second world war. The
Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The experiment was a great success and ushered in an era of significant funding for machine translation research. The authors claimed that within three or five years, machine translation would be a solved problem.
However, the real progress was much slower, and after the
ALPAC report in 1966, which found that the ten years long research had failed to fulfill the expectations, the funding was dramatically reduced. Starting in the late 1980s, as computational power increased and became less expensive, more interest began to be shown in
statistical models for machine translation.
Today there are many software programs for translating natural language, several of them online, such as the
SYSTRAN system which powers both
Google translate and the
AltaVista's Babelfish. Although there is no system that provides the holy-grail of "Fully automatic high quality machine translation" (FAHQMT), many systems provide reasonable output.
Despite their inherent limitations, MT programs are currently used by various organisations around the world. Probably the largest institutional user is the
European Commission, which uses a highly customised version of the commercial MT system
SYSTRAN to handle the automatic translation of a large volume of preliminary drafts of documents for internal use.
A Danish translation agency,
Lingtech A/S , has been translating patent applications from English to Danish since 1993 using a proprietary rule-based machine translation system,
PaTrans, working together with the translation memory based
Trados commercial CAT tool. The system requires both manual pre- and post-editing, but the monthly output is still approximately 400,000 words per operator.
The
Spanish daily newspaper
Periódico de Catalunya is translated from
Spanish into
Catalan with an MT system. .
Google has reported that promising results were obtained using proprietary statistic machine translation engine . This engine is currently used in the Google Translation tools for Arabic <-> English and Chinese <-> English, with more language pairs soon to be migrated from the
SYSTRAN engine to the Google engine .
With the recent focus on terrorism, the military sources in US invest significant amounts of money in natural language engineering.
In-Q-Tel (a
venture capital fund, largely funded by the US Intelligence Community, to stimulate new technologies through private sector entrepreneurs) brought up companies like
Language Weaver. Currently the military community is interested in translation and processing of languages like
Arabic,
Pashto, and
Dari. Information Processing Technology Office in
DARPA hosts programs like
TIDES and
Babylon. US Air Force has awarded a $1 million contract to develop a language translation technology.
There are various methods for evaluating the performance of machine translation systems, the oldest is by using human judges to tell the quality of a translation. Newer automated methods include
BLEU,
NIST and
METEOR.
Currently, the product of machine translation is sometimes called a "gisting translation" — unless one is proficient in both languages, MT will often produce only a rough translation that will at best allow the reader to "get the gist" of the source text, but is unlikely to convey a complete understanding of it. The user may find the raw translation sufficiently useful as it is.
In the words of the
European Association for Machine Translation (EAMT)::Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains. (1997)
*
Artificial Intelligence*
Computer-assisted translation*
Distributed Language Translation*
Eurotra*
List of Machine Translation software*
Parallel text alignment*
Translation*
Universal Networking Language#
*
Machine Translation, an introductory guide to MT by D.J.Arnold et al. (1994)
*
Machine Translation Archive by
John Hutchins. An electronic repository (and bibliography) of articles, books and papers in the field of machine translation and computer-based translation technology
*
Machine translation (computer-based translation) — Publications by John Hutchins (includes
PDFs of several books on machine translation)
*
NIST 2005 Machine Translation Evaluation Official Results*
Machine Translation and Minority Languages