Machine Translation MT – Research Landscape Stephan Vogel Spring Semester 2011 1 Overview Some influential projects Open source toolkits Conferences MT evaluations Literature and general resources Disclaimer: this all is incomplete, subjective, biased! 11-711 Machine Translation 2 MT Projects Verbmobil Large speech translation project in Germany Different translation paradigms Success story for SMT TIDES DARPA funded US MT project SMT widely used, small and large data track evaluations Chinese-English and Arabic-English GALE DARPA funded Follow-up to TIDES TransTac DARPA funded Speech-to-Speech Translation Targeted towards force protection 11-711 Machine Translation 3 MT Projects TC-Star European Project with partners from different universities Technology and Corpora for Speech-to-Speech Translation http://tcstar.org/ EuroMatrix 2006-2009, EuroMatixPlus 2009-2012 Translate all European languages Off-springs: WMT evaluations, MT marathon euromatrix.net Quero French-German project Kind of TC-Star follow-up http://www.quaero.org/modules/movie/scenes/home/index.php?FUSEB OX_LANG=2 11-711 Machine Translation 4 Open Source Toolkits: Word Alignment Game Changer Lower barrier to enter the field Transparency Word Alignment GIZA++ Started out at JHU workshop, subsequently extended by Franz Josef Och (at RWTH and ISI) Most widely used alignment toolkit mGIZA++ Multi-threaded/multi-core extension of GIZA++ By Qin Gao: http://geek.kyloo.net/software/doku.php/mgiza:overview Berkeley Aligner Word alignment via quadratic assignment http://code.google.com/p/berkeleyaligner/ PostCAT (Posterior Constrained Alignment Toolkit) http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html 11-711 Machine Translation 5 Open Source Toolkits: WA cont. Word Alignment tools Alignment Set Set of tools to manipulate and display alignments From TALP research group http://www.talp.upc.edu/talp/index.php/en/resources/tools/alingment-set 11-711 Machine Translation 6 Open Source Toolkits: Decoders Decoders Moses (Edinburgh): phrase-based and recently also hierarchical Joshua (JHU): hiero reimplementation sourceforge.net/projects/joshua Jane (RWTH Aachen): hierarchical http://www-i6.informatik.rwth-aachen.de/web/Software/index.html cdec (UMD -> CMU): hierarchical and phrase-based Marie (TALP): ngram-based (kinda phrase-based) www.talp.upc.edu/talp/index.php/en/resources/tools/marie Apertium (University of Alicante): rule-based Phrasasl (Stanford): phrase-based http://www-nlp.stanford.edu/wiki/Software/Phrasal 11-711 Machine Translation 7 Open Source Toolkits: LMs SRILM Most widely known and used LM toolkit SALM Written by Joy Ying Zhang (while at LTI) http://projectile.sv.cmu.edu/research/public/tools/salm/salm.htm IRST-LM http://sourceforge.net/projects/irstlm/ Ken-LM Smaller footprint then SRILM Written by Kenneth Heafield (LIT PhD student) http://kheafield.com/code/kenlm/ 11-711 Machine Translation 8 Conferences General CL conferences MT Summit (every 2 years) AMTA (US) EAMT (Europe) TMI Translating and the Computer (organised by Aslib) IWSLT (organized by C-Star consortium) … ACL HLT EMNLP Coling IJCNLP Int. Joint Conf on NLP LREC Language Resources and Evaluation RANLP Recent Advances in NLP SALTMIL Specific MT conferences Speech and Langauge Technology for Minority Languages MT Workshops WMT Workshop on Machine Translation SSST Syntax, Semantics, and Structure in SMT … 11-711 Machine Translation 9 Evaluations It all started with TIDES Comparative evaluations Defined training and test data Automatic evaluation metrics (NIST mteval, Bleu) Organized by NIST NIST Open MT Evaluations Continuation and expansion of TIDES MT evaluations Chinese-English, Arabic-English, Urdu-English Restricted and unrestricted track Originally every year, now going to 2 year cycle http://www.itl.nist.gov/iad/mig/tests/mt/2009/ 11-711 Machine Translation 10 Evaluations (cont.) WMT Evaluations Organized in connection with EuroMatrix Based on Europarl corpora Many languages Automatic and manual evaluation http://www.statmt.org/wmt11/translation-task.html IWSLT Evaluations Spoken language Languages vary: Chinese, Japanese, Arabic, Italian, … Speech 1-best and lattices provided Based on (small) BTEC corpus (basic traveler expression corpus) Last time also lecture translations http://iwslt2010.fbk.eu/node/15 11-711 Machine Translation 11 Evaluations (cont.) Specific projects have evaluations GALE Arabic-English and Chinese-English Broadcast news and broadcast conversations, newswire and blogs Human evaluation (HTER) Go/No-Go Quero European languages, also Arabic-French This year WMT evaluation was used as Quero evaluation 11-711 Machine Translation 12 Journals Machine Translation Springer Science, formerly Kluwer Academic Publishers, vol.4- ,1989 Articles available online (abstracts free, full texts on payment of fee) from Springer Chief editor: Andy Way http://www.springer.com/computer/ai/journal/10590 Computation Linguistics MIT Press Now open access http://www.mitpressjournals.org/loi/coli ACM TSLP Online publication Started in 2005 http://tslp.acm.org/ 11-711 Machine Translation 13 Journals (cont.) IEEE Transactions on Audio, Speech, and Langauge Processing http://www.signalprocessingsociety.org/publications/periodicals/taslp/ The Prague Bulletin of Mathematical Linguistics Has papers from recent MT Marathons, i.e. esp. descriptions of open source packages. http://ufal.mff.cuni.cz/pbml.html 11-711 Machine Translation 14 Literature MT-Archive: http://www.mt-archive.info/ Compiled by John Hutchins for the EAMT One stop shop! Also links to books, journals, conferences Papers listed by author, language, organization ACL Anthology: http://www.aclweb.org/anthology/ 11-711 Machine Translation 15