Language and Speech Technology: Introduction Jan Odijk January 2011 LOT Winter School 2011 1 Overview • What is language and speech technology (LST)? (3-7) • Major Subfields of LST (8-25) • Characterization of the last 30 years (26-27) – 80s (28-36), 90s (37-49), 00s (50-56) – Current Status (57-69) • CLARIN infrastructure (70-75) • This week’s programme (76) 2 Language Technology • Language Technology is the study of computational systems that process natural language • Alternative names: – Human Language Technology (HLT) – Natural Language Processing (NLP) 3 Speech Technology • Speech Technology is the study of computational systems that process speech • Is a part of Language Technology • Often – Term “Language technology” reserved for the study of computational systems that process written language 4 Computational Linguistics • Computational Linguistics (CL) is the study of language from a computational perspective • Often used interchangeably with language technology • Often grouped under Artificial Intelligence (AI) , although CL predates AI – AI: the study and design of intelligent systems 5 Computational Systems • Computational systems to process natural language do not exist naturally (except in the human brain) – They must be designed, implemented, and evaluated – Therefore it is a kind of engineering 6 Computational Systems • LST is NOT • the study of processing of natural language by humans in – – – – cognition, (cognitive) psychology, (psycho)linguistics phonetics 7 Language Technology Subfields • Orthographic processing – Text = sequence of characters – Tokenization • Text => sequence of tokens • Token= occurrence of a word form • Relatively simple for languages that uses interpunction (space, dot, comma, etc.) for separating tokens • More difficult for languages such as Chinese, Thai, etc. 8 Language Technology Subfields • Orthographic processing – Orthographic normalization – Token => (token, normalized token) – Normalized token = canonical orthographic representation for a set of orthographic variants – Examples: • • • • Contemporary spelling variants: aktie => actie Older spelling variants: vleesch => vlees Typos: actei => actie OCR errors: raarn => raam 9 Language Technology Subfields • Morphological processing – Lemmatization: token => (token, lemma) • Lemma = canonical orthographic representation for an inflectional paradigm • Often ambiguities • Examples – lemma(walked) = walk; Lemma(men) = man – Lemma (graven) = {graf, graaf, graven} (Dutch) 10 Language Technology Subfields • Morphological processing – Inflection analysis/generation • Word form (lemma, inflectional features) • Examples: – – – – graven (graf, PoS=Noun, number=plural) graven (graaf, PoS=Noun, number=plural) graven (graven, PoS=Verb, form=infinitive) graven (graven, PoS=Verb, form= indicative, tense=present, number = plural) 11 Language Technology Subfields • Morphological processing – Compound processing – word form ((word form,affix?)+, word form) – lemma ((word form,affix?)+, lemma) – Example: – Vleeskoeienhouders ([vlees,koeien], houders) ‘meat cow farmers’ – gebiedsbepaling ([(gebied, s)], bepaling) 12 Language Technology Subfields • Morphological processing – Derivational morphology processing – word form (prefix*, lemma, suffix*) – Example: • Characterization ([], characterize, [ation]) 13 Language Technology Subfields • (PoS-)tagging – Assignment of a grammatical tag to a token in context (tag=label for grammatical properties) – Token => (token, tag) in context – Usually assignment of PoS-tags – Often more detailed grammatical (inflectional) tags 14 Language Technology Subfields • (PoS-)tagging – Context: usually: • Some words and/or tags preceding • Some words following – Examples: • (graven, Zij __ een graf) => Vindprespl • (graven, De __ zijn boos) => Npl 15 Language Technology Subfields • Chunking – identifying major phrases in a sentence – Example • The man bought a present for his wife => • [NP The man] bought [NP a present] [PP for his wife] 16 Language Technology Subfields • Parsing – Assign a syntactic structure to a sentence – Example: The man bought a present for his wife => [S [subj/NP The man] [pred/VP bought [obj/NP a present] [pobj/PP for [obj/NP his wife]] ] ] 17 Language Technology Subfields • Machine Translation – Automatic translation of an input text – Example • The man bought a present for his wife => • L’homme a acheté un cadeau pour sa femme 18 Language Technology Subfields • Content extraction and processing – – – – – – – Named entity recognition Question-answering Information retrieval Information extraction Sentiment/ opinion mining Reasoning/Inference on semantic representation … 19 Speech Technology Subfields • Speech Synthesis – – – – Artificial production of human speech Text => speech Often called Text-To-Speech (TTS) TTS system usually contains two components • Grapheme to Phoneme (G2P) component – Text => symbolic speech representation (phonetic representation) • Speech Synthesis component – Symbolic speech representation => speech 20 Speech Technology Subfields • Speech Synthesis (cont.) – Term Speech Synthesis often reserved for this second component – Meaning => speech – Usually called Speech Generation, or ConceptTo-Speech, or Data-to-Speech 21 Speech Technology Subfields • Speech Recognition – Recognition of human speech – Audio containing speech => text – Often called automatic speech recognition (ASR) • Speech Understanding – Understanding of human speech – Audio containing speech => meaning or action 22 Speech Technology Subfields • Speaker Recognition – Recognition of a speaker given a speech signal – Speech => person identity • Speaker Verification – Verification of the identity of a person – Speech + claimed identity => Boolean 23 Speech Technology Subfields • Speech Compression – Reduction of the size of speech representations (speech encoding), or – Time-compression of speech representations (so that they sound faster to the listener) 24 Related fields • Speech often used in dialogues – Study of spoken dialogues (human-human, human-machine) • Speech often combined with other modalities – Study of Multimodal Interaction • Speech part of an man-machine interface – Study of Human - Machine Interaction 25 Introduction • Three decades: – “80s”= 1980-1994 – “90s”= 1990-2005 – “00s” = 2000-2011 26 Overview • • • • • • 80s: Language Technology 80s: Speech Technology 90s Language and Speech Technology 90s Commercial Activity 90s Importance of Data 00s Language and Speech Technology 27 80s: Language Technology • Focus on MT (in Europe) – Eurotra (Europe) – Rosetta (Philips, Netherlands) – Distributed Translation (BSO, Netherlands) 28 80s: Language Technology • Linguistic “Research Approach” • Focus on Research – not/less on Technology Development • Knowledge-based approach – hand-crafted lexicons and rules – based on a theory / grammatical formalism • Focus on linguistically interesting complex phenomena – less on phenomena that occur often – not strongly data-driven 29 80s: Language Technology • Focus on an idealized language – not on actual language use – no focus on robustness • Computational approach seen (in research) as a way to gain insight into language, grammar and grammar formalisms – no focus on developing a working system – no pragmatic solutions 30 80s: Language Technology • Little formal (quantitative) evaluation – only with test suites • constructed sentences illustrating linguistic phenomena • E.g. the HP Test Suite (Flickinger et al. 1987) • computational linguistics rather than language technology 31 80s: Language Technology Major Problems (from a technology point of view): • Ambiguity – Real – Temporary • Computational Complexity – computation-intensive grammar formalisms • Complexity of language – handcrafting lexicons and rules • requires linguistic and computational expertise • requires a lot of effort and time 32 80s: Language Technology • Major problems (cont.): • Idealized Language v. actual Language Use • Require large and rich lexicons, suited to the application domain: difficult/ large effort to make them, and to tune (adapt) to specific domains 33 80s: Speech Technology • • • • Automatic Speech Recognition (ASR) Statistical “Engineering Approach” approach based on Noisy Channel Model derive acoustic models from a lot of annotated speech examples • derive statistical language models from large text corpora (n-gram probabilities) 34 80s: Speech Technology • Focus on making (small) working systems • Statistical approach: system uses probabilities derived from data • Focus initially on limited, “simple” tasks (e.g. digit recognition), and increasingly on more complex tasks 35 80s: Speech Technology • Focus on real language use under realistic conditions • Progress made by making concrete systems and evaluating them rigorously 36 90s: Language Technology • Statistical MT – derive language models from monolingual corpora (probabilities of word ( sequence)s – align “sentences” with their translations – derive translation model from parallel corpora: • estimate translation probabilities for words and word sequences from the aligned “sentences” • use these probabilities to compute translations for new “sentences” 37 90: Language Technology • Ambiguity: resolved by probabilities based on statistics • Computational Complexity – computationally feasible formalisms – proven in speech recognition • Complexity of language – language and translation model automatically derived from data • Strong focus on actual language use – Highly data driven • Lexicons can be simpler and are derived automatically from the data; adaptation to specific domains easy once the data are available 38 90s: Language Technology • Rise of Internet • increasing need for information retrieval • approximated by search for word and word sequence strings • Information Retrieval – strongly statistically based – Limited linguistics – formal evaluation (recall, precision, F-score) 39 90s: Language Technology • Resulted in – strongly data-driven approach in language technology – increasing use of machine learning techniques – explicit focus on formal, esp. quantative evaluation – re-examination of simpler/computationally less intensive formalisms (finite-state) for syntax 40 90s: Speech Technology • Continued working under the established paradigm • increasingly improving performance and extending environments and application areas 41 90s: Companies • many companies active in Speech technology – IBM, Microsoft, Siemens, Nokia, Philips, Motorola, Matra Nortel, Nortel,.. – Dragon, Kurzweil, Lernout & Hauspie, SpeechWorks, Nuance, Babel, Loquendo, Rhetorical, Vocalis, Telisma, Elan, ... 42 90s: Companies • many companies in Language technology – IBM, Microsoft, INSO, Novell, ... – GMS, Apptek, Globalink, Lernout & Hauspie, Systran, LANT (Xplanation), ... 43 90s: Companies • MT systems: – knowledge based systems, – developed under an engineering approach • grammatical formalism simple or pruning in search space – to reduce ambiguity – to reduce computational resource requirements – to reduce hand-crafting of rules 44 90s: Companies • resulted in low quality MT systems – still useful in many circumstances • Differentiating factors – rapid adaptation to (multi-word) terms / vocabulary of new domain – good performance on named entity recognition 45 90s: Data • Knowledge Based NLP realized cooperation on lexicons was required • ASR Methodology requires a lot of data: – “There is no data like more data” • This led to – Data creation projects – Set-up of data distribution centers – Projects for developing standards for data 46 90s: Data • Projects – Lexicon projects • • • • • Multilex, Genelex Acquilex Parole WordNet, EuroWordNet – SpeechDat projects • SpeechDat, SpeechDat-Car, SpeechDat-East, SPEECON, Orientel – National / Local projects • Spoken Dutch Corpus (Netherlands and Flanders) 47 90s: Data • Data distribution Centers are set up – LDC (1993) – ELRA (1995) • Standards: – TEI for text corpora • CES, XCES – Eagles, ISLE for grammatical properties 48 Automating Data Production • Usually existing (imperfect) tools are used to create data (semi-)automatically – G2P for creating phonetic dictionaries – PoS-tagging for PoS-tagged text corpora – Parsers for treebanks • For bootstrapping annotations – Faster and more consistent results • Followed by (partial) manual correction 49 00s • Early 00s – Many data and research initiatives, nationally – Netherlands • IMIX 2001-2008 • STEVIN 2004-2011 • TST-Centrale (HLT Agency) 2005-.. – France • EVALDA • Technolangue 50 00s • Early 00s – International • • • • • • • • TREC CLEF TC-STAR 2004-2007 EuroMatrix 2006-2009 EUROMATRIXPlus 2009-2012 ECESS PASCAL / PASCAL2 ACE 51 00s • Early 00s – International • • • • • • • TAC US DUC US GALE US NTCIR Japan RTE SemEval SensEval 52 00s • More recent projects • FLaReNet • META-NET 53 00s • Companies offer services via the internet and via mobile (smart) phones – Search: Google, Bing, Yahoo!, etc. – Social networks: FaceBook, LinkedIn, Youtube – Cloud Computing: Amazon, Google, Salesforce • Companies gain access to huge amounts of data (text, pictures, movies, etc,) including user behavior 54 00s • Data are used – to improve existing services – To create new services – To personalize services and advertisements 55 00s • New Services relevant for LST – Google: Translation, search by voice, open platform for mobile devices (Android) – Amazon: Mechanical Turk • Allows large scale distribution of work, e.g. on manual annotation of language resources – Apple: several iPhone Apps • Dragon Dictate (for SMS, e-mail) • Jibbigo – ReCaptcha: transcription of (hand-written) documents (now part of Google) 56 Current Status • Language and Speech Technology in 2011: – Exciting area! • A lot of commercial activity, and expanding • A large and active research community • A lot of interesting topics are open for research 57 Commercial Activity • many companies in Language technology – Google, Yahoo!, IBM, Microsoft, ... – Apptek, Linguatec, Systran, Knowledge Concepts, Q-go, ... • applications – MT, content management, information retrieval, dealing with customer questions, sentiment and opinion mining, ... 58 Commercial Activity • many companies in Speech technology – Google, IBM, Microsoft, Motorola, Nokia, ... – Nuance, Loquendo, Acapela, SVOX, Telisma, ... • even more in application development and system integration 59 Commercial Activity • applications – Network IVR applications (Call centers, banking, information services,...) – Embedded applications • in-car applications, e.g. voice activated dialing, navigation (voice destination entry) • mobile phone/PDA applications – multimodal output e.g. for navigation – command and control – (SMS) dictation coming soon 60 Commercial Activity • applications – Office Applications • Dictation, horizontal and vertical (medical, legal) • Language learning – Audiomining • information retrieval from recorded speech (possibly incl. other modalities): Radio/TV-broadcasts, parliamentary sessions, ... 61 Research Topics? • Speech Technology (Recognition) – new paradigms? • cf . FLAVOR project http://www.esat.kuleuven.be/psi/spraak/projects/FLaVoR/ – Combination with other modalities • AMI http://www.amiproject.org • CHIL http://chil.server.de/servlet/is/101/ • IMIX (Interactive Multimodal Information eXtraction) 62 Research Topics? • Speech Technology (Recognition) – robustness against noise and other speakers • increasing use in car and in public places on PDAs and mobile phones • MIDAS project – pronunciation of names • Autonomata I and TOO (incl. Nuance, Ghent, Nijmegen and Utrecht) 63 Research Topics? • Speech technology (Text-to-Speech) – better control over prosody in corpus-based TTS? – Combination with other modalities 64 Research Topics? • Language Technology – Semantic Lexical databases created – WordNet and EuroWordNet – Cornetto 65 Research Topics? • Language Technology – Focus now on Semantic Annotation of Corpora • OntoNotes http://www.isi.edu/naturallanguage/people/hovy/papers/06HLT-NAACLOntoNotes-short.pdf • STEVIN D-COI and SONAR • DutchSemCor – How to use this semantic annotation in practical systems? 66 Research Topics? • Language Technology – (Semi-)automatic lexicon creation/adaptation – Sophisticated information retrieval • Information extraction, summarization and merging, opinion and sentiment mining, 67 Research Topics? • Language And Speech Technology – Speech to Speech Translation • TC-STAR http://www.tc-star.org/ 68 Research Topics? • Dutch-Flemish STEVIN programme – running from 2004-2011 – 11.4M€ budget • • • • resources research applications demonstration projects – Most projects finished – some projects are still running – http://www.taalunieversum.nl/stevin 69 CLARIN • aims to design, construct, validate, and exploit – a research infrastructure that is needed to provide a sustainable and persistent eScience working environment – for researchers in the Social Sciences & Humanities – who want to make use of language data and tools 70 CLARIN • Make data and tools on different locations easily accessible – via web interfaces and services – CLARIN-portal(s) with intelligent searching, browsing, viewing and querying services) • make it possible for non-technical researchers to extract / combine/ enrich data (supported by dissemination and training) 71 CLARIN • Will make available interoperable data and tools based on existing standards and best practices – Formal interoperability and – Semantic interoperability 72 CLARIN • For researchers that work with language data and tools – Humanities and Social Sciences • • • • • • Linguistics (broadly construed) Literary and Theatrical Studies Media en Culture History Political Sciences … 73 CLARIN • Preparatory Project (CLARIN-prep) – – – – Funded by EU 2008-2011 >33 partners from >23 countries Goals • Get commitments from EU countries to contribute to the CLARIN infrastructure after CLARIN-prep • Investigate needs, requirements • Make initial specification (and prototype implementations) 74 CLARIN • Current Status – Most countries in the process – CLARIN infrastructure to start in Mid 2011 – Netherlands committed and has leading role • CLARIN-NL – – – – Funded by NWO 2009-2015 Many subprojects running Focus on Humanities 75 This week’s Programme • Tuesday: Parsing • Wednesday: Machine Learning • Thursday: Speech Recognition – Guest lecturer: Arjan van Hessen • Friday: Machine Translation 76 Thanks for Your Attention! 77 References • Flickinger D., Nerbonne J., Sag I., Wasow T., "Toward Evaluation of NLP Systems", Hewlett-Packard Laboratories, Palo Alto, CA, 1987. 78