Language Technology

Language and Speech Technology: Introduction Jan Odijk January 2011 LOT Winter School 2011 1 Overview • What is language and speech technology (LST)? (3-7) • Major Subfields of LST (8-25) • Characterization of the last 30 years (26-27) – 80s (28-36), 90s (37-49), 00s (50-56) – Current Status (57-69) • CLARIN infrastructure (70-75) • This week’s programme (76) 2 Language Technology • Language Technology is the study of computational systems that process natural language • Alternative names: – Human Language Technology (HLT) – Natural Language Processing (NLP) 3 Speech Technology • Speech Technology is the study of computational systems that process speech • Is a part of Language Technology • Often – Term “Language technology” reserved for the study of computational systems that process written language 4 Computational Linguistics • Computational Linguistics (CL) is the study of language from a computational perspective • Often used interchangeably with language technology • Often grouped under Artificial Intelligence (AI) , although CL predates AI – AI: the study and design of intelligent systems 5 Computational Systems • Computational systems to process natural language do not exist naturally (except in the human brain) – They must be designed, implemented, and evaluated – Therefore it is a kind of engineering 6 Computational Systems • LST is NOT • the study of processing of natural language by humans in – – – – cognition, (cognitive) psychology, (psycho)linguistics phonetics 7 Language Technology Subfields • Orthographic processing – Text = sequence of characters – Tokenization • Text => sequence of tokens • Token= occurrence of a word form • Relatively simple for languages that uses interpunction (space, dot, comma, etc.) for separating tokens • More difficult for languages such as Chinese, Thai, etc. 8 Language Technology Subfields • Orthographic processing – Orthographic normalization – Token => (token, normalized token) – Normalized token = canonical orthographic representation for a set of orthographic variants – Examples: • • • • Contemporary spelling variants: aktie => actie Older spelling variants: vleesch => vlees Typos: actei => actie OCR errors: raarn => raam 9 Language Technology Subfields • Morphological processing – Lemmatization: token => (token, lemma) • Lemma = canonical orthographic representation for an inflectional paradigm • Often ambiguities • Examples – lemma(walked) = walk; Lemma(men) = man – Lemma (graven) = {graf, graaf, graven} (Dutch) 10 Language Technology Subfields • Morphological processing – Inflection analysis/generation • Word form  (lemma, inflectional features) • Examples: – – – – graven  (graf, PoS=Noun, number=plural) graven  (graaf, PoS=Noun, number=plural) graven  (graven, PoS=Verb, form=infinitive) graven  (graven, PoS=Verb, form= indicative, tense=present, number = plural) 11 Language Technology Subfields • Morphological processing – Compound processing – word form ((word form,affix?)+, word form) – lemma  ((word form,affix?)+, lemma) – Example: – Vleeskoeienhouders  ([vlees,koeien], houders) ‘meat cow farmers’ – gebiedsbepaling  ([(gebied, s)], bepaling) 12 Language Technology Subfields • Morphological processing – Derivational morphology processing – word form  (prefix*, lemma, suffix*) – Example: • Characterization  ([], characterize, [ation]) 13 Language Technology Subfields • (PoS-)tagging – Assignment of a grammatical tag to a token in context (tag=label for grammatical properties) – Token => (token, tag) in context – Usually assignment of PoS-tags – Often more detailed grammatical (inflectional) tags 14 Language Technology Subfields • (PoS-)tagging – Context: usually: • Some words and/or tags preceding • Some words following – Examples: • (graven, Zij __ een graf) => Vindprespl • (graven, De __ zijn boos) => Npl 15 Language Technology Subfields • Chunking – identifying major phrases in a sentence – Example • The man bought a present for his wife => • [NP The man] bought [NP a present] [PP for his wife] 16 Language Technology Subfields • Parsing – Assign a syntactic structure to a sentence – Example: The man bought a present for his wife => [S [subj/NP The man] [pred/VP bought [obj/NP a present] [pobj/PP for [obj/NP his wife]] ] ] 17 Language Technology Subfields • Machine Translation – Automatic translation of an input text – Example • The man bought a present for his wife => • L’homme a acheté un cadeau pour sa femme 18 Language Technology Subfields • Content extraction and processing – – – – – – – Named entity recognition Question-answering Information retrieval Information extraction Sentiment/ opinion mining Reasoning/Inference on semantic representation … 19 Speech Technology Subfields • Speech Synthesis – – – – Artificial production of human speech Text => speech Often called Text-To-Speech (TTS) TTS system usually contains two components • Grapheme to Phoneme (G2P) component – Text => symbolic speech representation (phonetic representation) • Speech Synthesis component – Symbolic speech representation => speech 20 Speech Technology Subfields • Speech Synthesis (cont.) – Term Speech Synthesis often reserved for this second component – Meaning => speech – Usually called Speech Generation, or ConceptTo-Speech, or Data-to-Speech 21 Speech Technology Subfields • Speech Recognition – Recognition of human speech – Audio containing speech => text – Often called automatic speech recognition (ASR) • Speech Understanding – Understanding of human speech – Audio containing speech => meaning or action 22 Speech Technology Subfields • Speaker Recognition – Recognition of a speaker given a speech signal – Speech => person identity • Speaker Verification – Verification of the identity of a person – Speech + claimed identity => Boolean 23 Speech Technology Subfields • Speech Compression – Reduction of the size of speech representations (speech encoding), or – Time-compression of speech representations (so that they sound faster to the listener) 24 Related fields • Speech often used in dialogues – Study of spoken dialogues (human-human, human-machine) • Speech often combined with other modalities – Study of Multimodal Interaction • Speech part of an man-machine interface – Study of Human - Machine Interaction 25 Introduction • Three decades: – “80s”= 1980-1994 – “90s”= 1990-2005 – “00s” = 2000-2011 26 Overview • • • • • • 80s: Language Technology 80s: Speech Technology 90s Language and Speech Technology 90s Commercial Activity 90s Importance of Data 00s Language and Speech Technology 27 80s: Language Technology • Focus on MT (in Europe) – Eurotra (Europe) – Rosetta (Philips, Netherlands) – Distributed Translation (BSO, Netherlands) 28 80s: Language Technology • Linguistic “Research Approach” • Focus on Research – not/less on Technology Development • Knowledge-based approach – hand-crafted lexicons and rules – based on a theory / grammatical formalism • Focus on linguistically interesting complex phenomena – less on phenomena that occur often – not strongly data-driven 29 80s: Language Technology • Focus on an idealized language – not on actual language use – no focus on robustness • Computational approach seen (in research) as a way to gain insight into language, grammar and grammar formalisms – no focus on developing a working system – no pragmatic solutions 30 80s: Language Technology • Little formal (quantitative) evaluation – only with test suites • constructed sentences illustrating linguistic phenomena • E.g. the HP Test Suite (Flickinger et al. 1987) • computational linguistics rather than language technology 31 80s: Language Technology Major Problems (from a technology point of view): • Ambiguity – Real – Temporary • Computational Complexity – computation-intensive grammar formalisms • Complexity of language – handcrafting lexicons and rules • requires linguistic and computational expertise • requires a lot of effort and time 32 80s: Language Technology • Major problems (cont.): • Idealized Language v. actual Language Use • Require large and rich lexicons, suited to the application domain: difficult/ large effort to make them, and to tune (adapt) to specific domains 33 80s: Speech Technology • • • • Automatic Speech Recognition (ASR) Statistical “Engineering Approach” approach based on Noisy Channel Model derive acoustic models from a lot of annotated speech examples • derive statistical language models from large text corpora (n-gram probabilities) 34 80s: Speech Technology • Focus on making (small) working systems • Statistical approach: system uses probabilities derived from data • Focus initially on limited, “simple” tasks (e.g. digit recognition), and increasingly on more complex tasks 35 80s: Speech Technology • Focus on real language use under realistic conditions • Progress made by making concrete systems and evaluating them rigorously 36 90s: Language Technology • Statistical MT – derive language models from monolingual corpora (probabilities of word ( sequence)s – align “sentences” with their translations – derive translation model from parallel corpora: • estimate translation probabilities for words and word sequences from the aligned “sentences” • use these probabilities to compute translations for new “sentences” 37 90: Language Technology • Ambiguity: resolved by probabilities based on statistics • Computational Complexity – computationally feasible formalisms – proven in speech recognition • Complexity of language – language and translation model automatically derived from data • Strong focus on actual language use – Highly data driven • Lexicons can be simpler and are derived automatically from the data; adaptation to specific domains easy once the data are available 38 90s: Language Technology • Rise of Internet • increasing need for information retrieval • approximated by search for word and word sequence strings • Information Retrieval – strongly statistically based – Limited linguistics – formal evaluation (recall, precision, F-score) 39 90s: Language Technology • Resulted in – strongly data-driven approach in language technology – increasing use of machine learning techniques – explicit focus on formal, esp. quantative evaluation – re-examination of simpler/computationally less intensive formalisms (finite-state) for syntax 40 90s: Speech Technology • Continued working under the established paradigm • increasingly improving performance and extending environments and application areas 41 90s: Companies • many companies active in Speech technology – IBM, Microsoft, Siemens, Nokia, Philips, Motorola, Matra Nortel, Nortel,.. – Dragon, Kurzweil, Lernout & Hauspie, SpeechWorks, Nuance, Babel, Loquendo, Rhetorical, Vocalis, Telisma, Elan, ... 42 90s: Companies • many companies in Language technology – IBM, Microsoft, INSO, Novell, ... – GMS, Apptek, Globalink, Lernout & Hauspie, Systran, LANT (Xplanation), ... 43 90s: Companies • MT systems: – knowledge based systems, – developed under an engineering approach • grammatical formalism simple or pruning in search space – to reduce ambiguity – to reduce computational resource requirements – to reduce hand-crafting of rules 44 90s: Companies • resulted in low quality MT systems – still useful in many circumstances • Differentiating factors – rapid adaptation to (multi-word) terms / vocabulary of new domain – good performance on named entity recognition 45 90s: Data • Knowledge Based NLP realized cooperation on lexicons was required • ASR Methodology requires a lot of data: – “There is no data like more data” • This led to – Data creation projects – Set-up of data distribution centers – Projects for developing standards for data 46 90s: Data • Projects – Lexicon projects • • • • • Multilex, Genelex Acquilex Parole WordNet, EuroWordNet – SpeechDat projects • SpeechDat, SpeechDat-Car, SpeechDat-East, SPEECON, Orientel – National / Local projects • Spoken Dutch Corpus (Netherlands and Flanders) 47 90s: Data • Data distribution Centers are set up – LDC (1993) – ELRA (1995) • Standards: – TEI for text corpora • CES, XCES – Eagles, ISLE for grammatical properties 48 Automating Data Production • Usually existing (imperfect) tools are used to create data (semi-)automatically – G2P for creating phonetic dictionaries – PoS-tagging for PoS-tagged text corpora – Parsers for treebanks • For bootstrapping annotations – Faster and more consistent results • Followed by (partial) manual correction 49 00s • Early 00s – Many data and research initiatives, nationally – Netherlands • IMIX 2001-2008 • STEVIN 2004-2011 • TST-Centrale (HLT Agency) 2005-.. – France • EVALDA • Technolangue 50 00s • Early 00s – International • • • • • • • • TREC CLEF TC-STAR 2004-2007 EuroMatrix 2006-2009 EUROMATRIXPlus 2009-2012 ECESS PASCAL / PASCAL2 ACE 51 00s • Early 00s – International • • • • • • • TAC US DUC US GALE US NTCIR Japan RTE SemEval SensEval 52 00s • More recent projects • FLaReNet • META-NET 53 00s • Companies offer services via the internet and via mobile (smart) phones – Search: Google, Bing, Yahoo!, etc. – Social networks: FaceBook, LinkedIn, Youtube – Cloud Computing: Amazon, Google, Salesforce • Companies gain access to huge amounts of data (text, pictures, movies, etc,) including user behavior 54 00s • Data are used – to improve existing services – To create new services – To personalize services and advertisements 55 00s • New Services relevant for LST – Google: Translation, search by voice, open platform for mobile devices (Android) – Amazon: Mechanical Turk • Allows large scale distribution of work, e.g. on manual annotation of language resources – Apple: several iPhone Apps • Dragon Dictate (for SMS, e-mail) • Jibbigo – ReCaptcha: transcription of (hand-written) documents (now part of Google) 56 Current Status • Language and Speech Technology in 2011: – Exciting area! • A lot of commercial activity, and expanding • A large and active research community • A lot of interesting topics are open for research 57 Commercial Activity • many companies in Language technology – Google, Yahoo!, IBM, Microsoft, ... – Apptek, Linguatec, Systran, Knowledge Concepts, Q-go, ... • applications – MT, content management, information retrieval, dealing with customer questions, sentiment and opinion mining, ... 58 Commercial Activity • many companies in Speech technology – Google, IBM, Microsoft, Motorola, Nokia, ... – Nuance, Loquendo, Acapela, SVOX, Telisma, ... • even more in application development and system integration 59 Commercial Activity • applications – Network IVR applications (Call centers, banking, information services,...) – Embedded applications • in-car applications, e.g. voice activated dialing, navigation (voice destination entry) • mobile phone/PDA applications – multimodal output e.g. for navigation – command and control – (SMS) dictation coming soon 60 Commercial Activity • applications – Office Applications • Dictation, horizontal and vertical (medical, legal) • Language learning – Audiomining • information retrieval from recorded speech (possibly incl. other modalities): Radio/TV-broadcasts, parliamentary sessions, ... 61 Research Topics? • Speech Technology (Recognition) – new paradigms? • cf . FLAVOR project http://www.esat.kuleuven.be/psi/spraak/projects/FLaVoR/ – Combination with other modalities • AMI http://www.amiproject.org • CHIL http://chil.server.de/servlet/is/101/ • IMIX (Interactive Multimodal Information eXtraction) 62 Research Topics? • Speech Technology (Recognition) – robustness against noise and other speakers • increasing use in car and in public places on PDAs and mobile phones • MIDAS project – pronunciation of names • Autonomata I and TOO (incl. Nuance, Ghent, Nijmegen and Utrecht) 63 Research Topics? • Speech technology (Text-to-Speech) – better control over prosody in corpus-based TTS? – Combination with other modalities 64 Research Topics? • Language Technology – Semantic Lexical databases created – WordNet and EuroWordNet – Cornetto 65 Research Topics? • Language Technology – Focus now on Semantic Annotation of Corpora • OntoNotes http://www.isi.edu/naturallanguage/people/hovy/papers/06HLT-NAACLOntoNotes-short.pdf • STEVIN D-COI and SONAR • DutchSemCor – How to use this semantic annotation in practical systems? 66 Research Topics? • Language Technology – (Semi-)automatic lexicon creation/adaptation – Sophisticated information retrieval • Information extraction, summarization and merging, opinion and sentiment mining, 67 Research Topics? • Language And Speech Technology – Speech to Speech Translation • TC-STAR http://www.tc-star.org/ 68 Research Topics? • Dutch-Flemish STEVIN programme – running from 2004-2011 – 11.4M€ budget • • • • resources research applications demonstration projects – Most projects finished – some projects are still running – http://www.taalunieversum.nl/stevin 69 CLARIN • aims to design, construct, validate, and exploit – a research infrastructure that is needed to provide a sustainable and persistent eScience working environment – for researchers in the Social Sciences & Humanities – who want to make use of language data and tools 70 CLARIN • Make data and tools on different locations easily accessible – via web interfaces and services – CLARIN-portal(s) with intelligent searching, browsing, viewing and querying services) • make it possible for non-technical researchers to extract / combine/ enrich data (supported by dissemination and training) 71 CLARIN • Will make available interoperable data and tools based on existing standards and best practices – Formal interoperability and – Semantic interoperability 72 CLARIN • For researchers that work with language data and tools – Humanities and Social Sciences • • • • • • Linguistics (broadly construed) Literary and Theatrical Studies Media en Culture History Political Sciences … 73 CLARIN • Preparatory Project (CLARIN-prep) – – – – Funded by EU 2008-2011 >33 partners from >23 countries Goals • Get commitments from EU countries to contribute to the CLARIN infrastructure after CLARIN-prep • Investigate needs, requirements • Make initial specification (and prototype implementations) 74 CLARIN • Current Status – Most countries in the process – CLARIN infrastructure to start in Mid 2011 – Netherlands committed and has leading role • CLARIN-NL – – – – Funded by NWO 2009-2015 Many subprojects running Focus on Humanities 75 This week’s Programme • Tuesday: Parsing • Wednesday: Machine Learning • Thursday: Speech Recognition – Guest lecturer: Arjan van Hessen • Friday: Machine Translation 76 Thanks for Your Attention! 77 References • Flickinger D., Nerbonne J., Sag I., Wasow T., "Toward Evaluation of NLP Systems", Hewlett-Packard Laboratories, Palo Alto, CA, 1987. 78

Language Technology

Related documents

Products

Support

Language Technology

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib