Ontologically-based Searching for Jobs in Linguistics Deryle Lonsdale lonz@byu.edu Funded by: DLLS 2003 1 The BYU Data Extraction Group Group of faculty (5) and students (15) from CS, Linguistics, SOAIS Goal: ontology-based data extraction NSF funding: CISE/IIS/IDM TIDIE Website: www.deg.byu.edu/ DLLS 2003 Papers, presentations Tools Demos 2 The BYU Data Extraction Group DLLS 2003 3 Overview DLLS 2003 Ontology-based extraction Building knowledge sources Jobs in linguistics (Sproat) Putting it all together Some sample results 4 Ontologies and IE Source DLLS 2003 Target 5 Document-based IE DLLS 2003 6 Conceptual modeling (OSM) Year Price 1..* 1..* Make 1..* has has has 0..1 0..1 0..1 0..1 Car 0..1 has Model 1..* 1..* Mileage has has 0..* 0..1 is for 1..* Feature 1..* PhoneNr 0..1 has 1..* Extension DLLS 2003 7 Recognition and Extraction Car 0001 0002 0003 Year 1989 1998 1994 DLLS 2003 Make Model Mileage Price PhoneNr Subaru SW $1900 (336)835-8597 Elantra (336)526-5444 HONDA ACCORD EX 100K (336)526-1081 Car 0001 0001 0002 0002 0002 0002 0002 0002 0002 0002 0002 0002 0003 0003 0003 Feature Auto AC Black 4 door tinted windows Auto pb ps cruise am/fm cassette stereo a/c Auto jade green gold 8 Car-Ads Ontology (textual) DLLS 2003 Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … … End; 9 The data-frame library Low-level patterns implemented as regular expressions Match items such as email addresses, phone numbers, names, etc. Mileage matches [8] constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; }, { extract "[1-9]\d{0,2}?,\d{3}"; context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{0,2}?,\d{3}"; context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{3,6}"; context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";}, { extract "[1-9]\d{3,6}"; context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";}; keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b"; end; DLLS 2003 10 Lexicons DLLS 2003 Repositories of enumerable classes of lexical information FirstNames, LastNames, USstates, ProvoOremApts, CarMakes, Drugs, CampGroundFeats, etc. 11 Accessing the output DLLS 2003 Extracted information is stored in a relational database Results can be queried using SQL Wide range of views is possible 12 Finding jobs in linguistics DLLS 2003 Linguistlist.org, LSA Email distribution lists (corpora, langage naturelle, CAAL/ACLA, etc.) Usual commercial sites (monster.com, flipdog.com, dice.com) Word-of-mouth sources 13 Sproat’s analysis DLLS 2003 Random sample (224/2250) of LinguistList postings, 1994-2001 Development vs. research, academic vs. industrial Linguists are most often (approx. 80% of the time) offered development jobs Linguists hired more for specific tasks (e.g. grammar, lexicon development) rather than for more general research-oriented tasks (e.g. creating new technological approaches.) 14 The banner years Year Academia Industry % Industry 1994 27 2 7% 1995 45 5 10% 1996 52 3 5% 1997 48 3 6% 1998 57 3 5% 1999 56 14 20% 2000 55 43 39% 2001 (mid) 22 10 31% Dramatic rise in 1999, 2000 Steep drop-off since 2001 Rising demand for technical, computational skills DLLS 2003 15 Linguistic jobs ontology Why? DLLS 2003 user-specifiable constraints Somewhat closely follows existing ontologies (e.g. jobs, software) 16 Data frames and lexicons Language names (sub)fields of linguistics DLLS 2003 ethnologue Linguistlist.org Tools, toolkits Software components, programming languages Linguistics-related job titles Activities Responsibilities Country names 17 The corpus 3237 postings (LinguistList, Corpora, LN, WoM): 1998 1999 2000 2001 2002 DLLS 2003 541 575 871 952 788 Some noise (non-English, factored, program descriptions, attachments, etc.) Semi-automatic edits (boilerplate, publicity blurbs about institutions, etc.) 18 Sample output DLLS 2003 Here 19 Observations DLLS 2003 270 don’t have linguist* (!) Demand for knowledge of English equals that for all other languages combined (G, F, S, J, C) Computer/computational background required for almost 1/3 (1116) Noticeable amount of headhunting, particularly in Seattle, DC areas 20 Programming languages 700 600 500 400 300 200 100 0 C/C++ Java/Jscript Prolog VB DLLS 2003 CGI Lisp/Python SQL XML/XSLT HTML/SGML Perl Tcl 21 Popular subfields 700 600 500 400 300 200 100 0 IE/IR Phonology Semantics DLLS 2003 Morpho Pragmatics MT NLP Speech TESOL/EFL Phonetics Syntax Translation 22 Subfields (another perspective) 800 600 400 200 0 DLLS 2003 Psycho Typological Socioling Philosophy Neuro Acquisition Lexicography Anthropo Historical Cognition Philology 23 An engineering discipline? 160 linguistics jobs ending in “engineer” Software development cycle Specific subfields DLLS 2003 research e., software design e. development e., software e. software quality e., linguistic test e., linguistic quality e. linguistic support e., user experience e. presales e., technical sales e. web site e. speech e., voice recognition e., speech recognition application e., speech e., ASR tuning e., audio e. dialog e. tools e. AI e., NLP e. knowledge e. linguist e., natural language e. staff e. human factors e., user interface e. 24 Paradigms 300 250 200 150 100 50 0 Machine learning Statistical Math Field Methods DLLS 2003 Finite-state Stoch/Prob Generative 25 Other observations Often a job title is not even listed (!) More in18 of data frames (e.g. email, ph. #) Great need for (preferably hierarchical) lexical repositories related to linguistics DLLS 2003 job titles theoretical frameworks, subfields typical linguist job activities linguistic research/development venues 26