View/Download - BYU Data Extraction Research Group

advertisement
Ontologically-based
Searching for Jobs in
Linguistics
Deryle Lonsdale
lonz@byu.edu
Funded by:
DLLS 2003
1
The BYU Data Extraction Group




Group of faculty (5) and students (15)
from CS, Linguistics, SOAIS
Goal: ontology-based data extraction
NSF funding: CISE/IIS/IDM TIDIE
Website: www.deg.byu.edu/



DLLS 2003
Papers, presentations
Tools
Demos
2
The BYU Data Extraction Group
DLLS 2003
3
Overview





DLLS 2003
Ontology-based extraction
Building knowledge sources
Jobs in linguistics (Sproat)
Putting it all together
Some sample results
4
Ontologies and IE
Source
DLLS 2003
Target
5
Document-based IE
DLLS 2003
6
Conceptual modeling (OSM)
Year
Price
1..*
1..*
Make
1..*
has
has
has
0..1 0..1
0..1 0..1
Car
0..1
has
Model 1..*
1..* Mileage
has
has
0..* 0..1
is for
1..*
Feature
1..* PhoneNr
0..1
has
1..*
Extension
DLLS 2003
7
Recognition and Extraction
Car
0001
0002
0003
Year
1989
1998
1994
DLLS 2003
Make Model
Mileage Price PhoneNr
Subaru SW
$1900 (336)835-8597
Elantra
(336)526-5444
HONDA ACCORD EX 100K
(336)526-1081
Car
0001
0001
0002
0002
0002
0002
0002
0002
0002
0002
0002
0002
0003
0003
0003
Feature
Auto
AC
Black
4 door
tinted windows
Auto
pb
ps
cruise
am/fm
cassette stereo
a/c
Auto
jade green
gold
8
Car-Ads Ontology (textual)
DLLS 2003
Car [->object];
Car [0..1] has Year [1..*];
Car [0..1] has Make [1..*];
Car [0...1] has Model [1..*];
Car [0..1] has Mileage [1..*];
Car [0..*] has Feature [1..*];
Car [0..1] has Price [1..*];
PhoneNr [1..*] is for Car [0..*];
PhoneNr [0..1] has Extension [1..*];
Year matches [4]
constant {extract “\d{2}”;
context "([^\$\d]|^)[4-9]\d[^\d]";
substitute "^" -> "19"; },
…
…
End;
9
The data-frame library


Low-level patterns implemented as
regular expressions
Match items such as email addresses,
phone numbers, names, etc.
Mileage matches [8]
constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; },
{ extract "[1-9]\d{0,2}?,\d{3}";
context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";},
{ extract "[1-9]\d{0,2}?,\d{3}";
context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute
"," -> "";},
{ extract "[1-9]\d{3,6}";
context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";},
{ extract "[1-9]\d{3,6}";
context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";};
keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b";
end;
DLLS 2003
10
Lexicons


DLLS 2003
Repositories of enumerable classes of
lexical information
FirstNames, LastNames, USstates,
ProvoOremApts, CarMakes, Drugs,
CampGroundFeats, etc.
11
Accessing the output



DLLS 2003
Extracted information is stored in a
relational database
Results can be queried using SQL
Wide range of views is possible
12
Finding jobs in linguistics




DLLS 2003
Linguistlist.org, LSA
Email distribution lists (corpora,
langage naturelle, CAAL/ACLA, etc.)
Usual commercial sites (monster.com,
flipdog.com, dice.com)
Word-of-mouth sources
13
Sproat’s analysis




DLLS 2003
Random sample (224/2250) of LinguistList
postings, 1994-2001
Development vs. research, academic vs.
industrial
Linguists are most often (approx. 80% of
the time) offered development jobs
Linguists hired more for specific tasks (e.g.
grammar, lexicon development) rather than
for more general research-oriented tasks
(e.g. creating new technological approaches.)
14
The banner years
Year Academia Industry % Industry
1994
27
2
7%
1995
45
5
10%
1996
52
3
5%
1997
48
3
6%
1998
57
3
5%
1999
56
14
20%
2000
55
43
39%
2001 (mid) 22
10
31%

Dramatic rise in 1999, 2000

Steep drop-off since 2001

Rising demand for technical, computational skills
DLLS 2003
15
Linguistic jobs ontology

Why?


DLLS 2003
user-specifiable constraints
Somewhat closely follows existing
ontologies (e.g. jobs, software)
16
Data frames and lexicons

Language names


(sub)fields of linguistics







DLLS 2003
ethnologue
Linguistlist.org
Tools, toolkits
Software components, programming languages
Linguistics-related job titles
Activities
Responsibilities
Country names
17
The corpus

3237 postings (LinguistList, Corpora, LN,
WoM):
1998
1999
2000
2001
2002


DLLS 2003
541
575
871
952
788
Some noise (non-English, factored,
program descriptions, attachments, etc.)
Semi-automatic edits (boilerplate,
publicity blurbs about institutions, etc.)
18
Sample output

DLLS 2003
Here
19
Observations




DLLS 2003
270 don’t have linguist* (!)
Demand for knowledge of English
equals that for all other languages
combined (G, F, S, J, C)
Computer/computational background
required for almost 1/3 (1116)
Noticeable amount of headhunting,
particularly in Seattle, DC areas
20
Programming languages
700
600
500
400
300
200
100
0
C/C++
Java/Jscript
Prolog
VB
DLLS 2003
CGI
Lisp/Python
SQL
XML/XSLT
HTML/SGML
Perl
Tcl
21
Popular subfields
700
600
500
400
300
200
100
0
IE/IR
Phonology
Semantics
DLLS 2003
Morpho
Pragmatics
MT
NLP
Speech
TESOL/EFL
Phonetics
Syntax
Translation
22
Subfields (another perspective)
800
600
400
200
0
DLLS 2003
Psycho
Typological
Socioling
Philosophy
Neuro
Acquisition
Lexicography
Anthropo
Historical
Cognition
Philology
23
An engineering discipline?


160 linguistics jobs ending in “engineer”
Software development cycle






Specific subfields








DLLS 2003

research e., software design e.
development e., software e.
software quality e., linguistic test e., linguistic quality e.
linguistic support e., user experience e.
presales e., technical sales e.
web site e.
speech e., voice recognition e., speech recognition application e., speech
e., ASR tuning e., audio e.
dialog e.
tools e.
AI e., NLP e.
knowledge e.
linguist e., natural language e.
staff e.
human factors e., user interface e.
24
Paradigms
300
250
200
150
100
50
0
Machine learning
Statistical
Math
Field Methods
DLLS 2003
Finite-state
Stoch/Prob
Generative
25
Other observations



Often a job title is not even listed (!)
More in18 of data frames (e.g. email, ph. #)
Great need for (preferably hierarchical)
lexical repositories related to linguistics




DLLS 2003
job titles
theoretical frameworks, subfields
typical linguist job activities
linguistic research/development venues
26
Download