投影片 1 - CWN - Academia Sinica

advertisement
From Synergy to Knowledge:
Integrating multiple language resources
Part I: Language Resources and Tools
Chu-Ren Huang
Academia Sinica
http://cwn.ling.sinica.edu.tw/huang/huang.htm
4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Outline: Language Resources and Tools
 Introduction: 10 Years in Chinese Language
Processing-A mirror for other Asian Languages
 The Starting Point: Resources and Resources
Sharing
 OLAC: The Open Language Archives Community
 Asian Language Resources Committee of AFNLP
 Standards: ISO TC37 Language Resources
Mangagement
 Language Archives Project of Taiwan
 Tools: Getting Started in NLP with NLTK
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 2
Why Resources and Tools
Language Resources
 Foundation and empirical basis of scientific studies of natural
languages
 The only reliable source for language specific features
 Infrastructure for knowledge representation and knowledge
engineering
 Essential to preserve linguistic and cultural diversity
Tools
 Needed to ‘process’
 General enough for multilingual processing and cross-lingual
comparison
 Robust enough to deal with language specific issues
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 3
Chinese Language Processing as a Mirror
For the development of Asian Language Processing
 Unlike Japanese, which has enjoying being one of
the leaders in technological innovation
 The development of Chinese language processing
coincides with the developing economies of Taiwan
and China
 Especially the availability of Chinese language PC’s
 Similar to the situation of many Asian languages now
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 4
CLP in the past 10 years
A review of what happened in the past ten years in
Chinese Language Processing (1992-2002)
from a somewhat personal perspective
1992 –Corpora
Completion of the first Chinese corpus for linguistic
research (Huang and Chen, COLING ’92.1214-1217)
-untagged, non-segmented
-but searchable
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 5
CLP 1992 –1993
1992 –Segmentation Standard
Announcement of the first national standard for word
segmentation by PRC government.
《GB 13715-信息處理用現代漢語分詞規範》.
1993 –Lexicon
Completion and Release of the first version of CKIP
lexicon (with the category set and ICG thematic roles)
First version of K. Chen’s parser for Chinese
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 6
CLP Corpus 1994 – 1995
1994
10th year anniversary for the Automation of Chinese
historical textual databases.
Completion of the pre-Qin Classic Chinese corpus at
Academia Sinica.
1995
Completion of Sinica Corpus (v. 1.0 1 million words), the
first balanced and tagged Chinese corpus.
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 7
CLP 1996
–Research Institutes
10th Anniversary of the Institute of Computational Linguistics at
Peking University
10th Anniversary of the Chinese Knowledge Information Processing
Group at Academia Sinica
–Anthology of Papers
Readings in Chinese Natural Language Processing (Journal of
Chinese Linguistics Monograph)
Editors: Huang, Chen, and T’sou
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 8
CLP 1996 November-1997
Sinica Corpus on Web
One of the first fully searchable language corpus on the
WWW
http://www.sinica.edu.tw/ftms-bin/kiwi.sh (old webpage in web archives)
http://www.sinica.edu.tw/SinicaCorpus/ (current page)
1997
Publication of the first Chinese dictionary compiled directly from a
corpus (Huang et al.’s Mandarin Daily Classifier Dictionary and
Noun-Classifier Collocation Dictionary)
The Tenth Annual ROCLING conference
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 9
CLP 1998
–KnowledgeNet
Release of HowNet, the first full-fledged Chinese and
English-Chinese LKB
http://www.keenage.com/
-Segmentation Standard
Official announcement of CNS14366 for Taiwan
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 10
CLP 2000 –Treebanks
Simultaneous completion and announcement of two
Chinese Treebanks:
*Penn Chinese Treebank
*Sinica Treebank
ACL Workshop on Chinese Language Processing
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 11
CLP 2001-2002
2001 –Society
Formal approval of the formation of
ACL SigHAN, the first international organization on Chinese
Language Processing
2002
First SigHAN workshop on Chinese Language Processing
Formal launch of Hsieh’s Intelligent Character Encoding System (a
sustainable solution to the missing character problem)
COLING2002 in Taipei
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 12
CLP 2003 2003
 THE FIRST INTERNATIONAL CHINESE WORD
SEGMENTATION BAKEOFF
http://www.sighan.org/bakeoff2003/
2002-2005
 Chinese Proposition Bank
http://www.cis.upenn.edu/~chinese/cpb/
2003,2005,2007
 Chinese Gigaword Corpus v.1., v.2, and tagged
version
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 13
What CLP Development Showed?
 Resources Lead
 When tools and standards completes a
comprehensive infrastructure
 Research will bloom
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 14
Resources Development
 Towards a Sharable and Sustainable Model of
Resources Development
OLAC
Open Language Archives Community
http://www.language-archives.org
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 15
OLAC Aims
OLAC, the Open Language Archives Community, is an
international partnership of institutions and individuals
who are creating a worldwide virtual library of
language resources by:
• developing consensus on best current practice for the
digital archiving of language resources;
• developing a network of interoperating repositories and
services for housing and accessing such resources.
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 16
OLAC Organization
Coordinators: Steven Bird & Gary Simons
Council: Anthony Aristar (Linguist List), Christopher
Cieri (LDC), Gary Holton (Alaska Native Lanuage
Center), Chu-Ren Huang (Academia Sinica), Heidi
Johnson (Archive of the Indigenous Languages of
Latin America), Laurent Romary (Atilf, University of
Nancy), Joan Spanne (SIL), Martin Wynne (Oxford
Text Archive)
Participating Archives & Services: 39 archives including
LDC, ELRA, DFKI, CBOLD, ANLC, LACITO, Perseus, SIL,
APS, Utrecht, Academia Sinica, TalkBank, Rosetta, MPI
Individual Members: ~120
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 17
Types of Language Resource
DATA: any information which documents or
describes a language, such as a:
• monograph, data file, shoebox of index cards,
unanalyzed recordings, heavily annotated texts,
complete descriptive grammar
TOOLS: computational resources that facilitate
creating, viewing, querying, or otherwise using
language data
• includes fonts, stylesheets, DTDs, Schemas
ADVICE: any information about:
• reliable data sources, appropriate tools and
practices
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 18
The Gap
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 19
Coordinated Approach
OLAC
OAI
"A shared architectural vision, having many
components, and implemented in stages by
the community, will bridge the gap"
Analogies: federated databases; semantic web
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 20
OLAC
USER SERVICES
OLAC SERVICES
OLAC REPOSITORIES
CONVERT
CONTENT
CREATE
PROC
METADAT
A
MS
OLAC
OLAC
DC
OAI
Initiatives
Standards
EXPORT
FORMAT
DELIVER
MHP
OAI
Recommendations
Software
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 21
The Foundation: 3 initiatives
Dublin Core Metadata Initiative (DC)
• founded in 1995 (Dublin, Ohio)
• conventions for resource discovery on the web
Open Archives Initiative (OAI)
• founded in 1999 (Santa Fe)
• interoperability of e-print services
Open Language Archives Community (OLAC)
• founded in 2000 (Philadelphia)
• a partnership of institutions and individuals
• creating a worldwide virtual library of language resources
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 22
Foundation 1: DC Elements
15 metadata elements:
• broad interdisciplinary consensus
• each element is optional and repeatable
• applies to digital and traditional formats
• Title, Creator, Subject, Description, Publisher,
Contributor, Date, Type, Format, Identifier, Source,
Language, Relation, Coverage, Rights.
dublincore.org
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 23
Foundation 1: DC Qualifiers
Encoding Schemes:
• a controlled vocabulary or notation used to express the
value of an element
• helps a client system to interpret the element content
• e.g. Language = "en" (not "English", "Anglais", ...)
Refinements:
• makes the meaning of an element more specific
• e.g. Subject.language, Type.linguistic
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 24
Foundation 2: OAI Repository
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 25
Foundation 2: OAI Standards
To implement the OAI infrastructure, an archive must
comply with two standards:
1. The OAI Shared Metadata Set
• Dublin Core
• interoperability across all repositories
2. The OAI Metadata Harvesting Protocol
• HTTP requests - 6 verbs:
•
Identify, ListIdentifiers, ListMetadataFormats, ListSets,
ListRecords, GetRecord
• XML responses
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 26
Foundation 2: OAI Service Providers and
Data Providers
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 27
Foundation 3: OLAC & OAI
Recall: OAI data providers must support:
• Dublin Core Metadata
• OAI Metadata harvesting protocol
BUT: OAI data providers can support:
• a more specialized metadata format
• a more specialized harvesting protocol
What OLAC does:
• specialized metadata for language resources
• specialized harvesting (extra validation)
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 28
OLAC Standards
Aside:
• standards = the protocols and interfaces that allow the
community to function
• recommendations = "standards" for representing
linguistic content
OLAC has three primary standards:
• OLACMS: the OLAC Metadata Set (Qualified DC)
• OLAC MHP: refinements to the OAI protocol
• OLAC Process: a procedure for identifying Best
Common Practice Recommendations
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 29
The OLAC Metadata Set
The three categories of metadata:
• Work language: describes information entities and
their intellectual attributes
•
e.g. names of works and their creators
• Document language: describes and provides
access to the physical manifestation of information
•
e.g. format, publisher, date, rights
• Subject language: describes what a document is
about
•
e.g. subject, description
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 30
OLACMS and Controlled Vocabularies
Language:
A language of the intellectual content of the resource
(OLAC-Language)
Subject.language:
A language which the content of the resource describes or
discusses (OLAC-Language)
OLAC-Language:
A vocabulary for identifying the language(s) that the data
is in, or that a piece of linguistic description is about, or
that a particular tool can process
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 31
Summary: With the software in place, we have a complete
platform
CONVERT
CONTENT
CREATE
PROC
METADAT
A
MS
OLAC
OLAC
DC
OAI
Initiatives
Standards
EXPORT
FORMAT
DELIVER
MHP
OAI
Recommendations
Software
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 32
Summary: Repositories completely bridge the gap, letting us
consistently organize and archive our resources
OLAC REPOSITORIES
CONVERT
CONTENT
CREATE
PROC
METADAT
A
MS
OLAC
OLAC
DC
OAI
Initiatives
Standards
EXPORT
FORMAT
DELIVER
MHP
OAI
Recommendations
Software
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 33
OLAC
USER SERVICES
OLAC SERVICES
OLAC REPOSITORIES
CONVERT
CONTENT
CREATE
PROC
METADAT
A
MS
OLAC
OLAC
DC
OAI
Initiatives
Standards
EXPORT
FORMAT
DELIVER
MHP
OAI
Recommendations
Software
Acknowledgements: ISLE and TalkBank projects (NSF), participants of the
Philadelphia
workshop, Eva Banik (programmer), Hernando de Soto (the
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
analogy)
p. 34
OLACMS helps archive versatility
Given Shared Metadata Standard
 New language archives can be created on the fly by
harvesting existing archives
 Rich information can be inferred by establishing
temporal and geographic anchors for each document.
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 35
OLAC Infrastructure
Helps to Solve Language Archive Problems
such as
Language Identification
and
Metadata Set for Multi-lingual Language
Archives
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 36
The Language Identification Problem
The DC code (e.g. ‘en’ for English) is not enough to describe all the
languages in the world
Enthnologue (http://www.ethnologue.org) is comprehensive but not
complete
Potential Problems of using Enthnologue (or any existing language list)
 over-splitting
 over-chunking
 omission
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 37
A Fundamental Solution to Language
Identification Problems
Registering language groups with an OLAC
registration service
OLAC language classification server would house a
comprehensive list of language family names (defined by
users) and their extensional definitions (i.e. sets of
Enthnologue codes)
AS:Amis = {ALV, AIS}
ALV= Amis, AIS= Nataoran
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 38
Describing Multi-Lingual Resources in OLACMS
 Directionality is crucial in multilingual resources
 However, OLAC metadata is flat and unordered
Bi-directional MT
<Language code= X/>
<Language code= Y/>
<Subject.language code= X/>
<Subject.language code= Y/>
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 39
Multi-lingual Resources II
Text: language
Bitext (bilingual aligned corpus)
 There is always an directionality
 Original: language
 Translation: Subject.language
Language Description (Field Notes)
 Elicitation, transcription, translation, notes
Multiple related resources
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 40
Language Archives Project of Taiwan
 Part of the National Digital Archives Project (NDAP)
 Pilot Stage 2000-2001
 First Phase: 2002-2006
 Both Language Archives
 And Linguistic Anchor
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 41
Language and Digital Archives
When
Historical Maps
Where
Digital
Archives
How and What
Language
Changes
Language
Language
Variations
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 42
Digital Archives are
Linguistically Anchored
• Archives are anchored with Lexical KnowledgeBase
(LKB)
-because LKB as collection of lexical types instantiated in
archives uniquely defines each archive
-And each lexical item is the conceptual atom projecting
knowledge from archive to archive
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 43
Multi-anchor Knowledge Linking
 Geographical anchor based on GIS (geography
information system)
-Ecology (Fauna, Weather, Geology etc.)
-Socio-Anthropological classification
 Linguistic anchor based on LKB
-etymology, language grouping, loan words,
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 44
Institute of Linguistics
Language Archives
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 45
Two branch projects :
1 Chinese Archives -- 5 sub-projects:
•
Early- Mandarin Chinese Lexicon
•
Lexical Database of Pre-Qin Bronze and Bamboo
Manuscripts
Modern Chinese Corpus and Treebank
New Age Corpus: Linguistic Representations and
Archives of Multimedia Data
•
•
•
Southern-Min Archive: A Database of Historical Change
in Language Distribution
2 Formosan Language Archives.
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 46
Early- Mandarin Chinese Lexicon
GOAL:
1. Collect the corpus and the lexicon in the period of Early
Mandarin Chinese.
2. Provide a systematical knowledge thesaurus as well as
powerful instrument for the study of the grammatical
development.
Archives Description:
1. Digitalization of texts (10,000,000 characters).
2. Tagging of grammatical markers (3,500,000 characters).
3. Construction of the lexical database.
 http:www.sinica.edu.tw/Early_Mandarin
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 47
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 48
Lexical Database of Pre-Qin Bronze and Bamboo Manuscripts
Archives Description:
•
•
to digitize the bronze inscriptions from the Shang to the
Eastern Chou dynasties.
the construction of a typological lexicon of bronze
inscriptions and bamboo scripts accurate encoding and
analysis for the bronze inscriptions and Chu scripts.
Achievement:
• Proof-read bronze inscriptions (12113 piece of bronze
inscriptions).
http://Inscription.sinica.edu.tw
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 49
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 50
Modern Chinese Corpus and Treebank
Achievement:
 Segmented words tagged with their part-ofspeech (10 millions words version in 2006).
 Syntactic tree structure:30,000.
 http://www.sinica.edu.tw/SinicaCorpus
 http://treebank.sinica.edu.tw
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 51
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 52
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 53
Treebank
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 54
New Age Corpus: Linguistic Representations and
Archives of Multimedia Data
Archives Description:
1. A multimodal corpus of spoken Mandarin in
Taiwan.
2. By means of different designs of tasks and
scenarios.
3. Combining data format of written transcripts
with digital technology of video and audio
processing.
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 55
New Age Corpus: Linguistic Representations
and Archives of Multimedia Data
Achievement:
 Transcribed and transformed the 11 hour-digital
data.
 Tagged the 5-hour speech data.
 http://mmc.sinica.edu.tw
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 56
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 57
Southern-Min Archive: A Database of
Historical Change in Language Distribution
Archives Description:
1. From the perspectives of historical change and
geographical distribution.
2. A tagged corpus of Southern Min written documents
from 16th century to 20th century.
3. A linguistic Geographical Informational System
displaying distributions of languages in Hsinfeng.
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 58
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 59
Formosan Language archives
Archives Description:
1. Preserve the endangered Formosan Austronesian
languages
1.1 corpora, lexicons and grammars
1.2 integration of linguistic information with GIS.
2. fifteen extant Formosan languages
2.1 Rukai, Yami, Saisiyat, Tsou, Atayal, Bunun,
Paiwan, Amis and Puyuma
 http://http://formosan.sinica.edu.tw/
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 60
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 61
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 62
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 63
Sinica BOW: Bilingual Ontological Wordnet
 To construct a Chinese WordNet as the linguistic
ontology for knowledge representation;
 To provide linguistic anchoring grounded with
temporal information by building a synchronic lexicon
for all historical periods; and
 To provide linguistic anchoring reference and
implementation services.
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 64
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 65
Asian Language Resources Committee
Mail List: alr@cl.cs.titech.ac.jp
 Affiliated with AFNLP
 Cataloguing Asian Language Resources
 Will adopt OLACMS and search engine
 Hosting ALR Workshops (5 so far)
 Asian Language Processing Special Issues in
Language Resources and Evaluation
 Co-Chairs:Togunaga take@cl.cs.titech.ac.jp
Huang churen@sinica.edu.tw
http://www.cl.cs.titech.ac.jp/alr/
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 66
An overview of the
Natural Language Toolkit
http://nltk.sourceforge.net
Project Leaders: Steven Bird, Edward Loper, Ewan Klein
Acknowledgement: I would like to thank Steven Bird for agreeing to let me use these
slides on NLTK
4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Summary
 NLTK is a suite of open source Python modules, data
sets and tutorials
 supporting research and development in natural
language processing
 Download NLTK from nltk.sourceforge.net
 A Truly Multilingual Toolkit accessible to beginning
researchers in NLP
 A good way to attract international scholars to
research on your language
 Also a good stepping stone for a developing HLT
language to test a full range of NLP applications
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 68
Components of NLTK
1.
Code: corpus readers, tokenizers, stemmers,
taggers, chunkers, parsers, wordnet, ... (50k lines of
code)
2.
Corpora: 20+ annotated data sets widely used in
natural language processing (300Mb data)
3.
Documentation: a 360-page book, articles, reviews,
API documentation
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 69
1. Code
 corpus readers
 tokenizers
 stemmers
 taggers
 parsers
 wordnet
 semantic interpretation
 clusterers
 evaluation metrics
 …
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 70
2. Corpora

Brown Corpus

Carnegie Mellon Pronouncing Dictionary

CoNLL 2000 Chunking Corpus

Project Gutenberg Selections

NIST 1999 Information Extraction: Entity Recognition Corpus

US Presidential Inaugural Address Corpus

Indian Language POS-Tagged Corpus

Prepositional Phrase Attachment Corpus

SENSEVAL 2 Corpus

Sinica Treebank Corpus Sample

Universal Declaration of Human Rights Corpus

Stopwords Corpus

TIMIT Corpus Sample

Treebank Corpus Sample

…
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 71
3. Documentation
 a 360-page book about natural language processing in Python
and NLTK
 teaches Python and NLP
 provides numerous examples and exercises
 installation instructions
 presentation slides for some of the book chapters
 API Documentation: describes every module, interface, class,
and method
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 72
Parser demonstrations
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 73
Interactive session (WordNet)
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 74
Adoption in NLP courses
Amsterdam, Ben-Gurion, Brown, Bryn Mawr,
CDAC-Mumbai, Coruña, Edinburgh, Erlangen,
Georgetown, Helsinki, IIT-Bombay, Iowa State,
Konstanz, MIT, Macquarie, Magdeburg, Malta,
Marquette, Melbourne, Nancy, Naval
Postgraduate School, Northeastern, Ohio
State, Pitt, San Diego State, Simon Fraser,
Stanford, Syracuse University, Tsuda College,
U Colorado, UC Berkeley, UMass Amherst,
UNAM, U Penn, UT Austin, Warsaw
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 75
Contribute…
 NLTK is an open source project
 all code, data, documentation is free
 dozens of people have contributed over the past 6
years
 please visit the website for project ideas
 sign up on the NLTK-Announce mailing list to hear
about new releases
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
p. 76
Download