Uploaded by ekomulyani

Digital Talent Scholarship AI NLTK

advertisement
DIGITAL TALENT
SCHOLARSHIP
2019
NLP with NLTK
Dr. Indriana Hidayah, S.T., M.T.
digitalent.kominfo.go.id
Digital Talent Scholarship
Introduction to Text mining
Dept. Teknik Elektro dan Teknologi Informasi
Fakultas Teknik UGM
Text Mining
Text Mining:
extracts useful and important
information from the various formats of
documents such as webpages, emails,
social media posts, journal articles, etc
Contoh
Text/document classification
Text summarization
Sentiment analysis
dan lain-lain
Text Mining Process
NATURAL LANGUAGE PROCESSING
Natural Language Processing (NLP)
• Natural Language Understanding (NLU)
– Understanding the written/spoken language
• Natural Language Generation
– Machine expresses human language naturally
NLP Applications
• Machine Translation
• Extracting data from text
– Konversi unstructured text ke structured data
• Spoken language control systems
• Spelling and grammar checkers
Natural language understanding
• Raw speech signal
• Speech recognition
• Sequence of words spoken
• Syntactic analysis using knowledge of the grammar
• Structure of the sentence
• Semantic analysis using info. about meaning of words
• Partial representation of meaning of sentence
• Pragmatic analysis using info. about context
• Final representation of meaning of sentence
Speech Recognition
Hz
Input (microphone
records voice)
Analog Signal
time
Freq. spectrogram
(e.g. Fourier
transform)
Syntactic Analysis
• Rules of syntax (grammar) specify the possible organization of
words in sentences and allows us to determine sentence’s
structure(s)
– “John saw Mary with a telescope”
• John saw (Mary with a telescope)
• John (saw Mary with a telescope)
• Parsing: given a sentence and a grammar
– Checks that the sentence is correct according with the grammar and if
so returns a parse tree representing the structure of the sentence
Syntactic Analysis - Grammar
•
•
•
•
•
•
•
•
sentence -> noun_phrase, verb_phrase
noun_phrase -> proper_noun
noun_phrase -> determiner, noun
verb_phrase -> verb, noun_phrase
proper_noun -> [mary]
noun -> [apple]
verb -> [ate]
determiner -> [the]
Syntactic Analysis - Parsing
sentence
noun_phrase
proper_noun
verb_phrase
verb
noun_phrase
determiner
“Mary”
“ate”
“the”
noun
“apple”
Semantic Analysis
• Generates (partial) meaning/representation of the
sentence from its syntactic structure(s)
• Compositional semantics: meaning of the sentence
from the meaning of its parts:
– Sentence: A tall man likes Mary
– Representation: x man(x) & tall(x) & likes(x,mary)
• Grammar + Semantics
– Sentence (Smeaning)->
noun_phrase(NPmeaning),verb_phrase(VPmeaning),
combine(NPmeaning,VPmeaning,Smeaning)
NATURAL LANGUAGE PROCESSING
USING NLTK
NLTK (natural language toolkit)
Getting Started with NLTK
• Open Anaconda Prompt and navigate it to the
directory you are currently working on
• Run jupyter notebook :
>> jupyter notebook
– NOTE: To make sure you have installed NLTK on your
system, run:
>> import nltk
If it does not result an error, then you have installed NLTK
on your computer
Getting Started with NLTK
Getting Started with NLTK
Take a closer look to an example of NLTK Corpus
Print the first 500 words of the corpus text in a paragraph
Let’s start with Tokenization
What is Tokenization?
“Given a character sequence and a defined
document unit, tokenization is the task of chopping
it up into pieces, called tokens, perhaps at the same
time throwing away certain characters, such as
punctuation.” – NLP Stanford
Example:
• Sentence tokenization:
– I ate Bubur Ayam this morning. That was the most
delicious Bubur Ayam I have ever had.
Example 1: Sentence Tokenization
• The input document:
I ate Bubur Ayam this morning. That was the
most delicious Bubur Ayam I have ever had.
• The output after sentence tokenization
1st sentence: I ate Bubur Ayam this morning.
2nd sentence: That was the most delicious Bubur
Ayam I have ever had.
Example 2: Word Tokenization
• The input document:
I ate Bubur Ayam this morning. That was the most delicious Bubur
Ayam I have ever had.
• The output after word tokenization
• I
• ate
• Bub
ur
• Aya
m
• this
• Mor
ning
• .
• That
• was
• the
• mos
t
• deli
ciou
s
• Bub
ur
• Aya
m
• I
• Hav
e
• ever
• Had
• .
Now, how do we do tokenization
using NLTK?
Tokenization with NLTK
• Take this paragraph as an example and we store it to a variable named “AI”
>> AI="""According to the father of Artificial
Intelligence, John McCarthy, it is “The science
and engineering of making intelligent machines,
especially intelligent computer programs”.
Artificial Intelligence is a way of making a
computer, a computer-controlled robot, or a
software think intelligently, in the similar
manner the intelligent humans think.
AI is accomplished by studying how human brain
thinks, and how humans learn, decide, and work
while trying to solve a problem, and then using
the outcomes of this study as a basis of
developing intelligent software and systems."""
Tokenization with NLTK
• Check the variable type
>> type(AI)
• Word tokenizer using NLTK
>> from nltk import word_tokenize
>> AI_tokens = word_tokenize(AI)
>> print(AI_tokens)
• Checking the length of AI_tokens
>> len(AI_tokens)
Tokenization with NLTK
Tokenization with NLTK
• Import blankline_tokenize
• Count blank paragraph
• Show the length of blank paragraph
Tokenization with NLTK
Important parts of tokenization:
●
●
●
Bigram: tokens consisting two consecutive words
Trigram: tokens consisting three consecutive words
N-gram: Tokens consisting n-consecutive words
Tokenization with NLTK
Now, we will move on to Stemming
Stemming
● Normalize words into its base or root form.
● Example:
● The words:
○ Affectation
○ Affects
○ Affections
○ Affected
○ Affection
○ Affecting
● Are actually originating from a single root word:
○ Affect
Stemming with NLTK
Stemming with NLTK
●
Other stemmer in NLTK.
○
LancasterStemmer is more agressive than
PorterStemmer.
○ It is usually used for counting the words occurencies.
○
SnowBostonStemmer which are using to specify
the language are using.
Lemmatization
(pemotongan kata dalam bahasa tertentu menjadi bentuk dasar pengenalan
fungsi setiap kata dalam kalimat)
• Considering the morphological analysis of the words, it’s
necessary to have a detail dictionary which algorithm can look
through to link the form back to original word itself or the root
word which is also known as lemma.
• Steps of Lemmatization:
– Group together different inflected forms of a word, called Lemma.
– Somehow similar to Stemming, as it maps several words into one
common root.
– Output of Lemmatization is a proper word.
• For example:
• Lemmatiser should map gone, going and went into go.
Lemmatization with NLTK
• Import wordnet and WordNetLemmatizer
• Lemmatization with with WordNetLemmatizer.
Stopwords
○ Words in the English or other languages such as “i, me,
○
○
○
○
my, myself, above, below and etc”.
Which are very useful in the formation of sentence and
without it the sentence wouldn’t make any sense.
Usually these words are dominating the documents and
appear everywhere
But these words do not provide any significant meaning
to help the NLP process so they are usually removed
NLTK has the list of stop words and we can import and
use it for stop words removal
StopWords Removal with NLTK
● Import NLTK stop words
● List of english stop words
● Check length of english stopwords
StopWords Removal with NLTK
● Remove stop words
● Check length of words after remove the stop words
POS Tags
● The grammatically type of the word is referred to POS Tags or
part of speech such as verb, noun, adjective, adverb and etc.
● It indicate how a word function in meaning as well as
grammatically within the sentence.
SENTENCE
DT
NN
VBD
DT
NN
The
wolf
killed
the
lion
POS Tags
● NLTK POS Tags for part of speech
POS Tags
● Part of speech example 1
● Part of speech example 2
POS Tags
● Other example:
“Google” something on the internet
● In this case google is used as a verb.
● The solution is “Named Entity Recognition” in the next capter.
LATIHAN
• Install nltk
• Lalu coba bermacam kalimat sbb:
– Hello, my name is…
– Would you like a tea?
– Where is your address?
Named Entity Recognition
● It’s a process of detecting the named entities such as the person
name, the company names, quantity or the location name.
● The steps of NER:
○ The noun phrase identification.
○ The phrase classification.
○ Entity disambiguation.
Google’s CEO Sundar Pichai introduce the new Pixel 3 at New York Central Mall
Organization
Person
Location
Organization
Named Entity Recognition
● Import Named Entity Library
●
●
●
●
>>> from nltk import ne_chunk
Recognize Named Entity
>>> NE_sent="The US President stays in the
WHITE HOUSE"
Tokenization
>>> NE_tokens=word_tokenize(NE_sent)
Get POS Tags
>>> NE_tags=nltk.pos_tag(NE_tokens)
Chunk and show the POS Tags
>>> NE_NER=ne_chunk(NE_tags)
>>> print(NE_NER)
Syntax Tree
Is a representation of syntactic structure of sentences or strings.
Chunking
● Picking up Individual pieces of information and Grouping them
into bigger Pieces.
We
PRP
Caught
the
Pink
Panther
VBD
DT
JJ
NN
NP
NP
CHUNK
Chunking
Group Project
PROJECT 1: TEXT SUMMARIZATION
Group Project – General Instruction
1. Make a group of 4 – 5 students.
2. Your group will perform a task about text
summarization.
3. Later, the tutor will choose several groups to
present their works and the results.
4. At the end of today’s session, each group
must submit the .ipynb file and one page
report using the given template.
Group Project – The Task
1. Discuss about the cases in which text summarization
may help to provide solution. Your group only need to
pick one case to focus on.
2. Discuss about an example of the case on step no 1.
This step will help you to decide about the data that
you need.
3. Discuss and pick one text source (document(s)) as the
input of the summarization task you will perform.
4. Perform the text summarization (preprocessing)using
the techniques you have just learned
5. Discuss about the result: what do you find and learn
from this task?
References
• Mani, Inderjeet. Advances in Automatic Text
Summarization. Book. MIT Press 1999
• https://nlp.stanford.edu/IRbook/html/htmledition/tokenization-1.html
IKUTI KAMI
digitalent.kominfo
digitalent.kominfo
DTS_kominfo
Digital Talent Scholarship 2019
Pusat Pengembangan Profesi dan Sertifikasi
Badan Penelitian dan Pengembangan SDM
Kementerian Komunikasi dan Informatika
Jl. Medan Merdeka Barat No. 9
(Gd. Belakang Lt. 4 - 5)
Jakarta Pusat, 10110
digitalent.kominfo.go.id
THANK YOU