DIGITAL TALENT SCHOLARSHIP 2019 NLP with NLTK Dr. Indriana Hidayah, S.T., M.T. digitalent.kominfo.go.id Digital Talent Scholarship Introduction to Text mining Dept. Teknik Elektro dan Teknologi Informasi Fakultas Teknik UGM Text Mining Text Mining: extracts useful and important information from the various formats of documents such as webpages, emails, social media posts, journal articles, etc Contoh Text/document classification Text summarization Sentiment analysis dan lain-lain Text Mining Process NATURAL LANGUAGE PROCESSING Natural Language Processing (NLP) • Natural Language Understanding (NLU) – Understanding the written/spoken language • Natural Language Generation – Machine expresses human language naturally NLP Applications • Machine Translation • Extracting data from text – Konversi unstructured text ke structured data • Spoken language control systems • Spelling and grammar checkers Natural language understanding • Raw speech signal • Speech recognition • Sequence of words spoken • Syntactic analysis using knowledge of the grammar • Structure of the sentence • Semantic analysis using info. about meaning of words • Partial representation of meaning of sentence • Pragmatic analysis using info. about context • Final representation of meaning of sentence Speech Recognition Hz Input (microphone records voice) Analog Signal time Freq. spectrogram (e.g. Fourier transform) Syntactic Analysis • Rules of syntax (grammar) specify the possible organization of words in sentences and allows us to determine sentence’s structure(s) – “John saw Mary with a telescope” • John saw (Mary with a telescope) • John (saw Mary with a telescope) • Parsing: given a sentence and a grammar – Checks that the sentence is correct according with the grammar and if so returns a parse tree representing the structure of the sentence Syntactic Analysis - Grammar • • • • • • • • sentence -> noun_phrase, verb_phrase noun_phrase -> proper_noun noun_phrase -> determiner, noun verb_phrase -> verb, noun_phrase proper_noun -> [mary] noun -> [apple] verb -> [ate] determiner -> [the] Syntactic Analysis - Parsing sentence noun_phrase proper_noun verb_phrase verb noun_phrase determiner “Mary” “ate” “the” noun “apple” Semantic Analysis • Generates (partial) meaning/representation of the sentence from its syntactic structure(s) • Compositional semantics: meaning of the sentence from the meaning of its parts: – Sentence: A tall man likes Mary – Representation: x man(x) & tall(x) & likes(x,mary) • Grammar + Semantics – Sentence (Smeaning)-> noun_phrase(NPmeaning),verb_phrase(VPmeaning), combine(NPmeaning,VPmeaning,Smeaning) NATURAL LANGUAGE PROCESSING USING NLTK NLTK (natural language toolkit) Getting Started with NLTK • Open Anaconda Prompt and navigate it to the directory you are currently working on • Run jupyter notebook : >> jupyter notebook – NOTE: To make sure you have installed NLTK on your system, run: >> import nltk If it does not result an error, then you have installed NLTK on your computer Getting Started with NLTK Getting Started with NLTK Take a closer look to an example of NLTK Corpus Print the first 500 words of the corpus text in a paragraph Let’s start with Tokenization What is Tokenization? “Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.” – NLP Stanford Example: • Sentence tokenization: – I ate Bubur Ayam this morning. That was the most delicious Bubur Ayam I have ever had. Example 1: Sentence Tokenization • The input document: I ate Bubur Ayam this morning. That was the most delicious Bubur Ayam I have ever had. • The output after sentence tokenization 1st sentence: I ate Bubur Ayam this morning. 2nd sentence: That was the most delicious Bubur Ayam I have ever had. Example 2: Word Tokenization • The input document: I ate Bubur Ayam this morning. That was the most delicious Bubur Ayam I have ever had. • The output after word tokenization • I • ate • Bub ur • Aya m • this • Mor ning • . • That • was • the • mos t • deli ciou s • Bub ur • Aya m • I • Hav e • ever • Had • . Now, how do we do tokenization using NLTK? Tokenization with NLTK • Take this paragraph as an example and we store it to a variable named “AI” >> AI="""According to the father of Artificial Intelligence, John McCarthy, it is “The science and engineering of making intelligent machines, especially intelligent computer programs”. Artificial Intelligence is a way of making a computer, a computer-controlled robot, or a software think intelligently, in the similar manner the intelligent humans think. AI is accomplished by studying how human brain thinks, and how humans learn, decide, and work while trying to solve a problem, and then using the outcomes of this study as a basis of developing intelligent software and systems.""" Tokenization with NLTK • Check the variable type >> type(AI) • Word tokenizer using NLTK >> from nltk import word_tokenize >> AI_tokens = word_tokenize(AI) >> print(AI_tokens) • Checking the length of AI_tokens >> len(AI_tokens) Tokenization with NLTK Tokenization with NLTK • Import blankline_tokenize • Count blank paragraph • Show the length of blank paragraph Tokenization with NLTK Important parts of tokenization: ● ● ● Bigram: tokens consisting two consecutive words Trigram: tokens consisting three consecutive words N-gram: Tokens consisting n-consecutive words Tokenization with NLTK Now, we will move on to Stemming Stemming ● Normalize words into its base or root form. ● Example: ● The words: ○ Affectation ○ Affects ○ Affections ○ Affected ○ Affection ○ Affecting ● Are actually originating from a single root word: ○ Affect Stemming with NLTK Stemming with NLTK ● Other stemmer in NLTK. ○ LancasterStemmer is more agressive than PorterStemmer. ○ It is usually used for counting the words occurencies. ○ SnowBostonStemmer which are using to specify the language are using. Lemmatization (pemotongan kata dalam bahasa tertentu menjadi bentuk dasar pengenalan fungsi setiap kata dalam kalimat) • Considering the morphological analysis of the words, it’s necessary to have a detail dictionary which algorithm can look through to link the form back to original word itself or the root word which is also known as lemma. • Steps of Lemmatization: – Group together different inflected forms of a word, called Lemma. – Somehow similar to Stemming, as it maps several words into one common root. – Output of Lemmatization is a proper word. • For example: • Lemmatiser should map gone, going and went into go. Lemmatization with NLTK • Import wordnet and WordNetLemmatizer • Lemmatization with with WordNetLemmatizer. Stopwords ○ Words in the English or other languages such as “i, me, ○ ○ ○ ○ my, myself, above, below and etc”. Which are very useful in the formation of sentence and without it the sentence wouldn’t make any sense. Usually these words are dominating the documents and appear everywhere But these words do not provide any significant meaning to help the NLP process so they are usually removed NLTK has the list of stop words and we can import and use it for stop words removal StopWords Removal with NLTK ● Import NLTK stop words ● List of english stop words ● Check length of english stopwords StopWords Removal with NLTK ● Remove stop words ● Check length of words after remove the stop words POS Tags ● The grammatically type of the word is referred to POS Tags or part of speech such as verb, noun, adjective, adverb and etc. ● It indicate how a word function in meaning as well as grammatically within the sentence. SENTENCE DT NN VBD DT NN The wolf killed the lion POS Tags ● NLTK POS Tags for part of speech POS Tags ● Part of speech example 1 ● Part of speech example 2 POS Tags ● Other example: “Google” something on the internet ● In this case google is used as a verb. ● The solution is “Named Entity Recognition” in the next capter. LATIHAN • Install nltk • Lalu coba bermacam kalimat sbb: – Hello, my name is… – Would you like a tea? – Where is your address? Named Entity Recognition ● It’s a process of detecting the named entities such as the person name, the company names, quantity or the location name. ● The steps of NER: ○ The noun phrase identification. ○ The phrase classification. ○ Entity disambiguation. Google’s CEO Sundar Pichai introduce the new Pixel 3 at New York Central Mall Organization Person Location Organization Named Entity Recognition ● Import Named Entity Library ● ● ● ● >>> from nltk import ne_chunk Recognize Named Entity >>> NE_sent="The US President stays in the WHITE HOUSE" Tokenization >>> NE_tokens=word_tokenize(NE_sent) Get POS Tags >>> NE_tags=nltk.pos_tag(NE_tokens) Chunk and show the POS Tags >>> NE_NER=ne_chunk(NE_tags) >>> print(NE_NER) Syntax Tree Is a representation of syntactic structure of sentences or strings. Chunking ● Picking up Individual pieces of information and Grouping them into bigger Pieces. We PRP Caught the Pink Panther VBD DT JJ NN NP NP CHUNK Chunking Group Project PROJECT 1: TEXT SUMMARIZATION Group Project – General Instruction 1. Make a group of 4 – 5 students. 2. Your group will perform a task about text summarization. 3. Later, the tutor will choose several groups to present their works and the results. 4. At the end of today’s session, each group must submit the .ipynb file and one page report using the given template. Group Project – The Task 1. Discuss about the cases in which text summarization may help to provide solution. Your group only need to pick one case to focus on. 2. Discuss about an example of the case on step no 1. This step will help you to decide about the data that you need. 3. Discuss and pick one text source (document(s)) as the input of the summarization task you will perform. 4. Perform the text summarization (preprocessing)using the techniques you have just learned 5. Discuss about the result: what do you find and learn from this task? References • Mani, Inderjeet. Advances in Automatic Text Summarization. Book. MIT Press 1999 • https://nlp.stanford.edu/IRbook/html/htmledition/tokenization-1.html IKUTI KAMI digitalent.kominfo digitalent.kominfo DTS_kominfo Digital Talent Scholarship 2019 Pusat Pengembangan Profesi dan Sertifikasi Badan Penelitian dan Pengembangan SDM Kementerian Komunikasi dan Informatika Jl. Medan Merdeka Barat No. 9 (Gd. Belakang Lt. 4 - 5) Jakarta Pusat, 10110 digitalent.kominfo.go.id THANK YOU