Speaker: Chen Zhang Oct 16, 2015 Society of tourism informatics ◦ Tourism is main industry for some local cities ◦ The Web contains much info for the tourism Contents: impressions, sentiments, sightseeing Sources: blogs, tweets, etc. ◦ Quality of the info is a mixture of good and bad ◦ Needs of tourism informatics are rising (city, person) Problem ◦ How to extract, analyze and visualize the tourism information? Linguistics ◦ MeCab Japanese language morphological analyzers Eg. “Iizuka” – [person’s name],[location’s name] ◦ Okapi-BM25[7] Showing the importance of a word Word weighting approach: D: qi: f(qi, D): n(qi): |D|: avgdl: n: b, k: document a word frequency of qi in D # of occurrences in D length of document average length of documents number of documents constant weighting factors P/N classification for movie reviews [6] ◦ Showing the polarity of reviews on movies ◦ Involve several machine learning techniques ◦ Reveal the effectiveness of the methods Sentiment lexicon [1,2] ◦ Dictionary showing word polarity(P/N) ◦ Possesses a real number for polarity of each word ◦ Scoring method as comparison for the paper’s results Tweets tend to be in real time. Weblogs tend to be not. Sentiment analysis is one of the hottest topics in natural language processing. Linguistic/dictionaries-based methods lack of ability to handle informal expressions (; orz ♩♪♫♬ Tourism Information Extraction ◦ ◦ ◦ ◦ Acquisition of basic queries Selection of related words Query generation and retrieval Filtering – get final testing dataset Sentiment Analysis ◦ ◦ ◦ ◦ Pseudo training dataset Naïve Bayes P/N classification model Accuracy compared with scoring method Seed words selection discussion MeCab Okapi-BM25 Select by hand Target city description (Sentences) Abbr. Same Contextual Basics (Words) q u e r y Naïve Bayes Classification (P/N) query Tweets 10 million Divide words into Vector space 100,000x2 P/N pseudo training data Our method result Comparison & Explanation Related to target Exactly match Seeds 116 tweets 64 pos; 52 neg m o d e l Naïve Bayes Training Scoring method result Basics ◦ Words extracted to best represent the target city ◦ Tweets query Selection ◦ Divide sentences into words ◦ Calculate weight of each word ◦ Select most related ones ◦ Get a Basics list Query ◦ Retrieve tweets with Twitter API ◦ Feed with Basics lists in previous step Filtering ◦ Abbr. ◦ Rule-based ◦ Contextual ◦ Finally retrieve 116 Tweets as testing dataset. Query by Seed words ◦ Source - Over 10 million non-tagged tweets corpus ◦ Seeds define - [P : ♪, N: orz] ◦ Get pseudo positive set and pseudo negative set (each with 100,000 tweets) Vectored and training ◦ Divided into words ◦ Selected and labeled as P/N ◦ Result in Naïve Bayes Model Naïve Bayes Classification ◦ Two classes Positive and Negative ◦ Model parameters Likelihood and Priori probability Get from training dataset in the previous step MeCab Okapi-BM25 Select by hand Target city description (Sentences) Abbr. Same Contextual Basics (Words) q u e r y Naïve Bayes Classification (P/N) query Tweets 10 million Divide words into Vector space 100,000x2 P/N pseudo training data Our method result Comparison & Explanation Related to target Exactly match Seeds 116 tweets 64 pos; 52 neg m o d e l Naïve Bayes Training Scoring method result Target city – Iizuka city, Fukuoka, Japan ◦ Medium size city with population of appr. 130,000 116 tweets as testing dataset ◦ Target city is small ◦ 64 tweets as positive and 52 as negative 100,000x2 tweets as training dataset Compared with another Scoring method ◦ Using linguistic sentiment dictionary ◦ Real number for polarity of each word The accuracy rate is computed by: Results: Output and scenarios: Accuracy rates rising ◦ By using other seeds ◦ Using the combination of seed words Important future works Statistic model for abbr. Contextual filtering Temporal Location based Other resources Extracted useful features from tweets, developed a tourism information analysis system. Help people to understand and organize significant information of the target city. Shortage: ◦ Still need to pick up Basic words by hand ◦ Cannot deal with abbreviations ◦ Cannot deal with cities with same names Assumption ◦ Since users tend to post tweets in real time, tweets often contain significant information of events for tourism as lifelog data. Most contribution ◦ Shown the effectiveness on the tweets data with tourism targets or events. ◦ Shown limitations on sentiment dictionary. Limitations and extensions ◦ Neutral usually is the 3rd class. ◦ Consider other resources such as weblogs. How could it be related to our research?