Analyzing Tourism Information on Twitter for a Local City

Speaker: Chen Zhang Oct 16, 2015  Society of tourism informatics ◦ Tourism is main industry for some local cities ◦ The Web contains much info for the tourism  Contents: impressions, sentiments, sightseeing  Sources: blogs, tweets, etc. ◦ Quality of the info is a mixture of good and bad ◦ Needs of tourism informatics are rising (city, person)  Problem ◦ How to extract, analyze and visualize the tourism information?  Linguistics ◦ MeCab  Japanese language morphological analyzers  Eg. “Iizuka” – [person’s name],[location’s name] ◦ Okapi-BM25[7]  Showing the importance of a word  Word weighting approach: D: qi: f(qi, D): n(qi): |D|: avgdl: n: b, k: document a word frequency of qi in D # of occurrences in D length of document average length of documents number of documents constant weighting factors  P/N classification for movie reviews [6] ◦ Showing the polarity of reviews on movies ◦ Involve several machine learning techniques ◦ Reveal the effectiveness of the methods  Sentiment lexicon [1,2] ◦ Dictionary showing word polarity(P/N) ◦ Possesses a real number for polarity of each word ◦ Scoring method as comparison for the paper’s results    Tweets tend to be in real time. Weblogs tend to be not. Sentiment analysis is one of the hottest topics in natural language processing. Linguistic/dictionaries-based methods lack of ability to handle informal expressions (; orz ♩♪♫♬  Tourism Information Extraction ◦ ◦ ◦ ◦  Acquisition of basic queries Selection of related words Query generation and retrieval Filtering – get final testing dataset Sentiment Analysis ◦ ◦ ◦ ◦ Pseudo training dataset Naïve Bayes P/N classification model Accuracy compared with scoring method Seed words selection discussion  MeCab  Okapi-BM25  Select by hand Target city description (Sentences)    Abbr. Same Contextual Basics (Words) q u e r y Naïve Bayes Classification (P/N) query Tweets 10 million Divide words into Vector space 100,000x2 P/N pseudo training data Our method result Comparison & Explanation Related to target Exactly match Seeds 116 tweets 64 pos; 52 neg m o d e l Naïve Bayes Training Scoring method result  Basics ◦ Words extracted to best represent the target city ◦ Tweets query  Selection ◦ Divide sentences into words ◦ Calculate weight of each word ◦ Select most related ones ◦ Get a Basics list  Query ◦ Retrieve tweets with Twitter API ◦ Feed with Basics lists in previous step  Filtering ◦ Abbr. ◦ Rule-based ◦ Contextual ◦ Finally retrieve 116 Tweets as testing dataset.  Query by Seed words ◦ Source - Over 10 million non-tagged tweets corpus ◦ Seeds define - [P : ♪, N: orz] ◦ Get pseudo positive set and pseudo negative set (each with 100,000 tweets)  Vectored and training ◦ Divided into words ◦ Selected and labeled as P/N ◦ Result in Naïve Bayes Model  Naïve Bayes Classification ◦ Two classes  Positive and Negative ◦ Model parameters  Likelihood and Priori probability  Get from training dataset in the previous step  MeCab  Okapi-BM25  Select by hand Target city description (Sentences)    Abbr. Same Contextual Basics (Words) q u e r y Naïve Bayes Classification (P/N) query Tweets 10 million Divide words into Vector space 100,000x2 P/N pseudo training data Our method result Comparison & Explanation Related to target Exactly match Seeds 116 tweets 64 pos; 52 neg m o d e l Naïve Bayes Training Scoring method result  Target city – Iizuka city, Fukuoka, Japan ◦ Medium size city with population of appr. 130,000  116 tweets as testing dataset ◦ Target city is small ◦ 64 tweets as positive and 52 as negative  100,000x2 tweets as training dataset  Compared with another Scoring method ◦ Using linguistic sentiment dictionary ◦ Real number for polarity of each word  The accuracy rate is computed by:  Results:  Output and scenarios:  Accuracy rates rising ◦ By using other seeds ◦ Using the combination of seed words  Important future works  Statistic model for abbr.  Contextual filtering Temporal Location based  Other resources  Extracted useful features from tweets, developed a tourism information analysis system.  Help people to understand and organize significant information of the target city.  Shortage: ◦ Still need to pick up Basic words by hand ◦ Cannot deal with abbreviations ◦ Cannot deal with cities with same names  Assumption ◦ Since users tend to post tweets in real time, tweets often contain significant information of events for tourism as lifelog data.  Most contribution ◦ Shown the effectiveness on the tweets data with tourism targets or events. ◦ Shown limitations on sentiment dictionary.  Limitations and extensions ◦ Neutral usually is the 3rd class. ◦ Consider other resources such as weblogs.  How could it be related to our research?

Analyzing Tourism Information on Twitter for a Local City

Related documents

Products

Support

Analyzing Tourism Information on Twitter for a Local City

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib