Analyzing Tourism Information on Twitter for a Local City

advertisement
Speaker: Chen Zhang
Oct 16, 2015

Society of tourism informatics
◦ Tourism is main industry for some local cities
◦ The Web contains much info for the tourism
 Contents: impressions, sentiments, sightseeing
 Sources: blogs, tweets, etc.
◦ Quality of the info is a mixture of good and bad
◦ Needs of tourism informatics are rising (city, person)

Problem
◦ How to extract, analyze and visualize the tourism
information?

Linguistics
◦ MeCab
 Japanese language morphological analyzers
 Eg. “Iizuka” – [person’s name],[location’s name]
◦ Okapi-BM25[7]
 Showing the importance of a word
 Word weighting approach:
D:
qi:
f(qi, D):
n(qi):
|D|:
avgdl:
n:
b, k:
document
a word
frequency of qi in D
# of occurrences in D
length of document
average length of
documents
number of documents
constant weighting
factors

P/N classification for movie reviews
[6]
◦ Showing the polarity of reviews on movies
◦ Involve several machine learning techniques
◦ Reveal the effectiveness of the methods

Sentiment lexicon
[1,2]
◦ Dictionary showing word polarity(P/N)
◦ Possesses a real number for polarity of each word
◦ Scoring method as comparison for the paper’s
results



Tweets tend to be in real time. Weblogs tend
to be not.
Sentiment analysis is one of the hottest topics
in natural language processing.
Linguistic/dictionaries-based methods lack of
ability to handle informal expressions
(; orz ♩♪♫♬

Tourism Information Extraction
◦
◦
◦
◦

Acquisition of basic queries
Selection of related words
Query generation and retrieval
Filtering – get final testing dataset
Sentiment Analysis
◦
◦
◦
◦
Pseudo training dataset
Naïve Bayes P/N classification model
Accuracy compared with scoring method
Seed words selection discussion
 MeCab
 Okapi-BM25
 Select by hand
Target city
description
(Sentences)



Abbr.
Same
Contextual
Basics
(Words)
q
u
e
r
y
Naïve Bayes
Classification
(P/N)
query
Tweets
10 million
Divide words into
Vector space
100,000x2
P/N pseudo
training data
Our method
result
Comparison
&
Explanation
Related to target
Exactly
match
Seeds
116 tweets
64 pos;
52 neg
m
o
d
e
l
Naïve Bayes
Training
Scoring method
result

Basics
◦ Words extracted to best represent the target city
◦ Tweets query

Selection
◦ Divide sentences into words
◦ Calculate weight of each word
◦ Select most related ones
◦ Get a Basics list

Query
◦ Retrieve tweets with Twitter API
◦ Feed with Basics lists in previous step

Filtering
◦ Abbr.
◦ Rule-based
◦ Contextual
◦ Finally retrieve 116 Tweets as testing dataset.

Query by Seed words
◦ Source - Over 10 million non-tagged tweets corpus
◦ Seeds define - [P : ♪, N: orz]
◦ Get pseudo positive set and pseudo negative set
(each with 100,000 tweets)

Vectored and training
◦ Divided into words
◦ Selected and labeled as P/N
◦ Result in Naïve Bayes Model

Naïve Bayes Classification
◦ Two classes
 Positive and Negative
◦ Model parameters
 Likelihood and Priori probability
 Get from training dataset in the previous step
 MeCab
 Okapi-BM25
 Select by hand
Target city
description
(Sentences)



Abbr.
Same
Contextual
Basics
(Words)
q
u
e
r
y
Naïve Bayes
Classification
(P/N)
query
Tweets
10 million
Divide words into
Vector space
100,000x2
P/N pseudo
training data
Our method
result
Comparison
&
Explanation
Related to target
Exactly
match
Seeds
116 tweets
64 pos;
52 neg
m
o
d
e
l
Naïve Bayes
Training
Scoring method
result

Target city – Iizuka city, Fukuoka, Japan
◦ Medium size city with population of appr. 130,000

116 tweets as testing dataset
◦ Target city is small
◦ 64 tweets as positive and 52 as negative

100,000x2 tweets as training dataset

Compared with another Scoring method
◦ Using linguistic sentiment dictionary
◦ Real number for polarity of each word

The accuracy rate is computed by:

Results:

Output and scenarios:

Accuracy rates rising
◦ By using other seeds
◦ Using the combination of seed words

Important future works
 Statistic model for abbr.
 Contextual filtering
Temporal
Location based
 Other resources

Extracted useful features from tweets,
developed a tourism information analysis
system.

Help people to understand and organize
significant information of the target city.

Shortage:
◦ Still need to pick up Basic words by hand
◦ Cannot deal with abbreviations
◦ Cannot deal with cities with same names

Assumption
◦ Since users tend to post tweets in real time, tweets often
contain significant information of events for tourism as
lifelog data.

Most contribution
◦ Shown the effectiveness on the tweets data with tourism
targets or events.
◦ Shown limitations on sentiment dictionary.

Limitations and extensions
◦ Neutral usually is the 3rd class.
◦ Consider other resources such as weblogs.

How could it be related to our research?
Download