DEMOCRATS, REPUBLICANS AND STARBUCKS AFFICIONADOS: USER CLASSIFICATION IN TWITTER KDD ‘11 Utku Şirin 1560838 USER CLASSIFICATION IN TWITTER, AUTHORS Marco Pennacchiotti Research Scientist at Yahoo! Labs PhD is from Uni. Of Rome Studied in Saarland University Large-Scale Text Mining, Information Extraction, and Natural Language Processing Ana-Maria Popescu Research Scientist at Yahoo! Labs Graduated from University of Washington Social Media Research and Analytics, User modelling, and Sentiment Analysis SOCIAL MEDIA Hotly growing phenomenan Everyone is there, the conservatives and the revolutionaries ! As Data Miners, what we are interested in is the very large number of available data about the social users A basic and important task: Classification of the Users Authorative users extraction Post reranking in web search (KDDCUP ‘12, Track #2) User recommendation How to do the classification ? CLASSIFICATION TASK The starting point is to fulfill the incomplete user attributes by classifiying the user with respect to the incomplete user attribute, indeed. Most of the users do not mention explicitly her political view, for example There are various methods for solving the user classification problem What do we have in social media domain ? Users have many attributes, such as age, gender, etc… Social Network Based on the attributes a classifier may be trained/constructed Users have friends that she follows How to define the classification task so that we can combine these two types of information ‘structure’, user attributes and social network ? MACHINE LEARNING MODEL A novel architecture combining user-centric information and social network information User-centric information are the attributes of the users, which we call as features hereafter Social Network information is the information of friends of the users Main contribution of the paper Use Gradient Boosted Decision Trees (GBDT) framework as the classification algorithm Train the GDBT with given labeled input data And label the users with respect to the built classifier Then apply same classifier model to the friends of the users and label the friends also Lastly, update each user’s label with respect to her friends’ label using an update formulae USER-CENTRIC INFORMATION User-centric information is represented as features. There is a overmuch feature set mainly comprised of four parts Profile features(PROF) Tweeting behavior features(BEHAV) User name, use of avatar picture, date of account creation, etc… Average number of tweets per day, number of replies etc... Linguistic content features Richest feature set, comprised of four sub-feature sets Uses Latent Drichlet Allocation (LDA) as Language Model Prototypical words(LING-WORD): Prototypical hashtags(LING-HASH): LDA is the language model they used, extracted topics with respect to the LDA model and represents users as a distribution over topics LDA is trained by all sets of users Domain-specific LDA(LING-DLDA): Hashtag (#) to denote topics Same technique for proto words Generic LDA(LING-GLDA): Proto words, words that are icons in users. Found probabilistically from the data Firstly partition the users into n class, then find the most frequent words for each class and take mostly used k words for each class Same as Generic LDA, but trained with specific training set such as users that are only democrats and republicans Sentiment words(LING-SENT): Manually collected small set of terms, Ronald Regan, good or bad ? Opinion Finder Tool gives the sentiment as positive, negative, neutral USER-CENTRIC INFORMATION Social Network Features Combination of two different features Friend accounts(SOC-FRIE): Prototypical replied(SOC-REP) and retweeted (SOCRET) users: Informs about sharing same friends for different labeled users such as democrats and republicans Find most frequent mentioned (@) and retweeted (RT) users for different labeled users That’s all for user-centric information OVERMUCH, indeed… LABEL UPDATE USING SOCIAL NETWORK Now each user in the test set is labeled by the classifer that is trained with the features just mentioned Label update part updates the labels with respect to the labels of friends of the users, this is done as follows: Label each user and all of her friends using the built classifier. Labels are numbers in [+1, -1], higher values shows higher confidence level Then update the labels of users with respect to the following formula for the user ui : label′ (ui ) = 𝑤𝑖𝑗 = 1 2 ∗ ∗ label(ui) + (1 − ) ∗ 𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑠𝑖𝑗 𝑘 ∈𝐹𝑖 𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑠𝑖𝑘 + 1 2 ∗ 𝑗 ∈𝐹𝑖 𝑤𝑖𝑗 ∗ 𝑙𝑎𝑏𝑒𝑙(uj) |𝐹𝑖| 𝑛𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠𝑗 /𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝐹𝑟𝑖𝑒𝑛𝑑𝑠𝑗 𝑘 ∈𝐹𝑖 𝑛𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠𝑗 /𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝐹𝑟𝑖𝑒𝑛𝑑𝑠𝑗 EXPERIMENTAL EVALUATION Three binary classification tasks: Detecting political affiliation Democrat or Republican 5169 Democrats and 5169 Republicans 1.2 millions friends Ethnicity African American or Not 3000 African Americans and 3000 Not African Americans 508K friends Following a business Following Starbucks or Not 5000 Starbucks follower and 5000 Not 981K friends EXPERIMENTAL RESULTS, POLITICAL AFFILIATION TASK Best achieved result for combined HYBRID model among three tasks however, not significant increase over single ML model Social Network features are very successfull. This is because users from a particular political view are friends with similar particular views. Suportting sinle GraphBased Label update is also very successfull alone EXPERIMENTAL RESULTS, STARBUCKS FANS TASK Social Graph update is not that much successfull as political affiliation task since Starbucks does not build friends, indeed Profile features are very successfull alone Linguistic features are also successfull HYBRID method still does not increase the alone ML system significantly EXPERIMENTAL RESULTS, ETHNICITY TASK HYBRID method fails, decreases the alone ML model Social network features a so bad ! As in Starbukcs Task case, ethnicity does not form a community. Hence, social network features and graph-based update has very low results Best feature alone results are in linguistic features. Linguistic features always have a point ! OVERALL COMMENTS #1 ML method mostly good enough and update part of the architecture does not bring significant improvement. If the task allows for users to form a community update function works, else, it may even hurt the alone ML system as in ethnicity case #2 Linguistic Features always reliable REVIEW#1 The novelty of combining the types of information is attractive, however, there are serious points that should be criticized First of all the classifier is doing only binary classification and nothing said about multi-dimensional classification. Doing multi-dimensional classification using binary classifier is time-consuming and weakens the claim about the scalability. As said, the novel arch. idea is attractive, however, the results show that labelupdate does not work well. Why ? They did not give any appriciable comment on why label update does not work well. This, I believe, shows that the feature set and the novel architecture is not well-studied. There are overmuch features. But the reasons why these features are selected are not given. Morever, applying same ML model the users and their friends replicates the information. Obviously connected users will have some common and different attributes, what is the point? The social graph should be used more effectively. I think it should not be used to update the labels but as an importantly weigthed feature in the ML model. This is because we should superpose different information types instead of using one to compensate the other. You can see difference in thinking vector space, update means spanning same vector again, superposing means using both vector concurrently. For example, proto words would have been extracted using the network, somehow. REVIEW#2 They told about Gradient Boosted Decision Trees (GBDT) but gave nothing about this classification algorithm, an explanation is expected at least in princpile about GBDT. Same thing is valid for Latent Drichlett Allocation (LDA) language model. It is the first time I hear this language model, and they said nothing about LDA. It is only said that LDA is used as language model and associated with topics. But, what is LDA and how it is associated with topics? There is no data analysis, very cruical lacking of paper, everything is data! They only gave the number of users used in training, but what about the test set? Development set? Any other statistics about the data? Moreover, they used different number of samples for each task. The success of label update is very low for ethnicity task than the political affiliation task, however, there are 1.2M friends for political affiliation task but almost half of them for ethnicity task, 508K. Hence the cross-task comments are not confident. The system they built have a stroing constraint, indeed. It is language dependent, English. For example, the features based on frequencies of proto words will not work for Turkish due to its agglutinative nature, many inflected forms of same words: masayı, masada, masanın, masalardakilkerin etc… (A stemmer will be need most probably) Experiments are not done in a structured way. They have just done the experiments and shows the results. There is not a useful comment. Beside, they did not explain why they have chosen these experiments. For example, I would want to see some success of subset features as features alone have mostly very good results, some subset may increase the overall HYBRID result. ANY COMMENTS OR QUESTIONS ?