Research on the Methods of Micro-blog Centralization for Online-Advertising Targeting ,2 Qian Yuan1 Shuqin Cai1, ,Lei Duan1 1School , Feng Liang 1 of Management, Huazhong University of Science and Technology, Wuhan,China 2 Weibo,Sina caishuqin@sina.com,fliang76@gmail.com,hustyq@hust.edu.cn Abstract: Online-advertising targeting develops a useful segmentation strategy which is critical in marketing. However decentralized UGC in micro-blog platform gives rise to content, space and time segmentation. Prior research just predicts the customer advertising preference from the perspective of statistic, and cannot meet marketing need because of failing considering the feature of UGC. This paper centers on micro-blog centralization process to improve the effect of advertising, considers different type micro-blog information as processing raw material and explores effective information processing methods of advertising targeting for micro-blog service providers. Our works are demonstrated valuable in micro-blog context with decentralized UGC. Keywords: UGC, Targeting Advertising Supported by projects of National Natural Science Foundation of China (71071066,7073101) and project of the Ministry of Education Research of Social Sciences of China (11YJA630098) Introduction With the booming of micro-blogs, micro-blog has become the most important platform for companies to design online-advertising targeting as marketing strategies, as micro-blog has great superiority of transmission, interactivity and accuracy. Advertising has been the major revenue source of micro-blog websites. Huge amounts of UGC in micro-blog including preference, behavior and context information of customers, create data context for onlineadvertising targeting which has developed from individual level (property-centric) to personal level (customer-centric) and contextual level (context-centric). Nevertheless, vast majority websites is unsatisfied with the effects of advertising revenue, so as audiences [1]. Fragmented and decentralized micro-blogs impede the utilization of content in advertising strategies. Extracting information from fragmented, decentralized micro-blog UGC is not only critical but also urgent for online targeting advertising from the perspective of marketing. Micro-blog is a representative UGC (user generated content) website. There are two main advertising approaches for advertisers in the UGC context: publishing professional ads alongside the content generated by users or asking users to create ads for the company's brand [2] . And the former is focused on here. Targeting advertising can be categories into content-oriented advertising and behavior-oriented advertising. From the content-oriented perspective, researches mainly are focused on keywords extraction [3], short-text similarity calculation [4], text-matching strategies [5] et al. From the behavior-oriented perspective, the main solution is collecting the feedback information of the users' click action to match ads and search keywords by combining with the content-oriented targeting solutions [6, 7]. To improve the Web service level, Goethals et al.[8] suggest fixing more attention on the centralization and decentralization of data and procedure. The work of Paltoglou et al. [9] coalescing the search results from search engine like Google from the view of centralization is very meaningful to increase the accuracy and relativity of search results. Goldszmidt et al. put forward four indicators by comparing the centralization paradigm and decentralization paradigm and suggest two paradigms can be and should be co-existed in online marketing [10]. Researches mentioned above provide insights of centralization, however there are some problems remaining:1) the research object of prior researches are mainly traditional online content, lacking the consideration of micro-blog’s new characteristics; 2) researches of centralization are mainly positioned in the qualitative, macroscopic level and it is difficult to meet the micro-level micro-blog information processing needs due to the limitation of operability; 3) the researches of measurement algorithms are too much, while there are few systematic studies from the perspective of marketing by considering customer’s demands . From the perspective of marketing, studies considering companies’ marketing needs, customer’s specific demand, and requirements of online advertising targeting, are urgently needed to explore the systematic centralization methods by combine information/knowledge management and marketing science together. Therefore, choosing the micro-blog services website as background, we propose a UGC centralization method in order to meet online-advertising targeting needs of company in micro-blog platform by summarizing advertising targeting needs in the view of location, time, behavior and content, extracting centralization features in the view of user, content and trends, and presenting a three-layer (person level, group level and platform level) framework of centralizing UGC. 0. Space and Features of Centralization for Advertising Targeting Online-advertising targeting in our study is based on the marketing demand of advertising targeting and features of UGC in micro-blog platform, and delivers right ads with right content to right micro-blog users at the right time and the right position in the right page, learning from the customers’ personalization information and their context. Business marketing targets customers generally based on the demography related to products or services, geography positions, customer psychological statistics and behavior factors [11]. However new UGC context provides advertisers new opportunities to realize it. We category the needs of targeting as location-oriented, time-oriented, behavior-oriented, and content-oriented targeting need. We propose a micro-blog centralization space to support the marketing aim of online advertising targeting, as illustrated in Fig Behavior-oriented Targeting 1. Although micro-blog websites attract huge amounts of content on one hand, Behavior Strength they also bring the users’ and content’s information fragments and further results Behavior Frequency in the decentralization of micro-blog on Content-Oriented the other hand. The micro-blog Targeting Number Of Frequency Of centralization is a user-centric procedure Time Subject Terms Subject Terms which treats micro-blog content as Space corpuses to obtain the users’ space-time Space-time-Oriented Targeting related, behavior related and subject term Fig 1 Centralization space for micro-blog advertising targeting related information from the demand of advertising targeting. Based on centralization space before, we analyze three kinds of trends here. There are different kinds of micro-blog content in micro-blog website, however not all of them are suitable for targeting advertising. For example, homepage is not a good choice for ads, as content in it isn’t convergent and can’t represent customers’ preference. For this reason, we try to estimate customers’ potential preference by integrating different level of UGC as corpuses. Corpuses are split into three levels here: personal micro-blog (X1), micro-group blogs (X2), and the whole micro-blogs (X3), illustrated in Fig 2. X1 is the micro-blogs posted by user himself/herself in personal pages, including original micro-blogs content set, user’s profile information, time sequence of user’s micro- blog posting action and location set where user posts micro-blogs. X2 is posted by users in the micro-group, and revealed in micro-group pages including original micro-blogs content generated by users, and the features of micro-group micro-blog Personal Micro-blog(X1) are the topics and categories of group. X3 is all of micro-blogs in the micro-blog websites, including Micro-Group Blog(X2) micro-blog content, temporal information and topics Whole-level Micro-blog(X3) emerged. Based on the marketing requirement of Fig 2 The level of micro-blog online advertising targeting, they can be transferred into user features, content features and trend features. User features User features (Y1) is related to the personal micro-blogs, related to user’s behavior. The behavior of online user is divided into short-term and long-term behavior. Occasional and unpredictable short-term behaviors happen more frequent in a specific period because the user’s interests in a specific entity or event are relatively unstable. Long-term behaviors come from user’s stable interests so that related behaviors will reveal more frequently in a long time period. We character user features as Y1={(K,F,S),CT,CL}, where K is the set of keywords terms related to ads in micro-blog, F is the value of keywords’ frequency, S is the value of keywords’ strengths, CT is the set of micro-blogs’ centralized timestamps, and CL is the set of micro-blogs’ centralized location. Content features Content features (Y2) reflect the individual features mining from micro-group content related to ads themes. The topics in a micro-group blog are convergent. Joining in a special group implies potential interests to those contents posted into that group. The type and domain of micro-group can be recognized from group topics and group profile information which are included in the micro-group page. Delivering ads to a user will performs well only if he/she comments a product majorly with positive sentiments [12]. It will accelerate the realization of marketing strategies. We define content features as Y2={K,N,E}, where K is the keywords collection related the ads from micro-level blogs, N is the times of keywords appearing, and E is the sentiment polar of related keywords. Trends features Trends features (Y3) are the trend topics appearing in whole-level micro-blogs. Micro-blog service providers usually present hot-topics to users by order. Those topics related to ads can draw users’ attention and encourage them to join in the discussion which is a excellent opportunity to enhance the interaction between companies and customers. In the profit model of Twitter, “promoted trends” strategy charges highest by sending brand promotion information directly to the head of hot topics in sidebars. The trends centralization method proposed here can combine the ads delivering and current hot topics together, and guide user into topics discussion to promote companies’ marketing strategies, as hot-topics can be noticed by many users and exposure rate is critical for companies. Trends features can be defined as Y3={K,N,ST}, where K is the keywords collection related the ads theme in whole-level micro-blogs, N is the times of keywords appearing, and ST represent the stage of keywords’ life cycle in a hot-topics. 1. Centralization Processing Procedure and Method of Micro-blog 1.1 Centralization Processing Procedure of Micro-blog Micro-blog centralization processing is a procedure to get different centralized results and match with different demands of customers by using suitable approach of advertising targeting based on personal micro-blogs, micro-group blogs and whole-level micro-blogs. Centralized results must match advertising targeting which customers’ needs according the matching degree, as illustrated in Fig 3. Personal micro-blogs (X1) will turn into user features (Y1) by the user-centric centralization process (P1). Micro-group blogs (X2) will turn into content characteristics (Y2) by the content-centric centralization process (P2). Whole-level micro-blogs (X3) will turn into trends characteristics (Y3) by the trends-centric centralization process (P3). Centralized Centralized Processing Processing Procedure Procedure Output Input User Feature(Y1) P1 Personal-level(X1) Centralization Space Of Targeting Advertising Decentralized Micro-blog Space-Time(A) MAP Group-level(X2) P2 Content Feature(Y2) Behavior(B) Whole-level(X3) P3 Trend Feature(3) Content(C) Fig 3 Centralization processing of micro-blogs 1.2 Centralization Processing Methods of Micro-blogs 2.2.1 User-Centric Processing Methods Centralization based on the behavior’s frequency and strengths features Frequency features majorly are reflected in the freshness of short-term behavior and the dispersion of long-term. Freshness is the average posting time in an effective cluster after normalization. Dispersion is the mean-square deviation of the posting time in an effective cluster after normalization. Micro-blogs can be clustered based on the similarity between keywords. When the similarity between two micro-blogs is greater than a pre-given threshold (α), they are categories into the same cluster. Only if the number of micro-blogs in a micro-blog cluster’s proportion of the total number of micro-blogs reaches a certain level, the cluster contains the ability to reflect the users’ behavior characteristics. The rate can be defined as behavior factor, indicated by ξ. So only if the amount of micro-blogs in a cluster is greater than ξ×m, that cluster can be considered as an effective cluster reflecting that user behavior characteristics, otherwise it is a noise cluster. The strength features depend on the repeated time of actions. It is reflected in the number of times that keywords appear. And the behavior strength is divided into three categories: strength, neutrality, weakness. The minimum and maximum of appearing times are chosen as the both ends, and trisect the interval equally as the scope of three categories. Strength features is the necessary attribute of keywords. Centralization based on the behavior’s space-time features The space and time features based on the user’s behavior which is often used in marketing practice, learn from the centralization processing on the location and time attributes of the all micro-blogs posted by a user. The purpose is choosing a right time quantum to deliver ads for advertiser based on the action of logging and posting, and the location and time of posting. Location features is the result of centralization processing on the location information added in the Mon Tue Wen Thu Fri Sat Sun 1 2 3 4 …… 24 Fig 4 Centralization of User’s Time Feature micro-blogs. Keywords related to current location information are recognized by NLP (Natural Language Processing). Time features comes from dividing the posting time of the micro-blogs related to a special keyword based on days of week and hours of a day and then determining which section those micro-blogs belong to, just as the case illustrated in Fig 4. 2.2.2 Content-Centric Processing Methods Feature extraction is the first step for content-centric processing and the next is constructing the keywords polar dictionary in a special domain to recognize the opinion trends of users. The centralized micro-blog content characteristics can be mined at last. Feature extraction based on micro-blog content Feature extraction is defined as extracting related content features from lots of micro-blogs automatically through machine learning. Those extracting technologies are widely used in the opinion analysis of customer product reviews in general. Micro-blogs can are considered as one kind of product reviews here excepted for the low pertinence. The mainly steps for association rules mining based on the content features are: 1) tagging keywords polar; 2) composing a transaction file with nouns and nouns phrases; 3) extracting frequent keywords based on the association rules mining; 4) pruning the features rules based on adjacent rules; 4) pruning the features rules based on the dependent support; 5) forming the micro-blog characteristics collection composed by frequent items; and 6) complementing the domain or product characteristic of the infrequent items in micro-blogs at last. Sentiment analysis based on micro-blog content According to the further analysis of the sentiment trends to the product characteristic referred in the micro-blogs, sentiment analysis is defined as classifying a user’s sentiment polar to a characteristic, positive or negative, given by feature mining. Sentiment trends of users can be summarized by classifying the sentiment in huge amount of micro-blogs. Polar dictionary construction and sentiment classification are very important in sentiment analysis. There are 445 positive adjective keywords and 337 negative adjective keywords picked out from the used dataset by authors manually at last. Those keywords are meaningful in marketing. Because Chinese is high-context, some polar keywords which may convert its polar when describing some features in special context are defined as the abnormal polar keywords. Some works to collect the polar keywords in the special domain and complement the polar keyword dictionary have been done. Table 1 the formulas for calculating the text sentiment classifying index Category Positive Negative Pricesion PP a1 / b1 100% PN a2 / b2 100% Recall RP a1 / c1 100% RN a2 / c2 100% Index The sentiment classification in micro-blog is measured by the indexes borrowed from text theme classification, precision and recall. Assumed that a1 is the correct text classified number in the positive text recognized by classifiers, a2 is the correct classified text number in the negative text recognized by classifiers, b1 is the positive text number recognized by classifiers, b2 is the positive text number recognized by classifiers, c1 is the actual positive text number and c2 is the actual negative text number. The formulas are listed in table 1. 2.2.3 Trends-Centric Processing Methods Trends-centric progressing is divided into three Hot stages: short text clustering, evolution cycles filtering and hot-topics clusters ranking. The key processing task h2 is filtering those clusters related to the needs of advertiser in the up-going stage. A series of hot-topic micro-blog clusters set can be h1 gotten by using micro-blog text clustering. It is reasonable to assume that there is an evolutionary cycle illustrated in Fig 5. In the evolutionary period, the life t1 t2 t Fig 5 Evolution Cycle of Hot Topic Clusters cycle of hot topic is a procedure which experiences the birth, growth, mature and death at last, and rises from valley to peak fast and declines to valley slowly again. The filtering of evolutionary period can be supported by the cluster in the rectangle where hot h in (h1, h2), and time t in (t1, t2). And h1 should depend on actual situation. If h1 is set too small, the content may be not convergent enough. If h1 is set too large, it will be too difficult to filter enough content and results in missing opportunities for advertising from the perspective of marketing. Based on the feature of revolutionary curve of hot-topic event, clusters unfit for the revolutionary curve can be removed even though the similarities between them are high enough. The candidate hot-topic clusters can be ranked after filtering processes of evolutionary cycle. Clusters with more corpuses and higher weight in the feature vector have the larger possibility to act as the hot corpus clusters. The set of clusters can be filtered whose ranking position is larger than the pre-given threshold. Those chosen clusters fitting with the evolutionary rules are ranked again according to the feature items’ weight. A feature item with the larger weight is the more valuable to be used to deliver related ads. 2. Cases and Conclusions 2.1 Cases Weibo 2 with a high-level open and users’ activities which is considered as twitter’s alternative product in mainland China, is chosen as the source of original corpuses. Content-centric and user-centric processing procedures are executed based on the research mentioned above to realize advertising targeting as marketing strategies. The micro-group, named “car”, with the most users in car-related micro-groups is chosen as the information source. By the end of November 1, 2012, there are 39255 members, 23096 micro-blogs in “car”. About 150 pieces of micro-blogs are been posed average per day in our dataset. Content-centric processing procedure According to feature extracting results of car-related product micro-blogs, top 8 keywords related to car’s attributes are identified based on the attention ranking. According to the classifying training model for product categories, sentiment trends are recognized based on 2 http://www.weibo.com the sentence. The results of sentiment analysis are displayed in Table 2, by combining polar dictionary into sentiment analysis tool. Table 2 Attributes and polar of car-related micro-blogs Positive Rank Attributes Negative Micro-blog Micro-blog Amount Amount Micro-blog Rate of Positive Amount Micro-blogs /% 1 新款 893 490 1383 65 2 兰博基尼 605 137 742 82 3 改装 557 160 717 78 4 宝马 487 207 694 70 5 奥迪 433 237 670 65 6 跑车 391 227 618 63 7 发动机 311 200 511 61 8 车型 288 213 501 57 User-centric processing Users who show interests to cars can be found by the tag, “car”. A high original-create rate user, named “疯狂的石头”, is chosen randomly. He is a VIP and is authenticated as “well-known micro-blog” in Weibo. There are 293 followees, 49388 followers and 1270 micro-blogs in his profile and he tagged himself as “After 80s”, ”Car”, ” Subaru ” , ”Sport”, ”Car racer”, and ”Essay”. He is obvious a typical car enthusiast. And no business promotion factors are involved in his profiles according to his micro-blogs content. Table 3 Attribute and polar of car-related micro-blogs posted by users Positive Rank Attributes Negative Micro-blog Micro-blog Amount Amount Micro-blog Rate of Positive Amount Micro-blogs /% 1 奥迪 55 12 67 82 2 赛车 41 9 50 82 3 大众 35 14 49 71 4 沙漠 33 16 49 67 5 宝马 23 17 40 58 6 法拉利 29 5 34 85 7 德国 14 14 28 50 8 动力 6 14 20 30 898 micro-blogs are collected thought LocoySpider. Feature keywords related to car have been extracted if keyword’s frequency is larger than ten after the preprocessing procedure. And some synonyms have been merged together, such as “沙漠” and “戈壁滩”,” 法拉利” and “F430”, “宝马” and “BMW”, “赛车” and ”赛道”, and “奥迪” and “A4L”. The top 8 characteristic keywords are obtained and their sentiment polar are recognized. Results are displayed in Table 3. Due to the huge data amount, it is too difficult to check the result of sentiment reorganization. 100 sentences are chosen as training set and 50 sentences are chosen as testing set. The results are list in Table 4. The recall of positive comment sentence is 64%, and the precision of it is 78%. The recall of negative comment sentence is 82%, and the precision of it is 69%. And the total precision is 73%. Table 4 Frequency statistic result of keywords Micro-blogs APCM ANCM PPCM 32 9 PNCM 18 41 In table 4, APCM: Actual Positive Comment Micro-blogs, ANCM: Actual Negative Comment Micro-blogs, PPCM: Predicted Positive Comment Micro-blogs, PNCM: Predicted Negative Comment Micro-blogs. The influence of behavior factor ξ to ads delivering is displayed in Table 5. It is obvious that the larger ξ will lead to the removing of noise clusters and those effective clusters which can reflect the short-term behavior and the long-term behavior of users are left. The pertinence of ads is raised. ξ Table 5 the influence of behavior factor to Ads delivering Theme of Effective Clusters 4% 奥迪,赛车,大众,沙漠,宝马,法拉利,德国 5% 奥迪,赛车,大众,沙漠,德国 6% 奥迪,赛车,德国 7% 奥迪,德国 This case combines the product features mining in Chinese micro-blogs and sentiment analysis technology together and validates that domain features can be summarized out from huge amount of micro-blogs content. Centralized information has been extracted successfully, such as the number of comment sentences, the number of positive opinions, the number of negative opinions, the rate of positive opinions and the domain-related features of micro-blogs. Methods of feature mining proposed in the case performance well and provide the decision support for delivering targeting advertising. 2.2 Conclusions Based on the framework of information centralization, this paper chooses Weibo as the research background, chooses delivering targeting ads as pointcut, and studies the methods and models of micro-blog centralization based on the web micro-content information processing theory in the mixing fields of marketing and information system. This research constructs the theories and methods aiming at solving the marketing problem of online advertising targeting brought by information fragments in micro-blog websites, and searches some breaks in the information science and marketing for the Internet companies. Since this study is involved in marketing strategies, knowledge management, information technology and others research fields, there are some problems left which need to solve in actual marketing strategies and left in the future research: weighting everyone’s micro-blog based on the relationship between users, extending centralization model for mining the micro-blog from social networks perspective, increasing the application of real-time processing, and exploring the more effective characteristic mining and sentiment analysis methods. References: [1].JS Beuscart, Kevin Mellet. Business Models of the Web 2.0: Advertising or the Tale of Two Stories. Communications & Strategies, 2009. [2].Sandeep Krishnamurthy, Wenyu Dou. Note From Special Issue Editors: Advertising with User-Generated Content: A Framework and Research Agenda. Journal of Interactive Advertising, Vol 8,1-4(2008) [3].Wen-tau Yih, Joshua Goodman, Vitor R. Carvalho. Finding Advertising Keywords on Web pages. In WWW, 213-222 (2006) [4].Sahami M., Heilman T. A web-based kernel function for matching short text snippets. Proceedings of the Workshop on Learning in Web Search located at 22th Intemational Conference on Machine Learning, pp.377-386(2005) [5].Berthier Ribeiro-Neto, Mareo Cristo, Paulo B. Golgher, Edleno Silva de Moura Impedanee Coupling in Content-targeted Advertising. In SIGIR, pp.15-19(2005) [6].Chakrabarti D., Agarwal D., Josifovski V. Contextual advertising by combining relevance with click feedback. International World Wide Web Conference, pp.417-426(2008) [7].Ciaramita M., Murdoek V., Plachouras V. Online learning from click data for sponsored search. International World Wide Web Conference, pp. 227-236(2008) [8].Goethals R. G., Snoeck M., Lemahieu W., et al. Considering (de)centralization in a Web Services World. Second international conference on Internet and Web applications and services. In: Morne, 22(2007) [9].Salampasis M., Satratzemi M. A comparison of Centralized and Distributed Information Retrieval approaches. In: Samos. Panhellenic conference on informatics, pp.21-25(2008) [10]. Goldszmidt G., Yemini Y. Distributed management by delegation. In Icdcs, , 15th IEEE International Conference on Distributed Computing Systems, pp. 0333(1995) [11]. Chandra, A., Targeted Advertising: The Role of Subscriber Characteristics in Media Markets. The Journal of Industrial Economics,Vol 57(1), 58-84(2009) [12]. Chatterjee, P., D.L. Hoffman and T.P. Novak, Modeling the Clickstream: Implications for Web-Based Advertising Efforts. Marketing Science, Vol22(4), 520 -541(2003).