Relationship Analysis between User’s Contexts and Real Input Words through Twitter Yutaka Arakawa, Shigeaki Tagashira and Akira Fukuda Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan 744, Motooka, Nishi-ku, Fukuoka, Fukuoka, JAPAN, 819-0395 Email: arakawa, shigeaki, [email protected] Abstract—In this paper, we propose a method to evaluate effectiveness of our proposed context-aware text entry by using Twitter. We focus on ”geo-tagged” public tweets because they include user’s important contexts, real location and time. We also focus on TV program listing because 50% traffic of iPhone in Japan is generated from our home, in which I often tweets in watching a TV. Cyclical collecting system based on Streaming API and Search API of Twitter is proposed for gathering the target tweets efficiently. In order to find the relationship between user’s contexts and really used words, we compare really-tweeted words with words obtained from Local Search API of Yahoo! Japan that is used for our context-aware text entry and words obtained from TV program listing. We analyze 471274 tweets that have been collected from 15 December 2009 to 10 June 2010 for specifying the relationship to landmark information and TV program. As a result, we show that 5.1% of tweets include landmark words, and 9% of tweets include TV program words. Additionally, we bring out that there are location dependent words and time dependent words. I. I NTRODUCTION In a recent research in Japan, it is turned out that over fifty percent of Internet users access the Internet from mobile devices. And among them, over eighty percent users access the information by using not hierarchical menu in official site but search engines such as Google. Moreover, current mobile devices can use not only text messaging but also web-mail such as Gmail. It indicates that one has an opportunity to input a long text. The increase of text input on mobile devices drives the demand for improving a text input method. Recent mobile phones generally equip clever text entry which have a function of predictive transform. This function consists of dictionaries, syntactic parsing, and learning. When a user inputs “a”, it picks up the words which start with “a” from a dictionary, and recommends some candidate words which seems appropriate according to syntactic parsing. In addition, based on user’s input history, such as frequency or time stamp of latest use, it sorts the order of the candidates. In these days, iWnn, one of Japanese text entry adopted many kinds of mobile phones, suggests more appropriate words according to current seasons, time, body of received e-mail, relationships between superior and inferior. As another approach that differs from text entry, “Google Suggest” provides oftenused keyword combination for optimizing search terms and reducing keystrokes. Meanwhile we have proposed context-aware text entry which can suggest useful words based on user’s context such as location, presence and time. We mainly focus on how to make dictionary dynamically among above-mentioned three component of predictive transform in text entry. The reason is that the word not included in the dictionary cannot be recommended, and we are not specialists of philological syntactic parsing. In our proposed system, based on user’s current contexts, the dictionary in the mobile phone is updated periodically in cooperation with the dictionary creation server on the Internet, which generates user’s current dictionary dynamically by using several public Web APIs. In our current prototype, “location” and “time” are adopted as a user’s context, and landmark names surrounding user are added based on user’s location, and TV programs’ title and performers’ name are added based on “time”. The reason why we use TV program is that 50% iPhone traffic in Japan is transmitted through home WiFi networks. As a result, if you input “H” at neighbor venue of Globecom2010, the system may suggest “Hyatt Regency Miami” as one of candidates. If you input “J” or “K” in watching 24, the system may return “Jack Bauer” or “Kiefer Sutherland” respectively. We have already constructed the OpenWnn-based prototype system on Android terminal. We have already tried several system architectures, and have confirmed that one of them can achieve enough response time. Also, we are evaluating our system through demonstration and questionnaire with some persons. Although our system looks effective, there is an important remaining issue. Since we started this research with the assumption that such kind of system will be convenient for us, there is no evidence or quantitative evaluation for representing the effectiveness. It is hard to gather large amount of results through questionnaire-based evaluation. In addition, if we log really inputted sentences in a mobile phone, we must consider privacy protection in relation to personal data. In this paper, we propose a method for evaluating contextaware text entry by using Twitter. Twitter is a micro blogging service on the Internet, where a short message of up to 140 characters, called tweet, can be posted. And these tweets are generally open for the public. The reason why we focus on Twitter is 1) we can obtain huge amount of public strings of various users, 2) Geotagging API released at November 2009 enables users to add geo code to each tweet. It means that we can extensively and publicly collect real input sentences that include users’ real location. Therefore, we think that to analyze collected data clears up the relationship between user’s Train Transit Application E-mail Application Map Application Internet Departure Destination Roppongi I took a Yamanote train Hyatt Regency Miami Local device External server from Shinbashi. Soon, I Tokyo will arrive Shibuya. Meet Acceleration sensor Dictionary is dynamically generated by using public APIs on the Internet Bank of America, Hyatt, Roppongi, Tokyo, etc. Shinbashi, Hachiko James L. Knight Center Asynchronous Roman character Typical Effective Examples of context-aware text entry Direct plugin IME ATOK contexts and real input words. As a result, it can show the effectiveness of our proposed text entry quantitatively. First, we construct the tweet collecting system that obtains Japanese tweets with geocode, where we effectively combine two APIs of Twitter, Streaming API and Search API to gather huge amount of tweets. Our system has already gathered halfmillion tweets since 15 December 2009. Next, we analyze collected data by comparing with the data that obtained from other APIs. In this paper, we use “Yahoo! Local search API” for obtaining landmark information, and use “TV program listing on the Internet” for obtaining TV programs’ title and performers’ name. These APIs are the same as APIs for making dictionary in our context-aware text entry. In our relationship analysis, both data are separated into some words by using “Yahoo! Japanese language morphological analysis API” and “Yahoo! Key phrase extraction API”. As a result, we show that 5.1% of tweets include landmark words, and 9% of tweets include TV program words. Additionally, we bring out that there are location dependent words and time dependent words. The rest of the paper is organized as follows. We present our context-aware input method editor proposed previously in Section 2. In section 3, we explain about Twitter. And following section explains relationship analysis. Finally, results are shown in Section 5. II. C ONTEXT-AWARE T EXT E NTRY FOR M OBILE P HONE Fig.1 shows a typical service examples in which our proposed context-aware text entry will work effective. It indicates the importance of words varies with a location (i.e., user’s context). For example, nearby station name is used at stations, landmark name is used at a new places, product name is used at bookstores and electronic retail stores. The most characteristic point is that dictionary is automatically and dynamically updated by mashing up public Web APIs in the Internet. Nearest station API can be used for obtaining the station name near here. Also, Landmark information API can be adopted to search landmark names surrounding the user. In addition, we introduced learning process into this system. If a user selects the word in suggested candidates at the station, the system judges it may be used at the same place in the future. If the word is not used, it judges that the word is not useful in this place. By repeating these learning Japanese language morphological analysis (MeCab) Context-Aware Select & Sort Engine or Location API Yahoo API Input Hiragana Schedule API Presense API Google API Schedule API Amazon API Tabelog API For making dictionary Landmark Info. API Train, Go, Take, Ride, Other sensors API access module (XML parser) Schedule/Calender API Shibuya, Shinjuku, Global context Feedback Nearest station API Context updater Local context For estimation Context Estimation Engine GPS sensor at Statue of Hachiko. API access module (XML parser) Route Search Fig. 1. Internal server GuruNavi API Output "Personal Context Dictionary" mixture of Chinese characters and Japanese phonetic characters As a general dictionary kana-kanji conversion API Fig. 2. The architecture of prototype system processes cyclically, the words related to a certain place become suggested, and normal words will be suggested in other place. A. System architecture The architecture of prototype system is shown in Figure 2． It is composed of three parts, local device, our server on the Internet, and general web services on the Internet. The local device has various sensors such as GPS and acceleration. In our prototype system, we use a PC as local device and adopt the Google Maps API as GPS sensor for setting user ’ s location visually. The internal server in the center of Figure 2 is a main part of our proposed system. It collects information and estimates of user’s context, creates the dynamic dictionary, and suggests the words by utilizing user’s context. These functions are possible to construct on local device. However, we set it into the server over the global network because it is important not only accuracy of estimation algorithm but also processing speed. Besides, we architect it works asynchronously to collect sensor information by the system and to input text on local device. The dictionary is updated whenever location is varied. As a result, local device only searches pre-constructed database when text is input. This architecture enable the system to prevent the processing speed from slowing down when web external servers increase. The external servers in the right of Figure 2 are not our servers but provided by several companies. In the case of this prototype system, it cooperates with the Yahoo Local Search API, Google Maps API, and Gurunavi API. Some words provided by these APIs are materials of personal context-aware dictionary. We develop the two prototype systems. One is the extension of OpenWnn of Android, another is ATOK Direct Plug-in for Collect realtime tweets from Streaming API （10∼15% of all tweets） ・xxxxxxxx Collect past tweets from Search API Filtering Japanese & Geotagged （Users who once geotagged） ・xxxxxxxx Twitter ID ・xxxxxxxxxx Filtering Japanese & Geotagged Time ・xxxxxxxxxx ・xxxxxx ・xxxxxx ・xxxxxxxxxxx ・xxxxxxxxxxx Performer's name Jack Bauer Kiefer Sutherland Keisuke Honda Daisuke Matsui etc. ・xxxxxxx ・xxxxxx ・xxxxxxxxx ・xxxxxxxxxxx ・xxxxx （less than 1%） ・xxxxxxxxx ・xxxxx Database Many tweets Fig. 3. Real inputted texts Yahoo! Local Search API Landmark name Hyatt Regency Miami Bank of America James L. Knight Center Miami Convention Center etc. Language morphological analysis to parse sentences ・xxxxxxxxx Tweets Location TV program listing ・xxx ・xxxxxxxxxxx 2010-06-28T17:04:25, 34.54324, 131.234234, Honda and Matsui, Good job!! #worldcup 2010-06-28T17:05:32, 33.59723, 130.217793, I am staying Hyatt Regency Miami． Jack Bauer Kiefer Sutherland Keisuke Honda Daisuke Matsui Hyatt Regency Miami Bank America James Knight Center Convention Honda Matsui Good Worldcup Hyatt Regency Miami staying Cyclical collecting system based on Streaming API and Search API Fig. 4. Windows and Mac. “ATOK” is one of the major text entry in Japan as well as Microsoft text entry. B. Remaining Issue Flow of relationship analysis These APIs are the same as APIs for making dictionary in our previously proposed context-aware text entry. A. Cyclical collecting system for Twitter Since we started this research with the assumption that such kind of system will be convenient for us, there is no evidence or quantitative evaluation for representing the effectiveness. It is hard to gather large amount of results through questionnaire-based evaluation. In addition, if we log really inputted sentences in a mobile phone, we must consider privacy protection in relation to personal data. III. T WITTER As you know, Twitter is one of the major micro blogging and social networking service today, in which a short message of up to 140 characters, called tweet, can be posted. Tweets are generally open for the public as a “public timeline”. Since Twitter releases many kinds of API for general users, we can obtain other user’s tweets through these APIs. In this paper, we use Streaming API and Search API for obtaining tweets. Streaming API that was officially released January 2010 allows near-realtime access to the user’s tweets timeline. Tweets created by a public account are candidates for inclusion in the Streaming API. However, Streaming API only provides randomly sampled tweets which is about 10% of all the tweets. Search API allows us to search Tweets with a query in which we can set some parameters such as target text, language, user id, geocode, time spam, etc. In this paper, we use this API for collecting the past tweets of user who have once posted with geocode. How to combine these two APIs is described in the following section. IV. R ELATIONSHIP A NALYSIS For analyzing the relationship between user’s contexts and really inputted sentences, we construct the collecting system of Twitter and compare collected tweets with landmark information gotten from Yahoo! Local Search API and TV information obtained from online TV program listing. We construct the tweet collecting system that obtains Japanese tweets tagged with geocode, where we effectively combine two APIs of Twitter, Streaming API and Search API to gather huge amount of tweets. Fig.3 shows the cyclical tweet collecting system based on Streaming API and Search API. The reason to use two APIs is as follows. Since tweets obtained through Streaming API consist of various languages, we need to filter and pick up target tweets which are written in Japanese and have geocode as shown in the left side of Fig.3. As a result, we obtain only less than 1% of tweets. If a user want to add geocode to own tweets, he must have a client that can tag user’s current location through Twitter API. In other words, a user once tagged is possible to post other tagged tweets. Therefore, we pick up user IDs who posted a geo-tagged tweet, and we collect their past tweets through Search API cyclically. B. Matching Process A tweet consists of thee data, time information, location information, and inputted text as shown in Fig.4. From location information and “Yahoo! Local Search API”, we pick up the surrounding landmarks’ name within one kilometer of user’s current location. Examples of typical landmarks are station, city hall, school, hospital, post office, and so on. From time information and TV program listing, we pick up performers’ name and TV programs’ title. As a target channel to be collected, we adopt 12 key stations in Tokyo area and Fukuoka area. Fukuoka is the one of major cities located at west side of Japan, where our university exists. Since it is hard to collect past TV program listing and obtainable data is extremely large, we only analyze data of about one-month (between 7 January 2010 to 2 February 2010). All the data are separated into some words by using “Yahoo! Japanese language morphological analysis API” and “Yahoo! 130.42 141.35 ᮐᖠ 43.06 Only Streaming API ༤ከ 33.58 Start: 15 Dec. 2009 Fig. 5. End: 10 June 2010 Distribution of collected tweets per day Fig. 6. Geographical distribution of a location independent word: Noodle 139.7 Key phrase extraction API”. After that we compare these words with each other and evaluate the matching rate. Bunkyo-ku V. R ESULTS Fig.5 shows a distribution of collected 471274 tweets that have been collected from 15 December 2009 to 10 June 2010. Geographical scope of tweets is limited to Japan area, which is equal to the area from latitude 24 north and longitude 123 east to latitude 46 north and longitude 146. This limitation is due to the limitation of Yahoo! Local search API, which is only provided by Yahoo! Japan. Since we used only Streaming API at first, the number of tweets of the first one month is 10 or 100 times less than those of subsequent terms. It points out that our proposed cyclical tweet collecting system is very effective. Average word count of collected tweets is 48.8 characters, and tweets of about 30 characters are majority. From these results, we think that an abbreviated notation is often used in tweets. Average and maximum number of landmarks obtained at a certain position from Yahoo! Local Search API is 22.9 and 71 respectively. 10.2% of position can’t obtain any landmark information from this API. Maximum number of landmarks per position is 66. Meanwhile, average and maximum number of words gotten from TV program listing is 149.1 words/hour and 790 words/hour respectively, which is about 10 times larger than landmark information. The percentage of tweets including the words obtained according to the tweeted position is 5 Finally, we refer the dependency of time and location. Fig.6 shows a geographical distribution of tweets which include “noodle”. Since plots are widely distributed all over Japan, the Nakano-ku Shinjuku-ku Shinjuku station 35.69 Chiyoda-ku Shibuya-ku Shibuya station 35.66 Minato-ku Meguro-ku Fig. 7. Geographical distribution of location dependent words: Shibuya, Shinjuku word “noodle” can be determined as a location independent word. On the other hand, we notice that each plot (circle and plus) in Fig.7 is concentrated in certain areas respectively. In this figure, circle plots and plus plots show the geographical distribution of tweets which incude “Shinjuku” and “Shibuya” respectively. Centers of concentrated areas are Shinjuku station Sunday “Ryoma-den” is a TV drama broadcasting in NHK at 20 o'clock on every Sunday now. “Ryoma-den” is a TV drama broadcasting in NHK at 20 o'clock on every Sunday now. Sunday Sunday Sunday Sunday 9 May Fig. 8. 16 May 23 May 30 May 6 June Distribution of a time dependent word: Ryoma-den (per day) and Shibuya station of JR (Japan Railways). From this result, the word “Shinjuku” and “Shibuya” can be defined as location dependent words. Fig.8 and Fig.9 show a distribution of tweets which include “Ryoma-den” per day and per hour respectively. “Ryoma-den” is a popular TV program broadcasting in Japan Broadcasting Corporation (NHK) at 20 o’clock on every Sunday now. As shown in Fig.8, the number of tweets on every Sunday is obviously larger than those on other day of the week. Also, we can notice that the number of tweets at 20 o’clock is remarkably larger than other time slots. As a result, the word “Ryoma-den” highly depends on time. We are now picking up other typical words that highly depend on either time or location. We hope that by picking up such context-aware words, our context-aware text entry system will be improved. VI. C ONCLUSION AND FUTURE WORK In this paper, we have proposed cyclical tweet collecting system and have collected over half-million geo-tagged tweets written in Japanese for analyzing the relationship between users’ context (location and time) and real inputted words. We have collected 471274 tweets from 15 December 2009. Statistical analysis shows that 5.1% of tweets include landmark words, and 9% of tweets include TV program words. Addtionally, Geographical mapping indicates the evidence of location/time dependence of real inputted words. As a first step, we have focused on Japanese tweets, but this relationship must exist regardless of language. We are now trying to pick up the words with high location dependency by calculating the geographical distribution ratio. Fig. 9. Distribution of a time dependent word: Ryoma-den (per hour) ACKNOWLEDGMENT The work is carried out by joint research program of the NTT Service Integration Laboratories and the National Institute of Informatics. It is performed using the facilities provided by them. R EFERENCES  rTYPE. (2009) Survey of mobile web site. (in Japanese). [Online]. Available: http://release.center.jp/2008/11/0502.html (last access:2009/12/1)  OMRON SOFTWARE. (2009) iwnn. (in Japanese). [Online]. Available: http://www.omronsoft.co.jp/SP/mobile/iwnn/ (last access:2009/12/1)  S. Suematsu, Y. Arakawa, S. Tagashira, and A. Fukuda, “Network-based context-aware input method editor,” in The Sixth International Conference on Networking and Services (ICNS 2010), 7 March 2010, pp. 1–6.  Y. Arakawa, S. Suematsu, S. Tagashira, Y. Yamaguchi, Y. Tanaka, and A. Fukuda, “Implementation of network-based context-aware japanese input method editor,” in IEICE Technical Reports, ser. MoMuC2009-58, vol. 109, no. 380, 21 January 2010, pp. 31–34, (in Japanese).  S. Suematsu, Y. Arakawa, S. Tagashira, and A. Fukuda, “On improvement of response time for network-based context-aware japanese input method editor,” in IEICE General Conference, no. B-15-18, 19 March 2010, (in Japanese).  Yahoo Japan Corporation., “Yahoo! developer network- map,” http:/ /developer.yahoo.co.jp/webapi/map/(last access:2009/12/1), 2009, (in Japanese).  JustSystems Corporation, “Atok.com,” http://www.atok.com/(last access:2009/12/1), 2009, (in Japanese).  Twitter, “Twitter,” http://twitter.com/.  Toshiba, “Net de navi,” http://tvsurf.jp/tv/.