Mining the Query Logs of a Chinese Web Search Engine for Character Usage Analysis Michael Chau Yan Lu School of Business School of Business The University of Hong Kong The University of Hong Kong Pokfulam, Hong Kong Pokfulam, Hong Kong mchau@business.hku.hk isabellu@business.hku.hk Xiao Fang College of Business Admin. The University of Toledo, Toledo, Ohio 43606, USA xiao.fang@utoledo.edu Abstract The use of non-English Web search engines has been prevalent. Given the popularity of Chinese Web searching and the unique characteristics of Chinese language, it is imperative to conduct studies with focuses on the analysis of Chinese Web search queries. In this paper, we report our research on the character usage of Chinese search logs from Web search engine in Hong Kong. By examining the distribution of search queries terms, we found that people intended to use more diversified terms and that the usage of characters in search queries was quite different from the character usage of both general online information in Chinese and Web search queries in English. We believe the findings from this study have provided some insights into further research in non-English Web searching and will assist in the design of more effective Chinese Website search engines. Keywords: web searching, character usage, search log analysis 1. Introduction With the increasing popularity of the World Wide Web as a major information resource for people worldwide, research interests in analyzing search engine logs to understand Web users’ search behavior has rapidly increased. Although many studies have been conducted on the query logs in search engines that are primarily English-based (e.g., Excite and AltaVista), the analysis of character usage in the Web search engines in non-English languages has been less investigated. As the number of non-English resources increases dramatically on the Web, it is of great importance to study users’ Web searching behavior for non-English contents using nonEnglish search engines. Such research will provide important indications to improve the design of non-English search engines. Moreover, different languages have different characteristics. Previous findings of English search engines may not be applicable to nonEnglish search engines. For example, different from English, Chinese is a character-based language. For Chinese, most of the meaningful words are built up by combining single characters together, and an individual word may not accurately indicate its real meaning in all the search queries it belongs to. Due to this specific characteristic of Chinese, traditional data processing methods for English search queries cannot be applied to process Chinese search queries. In this study, we choose character-based processing method for analyzing Chinese search queries, which has been widely used in the analysis of Chinese text (Chau et al., 2005b; Li et al., 2002). In this paper, we report our analysis of the search query logs collected from a Chinese Website search engine called Timway. The rest of the paper is organized as follows. Previous studies on search log analysis and online Chinese are reviewed in Section 2. We propose our research questions in Section 3. The data and the methods we used in this research are discussed in Section 4. The findings of our analysis are presented in Section 5. We conclude the paper in Section 6 with a summary of our study and some future directions. 2. Related Studies 2.1 Studies on Web Search Queries Large-scale studies on general-purpose English search engines such as Excite and Alta Vista began at the end of 90’s (Jansen et al., 1998; Jansen et al., 2000; Spink et al., 2001; Wolfram et al., 2001; Silverstein et al., 1999). Interesting findings of these studies include topic trends in Web searching (Spink et al., 2002), sex related information searching on the Web (Spink et al., 2004), and characteristics of question format Web queries (Spink & Ozmultu, 2002). Another category of Web search analysis examined the search logs of a specific Website or an information system. Croft et al. (1995) analyzed search data obtained from THOMAS and found that 88% of all queries contained three or fewer words, which was much lower than the number of words contained in queries to a traditional information retrieval system. Jones et al. (1998) studied transaction logs of the New Zealand Digital Library and obtained similar results to those reported in Croft et al. (1995). They found that almost 82% of queries were composed of three or fewer words. Chau et al. (2005a) and Wang et al. (2003) studied search queries submitted to a search engine in a government website, and search queries submitted to a search engine in a university website respectively. Both of the studies found that information search behavior when using a general-purpose search engines was quite different from information search behavior when using a Website specific search engine. Few studies have focused analyzing user information seeking behavior when using a nonEnglish search engine. Hoelscher (1998) analyzed searching data from Fireball (http://www.fireball.de), a German Web Information Retrieve system. Data set used in the study contained about 16 million queries and 27 million non-unique terms. Some summary statistics, such as the average length of queries, the use of Boolean operators, and the use of phrase searching, were discussed in the paper. Huang et al. (2001; 2003) analyzed query logs from two Chinese search engines in Taiwan, namely GAIS (gais.cs.ccu.edu.tw) and Dreamer (no longer available). Instead of studying users’ information needs and searching behaviors from the search logs, they utilized query logs to recommend term suggestions to users. They also proposed a method to extract search sessions and search queries from proxy server logs. They found that 74% of search sessions contained only one query, which was close to the number reported in the AltaVista study (Silverstein et al., 1999). Similar to the Fireball study, the major drawback of the Taiwan search engine study is that only limited statistics were provided; and there was no in-depth analysis of query terms and search topics. Previous studies on web search queries used a set of similar statistics, focusing on three different levels-sessions, queries and terms (Spink et al., 2001; 2002; Silverstein et al., 1999; Jansen et al. 1998; Jansen et al. 2000). These statistics allow researchers to compare their findings across different types of search engines at different times. 3. Research Questions As discussed above, few studies have focused on character usage in Chinese search engines. Most of the previous Web searching studies focused on English search engines. Although some of them analyzed the characteristics of search terms, their findings may not be applicable to Chinese search engines, due to the great discrepancy between Chinese and English. The purpose of our study was to explore the character usage of search terms submitted by Chinese search engine users, and to assist in the design of more effective Chinese Website search engines. We sought to answer the following research question in our study: (1) What are the characteristics of the character usage of the search queries submitted to Chinese Website search engines? (2) How do these character usages compare to those of Chinese general online texts as well as those of English search engine like Excite? 4. Data and Methods We collected query logs of the Timway Search Engine (http://www.timway.com). Timway, a Chinese search engine established in 1997, is primarily designed for searching Web sites in Hong Kong. It supports search queries in both Chinese and English, and indexes Web pages in both languages. The query log used in our study covers a three-month time period from December 1, 2003 to March 2, 2004. It consists of 1,255,633 records in total. Each record represents a search query submitted to the search engine. One record consists of four fields: search query, number of hits, user’s IP address, and timestamp (Chau et al., forthcoming). Out of the 1,255,633 queries, 536,814 are Chinese queries, 641,169 are English queries, and 77,650 are mixed queries. We focus on analyzing Chinese queries in this study. As most of the Chinese queries in our data are originally in Big 5, we used an open-source Java program to convert all queries in GB-2313 and GBK into Big 5. 5. Analysis Results This section begins by calculating the number of Chinese characters in one query in comparison with the number of English words reported in Excite by Spink et al. (2001). Next is the comparison between the character usage of search queries from Timway and that of another two Chinese corpora as well as an English search web corpus. This is followed by an analysis of the distribution of n-grams (n=1 to 6) extracted from Chinese search logs. 5.1 Number of Tokens per Query Mean number of characters in Chinese queries is 3.380, which is much larger than the mean number of words in English queries as reported in Excite (2.16) by Spink et al. (2001). 70.00% 65.00% Percentage of Occurrences 60.00% pure Chinese 55.00% pure English 50.00% 45.00% 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 0 5 10 15 20 25 30 35 40 Number of Tokens per Query Figure 1: Comparison of Number of Tokens per Query Figure 1 compares between the number of tokens per query for pure Chinese corpus and that for pure English corpus. Search queries in both languages are generally short. The curve for pure Chinese queries fluctuates with two crests, which represent two and four characters per query respectively. This indicates that most pure Chinese queries consist of two, three, or four characters. The main reason for this phenomenon lies in the characteristics of Chinese language. Each single Chinese character corresponds to a morpheme, the smallest meaningful linguistic unit that cannot be divided into smaller meaningful parts. Although some meaningful words consist of one character, most of the meaningful words in Chinese consist of at least two characters. As shown in Figure 1, bigrams, trigrams, and 4-grams are most frequently used search terms in Chinese. 5.2 Character Usage of Chinese Search Queries 5.2.1 Comparison with Chinese corpora To study character usage of Chinese search queries, we ranked the search log queries and compared top 50 of them with those of the MTSU corpus (Da, 2004), and those of the Usenet Newsgroups corpus (Tsai, 1996). 35.00% Percentage of Total Occurences Timway 30.00% Newsgroup MTSU 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 Number of Top Terms Figure 2: Comparison of Top 50 Terms Distribution with the Timway Search Log Data, the Usenet Newsgroup Corpus, and the MTSU Corpus Figure 2 illustrates the distribution of top 50 unigram characters from Timway, newsgroup, and MTSU. The comparison with these three curves shows that the percentile of total occurrences of Newsgroup grows at a slightly faster rate than that of MTSU and at an obviously faster rate than that of Timway. The curve of Timway has the lowest growing rate among them, indicating that web site query terms contain a very large number of different terms compared with large Chinese texts in general, even though the general Chinese texts are collected from online resources. We identified the overlapping unigrams and bigrams between Timway corpus and MTSU corpus separately in the range of the top 5,000 words for each corpus. Figure 3 presents the result. Number of Overlapping Characters 5000 4500 Overlapping Bigrams 4000 Overlapping Unigrams 3500 3000 2500 2000 1500 1000 500 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of Top Characters Figure 3: Overlapping Unigrams, Bigrams between Timway and MTSU For unigram characters, these two corpora have 3,056 overlapping characters, 61.12% out of 5,000 single characters. The curve for overlapping bigrams increases much more gently than the one for overlapping unigrams. Only 698 bigrams out of the top 5,000 bigrams from Timway appear in the top 5,000 bigrams from MTSU, with the percentage of 13.96%. The great discrepancy between these two curves implies that although the two corpora have quite a lot of same unigrams, they have few in common for bigrams which are more topic-related. Among the top 100 most frequently occurring bigrams of the two corpora, there are only two overlapping words - “中國” (China) coming at the 9th place of the Timway bigram list and at the 60th rank of MTSU list, and “電話” (telephone) appearing at the 28th of the Timway list and at the 78th of the MTSU list. The extremely limited overlap of these two corpora further verifies that the language of search engine queries has its unique characteristics, especially for Chinese search engines, which accept large amount of queries in Chinese. This warrants further study of the character usage of Chinese search queries. 5.2.2 Comparison with English corpus To further our analysis, we compare the distribution of top 50 unigrams of Timway search queries with that of Excite. The Excite data used for comparison is reported in (Spink et al, 2001). The data they analyzed consisted of a log of transaction record of 1,025,910 queries submitted by 211,063 users on 16 September 1997. There were 1,277,763 search terms in all unique queries. As shown in Figure 4, the top 50 unigram queries from Timway represented 25.25% (510,570) of all the unigram queries, approximately 2.5 times greater than the Excite data, in which the top 50 unigram queries represented only 10.15% of all its unigram queries. (The percentage we got here is different from the percentage reported by Spink et al. 2001, because they use unique queries for calculation while we use all the queries in this study.) This visible discrepancy is highly due to the limited number of Chinese characters. We found that out of the 536,814 Chinese queries in our data, there are only 7,303 unique Chinese characters. This finding is very different from the Excite corpus which has 140,279 unique English terms out of approximately 1 million queries in total (Spink et al., 2001). Percentage of Total Occurrences 30.00% Timway Excite 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 Number of Top Unigrams Figure 4: Comparison of Accumulation Curves for Top 50 Unigrams with the Timway Search Log Data and the Excite Search Log Data 6. Discussions This article is among few studies investigating the characteristics of Web searching in Chinese. Some valuable findings on non-English information search behavior are revealed in this article. By comparing the character usage of our corpus with that of two corpora obtained from online general texts, we found that our data had the lowest growing rate for the distribution of occurrence. We speculated that people intended to use more diversified terms when searching online. The little overlap in high-frequency characters between our corpus and each of the two corpora further confirmed that the usage of characters for search queries was quite different from that for general online information. The research also identified the discrepancy between the distribution of our corpus and that of the Excite corpus (Spink et al., 2001). The visible high degree of usage of the most frequent terms in Timway corpus with the percentage of 25.25%, compared with relatively low percentage of 10.15% in Excite corpus, provides an implication for search engine designers that the classification of website content and the index items of quick searching could be organized in a more precise and easy-using way. Generally, the language of Chinese Web queries has its own and unique characteristics, it is more advisable for search engine designers to extract potential search topics by analyzing the search terms submitted by users rather than analyze the available texts online. 7. Conclusion and Future Directions In this paper, we report our study on the character usage of a Chinese Website search engine. Our research has identified several characteristics of search terms of our corpus by comparing with other corpus from different perspectives. Due to the lack of studies on the search logs in Chinese, we were unable to undertake comparison with analogous corpora, thus limited our research scope. More studies on the search logs in Chinese and other non-English languages are highly desired. Acknowledgements This research has been supported in part by a Seed Funding for Basic Research (PI: M. Chau) granted by the University of Hong Kong. We would like to thank Timmy Yu from Timway Hong Kong Search Engine Limited for his help in providing the search log data used in this study. We also thank Jackey Ng and Raygen Lam from the University of Hong Kong for their help in data processing. References Chau, M., Fang, X., and Liu Sheng, O. R., “Analysis of the Query Logs of a Web Site Search Engine”, Journal of the American Society for Information Science and Technology, (56: 13), 2005a, pp. 1363-1376. Chau, M., Qin, J., Zhou, Y., Tseng, C., and Chen, H., “SpidersRUs: Automated Development of Vertical Search Engines in Different Domains and Languages”, in Proceedings of the ACM/IEEECS Joint Conference on Digital Libraries, Denver, Colorado, USA, June 7-11, 2005b. Chau, M., Fang, X., and Yang, C., “Web Searching in Chinese: A Study of a Search Engine in Hong Kong, Journal of the American Society for Information Science and Technology, forthcoming. Croft, W. B., Cook, R., and Wilder, D., “Providing Government Information on the Internet: Experiences with THOMAS”, in Proceedings of the Digital Libraries ’95 Conference, Austin, Texas, 2004, pp. 19-24. Da, J., Chinese Text Computing. [Online], retrieved from: http://lingua.mtsu.edu/chinese-computing on Nov 05, 2005. Hölscher, C., “How Internet Experts Search for Information on the Web”, in Proceedings of the World Conference of the World Wide Web, Internet, and Intranet, Orlando, Florida, USA, 1998. Huang, C. K., Chien, L. F., and Oyang, Y. J., “Relevant Term Suggestion in Interactive Web Search Based on Contextual Information in Query Session Logs”, Journal of the American Society of Information Science and Technology, (54:7), 2003, pp. 638-649. Huang, C. K., Oyang, Y. J., and Chien, L. F., “A Contextual Term Suggestion Mechanism for Interactive Search”, in Proceedings of The First Web Intelligence Conference (WI’2001), Japan, 2001, pp. 272-281. Jansen, B. J. and Pooch, U., “Web User Studies: A Review and Framework for Future Work”, Journal of the American Society of Information Science and Technology, (52:3), 2000, pp. 235-246. Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T., “Real Life Information Retrieval: A Study of User Queries on the Web”, ACM SIGIR Forum, (32:1), 1998, pp. 5-17. Jansen, B. J., Spink, A., and Saracevic, T., “Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web”, Information Processing and Management, (36), 2000, pp. 207-227. Jones, S., Cunningham, S. J., and McNam, R., “Usage Analysis of a Digital Library”, in Proceedings of the Third ACM Conference on Digital Libraries, Pittsburgh, PA, USA, June 1998, pp. 293-294. Li, Y., Ding, X., and Tan, C. L., “Combining Character-based Bigrams with Word-based Bigrams in Contextual Postprocessing for Chinese Script Recognition”, ACM Transactions on Asian Language Information Processing, (1:4), 2002, pp. 297-309. Silverstein, C., Henzinger, M., Marais, H. and Moricz, M., “Analysis of a Very Large Web Search Engine Query Log”, ACM SIGIR Forum, (33:1), 1999, pp. 6-12. Spink, A., and Ozmultu, H. C., “Characteristics of Question Format Web Queries: An Exploratory Study”, Information Processing and Management, (38), 2002, pp. 453-471. Spink, A., Ozmutlu, H. C., and Lorence, D. P., “Web Searching for Sexual Information: An Exploratory Study”, Information Processing and Management, (40), 2004, pp. 113-123. Spink, A., Wolfram, D., Jansen, B. J., and Saracevic, T., “Searching the Web: The Public and Their Queries”, Journal of the American Society for Information Science and Technology, (52:3), 2001, pp. 226-234. Tsai, C.-H., “Frequency and Stroke Counts of Chinese Characters.” [Online], retrieved from: http://technology.chtsai.org/charfreq/ on May 20, 2005. Wang, P., Berry, M. W., and Yang, Y., “Mining Longitudinal Web Queries: Trends and Patterns”, Journal of the American Society for Information Science and Technology, (54:8), 2003, pp.743758. Wolfram, D., Spink, A., Jansen, B. J., and Saracevic, T., “Vox Populi: The Public Searching of the Web”, Journal of the American Society for Information Science and Technology, (52:12), 2001, pp. 1073-1074.