International Research Journal of Computer Science and Information Systems (IRJCSIS) Vol. 2(2) pp. 34-39, March, 2013 Available online http://www.interesjournals.org/IRJCSIS Copyright © 2013 International Research Journals Full Length Research Paper Arabic users’ attitudes toward Web searching using paraphrasing mechanisms Abeer ALDayel1 and Mourad Ykhlef2 1 Department of Information Technology, CCIS, King Saud University Riyadh, Kingdom of Saudi Arabia 2 Department of Information System, CCIS, King Saud University Riyadh, Kingdom of Saudi Arabia Accepted March 29, 2013 The main purpose of this article is to study the Arabic web users’ behaviors toward search engines and the need for query paraphrasing as an enhanced tool for Arabic information retrieval. We need to know more about how Arabic users search the Web and what information they need from search engines in order to improve the effectiveness of Arabic information retrieval. Hundreds of online Arabic users responded to an online survey. The users provided information on the type of search engines they used (Arabic and/or Multilanguage), search frequency, and the factors affecting their usage of search engines. The results show that when searching the Web to find information in Arabic, users tend to use query paraphrasing while searching and that they mostly use Multilanguage search engines to find information in Arabic instead of using Arabic-specific search engines. As far as we know, this study is the first to investigate Arabic Web users’ searching needs and behaviours from a user perspective. We believe that the findings from this study will provide some insight for further research regarding Arabic Web searching. Keywords: Arabic Web searching, information retrieval, query paraphrasing. INTRODUCTION The Arabic language is the religious language of more than 1.5 billion Muslims worldwide. It is ranked 6th among the top 10 languages on the Internet, including English, Spanish, Japanese, German, French, Russian, Korean, and Chinese (see Figure 1). Figure 2 shows that about 17.5% of all Internet users speak Arabic(Internet World Stats, 2011).However, while the number of Arabic Web users is growing rapidly, Arabic language search engines, which would target this growing audience, are still very underdeveloped. The information retrieval process is affected by the language used and how a search engine handles the characteristics of this language (Moukdad, 2004). The Arabic language has around 5 million words that are derived from approximately 11,300 roots, compared to 400,000 key words and a total of 1.3 million words in English. Therefore, Arabic language has almost three times more words than English because it has hundreds of derivatives that can be taken from a single root (Al-Maimani et al., 2011). Another distinct property of the *Corresponding Author Email: aabeer@ksu.edu.sa Arabic language is polysemy, which means the words have multiple meanings. This fact raises many challenges in the retrieval process, as users may not have an effective search experience or may quickly become discouraged and not invest additional effort into thinking of other key words that would help their search. Another problem of using the Arabic language online is that different regions within the Arab World use different words to explain the same thing. This has caused difficulty in finding and retrieving relevant documents on the Web. For example, for the term “newspaper,” the word “ ”صحيفةis used in some Arabic countries while the word “" جريدةis used in others (Al-Maimani et al., 2011). Information retrieval researchers have realized that it is hard for Web users to formulate effective search requests. Also, most of the available search engines often return search results that have no relation to the user’s request, meaning that many of the results that contain the words used in the query are not relevant. A hypothesis emerged stating that “the search is likely to be more accurate and precise if it is based on meanings rather than on words" (Apresjan et al., 2009). Paraphrasing is one way to improve the retrieval process ALDayel and Ykhlef 35 and to move the querying to a new level, “the meaning level”. Query paraphrasing for information retrieval is also known as query reformulation and can be described as a restatement of a text. This can be accomplished by replacing words with their synonyms, using hyponyms, or changing the order of the words. In this article, we investigate the need for Arabic search engines and users' overall perceptions of query paraphrasing as a query refinement approach .A total of 463 survey responses were collected from an online questionnaire, and the answers of the respondents helped us to gain an understanding of how Arabic users’ experiences affect the individual use of search engines. The remainder of this article is structured as follows: First, we present some background information about the Arabic language and Web search engines. Next, we present the related research. Then, we outline the methods and data used in this research. Next, we discuss the results and the analysis strategy. Finally, we summarize our study and suggest some future directions for research. Arabic language specifications Figure 1. Top 10 languages on the Internet (Internet World Stats, 2011) Figure 2. Internet use of Arabic speakers (Internet World Stats, 2011) Certainly, the current search engines such as Google or Yahoo offer an efficient way to browse Arabic web content. However, the retrieval quality is highly reliant upon the users developing potential query paraphrases and searching the retrieved documents to determine which are relevant. Arabic is the fourth most widely spoken language worldwide after Chinese, Spanish, and English. There are more than 221 million speakers of Arabic from 57 different countries, which is about 3.2% of the world’s population (Lewis et al., 2009). With the spread of Islam at the beginning of 7th century, Arabic also spread rapidly throughout the Middle East and North Africa, as it is the language of Islam and occupies a sacrosanct place in the religious psyche of Muslims. The Arabic script evolved from the Nabataean–Aramaic model, and its vocabulary is continuing to evolve through the integration of words from various dialects and the creation of new words necessary to express new concepts or objects (European Commission, 2011). Arabic is considered to be the largest member of the Semitic language branch. The Semitic languages are distinguished by their noncontact morphology, in which word roots are not themselves syllables or words, but instead are isolated sets of consonants. Words are composed of roots by filling in the vowels between the root consonants. For example, in Arabic, the root meaning "eat" has the form A – K – L. From this root, words are formed by filling in the vowels, e.g. AKLAH is "food", yaAKLis "he eats," etc.(European Commission, 2011). There are three forms of Arabic: classical, modern standard and colloquial (spoken). The classical and modern standard Arabic are both called al-fusha ()الفصحى, which means “the most eloquent.”In preIslamic times in Arabia, classical Arabic was the most prevalent form of the language. With the rise of Islam, classical Arabic became the prominent language of scholarship and religious devotion as it was the language of the Qur'an. Modern standard Arabic is derived from classical Arabic and is often used in conversations between Arabic speakers of different dialects but not in ordinary conversation. Most printed matter in Arabic books and newspapers is written in modern standard Arabic. Colloquial (spoken) Arabic refers to different 36 Int. Res. J. Comput. Sci. Inform. Syst. dialects of Arabic that vary from country to country or between regions within a country (Lewis et al., 2009). Arabic consists of 28 letters and is written from right to left. The written form is based on consonants (letters) and vowel signs (diacritics) (over and underscores are used with letters to indicate proper pronunciation (Moukdad and Large, 2007). One of the most challenging characteristics in Arabic is that some of conjunctions are not separated from the following word by a space, which results in a large number of entries being clustered together alphabetically in index files. Another distinct characteristic is that Arabic plurals are formed irregularly through a complete reformulation of the word (Moukdad and Large, 2007). Web search engines Information retrieval technology provides knowledge management that guarantees access to large corpora of unstructured text and is the basic technology behind Web search engines (Mandl, 2008).The emergence of Web search engines came about by the rapidly growing number of Web sites in the mid-1990s. Alan Emtage created Archie, the first tool for Internet searching. Archie was based on a creating database of filenames for all files located on public FTP sites. Mark McCahill, at the University of Minnesota, then created Gopher, which was based on indexing documents’ plain text instead of the file name. In 1993, Matthew Gray developed the first real search engine called Wandex. This program was the first to crawl the Web, index files, and allow users to search for them. Another one of the first search engines was Aliweb, which was developed by the United Kingdom’s Martijn Koster in 1993 and is still in use today (Inkpen, 2007). By analyzing various points of Website entry, it can be seen that search engines receive the most Internet traffic. Figure 3 compares the type of visitor sources for Websites: search engines, direct and referring site link. Eleven percent of Internet users type a specific Website address directly into their browser. Six percent of Internet users access Websites via a link from a different site. Eighty-three percent of Internet users use search engines to search for a Website; this Percentage demonstrates the necessity of search engines as main portal for the Internet(OWL STAT, 2013b). Figure 3. Analysis of various points of Website entry Figure 4. Search engine usage statistics In 2000, Google's search engine was dominating the Internet search market. With an innovation called “page ranking” and the concept of “link popularity,” Google was able to provide better results compared with other search engines. Based on Inktomi's search engine model, Yahoo established a search engine in 2004 using the combined technologies of its acquisitions. Microsoft had also launched the MSN search engine in the fall of 1998 using search results from Inktomi. In 2004, Microsoft started using its own crawler called MSNBOT. In 2009,Bing was launched by Microsoft (Kuyoro Shade, Okolie Samuel, and Kanu Richmond, 2012). Figure 4 shows the market share for the top search engines in use today. This analysis shows that Google is the most utilized search engine with 81.11% of market (OWL STAT, 2013a). Related work The need for analyzing the quality of the current search engines is important and has been discussed in many studies. These studies can by divided into three categorize: analyzing the search engines’ log file, examining Web users’ behaviours, and evaluating the performance of the existing search engines. The most popular approach is to analyze the log file of each search engine. This approach falls under Web usage mining and focuses on how users use the search engines on the Web to satisfy their information needs. Several papers have been published in this area, including the work of (Pu, Chuang, and Yang, 2002) and (Chau, Fang, and Yang, 2007).The study conducted by(Chau et al., 2007) analyzed the search-query logs of a search engine that focused on Chinese Web users. Many researchers have investigated the searching behaviour of Web users. A number of studies have looked at English search engines, such as the work of (Spink, Bateman, and Jansen, 1999), who investigated Web users’ searching behaviours on the EXCITE Web search engine. They focused on successive searching behaviour as a user conducting related searches over time on the same or evolving topic. They using an interactive survey accessed through EXCITE’s homepage. In addition, (Liaw and Huang, 2003) deve- ALDayel and Ykhlef 37 loped an individual attitude model toward search engines and examined its effectiveness via a questionnaire distributed to 120 search engine users. They found that computer experience and quality of the search systems can predict individual motivation and technology acceptance toward search engines. In 2006, these same researchers’ used a questionnaire with 161 responses to study individual information retrieval behaviours and their evaluations on the usability and effectiveness of search engines. Another line of investigation has been the nonEnglish searching behaviour of Web users.(Zhang and Lin, 2007)and (Bar-Ilan and Gutman, 2003) investigated the multiple language support features of specific Internet search engines. (Bar-Ilan and Gutman, 2003) examined four languages: Russian, French, Hungarian, and Hebrew. For each language, a set of search terms were run on three general search engines (AltaVista, FAST, and Google) as well as some local search engines. In addition, a few studies have investigated the searching behaviour of Arabic Web users. (Moukdad, 2004) compared the performance of three general Arabic search engines based on their ability to retrieve morphologically related Arabic terms. (Tawileh et al., 2010) evaluated the performance of Arabic information retrieval by using 50 randomly Arabic queries to test 5 Web search engines: Araby, Ayna, Google, MSN, and Yahoo. Most of the studies to date have focused on English search engines. However, the information needs and search behaviours of Arabic users can be very different from those of English users because of the different natures of the languages and cultural differences. To address this gap in the research, we present a study of overall Arabic Web user behaviour toward search engines. RESEARCH DESIGN The online questionnaire was written in Arabic and directed at Arabic online users. The questionnaire was composed of three components: search engine usability, search engine effectiveness, and the needs for query paraphrasing in Arabic information retrieval. It included demographic information such as gender, age, and educational status. To measure how satisfied Arabic users are with current search engines, the users were also asked about the type of search engine they usually used (i.e. specific Arabic or Multilanguage) and the frequency that they used search engines to search for Arabic content. The questionnaire was designed to collect data related to possible factors affecting the usage of search engines. In general, search engines should satisfy the following measurements to create a powerful information retrieval system: precision and recall. Precision can be defined as the fraction of retrieved documents that are relevant to a user’s query, and recall is the fraction of the relevant documents that are successfully retrieved. Prior researchers have offered evidence that these measurements are important factors for determining the efficiency of a search engine’s retrieval capabilities. The questionnaire also included two multiple choice questions to examine the third component (the needs for query paraphrasing in Arabic information retrieval). The first question focused on discovering the user’s attitude if (s)he did not obtain the search result (s)he was looking for. The second question was to discover the user’s paraphrasing capability if (s)he wanted to conduct a search using the term ““عدد سكان منطقة الرياض, meaning the “number of people in Riyadh area”. Data collection To calculate the required sample size for the survey, we used the approach proposed by(Cochran, 1977). The sample size n and margin of error E are given by Equations 1, 2, and 3, respectively: X = Z( c 2 ) r (100 − r ) 100 (1) n = N × X ( N − 1) E 2 + X (2) E = (N − n)X n ( N − 1) , (3) Where N is the population size and r is the fraction of responses. The Z is the standard normal distribution. We set the fraction of responses to the most conservative assumption which is 50%.For Equation 1, Z(c/100) is the critical value for the confidence level c. The goal of the questionnaire was to measure the Arabic users’ perceptions toward search engines; in order to fulfill this goal, we used Equations 1 and 2 with a population size equal to the number of Arabic online users, which is 347,002,991 (Internet World Stats, 2011).The confidence level was equal to 95%,and the Z was equal to 1.96 for 95% confidence level. The margin of error was 4.56; the recommended sample size was 463. According to this sample, there is a 95% chance that we are within the margin of error of the correct answer. RESULTS All data were analyzed using the Statistical Package for Social Sciences (SPSS, 16.0).Overall, 463 Arabic online users participated in the study;31% of the participants were between the ages of 26 and 36,as shown in Figure 5. The users were 23% female and 77% male. Table 1 presents a detailed summary of the questionnaire responses for the first component (search engine usability) represented by frequency and percentage. From the statistical analyses, the results show that 73% of the Arabic Web users used search 38 Int. Res. J. Comput. Sci. Inform. Syst. engines on a daily basis. This indicates the importance of search engines as an information retrieval tool. Also, most Arabic Web users were not very satisfied with the Arabic search engines; 88% of users tended to use a Multilanguage search engines to search for Arabic content, only 12% of the users used Arabic-specific search engines. In addition, 55% of users mainly used the search engines to search for Arabic content, while 40% were looking for combined Arabic and English Web content. As an evaluation of the current search engines’ support for Arabic language, 47% of users found that it was good; while59% stated that they only sometimes found what they were looking for using search engines to search for Arabic contents. Figure 6 represents the Arabic user’s behavior when (s) he received no relevant search results. Query paraphrasing was the common strategy (it was used by 84% of Arabic Web users). The questionnaire asked the user to determine possible query paraphrasing for the original query of ““عدد سكان منطقة الرياض. Only 12% of the users were able to come up with possible options for query paraphrasing for the original query (Figure 7). From the Arabic Web user’s point of view, the most important factors of a search engine usage are shown in Figure 8. The factors that the Arabic Web users gave are listed in the following order of priority: (1) Large number of retrieved documents (2) Precision of results (3) Fame and reputation of search engine name (4) Availability of search support tools (e.g., correction, auto prediction, etc.) Figure 5. Age structure of questionnaire respondents Table 1. 2Summary of questionnaire results Search engine usage rate • Daily • Every 2 or 3 days • Sometimes • Never Type of search engine used to search for Arabic content • Arabic search engine • Multilanguage search engine Searching for Arabic content frequency rate • Always • Sometimes • Never Type of content Arabic users mostly use search engine to search for • Arabic language • English language • Arabic and English languages • Multiple languages The results of search engine match for what the user was looking for • Always • Sometimes • Never • Don’t know Evaluation of the services that the search engines offered to support the Arabic content search • Excellent • Good • Acceptable • Weak Response count 339 77 47 0 Response count 56 407 Response count 352 104 7 Response count Percentage 255 11 185 12 Response count 55% 2% 40% 3% Percentage 180 271 7 5 Response count 39% 59% 2% 1% Percentage 182 216 58 7 39% 47% 13% 2% 73% 17% 10% 0% Percentage 12% 88% Percentage 76% 22% 2% Percentage Figure 6. User behavior when (s)he does not retrieve suitable results Figure 7. Queries paraphrasing for original query ( عدد سكان منطقة )الرياض ALDayel and Ykhlef 39 Figure 8. Factors affecting the usage of a search engine CONCLUSION This article has shed some light on Arabic Web searching from the user’s perspective. A total of 463 Arabic user responses were collected by an online questionnaire. Based on the questionnaire results, the users tend to use query paraphrasing during the search session. The task of conducting paraphrasing queries is still complicated for the majority of non-expert users, who have difficulty expressing their needs through an accurate query. This gives insight into the need for automatic query paraphrasing in search engines. It is clear that the user sees a large number of retrieved documents as an important factor for search engine usage. Also, this article indicates that Arabic users rarely use Arabic-specific search engines to retrieve Arabic content; rather, they use multilanguagesearch engines. This result illustrates the need for improvement for the Arabic search engines. In the future, we plan to build a query paraphrasing technique that will improve information retrieval and search capabilities of Arabic Web documents. ACKNOWLEDGMENTS This work was supported by the Research Center of the College of Computer and Information Sciences at King Saud University. The authors are grateful for this support. REFERENCES Al-Maimani MR, Naamany AA, Bakar AZA (2011). Arabic Information Retrieval: Techniques, tools and challenges. In GCC Conference and Exhibition (GCC), 2011 IEEE (pp. 541-544). Apresjan J, Boguslavsky I, Iomdin L, Cinman L, Timoshenko S (2009). Semantic Paraphrasing for Information Retrieval and Extraction. Flexible Query Answering Systems, 5822/2009, 512-523. Bar-Ilan J, Gutman T (2003). How do search engines handle nonEnglish queries?-A case study. In (pp. 415-424). Chau, M., Fang, X., and Yang, C. C. (2007). Web searching in Chinese: A study of a search engine in Hong Kong. JASIST, 58, 1044-1054. Cochran WG (1977). Sampling Techniques. John Wiley and Sons. European Commission (2011). Lingua Franca: Chimera or Reality. Studies on Translation and Multilingualism. Inkpen D (8-12-2007). Information Retrieval on the Internet. http://www.si-te.uottawa.ca/-diana/csi4107L 1. Internet World Stats (2011). Internet World Stats. Miniwatts Marketing Group. http://internetworldstats.com/ Kuyoro Shade O, Okolie Samuel O, Kanu Richmond U (2012). Trends in Web-Based Search Engine. CIS Journal, 3, 942-948. Lewis MP, Grimes BF, Simons GF, Huttar G (2009). Ethnologue: Languages of the world. (vols. 9) SIL international Dallas, TX. Liaw SS, Huang HM (2003). An investigation of user attitudes toward search engines as an information retrieval tool. CHB, 19, 751-765. Mandl T (2008). Recent developments in the evaluation of information retrieval systems: Moving towards diversity and practical relevance. INFORMATICA-LJUBLJANA-, 32, 27. Moukdad H (2004). Lost in cyberspace: How do search engines handle Arabic queries. In the 32nd Annual Conference of the Canadian Association for Information Science (pp. 1-7). Moukdad H, Large A (2007). Information Retrieval from Full-Text Arabic Databases: Can Search Engines Designed for English Do the Job? Libri, 51, 63-74. OWL STAT (2013a). Analysis of various points of web site entry. Retrieved 1-2-2013a, from http://www.statowl.com/network_visitor_source.php?1=1andtimefram e=last_6andinterval=monthandchart_id=4andfltr_br=andfltr_os=andflt r_se=andfltr_cn=andtimeframe=last_12 OWL STAT (2013b). Search Engine Usage Statistics. Retrieved 2-22013b, from http://www.statowl.com/search_engine_market_share.php?1=1andti meframe=last_6andinterval=monthandchart_id=4andfltr_br=andfltr_o s=andfltr_se=andfltr_cn=andtimeframe=last_12 Pu HT, Chuang SL, Yang C (2002). Subject categorization of query terms for exploring Web users' search interests. JASIST, 53, 617630. Spink A, Bateman J, Jansen BJ (1999). Searching the Web: A survey of Excite users. Internet research, 9, 117-128. Tawileh W, Mandl T, Griesbaum J (2010). Evaluation of Five Web Search Engines in Arabic Language. In Proceedings of LWA (pp. 221-228). Zhang J, Lin S (2007). Multiple language supports in search engines. Online Information Review, 31, 516-532.