Spelling and Grammar Checking Using the Web as a Text Repository Kai A. Olsen Molde College, Norway, kai.olsen@himolde.no James G. Williams School of Information Sciences University of Pittsburgh, USA, jimw@sis.pitt.edu 412-980-3276 1. Introduction Natural languages are both complex and dynamic. They are in part formalized through dictionaries and grammar. Dictionaries attempt to provide definitions and examples of various usages for all the words in a language. Grammar, on the other hand, is the system of rules that defines the structure of a language and is concerned with the correct use and application of the language in speaking or writing. The fact that these two mechanisms lag behind the language as currently used is not a serious problem for those living in a language culture and talking their native language. However, the correct choice of words, expressions and word relationships are much more difficult when speaking or writing in a foreign language. The basics of the grammar of a language may have been learned in school decades ago, and even then there were always several choices for the correct expression for an idea, fact, opinion or emotion. Although many different parts of speech and their relationships can make for difficult language decisions, prepositions tend to be problematic for non-native speakers of English and, in reality, prepositions are a major problem in most languages. Does a speaker or writer say “in the West Coast” or “on the West Coast”, or perhaps “at the West Coast”? In Norwegian we are “in” a city, but “at” a place. But the distinction between cities and places are vague. To be absolutely correct, one really has to learn the right preposition for every single place. A simplistic way of resolving these language issues is to ask a native speaker. But even native speakers may disagree about the right choice of words. If there is disagreement then one will have to ask more than one native speaker, treat their response as a vote for a particular choice and perhaps choose the majority choice as the best possible alternative. In real life such a procedure may be impossible or impractical, but in the electronic world, as we shall see, this is quite easy to achieve. Using the vast text repository of the Web, we may get a significant voting base for even the most detailed and distinct phrases. We shall start by introducing a set of examples to present our idea of using the text repository on the Web to aid in making the best word selection, especially for the use of prepositions. Then we will present a more general discussion of the possibilities and limitations of using the Web as an aid for correct writing. 2. Examples – using the number of references returned For a nonnative writer a typical problem will be to choose the right preposition in a sentence such as: “We were living in/on/at the West Coast”. A grammar checker will accept all alternatives, and even if one can use any of these choices, one of these may be more natural or at least a more acceptable choice to use than the others. To get a vote we can present the different alternatives to a Web search engine, putting the phrase in apostrophes to get instances where these words are in the correct order. Using Google1 we get the following results: 1 Google™ is found at www.google.com 2 Phrase “on the West Coast” “in the West Coast” “at the West Coast” “Vote” 468,000 26,100 10,500 Thus, the preposition “on” seems to be the best choice in this example based on the Web usage of the phrase. We have an office “in a building”, but do we stay “in” a hotel or “at” a hotel? The Web gives the answer: “Vote” 334,000 115,100 Phrase “in a hotel” “at a hotel” Here both alternatives seem to be in use, but there is a 3 to 1 vote for “in” a hotel. If we want to be sure we can, check a longer phrase such as “stayed in a hotel” (Table 3). Phrase “stayed in a hotel” “stayed at a hotel” “Vote” 6,820 3,770 These results show a vote of approximately 2 to 1 for “in”. As we could expect the more complex phrase reduces the number of votes. However, the Web offers such a vast text repository that we can check on even more detailed phrases, and still get statistically good answers: Phrase “in New York Hilton” “at New York Hilton” “Vote” 76 574 In this case we see that “at” seems to be the best alternative when using a specific hotel name whereas “in” is the better alternative when using the more generic type of place, hotel. These possibilities are not limited to the English language alone. The Web repository contains large quantities of text in many languages. This may, for example, be exploited to find the right preposition in Norwegian for being “in” (“i”) or “at” (“på”). As stated in the introduction, Norwegians are always “in” a big city and always “at” a small place, but this may cause quite a lot of confusion for foreigners in the choice of prepositions for smaller cities or bigger places. Some examples are shown in the table below, with the correct preposition and Web results for each. City/place No. of inhabitants Correct prep. Vote for “i” (“in”) Vote for “på” (“at”) 3 Oslo Vadsø Voss “i” “i” “på” 500,000 6,100 13,800 566,000 4,200 2,300 25,600 227 8,100 As the results show, the Web gives us the correct preposition each of these 3 cases. Note that the results for the uncommon preposition may show a quite correct use of language. For example, while we are “i” Oslo, it is correct to say “på Oslo teater” (“at Oslo theater”). To avoid these disturbances in the web results we may try longer phrases, e. g. “bor i Oslo” (“live in Oslo”). Checking for the correct use of prepositions, as shown, is an ideal application of this method. However, it may also be used for other cases such as the use of pronouns or to catch spelling errors and to check the most common word combinations that would be helpful for non-native speakers/writers. Errors such as “I went home in there car” or “we had ice for desert” may easily be found be asking the Web: Phrase “their car” “there car” “Vote” 352,000 7,250 Phrase “ice for dessert” Hilton” “ice for desert” Hilton” “Vote” 171 23 In this case it is interesting to note that even incorrect alternatives get some votes. Selection of correct homonyms will always remain a difficult issue, especially on the Web, where the examples come from all types of people, from all cultures. 3. Examples – using the references themselves In the examples above, we have used the size of the result set, e. g. frequency only, not examining the references returned by the Web search. This has the advantage that when we let the larger number determine the result, we are practically independent of all the alternative uses of the words, mistakes and bad phrasing. However, the disadvantage of this approach is that we are limited to asking questions in the form of two alternatives – X or Y? To a limited degree we may also let the Web provide the words that we need. For example, we may not be sure which verb to be used when getting a train ticket. Do we reserve, book, order or perhaps use a completely different word? We can then submit the phrase “train ticket” to the search engine, and then study the references themselves to see what word is most commonly used. Of course, this has the disadvantage that the first reference returned may not be an example of good English. So, here it becomes necessary to check the nationality of the reference (US, English or foreign) and its credibility (e.g., home page of official organization, private home page, etc.). This may be simpler if we restrict the search to only use a special Web domain, for example the site of a newspaper, government or university. Then we can expect that the examples returned are more likely to be in the correct language form. The disadvantage of this restriction is that we may get fewer votes, perhaps so few that detailed phrasing is impossible. Alternatively, we can take the verbs offered, include these in complete phrases for a Web search and then look at the frequency of the results. 4 4. Languages The methods described here are largely language independent. Submissions to the Web for “votes” can be restricted to documents in a given language or, even simpler, to documents on sites with the right language identifier. For example, if we were to check a text in Norwegian we could limit voting to Web pages in this language or to the “.no” (for Norway) domain. The limitation is that the text repository may be small for some countries, especially for underdeveloped countries with only a few Web sites. The usefulness of these grammar and spell checking ideas may perhaps be even greater in other languages. When writing Spanish, for example, we very quickly find out that there are great differences between the language they speak and write in Spain, compared to the South American variant, that are different from the Cuban and Mexican variants. While a dictionary can tell you which variants of words are used in which countries, the Web can tell you much more. 5. Current Grammar, Style and Spell Checker Technology Grammar, style and spell checker software is one category of tools available for people who use computers for writing. The problems that these tools can detect are: i. Mechanics: e.g., capitalization, punctuation and spelling. ii. Grammar: e.g., parts of speech and subject/verb agreement. Style: e.g., words and expressions which 'set a tone' for the types of writing preset in the program such as general, a business letter, a report, fiction or a technical document etc. The messages generated by these software tools about the nature and types of problems in a text are flagged and presented under these three categories, which are further divided into additional sub-categories. Although there are several of these types of tools available either integrated into word processing packages or as stand-alone products, none that we examined take advantage of the text repository on the Web in a dynamic and adaptive manner. The basic approach to spell checking is to perform a dictionary lookup but spell checkers do little to deal with issues such as the use of homonyms, such as the word desert versus the dessert. It will let you eat a desert as well as die from thirst in the dessert. Grammar checkers work from a set of rules that are defined as the grammar of the language. Thus, they can determine when a plural noun is used with a singular verb in typical cases, e.g., “is” versus “are” usage, but they can also fail or misdiagnose many cases as well. Style checking is also based on a set of rules as to what constitutes good style. Thus, very long sentences are flagged as needing attention. These tools have improved over the years, but due to the informality, openness and dynamics of natural language all such tools are likely to have limitations in certain circumstances and for certain uses. 6. Discussion Grammar, spelling, and style checking is especially useful for those that are writing in a nonnative language. These people may be divided into two groups. Group one are those who are proficient in the language and group two are those who are novices and have problems writing correct sentences. While the first group can be interested in the more complex aspects of writing correctly in a language, e.g., “on” or “in” the West coast, group 2 has as it main goal the desire to write 5 simple, straightforward English as correctly as possible. We feel that the methods described here can support both groups. Using the text repository on the Web can be seen as an extension to an ordinary grammar checker, not as a replacement. In standard cases, e.g. “he is” or “he are”, a grammar/spell checker will always be more efficient as the answer or suggestion may be given by employing rules instead of a more time-consuming Web search. Of course, there is also the possibility that the grammar checkers themselves can utilize the Web text repository, either by the direct voting method described here, or by “learning” correct use of language by traversing the Web. Using the Web’s text repository we may be able to create more dynamic spell checkers, where new words are added automatically when they appear in the language. Some of the “failings” of natural language may produce what could be considered errors in our results since it is especially important to provide phrases that define the right context. If we search for the phrase “on the West coast”, the results may differ if the complete sentence is: “we are living on the West coast” or “we stayed at the West coast hotel”. Synonyms can also cause similar problems. One effect of the methods described here may be a form of text standardization, or conformism in writing, i.e., if we always use the majority vote. But this is just what many writing in a second language want to achieve, i.e., to write like everyone else. Here standardization is a virtue. On the other hand, if we look at the more proficient non-native speaker user community they may use the Web to see if it is possible to use a term in a specific manner. Thus, they will feel more secure when they use terms in more unconventional ways. In this respect, these Web based methods may lead to greater variability. 7. Conclusion We have suggested using the vast text repository on the Web as a tool to aid in making difficult language choices and have demonstrated with some empirical evidence that it can be highly effective. We also suggest that the producers of grammar checkers, style checkers and spell checkers incorporate the text resources on the Web into their products to provide users what they would find highly useful, especially those writing in their non-native language.