Spelling and Grammar Checking Using the Web as a Text Repository

advertisement
Spelling and Grammar Checking Using the Web as a Text Repository
Kai A. Olsen
Molde College, Norway, kai.olsen@himolde.no
James G. Williams
School of Information Sciences
University of Pittsburgh, USA, jimw@sis.pitt.edu
412-980-3276
1. Introduction
Natural languages are both complex and dynamic. They are in part formalized through
dictionaries and grammar. Dictionaries attempt to provide definitions and examples of various
usages for all the words in a language. Grammar, on the other hand, is the system of rules that
defines the structure of a language and is concerned with the correct use and application of the
language in speaking or writing.
The fact that these two mechanisms lag behind the language as currently used is not a serious
problem for those living in a language culture and talking their native language. However, the
correct choice of words, expressions and word relationships are much more difficult when
speaking or writing in a foreign language. The basics of the grammar of a language may have
been learned in school decades ago, and even then there were always several choices for the
correct expression for an idea, fact, opinion or emotion. Although many different parts of speech
and their relationships can make for difficult language decisions, prepositions tend to be
problematic for non-native speakers of English and, in reality, prepositions are a major problem
in most languages. Does a speaker or writer say “in the West Coast” or “on the West Coast”, or
perhaps “at the West Coast”? In Norwegian we are “in” a city, but “at” a place. But the
distinction between cities and places are vague. To be absolutely correct, one really has to learn
the right preposition for every single place.
A simplistic way of resolving these language issues is to ask a native speaker. But even native
speakers may disagree about the right choice of words. If there is disagreement then one will
have to ask more than one native speaker, treat their response as a vote for a particular choice and
perhaps choose the majority choice as the best possible alternative. In real life such a procedure
may be impossible or impractical, but in the electronic world, as we shall see, this is quite easy to
achieve. Using the vast text repository of the Web, we may get a significant voting base for even
the most detailed and distinct phrases.
We shall start by introducing a set of examples to present our idea of using the text repository on
the Web to aid in making the best word selection, especially for the use of prepositions. Then we
will present a more general discussion of the possibilities and limitations of using the Web as an
aid for correct writing.
2. Examples – using the number of references returned
For a nonnative writer a typical problem will be to choose the right preposition in a sentence such
as: “We were living in/on/at the West Coast”. A grammar checker will accept all alternatives,
and even if one can use any of these choices, one of these may be more natural or at least a more
acceptable choice to use than the others. To get a vote we can present the different alternatives to
a Web search engine, putting the phrase in apostrophes to get instances where these words are in
the correct order. Using Google1 we get the following results:
1
Google™ is found at www.google.com
2
Phrase
“on the West Coast”
“in the West Coast”
“at the West Coast”
“Vote”
468,000
26,100
10,500
Thus, the preposition “on” seems to be the best choice in this example based on the Web usage
of the phrase.
We have an office “in a building”, but do we stay “in” a hotel or “at” a hotel? The Web gives the
answer:
“Vote”
334,000
115,100
Phrase
“in a hotel”
“at a hotel”
Here both alternatives seem to be in use, but there is a 3 to 1 vote for “in” a hotel. If we want to
be sure we can, check a longer phrase such as “stayed in a hotel” (Table 3).
Phrase
“stayed in a hotel”
“stayed at a hotel”
“Vote”
6,820
3,770
These results show a vote of approximately 2 to 1 for “in”. As we could expect the more
complex phrase reduces the number of votes. However, the Web offers such a vast text
repository that we can check on even more detailed phrases, and still get statistically good
answers:
Phrase
“in New York Hilton”
“at New York Hilton”
“Vote”
76
574
In this case we see that “at” seems to be the best alternative when using a specific hotel name
whereas “in” is the better alternative when using the more generic type of place, hotel.
These possibilities are not limited to the English language alone. The Web repository contains
large quantities of text in many languages. This may, for example, be exploited to find the right
preposition in Norwegian for being “in” (“i”) or “at” (“på”). As stated in the introduction,
Norwegians are always “in” a big city and always “at” a small place, but this may cause quite a
lot of confusion for foreigners in the choice of prepositions for smaller cities or bigger places.
Some examples are shown in the table below, with the correct preposition and Web results for
each.
City/place
No. of
inhabitants
Correct
prep.
Vote
for “i”
(“in”)
Vote for
“på”
(“at”)
3
Oslo
Vadsø
Voss
“i”
“i”
“på”
500,000
6,100
13,800
566,000
4,200
2,300
25,600
227
8,100
As the results show, the Web gives us the correct preposition each of these 3 cases. Note that the
results for the uncommon preposition may show a quite correct use of language. For example,
while we are “i” Oslo, it is correct to say “på Oslo teater” (“at Oslo theater”). To avoid these
disturbances in the web results we may try longer phrases, e. g. “bor i Oslo” (“live in Oslo”).
Checking for the correct use of prepositions, as shown, is an ideal application of this method.
However, it may also be used for other cases such as the use of pronouns or to catch spelling
errors and to check the most common word combinations that would be helpful for non-native
speakers/writers. Errors such as “I went home in there car” or “we had ice for desert” may easily
be found be asking the Web:
Phrase
“their car”
“there car”
“Vote”
352,000
7,250
Phrase
“ice for dessert”
Hilton”
“ice for desert”
Hilton”
“Vote”
171
23
In this case it is interesting to note that even incorrect alternatives get some votes. Selection of
correct homonyms will always remain a difficult issue, especially on the Web, where the
examples come from all types of people, from all cultures.
3. Examples – using the references themselves
In the examples above, we have used the size of the result set, e. g. frequency only, not
examining the references returned by the Web search. This has the advantage that when we let
the larger number determine the result, we are practically independent of all the alternative uses
of the words, mistakes and bad phrasing. However, the disadvantage of this approach is that we
are limited to asking questions in the form of two alternatives – X or Y?
To a limited degree we may also let the Web provide the words that we need. For example, we
may not be sure which verb to be used when getting a train ticket. Do we reserve, book, order or
perhaps use a completely different word? We can then submit the phrase “train ticket” to the
search engine, and then study the references themselves to see what word is most commonly
used. Of course, this has the disadvantage that the first reference returned may not be an example
of good English. So, here it becomes necessary to check the nationality of the reference (US,
English or foreign) and its credibility (e.g., home page of official organization, private home
page, etc.). This may be simpler if we restrict the search to only use a special Web domain, for
example the site of a newspaper, government or university. Then we can expect that the
examples returned are more likely to be in the correct language form. The disadvantage of this
restriction is that we may get fewer votes, perhaps so few that detailed phrasing is impossible.
Alternatively, we can take the verbs offered, include these in complete phrases for a Web search
and then look at the frequency of the results.
4
4. Languages
The methods described here are largely language independent. Submissions to the Web for
“votes” can be restricted to documents in a given language or, even simpler, to documents on
sites with the right language identifier. For example, if we were to check a text in Norwegian we
could limit voting to Web pages in this language or to the “.no” (for Norway) domain. The
limitation is that the text repository may be small for some countries, especially for underdeveloped countries with only a few Web sites.
The usefulness of these grammar and spell checking ideas may perhaps be even greater in other
languages. When writing Spanish, for example, we very quickly find out that there are great
differences between the language they speak and write in Spain, compared to the South
American variant, that are different from the Cuban and Mexican variants. While a dictionary
can tell you which variants of words are used in which countries, the Web can tell you much
more.
5. Current Grammar, Style and Spell Checker Technology
Grammar, style and spell checker software is one category of tools available for people who use
computers for writing. The problems that these tools can detect are:
i.
Mechanics: e.g., capitalization, punctuation and spelling.
ii.
Grammar: e.g., parts of speech and subject/verb agreement.
Style: e.g., words and expressions which 'set a tone' for the types of writing preset in the program
such as general, a business letter, a report, fiction or a technical document etc.
The messages generated by these software tools about the nature and types of problems in a text
are flagged and presented under these three categories, which are further divided into additional
sub-categories. Although there are several of these types of tools available either integrated into
word processing packages or as stand-alone products, none that we examined take advantage of
the text repository on the Web in a dynamic and adaptive manner.
The basic approach to spell checking is to perform a dictionary lookup but spell checkers do little
to deal with issues such as the use of homonyms, such as the word desert versus the dessert. It
will let you eat a desert as well as die from thirst in the dessert. Grammar checkers work from a
set of rules that are defined as the grammar of the language. Thus, they can determine when a
plural noun is used with a singular verb in typical cases, e.g., “is” versus “are” usage, but they
can also fail or misdiagnose many cases as well. Style checking is also based on a set of rules as
to what constitutes good style. Thus, very long sentences are flagged as needing attention.
These tools have improved over the years, but due to the informality, openness and dynamics of
natural language all such tools are likely to have limitations in certain circumstances and for
certain uses.
6. Discussion
Grammar, spelling, and style checking is especially useful for those that are writing in a nonnative language. These people may be divided into two groups. Group one are those who are
proficient in the language and group two are those who are novices and have problems writing
correct sentences. While the first group can be interested in the more complex aspects of writing
correctly in a language, e.g., “on” or “in” the West coast, group 2 has as it main goal the desire to
write
5
simple, straightforward English as correctly as possible. We feel that the methods described here
can support both groups. Using the text repository on the Web can be seen as an extension to an
ordinary grammar checker, not as a replacement. In standard cases, e.g. “he is” or “he are”, a
grammar/spell checker will always be more efficient as the answer or suggestion may be given by
employing rules instead of a more time-consuming Web search. Of course, there is also the
possibility that the grammar checkers themselves can utilize the Web text repository, either by
the direct voting method described here, or by “learning” correct use of language by traversing
the Web.
Using the Web’s text repository we may be able to create more dynamic spell checkers, where
new words are added automatically when they appear in the language.
Some of the “failings” of natural language may produce what could be considered errors in our
results since it is especially important to provide phrases that define the right context. If we
search for the phrase “on the West coast”, the results may differ if the complete sentence is: “we
are living on the West coast” or “we stayed at the West coast hotel”. Synonyms can also cause
similar problems.
One effect of the methods described here may be a form of text standardization, or conformism
in writing, i.e., if we always use the majority vote. But this is just what many writing in a second
language want to achieve, i.e., to write like everyone else. Here standardization is a virtue.
On the other hand, if we look at the more proficient non-native speaker user community they may
use the Web to see if it is possible to use a term in a specific manner. Thus, they will feel more
secure when they use terms in more unconventional ways. In this respect, these Web based
methods may lead to greater variability.
7. Conclusion
We have suggested using the vast text repository on the Web as a tool to aid in making difficult
language choices and have demonstrated with some empirical evidence that it can be highly
effective. We also suggest that the producers of grammar checkers, style checkers and spell
checkers incorporate the text resources on the Web into their products to provide users what they
would find highly useful, especially those writing in their non-native language.
Download