IT has become Crucial

advertisement
Kap. 60 –
Case: Proofreading
How Information Technology Is Conquering the World:
Workplace, Private Life, and Society
Professor Kai A. Olsen, Universitetet i Bergen og Høgskolen i
Molde
1
A semantic proofreading tool for all
languages based on a text repository
Kai A. Olsen
Molde University College and Department of Informatics, University of Bergen
Norway
kai.olsen@himolde.no
Bård Indredavik
Technical Manager, Oshaug Metall AS, Molde, Norway
bard@oshaug.no
2
Proofreading is important

When we write in a foreign language
 If we are not proficient in our own language
 To find typos and other mistakes
 Errors can make the text unreadable and give a very
bad impression:

I am a student of MSc Logistics and Supply Chain
Management from Westminitser University, London. Last weel
I had the presentation regarding Molde College University and
I heart that you are the module leader of Management of
value. I am wondering if you may write me back more about
that module, because it not really clear for me? In particular,
when I am considering to go foe the second semestr to Molde.
I will be really approciate for it.
Kai A. Olsen, 22.03.2016
3
Manual proofreading
When we are in doubt about an
expression we could ask a language
proficient colleague
 However,

we may not have anybody to ask
 it may be too much to ask somebody to
proofread everything that we write


Can we do it automatically?
Kai A. Olsen, 22.03.2016
4
Automatic language processing





An important research area since the nineteen sixties
The results have been far from what many envisioned
Natural languages seems to be too complex to be
formalized (some argue that you have to be a human
being to understand natural language)
But, due to faster computers we have workable
spelling checkers and studies of syntax have offered
grammar checkers that handle at least some types of
mistakes
Still, clear limitations, e.g., the language tools in Office
2003 will not find these errors:




“I have a red far ”
”A forest has many threes”
“I live at London”
”We had ice cream for desert”
Kai A. Olsen, 22.03.2016
5
For our student
If she had used a spelling and grammar checker in
Office only a few mistakes would have been found:
Kai A. Olsen, 22.03.2016
6
Another approach
Instead of asking another person to
proofread, we could ask the whole world
 That is, use the Web as a text repository
and compare our sentences to those of
everyone else
 For example, by using Google:

”we live at the west coast” – 0
 ”we live on the west coast” – 3,500,000
 ”we live in the west coast” – 5,960,000

Kai A. Olsen, 22.03.2016
7
Background paper (2004)
Journal of the American Society for Information Science and
Technology, Volume 55, Issue 11, September 2004
Kai A. Olsen, 22.03.2016
8
What if the alternatives are unknown?
We can use a wild card (*)
 Example: ”we live * the west coast”
 Study the alternatives, and check the
complete sentence with each candidate
to get a frequency number

Kai A. Olsen, 22.03.2016
9
A tedious process
Kai A. Olsen, 22.03.2016
10
Disadvantages
A lot of work
 We have to know where we are in doubt
 It can be difficult to find all the
alternatives
 But we can make a tool that can do this
job automatically

Kai A. Olsen, 22.03.2016
11
Prototype

Consist of:
1. A spider that collects text from the Web
 2. An index builder that creates an index
structure
 3. An analyzing program that finds
alternatives for each word in the user’s
sentence

Kai A. Olsen, 22.03.2016
12
1. Spider






Starts with a list of seeds, e.g., links to Web
sites of universities, newspapers, state
organizations, etc.
Retrieves text from these sites
“Cleans” the text of formatting data
Stores all links that are found, .html, .pdf and
.doc if these have not been encountered
previously
Follows html-links recursively (we have
separate spiders to parse .pdf and .doc files).
Stores the text in files, numbered
consecutively.
Kai A. Olsen, 22.03.2016
13
2. Index builder
Word

File
For each word we get the files that contain at least O occurrences
of the word. If O is 1 all words are included, but we may use a
higher value to avoid (at least some) misspelled words.
File
Word
Lines

For each file we have a list of all words in the file, each word
giving the lines in the file where the word occurs

All structures are represented as Boolean arrays stored as .txt
files.
Kai A. Olsen, 22.03.2016
14
In English
2.5 Gb text
 2,500 files (1 Mb each) for raw text
 200,000 words (O=10, includes only
words with a frequency of 10 or higher)
and the same number of text files to
show in which files the word occurs
 43 million text files with line references
(for each word in each file)
 No problem for Windows 7

Kai A. Olsen, 22.03.2016
15
In Norwegian
1 Gb text
 10,000 files (0.1 Mb each) for raw text
 550,000 words (O=1, all words) and the
same number of text files to show in
which files the word occurs
 42 million text files with line references
(for each word in each file)

Kai A. Olsen, 22.03.2016
16
3. The analyzer

Finds the frequency of the complete sentence (N
words) offered by the user
 Parses the files where at least N-1 words of the
sentence occur
 Replaces one and one word with a wild card
 Collects alternatives
 Checks the frequency of each alternative
 Calculates a confidence value based on the ratio of
frequencies and the similarity between the original
word and the alternative (Hirschberg’s algorithm)
 Suggests improvements where the alternative
sentence get a higher score than the original
Kai A. Olsen, 22.03.2016
17
Analyzer (example)
I live at
London
changed to:
 I live in
London

Kai A. Olsen, 22.03.2016
18
Analyzer (example 2)
We had ice
cream for
desert
changed to:
 We had ice
cream for
dessert

Kai A. Olsen, 22.03.2016
19
What kind of errors can be found

Typos, as in:


Spelling, using the wrong word:


e.g., mixing in/at/on/
Facts



e.g., mixing desert and dessert
Grammar, using the wrong preposition, verb,
etc.


I have a red far
Beethoven was born in 1970 – corrected to 1770.
Punctuation
That is, most types of mistakes that we make
when writing.
Kai A. Olsen, 22.03.2016
20
When the system fails

Examples:






We eat avocado, may be corrected to we eat apples
Neptune is the outer planet in the solar system, may be
corrected to Pluto is the outer planet…
When we have date specific data, as in the sentence “the
prime minster of Great Britain is”
In practice these failures will seldom be problematic as
they often will address an area where the user is
competent, also
a learning system can reduce some of these cases
In addition, a system that takes dates into
consideration should help
Kai A. Olsen, 22.03.2016
21
The prototype

Is only a prototype:
1 or 2.5 Gb is not enough to get a wide
range of sentences
 Catching data from the Web gives a
repository with many spelling and grammar
errors (also with a lot of repeated text)
 The system works too slow to handle many
users


Still, it can correct many types of
mistakes, e.g., all the examples that we
used in our 2004 paper.
Kai A. Olsen, 22.03.2016
22
What we need in order to

improve the text repository:
A text quality checker, that ignores text with
too many errors
 Or, perhaps better, text repositories based
on books, company reports, government
reports, scientific papers, …


improve speed:
A site with many thousands (millions) of
simple computers (i.e., a “Google” setup)
 The task is ideal for parallel computing

Kai A. Olsen, 22.03.2016
23
Parallel computing: MapReduce

An algorithm offered by Dean and Ghemawat
from Google
 Idea – algorithms that work in parallel on large
data sets
 In our case:


The map operation could be applied to each file,
offering the frequency of each alternative sentence
(one computer can work on one file at a time)
The reduce could take these intermediate results in
order to compute the final frequencies.
Kai A. Olsen, 22.03.2016
24
Discussion

Do we want to write as the majority?



Can we leave everything to the proofing tool?



No
Why do not Google and others offer this tool?


No, as with other type of proofing tools what we get is a suggestion
only
What the tool really does is helping the user to use reading
competency when writing
Will the system find examples of all sentences?


Yes, when we write in a foreign language
When we are not too proficient writers
Perhaps because it will be very resource demanding (or because
they are not smart enough)
What about false negatives?

This (the system indicating expressions that are correct) may be a
problem.
Kai A. Olsen, 22.03.2016
25
Conclusion
With a multicomputer setup and a large
repository many mistakes can be
indicated
 Works in any language that can be
digitized
 Can be an offline or online tool (perhaps
online is achievable one time in the
future?)
 We could have repositories that reflects
style (academic, business, social…)?

Kai A. Olsen, 22.03.2016
26
Big data is becoming important
To analyze buying patterns of customers
 Recommendation systems
 Traffic patterns for planning new flights
or new roads (Norwegian to Molde)
 In science (meteorology, medicine,
physics, astronomy…)
 In many areas

Kai A. Olsen, 22.03.2016
27
Data is available
From the Web
 From user actions on the Web (keywords
entered for searching, pages visited…)
 From automatic sensors, modern
equipment (such as better telescopes),
online activities, cameras…
 The computers and software are here to
analyze the data

Kai A. Olsen, 22.03.2016
28
That is
BIG DATA can be used to understand
many complex processes
 Will becoming an important issue in the
next ten years of computing

Kai A. Olsen, 22.03.2016
29
Download