Tezis

advertisement
Nidzel’sky Alexey
student of the faculty of the applied mathematics
natural language processing
The vast amount of data on the Web and social media have made it possible to
build fantastic new applications. The most important task is information extraction.
Extract those information, create a new calendar entry, and then populate a calendar
with this kind of structured information, with the event, date, start, and end, for a
calendar program. And modern email and calendar programs are capable of doing this
from text. Another application of this kind of information extraction, involves sentiment
analysis. Machine translation is another important new application and machine
translation can be fully automatic. So for example, we might have a source sentence in
Chinese and here's Stanford's phrasal MT system translating that into English. But MT
can also be used to help human translators. So here we might have an Arabic text and
the human translator translating it into English might need some help from the MT
system, for example, a collection of possible next words that the MT system can build
automatically and help the human translator. Let's look at the state of the art in language
technology. Like every field, NLP's divided up into specialties and sub-specialties. A
number of these problems are pretty close to solved. So, for example, spam detection,
while it's very hard to completely detect spam in our email boxes, we don't have, 99
percent spam, and that's because spam detection is a relatively, easy classification task.
A couple of important component tasks, part of speech tagging and named entity
tagging. And those work at pretty high accuracies. We're gonna get 97 percent accuracy
in part of speech tagging. So we talked about sentiment analysis the task of deciding,
thumbs up or thumbs down on a sentence or a product. Component technologies like
word sense disambiguation deciding if we're talking about a rodent or a computer mouse
when people talk about mouses in a search. We'll talk about parcing which is good
enough now to be used in lots of applications, and machine translation usable on the
web. A number of applications however are still quite hard. So for example, answering
hard questions like how effective is this medicine in treating that disease, by looking at
the web or by summarizing information we know is quite hard. Similarly, while we
made some progress on, deciding that, the sentence xyz company acquired abc company
yesterday means something similar to abc has been taken over by xyz. The general
problem of detecting that two phrases or sentences mean the same thing the paraphrase
tasks still quite hard. Even harder is the task of summarization, reading a number, and
housing prices rose, and aggregating that to give user information, like, in summary, the
economy is good. And finally, one of the hardest tasks in natural language processing:
carrying on a complete human-machine communication in dialogue. So, here's a simple
example asking about what movie is playing when and buy movie tickets, and you can
get applications that do that today. But the general problem of understanding everything
the user might ask for, and returning a sensible response, is quite difficult. Why is
natural language processing so difficult? One cute example are the kinds of, ambiguity
problems that are called crash blossoms. Well, what are crash blossoms? Well this
headline gave the name to this phenomenon because the actual interpretation that the
headline writer intended, the main verb was blossoms. Who does the blossoming, the
violinist, and this fact about being linked to JA crash was a modifier of violinist. Similar
kinds of syntactic ambiguities. So here "Teacher Strikes Idle Kids", the writer intended
the main verb to be idle. The strikes caused the kids to be idle, but, of course, the
humorous interpretation is that the teacher is striking. Strike is the verb. And we have a
teacher. Striking idle kids. Another important kind of ambiguity, is word sense
ambiguity. So in our third example, red tape holds up new bridges, the writer intended
holds up, to mean something like delay. Call that sense one of holds up. But the
amusing interpretation is the second sense of holds up, which we might write down as to
support. And now, we get the interpretation that literal red-tape, as opposed to
bureaucratic red-tape, is actually supporting a bridge. And, we can see lots of other
kinds of, ambiguities in these actual headlines. Now, it turns out that it's not just
amusing headlines that have ambiguity. Ambiguity is pervasive throughout natural
language text. Let's look at a sensible, non-ambiguous-looking headline from the New
York Times. So the headline shortened it here a bit, is Fed raises interest rates, buy that
seems unambiguous. We have a verb here, a vital parser [inaudible] raises. What gets
raised? A noun phrase, a vital role to announce here interest rates. And we have a verb
phase, so raising interest rates and then we have the Fed. Make a little noun phrase. And
then we'll say, this is a sentence that has a noun phrase, Fed, and a verb phrase, raises.
And what gets raised is interest rates. So, this is called a phrase structure parse. We'll
talk about that, later in the course, phrase structure. So, we could also write a
dependency parse. So, we say the head verb, raises, has an argument which is fed, and
has another dependent, which is rates. And, rates has another, itself has a dependent,
interest. So, we can see the main verb is raising. Well, another interpretation of the very
same sentence, one that people don't see but that parsers see right away, is that it's not
raises that's the main verb of the sentence, but interest. Somebody interests something,
and, that something that gets interested is rates. And what is interesting these rates, well.
It's fed raises, raises by the fed. So its a completely different sentence with a different
interpretation that something is interesting, the rates, whatever that could mean, and it
seems an unlikely interpretation for people. But of course, for a parser, this is a perfectly
reasonable interpretation that we have to learn how to rule out. In fact, the sentence can
get even more difficult. This is, the actual headline was some, somewhat longer so we
had fed raises interest rates half a percent. Here we could imagine that rates is the verb
and now we have what is reading fed raises interest. The interest in federal raises. Are
rating, half a percent, so we might have a, a dependency structure like this. So again,
interest. Rates. The raises are what do the interesting and the Fed is a modifier of raises.
So, whether with our, phrase structure parse, or dependency parse, and even more so as
we add more words when get more and more ambiguity, that have to be solved in order
to build a parse, for each sentence. Now, the format of the course you're going to have
in video quizzes and most lectures will include a little quiz. And they're there just to
check basic understanding. They're simple multiple choice questions. You can retake
them if you get them wrong. Let's see one right now. A number of other things make
natural language understanding difficult. One of them is the non standard English that
we frequently see in, text like Twitter feeds, where we have, capitalization and, unusual
spelling of words, and hash tags and user ID's and so on. So, all of our, parsers and part
of speech taggers that we're gonna make use of are often trained on very clean
newspaper text English but, the actual English in the, in the wild. Will cause us a lot of
problems. We'll have a lot of segmentation problems for example if we see that the
string y o r k dash any w as part as New York New Haven, how do we know, the correct
segmentation is New York? And New Haven. So the New York, New Haven railroad.
And not something like. York-dash-new. This word here is not a word like in-dash-law.
We have to solve the segmentation problem correctly. We have problems with idioms,
and with, new words that haven't be- seen before. And, we'll also have problems with
entity names, like the movie, A Bug's Life, which has English words in it, and so it's
often difficult to know where the movie name starts and ends. And this comes up very
often in biology. Where we have genes and proteins named with English words. The
task of natural understanding is very difficult. What tools do we need? Well, we need
knowledge about language, knowledge about the world and a way to combine these
knowledge sources. So generally the way we do this is to use probabilistic models that
are built from language data. So, for example, if we see the word Maison, in French, we
are very likely to translate that as the word house in English. On the other hand if we see
the word avocation all in French, we are very unlikely to translate that as the general
avocado. And training these probabilistic models in general can be very hard. But it
turns out that we can do an approximate job of probabilistic models with rough text
features and we'll introduce those rough to, text features as we go on. So our goal in the
class is teaching key theory and methods for statistical natural language processing.
We'll talk about the Viterbi algorithm, nieve base, and maxen classifiers. We'll
introduce N gram language modeling and statistical parcing. We'll talk about the
inverted index and TFIDF and vector models of meaning that are important in
information retrieval. And we'll do this for practical, robust, real world applications.
We'll talk about information extraction, about spelling correction, about information
retrieval. The skills you'll need for the task, you'll need simple linear algebra so you
should know what a factor is and what a matrix is, you should have some basic
probability theory, and you need to know how to program an either job over python
because there'll be weekly programming assignments, you know have your choice of
languages. We're very happy to welcome you to our course on Natural Language
Processing and we look forward to seeing you in following lectures.
Download