Transcript: What is Text Analytics? Presenter: Jayatheerthan

advertisement
Transcript:
What is Text Analytics?
Presenter:
Jayatheerthan
Welcome to the BigData University's online course on Text Analytics.
In this section of the course we will introduce to you the concept of Text Analytics.
Here is a problem statement as a food for your thought: the requirement is to develop a software
application that can read movie reviews from various blogs and review sites; highlight the best and worst
parts of the movie; and suggest a movie that meets your expectations.
Do you think web search can do the job? Not really.
There are quite a lot of challenges in solving this problem that the web search is not equipped to handle.
A web search can only retrieve pages that contain certain keywords of your interest. However, it cannot
analyze the contents of web pages.
So what are the challenges involved?
Understanding human language involves the ability to differentiate positive and negative connotations;
the ability to understand the emotion associated with a piece of text; identify sarcasm; and a lot more. In
general, interpret it the way human beings do.
Nearly 80% of the world's data is unstructured. Since it arrives from sources such as the world wide web,
emails, and social media such as Facebook, Twitter, blogs, online newspapers, call centre notes, ebooks, etc.
In order to make sense out of such a data, the challenge lies in distilling huge volumes of unstructured
text and interpreting the relevant context.
Text analytics applications are equipped to do the job of extracting business intelligence from
unstructured or structured text.
For example, given a news item like the one posted here, a text extraction engine is capable of identifying
entities, such as person names, organizations, their title, etc.
Text extraction engine takes an input document and runs a set of domain specific rules to label certain
portions of text as annotations, which are then fed into business analytics engines for further analysis.
In this example, you would notice that person names and phone numbers have been extracted from an
input document. The result of extraction is a table of records with each record containing the labelled text
along with its start and end indexes.
How does text extraction work? What is under the hood?
A text analytics engine performs several steps before it can distil information from unstructured data
source.
The first step is identifying the language of the document.
Secondly, it performs segmentation which is essentially splitting the text document into words. This step is
also known as tokenization since it involves breaking down the text into tokens.
Some text analytics engines perform normalization which involves expansion of acronyms and in some
Page 1 of 2
cases replacing multiple words that mean the same with a single synonymous word.
The next important step is classification of tokens into a category such as: person names, organization,
date, email ID, etc. Domain specific rules are intended to perform classification.
Some text analytics applications perform the task of disambiguation which is essentially a context
sensitive interpretation of words.
The next important step is the relationship extension: identifying the relationship between the extracted
entities.
We hope you have now understood Text Analytics at a high level.
Watch out for more videos in this series to learn more about the field of Text Analytics.
Page 2 of 2
Download