Transcript: Text Analytics Workflow (part

advertisement
Transcript:
Text Analytics Workflow (part-1)
Presenter:
Rakhi Arora
In this chapter on Text Analytics Workflow, we will learn how to extract person names and phone
numbers from a sample email data set using AQLs.
Annotation Query Language or AQL is a domain specific language that comes with its unique set of
challenges and best practices including the challenge that we currently face: How do we approach this
text analytics problem and overcome the challenge of extracting entities from unstructured text?
The Text Analytics Workflow is intended to address this challenge and guide the developer in acquiring
the best practices in writing AQL.
In this video, we will speak about the BigInsights project setup for workflow, importing of input documents,
and dictionaries.
Once the BigInsights Eclipse tooling has been installed, we'll click a new BigInsights project.
In the BigInsights Task Launcher, select the Analyze Text tab and then click Create an extractor in a new
BigInsights project. We'll give it a name "PersonPhone". Click OK in the popup to switch to the BigInsights
perspective. This will create a BigInsights project named PersonPhone.
Two key views open up in text analytics workflow perspective: Extraction Task (shown on the top left) and
Extraction Plan (shown on the right).
Currently the plan shows a top level project named PersonPhone and it is empty.
A project structure is also created automatically. See the structure in Package explorer view next to the
Extraction task view.
To verify the project properties, right-click the project name in the Package explorer and select Properties.
In the Properties dialog, go to the Text Analytics entry. We see the default entries for the location of
"main.aql" and the data path is already set. The Project builder compiles only the main AQL file and its
dependencies.
Once the text analytics properties is set, the next step is to copy the input text files which are to be
analyzed and the dictionaries which will be used in extraction.
Create a new project folder under project PersonPhone by right-clicking on the project name and
selecting "New" > "Folder". We will store the input files here.
Copy the sample email dataset in which you want the person and phone numbers to be extracted.
Validate the encoding of your input document is UTF-8 by right-clicking in Project explorer and selecting
Properties. In the Properties dialog, select Resource and in the text file encoding, select the radial button
"Other", and make sure UTF-8 is selected from the drop down.
Now to complete the project setup, create a dictionary folder under the project.
Copy a set of dictionaries in the dictionary folder under the project. These dictionaries contain a common
set of names: a list of first names and a list of last names. Set the dictionary data path in the text analytics
properties as I had shown earlier.
Page 1 of 2
The project setup is now done.
We will now move on to the six steps involved in the Text Analytics Workflow:
Step1: Select Document Collection
Step2: Label Examples and Clues
Step3: Develop the Extractor
Step4: Test the Extractor
Step5: Profile the Extractor
Step6: Export the Extractor
Ultimately, to get the results that you want, we must perform steps 2 to 4 in an iterative fashion. Each time
you perform these steps, your results become more refined.
We will speak more about these steps in the subsequent videos on Workflow.
Page 2 of 2
Download