Transcript: Text Analytics Workflow (part-1) Presenter: Rakhi Arora In this chapter on Text Analytics Workflow, we will learn how to extract person names and phone numbers from a sample email data set using AQLs. Annotation Query Language or AQL is a domain specific language that comes with its unique set of challenges and best practices including the challenge that we currently face: How do we approach this text analytics problem and overcome the challenge of extracting entities from unstructured text? The Text Analytics Workflow is intended to address this challenge and guide the developer in acquiring the best practices in writing AQL. In this video, we will speak about the BigInsights project setup for workflow, importing of input documents, and dictionaries. Once the BigInsights Eclipse tooling has been installed, we'll click a new BigInsights project. In the BigInsights Task Launcher, select the Analyze Text tab and then click Create an extractor in a new BigInsights project. We'll give it a name "PersonPhone". Click OK in the popup to switch to the BigInsights perspective. This will create a BigInsights project named PersonPhone. Two key views open up in text analytics workflow perspective: Extraction Task (shown on the top left) and Extraction Plan (shown on the right). Currently the plan shows a top level project named PersonPhone and it is empty. A project structure is also created automatically. See the structure in Package explorer view next to the Extraction task view. To verify the project properties, right-click the project name in the Package explorer and select Properties. In the Properties dialog, go to the Text Analytics entry. We see the default entries for the location of "main.aql" and the data path is already set. The Project builder compiles only the main AQL file and its dependencies. Once the text analytics properties is set, the next step is to copy the input text files which are to be analyzed and the dictionaries which will be used in extraction. Create a new project folder under project PersonPhone by right-clicking on the project name and selecting "New" > "Folder". We will store the input files here. Copy the sample email dataset in which you want the person and phone numbers to be extracted. Validate the encoding of your input document is UTF-8 by right-clicking in Project explorer and selecting Properties. In the Properties dialog, select Resource and in the text file encoding, select the radial button "Other", and make sure UTF-8 is selected from the drop down. Now to complete the project setup, create a dictionary folder under the project. Copy a set of dictionaries in the dictionary folder under the project. These dictionaries contain a common set of names: a list of first names and a list of last names. Set the dictionary data path in the text analytics properties as I had shown earlier. Page 1 of 2 The project setup is now done. We will now move on to the six steps involved in the Text Analytics Workflow: Step1: Select Document Collection Step2: Label Examples and Clues Step3: Develop the Extractor Step4: Test the Extractor Step5: Profile the Extractor Step6: Export the Extractor Ultimately, to get the results that you want, we must perform steps 2 to 4 in an iterative fashion. Each time you perform these steps, your results become more refined. We will speak more about these steps in the subsequent videos on Workflow. Page 2 of 2