Transcript: Text Analytics Workflow (part

advertisement
Transcript:
Text Analytics Workflow (part-4)
Presenter:
Shyamala Gowri
In this video, we will continue with Step3 - Develop Extractors.
We will first extract the concepts for the complete person name which combines the first name and the
last name. You will make use of some of the aql constructs: the "select" statement, "union all" and
"consolidate".
In the previous chapter, we extracted first name and last name. Now let's extract the combinations of first
and last name. For this we will create another view called FirstLast:
In the Extraction Plan, select PersonPhone > Labels > Candidate Generation.
Right-click, choose Change AQL File.
(PersonPhone > aql >PersonPhone) And select concepts.aql from the browse window.
You will get a message to include the concepts.aql in the main AQL, click on OK as it has already been
added by default.
In the Extraction Plan, select PersonPhone > Labels > PersonName > AQL Statements > Candidate
Generation.
Right-click and choose Add AQL Statement.
View Name: FirstLast.
And choose Type as Select.
Click OK.
The "select" operator takes as input a set of tuples and a predicate to apply to the tuples. It will output all
the tuples that satisfies the predicate.
The new view appears in the Extraction Plan and a template is added in the PersonPhone/concepts.aql.
For the "select" clause, we want to select the Span that combines the spans of first name and last name.
In this view, the specified FirstLast should be a combination of the first and last attribute of these two
views, and also:
select CombineSpans (FN.match, LN.match) as match
from FirstName FN, LastName LN
And also, we have put a condition that these two attributes should not have anything between them. That
is, there should be no character between the first and the last name. For that we add the following:
where FollowsTok(FN.match, LN.match, 0, 0);
Now let's combine the result of these three views, FirstName, LastName and FirstLast, into a single view.
I will use the "union" operator to combine the values of these attributes of three different views into a
single attribute, person of view Person. The "union all" statement merges the output of one or more
"select" or "extract" statements.
For that, in the Extraction Plan > PersonPhone > Labels > PersonName > AQL Statements > Candidate
Generation.
Click on Add AQL Statement.
And enter Person for View Name.
And choose Union all from the drop down.
Page 1 of 2
In the generated template, we make the following changes:
(select P.match as person from FirstName P)
union all
(select P.match as person from LastName P)
union all
(select P.match as person from FirstLast P)
"Union" the results from FirstName, LastName and the FirstLast.
Now that we have made the changes, let's run the extractor.
As you can see in the Annotation Explorer, it has extracted FirstName, LastName and FirstLast.
We can also view the result in the Result viewer by clicking on the Person view. As you can see here,
from the same document, it has extracted Lorraine Smith, Lorraine and Smith; three different records for
the same name. However, ideally, it should have extracted only one entity. For example, Lorraine Smith.
To remove these unwanted entities, Lorraine and Smith, from the extracted results, we will use the Filter
and Consolidate feature available in the Extraction Plan.
For that we will create another view. We will name it PersonFinal. I will use Consolidate keyword "person"
which will remove duplicates of Person attribute; that means, it will keep the attribute value that occurs in
this plan and remove all other unwanted entities.
In the Extraction Plan, PersonPhone > Labels > PersonName > AQL Statements > Filter and Consolidate
> Change AQL File.
(PersonPhone > aql > PersonPhone) Select refinement.aql from the browse window.
You will again get a message to include the refinement.aql in the main AQL file, click on OK as it has
already been added by default.
In the Extraction Plan, select PersonPhone > Labels > PersonName > AQL Statements > Filter and
Consolidate.
Click on Add AQL Statement.
Type in PersonFinal for View Name.
And choose Consolidate for Type.
Now let's fill in the generated template and run again, run the extractor again:
select P.person as person
from Person P
consolidate on P.person using 'LeftToRight';
The <consolidation policy> here is left to right.
Now that we have the output view PersonFinal, combined of the other output views; namely person, last
name, first name and last name and then run the extractor again.
As you can see now in the Annotation Explorer, all the duplicates have been removed.
This concludes the steps for developing extractors for complete person name from unstructured text.
In the next chapter, we will see the remaining Steps 4, 5 and 6 of Workflow that is Test, Profile and Export
Extractors.
Page 2 of 2
Download