In this video, let's look at some of the constructs of statements that we used for extracting person names and phone numbers from an input document collection.
In the AQL demo, our approach is to use dictionaries to extract first names and other dictionaries to extract last names. We then used a simple rule that last names should follow the first name, which is actually the person name. On the other hand, for extracting phone numbers we used regular expressions.
To combine these extracted entities, we again defined a rule that phone number should be followed by a person name.
Before moving on to understand the constructs of statements that we used in PersonPhone example, let us look at some of the most common used statements to extract entities and how they can be classified.
The Data model of AQL is defined by the top level components of an AQL annotator, that is, a view.
Views are the logical statements that define a set of tuples.
The "create view" statement creates a new view name and defines the tuple inside the view. A view, when created, is a non-output view; that is, it might not get computed.
The "output view" statement defines that this resulting tuples of that view will get computed when the AQL annotator executes.
AQL statements can be classified into three broad categories: Extraction Constructs, Relational or SQLstyle Constructs, and Functions.
The "extract" statement provides a variety of functionality for extracting basic features directly from text, such as, phone numbers, pin codes or other types of patterns can be extracted using regular expressions. On the other hand, company names, person or movie names, etc. can be extracted using dictionary of strings. Other types of extractions are also supported by AQL, such as, parts of speech, sequence patterns, blogs, etc.
AQL's "select" statement provides a mechanism for constructing complex patterns out of a simpler building block. In AQL, a "select" statement is used to select attributes of columns from views. The syntax of the statement is much similar to SQL's "select" statement. As in SQL we perform join, union, minus and other operations on tables. We can perform similar operations on views based on their attributes.
A number of functions are also supported in AQL which can be used in AQL statements. These can be grouped in "Predicate Functions", "Scalar Functions" and "Aggregate Functions". All these functions can be used along with AQL statements to perform efficient information extraction.
Now let's take a look at the statement that we used to solve PersonPhone problem.
The regular extraction specification has the following structure:
The first part of the specification lists one or more regular expressions. Regular expressions are enclosed in two forward slash characters.
The second part of the specification is an optional "flags string". This string specifies a combination of flags to control against matching. For example, flags from multi-line or case insensitive matching.
The third part of the specification deals with the …to match regular expression only on token boundaries.
Page 1 of 2
The location of the token boundary depends on the tokenizer the runtime engine is using to tokenize the document.
The final part of the specification tells the system how to handle capturing groups in regular expression.
Capturing groups are regions in regular expressions identified by parenthesis.
Now let's look at the example where we used "extract" statement to extract phone numbers using regular expression. As you can see, we created a view definition followed by an extract statement to define the tuples. We used a regular expression for certain phone numbers provided in the document text and below you can see the result that has been highlighted in the document.
Extraction of entities from unstructured text can also be done using dictionary or strings.
To find matches of a dictionary use a dictionary extraction specification. In this specification, we specify the external dictionary which contains strings to match.
And also the "flags string" controls how a dictionary matching is performed. For "flags string", two options are supported: "exact" to do case sensitive matching and "in no case" to do case insensitive matching ("in no case" is default if no case is specified).
In this example, you can see that FirstName is extracted using a dictionary "strictFirst", which contains a list of common first names. And below, you can see that the entities that have a match in the dictionary content.
As I said in my previous slide, SQL-like operations can be performed in AQL on views. This is a specification for a "union all" statement, where you can perform union on several extract or select statements. In this slide you can see that we used "union all" statement to combine FirstName, LastName and FirstLast combination in order to get one unified person name that is defined by view name Person.
Below in the result, you can see that unified names have been highlighted.
Now let's look at "consolidate" clause which is an optional clause that tells the system what to do if the other parts of the statements produce spans that overlap. The following is the structure of consolidate clause where target represents a column; policy is one of the “consolidate” supported policies, for example, contains within, left or right, etc. This clause helps prevent overlaps from output tuples.
As you can see in this slide, we used "consolidate" clause to remove overlaps from the resulting tuples. In this example, we used "LeftToRight" policy; what it means is that it processes the spans in order from left to right. When overlap occurs it retains the left-most longest non-overlapping span. Below you can see that for a single name, three values where extracted, that is Lorraine, Smith and Lorraine Smith. On using
"consolidate" it removes the overlapping of spans and keeps the longest span; that is, Lorraine Smith.
Now let's look at the Functions that we used in PersonPhone example:
In the above statement you can see that a function "CombineSpans" is used to combine person name and phone number. That is, PersonPhone view will contain all the results where phone number is followed by person name. We also use a predicate condition that is "FollowsTok" which defines that there should be zero characters in between person name and phone number spans.
The result of the computation can be seen below where combinations of person names and phone numbers have been highlighted in the document.
This concludes the summary of the statements that I used in PersonPhone demo. To get the complete list of statements, along with the functions and other AQL capabilities you can refer to the AQL reference manual.
Page 2 of 2