Uploaded by Mark Tinkham

AI Based Textual Analysis Vs Other Approaches

advertisement
AI Based Textual Analysis vs other Approaches for Indexing Mortgage
Documents
Today there are many technology options available to assist in the automation of mortgage loan processing.
Some solutions are well marketed and low cost with great claims of vast libraries of rules and an ability to
provide tremendous results. There are even approaches which claim that OCR is an antiquated technology,
but go on to either apply very low cost labor or other older technology approaches or a combination of these.
Alternative Approaches
There are three typical methodologies applied to document classification. Below, we provide a high-level
overview of each, along with some discussion of data extraction with each approach:
Full Page OCR for Document Classification
This approach to document classification is distinct from most other classification technologies in that it uses a
full-page OCR pass for every page of every document presented to it. Ideally, an entire page is read in less
than half a second and then a set of rules are applied to determine which document type each page belongs
to. While this would seem to be an obvious way to approach the task of identifying the very diverse
documents found in the mortgage industry, most technology providers are unable to deliver the speed
necessary to successfully scale with this approach.
Advantages of this approach include:
•
Ability to index document versions which may have never been seen before by the system assuming
they are lexically similar (same words and phrases found throughout)
•
Ability to accurately distinguish between leading pages and following pages, thus eliminating need to
include separator sheets in the scanning process
•
Ability to “discover” data for capture in a similar way to how a human being does it using words and
phrases across the entire document to find key data.
•
High speed OCR allows for almost infinite scalability with a relatively small hardware footprint
Visual Classification also known as Fingerprinting
This is an old approach which has been remarketed and renamed today by some vendors as AI for use in the
mortgage industry. While it does recognize and have the advantage of sub-second speed it is NOT an OCR
solution. Therefore instead, an image analysis (non-text based) approach is used to identify documents and
page types.
www.paradatec.com/mortgage
This solution attempts to differentiate between document type A and document type B largely by examining
the distribution of ink on samples of each document type. This is like a thumbprint analysis i.e. a graphical
signature of each document type is learned and remembered.
The Advantage of this methodology include:
•
Performance (for the images successfully processed by the image signature method)
Disadvantages of this methodology include:
•
The layout-specific configurations needed for each document variation can take a long time to set
up if the number of document variations/types is high.
•
These layout-specific configurations need to change if the layout of a document ever changes.
•
The graphical signature approach tends to be less reliable with more than one hundred document
variations/types to compare. This can affect accuracy in some cases.
•
The time to process images tends to be linearly related to the number of document
variations/types.
•
This approach presents challenges when attempting to detect document boundaries for multiple
page documents and does not provide an ability to extract data from the documents once
identified.
Dynamic Learning
This approach does NOT have the advantage of a sub-second OCR solution but it does use OCR as part of its
document classification and data extraction methodology to enhance its results. In general, the system is a
mix of preconfigured rules, a learned knowledgebase and layout-specific configurations. The rules are
configured through a GUI but more complex operations require scripting. The technology is typically
configured for mailroom and Accounts Payable environments.
Learning is achieved by running real production data through the system to a human verification step. The
system attempts to learn from the document classification and data extraction decisions made by the
verification operator.
An advantage of this technology is:
•
In-production learning allows rapid use of layout specific information. Unfortunately, this
advantage is also a disadvantage. Many higher-volume sites require regression testing prior to
promotion of any configuration change into production. This methodology is based on a belief that
this is not necessary.
Other disadvantages include:
•
As the system adds layout-specific templates, the system gets proportionately slower
www.paradatec.com/mortgage
•
Separator sheets between multi-page documents are required
•
Production errors occur if layouts change
•
Focused mostly on just mailroom and accounts payable (possibly an advantage to an AP customer,
but mortgage documents do not lend themselves well to this approach)
About Paradatec
Paradatec’s Advanced OCR solutions offer significant efficiencies for classifying large quantities of differing
document types and extracting key data elements from those documents. For more information, please visit
www.paradatec.com/....(Link to Whitepaper).
www.paradatec.com/mortgage
Download