lecture notes on Watson - Rensselaer Polytechnic Institute

advertisement
WATSON @ RPI
INSIDE DEEPQA
Managing complex unstructured data with UIMA
Simon Ellis
22nd November, 2013
WATSON T ECHNOLOGIES
P ROFESSOR J IM H ENDLER
S IMON E LLIS
AND
K AT E M C G U I R E
 N I C O L E N E G E D LY
A V I W E I N S TO C K
 M AT T K L AW O N N
J ENN CHAN  S ARABETH J AFFE
O PEN A RCHITECTURE Q UESTION A NSWERING
WATSON
RPI
INTRODUCTION
IBM Watson
???
Watson is…

???
… a piece of software that will run on your laptop

Though very slowly

Specialised hardware and control platform

… an implementation of the DeepQA concept

… the first iteration of the ‘cognitive computing’ platform

… a very clever artificial intelligence

A very clever application of human intelligence
Inside Watson
Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2
???
WATSON
RPI
QUESTION ANALYSIS
Nicole Negedly
Question Analysis
???
Question analysis
???

What is the question asking for?

Which terms in the question refer to the answer?

Given any natural language question, how can Watson
accurately discover this information?
Question
Analysis
Who is the president of
Rensselaer Polytechnic Institute?
Focus Terms:
“Who”, “president of Rensselaer
Polytechnic Institute”
Answer Types: Person, President
Parsing and semantic analysis

What information about a previously unseen piece of
English text can Watson determine?

How is this information useful?
Natural Language Parsing
Semantic Analysis
- grammatical structure
- meanings of words, phrases, etc.
- parts of speech
- synonyms, entailment
- relationships between words
- hypernyms, hyponyms
- ...etc.
- ...etc.
???
Parsing

Stanford’s NLP toolset is used
???
Semantic relations in WordNet

Princeton University’s WordNet

Words are grouped into groups of synonyms called
synsets

Relationships exist between noun synsets

hypernym/hyponym: type-of relation


e.g. Canine is a hypernym of dog
holonym/meronym: part-of relation

e.g. Building is a holonym of window
???
How is this useful?
???

This information can be used to “understand” a question

Current Question Analysis work with RPI’s version of
Watson

Creating and training machine learning classifiers
Parse Trees
Dependency Relations
Coreferences
Named Entities
Semantic Relations
Manually Annotated
Questions
Classifiers
New Question
Critical Elements
of Question
Question analysis pipeline
Unstructured
Question Text
Parsing
&
Semantic
Analysis
???
Structured Annotations
of Question:
Focus, answer types,
Useful search queries
Machine
Learning
Classifiers
WATSON
RPI
CANDIDATE GENERATION
Kate McGuire
Search Result Processing and Candidate
Generation
???
Primary Search
???

Primary Search is used to generate our corpus of
information from which to take candidate answers,
passages, supporting evidence, and essentially all textual
input to the system

It formulates queries based on the results of Question
Analysis

These queries are passed into a search engine which
returns a set number of highly relevant documents and
their ranks.
Search Result Processing

???
Search Result Processing restructures the information
in the document so it is useful.

HTML tags are cleaned from the document

Passage Retrieval/Chunking



Breaks the document down into smaller pieces
Adds information, such as the html text, length, place in the
document, etc.
Passage Parsing

Parse trees are formed for each passage
Candidate Generation
???

Candidate Generation generates a wide net of possible
answers for the question from each document.

Using each document, and the passages created by
Search Result Processing, we generate candidates using
three techniques:

Title of Document (T.O.D.): Adds the title of the document as a
candidate.

Wikipedia Title Candidate Generation: Adds any noun phrases
within the document’s passage texts that are also the titles of
Wikipedia articles.

Anchor Text Candidate Generation: Adds candidates based on
the hyperlinks and metadata within the document.
Search Result Processing and
Candidate Generation
???
WATSON
RPI
SCORING & RANKING
Matt Klawonn
Scoring & Ranking
???
Scoring

Analyzes how well a candidate answer relates to the
question

Two basic types of scoring algorithm

Context-independent scoring

Context-dependent scoring
???
Types of scorers


Context-independent

Question Analysis

Ontologies (DBpedia, YAGO, etc)

Reasoning
Context-dependent

Analyzes natural language that candidates appear in

Relies on “passages” found during search
???
Scorers

???
Examples of scorers include

Passage Term Match

Textual Alignment

Skip-Bigram

Each of these scores supportive evidence

Scores are then merged to produce a single candidate
score
Inside Watson
Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2
???
WATSON
RPI
THE TAO OF UIMA
Simon Ellis
UIMA
???

‘Unstructured Information Management Architecture’

A platform for the analysis of unstructured information and
its integration with search technologies

Permits multi-modal analysis of collections or archives
UIMA
???
http://uima.apache.org/d/uimaj-2.4.0/
‘Unstructured information’


???
The most rapidly-growing source of information in
existence

The internet

Print media

Video recordings

Audio recordings

...
“Unstructured information is just information that doesn’t
have the kind of structure you need it to have for what
you’re doing.” [Peter Fox, X-Informatics class]
UIMA (again)

???
The UIMA platform can be thought of in four ways:

A specification for component interfaces for, and in, an
analytics pipeline

A specification of certain design patterns for that pipeline

An outline of 2 data representations: in-memory annotations
for local analysis and XML representation for remote web
integration

An outline for possible development roles allowing tools to be
CAS


???
Common Analysis Structure (CAS)

Object-based structure

Allows representation of objects, properties and values

Stores arbitrary data structures

Annotations

Types

Object types may be related by single-inheritance

Contains document being analysed, either physically or
logically
Results of analysis are shared and recorded in a CAS
Annotator
???

Core UIMA component type

Contains analysis algorithms designed to work on data
contained in a CAS


Original document

Annotation

Search evidence

Candidate score

...
Form the building blocks of Analysis Engines
Analysis Engine

Building blocks of a UIMA pipeline

Section of code containing 1 or more annotators

Analyses source document(s) and provides analysis
results


Results typically represent metadata about the source
Analysis Engines are effectively software agents that
discover and record metadata
???
Example
???
http://uima.apache.org/d/uimaj-2.4.0/
Sofas and CAS Views



???
Sofa

Subject of Analysis

A piece of data intended for analysis by UIMA components
CAS View

A section of a CAS dedicated to one Sofa

Shares the same name as its Sofa

May be dynamically created as needed by applications or AEs
Each Sofa permits a different perspective of an artefact
Example
Teacher of physics
???
Dr Shirley Ann
Jackson
Chairman, USNRC
Researcher at Bell Labs
IBM Board of Directors
President, RPI
Descriptors


???
All components consist of two parts

Code

Descriptor (declaration)
Functions of the descriptor


Contains metadata about the code block

Name

Structure

Behaviour
Used in component discovery, reuse, and tool composition
UIMA (again, again)


???
Highly reliant on XML

Flexible

Extensible
XML...

... describes components and their behaviour

... controls data (CAS) flow through the pipeline

... is used to create larger components from subcomponents

Aggregate Analysis Engines
Aggregate Analysis Engine

???
A complex analysis engine made up of other components

May contain simple AEs or other AAEs

Components further down the pipeline may rely on all output

Performs a larger, complete task, e.g. named entity recognition

language detection and tokenisation

part-of-speech detection

deep grammatical parsing

named entity recognition
CAS Multiplier
???

Creates 0 or more new CAS objects from an input CAS

May be used to duplicate or merge CAS objects

e.g....

... creating alternative versions of an input Sofa

... breaking a large input CAS into multiple smaller pieces

... aggregating multiple input CAS into a single output
Inside Watson
Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2
???
UIMA, once more

UIMA runs in the Java Runtime Environment

Uses XML code to run system

UIMA framework reads XML dynamically and creates
objects using them

Only the UIMA framework itself is compiled
SO HOW DOES IT WORK?
???
How it works

Abstract class prototyping


UIMA Framework objects are usually derived from a base
class
Function signature


???
UIMA Framework objects each have certain functions which
can or must be overridden

initialize()

process()
This ensures all classes are of known supertypes and
have a recognisable function signature for all key
How it works

Reflection



???
The ability of a computer program to examine and modify the
structure and behavior (specifically the values, meta-data,
properties and functions) of an object at runtime.
XML descriptors define the nature of objects

class name

constructor parameters

...
UIMA dynamically creates objects using reflection
The ‘magic code’
// create type of obj we want
JCasAnnotator ann = null;
// use Java inbuilt function to create abstract class
Class annClass = Class.forName("com.ibm.tutorial.tycor");
// get constructors for abstract class type
Constructor cons = annClass.getConstructor(<params>);
// should return a JCasAnnotator object
ann = cons.newInstance(<params>);
???
UIMA, finally
???

Effectively an interpreter for code ‘scripted’ in XML and
Java

Component-oriented design makes scaling easy


BlueJ (Jeopardy! hardware) had ≫ 2,000 cores
Most easily written in Java

Java runs in the Java Runtime Environment

Dynamic typing & reflection are therefore possible

Could not have been written in C++08
WATSON
RPI
QUESTIONS & ANSWERS
Download