Oracle Database 11g New Search Features and Roadmap

advertisement
1
<Insert Picture Here>
Oracle Database 11g New Search Features and Roadmap
Roger Ford
Senior Principal Product Manager
Contents
• Oracle’s Search Products
• Oracle Text 11g New Features
• Oracle Text 11.2.0.2 New Features
<Insert Picture Here>
– Entity Extraction
– Name Search
– Result Set Interface
• Search Product Roadmap
– Oracle Text
– Secure Enterprise Search
3
Oracle’s Search Products
• Oracle Text
– A SQL and PL/SQL based toolkit for creating full-text search
applications
– Free with all database versions
– Previously known as Context Option, interMedia Text
• Secure Enterprise Search
– A complete search based on Oracle Text capabilities
– Crawlers for datasources such as web, email, document
repositories, databases
– End-user query application and APIs for embedding
4
Oracle Text 11g New Features
• Composite Domain Indexes and SDATA sections
– Allows storage of structured info (eg numbers, dates) within
text index
– Makes for much faster “mixed” queries
• Auto Lexer
– Automatic Language Recognition
– Segmentation and Stemming for 32 languages
– Context-sensitive stemming for 23 of these languages
• Off-line and time-limited index creation
– Enables rebuild of indexes offline in quiet periods for true
24x7 operation
5
Demo: Auto Lexer
6
11.2.0.2 New Features - Summary
1. Entity Extraction
– Find “entities” such as people, countries, cities, states, zip codes,
phone numbers etc from the text
– Use default dictionary and rules or define your own dictionary and
rules based on regular expressions
2. Name Search (NDATA sections)
– Inexact searches, copes with mis-spellings, segmentation errors,
contractions and word reversal
– Useful for many searches, but particular good for names
3. ResultSet Interface
– Query request in XML and results returned as XML
– Avoids SQL layer and requirement to work within “SELECT”
semantics
7
Entity Extraction
•
•
•
•
Indentify names, places, dates, times, etc
Tag each occurence with type and subtype
Entities are defined by DICTIONARY and RULES
Implemented by CTX_ENTITY package
– create_extract_policy – create a policy to which you can add extract
rules
• Choose to use/not use built in rules and dictionary
– add_extract_rule – create an XML-based rule to define an entity
– add_stop_entity – prevent defined entities from being used
– compile – build the policy with its rules
– extract – get an XML-based list of entities for a doc
• Also can use ctxload to load user dictionary
8
Demo: Entity Extraction
9
Entities: built-in types
•
•
•
•
•
•
•
•
•
•
•
•
•
•
building
city
company
country
currency
date
day
email_address
geo_political
holiday
location_other
month
non_profit
organization_other
•
•
•
•
•
•
•
•
•
•
•
•
•
•
percent
person_jobtitle
person_name
person_other
phone_number
postal_address
product
region
ssn
state
time_duration
tod
url
zip_code
10
Entity Extraction –
Example 1: Defaults
ctx_entity.create_extract_policy('my_default_policy');
ctx_entity.compile('mypolicy');
ctx_entity.extract('mypolicy', mydoc, mylang, myresults);
• Output in "myresults":
<entities>
<entity id="0" offset="75" length="8" source="SuppliedDictionary">
<text>New York</text>
<type>city</type>
</entity>
<entity id="1" offset="55" length="16" source="SuppliedRule">
<text>Hupplewhite Inc.</text>
<type>company</type>
</entity>
</entities>
11
Entity Extraction –
Example 2: User rule
ctx_entity.create_extract_policy('mypolicy');
ctx_entity.add_extract_rule('mypolicy', 5,
'<rule>
<expression>((North|South)? America)</expression>
<type refid="1">xContinent</type>
</rule>');
ctx_entity.compile('mypolicy');
ctx_entity.extract('mypolicy', mydoc, mylang, myresults);
• Note parentheses around expression. refid="1" means take the first expression in
paren – so "North America" or just "America".
• User defined types must be prefixed with a "x" – hence "xContinent"
<entities>
<entity id="0" offset="75" length="13" source="UserRule">
<text>North America</text>
<type>xContinent</type>
</entity>
</entities>
12
Ent Ext: Adding a user dictionary
• Create file
ud.xml:
<dictionary> <entities>
<entity> <value>Dow Jones Industrial Average</value> <type>xIndex</type> </entity>
<entity> <value>S&P 500</value> <type>xIndex</type> </entity>
<entities> </dictionary>
• Create the policy with CTXLOAD (can add rules later)
ctxload -user scott/tiger -extract -name pol1 -file ud.xml
• Compile the policy
ctx_entity.compile('pol1');
•
Results
<entity id="69" offset="1010" length="7" source="UserDictionary">
<text>S&P 500</text>
<type>xIndex</type>
</entity>
13
Entity Extraction – other stuff
• Extracting only certain entity types:
– ctx_entity.extract('p1', mydoc, null, myresults,
'city,company,xContinent');
14
Name Search
• Searching names has many difficulties
–
–
–
–
–
–
Spelling (steven = stephen)
Alternate Names (fred = alfred, chuck = charles)
Transcription (copying from spoken to written form)
Transliteration (copying from one writing system to another)
Segmentation (Mary Jane, Maryjane)
First, Middle, and Last Name Classification
• Name search does intelligent matching across all
these issues
15
Demo: Name Search
16
NDATA section type
• Basic implementation for name search
• Limitations
– 511 characters
– 255 whitespace-delimited terms
– No offset information, therefore no:
• Highlighting / Markup
• NEAR or phrase search with NDATA
• Uses WORDLIST preference attributes:
–
–
–
–
NDATA_ALTERNATE_SPELLING
NDATA_BASE_LETTER
NDATA_THESAURUS (for alternate names – default thesaurus provided)
NDATA_JOIN_PARTICLES (list such as 'de:du:mc:mac')
• Query Syntax
– NDATA(fieldname, search terms [, order [, proximity ] ] )
17
Result Set Interface
• Some queries are difficult to express in SQL:
– eg "Give me the top 5 hits in each category"
• Result set interface uses a simple text query and an
XML result set descriptor
• Hitlist is returned in XML according to result set
descriptor
• Uses SDATA sections for
– Grouping
– Counting
18
Result Set Example Query
ctx_query.result_set('docidx', 'oracle',
'<ctx_result_set_descriptor>
<count/>
<hitlist start_hit_num="1" end_hit_num="2" order="pubDate
desc, score desc">
<score/> <rowid/>
<sdata name="author"/>
<sdata name="pubDate"/>
</hitlist>
<group sdata="pubDate">
<count/>
</group>
<group sdata="author">
<count/>
</group>
</ctx_result_set_descriptor> ', rs);
19
Result Set Output
<ctx_result_set>
<hitlist>
<hit>
<score>3</score><rowid>AAAPoEAABAAAMWsAAC</rowid>
<sdata name="AUTHOR">John</sdata>
<sdata name="PUBDATE">2001-01-03 00:00:00</sdata>
</hit>
<hit>
<score>3</score><rowid>AAAPoEAABAAAMWsAAG</rowid>
<sdata name="AUTHOR">John</sdata>
<sdata name="PUBDATE">2001-01-03 00:00:00</sdata>
</hit>
</hitlist>
<count>100</count>
20
Result Set Output - Continued
<groups sdata="PUBDATE">
<group value="2001-01-01 00:00:00"><count>25</count></group>
<group value="2001-01-02 00:00:00"><count>50</count></group>
<group value="2001-01-03 00:00:00"><count>25</count></group>
</groups>
<groups sdata="AUTHOR">
<group value="John"><count>50</count></group>
<group value="Mike"><count>25</count></group>
<group value="Steve"><count>25</count></group>
</groups>
</ctx_result_set>
21
Preview
22
Roadmap – merging Text and SES
Oracle Text
Secure Enterprise
Search
Full Control
Full Featured
• Fine-grained Index Options
• Built in database and mid-tier
• Data Storage Options
• Crawlers for many sources
• Lexer Options
• Simple Query Interface
• Stoplists
• End user GUI / API
• Use existing database
• Embedded security
• RAC, Exadata
23
Coming Search Features
• Natural Language Processing enhancements
– Ontology based classification
– Question answering
• Automatic Partitioning
– Query load load balancing
• Full support for facetted navigation (MVDATA sections)
• Functional completeness for Result Set Interface
– Result Iterator – streaming support
– Parallel Query
• Replication Support
– Golden Gate / Logical Standby / Streams
• Operator improvements
– NEAR2 – best query in one operator
– MNOT – mild not, eg YORK mnot NEW YORK
– Nested near
• Substring index and query performance improvements
24
Coming Search Features - Continued
• Multiple enhancements to query performance
– BIGIO leverages Secure Files CLOBs
– Automatic optimization of indexes with “stage index”
– Two level index – keep common search terms in memory
• Partition maintenance without reindexing
• Off-load filtering from database server
• Section specific index options
– Choose different options, eg language, stopwords, PRINTJOINS for
each section
• Regular expression based stopwords
• Forward Index
– Hugely improved performance for highlighting, snippets
• PDF “Native” Highlighting
• Unlimited SDATA, MDATA and Field Sections
25
The preceding is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.
26
27
Download