Search,plus - building taxonomy, autoclassification and

advertisement
Search,plus
building taxonomy, autoclassification and media
monitoring tools with open source software
Charlie Hull
Managing Director, Flax
1st November 2012
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch
Who are Flax?
Search engine specialists with decades of experience
Developers, innovators and strategists based in
Cambridge, UK
Technology agnostic – but open source exponents
UK Authorized Partner of Lucid Imagination
Customers include Reed Specialist Recruitment, Mydeco,
NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen,
Accenture, University of Cambridge, Cabinet Office...
Come to
Who am I?
Wrote my first saleable software at age 14
Electronic engineer, Windows device driver developer, mobile
networks, pro sound, racing cars....
Muscat (Bayesian search)
Helped build a half-billion-page web search engine
Co-founder and CEO of Flax
Who am I?
Wrote my first saleable software at age 14
Electronic engineer, Windows device driver developer, mobile
networks, pro sound, racing cars....
Muscat (Bayesian search)
Helped build a half-billion-page web search engine
Co-founder and CEO of Flax
What I'll cover today
Search – the state of play
Clade – an open source taxonomy based classifier
Flax Media Monitor
Some other crazy ideas
Conclusions
Search – the state of play
Types of search project:
– Website search
– Intranet search
– Database search
Closed source engines either:
– Sold up
– Repositioned
– In trouble!
Open source engines:
– Apache Lucene/Solr
– ElasticSearch
– (ish!) Attivio/Lucidworks/...
Search – the state of play
Types of search project:
– Website search
– Intranet search
– Database search
Closed source engines either:
– Sold up
– Repositioned
– In trouble!
Open source engines:
– Apache Lucene/Solr
– ElasticSearch
– (ish!) Attivio/Lucidworks/...
Search – the state of play
Types of search project:
– Website search
– Intranet search
– Database search
Closed source engines either:
– Sold up
– Repositioned
– In trouble!
Open source engines:
– Apache Lucene/Solr
– ElasticSearch
– (ish!) Attivio/Lucidworks/...
Let's talk about something more
interesting...
Clade: classifying data into a taxonomy with a search
engine
– Developed as a proof of concept
– Based on Apache Solr & Stanford NLP
– Written in JQuery & Python
Caveats:
– We don't know much about library science!
– Something like this may already exist (not that we
could find it...)
– This is an alpha version only
Clade demo....
What Clade doesn't do (yet)
Talk standard taxonomy formats
Output anything
Multiple users
Rules-based classification
Look pretty
http://www.flax.co.uk/the_software to try it out...
Media monitoring
Standard search – few queries over many documents
Monitoring search – many queries over each document
Customers interests manually turned into queries
Humans probably still have the final say on relevance
Eventual result is a list of articles emailed (or even printed
for) customers
Media monitoring - parameters
Tens of thousands of stored expressions or keywords
– Can't rewrite these so must use same syntax!
Hundreds of thousands of articles to monitor every day
Source data can sometimes be scanned & OCR'd
False positives cost human operator time: false negatives
cost customers!
Traditional approach is brute force using standard search
engine software
Media monitoring – a Keyword
(";PALM BEACH COUNTY"; W/48 ((";TOURIS*"; OR ";TOUR"; OR ";TOURS"; OR ";TRAVEL*"; OR
";HOLIDAY*"; OR ";HOL"; OR ";HOLS"; OR ";HOTEL*"; OR ";VISIT*"; OR ";TRIP"; OR
";TRIPS"; OR ";DAYTRIP*"; OR ";BEACH"; OR ";!BEACHES"; OR ";COAST"; OR ";!COASTLINE*";
OR ";ABTA"; OR ";DAY TRIP*"; OR ";SUITE"; OR ";SUITES"; OR ";A%CCOMMODATION"; OR ";BED
AND !BREAKFAST"; OR ";B&B"; OR ";BED & !BREAKFAST"; OR ";!BREAKFAST AND BOARD";
OR ";FULL BOARD"; OR ";HALF BOARD"; OR ";ALL !INCLUSIVE"; OR ";THINGS TO DO"; OR
";HOSP?TALITY"; OR ";SHORT BREAK*"; OR ";!WEEKEND BREAK"; OR ";CITY BREAK*"; OR
";!SIGHTSEE*"; OR ";!VACATION*"; OR ";E%XCURSION*"; OR ";FLY* WITH"; OR ";FLY* THERE";
OR ";FLY* DRIVE"; OR ";!GETAWAY"; OR ";!BACKPACK*"; OR ";BACK PACK*"; OR
";!ECOTOURIS*"; OR ";!WATERSPORT*"; OR ";WATER SPORT*"; OR ";FESTIVAL*"; OR ";RESORT*
& SPA"; OR ";RESORT* AND SPA"; OR ";WHALE WATCH*"; OR ";GET THERE"; OR ";WHERE TO
STAY"; OR ";GETTING THERE"; OR ";STAYCATION*"; OR ";VILLA"; OR ";VILLAS"; OR
";AIRPORT*"; OR ";SPA"; OR ";SPAS"; OR ";OUTDOOR EVENT*"; OR ";OUTDOOR ADVENTURE*"; OR
";OUTDOOR PURSUIT*"; OR ";OUTDOOR ACTIVIT*"; OR ";CLIMBING WALL*"; OR ";CLIMBING
CENTRE*"; OR ";ROCK CLIMB*"; OR ";WHITE WATER RAFTING";) OR (";PLACES"; W/4 (";TO
STAY"; OR ";TO SEE"; OR ";TO EAT";)) OR ((";FLIGHT*"; OR ";FLY"; OR ";FLYING"; OR
";CRUISE*";) W/4 (";OFFER"; OR ";AVAILABLE"; OR ";DEPART*"; OR ";FROM"; OR
";TRANSFER*";))))
Media monitoring – another Keyword
(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR
";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR
!MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA";
OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR
";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*";
OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR
";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR
";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN
!MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR
";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR
TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUYOUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY";
OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR
";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR
";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR
";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*";
OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MISSELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR
";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR
";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR";
OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR
";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR
";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT
LANDLINE*";))))
Flax Media Monitor
Based on a modification of Apache Lucene/Solr
Runs a separate Solr server for archiving
Consumes XML articles & keywords
Outputs matches as XML
REST API for status & configuration
Allows you to test new Keywords on old content
Flax Media Monitor demo...
Flax Media Monitor - performance
For simple keywords (<20 terms):
– 70,000 keywords applied per second to an article
– Tested on a Macbook
– 20 times faster than previous implementation
For more complex keywords (some run to three pages!)
– 20,000 keywords applied in 0.5 seconds
– Approx 2000 docs/hour
Can be scaled horizontally for high load (and needs a lot
less hardware)
Archive can store tens to hundreds of millions of articles
Some other crazy ideas...
Combine media monitoring with Clade: very fast
expression-based classification!
We can parse syntax from other search engines...
How to store rapidly changing classification data in a
search engine index:
1. Re-index all documents affected (expensive)
2. Store the classifications somewhere else: how about
a Lucene codec backed by a NoSQL Database?
http://www.flax.co.uk/blog/2012/06/22/updating-individualfields-in-lucene-with-a-redis-backed-codec/
Conclusions
Search isn't just about “search”
Taxonomy management is ready for open source
Media monitoring can be done at low cost for high volume
with open source
– Classification maybe as a special case of monitoring?
It's all much more fun than 'vanilla' search!
Thankyou!
Any questions?
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch
Download