Search,plus building taxonomy, autoclassification and media monitoring tools with open source software Charlie Hull Managing Director, Flax 1st November 2012 charlie@flax.co.uk www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch Who are Flax? Search engine specialists with decades of experience Developers, innovators and strategists based in Cambridge, UK Technology agnostic – but open source exponents UK Authorized Partner of Lucid Imagination Customers include Reed Specialist Recruitment, Mydeco, NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen, Accenture, University of Cambridge, Cabinet Office... Come to Who am I? Wrote my first saleable software at age 14 Electronic engineer, Windows device driver developer, mobile networks, pro sound, racing cars.... Muscat (Bayesian search) Helped build a half-billion-page web search engine Co-founder and CEO of Flax Who am I? Wrote my first saleable software at age 14 Electronic engineer, Windows device driver developer, mobile networks, pro sound, racing cars.... Muscat (Bayesian search) Helped build a half-billion-page web search engine Co-founder and CEO of Flax What I'll cover today Search – the state of play Clade – an open source taxonomy based classifier Flax Media Monitor Some other crazy ideas Conclusions Search – the state of play Types of search project: – Website search – Intranet search – Database search Closed source engines either: – Sold up – Repositioned – In trouble! Open source engines: – Apache Lucene/Solr – ElasticSearch – (ish!) Attivio/Lucidworks/... Search – the state of play Types of search project: – Website search – Intranet search – Database search Closed source engines either: – Sold up – Repositioned – In trouble! Open source engines: – Apache Lucene/Solr – ElasticSearch – (ish!) Attivio/Lucidworks/... Search – the state of play Types of search project: – Website search – Intranet search – Database search Closed source engines either: – Sold up – Repositioned – In trouble! Open source engines: – Apache Lucene/Solr – ElasticSearch – (ish!) Attivio/Lucidworks/... Let's talk about something more interesting... Clade: classifying data into a taxonomy with a search engine – Developed as a proof of concept – Based on Apache Solr & Stanford NLP – Written in JQuery & Python Caveats: – We don't know much about library science! – Something like this may already exist (not that we could find it...) – This is an alpha version only Clade demo.... What Clade doesn't do (yet) Talk standard taxonomy formats Output anything Multiple users Rules-based classification Look pretty http://www.flax.co.uk/the_software to try it out... Media monitoring Standard search – few queries over many documents Monitoring search – many queries over each document Customers interests manually turned into queries Humans probably still have the final say on relevance Eventual result is a list of articles emailed (or even printed for) customers Media monitoring - parameters Tens of thousands of stored expressions or keywords – Can't rewrite these so must use same syntax! Hundreds of thousands of articles to monitor every day Source data can sometimes be scanned & OCR'd False positives cost human operator time: false negatives cost customers! Traditional approach is brute force using standard search engine software Media monitoring – a Keyword (";PALM BEACH COUNTY"; W/48 ((";TOURIS*"; OR ";TOUR"; OR ";TOURS"; OR ";TRAVEL*"; OR ";HOLIDAY*"; OR ";HOL"; OR ";HOLS"; OR ";HOTEL*"; OR ";VISIT*"; OR ";TRIP"; OR ";TRIPS"; OR ";DAYTRIP*"; OR ";BEACH"; OR ";!BEACHES"; OR ";COAST"; OR ";!COASTLINE*"; OR ";ABTA"; OR ";DAY TRIP*"; OR ";SUITE"; OR ";SUITES"; OR ";A%CCOMMODATION"; OR ";BED AND !BREAKFAST"; OR ";B&amp;B"; OR ";BED &amp; !BREAKFAST"; OR ";!BREAKFAST AND BOARD"; OR ";FULL BOARD"; OR ";HALF BOARD"; OR ";ALL !INCLUSIVE"; OR ";THINGS TO DO"; OR ";HOSP?TALITY"; OR ";SHORT BREAK*"; OR ";!WEEKEND BREAK"; OR ";CITY BREAK*"; OR ";!SIGHTSEE*"; OR ";!VACATION*"; OR ";E%XCURSION*"; OR ";FLY* WITH"; OR ";FLY* THERE"; OR ";FLY* DRIVE"; OR ";!GETAWAY"; OR ";!BACKPACK*"; OR ";BACK PACK*"; OR ";!ECOTOURIS*"; OR ";!WATERSPORT*"; OR ";WATER SPORT*"; OR ";FESTIVAL*"; OR ";RESORT* &amp; SPA"; OR ";RESORT* AND SPA"; OR ";WHALE WATCH*"; OR ";GET THERE"; OR ";WHERE TO STAY"; OR ";GETTING THERE"; OR ";STAYCATION*"; OR ";VILLA"; OR ";VILLAS"; OR ";AIRPORT*"; OR ";SPA"; OR ";SPAS"; OR ";OUTDOOR EVENT*"; OR ";OUTDOOR ADVENTURE*"; OR ";OUTDOOR PURSUIT*"; OR ";OUTDOOR ACTIVIT*"; OR ";CLIMBING WALL*"; OR ";CLIMBING CENTRE*"; OR ";ROCK CLIMB*"; OR ";WHITE WATER RAFTING";) OR (";PLACES"; W/4 (";TO STAY"; OR ";TO SEE"; OR ";TO EAT";)) OR ((";FLIGHT*"; OR ";FLY"; OR ";FLYING"; OR ";CRUISE*";) W/4 (";OFFER"; OR ";AVAILABLE"; OR ";DEPART*"; OR ";FROM"; OR ";TRANSFER*";)))) Media monitoring – another Keyword (((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE &amp; !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUYOUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MISSELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";)))) Flax Media Monitor Based on a modification of Apache Lucene/Solr Runs a separate Solr server for archiving Consumes XML articles & keywords Outputs matches as XML REST API for status & configuration Allows you to test new Keywords on old content Flax Media Monitor demo... Flax Media Monitor - performance For simple keywords (<20 terms): – 70,000 keywords applied per second to an article – Tested on a Macbook – 20 times faster than previous implementation For more complex keywords (some run to three pages!) – 20,000 keywords applied in 0.5 seconds – Approx 2000 docs/hour Can be scaled horizontally for high load (and needs a lot less hardware) Archive can store tens to hundreds of millions of articles Some other crazy ideas... Combine media monitoring with Clade: very fast expression-based classification! We can parse syntax from other search engines... How to store rapidly changing classification data in a search engine index: 1. Re-index all documents affected (expensive) 2. Store the classifications somewhere else: how about a Lucene codec backed by a NoSQL Database? http://www.flax.co.uk/blog/2012/06/22/updating-individualfields-in-lucene-with-a-redis-backed-codec/ Conclusions Search isn't just about “search” Taxonomy management is ready for open source Media monitoring can be done at low cost for high volume with open source – Classification maybe as a special case of monitoring? It's all much more fun than 'vanilla' search! Thankyou! Any questions? charlie@flax.co.uk www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch