What’s in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim Pharmaceuticals, Inc. Ridgefield, CT USA SLA PHT 2013 Question #1 Who here is involved with News Alerting activities in their jobs? 2 Topics • About Us • Project Background & Critical Issues • Organizational Drivers • Technical & Process Overviews • Demonstration (pre-recorded) • Continuing Challenges • Lessons Learned • User Feedback • Next Steps • Q&A / Discussion 3 About Us RAUL Scientific Knowledge Discovery (SKD) is made up of Computational Biology professionals and Knowledge Management experts who support BI* Pharmaceutical Research and Corporate areas in the US by supplying relevant information and analysis. We focus our work on: • Delivering data and information in a short timeframe • Streamlining information gathering and processing through computational methods including Text Mining • Turning information into knowledge that drives impact * BI will refer to “Boehringer Ingelheim” throughout this presentation 4 Project Background BI US Library has been involved with news alerting for ~20 years: • 1990s: Library staff generated daily & weekly electronic newsletters on various therapeutic area & business topics • Early 2000s: Executives & Competitive Intelligence (CI) requested a more systematic, early morning alerting of significant news; “Code Red Alert” developed & managed by 1 Info Scientist (~1-1.5 hours per day manual curation time) • Late 2000s: Service evolved to include many sources (fee & free) but not as time-critical; executives alerted by other routes; CI no longer part of workflow; distribution list broadened to include Public Affairs & Communications group; work distributed among 3 Library staff for various weekdays after lead Info Scientist retired (~1-1.5 hours per day manual curation time); renamed to “Daily News Brief” in 2010 A very valuable service…but extremely time-consuming 5 Vendor Products: Critical Issues Ongoing search for a tool to assist in newsletter generation for many years; various vendor products* tested & used, but none met all requirements for success: * No names will be disclosed • Duplication: Similar stories from various sources • Timeliness: Sometimes 24 hour delay experienced • Cost: Some aggregators required fees for each recipient in addition to base annual subscription Other Issues: • Some subscription sources had limited user access • Some products lacked focus on particular areas of interest to BI • Implementation always more challenging than anticipated • Technical issues usually required much interaction with vendors No significant time savings realized (~1+ hour per day curation) 6 2010: First In-House Tool developed BI News Strengths: Fast & free to build; simple to maintain (HTML page with links); customizable; comprehensive coverage Global News (Fee & Free) Local News Blogs Press Releases Weaknesses: No newsletter-generating tool; much manual scanning of many websites; required much manual curation (i.e. copying/pasting/formatting into email template); duplication among sources After 2 years, still no significant time savings (~1+ hour per day curation) 7 2011: Organizational Drivers • Major departmental reorganization in Q4 2011 • Limited staff to support news monitoring; needed to significantly reduce time spent on Daily News Brief • Unsuccessful paid trial with vendor product • New management prefers automated computational methods over manual processes • Clients desire human filtering due to their lack of time “A perfect storm” 8 Q1 2012: Daily News Brief re-launched • Provide a daily morning “snapshot” of BI and pharmaceutical industry headlines with a US focus • Minimize curation time to under 30 minutes per day • Leverage internal expertise in Web Scraping • Utilize cost-effective news sources whenever possible: • BI Press Releases (US & DE) • Google News • Yahoo News • Elsevier Business Intelligence • FirstWord • FiercePharma / FierceBiotech • Reuters • Bloomberg • Medical Marketing & Media *global subscription 9 Question #2 Who here knows what Web Scraping is? 10 Typical Content • BI press releases & major news on all BI marketed Pharma products • BI & subsidiaries (Vetmedica, Roxane Labs, Ben Venue Labs, Bedford Labs) in major & local news sources • Competitor products: Phase 3 trial announcements, major trial published studies, approvals, launches • Major Competitor, FDA, & Conference announcements • Pharma & Healthcare industry trends GOAL: Select & distribute ~12 relevant news items each business day before 8:00 am ET 11 Technical Overview RSS Feeds • Real Time • Sources gathered “on the fly” • Multiple input formats • Manages RSS feeds, news websites, online newsletters • Handles password-protected sites • Automatic login • Uses “lightweight” code • Adaptable script language (Perl) • Copyright compliant • Only scraping/extracting content that is free or globally licensed by BI News Websites Newsletters Web crawling agent (cURL) Parse news items & components Filter Relevancy & Minimum date Standardized display for selection Manual selection / curation Output presentation (HTML) 12 Technical Overview: Perl Scripting 13 Process Overview: Curation 1) Login to DNB interface on internal BI server; Enter # of days to review 2) Select categories for news items to include using drop-down menus 3) Select SUBMIT to publish all selected news items to HTML output file 14 Process Overview: Publishing & Distribution 5) Paste into email, edit, & distribute 4) Copy HTML output ~15 minutes from start to finish! 15 Demonstration BI DAILY NEWS BRIEF (2 minutes) 16 Continuing Challenges • Duplication among sources, especially between Google News & Yahoo News • Some sources don’t always scrape properly, requiring minor edits before distribution • Technical changes on source websites can affect results • BI still running IE7; migrating to IE9 in 2013 • Keeping it simple for us & our clients, i.e. “Daily News BRIEF” not “Daily News OVERLOAD” Stay tuned… 17 Lessons Learned • Have a focused objective (i.e. “snapshot” instead of “all news for everyone”) • Look within your organization first for expertise before looking externally • Change is inevitable; accept it as opportunity • Regularly seek out user feedback (see next slide) To eat an elephant, you must take one bite at a time 18 User Feedback • The Daily News Brief has become my primary source of competitive & marketplace information. Outstanding! from an Executive Director in Marketing • I read the DNB every morning. I prefer the current format to the previous one; it’s succinct & provides a good overview of top industry stories that I can view on my Blackberry. from a Director in Public Affairs & Communications • I really like the new simplified look of the Daily News Brief, especially the clean lines and simplicity! Nice work! from an Associate Director in Public Affairs & Communications • I really enjoy reading the Daily News Brief. It helps me to prepare for my day. from an Associate Director in Business Intelligence 19 Next Steps • Currently underway: • Use underlying code to develop news interfaces for monitoring other domains of interest (e.g. Therapeutic Areas, BI Products) • Expand distribution list to include more senior-level management in US (currently ~125 recipients) • Develop RSS feed for internal portals (recently completed) • Attempt to remove duplication among sources wherever possible • Explore options for delivery to mobile platforms 20 Acknowledgements • Dr. Raul Rodriguez-Esteban • Dr. Will Loging • Amy Shortlidge-Cox • Yirong Wang 21 Thank You David A. Breiner, MS LinkedIn: http://www.linkedin.com/in/davidbreiner Raul Rodriguez-Esteban, PhD LinkedIn: http://www.linkedin.com/pub/raul-rodriguez-esteban/0/36b/3bb Now at Roche in Basel Boehringer Ingelheim Pharmaceuticals, Inc. Scientific Knowledge Discovery Ridgefield, Connecticut USA 22 Q&A / Discussion What are your companies doing for News Alerting? Please share! 23 What’s in the News? Web Scraping Technology as a Cost-Effective Solution for News Alerting David A. Breiner and Raul Rodriguez-Esteban Boehringer Ingelheim Pharmaceuticals, Inc. Ridgefield, CT USA SLA PHT 2013