20130415-Breiner

advertisement
What’s in the News?
Web Scraping Technology as a
Cost-Effective Solution for News Alerting
David A. Breiner and Raul Rodriguez-Esteban
Boehringer Ingelheim Pharmaceuticals, Inc.
Ridgefield, CT USA
SLA PHT 2013
Question #1
Who here is involved with News
Alerting activities in their jobs?
2
Topics
• About Us
• Project Background & Critical Issues
• Organizational Drivers
• Technical & Process Overviews
• Demonstration (pre-recorded)
• Continuing Challenges
• Lessons Learned
• User Feedback
• Next Steps
• Q&A / Discussion
3
About Us
RAUL
Scientific Knowledge Discovery (SKD) is made up of Computational
Biology professionals and Knowledge Management experts who
support BI* Pharmaceutical Research and Corporate areas in the US
by supplying relevant information and analysis.
We focus our work on:
•
Delivering data and information in a short timeframe
•
Streamlining information gathering and processing through computational methods
including Text Mining
•
Turning information into knowledge that drives impact
* BI will refer to “Boehringer Ingelheim” throughout this presentation
4
Project Background
BI US Library has been involved with news alerting for ~20 years:
• 1990s: Library staff generated daily & weekly electronic newsletters on
various therapeutic area & business topics
• Early 2000s: Executives & Competitive Intelligence (CI) requested a
more systematic, early morning alerting of significant news; “Code Red
Alert” developed & managed by 1 Info Scientist (~1-1.5 hours per day
manual curation time)
• Late 2000s: Service evolved to include many sources (fee & free) but not
as time-critical; executives alerted by other routes; CI no longer part of
workflow; distribution list broadened to include Public Affairs &
Communications group; work distributed among 3 Library staff for various
weekdays after lead Info Scientist retired (~1-1.5 hours per day manual
curation time); renamed to “Daily News Brief” in 2010
A very valuable service…but extremely time-consuming
5
Vendor Products: Critical Issues
Ongoing search for a tool to assist in newsletter generation
for many years; various vendor products* tested & used,
but none met all requirements for success:
* No names will be disclosed
• Duplication: Similar stories from various sources
• Timeliness: Sometimes 24 hour delay experienced
• Cost: Some aggregators required fees for each recipient
in addition to base annual subscription
Other Issues:
• Some subscription sources had limited user access
• Some products lacked focus on particular areas of interest to BI
• Implementation always more challenging than anticipated
• Technical issues usually required much interaction with vendors
No significant time savings realized
(~1+ hour per day curation)
6
2010: First In-House Tool developed
BI News
Strengths:
Fast & free to build; simple to maintain
(HTML page with links); customizable;
comprehensive coverage
Global News
(Fee & Free)
Local News
Blogs
Press
Releases
Weaknesses:
No newsletter-generating tool; much manual
scanning of many websites; required much
manual curation (i.e. copying/pasting/formatting
into email template); duplication among sources
After 2 years, still no significant time savings
(~1+ hour per day curation)
7
2011: Organizational Drivers
• Major departmental reorganization in Q4 2011
• Limited staff to support news monitoring; needed to
significantly reduce time spent on Daily News Brief
• Unsuccessful paid trial with vendor product
• New management prefers automated computational
methods over manual processes
• Clients desire human filtering due to their lack of time
“A perfect storm”
8
Q1 2012: Daily News Brief re-launched
• Provide a daily morning “snapshot” of BI and
pharmaceutical industry headlines with a US focus
• Minimize curation time to under 30 minutes per day
• Leverage internal expertise in Web Scraping
• Utilize cost-effective news sources whenever possible:
• BI Press Releases (US & DE)
• Google News
• Yahoo News
• Elsevier Business Intelligence
• FirstWord
• FiercePharma / FierceBiotech
• Reuters
• Bloomberg
• Medical Marketing & Media
*global subscription
9
Question #2
Who here knows what Web Scraping is?
10
Typical Content
• BI press releases & major news on all BI marketed
Pharma products
• BI & subsidiaries (Vetmedica, Roxane Labs, Ben Venue
Labs, Bedford Labs) in major & local news sources
• Competitor products: Phase 3 trial announcements,
major trial published studies, approvals, launches
• Major Competitor, FDA, & Conference announcements
• Pharma & Healthcare industry trends
GOAL: Select & distribute ~12 relevant news items
each business day before 8:00 am ET
11
Technical Overview
RSS Feeds
• Real Time
• Sources gathered “on the fly”
• Multiple input formats
• Manages RSS feeds, news websites,
online newsletters
• Handles password-protected sites
• Automatic login
• Uses “lightweight” code
• Adaptable script language (Perl)
• Copyright compliant
• Only scraping/extracting content that
is free or globally licensed by BI
News
Websites
Newsletters
Web crawling agent
(cURL)
Parse news items
& components
Filter
Relevancy &
Minimum date
Standardized display
for selection
Manual selection /
curation
Output presentation
(HTML)
12
Technical Overview: Perl Scripting
13
Process Overview: Curation
1) Login to DNB interface on internal
BI server; Enter # of days to review
2) Select categories for news items to
include using drop-down menus
3) Select SUBMIT to publish all selected
news items to HTML output file
14
Process Overview: Publishing & Distribution
5) Paste into email, edit, & distribute
4) Copy HTML output
~15 minutes from start to finish!
15
Demonstration
BI DAILY NEWS BRIEF
(2 minutes)
16
Continuing Challenges
• Duplication among sources, especially between
Google News & Yahoo News
• Some sources don’t always scrape properly, requiring
minor edits before distribution
• Technical changes on source websites can affect results
• BI still running IE7; migrating to IE9 in 2013
• Keeping it simple for us & our clients, i.e. “Daily News
BRIEF” not “Daily News OVERLOAD”
Stay tuned…
17
Lessons Learned
• Have a focused objective (i.e. “snapshot”
instead of “all news for everyone”)
• Look within your organization first for expertise
before looking externally
• Change is inevitable; accept it as opportunity
• Regularly seek out user feedback (see next
slide)
To eat an elephant, you must take one bite at a time 
18
User Feedback
• The Daily News Brief has become my primary source of
competitive & marketplace information. Outstanding!
from an Executive Director in Marketing
• I read the DNB every morning. I prefer the current format to
the previous one; it’s succinct & provides a good overview
of top industry stories that I can view on my Blackberry.
from a Director in Public Affairs & Communications
• I really like the new simplified look of the Daily News Brief,
especially the clean lines and simplicity! Nice work!
from an Associate Director in Public Affairs & Communications
• I really enjoy reading the Daily News Brief. It helps me to
prepare for my day.
from an Associate Director in Business Intelligence
19
Next Steps
• Currently underway:
• Use underlying code to develop news interfaces for
monitoring other domains of interest (e.g. Therapeutic
Areas, BI Products)
• Expand distribution list to include more senior-level
management in US (currently ~125 recipients)
• Develop RSS feed for internal portals (recently completed)
• Attempt to remove duplication among sources
wherever possible
• Explore options for delivery to mobile platforms
20
Acknowledgements
• Dr. Raul Rodriguez-Esteban
• Dr. Will Loging
• Amy Shortlidge-Cox
• Yirong Wang
21
Thank You
David A. Breiner, MS
LinkedIn: http://www.linkedin.com/in/davidbreiner
Raul Rodriguez-Esteban, PhD
LinkedIn: http://www.linkedin.com/pub/raul-rodriguez-esteban/0/36b/3bb
Now at Roche in Basel 
Boehringer Ingelheim Pharmaceuticals, Inc.
Scientific Knowledge Discovery
Ridgefield, Connecticut USA
22
Q&A / Discussion
What are your companies doing for
News Alerting? Please share!
23
What’s in the News?
Web Scraping Technology as a
Cost-Effective Solution for News Alerting
David A. Breiner and Raul Rodriguez-Esteban
Boehringer Ingelheim Pharmaceuticals, Inc.
Ridgefield, CT USA
SLA PHT 2013
Download