Unlocking New Value from Content Mining JSTOR Usage Data Ron Snyder Director of Advanced Technology, JSTOR NFAIS 2013 Annual Conference February 25, 2013 Who we are ITHAKA is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We pursue this mission by providing innovative services that aid in the adoption of these technologies and that create lasting impact. JSTOR is a research platform that enables discovery, access, and preservation of scholarly content. JSTOR Archive Stats JSTOR archive » Started in 1997 » Journals online: +1,700 » Documents online: 8.4 million Includes Journals, Books, Primary Source » » » » 50 million pages (2.5 miles of shelf space) Disciplines covered: +70 Participating institutions: +8,000 Countries with participating institutions: 164 JSTOR site activity User Sessions (visits) » 661K per day, 1.3M peak » New visits per hour: 38K average, 70K peak » Simultaneous users: 21K average, 44K peak Page Views » 3.4M per day, 7.9M peak Content Accesses » 430K per day, 850K peak Searches » 456K per day, 1.13M peak Accumulated logging data • • • • • 2 billion visits/sessions 10 billion page views 1.25 billion searches 600 million Google & Google Scholar searches 580 million PDF downloads Using logging data to better understand our users and their needs Our goal is to go from this… Fragmented and incomplete understanding of users to this Clear(er) understanding of user behaviors and needs A few things we want to know more about… • • • • • • • • • How many distinct users? Where are they coming from? What content are they accessing? What content are they unsuccessful in accessing? How effective are our discovery tools at helping users find content? How are external discovery tools used? How do users arrive at a content items? What are the users content consumption behaviors and preferences? etc, etc, … Why? • User expectations and a more competitive environment have raised the bar significantly on the need for analytics • Successful enterprises need to effectively use data • The volume and diversity of our content have increased the difficulty of finding content Data reporting/analysis capabilities - 2010 Predefined reports from delivery platform » COUNTER reports, primarily intended for participating institutions and participants » Limited in scope, batch oriented Ad-hoc querying and reporting » SQL queries from delivery system RDBMS database » Limited capabilities and difficult to produce reports/analyses involving usage data and content/business data » Turnaround time on requests was typically days or weeks (if there was even bandwidth available) Limited capacity for longitudinal studies and trend analysis » 2 delivery systems with incompatible logging Legacy system: 1997 – April 2008 Current system: April 2008 – Present The problem to be addressed • Our ability to improve services for our users was hampered by weak and inefficient analysis capabilities • Analytics tools and staffing had not kept pace with the volume and complexity of our usage data What we’re doing about it… • Initiated a data warehouse project at the end of 2010 • Initial focus on ingesting, normalizing, and enriching usage data • Key objective has been to improve access to dimensioned usage data by staff, both technical and non-technical • Designed for flexibility and scalability • Increased analytics staffing • Formed Analytics team • Hired a Data Scientist • What is a Data Scientist? skilled in statistical methods/tools, data modeling, data visualizations, programming, … Benefits of improved data and analytics • Personalized user-centric features • Content development • Informed decisions on content additions and enrichment • Outreach and participation • Better matching of subscriptions with institution needs • Improved discovery • Better search relevancy, recommendations • Improved platform features • Fewer actions required by users to find the content they need • Improved access for users (both affiliated and unaffiliated) • More avenues for content access • Tools assisting affiliated users in accessing content when off campus Our data warehouse project The project consists of 4 main areas of work, largely performed in parallel 1. Infrastructure building » Building a flexible and scalable infrastructure for storing and processing Big Data 2. Data ingest, normalization, and enrichment » ETL (extract, transform, load) of data from many data sources 3. Tool development » Integrating and building tools supporting a wide range of uses and user capabilities 4. Analysis and reporting » Using the infrastructure, data, and tools while under development Challenges Big Data problems » Many billions of usage events with dozens of attributes (some with thousands of unique values) » Conventional approaches using RDBMS technologies did not prove well suited for this scale and complexity Locating and building feeds from authoritative data sources » Redundant and sometimes contradictory data » Poor/non-existent history in many sources Domain knowledge Data validation and integrity » No ground truth in many instances Budget! Not only Big… but Rich Data as well For each session / action we want to know things like: » » » » » » » » IP address, user agent Referrer Action type Status – successful or denied access (and reasons for deny) Geographic location (country, region, city, Lat/Lon) User identities (institution, society, association, MyJSTOR) User affiliations Attributes of accessed content Journal, article, collections, disciplines, publication date, language, release date, publisher, authors » » » » Time (at different resolutions, year, month, day, hour) Preceding/succeeding actions linked (for click stream analysis) Search terms used and many more… ITHAKA usage data warehouse Loggingdata data Logging Contentmetadata metadata Content Content metadata Licensing data Geolocation & Content metadata DNS data Content Beta metadata Search eCommercedata data eCommerce Data Warehouse Analysis Register & Read eCommerce data data Reporting Out first attempt… • Our initial approach consisted of: • RDBMS (MySQL) • Star schema to implement a multi-dimensional database • Use of BI tool, such as Pentaho, for OLAP cube generation • Problems encountered included: • • • • Generally poor performance in ETL processing Table locking issues Long processing timelines Relatively poor query performance (still much better than the operational DB though) • Concerns about the long-term scalability and flexibility of the approach Our technology solution Open source Architecture: Why Hadoop? • Rapidly becoming the architecture of choice for Big Data • • • • Modeled after Google’s GFS and BigTable infrastructures Used by Facebook, Yahoo and many others Open Source Large and vibrant developer community • Designed from ground-up for scalability and robustness • Fault-tolerant data storage • High scalability • Large (and growing) ecosystem • HDFS – distributed, fault-tolerant file system • Hive – Supports high-level queries similar to SQL • HBase – Column-based data store (like BigTable), supporting billions of rows with millions of columns Architecture: Why SOLR? • HDFS and Hive provide significant improvements in query times over an equivalent relational database representation • But still not good enough for an interactive application • SOLR is a highly-scalable search server supporting fast queries of large datasets • • • • Uses Lucene under the covers Provides faceting and statistical aggregations Scales linearly Hadoop + SOLR is a logical fit for managing Big Data • SOLR is proven technology with a vibrant OSS community behind it • We’ve been using it for a number of years for R&D projects such as our Data for Research site (DfR) – http://dfr.jstor.org Usage data warehouse architecture www.jstor.org Non/semi-technical Users Public Delivery System … … … JSTOR Users … … Tools optimized for web-oriented interactive use Log data Master Logging/Reporti ng DB Data Warehouse TODO Content Metadata Daily Updates Business Data Power Users Daily ETL Legacy system logging data 1997-2008 Web App Data Nodes Ease of use SOLR Index … Hadoop Distributed File System (HDFS) Hive Programmers & Data Analysts Off-line Batch Oriented Tools Progress to date • Production Hadoop and SOLR clusters in-place • 25 Hadoop servers • 11 SOLR servers (~3.6 billion index records) • Usage data from April ‘08 through present has been ETL’d and is available in the warehouse • Represents about 55% of JSTOR historical log data • Hive tables containing enriched usage data have been developed • Highly de-normalized for optimal performance • Hive tables can be queried from Web interface • Web tool developed for data exploration, filtering, aggregation, and export • Usable by non-technical users • Intended to provide 80% solution for common organization usage data needs Analytics environment - 2012 Query performance improvements: • Platform RDBMS – hours to days (programmer required) • HDFS/MapReduce – minutes to hours (programmer required) • Hive – minutes to hours (power user) • Web tool (backed by SOLR index) – seconds to minutes (casual user) Examples – Usage Data Explorer (UDE) tool Examples – Usage Data Explorer (UDE) tool Examples – Usage Data Explorer (UDE) tool Examples – UDE Faceting Explorer tool – Top accessed disciplines Explorer tool – Top accessed articles Explorer tool – Geomapping visualization Examples – Usage trends by location Examples – Usage trends by referrer Analysis Samples Examples – User navigation path analysis Examples – User navigation path analysis Referrer analysis Where JSTOR ‘sessions’ originated | Jan 2011 – Dec 2011 Site Search Activity, by type Locator, 0.27% Saved, 0.05% Advanced, 21.49% Basic, 78.20% 6.3M Sessions 19.8M Searches (from Mar-5 to Apr-16, 2012) 2011 2012 2009 2010 Basic 68.8% 71.3% 77.4% 78.2% Advanced 30.5% 28.1% 22.3% 21.5% Locator 0.6% 0.6% 0.2% 0.3% Search Pages Viewed 3, 2.6% 4, 1.2% 5+, 1.7% 2, 9.4% 1, 85.1% 1.3 Search Results Pages Viewed per Search Click Thru Rates by Search Position JSTOR – 20M searches from March 5 – Apr 16 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Click Thru Rates by Search Position • First 2 pages of search results 20.00% 18.00% 16.00% 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950 Click Thru Rates by Search Position • Positions 10-75 • Notice the relative increase in CTR with the last 2 items on each page 3.00% 2.50% 2.00% 1.50% 1.00% 0.50% 0.00% 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 Examples – Click-through position Avg SERP Click-Through Position 20 18 16 14 12 10 8 6 4 2 0 Jan-11 Apr-11 Jul-11 Oct-11 Jan-12 SERP Position Apr-12 Jul-12 Oct-12 Examples – Click-through position Avg SERP Click-Through Position at Traffic Levels 20 20,000,000 18 18,000,000 16 16,000,000 14 14,000,000 12 12,000,000 10 10,000,000 8 8,000,000 6 6,000,000 4 4,000,000 2 2,000,000 0 - Jan-11 Apr-11 Jul-11 Oct-11 SERP Position Jan-12 Apr-12 Content Accesses Jul-12 Oct-12 What’s next? • Integrate Business Intelligence Reporting tool with Hadoop/Hive • Good open source options such as Pentaho, Jasper, BIRT • Commercial options also under consideration • Ingest and normalize the pre-2008 log data • Expand data sharing with institutions and publishers • More “push” reporting and dashboards • More data modeling and deep analysis Thank you