ITHAKA usage data warehouse - National Federation of Advanced

advertisement
Unlocking New Value from Content
Mining JSTOR Usage Data
Ron Snyder
Director of Advanced Technology, JSTOR
NFAIS 2013 Annual Conference
February 25, 2013
Who we are
ITHAKA is a not-for-profit organization that helps the
academic community use digital technologies to
preserve the scholarly record and to advance research
and teaching in sustainable ways.
We pursue this mission by providing innovative
services that aid in the adoption of these technologies
and that create lasting impact.
JSTOR is a research platform that enables
discovery, access, and preservation of
scholarly content.
JSTOR Archive Stats
JSTOR archive
» Started in 1997
» Journals online: +1,700
» Documents online: 8.4 million
 Includes Journals, Books, Primary Source
»
»
»
»
50 million pages (2.5 miles of shelf space)
Disciplines covered: +70
Participating institutions: +8,000
Countries with participating institutions: 164
JSTOR site activity
User Sessions (visits)
» 661K per day, 1.3M peak
» New visits per hour:
 38K average, 70K peak
» Simultaneous users:
 21K average, 44K peak
Page Views
» 3.4M per day, 7.9M peak
Content Accesses
» 430K per day, 850K peak
Searches
» 456K per day, 1.13M peak
Accumulated logging data
•
•
•
•
•
2 billion visits/sessions
10 billion page views
1.25 billion searches
600 million Google & Google Scholar searches
580 million PDF downloads
Using logging data to better understand
our users and their needs
Our goal is to go from this…
Fragmented and
incomplete understanding
of users
to this
Clear(er)
understanding of
user behaviors and
needs
A few things we want to know more
about…
•
•
•
•
•
•
•
•
•
How many distinct users?
Where are they coming from?
What content are they accessing?
What content are they unsuccessful in accessing?
How effective are our discovery tools at helping
users find content?
How are external discovery tools used?
How do users arrive at a content items?
What are the users content consumption behaviors
and preferences?
etc, etc, …
Why?
• User expectations and a more competitive
environment have raised the bar
significantly on the need for analytics
• Successful enterprises need to effectively use data
• The volume and diversity of our content
have increased the difficulty of finding
content
Data reporting/analysis capabilities - 2010
Predefined reports from delivery platform
» COUNTER reports, primarily intended for participating
institutions and participants
» Limited in scope, batch oriented
Ad-hoc querying and reporting
» SQL queries from delivery system RDBMS database
» Limited capabilities and difficult to produce reports/analyses
involving usage data and content/business data
» Turnaround time on requests was typically days or weeks (if
there was even bandwidth available)
Limited capacity for longitudinal studies and trend
analysis
» 2 delivery systems with incompatible logging
 Legacy system: 1997 – April 2008
 Current system: April 2008 – Present
The problem to be addressed
• Our ability to improve services for our
users was hampered by weak and
inefficient analysis capabilities
• Analytics tools and staffing had not kept pace with
the volume and complexity of our usage data
What we’re doing about it…
• Initiated a data warehouse project at the end
of 2010
• Initial focus on ingesting, normalizing, and enriching
usage data
• Key objective has been to improve access to
dimensioned usage data by staff, both technical and
non-technical
• Designed for flexibility and scalability
• Increased analytics staffing
• Formed Analytics team
• Hired a Data Scientist
• What is a Data Scientist? skilled in statistical methods/tools,
data modeling, data visualizations, programming, …
Benefits of improved data and analytics
• Personalized user-centric features
• Content development
• Informed decisions on content additions and enrichment
• Outreach and participation
• Better matching of subscriptions with institution needs
• Improved discovery
• Better search relevancy, recommendations
• Improved platform features
• Fewer actions required by users to find the content they need
• Improved access for users (both affiliated and
unaffiliated)
• More avenues for content access
• Tools assisting affiliated users in accessing content when off
campus
Our data warehouse project
The project consists of 4 main areas of work, largely
performed in parallel
1. Infrastructure building
»
Building a flexible and scalable infrastructure for storing and
processing Big Data
2. Data ingest, normalization, and enrichment
»
ETL (extract, transform, load) of data from many data
sources
3. Tool development
»
Integrating and building tools supporting a wide range of
uses and user capabilities
4. Analysis and reporting
»
Using the infrastructure, data, and tools while under
development
Challenges
Big Data problems
» Many billions of usage events with dozens of
attributes (some with thousands of unique values)
» Conventional approaches using RDBMS
technologies did not prove well suited for this scale
and complexity
Locating and building feeds from authoritative data
sources
» Redundant and sometimes contradictory data
» Poor/non-existent history in many sources
Domain knowledge
Data validation and integrity
» No ground truth in many instances
Budget!
Not only Big… but Rich Data as well
For each session / action we want to know things like:
»
»
»
»
»
»
»
»
IP address, user agent
Referrer
Action type
Status – successful or denied access (and reasons for deny)
Geographic location (country, region, city, Lat/Lon)
User identities (institution, society, association, MyJSTOR)
User affiliations
Attributes of accessed content
 Journal, article, collections, disciplines, publication date,
language, release date, publisher, authors
»
»
»
»
Time (at different resolutions, year, month, day, hour)
Preceding/succeeding actions linked (for click stream analysis)
Search terms used
and many more…
ITHAKA usage data warehouse
Loggingdata
data
Logging
Contentmetadata
metadata
Content
Content
metadata
Licensing
data
Geolocation
&
Content
metadata
DNS data
Content
Beta metadata
Search
eCommercedata
data
eCommerce
Data
Warehouse
Analysis
Register
& Read
eCommerce
data
data
Reporting
Out first attempt…
• Our initial approach consisted of:
• RDBMS (MySQL)
• Star schema to implement a multi-dimensional database
• Use of BI tool, such as Pentaho, for OLAP cube generation
• Problems encountered included:
•
•
•
•
Generally poor performance in ETL processing
Table locking issues
Long processing timelines
Relatively poor query performance (still much better than the
operational DB though)
• Concerns about the long-term scalability and flexibility of the
approach
Our technology solution
Open source
Architecture: Why Hadoop?
• Rapidly becoming the architecture of choice for Big
Data
•
•
•
•
Modeled after Google’s GFS and BigTable infrastructures
Used by Facebook, Yahoo and many others
Open Source
Large and vibrant developer community
• Designed from ground-up for scalability and
robustness
• Fault-tolerant data storage
• High scalability
• Large (and growing) ecosystem
• HDFS – distributed, fault-tolerant file system
• Hive – Supports high-level queries similar to SQL
• HBase – Column-based data store (like BigTable), supporting
billions of rows with millions of columns
Architecture: Why SOLR?
• HDFS and Hive provide significant improvements
in query times over an equivalent relational
database representation
• But still not good enough for an interactive application
• SOLR is a highly-scalable search server
supporting fast queries of large datasets
•
•
•
•
Uses Lucene under the covers
Provides faceting and statistical aggregations
Scales linearly
Hadoop + SOLR is a logical fit for managing Big Data
• SOLR is proven technology with a vibrant OSS
community behind it
• We’ve been using it for a number of years for R&D projects
such as our Data for Research site (DfR) – http://dfr.jstor.org
Usage data warehouse architecture
www.jstor.org
Non/semi-technical
Users
Public Delivery System
…
…
…
JSTOR Users
…
…
Tools optimized
for web-oriented
interactive use
Log data
Master
Logging/Reporti
ng DB
Data Warehouse
TODO
Content
Metadata
Daily
Updates
Business
Data
Power Users
Daily ETL
Legacy system
logging data
1997-2008
Web App
Data
Nodes
Ease of use
SOLR
Index
…
Hadoop Distributed
File System (HDFS)
Hive
Programmers &
Data Analysts
Off-line Batch
Oriented Tools
Progress to date
• Production Hadoop and SOLR clusters in-place
• 25 Hadoop servers
• 11 SOLR servers (~3.6 billion index records)
• Usage data from April ‘08 through present has been
ETL’d and is available in the warehouse
• Represents about 55% of JSTOR historical log data
• Hive tables containing enriched usage data have been
developed
• Highly de-normalized for optimal performance
• Hive tables can be queried from Web interface
• Web tool developed for data exploration, filtering,
aggregation, and export
• Usable by non-technical users
• Intended to provide 80% solution for common organization usage
data needs
Analytics environment - 2012
Query performance improvements:
• Platform RDBMS – hours to days (programmer
required)
• HDFS/MapReduce – minutes to hours (programmer
required)
• Hive – minutes to hours (power user)
• Web tool (backed by SOLR index) – seconds to
minutes (casual user)
Examples – Usage Data Explorer (UDE) tool
Examples – Usage Data Explorer (UDE) tool
Examples – Usage Data Explorer (UDE) tool
Examples – UDE Faceting
Explorer tool – Top accessed disciplines
Explorer tool – Top accessed articles
Explorer tool – Geomapping visualization
Examples – Usage trends by location
Examples – Usage trends by referrer
Analysis Samples
Examples – User navigation path analysis
Examples – User navigation path analysis
Referrer analysis
Where JSTOR ‘sessions’ originated | Jan 2011 – Dec 2011
Site Search Activity, by type
Locator,
0.27%
Saved,
0.05%
Advanced,
21.49%
Basic,
78.20%
6.3M Sessions
19.8M Searches
(from Mar-5 to Apr-16, 2012)
2011
2012
2009
2010
Basic
68.8%
71.3% 77.4%
78.2%
Advanced
30.5%
28.1% 22.3%
21.5%
Locator
0.6%
0.6%
0.2%
0.3%
Search Pages Viewed
3, 2.6% 4, 1.2% 5+, 1.7%
2, 9.4%
1, 85.1%
1.3 Search Results Pages Viewed per Search
Click Thru Rates by Search Position
JSTOR – 20M searches from March 5 – Apr 16
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Click Thru Rates by Search Position
• First 2 pages of search results
20.00%
18.00%
16.00%
14.00%
12.00%
10.00%
8.00%
6.00%
4.00%
2.00%
0.00%
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950
Click Thru Rates by Search Position
• Positions 10-75
• Notice the relative increase in CTR with the last
2 items on each page
3.00%
2.50%
2.00%
1.50%
1.00%
0.50%
0.00%
10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74
Examples – Click-through position
Avg SERP Click-Through Position
20
18
16
14
12
10
8
6
4
2
0
Jan-11
Apr-11
Jul-11
Oct-11
Jan-12
SERP Position
Apr-12
Jul-12
Oct-12
Examples – Click-through position
Avg SERP Click-Through Position at Traffic Levels
20
20,000,000
18
18,000,000
16
16,000,000
14
14,000,000
12
12,000,000
10
10,000,000
8
8,000,000
6
6,000,000
4
4,000,000
2
2,000,000
0
-
Jan-11
Apr-11
Jul-11
Oct-11
SERP Position
Jan-12
Apr-12
Content Accesses
Jul-12
Oct-12
What’s next?
• Integrate Business Intelligence Reporting
tool with Hadoop/Hive
• Good open source options such as Pentaho, Jasper, BIRT
• Commercial options also under consideration
• Ingest and normalize the pre-2008 log data
• Expand data sharing with institutions and
publishers
• More “push” reporting and dashboards
• More data modeling and deep analysis
Thank you
Download