Code4lib 2014 • Raleigh, NC
Under the Hood of
Hadoop Processing at
OCLC Research
Roy Tennant
Senior Program Officer
The world’s libraries. Connected.
Apache Hadoop
• A family of open source technologies for parallel
processing:
• Hadoop core, which implements the MapReduce
algorithm
• Hadoop Distributed File System (HDFS)
• HBase – Hadoop Database
• Pig – A high-level data-flow language
• Etc.
The world’s libraries. Connected.
MapReduce
• “…a programming model for processing large data
sets with a parallel, distributed algorithm on a
cluster.” – Wikipedia
• Two main parts implemented in separate programs:
• Mapping – filtering and sorting
• Reducing – merging and summarizing
• Hadoop marshalls the servers, runs the tasks in
parallel, manages I/O, & provides fault tolerance
The world’s libraries. Connected.
Quick History
• OCLC has been doing MapReduce processing
on a cluster since 2005, thanks to Thom Hickey
and Jenny Toves
• In 2012, we moved to a much larger cluster using
Hadoop and HBase
• Our longstanding experience doing parallel
processing made the transition fairly quick and
easy
The world’s libraries. Connected.
Meet “Gravel”
• 1 head node, 40 processing
nodes
• Per processing node:
• Two AMD 2.6 Ghz processors
• 32 GB RAM
• Three 2 TB drives
• 1 dual port 10Gb NIC
• Several copies of WorldCat,
both “native” and “enhanced”
The world’s libraries. Connected.
Using Hadoop
• Java Native
• Can use any language you want if you use the
“streaming” option
• Streaming jobs require a lot of parameters, best
kept in a shell script
• Mappers and reducers don’t even need to be in
the same language (mix and match!)
The world’s libraries. Connected.
Using HDFS
• The Hadoop Distributed File System (HDFS) takes care of
distributing your data across the cluster
• You can reference the data using a canonical address; for
example: /path/to/data
• There are also various standard file system commands open to
you; for example, to test a script before running it against all
the data:
hadoop fs -cat /path/to/data/part-00001 | head | ./SCRIPT.py
• Also, data written to disk is similarly distributed and accessible
via HDFS commands; for example:
hadoop fs -cat /path/to/output/* > data.txt
The world’s libraries. Connected.
Using HBase
• Useful for random access to data elements
• We have dozens of tables, including the entirety
of WorldCat
• Individual records can be fetched by OCLC
number
The world’s libraries. Connected.
Browsing HBase
Our “HBase Explorer”
The world’s libraries. Connected.
MARC Record
The world’s libraries. Connected.
MapReduce Processing
• Some jobs only have a “map” component
• Examples:
• Find all the WorldCat records with a 765 field
• Find all the WorldCat records with the string “Tar
Heels” anywhere in them
• Find all the WorldCat records with the text “online” in
the 856 $z
• Output is written to disk in the Hadoop filesystem
(HDFS)
The world’s libraries. Connected.
Mapper Process Only
Shell
Script
Mapper
Data
HDFS
The world’s libraries. Connected.
MapReduce Processing
• Some also have a “reduce” component
• Example:
• Find all of the text strings in the 650 $a (map) and
count them up (reduce)
The world’s libraries. Connected.
Mapper and Reducer Process
Shell
Script
Mapper
Data
Reducer
Summarized
Data
HDFS
The world’s libraries. Connected.
The JobTracker
The world’s libraries. Connected.
Sample Shell Script
Setup Variables
Remove earlier output
Call Hadoop with parameters
and files
The world’s libraries. Connected.
Sample Mapper
Sample Mapper
The world’s libraries. Connected.
Sample Reducer
Sample Reducer
The world’s libraries. Connected.
Running the Job
• Shell screenshot
The world’s libraries. Connected.
The Blog Post
The world’s libraries. Connected.
FOLLOW US
 

Search
P O LI TI CS
The Press
JU ST I N

BUSI NESS

T ECH

ENTERTAI NM ENT

HEALT H

EDUCATI O N


SEX ES
Not Just a Southern Thing: The Changing Geography of American Poverty
NATI O NAL
EVENTS

IN FOCUS
G LO BAL

LONGREADS
CHI NA
APPS

VI DEO
E-BOOKS

M AG AZI NE
NEWSLETTERS

SUBSCRIBE
SPECIAL REPORT
2014: A User's Guide
When you are
really,
seriously,
lucky.
Facebook
What It Means to Be a Public Intellectual
The machinery of racism affords people the privilege of being oblivious to others unlike them. BY TA-NEHISI COATES
What It Means to Be a
Public Intellectual
What's Ahead for
Technology in 2014
Poor and Uninsured in
a Red State
'The Most Expensive
Sheriff in America'
Is Dreaming of Prince
Charming
Problematic?
Who Should Decide
What Happens When The Female Face of
School Wasn't
When People Go to the 30 People Draw World Poverty
Canceled for Bad
Fifty years after the War on Poverty
Doctor?
Maps From Memory
Weather in 1882
Why Frozen's anti-fairytale plot
twist is a mistake.
How insurance coverage is
redefining medical care.
AKASH NIKOLAS 11
DAVID GOLDHILL 2
Brits and Indonesians may not be
happy with the results.
URI FRIEDMAN In Books, Movies, and Media, the Most
Popular Title Word Is 'New'
Don Draper was right: “ The most important idea in
advertising is new.”
2
ROBINSON MEYER SPONSOR CONTENT PRESENTED BY IBM
Cloud: A Change Agent that Drives Growth for
Small, Midsize Businesses
The world’s libraries. Connected.
Cloud adoption for businesses is increasing dramatically as
the technology becomes easier to adopt and more useful out
of the box.
18
began, millions of women are still
struggling to get by.
MARIA SHRIVER In Focus
15
A Laura Ingalls Wilder story prove
we've all gone soft.
ELEANOR BARKHORN THE BIGGEST STORIES IN PHOTOS
44
The world’s libraries. Connected.
WorldCat Identities
The world’s libraries. Connected.
Kindred Works
The world’s libraries. Connected.
The world’s libraries. Connected.
Cookbook Finder
The world’s libraries. Connected.
VIAF
The world’s libraries. Connected.
MARC Usage in WorldCat
Contents of the 856 $3 subfield
The world’s libraries. Connected.
Work Records
The world’s libraries. Connected.
WorldCat Linked Data Explorer
The world’s libraries. Connected.
Roy Tennant
tennantr@oclc.org
@rtennant
facebook.com/roytennant
roytennant.com
The world’s libraries. Connected.