Code4lib 2014 • Raleigh, NC Under the Hood of Hadoop Processing at OCLC Research Roy Tennant Senior Program Officer The world’s libraries. Connected. Apache Hadoop • A family of open source technologies for parallel processing: • Hadoop core, which implements the MapReduce algorithm • Hadoop Distributed File System (HDFS) • HBase – Hadoop Database • Pig – A high-level data-flow language • Etc. The world’s libraries. Connected. MapReduce • “…a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.” – Wikipedia • Two main parts implemented in separate programs: • Mapping – filtering and sorting • Reducing – merging and summarizing • Hadoop marshalls the servers, runs the tasks in parallel, manages I/O, & provides fault tolerance The world’s libraries. Connected. Quick History • OCLC has been doing MapReduce processing on a cluster since 2005, thanks to Thom Hickey and Jenny Toves • In 2012, we moved to a much larger cluster using Hadoop and HBase • Our longstanding experience doing parallel processing made the transition fairly quick and easy The world’s libraries. Connected. Meet “Gravel” • 1 head node, 40 processing nodes • Per processing node: • Two AMD 2.6 Ghz processors • 32 GB RAM • Three 2 TB drives • 1 dual port 10Gb NIC • Several copies of WorldCat, both “native” and “enhanced” The world’s libraries. Connected. Using Hadoop • Java Native • Can use any language you want if you use the “streaming” option • Streaming jobs require a lot of parameters, best kept in a shell script • Mappers and reducers don’t even need to be in the same language (mix and match!) The world’s libraries. Connected. Using HDFS • The Hadoop Distributed File System (HDFS) takes care of distributing your data across the cluster • You can reference the data using a canonical address; for example: /path/to/data • There are also various standard file system commands open to you; for example, to test a script before running it against all the data: hadoop fs -cat /path/to/data/part-00001 | head | ./SCRIPT.py • Also, data written to disk is similarly distributed and accessible via HDFS commands; for example: hadoop fs -cat /path/to/output/* > data.txt The world’s libraries. Connected. Using HBase • Useful for random access to data elements • We have dozens of tables, including the entirety of WorldCat • Individual records can be fetched by OCLC number The world’s libraries. Connected. Browsing HBase Our “HBase Explorer” The world’s libraries. Connected. MARC Record The world’s libraries. Connected. MapReduce Processing • Some jobs only have a “map” component • Examples: • Find all the WorldCat records with a 765 field • Find all the WorldCat records with the string “Tar Heels” anywhere in them • Find all the WorldCat records with the text “online” in the 856 $z • Output is written to disk in the Hadoop filesystem (HDFS) The world’s libraries. Connected. Mapper Process Only Shell Script Mapper Data HDFS The world’s libraries. Connected. MapReduce Processing • Some also have a “reduce” component • Example: • Find all of the text strings in the 650 $a (map) and count them up (reduce) The world’s libraries. Connected. Mapper and Reducer Process Shell Script Mapper Data Reducer Summarized Data HDFS The world’s libraries. Connected. The JobTracker The world’s libraries. Connected. Sample Shell Script Setup Variables Remove earlier output Call Hadoop with parameters and files The world’s libraries. Connected. Sample Mapper Sample Mapper The world’s libraries. Connected. Sample Reducer Sample Reducer The world’s libraries. Connected. Running the Job • Shell screenshot The world’s libraries. Connected. The Blog Post The world’s libraries. Connected. FOLLOW US Search P O LI TI CS The Press JU ST I N BUSI NESS T ECH ENTERTAI NM ENT HEALT H EDUCATI O N SEX ES Not Just a Southern Thing: The Changing Geography of American Poverty NATI O NAL EVENTS IN FOCUS G LO BAL LONGREADS CHI NA APPS VI DEO E-BOOKS M AG AZI NE NEWSLETTERS SUBSCRIBE SPECIAL REPORT 2014: A User's Guide When you are really, seriously, lucky. Facebook What It Means to Be a Public Intellectual The machinery of racism affords people the privilege of being oblivious to others unlike them. BY TA-NEHISI COATES What It Means to Be a Public Intellectual What's Ahead for Technology in 2014 Poor and Uninsured in a Red State 'The Most Expensive Sheriff in America' Is Dreaming of Prince Charming Problematic? Who Should Decide What Happens When The Female Face of School Wasn't When People Go to the 30 People Draw World Poverty Canceled for Bad Fifty years after the War on Poverty Doctor? Maps From Memory Weather in 1882 Why Frozen's anti-fairytale plot twist is a mistake. How insurance coverage is redefining medical care. AKASH NIKOLAS 11 DAVID GOLDHILL 2 Brits and Indonesians may not be happy with the results. URI FRIEDMAN In Books, Movies, and Media, the Most Popular Title Word Is 'New' Don Draper was right: “ The most important idea in advertising is new.” 2 ROBINSON MEYER SPONSOR CONTENT PRESENTED BY IBM Cloud: A Change Agent that Drives Growth for Small, Midsize Businesses The world’s libraries. Connected. Cloud adoption for businesses is increasing dramatically as the technology becomes easier to adopt and more useful out of the box. 18 began, millions of women are still struggling to get by. MARIA SHRIVER In Focus 15 A Laura Ingalls Wilder story prove we've all gone soft. ELEANOR BARKHORN THE BIGGEST STORIES IN PHOTOS 44 The world’s libraries. Connected. WorldCat Identities The world’s libraries. Connected. Kindred Works The world’s libraries. Connected. The world’s libraries. Connected. Cookbook Finder The world’s libraries. Connected. VIAF The world’s libraries. Connected. MARC Usage in WorldCat Contents of the 856 $3 subfield The world’s libraries. Connected. Work Records The world’s libraries. Connected. WorldCat Linked Data Explorer The world’s libraries. Connected. Roy Tennant tennantr@oclc.org @rtennant facebook.com/roytennant roytennant.com The world’s libraries. Connected.