Document 13462172

advertisement
On day 4, we saw how to process text data using the Enron email dataset. In reality, we only processed a small fraction of
the entire dataset: about 15 megabytes of Kenneth Layʹs emails. The entire dataset containing many Enron employeesʹ
mailboxes is 1.3 gigabytes, about 87 times than what we worked with. And what if we worked on GMail, Yahoo! Mail, or
Hotmail? Weʹd have several petabytes worth of emails, at least 71 million times the size of the data we dealt with.
All that data would take a while to process, and it certainly couldnʹt fit on or be crunched by a single laptop. Weʹd have to
store the data on many machines, and weʹd have to process it (tokenize it, calculate tf‐idf) using multiple machines. There
are many ways to do this, but one of the more popular recent methods of parallelizing data computation is based on a
programming framework called MapReduce, an idea that Google presented to the world in 2004. Luckily, you do not have to
work at Google to benefit from MapReduce: an open‐source implementation called Hadoop is available for your use!
You might worry that we donʹt have hundreds of machines sitting around for us to use them. Actually, we do! Amazon Web
Services offers a service called Elastic MapReduce (EMR) that gives us access to as many machines as we would like for
about 10 cents per hour of machine we use. Use 100 machines for 2 hours? Pay Amazon aroud $2.00. If youʹve ever heard
the buzzword cloud computing, this elastic service is part of the hype.
Letʹs start with a simple word count example, then rewrite it in MapReduce, then run MapReduce on 20 machines using
Amazonʹs EMR, and finally write a big‐person MapReduce workflow to calculate TF‐IDF!
Setup
Weʹre going to be using two files, dataiap/day5/term_tools.py and dataiap/day5/package.tar.gz . Either write your code in the
dataiap/day5 directory, or copy these files to the directory where your work lives.
Counting Words
Weʹre going to start with a simple example that should be familiar to you from day 4ʹs lecture. First, unzip the
JSON‐encoded Kenneth Lay email file:
unzip dataiap/datasets/emails/kenneth_json.zip
This will result in a new file called lay‐k.json , which is JSON‐encoded. What is JSON? You can think of it like a text
representation of python dictionaries and lists. If you open up the file, you will see on each line something that looks like
this:
{"sender": "rosalee.fleming@enron.com", "recipients": ["lizard_ar@yahoo.com"], "cc": [], "text": "Liz, I don't know how the
address shows up when sent, but they tell us it's \nkenneth.lay@enron.com.\n\nTalk to you soon, I hope.\n\nRosie", "mid":
"32285792.1075840285818.JavaMail.evans@thyme", "fpath": "enron_mail_20110402/maildir/lay‐k/_sent/108.", "bcc": [], "to":
["lizard_ar@yahoo.com"], "replyto": null, "ctype": "text/plain; charset=us‐ascii", "fname": "108.", "date": "2000‐08‐10
03:27:00‐07:00", "folder": "_sent", "subject": "KLL's e‐mail address"}
Itʹs a dictionary representing an email found in Kenneth Layʹs mailbox. It contains the same content that we dealt with on
day 4, but encoded into JSON, and rather than one file per email, we have a single file with one email per line.
Why did we do this? Big data crunching systems like Hadoop donʹt deal well with lots of small files: they want to be able to
send a large chunk of data to a machine and have to crunch on it for a while. So weʹve processed the data to be in this
format: one big file, a bunch of emails line‐by‐line. If youʹre curious how we did this, check out dataiap/day5
/emails_to_json.py .
Aside from that, processing the emails is pretty similar to what we did on day 4. Letʹs look at a script that counts the words
in the text of each email (Remember: it would help if you wrote and ran your code in dataiap/day5/... today, since several
modules like term_tools.py are available in that directory).
MapReduce skills just yet: you will likely need them one day.
Analyzing the output
Hopefully your first mapreduce is done by now. There are two bits of output we should check out. First, when the
MapReduce job finishes, you will see something like the following message in your terminal window:
Counters from step 1:
FileSystemCounters:
FILE_BYTES_READ: 499365431
FILE_BYTES_WRITTEN: 61336628
S3_BYTES_READ: 1405888038
S3_BYTES_WRITTEN: 8354556
Job Counters :
Launched map tasks: 189
Launched reduce tasks: 85
Rack‐local map tasks: 189
Map‐Reduce Framework:
Combine input records: 0
Combine output records: 0
Map input bytes: 1405888038
Map input records: 516893
Map output bytes: 585440070
Map output records: 49931418
Reduce input groups: 232743
Reduce input records: 49931418
Reduce output records: 232743
Reduce shuffle bytes: 27939562
Spilled Records: 134445547
Thatʹs a summary of, on your 20 machines, how many Mappers and Reducers ran. You can run more than one of each on a
physical machine, which explains why more than 20 of each ran in our tasks. Notice how many reducers ran your task. Each
reducer is going to receive a set of words and their number of occurrences, and emit word counts. Reducers donʹt talk to
one‐another, so they end up writing their own files.
With this in mind, go to the S3 console, and look at the output directory of the S3 bucket to which you output your words.
Notice that there are several files in the output directory named part‐00000 , part‐00001 . There should be as many files as
there were reducers, since each wrote the file out. Download some of these files and open them up. You will see the various
word counts for words across the entire Enron email corpus. Life is good!
(Optional) Exercise : Make a directory called copied . Copy the output from your script to copied using dataiap/resources
/s3_util.py with a command like python ../resources/s3_util get s3://dataiap‐YOURUSERNAME‐testbucket/output copied . Once
youʹve got all the files downloaded, load them up and sort the lines by their count. Do the popular terms across the entire
dataset make sense?
TF‐IDF
This section is going to further exercise our MapReduce‐fu.
On day 4, we learned that counting words is not enough to summarize text: common words like the and and are too
popular. In order to discount those words, we multiplied by the term frequency of wordX by log(total # documents/# documents
with wordX) . Letʹs do that with MapReduce!
Weʹre going to emit a per‐sender TF‐IDF. To do this, we need three MapReduce tasks:
The first will calculate the number of documents, for the numerator in IDF.
The second will calculate the number of documents each term appears in, for the denominator of IDF, and emits the
IDF ( log(total # documents/# documents with wordX) ).
The third calculates a per‐sender IDF for each term after taking both the second MapReduceʹs term IDF and the email
OK, now that youʹve seen the motivation behind the MapReduce technique, letʹs actually try it out.
MapReduce
Say we have a JSON‐encoded file with emails (3,000,000 emails on 3,000,000 lines), and we have 3 computers to compute
the number of times each word appears.
In the map phase (figure below), we are going to send each computer 1/3 of the lines. Each computer will process their
1,000,000 lines by reading each of the lines and tokenizing their words. For example, the first machine may extract ʺenron,
call,...ʺ, while the second machine extracts ʺconference, call,...ʺ.
From the words, we will create (key, value) pairs (or (word, 1) pairs in this example). The shuffle phase will assigned each key
to one of the 3 computer, and all the values associated with the same key are sent to the keyʹs computer. This is necessary
because the whole dictionary doesnʹt fit in the memory of a single computer! Think of this as creating the (key, value) pairs
of a huge dictionary that spans all of the 3,000,000 emails. Because the whole dictionary doesnʹt fit into a single machine, the
keys are distributed across our 3 machines. In this example, ʺenronʺ is assigned to computer 1, while ʺcallʺ and ʺconferenceʺ
are assigned to computer 2, and ʺbankruptʺ is assigned to computer 3.
Finally, once each machine has received the values of the keys itʹs responsible for, the reduce phase will process each keyʹs
value. It does this by going through each key that is assigned to the machine and executing a reducer function on the values
associated with the key. For example, ʺenronʺ was associated with a list of three 1ʹs, and the reducer step simply adds them
up.
MapReduce is more general‐purpose than just serving to count words. Some people have used it to do exotic things like
process millions of songs, but we want you to work through an entire end‐to‐end example.
Without further ado, hereʹs the wordcount example, but written as a MapReduce application:
import sys
from mrjob.protocol import JSONValueProtocol
from mrjob.job import MRJob
from term_tools import get_terms
class MRWordCount(MRJob):
INPUT_PROTOCOL = JSONValueProtocol
OUTPUT_PROTOCOL = JSONValueProtocol
def mapper(self, key, email):
for term in get_terms(email['text']):
yield term, 1
def reducer(self, term, occurrences):
yield None, {'term': term, 'count': sum(occurrences)}
if __name__ == '__main__':
MRWordCount.run()
Letʹs break this thing down. Youʹll notice the term MRJob in a bunch of places. MRJob is a python package that makes
writing MapReduce programs easy. The developers at Yelp (they wrote the mrjob module) wrote a convenience class called
MRJob that you will extend. When itʹs run, it automatically hooks into the MapReduce framework, reads and parses the
input files, and does a bunch of other things for you.
What we do is create a class MRWordCount that extends MRJob , and implement the mapper and reducer functions. If the
program is run from the command line (the if __name__ == '__main__': part), it will execute the MRWordCount MapRedce
program.
Looking inside MRWordCount , we see INPUT_PROTOCOL being set to JSONValueProtocol . By default, map functions expect a line of
text as input, but weʹve encoded our emails as JSON, so we let MRJob know that. Similarly, we explain that our reduce tasks
will emit dictionaries by setting OUTPUT_PROTOCOL appropriately.
The mapper function handles the functionality described in the first image of the last section. It takes each email, tokenizes it
into terms, and yield s each term. You can yield a key and a value ( term and 1 ) in a mapper (notice ʺyieldʺ arrows in the
second figure above). We yield the term with the value 1 , meaning one instance of the word term was found. yield is a
python keyword that turns functions into iterators (stack overflow explanation). In the context of writing mapper and
reducer functions, you can think of it as return .
The reducer function implements the third image of the last section. We are given a word (the key emitted from mappers),
and a list occurrences of all of the values emitted for each instance of term . Since we are counting occurrences of words, we
yield a dictionary containing the term and a sum of the occurrences weʹve seen.
Note that we sum instead of len the occurrences . This allows us to change the mapper implementation to emit the number
of times each word occurs in a document, rather than 1 for each word.
Both the mapper and reducer offer us the parallelism we wanted. There is no loop through our entire set of emails, so
MapReduce is free to distribute the emails to multiple machines, each of which will run mapper on an email‐by‐email basis.
We donʹt have a single dictionary with the count of every word, but instead have a reduce function that has to sum up the
occurrences of a single word, meaning we can again distribute the work to several reducing machines.
Run It!
Enough talk! Letʹs run this thing.
python mr_wordcount.py ‐o 'wordcount_test' ‐‐no‐output '../datasets/emails/lay‐k.json'
The ‐o flag tells MRJob to output all reducer output to the wordcount_test directory. The ‐‐no‐output flag says not to print
the output of the reducers to the screen. The last argument ( '../datasets/emails/lay‐k.json' ) specifies which file (or files) to
read into the mappers as input.
Take a look at the newly created wordcount_test directory. There should be at least one file ( part‐00000 ), and perhaps more.
There is one file per reducer that counted words. Reducers donʹt talk to one‐another as they do their work, and so we end
up with multiple output files. While the count of a specific word will only appear in one file, we have no idea which reducer
file will contain a given word.
The output files (open one up in a text editor) list each word as a dictionary on a single line ( OUTPUT_PROTOCOL =
JSONValueProtocol in mr_wordcount.py is what caused this).
You will notice we have not yet run tasks on large datasets (weʹre still using lay‐k.json ) and we are still running them locally
on our computers. We will soon learn to movet his work to Amazonʹs cloud infrastructure, but running MRJob tasks locally
to test them on a small file is forever important. MapReduce tasks will take a long time to run and hold up several tens to
several hundreds of machines. They also cost money to run, whether they contain a bug or not. Test them locally like we just
did to make sure you donʹt have bugs before going to the full dataset.
Show off What you Learned
Exercise Create a second version of the MapReduce wordcounter that counts the number of each word emitted by each
sender. You will need this for later, since weʹre going to be calculating TF‐IDF implementing terms per sender. You can
accomplish this with a sneaky change to the term emitted by the mapper . You can either turn that term into a dictionary, or
into a more complicated string, but either way you will have to encode both sender and term information in that term . If you
get stuck, take a peak at dataiap/day5/mr_wc_by_sender.py .
(Optional) Exercise The grep command on UNIX‐like systems allows you to search text files for some term or terms. Typing
grep hotdogs file1 will return all instances of the word hotdogs in the file file1 . Implement a grep for emails. When a user
uses your mapreduce program to find a word in the email collection, they will be given a list of the subjects and senders of
all emails that contain the word. You might find you do not need a particularly smart reducer in this case: thatʹs fine. If
youʹre pressed for time, you can skip this exercise.
We now know how to write some pretty gnarly MapReduce programs, but they all run on our laptops. Sort of boring. Itʹs
time to move to the world of distributed computing, Amazon‐style!
Amazon Web Services
Amazon Web Services (AWS) is Amazonʹs gift to people who donʹt own datacenters. It allows you to elastically request
computation and storage resources at varied scales using different services. As a testiment to the flexibility of the services,
companies like NetFlix are moving their entire operation into AWS.
In order to work with AWS, you will need to set up an account. If youʹre in the class, we will have given you a username,
password, access key, and access secret. To use these accounts, you will have to login through a special class‐only webpage.
The same instructions work for people who are trying this at home, only you need to log in at the main AWS website.
The username and password log you into the AWS console so that you can click around its interface. In order to let your
computer identify itself with AWS, you have to tell your computer your access key and secret. On UNIX‐like platforms
(GNU/Linux, BSD, MacOS), type the following:
export AWS_ACCESS_KEY_ID='your_key_id'
export AWS_SECRET_ACCESS_KEY='your_access_id'
On windows machines, type the following at the command line:
set AWS_ACCESS_KEY_ID=your_key_id
set AWS_SECRET_ACCESS_KEY=your_access_id
Replace your_key_id and your_access_id with the ones you were assigned.
Thatʹs it! There are more than a dayʹs worth of AWS services to discuss, so letʹs stick with two of them: Simple Storage
Service (S3) and Elastic MapReduce (EMR).
AWS S3
S3 allows you to store gigabytes, terabytes, and, if youʹd like, petabytes of data in Amazonʹs datacenters. This is useful,
because laptops often donʹt crunch and store more than a few hundred gigabytes worth of data, and storing it in the
datacenter allows you to securely have access to the data in case of hardware failures. Itʹs also nice because Amazon tries
harder than you to have the data be always accessible.
In exchange for nice guarantees about scale and accessibility of data, Amazon charges you rent on the order of 14 cents per
gigabyte stored per month.
Services that work on AWS, like EMR, read data from and store data to S3. When we run our MapReduce programs on
EMR, weʹre going to read the email data from S3, and write word count data to S3.
S3 data is stored in buckets . Within a bucket you create, you can store as many files or folders as youʹd like. The name of
your bucket has to be unique across all of the people that store their stuff in S3. Want to make your own bucket? Letʹs do
this!
Log in to the AWS console (the website), and click on the S3 tab. This will show you a file explorer‐like interface, with
buckets listed on the left and files per bucket listed on the right.
Click ʺCreate Bucketʺ near the top left.
Enter a bucket name. This has to be unique across all users of S3. Pick something like dataiap‐YOURUSERNAME‐
testbucket. Do not use underscores in the name of the bucket .
Click ʺCreateʺ
This gives you a bucket, but the bucket has nothing in it! Poor bucket. Letʹs upload Kenneth Layʹs emails.
Select the bucket from the list on the left.
Click ʺUpload.ʺ
Click ʺAdd Files.ʺ
Select the lay‐k.json file on your computer.
Click ʺStart Upload.ʺ
Right click on the uploaded file, and click ʺMake Public.ʺ
Verify the file is public by going to ht tp://dataiap‐YOURUSERNAME‐testbucket.s3.amazonaws. com/lay‐k.json.
Awesome! We just uploaded our first file to S3. Amazon is now hosting the file. We can access it over the web, which
means we can share it with other researchers or process it in Elastic MapReduce. To save time, weʹve uploaded the entire
enron dataset to https://dataiap‐enron‐json.s3.amazonaws.com/ . Head over there to see all of the different Enron
The complete dataset is no longer available here, since it was only intended for in-class use. For alternate instructions, please see the Lectures and Labs page.
employeeʹs files listed (the first three should be allen‐p.json , arnold‐j.json , and arora‐h.json ).
Two notes from here. First, uploading the file to S3 was just an exercise‐‐‐weʹll use the dataiap‐enron‐json bucket for our
future exercises. Thatʹs because the total file upload is around 1.3 gigs, and we didnʹt want to put everyone through the pain
of uploading it themselves. Second, most programmers donʹt use the web interface to upload their files. They instead opt to
upload the files from the command line. If you have some free time, feel free to check out dataiap/resources/s3_util.py for a
script that copies directories to and downloads buckets from S3.
Letʹs crunch through these files!
AWS EMR
Weʹre about to process the entire enron dataset. Letʹs do a quick sanity check that our mapreduce wordcount script still
works. Weʹre about to get into the territory of spending money on the order of 10 cents per machine‐hour, so we want to
make sure we donʹt run into preventable problems that waste money.
python mr_wordcount.py ‐o 'wordcount_test2' ‐‐no‐output '../datasets/emails/lay‐k.json'
Did that finish running and output the word counts to wordcount_test2 ? If so, letʹs run it on 20 machines (costing us $2,
rounded to the nearest hour). Before running the script, weʹll talk about the parameters:
python mr_wordcount.py ‐‐num‐ec2‐instances=20 ‐‐python‐archive package.tar.gz ‐r emr ‐o 's3://dataiap‐YOURUSERNAME‐testbucket/out
The parameters are:
num‐ec2‐instances: we want to run on 20 machines in the cloud. Snap!
python‐archive: when the script runs on remote machines, it will need term_tools.py in order to tokenize the email
text. We have packaged this file into package.tar.gz.
‐r emr: donʹt run the script locally‐‐‐run it on AWS EMR.
‐o 's3://dataiap‐YOURUSERNAME‐testbucket/output': write script output to the bucket you made when playing
around with S3. Put all files in a directory called output in that bucket. Make sure you change dataiap‐
YOURUSERNAME‐testbucket to whatever bucket name you picked on S3.
‐‐no‐output: donʹt print the reducer output to the screen.
's3://dataiap‐enron‐json/*.json': perform the mapreduce with input from the dataiap‐enron‐json bucket
that the instructors created, and use as input any file that ends in .json. You could have named a specific file, like
lay‐k.json here, but the point is that we can run on much larger datasets.
Check back on the script. Is it still running? It should be. You may as well keep reading, since youʹll be here a while. In total,
our run took three minutes for Amazon to requisition the machines, four minutes to install the necessary software on them,
and between 15 adn 25 minutes to run the actual MapReduce tasks on Hadoop. That might strike some of you as weird, and
weʹll talk about it now.
Understanding MapReduce is about understanding scale . Weʹre used to thinking of our programs as being about
performance , but thatʹs not the role of MapReduce. Running a script on a single file on a single machine will be faster than
running a script on multiple files split amongst multiple machines that shuffle data around to one‐another and emit the data
to a service (like EMR and S3) over the internet is not going to be fast. We write MapReduce programs because they let us
easily ask for 10 times more machines when the data weʹre processing grows by a factor of 10, not so that we can achieve
sub‐second processing times on large datasets. Itʹs a mental model switch that will take a while to appreciate, so let it brew
in your mind for a bit.
What it does mean is that MapReduce as a programming model is not a magic bullet. The Enron dataset is not actually so
large that it shouldnʹt be processed on your laptop. We used the dataset because it was large enough to give you an
appreciation for order‐of‐magnitue file size differences, but not large enough that a modern laptop canʹt process the data. In
practice, donʹt look into MapReduce until you have several tens or hundreds of gigabytes of data to analyze. In the world
that exists inside most companies, this size dataset is easy to stumble upon. So donʹt be disheartened if you donʹt need the
MapReduce skills just yet: you will likely need them one day.
Analyzing the output
Hopefully your first mapreduce is done by now. There are two bits of output we should check out. First, when the
MapReduce job finishes, you will see something like the following message in your terminal window:
Counters from step 1:
FileSystemCounters:
FILE_BYTES_READ: 499365431
FILE_BYTES_WRITTEN: 61336628
S3_BYTES_READ: 1405888038
S3_BYTES_WRITTEN: 8354556
Job Counters :
Launched map tasks: 189
Launched reduce tasks: 85
Rack‐local map tasks: 189
Map‐Reduce Framework:
Combine input records: 0
Combine output records: 0
Map input bytes: 1405888038
Map input records: 516893
Map output bytes: 585440070
Map output records: 49931418
Reduce input groups: 232743
Reduce input records: 49931418
Reduce output records: 232743
Reduce shuffle bytes: 27939562
Spilled Records: 134445547
Thatʹs a summary of, on your 20 machines, how many Mappers and Reducers ran. You can run more than one of each on a
physical machine, which explains why more than 20 of each ran in our tasks. Notice how many reducers ran your task. Each
reducer is going to receive a set of words and their number of occurrences, and emit word counts. Reducers donʹt talk to
one‐another, so they end up writing their own files.
With this in mind, go to the S3 console, and look at the output directory of the S3 bucket to which you output your words.
Notice that there are several files in the output directory named part‐00000 , part‐00001 . There should be as many files as
there were reducers, since each wrote the file out. Download some of these files and open them up. You will see the various
word counts for words across the entire Enron email corpus. Life is good!
(Optional) Exercise : Make a directory called copied . Copy the output from your script to copied using dataiap/resources
/s3_util.py with a command like python ../resources/s3_util get s3://dataiap‐YOURUSERNAME‐testbucket/output copied . Once
youʹve got all the files downloaded, load them up and sort the lines by their count. Do the popular terms across the entire
dataset make sense?
TF‐IDF
This section is going to further exercise our MapReduce‐fu.
On day 4, we learned that counting words is not enough to summarize text: common words like the and and are too
popular. In order to discount those words, we multiplied by the term frequency of wordX by log(total # documents/# documents
with wordX) . Letʹs do that with MapReduce!
Weʹre going to emit a per‐sender TF‐IDF. To do this, we need three MapReduce tasks:
The first will calculate the number of documents, for the numerator in IDF.
The second will calculate the number of documents each term appears in, for the denominator of IDF, and emits the
IDF ( log(total # documents/# documents with wordX) ).
The third calculates a per‐sender IDF for each term after taking both the second MapReduceʹs term IDF and the email
corpus as input.
HINT Do not run these MapReduce tasks on Amazon. You saw how slow it was to run, so make sure the entire TF‐IDF
workflow works on your local machine with lay‐k.json before moving to Amazon.
MapReduce 1: Total Number of Documents
Eugene and I are the laziest of instructors. We donʹt like doing work where we donʹt have to. If youʹd like a mental exercise
as to how to write this MapReduce, you can do so yourself, but itʹs simpler than the wordcount example. Our dataset is
small enough that we can just use the wc UNIX command to count the number of lines in our corpus:
wc ‐l lay‐k.json
Kenneth Lay has 5929 emails in his dataset. We ran wc ‐l on the entire Enron email dataset, and got 516893. This took a few
seconds. Sometimes, itʹs not worth overengineering a simple task!:)
MapReduce 2: Per‐Term IDF
We recommend you stick to 516893 as your total number of documents, since eventually weʹre going to be crunching the
entire dataset!
What we want to do here is emit log(516893.0 / # documents with wordX) for each wordX in our dataset. Notice the decimal on
516893.0: thatʹs so we do floating point division rather than integer division. The output should be a file where each line
contains {'term': 'wordX', 'idf': 35.92} for actual values of wordX and 35.92 .
Weʹve put our answer in dataiap/day5/mr_per_term_idf.py , but try your hand at writing it yourself before you look at ours. It
can be implemented with a three‐line change to the original wordcount MapReduce we wrote (one line just includes
math.log !).
MapReduce 3: Per‐Sender TF‐IDFs
The third MapReduce multiplies per‐sender term frequencies by per‐term IDFs. This means it needs to take as input the
IDFs calculated in the last step and calculate the per‐sender TFs. That requires something we havenʹt seen yet: initialization
logic. Letʹs show you the code, then tell you how itʹs done.
import os
from mrjob.protocol import JSONValueProtocol
from mrjob.job import MRJob
from term_tools import get_terms
DIRECTORY = "/path/to/idf_parts/"
class MRTFIDFBySender(MRJob):
INPUT_PROTOCOL = JSONValueProtocol
OUTPUT_PROTOCOL = JSONValueProtocol
def mapper(self, key, email):
for term in get_terms(email['text']):
yield {'term': term, 'sender': email['sender']}, 1
def reducer_init(self):
self.idfs = {}
for fname in os.listdir(DIRECTORY): # look through file names in the directory
file = open(os.path.join(DIRECTORY, fname)) # open a file
for line in file: # read each line in json file
term_idf = JSONValueProtocol.read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_sender, howmany):
tfidf = sum(howmany) * self.idfs[term_sender['term']]
yield None, {'term_sender': term_sender, 'tfidf': tfidf}
If you did the first exercise, the mapper and reducer functions should look a lot like the per‐sender word count mapper and
reducer functions you wrote for that. The only difference is that reducer takes the term frequencies and multiplies them by
self.idfs[term] , to normalize by each wordʹs IDF. The other difference is the addition of reducer_init , which we will
describe next.
self.idfs is a dictionary containing term‐IDF mappings from the second MapReduce. Say you ran the IDF‐calculating
MapReduce like so:
python mr_per_term_idf.py ‐o 'idf_parts' ‐‐no‐output '../datasets/emails/lay‐k.json'
10
The individual terms and IDFs would be emitted to the directory idf_parts/ . We would want to load all of these term‐idf
mappings into self.idfs . Set DIRECTORY to the filesystem path that points to the idf_parts/ directory.
Sometimes, we want to load some data before running the mapper or the reducer. In our example, we want to load the IDF
values into memory before executing the reducer, so that the values are available when we compute the tf‐idf. The function
reducer_init is designed to perform this setup. It is called before the first reducer is called to calculate TF‐IDF. It opens all of
the output files in DIRECTORY , and reads them into self.idfs . This way, when reducer is called on a term, the idf for that term
has already been calculated.
To verify youʹve done this correctly, compare your output to ours. There were some pottymouths that emailed Kenneth Lay:
{ʺtfidfʺ: 13.155591168821202, ʺterm_senderʺ: {ʺtermʺ: ʺa‐holeʺ, ʺsenderʺ: ʺjustinsitzman@hotmail.comʺ}}
Why is it OK to Load IDFs Into Memory?
You might be alarmed at the moment. Here we are, working with BIG DATA, and now weʹre expecting the TF‐IDF
calculation to load the entirety of the IDF data into memory on EVERY SINGLE reducer. Thatʹs crazytown.
Itʹs actually not. While the corpus weʹre analyzing is large, the number of words in the English language (roughly the
amount of terms we calculate IDF for) is not. In fact, the output of the per‐term IDF calculation was around 8 megabytes,
which is far smaller than the 1.3 gigabytes we processed. Keep this in mind: even if calculating something over a large
amount of data is hard and takes a while, the result might end up small.
Optional: Run The TF‐IDF Workflow
We recommend running the TF‐IDF workflow on Amazon once class is over. The first MapReduce script (per‐term IDF)
should run just fine on Amazon. The second will not. The reducer_init logic expects a file to live on your local directory. You
will have to modify it to read the output of the IDF calculations from S3 using boto . Take a look at the code to implement
get in dataiap/resources/s3_util.py for a programmatic view of accessing files in S3.
Where to go from here
We hope that MapReduce serves you well with large datasets. If this kind of work excites you, here are some things to read
up on.
As you can see, writing more complex workflows for things like TF‐IDF can get annoying. In practice, folks use
higher‐level languages than map and reduce to build MapReduce workflows. Some examples are Pig, Hive, and
Cascading.
If you care about making your MapReduce tasks run faster, there are lots of tricks you can play. One of the easiest
things to do is to add a combiners between your mapper and reducer. A combiner has similar logic to a reducer, but
runs on a mapper before the shuffle stage. This allows you to, for example, pre‐sum the words emitted by the map
stage in a wordcount so that you donʹt have to shuffle as many words around.
MapReduce is one model for parallel programming called data parallelism . Feel free to read about others.
When MapReduce runs on multiple computers, itʹs an example of distributed computing, which has a lot of
interesting applications and problems to be solved.
S3 is a distributed storage system and is one of many. It is built upon Amazonʹs Dynamo technology. Itʹs one of many
distributed file systems and distributed data stores.
MIT OpenCourseWare
http://ocw.mit.edu
Resource: How to Process, Analyze and Visualize Data
Adam Marcus and Eugene Wu
The following may not correspond to a particular course on MIT OpenCourseWare, but has been
provided by the author as an individual learning resource.
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Download