Project 1.doc

advertisement
CST334
Project 1: Unix Media Analysis
Due at 11:30 PM, March 15
Worth 25 Points
Late projects will be discounted 50%!
The power of the media to influence public perception is well known but difficult to characterize
since most of the information of our world comes through media sources holding vested interests
in certain institutions or policies. In addition, the most powerful mediums are audio-visual,
ephemeral, and rapidly changing, like television and radio, which are difficult to track for content
and subtext. However, some media sources are archived as text files or transcripts and are
available on the web. In this project, we are going to apply our Unix tool set to analyze a
collection of weekly radio addresses from President Bush spanning the period from 2003 to the
present.
In the /home/cst334/bush directory you'll find a set of more than 100 files time coded by date, as
shown in the following excerpt:
20010915.htm 20030607.htm 20031115.htm 20040424.htm
20041002.htm 20030104.htm 20030614.htm 20031122.htm
20040501.htm 20041009.htm ...
The date of the files forms the file name. For example, 20040501.htm represents the radio address
given on May 1, 2004. Note there is also a file from September 15, 2001 for historical purposes. As
indicated by the .htm suffix, the files are coded in HTML format. Although this is not really a
problem, we will spend a part of this project translating the files into plain text. To do this we will
use the program html2text which is at the bottom of the directory listing.
Your primary mission will be to compile statistics on the use of certain words in Bush's radio
addresses, and examine how these may be changing over time.
Part A Getting Familiar with the Content:
Visit the bush directory and spend some time getting familiar with the contents. Use
more to examine the 20010915.htm file and read what President Bush said four days after
the 9/11 attack. Notice the HTML formatting, which is distracting but does not prevent reading
the content.
One tool that we will use more than any other is grep, which searches files for different string
patterns. You can enter the command
grep 'terrorist' 20010915.htm to see the lines in
this file containing the word terrorist. The command grep 'new york' 20010915.htm will look
for the phrase 'new york' in the same file. However, since grep matches patterns based on letter
case, there'll be no hits. You can ignore case using the –i option. In other words,
grep -i 'new york' 20010915.htm this will search for lowercase and upper case versions of the
search string.
The grep command returns the entire line for every occurrence of the search string. If you would
like to obtain more context on what grep finds, you can use the –C option. The command
1
grep –iC3 'new york' 20010915.htm displays three lines before and after the line containing
the search string. You'll notice in this case a fair amount of HTML coming into view. Don't
worry, we will remove that in Part B.
Another thing we will be using grep for is to count the number of occurrences a word is used. We
can use this technique to characterize the various interests and passions of the President. The
more a word is used, we can (perhaps) assume, the more it plays an important role in the
President's thinking. You can get a count of the number of hits a search uncovers by piping the
output of grep into the wc (word count) utility. Try the command
grep -i 'new york' 20010915.htm | wc
to count the number of words in the output from grep. The following command counts lines
grep -i 'new york' 20010915.htm | wc –l
(that's l as in line, not number 1!)
which is the same as counting the number of hits grep returns. Notice that you don't want to use
the –C option on grep when you're counting hits, because the context will be included in the
count, and corrupt the results. You can further expand the performance of grep using the
wildcard characters (*, ?, [] ) to specify a larger group of files. Try this command
grep -i 'new york' * | wc –l
To determine the number of times President Bush mentions New York in the entire file set. Or,
to gain some understanding of President Bush's feeling on abortion, You can use the command:
grep -iC3 'abortion' *
Which will return some context on what the president was saying when the word was used. If
you want more context, change the number after C, for example grep -iC10 'abortion' *
If you don't want to view the filename of the hit, use –h: grep -ihC10 'abortion' *
Have fun with this and feel free to explore other topics and keywords that might interest you.
Finally, experiment with translating the 20010915.htm file into text using the command
./html2text 20010915.htm
Part B Translating HTM files into TXT
For this part, we will take up the challenging task of writing a script to translate all of the .htm
files into .txt format. The difficult way to do this would be to translate each file one at a time.
We're going to use various capturing and editing techniques to create a script to do all 100+ files
at the same time.
1) The first step in this process is to make a complete copy of the bush directory into your home
directory. You may have to refresh your memory on how to copy a directory. Feel free to use
your textbook, or the online man pages. After you copy the directory, change its access
permission so that you can write to the directory.
2) Once you have a copy of the directory, make a listing of the contents using ls and redirect the
output into a file called htm. Then make a copy of htm called txt
2
3) you are going to edit these two files to build a script file to accomplish the task of translating
the files into txt. Your goal is to create a file that looks like this:
./html2text
./html2text
./html2text
./html2text
./html2text
…
20010915.htm
20030104.htm
20030111.htm
20030118.htm
20030125.htm
> 20010915.txt
> 20030104.txt
> 20030111.txt
> 20030118.txt
> 20030125.txt
for the entire contents of the directory. Using vi open the htm file. This is going to be the start
of all the commands you will create. First go to the bottom of the file and delete the last 2 or 3
lines (html2text, and any non 200XXXX.htm files) because we won't be converting these files!
Your job is to formulate a substitution command in vi to replace the string 200 with the string
./html2text 200 (use \ to escape the '/' in the replacement string, like so:
.\/html2text 200 )
This will give you the first two colums of the script file, ie.
./html2text 20010915.htm
./html2text 20030104.htm
./html2text 20030111.htm
…
all the way down.
When you finish this step, save the file and exit vi.
4) You are going to build the second half of the command by editing the file txt. Again, visit the
bottom of the file and delete the last two or three lines (non 200XXXX.htm files) as above. Then
use a substitution command to replace htm with txt
Then save and exit vi.
5) Another nifty Unix command is paste, which allows you to paste two files together in a
column format. Try the command
paste htm txt
and you'll see the files appearing side
by side on output. You are almost finished. A variation of paste allows you to set the delimiter
(field separator) to '>' so you can redirect the output of every html2text command to a .txt file.
This will make the command
paste -d '>' htm txt
you should get something that
looks like the goal displayed under step three. Don’t worry if there are no spaces before or after
the '>' symbol, Unix will handle this just fine. What you need to do now is redirect the output of
this command into a file called convert. Then change the access permission of convert so that
you can execute as well as read/write it.
6) Cross your fingers. Type the command ./convert When the command finishes, type ls
and examine the contents of any txt file. It should be cleaned of html formatting.
3
Part C Analyzing Files for Content
C1. Total Analysis
In this part we're going to do the actual analysis of the text files. Before you get started, read
Important Notes below. You are going to analyze the presence of 10 key words in the collection
of files. Five of these are required (shown below) and five of these are of your own choosing
These must be unique from anyone else in the class.
Required key text phrases for the analysis
energy
iraq
social security
taxes
weapons of mass destruction
(+ 5 more UNIQUE phrases of your own choosing)
Your task is to produce a single file called bushstats that contains the number of lines in the
entire set of transcripts containing each key word. In other words, the file bushstats should
look something like this (actual counts will be different, but a sample counts will be provided
later to test your solution).
energy
10
iraq
13
social security
26
taxes
93
...(etc)
You can do this a number of ways. The best way is to grep each keyword separately on all the
files, pipe into wc (to count lines) then redirect the output to a file with the same name as the
keyword. (For example grep –i 'abortion' * | wc –l > abortion to count the number of hits
for the phrase ‘abortion’ and store this number in a file called abortion.)
When you have all the keyword files with their line counts, you can merge them together using
head -10 * (Using head will provide the filename as headings befire the hit count in each file.
If you like your results, repeat the head command and redirect output into your final file
bushstats
Important Notes
1) In order that you do not process htm files in your keyword analysis, make a new directory in
your home directory called bushtxt and copy all of the .txt files from bush into that directory.
4
2) in order that the head -10 *
command only includes your keyword files, you should create
a subdirectory in bushtxt directory recalled totals Assuming you start your grep commands
in the directory bushtxt, your grep commands will be modified to send the keyword hit output
into the totals subdirectory, for example grep –i 'abortion' * | wc –l > totals/abortion
3) Your final bushstats file should reside in the totals subdirectory in the bushtxt directory (i.e.
/home/YOURID/bushtxt/totals
C2. 2003/2004/2005 Analysis
Having performed the above task, you now need to repeat the whole process using more precise
wildcards to analyze the files in three distinct sets: those from 2003, those from 2004, and those
from 2005. Start by creating three new subdirectories in your bushtxt directory, totals2003,
totals2004, and totals2005
Then repeat the process described in C1 above using a modified wildcard to specify only radio
addresses given in 2003 and put the results in the totals2003 directory (your final stats for 2003
will go in file bushstats2003). Then repeat the process two more times for 2004 and 2005. The
repetition will serve two goals: 1) help you memorize the command sequence used to do the
processing, and 2) make you appreciate the need for a "meta" script to handle multiple analyses,
which we will explore in project 2.
Part D Turn in your Results
Email me (trebold@mpc.edu) your three summary files by the due date. To retrieve the files from
mlc104 you can cat them on the screen, select them with your mouse (selected text will appear
with a white background) and type Ctrl-C. Then open Microsoft Word (or other editor on your
computer) and paste the text into a word file. Then email me that Word file as an attachment.
For various reasons, this project needs to be completed on time. Late work will be substantially
reduced (by 50% or more)
ACADEMIC HONESTY: Working with a partner on this project is encouraged, however I draw
the line at providing your solutions to somebody else. Please use this assignment to develop your
own Unix skills using the knowledge gleaned from the text book or your class notes. I would
like you to think of this as a task assigned to you on the job, requiring minimal input from your
supervisor. As a further incentive, Test 2 (after spring break) will refer heavily to the skills you
develop for this project.
5
Download