CST334 Project 1: Unix Media Analysis Due at 11:30 PM, March 15 Worth 25 Points Late projects will be discounted 50%! The power of the media to influence public perception is well known but difficult to characterize since most of the information of our world comes through media sources holding vested interests in certain institutions or policies. In addition, the most powerful mediums are audio-visual, ephemeral, and rapidly changing, like television and radio, which are difficult to track for content and subtext. However, some media sources are archived as text files or transcripts and are available on the web. In this project, we are going to apply our Unix tool set to analyze a collection of weekly radio addresses from President Bush spanning the period from 2003 to the present. In the /home/cst334/bush directory you'll find a set of more than 100 files time coded by date, as shown in the following excerpt: 20010915.htm 20030607.htm 20031115.htm 20040424.htm 20041002.htm 20030104.htm 20030614.htm 20031122.htm 20040501.htm 20041009.htm ... The date of the files forms the file name. For example, 20040501.htm represents the radio address given on May 1, 2004. Note there is also a file from September 15, 2001 for historical purposes. As indicated by the .htm suffix, the files are coded in HTML format. Although this is not really a problem, we will spend a part of this project translating the files into plain text. To do this we will use the program html2text which is at the bottom of the directory listing. Your primary mission will be to compile statistics on the use of certain words in Bush's radio addresses, and examine how these may be changing over time. Part A Getting Familiar with the Content: Visit the bush directory and spend some time getting familiar with the contents. Use more to examine the 20010915.htm file and read what President Bush said four days after the 9/11 attack. Notice the HTML formatting, which is distracting but does not prevent reading the content. One tool that we will use more than any other is grep, which searches files for different string patterns. You can enter the command grep 'terrorist' 20010915.htm to see the lines in this file containing the word terrorist. The command grep 'new york' 20010915.htm will look for the phrase 'new york' in the same file. However, since grep matches patterns based on letter case, there'll be no hits. You can ignore case using the –i option. In other words, grep -i 'new york' 20010915.htm this will search for lowercase and upper case versions of the search string. The grep command returns the entire line for every occurrence of the search string. If you would like to obtain more context on what grep finds, you can use the –C option. The command 1 grep –iC3 'new york' 20010915.htm displays three lines before and after the line containing the search string. You'll notice in this case a fair amount of HTML coming into view. Don't worry, we will remove that in Part B. Another thing we will be using grep for is to count the number of occurrences a word is used. We can use this technique to characterize the various interests and passions of the President. The more a word is used, we can (perhaps) assume, the more it plays an important role in the President's thinking. You can get a count of the number of hits a search uncovers by piping the output of grep into the wc (word count) utility. Try the command grep -i 'new york' 20010915.htm | wc to count the number of words in the output from grep. The following command counts lines grep -i 'new york' 20010915.htm | wc –l (that's l as in line, not number 1!) which is the same as counting the number of hits grep returns. Notice that you don't want to use the –C option on grep when you're counting hits, because the context will be included in the count, and corrupt the results. You can further expand the performance of grep using the wildcard characters (*, ?, [] ) to specify a larger group of files. Try this command grep -i 'new york' * | wc –l To determine the number of times President Bush mentions New York in the entire file set. Or, to gain some understanding of President Bush's feeling on abortion, You can use the command: grep -iC3 'abortion' * Which will return some context on what the president was saying when the word was used. If you want more context, change the number after C, for example grep -iC10 'abortion' * If you don't want to view the filename of the hit, use –h: grep -ihC10 'abortion' * Have fun with this and feel free to explore other topics and keywords that might interest you. Finally, experiment with translating the 20010915.htm file into text using the command ./html2text 20010915.htm Part B Translating HTM files into TXT For this part, we will take up the challenging task of writing a script to translate all of the .htm files into .txt format. The difficult way to do this would be to translate each file one at a time. We're going to use various capturing and editing techniques to create a script to do all 100+ files at the same time. 1) The first step in this process is to make a complete copy of the bush directory into your home directory. You may have to refresh your memory on how to copy a directory. Feel free to use your textbook, or the online man pages. After you copy the directory, change its access permission so that you can write to the directory. 2) Once you have a copy of the directory, make a listing of the contents using ls and redirect the output into a file called htm. Then make a copy of htm called txt 2 3) you are going to edit these two files to build a script file to accomplish the task of translating the files into txt. Your goal is to create a file that looks like this: ./html2text ./html2text ./html2text ./html2text ./html2text … 20010915.htm 20030104.htm 20030111.htm 20030118.htm 20030125.htm > 20010915.txt > 20030104.txt > 20030111.txt > 20030118.txt > 20030125.txt for the entire contents of the directory. Using vi open the htm file. This is going to be the start of all the commands you will create. First go to the bottom of the file and delete the last 2 or 3 lines (html2text, and any non 200XXXX.htm files) because we won't be converting these files! Your job is to formulate a substitution command in vi to replace the string 200 with the string ./html2text 200 (use \ to escape the '/' in the replacement string, like so: .\/html2text 200 ) This will give you the first two colums of the script file, ie. ./html2text 20010915.htm ./html2text 20030104.htm ./html2text 20030111.htm … all the way down. When you finish this step, save the file and exit vi. 4) You are going to build the second half of the command by editing the file txt. Again, visit the bottom of the file and delete the last two or three lines (non 200XXXX.htm files) as above. Then use a substitution command to replace htm with txt Then save and exit vi. 5) Another nifty Unix command is paste, which allows you to paste two files together in a column format. Try the command paste htm txt and you'll see the files appearing side by side on output. You are almost finished. A variation of paste allows you to set the delimiter (field separator) to '>' so you can redirect the output of every html2text command to a .txt file. This will make the command paste -d '>' htm txt you should get something that looks like the goal displayed under step three. Don’t worry if there are no spaces before or after the '>' symbol, Unix will handle this just fine. What you need to do now is redirect the output of this command into a file called convert. Then change the access permission of convert so that you can execute as well as read/write it. 6) Cross your fingers. Type the command ./convert When the command finishes, type ls and examine the contents of any txt file. It should be cleaned of html formatting. 3 Part C Analyzing Files for Content C1. Total Analysis In this part we're going to do the actual analysis of the text files. Before you get started, read Important Notes below. You are going to analyze the presence of 10 key words in the collection of files. Five of these are required (shown below) and five of these are of your own choosing These must be unique from anyone else in the class. Required key text phrases for the analysis energy iraq social security taxes weapons of mass destruction (+ 5 more UNIQUE phrases of your own choosing) Your task is to produce a single file called bushstats that contains the number of lines in the entire set of transcripts containing each key word. In other words, the file bushstats should look something like this (actual counts will be different, but a sample counts will be provided later to test your solution). energy 10 iraq 13 social security 26 taxes 93 ...(etc) You can do this a number of ways. The best way is to grep each keyword separately on all the files, pipe into wc (to count lines) then redirect the output to a file with the same name as the keyword. (For example grep –i 'abortion' * | wc –l > abortion to count the number of hits for the phrase ‘abortion’ and store this number in a file called abortion.) When you have all the keyword files with their line counts, you can merge them together using head -10 * (Using head will provide the filename as headings befire the hit count in each file. If you like your results, repeat the head command and redirect output into your final file bushstats Important Notes 1) In order that you do not process htm files in your keyword analysis, make a new directory in your home directory called bushtxt and copy all of the .txt files from bush into that directory. 4 2) in order that the head -10 * command only includes your keyword files, you should create a subdirectory in bushtxt directory recalled totals Assuming you start your grep commands in the directory bushtxt, your grep commands will be modified to send the keyword hit output into the totals subdirectory, for example grep –i 'abortion' * | wc –l > totals/abortion 3) Your final bushstats file should reside in the totals subdirectory in the bushtxt directory (i.e. /home/YOURID/bushtxt/totals C2. 2003/2004/2005 Analysis Having performed the above task, you now need to repeat the whole process using more precise wildcards to analyze the files in three distinct sets: those from 2003, those from 2004, and those from 2005. Start by creating three new subdirectories in your bushtxt directory, totals2003, totals2004, and totals2005 Then repeat the process described in C1 above using a modified wildcard to specify only radio addresses given in 2003 and put the results in the totals2003 directory (your final stats for 2003 will go in file bushstats2003). Then repeat the process two more times for 2004 and 2005. The repetition will serve two goals: 1) help you memorize the command sequence used to do the processing, and 2) make you appreciate the need for a "meta" script to handle multiple analyses, which we will explore in project 2. Part D Turn in your Results Email me (trebold@mpc.edu) your three summary files by the due date. To retrieve the files from mlc104 you can cat them on the screen, select them with your mouse (selected text will appear with a white background) and type Ctrl-C. Then open Microsoft Word (or other editor on your computer) and paste the text into a word file. Then email me that Word file as an attachment. For various reasons, this project needs to be completed on time. Late work will be substantially reduced (by 50% or more) ACADEMIC HONESTY: Working with a partner on this project is encouraged, however I draw the line at providing your solutions to somebody else. Please use this assignment to develop your own Unix skills using the knowledge gleaned from the text book or your class notes. I would like you to think of this as a task assigned to you on the job, requiring minimal input from your supervisor. As a further incentive, Test 2 (after spring break) will refer heavily to the skills you develop for this project. 5