Common Linux Commands for Data Manipulation Introduction In this lecture we will see how some common Linux command line programs can be used to perform simple, though powerful, manipulation of data, for example, searching, sorting and selecting data. This provides us with an effective and time-saving approach to manipulating data. The Linux (Unix) operating system Linux is a variant of the Unix operating system. Most likely, these days, your central computer server runs the Linux operating system as does many of the mail, web and name servers on the internet. Linux provides, as standard, many file manipulating programs, for example 'sort' sorts the contents of a file into alphabetical or numerical order (you therefore do not need to write your own sort program). Data manipulation programs We will look at the following data manipulation programs: cat sort uniq diff echo sed tr grep head tail split wc cut - concatenate files sort lines of text files remove duplicate lines from a sorted file find differences between two files display a line of text a Stream EDitor translate or delete characters search for lines in a file matching a pattern output the first part of files output the last part of files split a file into pieces print the number of bytes, words, and lines in files remove sections from each line of files For detailed information about these commands type: $ man 'command' or $ info 'command' and $ 'command' --help Additionally, we can use "output redirection" (the '>' symbol) to redirect the output of a program to a file instead of the screen, "append" (the '>>' symbol) to add(append) to files, and "piping" (the '|' symbol) to "pipe" the output of one program into another program. In this way we can combine the functions of more than one program and direct the finally output to a file. cat - concatenate files The 'cat' command can be used to combine (concatenate) files into a single file. For example consider the following three files: file1.dat --------cake hat pool ten tool file2.dat --------house pen fish cake file3.dat --------bed fence tool one comb The command $ cat file1.dat file2.dat file3.dat results in the following output: cake hat pool ten tool house pen fish cake bed fence tool one comb or using output redirection $ cat file1.dat file2.dat file3.dat > file4.dat the output is stored in file4.dat sort - sort lines of text files The contents of file4.dat can be sorted into alphabetically order with the 'sort' command: $ sort file4.dat bed cake cake comb fence fish hat house one pen pool ten tool tool or to redirect the output to 'file5.dat': $ sort file4.dat > file5.dat Note that word 'cake' and 'tool' both appear twice in the list. If you wish to sort a list into numerical order then use the '-n' option. For example sorting the list in file list.dat: 12 3 -7 14 $ sort list.dat -7 12 14 3 $ sort -n list.dat -7 3 12 14 (sort into alphabetical order) (sort in numerical order) uniq - remove duplicate lines from a sorted file We can now remove duplicate lines from the list: $ uniq file5.dat bed cake comb fence fish hat house one pen pool ten tool or $ uniq file5.dat > file6.dat Now words 'cake' and 'tool' only appear once in the list. We can combine two or more commands by using the "pipe" symbol '|; for example $ sort file4.dat | uniq > file6.dat this compound command sorts file4.dat and feeds the output into the command 'uniq' which then outputs to file6.dat; it is equivalent to $ sort file4.dat > file5.dat $ uniq file5.dat > file6.dat except file5.dat is not created when piping. diff - find differences between two files We can look at the difference between file5.dat and file6.dat as follows: $ diff file5.dat file6.dat 3d2 < cake 13d11 < tool the output indicates that the files differ by the words 'cake' and 'tool' (these word occur twice in file5.dat but only once in file6.dat). echo - display a line of text If we wish to add a word to file6.dat we can use the echo command as follows: $ echo dog >> file6.dat note the '>>' symbol (append), a '>' will overwite the file with the single word 'dog'! and sort it (output to file7.dat): $ sort file6.dat > file7.dat we can view the new file with the cat command: $ cat file7.dat bed cake comb dog fence fish hat house one pen pool ten tool the word 'dog' is now included (between 'comb' and 'fence'). sed - a Stream EDitor The command 'sed' provides basic text transformations. Sed has many options, here we will use Sed as a utility to replace one word with another. The command for this is: $ sed "s/old_word/new_word/g" filename For example we will replace, in file7.dat, the word 'house' with the word 'home': $ sed "s/house/home/g" file7.dat bed cake comb dog fence fish hat home one pen pool ten tool or $ sed "s/house/home/g" file7.dat > file8.dat to redirect the output to file8.dat. The word 'house' has been replace with the word 'home', we can check the difference between the files as follows: $ diff file7.dat file8.dat 8c8 < house --> home tr - translate or delete characters The command 'tr' can be used to translates characters to other characters. For example if we wish to translate all lowercase characters to upper case we issue the command: $ cat file8.dat | tr a-z A-Z BED CAKE COMB DOG FENCE FISH HAT HOME ONE PEN POOL TEN TOOL Note that the 'tr' command does not operate on a file hence we first 'cat' the file and then pipe the output to the 'tr' program. The translated output can be stored in file9.txt as follows: $ cat file8.dat | tr a-z A-Z > file9.dat grep - search for lines in a file matching a pattern We can search for words in a file using the 'grep' command. For example does the word 'home' exist in file9.dat? $ grep home file9.dat no output is given (no search matches). $ grep HOME file9.dat HOME the output tells us that the word 'HOME' exists in the file ('grep' is case-senistive like most Linux commands). Case-sensitivity can be switched off with the '-i' option: $ grep -i home file9.dat HOME And the option '-n' gives the line number of the matched word. $ grep -i -n home file9.dat 8:HOME head - output the first part of files The command 'head' can be used to output the first "n" lines of a file; for example we know that 'HOME' is the 8th line in file9.dat so we can output the first 8 lines as follows: $ head -n8 file9.dat BED CAKE COMB DOG FENCE FISH HAT HOME tail - output the last part of files Similarly the command 'tail' can be used to output the last "n" lines of a file; for example the five remaining lines after HOME can be output with: $ tail -n5 file9.dat ONE PEN POOL TEN TOOL We can use 'head' and 'tail' together to output m lines starting from line n. For example to list the 8th, 9th and 10th line in file9.dat: $ head -n10 file9.dat | tail -n3 HOME ONE PEN or to output the 8th line only: $ head -n8 file9.dat | tail -n1 HOME split - split a file into pieces The 'split' command can be used to split a file into two files. The two new files are called 'xaa' and 'xab'. For example: $ split -l8 file9.dat $ cat xaa BED CAKE COMB DOG FENCE FISH HAT HOME $ cat xab ONE PEN POOL TEN TOOL wc - print the number of bytes, words, and lines in files The command 'wc' outputs the number of bytes, words, and lines, in a file. For example: $ wc file9.dat 13 13 $ wc xaa 8 8 $ wc xab 60 file9.dat 38 xaa 5 5 22 xab so file9.dat has 13 lines, 13 words, and 60 characters, xaa has 8 lines and xab has 5 lines. cut The for are - remove sections from each line of files command 'cut' can be used to select a column from a file; example the first three characters from each line of file9.dat selected as follows: $ cut -b1-3 file9.dat BED CAK COM DOG FEN FIS HAT HOM ONE PEN POO TEN TOO or characters 2 to 4: $ cut -b2-4 file9.dat ED AKE OMB OG ENC ISH AT OME NE EN OOL EN OOL Command options I have shown the most basic functions of the above commands, more operations can be obtain with the many options.