Common Linux Commands for Data Manipulation

advertisement
Common Linux Commands for Data Manipulation
Introduction
In this lecture we will see how some common Linux command line programs can be used
to perform simple, though powerful, manipulation of data, for example, searching, sorting
and selecting data. This provides us with an effective and time-saving approach to
manipulating data.
The Linux (Unix) operating system
Linux is a variant of the Unix operating system. Most likely, these days, your central
computer server runs the Linux operating system as does many of the mail, web and
name servers on the internet. Linux provides, as standard, many file manipulating
programs, for example 'sort' sorts the contents of a file into alphabetical or numerical
order (you therefore do not need to write your own sort program).
Data manipulation programs
We will look at the following data manipulation programs:
cat
sort
uniq
diff
echo
sed
tr
grep
head
tail
split
wc
cut
-
concatenate files
sort lines of text files
remove duplicate lines from a sorted file
find differences between two files
display a line of text
a Stream EDitor
translate or delete characters
search for lines in a file matching a pattern
output the first part of files
output the last part of files
split a file into pieces
print the number of bytes, words, and lines in files
remove sections from each line of files
For detailed information about these commands type:
$ man 'command'
or
$ info 'command'
and
$ 'command' --help
Additionally, we can use "output redirection" (the '>' symbol) to redirect the output of a
program to a file instead of the screen, "append" (the '>>' symbol) to add(append) to files,
and "piping" (the '|' symbol) to "pipe" the output of one program into another program. In
this way we can combine the functions of more than one program and direct the finally
output to a file.
cat - concatenate files
The 'cat' command can be used to combine (concatenate) files into a
single file. For example consider the following three files:
file1.dat
--------cake
hat
pool
ten
tool
file2.dat
--------house
pen
fish
cake
file3.dat
--------bed
fence
tool
one
comb
The command
$ cat file1.dat file2.dat file3.dat
results in the following output:
cake
hat
pool
ten
tool
house
pen
fish
cake
bed
fence
tool
one
comb
or using output redirection
$ cat file1.dat file2.dat file3.dat > file4.dat
the output is stored in file4.dat
sort - sort lines of text files
The contents of file4.dat can be sorted into alphabetically order
with the 'sort' command:
$ sort file4.dat
bed
cake
cake
comb
fence
fish
hat
house
one
pen
pool
ten
tool
tool
or to redirect the output to 'file5.dat':
$ sort file4.dat > file5.dat
Note that word 'cake' and 'tool' both appear twice in the list.
If you wish to sort a list into numerical order then use
the '-n' option. For example sorting the list in file list.dat:
12
3
-7
14
$ sort list.dat
-7
12
14
3
$ sort -n list.dat
-7
3
12
14
(sort into alphabetical order)
(sort in numerical order)
uniq - remove duplicate lines from a sorted file
We can now remove duplicate lines from the list:
$ uniq file5.dat
bed
cake
comb
fence
fish
hat
house
one
pen
pool
ten
tool
or
$ uniq file5.dat > file6.dat
Now words 'cake' and 'tool' only appear once in the list.
We can combine two or more commands by using the "pipe" symbol '|;
for example
$ sort file4.dat | uniq > file6.dat
this compound command sorts file4.dat and feeds the output into
the command 'uniq' which then outputs to file6.dat; it is equivalent to
$ sort file4.dat > file5.dat
$ uniq file5.dat > file6.dat
except file5.dat is not created when piping.
diff - find differences between two files
We can look at the difference between file5.dat and file6.dat as
follows:
$ diff file5.dat file6.dat
3d2
< cake
13d11
< tool
the output indicates that the files differ by the words 'cake' and
'tool'
(these word occur twice in file5.dat but only once in file6.dat).
echo - display a line of text
If we wish to add a word to file6.dat we can use the echo command as
follows:
$ echo dog >> file6.dat
note the '>>' symbol (append), a '>' will overwite the file with the
single word 'dog'!
and sort it (output to file7.dat):
$ sort file6.dat > file7.dat
we can view the new file with the cat command:
$ cat file7.dat
bed
cake
comb
dog
fence
fish
hat
house
one
pen
pool
ten
tool
the word 'dog' is now included (between 'comb' and 'fence').
sed - a Stream EDitor
The command 'sed' provides basic text transformations. Sed has many
options, here we will use Sed as a utility to replace one word with
another. The command for this is:
$ sed "s/old_word/new_word/g" filename
For example we will replace, in file7.dat, the word 'house'
with the word 'home':
$ sed "s/house/home/g" file7.dat
bed
cake
comb
dog
fence
fish
hat
home
one
pen
pool
ten
tool
or
$ sed "s/house/home/g" file7.dat > file8.dat
to redirect the output to file8.dat.
The word 'house' has been replace with the word 'home',
we can check the difference between the files as follows:
$ diff file7.dat file8.dat
8c8
< house
--> home
tr - translate or delete characters
The command 'tr' can be used to translates characters to other
characters.
For example if we wish to translate all lowercase characters to upper
case
we issue the command:
$ cat file8.dat | tr a-z A-Z
BED
CAKE
COMB
DOG
FENCE
FISH
HAT
HOME
ONE
PEN
POOL
TEN
TOOL
Note that the 'tr' command does not operate on a file
hence we first 'cat' the file and then pipe the output
to the 'tr' program.
The translated output can be stored in file9.txt as follows:
$ cat file8.dat | tr a-z A-Z > file9.dat
grep - search for lines in a file matching a pattern
We can search for words in a file using the 'grep' command.
For example does the word 'home' exist in file9.dat?
$ grep home file9.dat
no output is given (no search matches).
$ grep HOME file9.dat
HOME
the output tells us that the word 'HOME' exists in the file ('grep' is
case-senistive like most Linux commands).
Case-sensitivity can be switched off with the '-i' option:
$ grep -i home file9.dat
HOME
And the option '-n' gives the line number of the matched word.
$ grep -i -n home file9.dat
8:HOME
head - output the first part of files
The command 'head' can be used to output the first "n" lines of a file;
for example we know that 'HOME' is the 8th line in file9.dat so we can
output the first 8 lines as follows:
$ head -n8 file9.dat
BED
CAKE
COMB
DOG
FENCE
FISH
HAT
HOME
tail - output the last part of files
Similarly the command 'tail' can be used to output the last "n" lines
of a
file; for example the five remaining lines after HOME can be output
with:
$ tail -n5 file9.dat
ONE
PEN
POOL
TEN
TOOL
We can use 'head' and 'tail' together to output m lines starting from
line n. For example to list the 8th, 9th and 10th line in file9.dat:
$ head -n10 file9.dat | tail -n3
HOME
ONE
PEN
or to output the 8th line only:
$ head -n8 file9.dat | tail -n1
HOME
split - split a file into pieces
The 'split' command can be used to split a file into two files.
The two new files are called 'xaa' and 'xab'. For example:
$ split -l8 file9.dat
$ cat xaa
BED
CAKE
COMB
DOG
FENCE
FISH
HAT
HOME
$ cat xab
ONE
PEN
POOL
TEN
TOOL
wc - print the number of bytes, words, and lines in files
The command 'wc' outputs the number of bytes, words, and lines,
in a file. For example:
$ wc file9.dat
13
13
$ wc xaa
8
8
$ wc xab
60 file9.dat
38 xaa
5
5
22 xab
so file9.dat has 13 lines, 13 words, and 60 characters,
xaa has 8 lines and xab has 5 lines.
cut
The
for
are
- remove sections from each line of files
command 'cut' can be used to select a column from a file;
example the first three characters from each line of file9.dat
selected as follows:
$ cut -b1-3 file9.dat
BED
CAK
COM
DOG
FEN
FIS
HAT
HOM
ONE
PEN
POO
TEN
TOO
or characters 2 to 4:
$ cut -b2-4 file9.dat
ED
AKE
OMB
OG
ENC
ISH
AT
OME
NE
EN
OOL
EN
OOL
Command options
I have shown the most basic functions of the above commands, more operations can be
obtain with the many options.
Download