Tools for Data Analysis CS910: Foundations of Data Analytics

advertisement
CS910: Foundations
of Data Analytics
Graham Cormode
G.Cormode@warwick.ac.uk
Tools for Data
Analysis
Objectives
 Introduce a broad collection of tools used in data analytics
 Outline the capabilities and uses of each tool
–
Provide examples of tool usage
 Allow you to select the appropriate tools to work with
–
Based on your preferences, e.g. GUI or command line
 Very quick introduction to each tool
–
2
More information on the web, in the library
CS910 Foundations of Data Analytics
The philosophy of tool choice
 A huge array of tools is available, with overlapping functionality
 A good data analyst knows a good selection of tools
–
Can pick the right one for the job
 Many people know only one or two (often unsuitable) tools
–
Hence, much of the world’s analytics is performed in spreadsheets
 Knowing that a tool exists and what it can do is often enough
Can decide if learning to use it is time-effective
– Will introduce some options here
– Will not give formal training
–
 No single answer to tool selection
–
3
Often a matter of personal choice
CS910 Foundations of Data Analytics
Unix Tools
 “Unix tools” covers many simple tools developed as part of the
Unix operating system
They manipulate data files represented as lines of text:
flat files, comma separated value (CSV) files
– Allow simple analysis and data preparation
– Widely available in Linux, MacOS, Windows
–
 “I use all these nearly every day. The best part is, once you know they exist,
these tools are available on every unix machine you will ever use. Nothing else
(except maybe perl) is as universal – you don’t have to worry about versions or
anything. Being comfortable with these tools means you can get work done
anywhere – any EC2 instance you boot up will have them, as will any unix
server you ssh into.”
4
CS910 Foundations of Data Analytics
Unix History in a nutshell
 Developed at AT&T Bell Labs in late 1960’s for PDP11
Made available in mid-1970’s
– Developed and sold by AT&T in the 1980’s
– Commercial variants emerged: Solaris, SCO…
–
 Standardized via POSIX in 1989
–
POSIX: Portable Operating System Interface based on Unix
 GNU foundation launched free implementations in 1980s
Linux started in 1991 as a free POSIX-compliant OS kernel
– Many Linux distributions available: Ubuntu, Fedora, Debian…
–
5
CS910 Foundations of Data Analytics
Tool availability
 Available on any Unix machine
 Available on any Linux machine
–
Such as those in DCS, e.g. joshua
 Available on any modern Mac
Based on BSD kernel
– Open the ‘console’ and type away
–
 On Windows:
Various ports of individual tools or collections of tools
– Cygwin, open source port of many linux tools to Windows
http://cygwin.com/install.html
–
6
CS910 Foundations of Data Analytics
Command line tools
 These are command line tools – no fancy GUI
 Each tool performs a single simple function
nmap.org/movies/
Additional functionality has crept in over time
– Now some are more like a swiss army knife
–
 Can be combined via scripts, piping
 Information available on each tool:
Via ‘man’ command: e.g. man cat
– Via program itself: sort –help
– Via the web: many instructions/examples online
–
 Short course on unix tools from Cambridge:
–
7
http://www.cl.cam.ac.uk/teaching/1213/UnixTools/materials.html
CS910 Foundations of Data Analytics
Example Data Set
 Show examples using the “adult census data”
http://archive.ics.uci.edu/ml/machine-learning-databases/adult/
– File: adult.data
–
 ~32K individuals, one per line
–
Age, Gender, Employment Type, Years of Education…
 Widely studied in Machine Learning community
–
8
Prediction task: is income > 50K?
CS910 Foundations of Data Analytics
Standard input, output and Piping
 Unix commands can read and write files
Special case: standard input (stdin) and standard output (stdout)
– By default, a command reads from stdin, writes to stdout
–
 Some commonly used tools are ‘wc’ and ‘cat’
wc does a simple wordcount
– cat reads a file, writes it to stdout
– Pipe ‘|’ connects the stdout of one command to stdin of next
–
 Examples:
cat adult.data | wc
 Output: 32562
488415 3974305 [lines words characters]
– cat adult.data| wc | wc
 Output: 1
3
24
–
9
CS910 Foundations of Data Analytics
Redirection
 Can use < to redirect a file to stdin, and > to redirect stdout
–
>> appends to an existing file
 Examples:
wc < adult.data
 32562
488415 3974305
– wc < adult.data > wordcount
cat wordcount
 32562
488415 3974305
– cat adult.data | wc >> wordcount
–
 wc options:
–
10
-l / -w / -c : print number of lines / words / characters
CS910 Foundations of Data Analytics
Basic Commands
 ls: list files in a directory
–
ls adult

adult.data
adult.names
adult.test
 Options to commands are often single letters preceded by –
ls –l adult

–
8 18:03 adult.data
8 18:03 adult.names
8 18:04 adult.test
ls –la public_html

11
total 5852
-rwx------ 1 grahamc dcsstaff 3974305 Oct
-rwx------ 1 grahamc dcsstaff
5229 Oct
-rwx------ 1 grahamc dcsstaff 2003153 Oct
total 5860
drwx------ 2 grahamc dcsstaff
4096
drwx------ 39 grahamc dcsstaff
4096
-rwx------ 1 grahamc dcsstaff 3974305
-rwx------ 1 grahamc dcsstaff
5229
-rwx------ 1 grahamc dcsstaff 2003153
CS910 Foundations of Data Analytics
Oct
Oct
Oct
Oct
Oct
8
8
8
8
8
18:04
18:04
18:03
18:03
18:04
.
..
adult.data
adult.names
adult.test
Viewing files: cat, head, tail
 cat file shows contents of file
 head shows first few lines of a file
–
head adult.data

39,
50,
38,
53,
28,
37,
49,
52,
31,
42,
State-gov, 77516, Bachelors, 13, Never-married,
Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse,
Private, 215646, HS-grad, 9, Divorced,
Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners,
Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty,
Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial,
Private, 160187, 9th, 5, Married-spouse-absent, Other-service,
Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse,
Private, 45781, Masters, 14, Never-married, Prof-specialty,
Private, 159449, Bachelors, 13, Married-civ-spouse, Exec-managerial,
 Tail shows last few lines of a file
–
Tail –n 5 adult.data

12
40,
58,
22,
52,
Private, 154374, HS-grad, 9, Married-civ-spouse, Machine-op-inspct,
Private, 151910, HS-grad, 9, Widowed, Adm-clerical, Unmarried, White,
Private, 201490, HS-grad, 9, Never-married, Adm-clerical, Own-child,
Self-emp-inc, 287927, HS-grad, 9, Married-civ-spouse, Exec-managerial,
CS910 Foundations of Data Analytics
Viewing files: more or less
 more lets you page through a file
–
Page down/space to advance
 less is a more flexible replacement
Can page up to go back
– Q to quit
–
13
CS910 Foundations of Data Analytics
The sort command
 sort: sorts the input
 Default: sort lines by alphabetic order
–
sort adult.data
 Configurable
–
–
–
–
–
14
-r: reverse sort
-n: numeric sort
-k: column on which to sort (assume space separates fields)
 sort adult.data –k5 | less
 sort adult.data –n –k5 | less
-f: ignore (upper/lower) case
-m: merge multiple sorted files together
CS910 Foundations of Data Analytics
Cut
 cut: select certain columns from the file
Default: assume tab separates columns
– -f: specifiy which fields to select
– -c: specify which character positions in each line to select
 cut –c1-9 adult.data | head
 cut –c1,3,5,7,9 adult.data | head
– -d: specify the field delimiter
 cut –f1 adult.data | head
 cut –f1,2 –d, adult.data | head
 cut –f1,2 –d\ adult.data | head
 cut –f1,3,5,7,9 –d, adult.data | head
–
15
CS910 Foundations of Data Analytics
Uniq
 uniq: omit (or report) repeated lines
cut –f1 –d, adult.data | uniq | head
 cut –f1 –d, adult.data | sort –n | uniq | head
– Count the number of occurrences with -c
 cut –f1 –d, adult.data | sort –n | uniq –c | head
 cut –f1 –d, adult.data | sort –n | uniq –c | sort –rn | head

16
CS910 Foundations of Data Analytics
grep
 grep: search for lines that match some text
grep Masters adult.data | head
 grep Masters adult.data | wc –l
– -i: ignore case
– -v: invert behaviour, select non-matching lines
– -An, -Bn: print n lines of context appearing After / Before the match
 grep –A1 –B2 Hungary adult.data | less

 Can handle regular expressions for flexible matching


17
grep Married.*England adult.data | less
grep ^90 adult.data | less
CS910 Foundations of Data Analytics
Grep + regular expressions
 Grep’s regular expression syntax:
–
–
–
–
–
–
–
–
^ : start of line
$ : end of line
\ : “escape” next character: \$ to match a $ sign
[abc] : match any character of abc
[a-z] : match any character in range a to z
. (dot) : match any character
* : match 0 or more occurrences of preceding expression
\{n\} : match n instances of preceding expression
 Example: grep “\(21\)\{2\}” adult.data
 egrep for “extended” regular expressions:

18
egrep “England|Mexico” adult.data | head
CS910 Foundations of Data Analytics
sed
 sed: stream editor
Most commonly used to substitute some text for others
– sed ‘s/expression/replacement/g’
 sed ‘s/Private/Secret/g’ adult.data | head
 sed ‘s/, /\t/g’ adult.data | head
 sed ‘s/, /\n/g’ adult.data | head
–
19
CS910 Foundations of Data Analytics
join
 join: do a database-style join on two sorted text files
-1 n -2 m: try to match n’th field of first file with m’th field of second
– Output all combinations of matches
– e.g. join list of people + postcodes with average income in postcode
–
 Example:

20
grep United-States -v adult.data | head -n 20 | cut -f 4,14 -d, |
sort –k 2 > adult.join1
grep United-States -v adult.data | head -n 20 | cut -f 1,14 -d, |
sort –k 2 > adult.join2
join -1 2 -2 2 adult.join1 adult.join2
CS910 Foundations of Data Analytics
Editors: nano, pico, emacs
 Unix editors were once notoriously unfriendly
–
vi, vim, and ed all required memorizing complex commands
 Modern editors are now much more usable
pico and nano are easy to pick up and use
– emacs is very powerful and configurable
–
 If working on a GUI based system, many options
21
Local text editors in Windows, Macs, Linux
CS910 Foundations of Data Analytics
http://xkcd.com/378/
–
scripting
 Don’t have to write ever longer command lines
 Can put sequences of commands into scripts
–
22
With loop controls: automate processing, reduce errors
 #/bin/bash
for i in 1 2
do
wc adult.join$i
for ((j=1; j<=2; j++))
do
echo $((i+j))
done
done
date
CS910 Foundations of Data Analytics
Programming
 Can write programs in your language of choice
Java: powerful, general purpose language
– Python: popular, mathematical language
– Perl: popular for processing text
–
 Teaching a language is definitely out of scope of this module
Foundations (CS917) module gives crash course in Java
– You can use any language you know for homeworks, project
 Data Analytics is about getting an answer, less about how
–
 Will give a brief introduction to R, a statistical tool/language
23
CS910 Foundations of Data Analytics
Tools for working with statistical data: R
 R: flexible language with a lot of support for statistical operations
Successor to ‘S’ language
– Open-source, available in Windows, Mac, Linux, Cygwin
–
 Inbuilt support for many data manipulation operations
–
–
–
–
–
Read in data from CSV (comma-separated values) format
Compute sample mean, variance, quantiles
Find line of best fit (linear regression)
Flexible plotting tools, output to screen or file
Lots more statistical tools available as libraries
 Steep learning curve, but GUIs and help is available
–
24
Will use the R Studio GUI
https://www.rstudio.com/products/rstudio/download/
CS910 Foundations of Data Analytics
Quick example in R
 data <- read.csv(“adult.test“, header=F)
# read in data in comma-separated value format
summary(data) # show a summary of all attributes
summary (data[5]) # show a summary of years of education
d <- table(data[5]) # tabulate the data
plot (d) # plot the frequency distribution
plot(ecdf(data[5]$V5)) # plot the (empirical) CDF
 data2 <- read.csv(“adult.data”, header=F)
qqplot(data[5]$V5, data2[5]$V5), type=“l”)
# make a quantile-quantile plot of two (empirical) dbns
pdf(file=“qq.pdf”) # send output to a PDF file
qqplot(data[5]$V5, data2[5]$V5), type=“l”)
dev.off() # close the file!
quit() # quit!
25
CS910 Foundations of Data Analytics
Spreadsheets
 Many options: Excel, OpenOffice, Google Spreadsheets
 Great for quick viewing, exploration and plotting of small data
Excel 2003: 65536 rows
– Excel 2007, 2010, 2013, 2016: 1M rows
– Google sheets: up to 256 columns, or up to 200,000 cells
–
 Quick plotting tools:
Select data to plot, hit ‘plot’ button, fiddle with options
– Sometimes takes a long time to make plots how you want
– Tricky to get multiple plots with the same formatting
–
26
adult.test years of
education
20
CS910 Foundations of Data Analytics
15
10
5
0
0
5
10 of education
15
Adult.data
years
20
Data Processing in Spreadsheets
 Decent data manipulation functionality
Sort, selection, reformatting
– Some tasks more difficult within the spreadsheet metaphor
–
 Limitations of data processing in spreadsheets
Capacity limits (row limits, cell limits)
– Can’t always keep a record of what was done (repeatability)
 Can put sequence of unix tool commands in a script
– Prone to errors: may select wrong range of cells etc.
–
theconversation.com/economists-an-excel-error-and-the-misguided-push-for-austerity-13584


27
An economics paper argued in favour of austerity measures
Missed out Australia, Austria, Belgium, Canada, and Denmark from
calculations, skewing the conclusion
CS910 Foundations of Data Analytics
Data Processing in Spreadsheets
 Sort: select data and click on ‘sort’
 Aggregation:
–
=sum(range), =count(range), =average(range), =median(range)
 =if(test, [value if true], [value if false])
–
“Smart filling” lets you drag to extend
 =countif(range, condition)
 Pivot tables let you explore the data cube
 Exercise: compute the number of people from each country in
adult.data
–
28
Compare to the effort to do this with unix tools (cut, sort, uniq)
CS910 Foundations of Data Analytics
Plotting in Excel
 Scatter plot of age vs years of education
Select columnns
– Insert - ‘scatter plot’
–
18
16
14
12
10
Series1
8
6
4
2
0
0
 Bar chart of gender breakdown
Derive necessary counts
– Insert - ‘Column’
–
20
40
60
80
25000
20000
15000
Series1
10000
5000
0
Male
29
100
CS910 Foundations of Data Analytics
Female
Gnuplot
 Powerful plotting tool, driven by a script
Easier to generate multiple, consistent plots
– Write script as a text file
– Call gnuplot scriptname
–
 Pros and cons:
Flexible output: create PDF, JPG, PNG, EPS, EMF…
– Plot data and functions
– Configure almost every aspect of the output
– Sometimes arcane commands, cryptic abbreviations
–
30
CS910 Foundations of Data Analytics
Gnuplot function plotting
set term emf enhanced font "Calibri,18" size 600,400
set output "pareto.emf"
set log y
set log x
set xrange [1: 1e6]
set yrange [1e-6: 1]
set format y "10^{%L}”
set format x "10^{%L}”
unset key
plot x**(-1.0)
– set output "exp.emf"
plot x**(-1.0)*exp(-0.0001*x)
–
0
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
10
1
10
2
10
3
10
4
10
5
10
6
10
0
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
1
10
2
10
3
10
4
10
5
10
6
100
10-1
10-2
–
31
cdf_lognormal(x)=0.5+0.5*erf((x)/sqrt(2.0))
set output "lognorm.emf"
plot 1.0-cdf_lognormal(0.5*log(0.01*x))
CS910 Foundations of Data Analytics
10-3
10-4
10-5
10-6
100
101
102
103
104
105
106
Gnuplot data plotting
 Scatter plot of age versus years of education:
set term emf enhanced font "Calibri,18"
set output "ageeducation.emf"
16
set title "Age versus Education"
14
set xlabel "Age"
12
set ylabel "Years of Education"
10
set key under
8
plot "adult/adult.data" using 1:5 \
6
with points title 'Adult data'
4
Age versus Education
Years of Education
–
2
 Add a line of best fit:
–
32
0
10
20
30
y(x)=a*x+b
fit y(x) "adult/adult.data" using 1:5 via a,b
plot "adult/adult.data" u 1:5 w p t 'Adult', y(x) w l t ‘Fit'
CS910 Foundations of Data Analytics
40
50
Age
Adult data
60
70
80
90
Gnuplot data plotting
 Bar chart of gender breakdown:
Process data to generate sums:
 cut -f 10 -d, adult/adult.data | sort | uniq -c > gendercount.txt
25000
– Gnuplot script:
–

33
set term emf enhanced font "Calibri,18" 20000
set output "gender.emf"
15000
set style data histograms
10000
set style histogram cluster gap 1
set style fill solid border -1
5000
set yrange [0:]
0
plot "gendercount.txt" using 1:xticlabel(2) title " "
CS910 Foundations of Data Analytics
Female
Male
Report writing: Wordprocessors
 Many options: MS Word, OpenOffice Writer, Google Docs
 Adequate for report writing (e.g. project report)
Nice GUI interface, configurable
– Can be difficult if you have many figures
– 3rd party support for bibliographic data (Endnote)
–
34
CS910 Foundations of Data Analytics
Report writing: LaTeX
 LaTeX: a scientific document preparation system
 Describe how you want your document to be, and compile it
 More of a learning curve, but very powerful
–
–
–
–
–
Stops you getting too involved in fine details
Support for producing beautiful mathematical formulae
Produce PDF output easily from LaTeX (text) source file:
 pdflatex myfile.tex
Support automatic bibliography creation via bibtex
Automatic updating cross-references via \label and \ref
 Covered in more detail in CS908 Research Methods
35
CS910 Foundations of Data Analytics
Is this on the test?
 From 2014 exam:
Many acceptable answers for each question
(and also poor/wrong answers…)
Background reading
 Warwick past papers
http://www2.warwick.ac.uk/services/exampapers?q=cs910&
department=Any&year=Any
 http://www.cl.cam.ac.uk/teaching/1213/UnixTools/materials.html
36
CS910 Foundations of Data Analytics
LaTeX example
\documentclass{article}
\usepackage[margin=2cm]{geometry}
\usepackage{graphicx}
\title{This is my report}
\author{Your name}
\begin{document}
\maketitle
\begin{abstract}
This is an abstract for the document
\end{abstract}
\section{Introduction}
This is the
introduction to my document
\begin{figure}
\includegraphics{figure.pdf}
\caption{This is a figure}
\label{fig:first}
\end{figure}
Please see figure~\ref{fig:first}.
\end{document}
37
CS910 Foundations of Data Analytics
Download