CS910: Foundations of Data Analytics Graham Cormode G.Cormode@warwick.ac.uk Tools for Data Analysis Objectives Introduce a broad collection of tools used in data analytics Outline the capabilities and uses of each tool – Provide examples of tool usage Allow you to select the appropriate tools to work with – Based on your preferences, e.g. GUI or command line Very quick introduction to each tool – 2 More information on the web, in the library CS910 Foundations of Data Analytics The philosophy of tool choice A huge array of tools is available, with overlapping functionality A good data analyst knows a good selection of tools – Can pick the right one for the job Many people know only one or two (often unsuitable) tools – Hence, much of the world’s analytics is performed in spreadsheets Knowing that a tool exists and what it can do is often enough Can decide if learning to use it is time-effective – Will introduce some options here – Will not give formal training – No single answer to tool selection – 3 Often a matter of personal choice CS910 Foundations of Data Analytics Unix Tools “Unix tools” covers many simple tools developed as part of the Unix operating system They manipulate data files represented as lines of text: flat files, comma separated value (CSV) files – Allow simple analysis and data preparation – Widely available in Linux, MacOS, Windows – “I use all these nearly every day. The best part is, once you know they exist, these tools are available on every unix machine you will ever use. Nothing else (except maybe perl) is as universal – you don’t have to worry about versions or anything. Being comfortable with these tools means you can get work done anywhere – any EC2 instance you boot up will have them, as will any unix server you ssh into.” 4 CS910 Foundations of Data Analytics Unix History in a nutshell Developed at AT&T Bell Labs in late 1960’s for PDP11 Made available in mid-1970’s – Developed and sold by AT&T in the 1980’s – Commercial variants emerged: Solaris, SCO… – Standardized via POSIX in 1989 – POSIX: Portable Operating System Interface based on Unix GNU foundation launched free implementations in 1980s Linux started in 1991 as a free POSIX-compliant OS kernel – Many Linux distributions available: Ubuntu, Fedora, Debian… – 5 CS910 Foundations of Data Analytics Tool availability Available on any Unix machine Available on any Linux machine – Such as those in DCS, e.g. joshua Available on any modern Mac Based on BSD kernel – Open the ‘console’ and type away – On Windows: Various ports of individual tools or collections of tools – Cygwin, open source port of many linux tools to Windows http://cygwin.com/install.html – 6 CS910 Foundations of Data Analytics Command line tools These are command line tools – no fancy GUI Each tool performs a single simple function nmap.org/movies/ Additional functionality has crept in over time – Now some are more like a swiss army knife – Can be combined via scripts, piping Information available on each tool: Via ‘man’ command: e.g. man cat – Via program itself: sort –help – Via the web: many instructions/examples online – Short course on unix tools from Cambridge: – 7 http://www.cl.cam.ac.uk/teaching/1213/UnixTools/materials.html CS910 Foundations of Data Analytics Example Data Set Show examples using the “adult census data” http://archive.ics.uci.edu/ml/machine-learning-databases/adult/ – File: adult.data – ~32K individuals, one per line – Age, Gender, Employment Type, Years of Education… Widely studied in Machine Learning community – 8 Prediction task: is income > 50K? CS910 Foundations of Data Analytics Standard input, output and Piping Unix commands can read and write files Special case: standard input (stdin) and standard output (stdout) – By default, a command reads from stdin, writes to stdout – Some commonly used tools are ‘wc’ and ‘cat’ wc does a simple wordcount – cat reads a file, writes it to stdout – Pipe ‘|’ connects the stdout of one command to stdin of next – Examples: cat adult.data | wc Output: 32562 488415 3974305 [lines words characters] – cat adult.data| wc | wc Output: 1 3 24 – 9 CS910 Foundations of Data Analytics Redirection Can use < to redirect a file to stdin, and > to redirect stdout – >> appends to an existing file Examples: wc < adult.data 32562 488415 3974305 – wc < adult.data > wordcount cat wordcount 32562 488415 3974305 – cat adult.data | wc >> wordcount – wc options: – 10 -l / -w / -c : print number of lines / words / characters CS910 Foundations of Data Analytics Basic Commands ls: list files in a directory – ls adult adult.data adult.names adult.test Options to commands are often single letters preceded by – ls –l adult – 8 18:03 adult.data 8 18:03 adult.names 8 18:04 adult.test ls –la public_html 11 total 5852 -rwx------ 1 grahamc dcsstaff 3974305 Oct -rwx------ 1 grahamc dcsstaff 5229 Oct -rwx------ 1 grahamc dcsstaff 2003153 Oct total 5860 drwx------ 2 grahamc dcsstaff 4096 drwx------ 39 grahamc dcsstaff 4096 -rwx------ 1 grahamc dcsstaff 3974305 -rwx------ 1 grahamc dcsstaff 5229 -rwx------ 1 grahamc dcsstaff 2003153 CS910 Foundations of Data Analytics Oct Oct Oct Oct Oct 8 8 8 8 8 18:04 18:04 18:03 18:03 18:04 . .. adult.data adult.names adult.test Viewing files: cat, head, tail cat file shows contents of file head shows first few lines of a file – head adult.data 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, State-gov, 77516, Bachelors, 13, Never-married, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Private, 215646, HS-grad, 9, Divorced, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Private, 45781, Masters, 14, Never-married, Prof-specialty, Private, 159449, Bachelors, 13, Married-civ-spouse, Exec-managerial, Tail shows last few lines of a file – Tail –n 5 adult.data 12 40, 58, 22, 52, Private, 154374, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Private, 151910, HS-grad, 9, Widowed, Adm-clerical, Unmarried, White, Private, 201490, HS-grad, 9, Never-married, Adm-clerical, Own-child, Self-emp-inc, 287927, HS-grad, 9, Married-civ-spouse, Exec-managerial, CS910 Foundations of Data Analytics Viewing files: more or less more lets you page through a file – Page down/space to advance less is a more flexible replacement Can page up to go back – Q to quit – 13 CS910 Foundations of Data Analytics The sort command sort: sorts the input Default: sort lines by alphabetic order – sort adult.data Configurable – – – – – 14 -r: reverse sort -n: numeric sort -k: column on which to sort (assume space separates fields) sort adult.data –k5 | less sort adult.data –n –k5 | less -f: ignore (upper/lower) case -m: merge multiple sorted files together CS910 Foundations of Data Analytics Cut cut: select certain columns from the file Default: assume tab separates columns – -f: specifiy which fields to select – -c: specify which character positions in each line to select cut –c1-9 adult.data | head cut –c1,3,5,7,9 adult.data | head – -d: specify the field delimiter cut –f1 adult.data | head cut –f1,2 –d, adult.data | head cut –f1,2 –d\ adult.data | head cut –f1,3,5,7,9 –d, adult.data | head – 15 CS910 Foundations of Data Analytics Uniq uniq: omit (or report) repeated lines cut –f1 –d, adult.data | uniq | head cut –f1 –d, adult.data | sort –n | uniq | head – Count the number of occurrences with -c cut –f1 –d, adult.data | sort –n | uniq –c | head cut –f1 –d, adult.data | sort –n | uniq –c | sort –rn | head 16 CS910 Foundations of Data Analytics grep grep: search for lines that match some text grep Masters adult.data | head grep Masters adult.data | wc –l – -i: ignore case – -v: invert behaviour, select non-matching lines – -An, -Bn: print n lines of context appearing After / Before the match grep –A1 –B2 Hungary adult.data | less Can handle regular expressions for flexible matching 17 grep Married.*England adult.data | less grep ^90 adult.data | less CS910 Foundations of Data Analytics Grep + regular expressions Grep’s regular expression syntax: – – – – – – – – ^ : start of line $ : end of line \ : “escape” next character: \$ to match a $ sign [abc] : match any character of abc [a-z] : match any character in range a to z . (dot) : match any character * : match 0 or more occurrences of preceding expression \{n\} : match n instances of preceding expression Example: grep “\(21\)\{2\}” adult.data egrep for “extended” regular expressions: 18 egrep “England|Mexico” adult.data | head CS910 Foundations of Data Analytics sed sed: stream editor Most commonly used to substitute some text for others – sed ‘s/expression/replacement/g’ sed ‘s/Private/Secret/g’ adult.data | head sed ‘s/, /\t/g’ adult.data | head sed ‘s/, /\n/g’ adult.data | head – 19 CS910 Foundations of Data Analytics join join: do a database-style join on two sorted text files -1 n -2 m: try to match n’th field of first file with m’th field of second – Output all combinations of matches – e.g. join list of people + postcodes with average income in postcode – Example: 20 grep United-States -v adult.data | head -n 20 | cut -f 4,14 -d, | sort –k 2 > adult.join1 grep United-States -v adult.data | head -n 20 | cut -f 1,14 -d, | sort –k 2 > adult.join2 join -1 2 -2 2 adult.join1 adult.join2 CS910 Foundations of Data Analytics Editors: nano, pico, emacs Unix editors were once notoriously unfriendly – vi, vim, and ed all required memorizing complex commands Modern editors are now much more usable pico and nano are easy to pick up and use – emacs is very powerful and configurable – If working on a GUI based system, many options 21 Local text editors in Windows, Macs, Linux CS910 Foundations of Data Analytics http://xkcd.com/378/ – scripting Don’t have to write ever longer command lines Can put sequences of commands into scripts – 22 With loop controls: automate processing, reduce errors #/bin/bash for i in 1 2 do wc adult.join$i for ((j=1; j<=2; j++)) do echo $((i+j)) done done date CS910 Foundations of Data Analytics Programming Can write programs in your language of choice Java: powerful, general purpose language – Python: popular, mathematical language – Perl: popular for processing text – Teaching a language is definitely out of scope of this module Foundations (CS917) module gives crash course in Java – You can use any language you know for homeworks, project Data Analytics is about getting an answer, less about how – Will give a brief introduction to R, a statistical tool/language 23 CS910 Foundations of Data Analytics Tools for working with statistical data: R R: flexible language with a lot of support for statistical operations Successor to ‘S’ language – Open-source, available in Windows, Mac, Linux, Cygwin – Inbuilt support for many data manipulation operations – – – – – Read in data from CSV (comma-separated values) format Compute sample mean, variance, quantiles Find line of best fit (linear regression) Flexible plotting tools, output to screen or file Lots more statistical tools available as libraries Steep learning curve, but GUIs and help is available – 24 Will use the R Studio GUI https://www.rstudio.com/products/rstudio/download/ CS910 Foundations of Data Analytics Quick example in R data <- read.csv(“adult.test“, header=F) # read in data in comma-separated value format summary(data) # show a summary of all attributes summary (data[5]) # show a summary of years of education d <- table(data[5]) # tabulate the data plot (d) # plot the frequency distribution plot(ecdf(data[5]$V5)) # plot the (empirical) CDF data2 <- read.csv(“adult.data”, header=F) qqplot(data[5]$V5, data2[5]$V5), type=“l”) # make a quantile-quantile plot of two (empirical) dbns pdf(file=“qq.pdf”) # send output to a PDF file qqplot(data[5]$V5, data2[5]$V5), type=“l”) dev.off() # close the file! quit() # quit! 25 CS910 Foundations of Data Analytics Spreadsheets Many options: Excel, OpenOffice, Google Spreadsheets Great for quick viewing, exploration and plotting of small data Excel 2003: 65536 rows – Excel 2007, 2010, 2013, 2016: 1M rows – Google sheets: up to 256 columns, or up to 200,000 cells – Quick plotting tools: Select data to plot, hit ‘plot’ button, fiddle with options – Sometimes takes a long time to make plots how you want – Tricky to get multiple plots with the same formatting – 26 adult.test years of education 20 CS910 Foundations of Data Analytics 15 10 5 0 0 5 10 of education 15 Adult.data years 20 Data Processing in Spreadsheets Decent data manipulation functionality Sort, selection, reformatting – Some tasks more difficult within the spreadsheet metaphor – Limitations of data processing in spreadsheets Capacity limits (row limits, cell limits) – Can’t always keep a record of what was done (repeatability) Can put sequence of unix tool commands in a script – Prone to errors: may select wrong range of cells etc. – theconversation.com/economists-an-excel-error-and-the-misguided-push-for-austerity-13584 27 An economics paper argued in favour of austerity measures Missed out Australia, Austria, Belgium, Canada, and Denmark from calculations, skewing the conclusion CS910 Foundations of Data Analytics Data Processing in Spreadsheets Sort: select data and click on ‘sort’ Aggregation: – =sum(range), =count(range), =average(range), =median(range) =if(test, [value if true], [value if false]) – “Smart filling” lets you drag to extend =countif(range, condition) Pivot tables let you explore the data cube Exercise: compute the number of people from each country in adult.data – 28 Compare to the effort to do this with unix tools (cut, sort, uniq) CS910 Foundations of Data Analytics Plotting in Excel Scatter plot of age vs years of education Select columnns – Insert - ‘scatter plot’ – 18 16 14 12 10 Series1 8 6 4 2 0 0 Bar chart of gender breakdown Derive necessary counts – Insert - ‘Column’ – 20 40 60 80 25000 20000 15000 Series1 10000 5000 0 Male 29 100 CS910 Foundations of Data Analytics Female Gnuplot Powerful plotting tool, driven by a script Easier to generate multiple, consistent plots – Write script as a text file – Call gnuplot scriptname – Pros and cons: Flexible output: create PDF, JPG, PNG, EPS, EMF… – Plot data and functions – Configure almost every aspect of the output – Sometimes arcane commands, cryptic abbreviations – 30 CS910 Foundations of Data Analytics Gnuplot function plotting set term emf enhanced font "Calibri,18" size 600,400 set output "pareto.emf" set log y set log x set xrange [1: 1e6] set yrange [1e-6: 1] set format y "10^{%L}” set format x "10^{%L}” unset key plot x**(-1.0) – set output "exp.emf" plot x**(-1.0)*exp(-0.0001*x) – 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 10 1 10 2 10 3 10 4 10 5 10 6 10 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 1 10 2 10 3 10 4 10 5 10 6 100 10-1 10-2 – 31 cdf_lognormal(x)=0.5+0.5*erf((x)/sqrt(2.0)) set output "lognorm.emf" plot 1.0-cdf_lognormal(0.5*log(0.01*x)) CS910 Foundations of Data Analytics 10-3 10-4 10-5 10-6 100 101 102 103 104 105 106 Gnuplot data plotting Scatter plot of age versus years of education: set term emf enhanced font "Calibri,18" set output "ageeducation.emf" 16 set title "Age versus Education" 14 set xlabel "Age" 12 set ylabel "Years of Education" 10 set key under 8 plot "adult/adult.data" using 1:5 \ 6 with points title 'Adult data' 4 Age versus Education Years of Education – 2 Add a line of best fit: – 32 0 10 20 30 y(x)=a*x+b fit y(x) "adult/adult.data" using 1:5 via a,b plot "adult/adult.data" u 1:5 w p t 'Adult', y(x) w l t ‘Fit' CS910 Foundations of Data Analytics 40 50 Age Adult data 60 70 80 90 Gnuplot data plotting Bar chart of gender breakdown: Process data to generate sums: cut -f 10 -d, adult/adult.data | sort | uniq -c > gendercount.txt 25000 – Gnuplot script: – 33 set term emf enhanced font "Calibri,18" 20000 set output "gender.emf" 15000 set style data histograms 10000 set style histogram cluster gap 1 set style fill solid border -1 5000 set yrange [0:] 0 plot "gendercount.txt" using 1:xticlabel(2) title " " CS910 Foundations of Data Analytics Female Male Report writing: Wordprocessors Many options: MS Word, OpenOffice Writer, Google Docs Adequate for report writing (e.g. project report) Nice GUI interface, configurable – Can be difficult if you have many figures – 3rd party support for bibliographic data (Endnote) – 34 CS910 Foundations of Data Analytics Report writing: LaTeX LaTeX: a scientific document preparation system Describe how you want your document to be, and compile it More of a learning curve, but very powerful – – – – – Stops you getting too involved in fine details Support for producing beautiful mathematical formulae Produce PDF output easily from LaTeX (text) source file: pdflatex myfile.tex Support automatic bibliography creation via bibtex Automatic updating cross-references via \label and \ref Covered in more detail in CS908 Research Methods 35 CS910 Foundations of Data Analytics Is this on the test? From 2014 exam: Many acceptable answers for each question (and also poor/wrong answers…) Background reading Warwick past papers http://www2.warwick.ac.uk/services/exampapers?q=cs910& department=Any&year=Any http://www.cl.cam.ac.uk/teaching/1213/UnixTools/materials.html 36 CS910 Foundations of Data Analytics LaTeX example \documentclass{article} \usepackage[margin=2cm]{geometry} \usepackage{graphicx} \title{This is my report} \author{Your name} \begin{document} \maketitle \begin{abstract} This is an abstract for the document \end{abstract} \section{Introduction} This is the introduction to my document \begin{figure} \includegraphics{figure.pdf} \caption{This is a figure} \label{fig:first} \end{figure} Please see figure~\ref{fig:first}. \end{document} 37 CS910 Foundations of Data Analytics