Document

advertisement
Hands-on Introduction to R
3
2
1
0
1
2
3
Why Leaning Programing?
• We live in oceans of data. Computers are
essential to record and help analyse it.
• Competent scientists speak C/C++, Java,
MATLAB, Python, Perl, R and/or Mathematica
• Data collection and analysis very important in
Forensic Science since NAS 2009
• Using the above languages, codes can easily be
made available for review/discovery
Getting a computer to do anything useful
• All machines understand is on/off!
•
•
•
•
High/low voltage
High/low current
High/low charge
1/0 binary digits (bits)
• To make a computer do anything, you have to
speak machine language to it:
000000 00001 00010 00110 00000 100000
Add 1 and 2. Store the result.Wikipedia
Getting a computer to do anything useful
• Machine language is not intuitive and can vary
a great deal over designs
• The basic operations operations however are
the same, e.g.:
•
•
•
•
Move data here
Combine these values
Store this data
Etc.
• “Human readable” language for basic machine
operations: assembly language
Getting a computer to do anything useful
• Assembly is still cumbersome for (most)
humans
10110000 01100001
MOV AL, 61h
A machine encoding
Assembly
Move the number 97 over to “storage area” AL
Getting a computer to do anything useful
• Better yet is a more “Englishy”, “high-level”
language
• Enter: C, C++, Fortran, Java, …
• Higher level languages like these are translated
(“compiled”) to machine language
• Not exactly true for Java, but it’s something
analogous…
Getting a computer to do anything useful
• Even more “Englishy” and “high-level” are
interpreted languages
• Enter: R MATLAB, Perl, Python, Mathematica,
Maple, …
• The “code” of these languages are “interpreted”
as commands by a program that is already
running
• They make many assumptions behind the scenes
• Much easier to program with
• Much slower than compiled languages
Why
?
• R is not a black box!
• Codes available for review; totally transparent!
• R maintained by a professional group of
statisticians, and computational scientists
• From very simple to state-of-the-art procedures
available
• Very good graphics for exhibits and papers
• R is extensible (it is a full scripting language)
• Coding/syntax similar to Python and MATLAB
• Easy to link to C/C++ routines
Why
?
• Where to get information on R :
• R: http://www.r-project.org/
• Just need the base
• RStudio: http://rstudio.org/
• A great IDE for R
• Work on all platforms
• Sometimes slows down performance…
• CRAN: http://cran.r-project.org/
• Library repository for R
• Click on Search on the left of the website to search for
package/info on packages
Finding our way around R/RStudio
Handy
Commands:
• Basic Input and Output
Numeric input
x <- 4
variables:
store
information
:Assignment operator
x <- “text goes in quotes”
Text (character) input
Handy
Commands:
• Get help on an R command:
• If you know the name: ?command name
• ?plot brings up html on plot command
• If you don’t know the name:
• Use Google (my favorite)
• ??key word
Handy
Commands:
• R is driven by functions:
func(arguement1, argument2)
function name
input to function goes in parenthesis
function returns something; gets dumped into x
x <- func(arg1, arg2)
Handy
Commands:
• Input from Excel
• Save spreadsheet as a CSV file
• Use read.csv function
• Needs the path to the file
Mac e.g.:
"/Users/npetraco/latex/papers/data.csv”
Windows e.g.:
“C:\Users\npetraco\latex\papers\data.csv”
*Exercise: basicIO.R
Handy
Commands:
• Matrices: X
• X[,1] returns column 1 of matrix X
• X[3,] returns row 3 of matrix X
• Handy functions for data frames and matrices:
• dim, nrow, ncol, rbind, cbind
• User defined functions syntax:
• func.name <- function(arguements) {
do something
return(output)
}
• To use it: func.name(values)
First Thing: Look at your Data
o Explore the Glass dataset of the mlbench
package
14
13
12
11
Na
15
16
17
• Source (load) all_data_source.R
• *visualize_with_plots.r
• Scatter plots: plot any two variables against each
other
1.515
1.520
1.525
RI
1.530
First Thing: Look at your Data
• Pairs plots: do many scatter plots at once
1
2
3
4
5
6
73
74
75
0
4
5
6
70
71
72
Si
12
14
16
0
1
2
3
K
6
8
10
Ca
70
71
72
73
74
75
6
8
10
12
14
16
First Thing: Look at your Data
• Histograms: “bin” a variable and plot frequencies
60
50
Percent of Total
40
30
20
10
0
1.510
1.515
1.520
1.525
RI
1.530
1.535
First Thing: Look at your Data
• Histograms conditioned on other variables: use
lattice package
1.5101.5151.5201.5251.5301.535
5
6
7
80
60
40
RIs Conditioned on glass
group membership
Percent of Total
20
0
1
2
3
80
60
40
20
0
1.5101.5151.5201.5251.5301.535
1.5101.5151.5201.5251.5301.535
RI
First Thing: Look at your Data
• Probability density plots: also needs lattice
200
Density
150
100
50
0
1.510
1.515
1.520
1.525
RI
1.530
1.535
First Thing: Look at your Data
• Empirical Probability Distribution plots: also
called empirical cumulative density
1.0
Empirical CDF
0.8
0.6
0.4
0.2
0.0
1.515
1.520
1.525
RI
1.530
1.535
First Thing: Look at your Data
• Box and Whiskers plots:
range
possible
outliers
possible
outliers
25th-%tile
1st-quartile
1.5188
1.5189
median
50th-%tile
1.5190
RI
75th-%tile
3rd-quartile
1.5191
1.5192
Visualizing Data
• Note the relationship:
First Thing: Look at your Data
• Box and Whiskers plots:
60
40
values
values
5
0
20
0
Al
Ba
Ca
Fe
K
Mg
Na
Box-Whiskers plots for
actual variable values
RI
Si
Al
Ba
Ca
Fe
K
Mg
Na
RI
Box-Whiskers plots for
scaled variable values
Si
Confidence Intervals
• A confidence interval (CI) gives a range in which
a true population parameter may be found.
• Specifically, (1 – a)×100% CIs for a parameter,
constructed from a random sample (of a given sample
size), will contain the true value of the parameter
approximately (1 – a)×100% of the time.
• Different from tolerance and prediction intervals
Confidence Intervals
• Caution: IT IS NOT CORRECT to say that there a
(1 - a)×100% probability that the true value of a
parameter is between the bounds of any given CI.
Take a sample.
Compute a CI.
Here 90% of the
CIs contain the
true value of the
parameter
Graphical representation of
90% CIs is for a parameter:
true value
of parameter
Confidence Intervals
• Construction of a CI for a mean depends on:
• Sample size n
s
• Standard error for means sx 
n
• Level of confidence 1-a
• a is significance level
• Use a to compute tc-value
• (1-a )×100% CI for population mean using a sample
average and standard error is:
 x  tc sx , x  tc sx 
Confidence Intervals
• Compute a 99% confidence interval for the mean using this sample
set:
Fragment # Fragment nD
1
1.52005
2
1.52003
3
1.52001
4
1.52004
5
1.52000
6
1.52001
7
1.52008
8
1.52011
9
1.52008
10
1.52008
11
1.52008
x  1.52005
s  0.0004
sx  0.0001
α  0.01
(a/2=0.005) tc = 3.17
Putting this together:
[1.52005 - (3.17)(0.00001), 1.52005 + (3.17)(0.00001)]
99% CI for sample = [1.52002, 1.52009]
*Try out confidence_intervals.R
Download