Uploaded by Piyush Chauhan

L01 5002 14a 4pp

advertisement
9/03/2015
Stat 5002
An Introduction to Statistics with Applications
in Computing
Lecture 1
Introduction to Statistical Thinking
Objectives of STAT5002
Samples Populations;
Sample Statistics  Population Parameters;
Graphical summaries of Data;
https://elearning.sydney.edu.au/webapps/
Numerical summaries of Data.
To introduce students to
 basic statistical concepts and methods for
further studies.
 methodologies related to statistical data
analysis and Data Mining.
 a number of useful statistical models
 computer oriented estimation procedures
Objective of Statistics
 smoothing and nonparametric concepts
 analysis of large data sets.
 the R computing language for all computational
aspects in the course
©Sydney University
3
1
9/03/2015
Samples  Populations
Populations (ALL)
Define the target population -- the population to which
we want to generalize our findings.
We use
information from
a
SAMPLE to
answer questions
or
discover features
about a target
POPULATION
? ?
 Specify characteristics that identify the members of the
population. Who/What? Where? When?
?
Population
Example: Characteristics such as age, income,
education, gender and marital status are typically
used in studies concerning people.
 A sampling frame is a List or Rule Defining the
Population. This is usually unachievable, and we often
need to restrict our studies to the population to which
we can gain access.
10
Samples  Some of All
Representative Sample
It is often difficult, or even impossible, to obtain a
random sample.
Individual observations should be selected independently!
Samples need be representative of the population (not
biased)
Population
Sample size needs to be large enough!
A random sample is one where
each member of the population
has the same chance of being
selected.
Independent observations
∴ random sample:
Representative of population
11
2
9/03/2015
Samples need to be
Bias
Samples need to be
representative of the target population
Bias may be defined as any systematic error (ie. not
occurring randomly) which results in incorrect
conclusions about the target population.
Observations within samples must be
independent of each other
Some types of bias include
 selection bias
 measurement bias
Samples must not be b i a s e d !
 response bias
 confounding
14
©Sydney University
Types of Bias
Two schools of Thought
 Selection Bias
Frequentist
Selection bias refers to any systematic differences occurring in the
way that subjects are selected for a study.
Population is fixed
Samples vary (somewhat)
 Measurement Bias
Bayesian
Population varies
Sample is fixed
Measurement bias refers to systematic differences in the
measurement of variables.
 Response bias
Response bias can occur when the response rate to a survey is too
low.
 Confounding
A confounder is a variable that distorts (increases or decreases)
the apparent effect of one variable (determinant) on another
variable (outcome).
©Sydney University
16
22
3
9/03/2015
Scope of Statistics/Data Mining
Study
Understand Problem!!
Data Mining
Design Study
Scope of Statistics
Collect Sample
Obtain Data
Organise Data
Organise Data
Data Analysis
Exploratory Data Analysis
Interpretation of Results
Interpretation of Results
Report Results
Report Results?
Where do data come from?
Types of Statistical Studies
Statistical Studies
An observational study is one in which there is no
intervention by the investigator nor is there any
treatment imposed.
 Observational Studies
 Experimental Studies
An experimental study is one in which the
investigator has some control over the determinant.
Data Mining
 from Databases
(c)Sydney University
27
(c)Sydney University
1.22
4
9/03/2015
Experimental
Studies
Obtaining Data
Population
Sampling
Statistical Studies
Sample
 Observational Studies
Randomisation
Experimental
Group
 Experimental Studies
Stanford prison experiment
Control Group
Comparison
Compare!
First Data
Collection
(Before)
First Data
Collection
(Before)
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html
http://www.med.uottawa.ca/sim/data/Study_Designs_e.htm
No
Treatment
Treatment
Data Mining
 from Databases,
Comparison
Compare!
http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets04
05.html
(c)Sydney University
29
(c)Sydney University
Second Data
Collection
(After)
Second Data
Collection
(After)
CRISP Data Mining
Cross Industry Standard Process for Data Mining
Aim: To develop an industry tool and application neutral process
for conducting Knowledge Discovery (KD).
Data Mining
31
(c)Sydney University, 2014
32
5
9/03/2015
Variables
Measurements taken on subjects in a study vary amongst
subjects.
These measurements (data) are usually organised in a
spreadsheet consisting of rows and columns.
 The rows contain information about individual subjects or records.
 The columns contain the values of the measurements that vary 
the variables.
Data  Evidence from Samples
Variables usually take on specific roles
determinants
Predictors
Explanatory variable/s
Input
independent variable/s
influence
outcomes
Outcomes
Response variable/s
Output
dependent variable/s
34
A Spreadsheet
BOM
station
number
Month
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
66062
1
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Max_
1913
25
27.1
32.6
21.9
23.1
24.6
23.9
23.8
23.9
25.2
26.3
26.9
31.3
25.2
25.9
27.1
27.8
30.6
27.7
21.7
22.2
24.6
Min_
Max_
1913
2013
19.1
26.2
17.1
22.9
20.7
24.8
17.5
26.6
15.8
28.3
15.4
28
18.9
27.5
18.6
42.3
16.8
25
16
25.4
19.8
29.6
20.1
31.2
19.8
23.8
20
23.7
20.1
24.9
20.9
27.2
20.4
29
19.4
45.8
20.2
24.8
19.6
24.3
16.9
26.6
15
29.6
Min_
>34_
>34_
2013
1913
2013
20.2
0
0
20.3
0
0
18.4
0
0
18.3
0
0
20.9
0
0
21.6
0
0
21.4
0
0
20.9
0
1
21.1
0
0
20.2
0
0
21.2
0
0
23.5
0
0
20.7
0
0
17.1
0
0
16.8
0
0
19.1
0
0
21.4
0
0
21.7
0
1
21.5
0
0
20.2
0
0
20.7
0
0
20.9
0
0
<9_
1913
<9_
2013
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Diff
Max
7.1
5.8
4.1
9.1
12.5
12.6
8.6
23.7
8.2
9.4
9.8
11.1
4
3.7
4.8
6.3
8.6
26.4
4.6
4.7
9.7
14.6
Diff
Min
1.1
3.2
‐2.3
0.8
5.1
6.2
2.5
2.3
4.3
4.2
1.4
3.4
0.9
‐2.9
‐3.3
‐1.8
1
2.3
1.3
0.6
3.8
5.9
Types of Data
http://www.bom.gov.au/climate/data/
35
6
9/03/2015
Types of
Data
Variable
Types
Categorical variables
Categorical variables are variables where each
observation falls into one of a finite number of groups.
nominal
Categorical/
Group
Nominal variables: named variables with no implicit order.
Examples: Type of cancer, Personality type
Ordinal variables: grouped variables with implicit order.
Examples: Level of education, grade
ordinal
Numerical/
Quantitative
If there are two groups the variable is often referred to
as being binary or dichotomous (having two possible values).
discrete
Binary variables can be either
nominal, such as sex, or
continuous
ordinal such as age group, eg < 20 years, ≥ 20 years.
37
Colour
(nominal
Size (ordinal)
Small
Medium
39
Numerical / Quantitative Variables
Numerical variables are measured variables and can
be either discrete or continuous.
Large
Discrete variables are variables that take discrete
values: eg. Number of children, number of people in a
store.
White
Continuous variables are those that can assume many
values within a certain range or interval:
eg. height, weight, pulse rate.
Green
Numerical variables are also referred to interval or
scale variables
Purple
44
(c)Sydney University
7
9/03/2015
'Numerical' data
cover a range of values
usually measured with an instrument or along some
scale or counted (but a large number).
Discrete
Continuous -
can only take some values
For example:
 Marks in a test ( max half mark accuracy only)
 number of steps walked in a day (whole
numbers)
can take any values
Example:
 Distance walked in a day (2.6km, 2.67km,
2.675km, etc
 University entrance scores
The Variables in the Spreadsheet
Variable
Description
Data Type
BOM station number
Bureau of Meteorology station
number
Categorical, Nominal
Month
Month of Year (1:12)
Categorical, Ordinal
Day
Day of Month (1:31)
Continuous, Discreet
Max_1913
1913 Daily max temp (Co)
Numeric, continuous
Min_1913
1913 Daily min temp (Co)
Numeric, continuous
Max_2013
2013 Daily max temp (Co)
Numeric, continuous
Min_2013
2013 Daily min temp (Co)
Numeric, continuous
Very Hot?_1913
1: 1913 max temp > 34
0: otherwise
Categorical, Ordinal
Very Hot?_2013
1: 2013 max temp > 34
0: otherwise
Categorical, Ordinal
Very Cold?_1913
1: 2013 min temp < 9
0: otherwise
Categorical, Ordinal
Very Cold?_2013
1: 1913 min temp < 9
0: otherwise
Categorical, Ordinal
Diff Max
Max_2013 - Max_2013
Numeric, continuous
Diff Min
Max_2013 - Max_2014
Numeric, continuous
47
All graphs need
A title
Clearly labelled axes
Appropriate comments
 to have clarity
 to be aesthetically satisfying
Summarising Data: Graphical Methods
(c)Sydney University
51
8
9/03/2015
Displaying Categorical Data:
One Variable  Bar Chart
Contingency Table: Showing counts for Two
Categorical Variables
Number of Very Hot Days in 1913
Temperature
400
350
300
250
200
150
100
50
0
Not so Hot
>34 C
Number of Very Cold Days 2013
350
300
Year
< 9C
Not so extreme
> 34C
Total
1913
65
295
5
365
2013
33
326
6
365
Total
98
621
11
730
250
200
150
100
50
0
Not so Cold
<9C
53
52
Clustered bar chart
350
Numerical Summary: Categorical Data
For categorical data we simply tabulate the counts
and/or proportions of data (denoted p in a sample,
or  in a population) in the categories of interest.
Numbers of Very Hot, and Very Cold Days in
1913 and 2013
Number of Days
300
A Clustered Bar Chart
is a visual display
showing associations
between two
categorical variables.
250
1913
200
2013
Counts of Days in each Year
150
100
Year
< 9C
Not so extreme
> 34C
50
1913
65
295
5
0
2013
33
326
6
< 9C
Not so extreme
Temperatures
> 34C
Percentages of Days in each Year
It appears that
 the daily temperatures were not so extreme in both 1913 and 2013
 there was a larger proportion of extremely cold days in 1913 than in
2013
 the proportion of very hot days was low in both years
54
Year
< 9C
Not so extreme
> 34C
1913
17.81%
80.82%
1.37%
2013
9.04%
89.32%
1.64%
55
9
9/03/2015
Histogram
A histogram is a simple and
effective display, useful for
displaying the distribution of
numerical data.
A histogram shows the number
of observations that fall into
each of several nonoverlapping groups or bins.
Daily Minimum Temperatures in 2013
40
The bins of a histogram adjoin
each other so there are no
gaps between bins, unless a
bin is empty.
Displaying Numerical Data
30
20
10
0
7
24
57
56
Structure of a Box Plot
Median and quartiles
A boxplot displays a five-number summary of a
numerical set of data. These numbers are
whiskers
median
lower
quartile
outliers
upper
quartile
Minimum
the smallest value
Lower Quartile
separates the lower 25% of values from
the rest
Median:
the half-way point of the data
Upper Quartile:
separates the upper 25% of values from
the rest
Maximum:
the largest value
A boxplot also identifies any unusually large or small
values in a dataset, called outliers.
58
59
10
9/03/2015
Comparative Box Plots
Comparing box plots
Box plots enable the comparison of several samples of data
simultaneously.
Daily Minimum and Maximum Temperatures,
When making comparisons using box plots compare
1913 and 2013
1913_Min
 centres
2013_Min
 spreads and
1913_Max
 mention unusual observations
2013_Max
0
10
20
30
40
50
Temperature oC
It appears that both minimum and maximum daily
temperatures in 2013 were slightly higher than those in
1913.
See: http://freedom.indiemaps.com/
60
61
Scatter Plot
Construction of scatter plot
A scatter plot shows the relation between two numerical
variables.
Draw X and Y axes to cover the range of the two variables.
The two variables, X and Y, are referred to as the
predictor and response variable respectively, although
they do have other names.
Plot one point for each observation ie. (x, y)
X
predictor
Label the axes and mark the scale
Y
Comment on the plot.
Y
response
determinant
outcome
independent
dependent
X
If X increases and Y increases then a
POSITIVE relation exists.
If X increases and Y decreases then a
NEGATIVE relation exists.
X
Y
X
62
Y
63
11
9/03/2015
Scatterplots of Temperatures
Maximum Temperatures: 1913 and 2013
Minimum Temperatures: 1913 and 2013
44
44
Maximum Temperatures 2013
Minimum Temperatures 2013
Displaying Data
36
28
20
28
4
4
20
28
36
44
Minimum Temperatures 1913
Numerical
Categorical
Clustered bar
chart
Comparative
Box plots
Numerical
Comparative
Box plots
Scatter plots
One Variable
Only
Bar Chart
Histograms
20
12
12
Categorical
36
12
4
Data Type
4
12
20
28
36
44
Maximum Temperatures 1913
The points on the diagonal lines represent days where the minimum
(or maximum) temperature in 1913 was the same as in 2013.
Is there a sensible message here??
64
http://www.gapminder.org/videos/the-joy-of-stats/
Displaying Data
Data Type
Categorical
65
Numerical
Categorical
20
Numerical
12
4
4
12
20
One Variable
Only
http://www.gapminder.org/world
66
6
7
12
9/03/2015
Wordle
(c)Sydney University, 2014
http://www.oceancalendars.com.au
6
8
Measures of Centre
Data summaries:
Numerical Data
Mode:
The most frequently occurring value in the dataset.
The data may be nominal, ordinal or numeric.
Median:
The middle value when all the data are placed in order.
The data must be ordinal or numerical.
For an even number of values the median is the
average of the two middle values.
Mean:
The Arithmetic Average.
The data must be either discreet or continuous.
The mean is calculated by dividing the 'sum of the
values' by the 'number of the values'.
http://www.youtube.com/watch?v=oNdVynH6hcY
71
13
9/03/2015
The Mean
Mean versus median
The median cuts the data into two
sections with the same number of
observations in each
The mean is calculated by dividing the 'sum of the values' by the
'number of the values'.
n
xi

x  i 1
n
Symmetric Data
50%
50%
xithe i values of the data
̅ the average or the 'mean‘ of the x values
The mean is the centre of gravity
(point of balance) of the data.
(sigma)  'the sum of'.
Medians and Means
http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html
72
Data: 1
3
6
The mean is affected by outliers, the
median is not.
74
mean
Mean: Centre of balance
Mean
=
Median
Samples  Populations
10
? ?
Mean = (1 + 3 + 6 + 10)/4
= 5
We use
?
Population
Sample Statistics
to estimate
Population Parameters
0
1
2
3
4
5
6
7
8
9
10
75
14
9/03/2015
Sample
Statistics
estimate
Population
Parameters
Mean
x

Median
~
x
~

Measures of spread
Numeric data is often described, or summarised, using
two statistics
 a measure of centrality, or location, and
 a measure of spread, or dispersion.
Daily Minimum and Maximum Temperatures, 2013
Minimum
Maximum
0
77
40
35
A measure of
30
variability
25
is important
20
15
10
5
0
5
10
15
20
25
30
35
40
45
50
Temperature oC
78
The inter-quartile range
The inter-quartile range (IQR) is the difference between the
upper and lower quartiles in an ordered set of numerical data.
IQR = UQ - LQ
-5
-10
40
35
30
25
The IQR gives the range of the middle 50% of a set of data, so is
sometimes called the midspread.
The inter-quartile range is rarely influenced
by outliers in the data.
20
15
10
Daily Minimum and Maximum
Temperatures, 2013
5
0
-5
-10
For the minimum temperatures in 2013:
IQR ≈ 18-11
=7
For the maximum temperatures in 2013:
IQR ≈ 21-26.5 = 5.5
Minimum
Maximum
0
10
20
30
Temperature oC
40
50
80
15
9/03/2015
The range
The Standard Deviation
The standard deviation is a measure of how closely the data are
grouped about the mean.
The range is the difference between the maximum value and
the minimum value in an ordered set of numerical data.
 The larger the standard deviation the the greater the spread.
Range = max - min
It is defined in terms of the deviations of the data from the mean
(called residuals).
The sample standard deviation, s, is the square root of the average
(sort of) squared residual.
The range will be influenced by outliers in the data.
s 
Daily Min and MaxTemps, 2013
For the minimum temperatures in 2013:
Range ≈ 24 - 7
= 17
n
Minimum

Maximum
For the maximum temperatures in 2013:
Range ≈ 46 - 13
= 33
0
( x1  x )2  ( x2  x ) 2  . . .  ( xn  x )2
n 1
10
20
30
Temperature oC
40
(x  x )
i 1
50
2
i
n 1
Residual = xi – x, ie. observed value – sample mean.
81
82
Standard deviation (s)
Deviations of points from the mean
Mean
2.5
1.5
-0.5
sd
5 5 5 5 5 5 5
5
0
1 3 5 7 9
5
3.16
0 5 15 34 86
28
34.94
‐3.5
‐1
1
3
5
7
A measure of how much the data are spread
around the mean
9
83
84
16
9/03/2015
Sample
Statistics
Standard deviation (s)
Mean
sd
5
0
1 3 5 7 9
5
3.16
0 5 15 34 86
28
34.94
5 5 5 5 5 5 5
A measure of how much the data are spread
around the mean
estimate
Population
Parameters
Mean
x

Median
~
x
~

Std.dev
s

Variance
s2
2
The variance, 2, is the square of the standard deviation
and is estimated by s2.
85
86
The data in Excel
The data we have been using this week is stored in an
Excel workbook named Daily MaxMin Temp_18592013.xlsx.
The data we will be using are stored in the spreadsheet
called 1913; 2013.
Use File…Save as … and save the data in Text (Tab-delimited)
(*txt) format, named Daily MaxMin Temp_1859-2013.txt
Doing it with R!
87
(c)Sydney University, 2014
88
17
9/03/2015
Reading in the Data into R
Renaming variables
You can rename variables programmatically or interactively.
# rename interactively
fix(mydata) # results are saved on close
From the File Drop Down menu in R select
Change dir…
and change the working directory in R to the
directory and folder where your data are stored in
Excel.
# rename programmatically
#Recoding a continuous variable into categorical variable
#Mark those whose control measurement is >34 as "VeryHot", and those
with <=34 as "NotVeryHot":
tempdat$VHot2013[tempdat$Max_2013 > 34] <- "VeryHot"
tempdat$VHot2013[tempdat$Max_2013 <=34] <- "NotVeryHot"
First row of the
dataset contains
names of each
variable
 Read in the data, type
# Convert the column to a factor!!!
tempdat$VHot2013 <- factor(tempdat$VHot2013)
temp.dat = read.table("Daily MaxMin Temp_1859-2013.txt”, header=T)
 To look at the first 10 rows of data, type
temp.dat[1:10, ]
 To edit the data, type
fix(temp.dat)
(Make changes directly on the spreadsheet)
(c)Sydney University
89
# you can re-enter all the variable names in order
# changing the ones you need to change.
# the limitation is that you need to enter all of them!
names(mydata) <- c("x1","age","y", "ses")
90
Some Graphics commands
R command
plot()
Graphing data in R
Outcome
2-D scatterplot
barplot()
Bar graph
hist()
Histogram
lines()
Line graph
points()
Adds points to a plot
legend()
Adds a legend to the plot
axis()
Adds an axis to the plot
92
18
9/03/2015
Setting the Graphing Parameters
Bar Charts in R
The par() function defines the settings for subsequent
commands.
 To construct a bar chart of the categorical variable VHot2013, type
counts<-table(tempdat$VHot2013)
barplot(temp.dat$VHot2013)
Arguments within other graphics functions can also be used.
http://www.statmethods.net/advgraphs/parameters.html
http://research.stowers-institute.org/efg/R/Graphics/Basics/maroma/index.htm?utm_source=twitterfeed&utm_medium=twitter
Number of Very Hot Days in 2013
Counts
0 50
Example:
par(mfrow=c(1,1), mar=c(3.0,3.0,3.0,3.0), mgp=c(1.1,0.1,0),
oma=c(0,2,1.4,0), las=1, tcl=0.2, cex=0.8)
150
250
350
##Detail:
barplot(counts, main="Number of Very
Hot Days in 2013",
names.arg=c("35C or more","Less than
35C"),
xlab="Maximum Temperature",
ylab="Counts",
col="darkred")
35C
NotVeryHot
or more
Less
VeryHot
than 35C
Maximum Temperature
93
94
Presentation of Numerical data
Present numerical summaries of data in neatly organised
tables, with column and row headings
 Easy to read!!!
Numerical summaries in R
n
median
mean
std.dev
Min_1913
365
13.9
13.73
4.35
Max_1913
365
21.3
21.52
5.13
Min_2013
365
14.9
15.03
4.2
Max_2013
365
23.6
23.71
4.36
97
19
9/03/2015
Tables
# 2-Way Frequency Table
attach(mydata)
mytable <- table(A,B) # A will be rows, B will be columns
mytable # print table
margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)
prop.table(mytable) # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages
More examples in the Tutorial!
98
References
Introductory Statistics Lecture Notes, Macquarie University
Susan Imberman: notes on Data Mining vs. Statistics
Wasserman: Chapter 1
R
http://www.statmethods.net/
http://www.statmethods.net/graphs/
http://addictedtor.free.fr/graphiques/
http://www.rseek.org
http://www.cookbook-r.com/Graphs/Shapes_and_line_types/
http://rprogramming.net/
http://it-ebooks.info/book/537/
http://www.ats.ucla.edu/stat/r/
20
Download