330.Lect2

advertisement
STATS 330: Lecture 2
4/8/2015
330 Lecture 2
1
Housekeeping matters
• STATS 762
– An extra test (details to be provided). An extra
assignment
• Class rep – ????
• Office hours
– Alan: 10:30 – 12:00 Tuesday and Thursday in Rm
265, Building 303S
– Tutors: TBA
• Assignment 1: Due 2 August
4/8/2015
330 Lecture 2
2
Today’s lecture:
Exploratory graphics
Aim of the lecture:
To give you a quick overview of the kinds
of graphs that can be helpful in exploring
data. Some of this material has been
covered in 201/8. (We will discuss the R
code used to make the plots in Lecture 4)
4/8/2015
330 Lecture 2
3
Exploratory Graphics: topics
•
Exploratory Graphics for a single variable: aim is to show
distribution of values
– Histograms
– Kernel density estimators
– Qq plots
•
For 2 variables: aim is to show relationships
– Both continuous: scatter plots
– One of each: side by side boxplots
– Both categorical: mosaic plots – see Ch 5
•
3 variables
–
–
–
–
4/8/2015
Pairs plot
Rotating plot
coplots
3D plots, contour plots
330 Lecture 2
4
Single variable:
Exchange rate data
• The data of interest: daily changes in
log(exchange rate) for the US$/Kiwi.
– Monthly date from June 1986 to may 2012
– Source: Reserve Bank
• Questions:
– What is the distribution of the daily changes
in the logged exchange rate?
– Is it normal? If not, how is it different?
4/8/2015
330 Lecture 2
5
yt = exchange rate at time t
Difference in logs is
log(yt) – log(yt-1) = log(yt/yt-1)
4/8/2015
330 Lecture 2
6
Data Analysis
Suppose we have the data (3374 differences in
the logs), in an R vector, Diff.in.Logs
hist(Diff.in.Logs,nclass=100,freq=FALSE)
# add density estimate
lines(density(Diff.in.Logs),col="blue", lwd=2)
# add fitted normal density
xvec = seq(-0.1, 0.1, length=100)
lines(xvec,dnorm(xvec,mean=mean(Diff.in.Logs),
sd=sd(Diff.in.Logs)),col="red", lwd=2)
4/8/2015
330 Lecture 2
7
4/8/2015
330 Lecture 2
8
4/8/2015
330 Lecture 2
9
Normal plot
> qqnorm(Diff.in.Logs)
Normal data?
No –
QQ plot indicates that the
differences have longer
tails than normal, since the
plotted points are lower
than the line for small
values and higher for big
ones
4/8/2015
330 Lecture 2
10
Two Variables: Rats!
Of interest: growth rates of 16 rats
i.e. relationship between weight and time
• Want to explore the relationship graphically.
• Each rat was measured (roughly) every
week for 11 weeks
• For weeks 1-6, all rats were on a fixed diet.
Diet was changed after week 6.
4/8/2015
330 Lecture 2
11
Two Variables: Rats!
Data set rats.df has variables
– rat (1-16)
– growth (weight in grams)
– day (day since start of study, 11 values, at
approximately weekly intervals
– group (litter, one of 3)
– change (has values 1 or 2 - diet was changed
after 6 weeks, diet 1 for weeks 1-6, diet 2 for
weeks 7-11
4/8/2015
330 Lecture 2
12
Rats: the data
> rats.df
growth group rat change day
1
240
1
1
1
1
2
250
1
1
1
8
3
255
1
1
1 15
4
260
1
1
1 22
5
262
1
1
1 29
6
258
1
1
1 36
7
266
1
1
2 43
8
266
1
1
2 44
9
265
1
1
2 50
10
272
1
1
2 57
11
278
1
1
2 64
12
225
1
2
1
1
12
230
1
2
1
8
... More data
4/8/2015
330 Lecture 2
13
Rats (cont)
• Could plot weight (i.e. the variable
growth) versus the variable day:
plot(day,growth)
BUT….
4/8/2015
330 Lecture 2
14
600
500
400
300
growth
0
10
20
30
40
50
60
day
4/8/2015
330 Lecture 2
15
Criticisms
• Can’t tell which points belong to which rat
• Seem to be 2 groups of points
• In actual fact, the rats came from 3
different litters, is this relevant?
• Could do better
4/8/2015
330 Lecture 2
16
More rats: improvements
• Join points representing the same rat with
a line
• Use different colours (or different line
types e.g. dashed or dotted) for the
different litters
• Use a legend
4/8/2015
330 Lecture 2
17
500
400
Litter 1
Litter 2
Litter 3
300
growth
600
Growth rate of rats
0
10
20
30
40
50
60
day
4/8/2015
330 Lecture 2
18
More improvements
• Plot is too cluttered
• Could plot each rat on a different graph –
important to use same scales (axes) for
each graph
• This leads to the idea of “Trellis graphics”
4/8/2015
330 Lecture 2
19
0
rat
20
40
60
0
rat
rat
20
40
60
rat
600
500
400
300
rat
rat
rat
rat
rat
rat
rat
rat
200
600
500
growth
400
300
200
600
500
400
300
rat
rat
rat
rat
200
600
500
400
300
200
0
4/8/2015
20
40
60
0
day
20
330 Lecture 2
40
60
20
0 20 4060
0 2040 60
0 20 40 60
0 20 40 60
group
group
group
group
group
group
group
group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
600
500
400
300
200
group
group
group
group
group
group
group
group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
growth
600
500
400
300
200
group
group
group
group
group
group
group
group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
rat.within.group
600
500
400
300
200
0 20 40 60
4/8/2015
0 2040 60
0 20 40 60
day
330 Lecture 2
0 20 40 60
21
Two variables:
one continuous, one categorical
• Insurance data: data on 14,000 insurance
claims. Want to explore relationship
between the amount of the claim (a
continuous variable) and the type of car
(a categorical variable).
• Use side-by side boxplots.
4/8/2015
330 Lecture 2
22
8
6
Loess
smooth
4
Log(ADINCUR)
Car Group
1 2 3 4 5 6 7 8 9
11
13
15
17
CARGROUP
4/8/2015
330 Lecture 2
23
More than 2 variables:
• If all variables are continuous, we can
explore the relationships between them
using a pairs plot
• If we have 3 variables, a rotating plot is a
very useful tool
4/8/2015
330 Lecture 2
24
Example: Cherry trees
> cherry.df
diameter height volume
1
8.3
70
10.3
2
8.6
65
10.3
3
8.8
63
10.2
4
10.5
72
16.4
5
10.7
81
18.8
6
10.8
83
19.7
7
11.0
66
15.6
8
11.0
75
18.2
9
11.1
80
22.6
10
11.2
75
19.9
... more data – 31 trees in all
4/8/2015
330 Lecture 2
25
Cherry trees: pairs plots
> pairs(cherry.df)
70
75
80
85
16
18
20
65
80
85
8
10
12
14
diameter
50
60
70
65
70
75
height
10
20
30
40
volume
8
4/8/2015
10
12
14
16
18
20
10
330 Lecture 2
20
30
40
50
60
70
26
3-d Rotating plots
• The challenge: to represent a 3dimensional object on a 2-dimensional
surface (a TV screen, computer screen
etc)
• Traditional method uses projection,
perspective
• A powerful idea is to use motion, looking
at the 3-d scene from different angles
4/8/2015
330 Lecture 2
27
Perspective
4/8/2015
330 Lecture 2
28
Diameter height view
Arbitrary
view
Projection
Volume height view
Diameter volume view
4/8/2015
330 Lecture 2
29
Cherry trees: rotating
plot
4/8/2015
330 Lecture 2
30
Dynamic motion
• By dynamically changing the angle of
view, we get a better impression of the 3dimensional structure of the data
• “Dynamic graphics” is a very powerful tool
4/8/2015
330 Lecture 2
31
A powerful idea: Coplots
• Coplot shows relationship between x and y
for selected values of z (usually a narrow
range of z’s)
• By showing separate plots for different z
ranges, we can see how the relationship
between x and y changes as z changes
• Coplot: conditioning plot, shows
relationship between x and y conditional
on z (ie for fixed z)
4/8/2015
330 Lecture 2
32
Cherry trees: coplots
• To show the relationship between height
and volume for different values of
diameter:
• Divide the range of diameter (8.3 to 20.6)
up into 6 subranges 8-11, 10.5 -11.5 etc
• Draw 6 plots, the first using all data whose
diameter is between 8 and 11, the second
using all data whose diameter is between
10.5 and 11.5, and so on
4/8/2015
330 Lecture 2
33
Given : diameter
10
70
75
80
14
16
85
65
70
18
75
80
85
10
70
10
20
30
40
50
60
volume
20
30
40
50
60
70
65
12
65
70
75
80
85
height
4/8/2015
330 Lecture 2
34
Interpretation
• Note that the lines are not of the same slope
• This implies that the point configuration is not
“planar”
40
20
0
z
-20
column
-40
row
-60
4/8/2015
330 Lecture 2
35
Other 3-d graphs
3-d scatter plot
plot of surface
Both can be rotated
4/8/2015
330 Lecture 2
36
Download