Sec. 4.1 Part 1 PowerPoint

advertisement
SECTION 4.1 – TRANSFORMING
RELATIONSHIPS

Linear regression using the LSRL is not the only
model for describing data.

Some data are not best described linearly.
In some cases the removal of outliers from data may cause a
drop in the correlation so that linear no longer does a
satisfactory job of describing the data.
 In some cases the bulk of the data may not be linear at all.


Non-linear relationships between two
quantitative variables can sometimes
be changed into linear relationships
by transforming one or both variables.
SECTION 4.1 – TRANSFORMING
RELATIONSHIPS




Transforming can be thought of as re-expressing
the data.
We may want to transform either the
explanatory variable x, or the response variable y
in a scatter plot, or maybe even both.
We will call the transformed variable "t" when
talking about the transforming in general.
Many variables take only 0 or positive values, so
we are particularly interested in how functions
behave for positive values of t.
The following models are common functions in which you
should be familiar with their shape and equation. These are
models in which t > 0.
a  bt , slope b  0
The following scatterplot represents brain weight
against body weight for 96 species of mammals.

The scatterplot is not very satisfactory since most
mammals are so small relative to elephants and
hippos



The lower left corner of the plot shows that most of
the species overlap forming a “blob”
The correlation with all 96 species is r = .86, but
removing the elephant, r = .50
To get a closer look at the observations that are
in the lower-left corner, the 4 outliers were
removed
This scatterplot represents the 92 observations with the 4 outliers
removed.
Instead of a linear relationship, you can see that as body weight
increases, the graph bends to the right which is representative of
a logarithmic function.
The following plot includes the original 96 observations, but instead of
plotting the y-value against the x-value, the logarithm of the brain
weights (y-value) were plotted against the logarithm of the body
weights (x-value).
There are no longer any extreme outliers or very influential observations
and the pattern is very linear with r = .96
The ladder of power functions is in the form:
(t p 1) / p
square
linear
reciprocal
square root
logarithmic
inverse
CONCAVITY OF POWER FUNCTIONS
EXPONENTIAL GROWTH

Exponential growth occurs when a variable is multiplied by
a fixed number in each time period.



Ex. – consider a population of bacteria in which each bacterium splits into
two each hour. Beginning with 1, we have 2 after one hour, 4 after two
hours, 8 after three hours, 16 after four hours, 32, 64, 128 and so on. After
one day of doubling there are 224 or 16,777,216 bacteria in the population.
Exponential growth increases by a fixed percentage of the
previous total whereas linear growth increases by a fixed
amount in each equal time period.
If a variable grows exponentially, its logarithm grows linearly.
TRANSFORMING DATA
REVIEW PROPERTIES OF LOGARITHMS
log b x  y  b  x
y
log(ab)  log(a)  log(b)
a
log    log(a)  log(b)
b
log( x)  p log( x)
p
13
EXAMPLE 1 – GROWTH OF CELL PHONE USE

The cell phone industry enjoyed substantial growth in the
1990’s. One way to measure cell phone growth is to look at the
number of subscribers. Find a linear model to predict the
number of subscribers in the year 2000.
Year
1990
1993
1994
1995
1996
1997
1998
1999
Subscribers
(thousands)
5283
16,009
24,134
33,786
44,043
55,312
69,209
86,047
There is an increasing trend, but the overall pattern is not linear.
The pattern looks like an exponential curve. Is this exponential growth?
EXAMPLE 1 – GROWTH OF CELL PHONE USE



While the curve may appear to be exponential growth, we can’t simply
depend on what our eyes see.
If you suspect exponential growth, first calculate the ratios of consecutive
terms to see if they are the same fixed percentage of the previous total.
To avoid overflow in the calculator it is good practice to code the years (let
1990 = 1)

Don’t use 0 since you can’t take the log of 0
Means that the #
of subscribers in 1994
is 151% of or 1.51
times the # of
subscribers in 1993.
Could also say that
it’s a 51% increase.
Year
Subscribers Ratios
log(y)
1
5,283
--
3.72288
4
16,009
--
4.20436
5
24,134
1.51
4.38263
6
33,786
1.40
4.52874
7
44,043
1.30
4.64388
8
55,312
1.26
4.74282
9
69,209
1.25
4.84016
10
86,047
1.24
4.93474
On average,
subscribers were
increasing around
35% each year.
EXAMPLE 1 – GROWTH OF CELL PHONE USE
Now that you have verified that the ratios are
similar, the next step is to apply a mathematical
transformation that changes exponential growth
into linear growth.
 We had hypothesized that an exponential model
of the form 𝑦 = 𝑎𝑏 𝑥 represented the cell phone
growth, therefore we need to use properties of
logarithms to transform:

log y  log a  log b x
log y  log a  x(log b)
log y  log a  (log b) x
EXAMPLE 1 – GROWTH OF CELL PHONE USE



Since log 𝑦 = log 𝑎 + (log 𝑏)𝑥 looks like 𝑦 = 𝑎 + 𝑏𝑥 we can
plot log 𝑦 versus x and if the data are linear we would have
better reason to believe that the cell phone growth is
exponential.
The plot appears slightly concave down, but certainly more
linear than the original scatterplot.
Applying the least squares regression we get:
log 𝑦 = 3.66 + .134𝑥
 𝑟 2 = .982.


This means that 98.2% of the variation in log 𝑦 is explained by the least
squares regression of log 𝑦 on x.
EXAMPLE 1 – GROWTH OF CELL PHONE USE


Although the model appears to be useful for
prediction purposes because the 𝑟 2 is so high, you
should always check the residual plot.
The purpose of finding a linear model is to be able to
predict the number of subscribers in 2000. One
approach would be to discard the first 4 data points
since they are the oldest and furthest removed from
the year 2000.
EXAMPLE 1 – GROWTH OF CELL PHONE USE
By removing the first 4 points, the 𝑟 2 improves to
.99897 which is even better than the first.
 The LSRL is represented by:


log 𝑁𝑒𝑤𝑌 = 3.966 + .097(𝑁𝑒𝑤𝑋)
EXAMPLE 1 – GROWTH OF CELL PHONE USE

Now that we have the linear model, we can use it to predict
the number of subscribers in the year 2000 by substituting
an 11 in for “NewX” and then “undoing” the logarithm
 log 𝑁𝑒𝑤𝑌 = 3.966 + .097(𝑁𝑒𝑤𝑋)
log(subscribers)  3.966  .097( year )
log( subscribers )
10
3.966 .097( year )
 10
subscribers  (103.966 )(10.097( year ) )
subscribers  (9246.9817)(1.2503( year ) )
subscribers  (9246.9817)(1.2503(11) )
𝑠𝑢𝑏𝑠𝑐𝑟𝑖𝑏𝑒𝑟𝑠 = 107,933.6

Homework: p.212-213 #’s 6-8
Download