Project proposal
1. Project description
In this project, we proposed to identify the potential factors that affect the city-cycle fuel
consumption in miles per gallon of different vehicles and generate a multivariate regression model to
analyze their relation. We would focus on the correlation between miles per gallon of fuel of vehicles
from different make and model and the following seven predictor variables: number of cylinders, engine
displacement, horsepower, weight of the vehicle, acceleration time, age of the vehicle and origin
country. A total of 398 observations were gathered from CMU Statistics library. Horsepower variable has
6 missing values.
In this project, the proposed response variable is miles per gallon of vehicle (miles/gallon), which is a
well-known indicator the fuel efficiency. There are seven predictor variables:
1) Number of cylinders. This data is discrete, with three possible values, 4, 6 and 8. The more
cylinders an engine has the more fuel it consumes but also provides more energy.
2) Normal engine displacement (cubic inches), which is a continuous variable and means the
volume swept by all the pistons inside the cylinders of a reciprocating engine in a single
movement from top to bottom. With larger displacement the engine will consume more fuel
and provide more energy.
3) Mechanical horsepower (hp), which is a continuous variable and measures the power of
engine.Although usually horsepower is positively related to the engine displacement some
technology such as turbo will change this situation so we still assume the horsepower is an
significant parameter in measuring fuel efficiency.
4) Weight of the vehicle (pounds). A heavy car will consume more fuel than a light car forging the
same distance.
5) Acceleration time (seconds). This represents the average time the driver would need to use to
accelerate the vehicle from 0 to 80 miles/hour. Acceleration time is an indicator of the overall
performance of the vehicle engine. Usually a car with larger horsepower and displacement will
have a faster acceleration time but as mentioned above, some technology will deny this
situation.
6) Age of the vehicle (years). How old the vehicle is possibly affects the fuel consumption of a
vehicle. Since every car has its longevity which is either measured in time or miles and we are
not able to make sure if a car has exceeded its longevity, we cannot say the older the car is the
more fuel it will consume forging the same distance. However we do expect the more years over
a car’s longevity the less the fuel efficiency will be.
7) Origin of the vehicle. This variable represents the country of the vehicle producer. 1 = the US; 2
= Germany and 3 = Japan. Cars from different country will have different quality and target
consumers. We expect that Japanese cars will be more efficient since they are usually light.
2. Preliminary analysis
1) Descriptive statistics: First of all, we generated the descriptive statistics of the 7 predictor
variables and 1 response variable (Table 1). The results indicate that none of the variables were
badly skewed, with the largest skewness of 1.09 coming from horsepower variable. The kurtosis
values are also in a reasonable range, which means all data groups are close enough to normal
distribution.
MPG
# of
Cylinders
Displacement
Horse
power
Acceleration
Weight
Mean
23.51
5.45
193.43
104.47
2970.42
Standard
Error
0.39
0.09
5.23
1.94
42.45
Median
23.00
4.00
148.50
93.50
2803.50
Mode
13.00
4.00
97.00
150.00
2130.00
Standard
Deviation
7.82
1.70
104.27
38.49
846.84
Sample
Variance
61.09
2.89 10872.20
1481.57
717140.99
Kurtosis
-0.51
-1.38
-0.75
0.70
-0.79
Skewness
0.46
0.53
0.72
1.09
0.53
Range
37.60
5.00
387.00
184.00
3527.00
Minimum
9.00
3.00
68.00
46.00
1613.00
Maximum
46.60
8.00
455.00
230.00
5140.00
Sum
9358.80
2171.00 76983.50 40952.00 1182229.00
Count
398.00
398.00
398.00
392.00
398.00
Confidence
Level 95.0%
0.77
0.17
10.28
3.82
83.45
Table 1: descriptive statistics of 8 variables in the study.
age
origin
15.57
6.99
1.57
0.14
15.50
14.50
0.19
7.00
10.00
0.04
1.00
1.00
2.76
3.70
0.80
7.60
0.42
0.28
16.80
8.00
24.80
6196.10
398.00
13.67
-1.18
-0.01
12.00
1.00
13.00
2782.00
398.00
0.64
-0.82
0.92
2.00
1.00
3.00
626.00
398.00
0.27
0.36
0.08
2) Correlation of variables: We then analyzed the correlation between variables (Table 2 and Table
3). Table 2 represents the correlation between response variable and different predictor
variables. From this table, we can conclude that the number of cylinders, displacement and
vehicle weight all have significant and negative correlation with MPG values. This means the
larger these parameters become the less miles per gallon can go which is just what we expect.
Their |r|>0.7, and P-value<0.0001. The acceleration and origin variables also affect the fuel
consumption efficiency. Their |r| values are about 0.2, and P-value<0.0001. All the above
response variables should probably be included in the regression model. On the other hand, it
seems that the correlations between MPG and either horsepower or origin are not so significant.
Their |r| values are smaller than 0.1, and P-values are greater than 0.1. Whether these two
variables should be included in the regression model need to be further tested.
Pearson Correlation Coefficients, N = 392
Prob > |r| under H0: Rho=0
cylinder
MPG
displacement
horsepower
weight
-0.77372
-0.80285
-0.04318
-0.78588
0.20203
<.0001
<.0001
0.3939
<.0001
<.0001
Table 2: Correlation between MPG and response variables.
acceleration
age
origin
-0.07253 0.20478
0.1518
<.0001
Pearson Correlation Coefficients, N = 392
Prob > |r| under H0: Rho=0
cylinder
displacement
horsepower
weight
acceleration
age
cylinder
displacement
1
0.95034
<.0001
1
0.95034
<.0001
horsepower
weight
0.02915 0.85281
0.565 <.0001
0.04488 0.88426
0.3755 <.0001
1
0.28579
<.0001
0.02915
0.04488
0.565
0.3755
0.85281
0.88426
-0.28579
<.0001
0.23707
<.0001
<.0001
<.0001
-0.24458
0.89674
<.0001
0.49753
<.0001 <.0001
0.08278
0.07754
-0.97458 0.39659
0.1017
0.1254
<.0001
1
acceleration
age
origin
-0.23707 0.08278 -0.2115
<.0001 0.1017 <.0001
-0.24458 0.07754 -0.2124
<.0001 0.1254 <.0001
0.89674
0.91017
0.97458
<.0001 <.0001 <.0001
-0.49753 0.39659
0.49301
<.0001 <.0001 <.0001
1
0.90981
0.94911
<.0001 <.0001
-0.94911
1
0.94236
<.0001
<.0001
0.90981
1
0.94236
<.0001 <.0001
<.0001
-0.2115
-0.2124
0.91017
origin
0.49301
<.0001
<.0001
<.0001 <.0001
Table 3: Correlation between different variables.
Table 3 represents the correlation between all 8 variables in this study, as well as the
significances. Besides the correlations between response variable and predictor variables, it can
also be observed that correlations between response variables exist. Some of the predictor
variables are highly correlated with each other, such as number of cylinders and displacement,
horsepower and age, origin and acceleration, and so on. This information may be helpful in
reducing the regression model.
3. Preliminary plans of analysis
In order to study the contributions of each of the predictors to the city-cycle fuel consumption
efficiency of vehicles, we proposed to conduct the following studies.
A) Use PROC REG to predict mpg with all 7 predictor variables.
B) Use PROC REG to predict mpg with all predictor variables without horsepower and age.
C) Use PROC REG to predict mpg with only horsepower and age.
D) Use PROC REG to predict mpg with discrete variables: cylinders, origin and age.
E) Use PROC REG to predict mpg with continuous variables: displacement, horsepower, weight and
acceleration.
F) Use significance test to check each model built above:
F test.
 See which predictor variables should be included, which are not.
 Build regression model.
G) Calculation of regression parameters.
H) Diagnosis:
Look at the distribution of each variable.
Look at the relationship between pairs of variables.
Plot the residuals versus:
The predicted/fitted values
Each response variable
Time (if available)