Project proposal

advertisement

Project proposal

1.

Project description

In this project, we proposed to identify the potential factors that affect the city-cycle fuel consumption in miles per gallon of different vehicles and generate a multivariate regression model to analyze their relation. We would focus on the correlation between miles per gallon of fuel of vehicles from different make and model and the following seven predictor variables: number of cylinders, engine displacement, horsepower, weight of the vehicle, acceleration time, age of the vehicle and origin country. A total of 398 observations were gathered from CMU Statistics library. Horsepower variable has

6 missing values.

In this project, the proposed response variable is miles per gallon of vehicle (miles/gallon), which is a well-known indicator the fuel efficiency. There are seven predictor variables:

1) Number of cylinders. This data is discrete, with three possible values, 4, 6 and 8. The more cylinders an engine has the more fuel it consumes but also provides more energy.

2) Normal engine displacement (cubic inches), which is a continuous variable and means the volume swept by all the pistons inside the cylinders of a reciprocating engine in a single movement from top to bottom. With larger displacement the engine will consume more fuel and provide more energy.

3) Mechanical horsepower (hp), which is a continuous variable and measures the power of engine.Although usually horsepower is positively related to the engine displacement some technology such as turbo will change this situation so we still assume the horsepower is an significant parameter in measuring fuel efficiency.

4) Weight of the vehicle (pounds). A heavy car will consume more fuel than a light car forging the same distance.

5) Acceleration time (seconds). This represents the average time the driver would need to use to accelerate the vehicle from 0 to 80 miles/hour. Acceleration time is an indicator of the overall performance of the vehicle engine. Usually a car with larger horsepower and displacement will have a faster acceleration time but as mentioned above, some technology will deny this situation.

6) Age of the vehicle (years). How old the vehicle is possibly affects the fuel consumption of a vehicle. Since every car has its longevity which is either measured in time or miles and we are not able to make sure if a car has exceeded its longevity, we cannot say the older the car is the more fuel it will consume forging the same distance. However we do expect the more years over a car’s longevity the less the fuel efficiency will be.

7) Origin of the vehicle. This variable represents the country of the vehicle producer. 1 = the US; 2

= Germany and 3 = Japan. Cars from different country will have different quality and target consumers. We expect that Japanese cars will be more efficient since they are usually light.

2.

Preliminary analysis

1) Descriptive statistics: First of all, we generated the descriptive statistics of the 7 predictor variables and 1 response variable (Table 1). The results indicate that none of the variables were

badly skewed, with the largest skewness of 1.09 coming from horsepower variable. The kurtosis values are also in a reasonable range, which means all data groups are close enough to normal distribution.

Mean

Standard

Error

Median

Mode

Standard

Deviation

Sample

Variance

Kurtosis

Skewness

Range

Minimum

Maximum

Sum

Count

Confidence

Level 95.0%

MPG

23.51

0.39

23.00

13.00

7.82

61.09

-0.51

0.46

37.60

9.00

0.77

# of

Cylinders

5.45

0.09

4.00

4.00

1.70

-1.38

0.53

5.00

3.00

0.17

Displacement

193.43

5.23

148.50

97.00

104.27

-0.75

0.72

387.00

68.00

10.28

Horse power

104.47

1.94

93.50

150.00

38.49

0.70

1.09

184.00

46.00

3.82

Weight

Table 1: descriptive statistics of 8 variables in the study.

2970.42

42.45

2803.50

2130.00

846.84

2.89 10872.20 1481.57 717140.99

-0.79

0.53

3527.00

1613.00

Acceleration

15.57

0.14

15.50

14.50 age

6.99

0.19

7.00

10.00 origin

1.57

0.04

1.00

1.00

46.60 8.00 455.00 230.00 5140.00 24.80 13.00 3.00

9358.80 2171.00 76983.50 40952.00 1182229.00 6196.10 2782.00 626.00

398.00 398.00 398.00 392.00 398.00 398.00 398.00 398.00

83.45

2.76

7.60

0.42

0.28

16.80

8.00

0.27

3.70

13.67 0.64

-1.18 -0.82

-0.01

12.00

1.00

0.92

2.00

1.00

0.36

0.80

0.08

2) Correlation of variables: We then analyzed the correlation between variables (Table 2 and Table

3). Table 2 represents the correlation between response variable and different predictor variables. From this table, we can conclude that the number of cylinders, displacement and vehicle weight all have significant and negative correlation with MPG values. This means the larger these parameters become the less miles per gallon can go which is just what we expect.

Their |r|>0.7, and P-value<0.0001. The acceleration and origin variables also affect the fuel consumption efficiency. Their |r| values are about 0.2, and P-value<0.0001. All the above response variables should probably be included in the regression model. On the other hand, it seems that the correlations between MPG and either horsepower or origin are not so significant.

Their |r| values are smaller than 0.1, and P-values are greater than 0.1. Whether these two variables should be included in the regression model need to be further tested.

Pearson Correlation Coefficients, N = 392

Prob > |r| under H0: Rho=0 cylinder displacement horsepower weight acceleration age origin

MPG -0.77372

<.0001

-0.80285

<.0001

-0.04318 -0.78588

0.3939 <.0001

Table 2: Correlation between MPG and response variables.

0.20203 -0.07253 0.20478

<.0001 0.1518 <.0001

Pearson Correlation Coefficients, N = 392

Prob > |r| under H0: Rho=0 cylinder displacement horsepower weight acceleration age cylinder displacement horsepower weight acceleration age

1 0.95034 0.02915 0.85281 origin

-0.23707 0.08278 -0.2115

0.95034

<.0001

0.02915

0.565

<.0001

1

0.04488

0.3755

0.565 <.0001

0.04488 0.88426

0.3755 <.0001

1

-

0.28579

<.0001

<.0001 0.1017 <.0001

-0.24458 0.07754 -0.2124

<.0001 0.1254 <.0001

0.89674

-

0.97458

0.91017

0.85281 0.88426

<.0001

-0.28579 1

<.0001 <.0001 <.0001

-0.49753 0.39659

-

0.49301

<.0001 <.0001 <.0001 <.0001

-

0.23707

<.0001

0.08278

-0.24458

<.0001

0.07754

<.0001

0.89674

-

0.49753

<.0001 <.0001

-0.97458 0.39659

1

<.0001 <.0001

-0.94911

-

0.94911

1

0.90981

-

0.94236 origin

0.1017

-0.2115

<.0001

0.1254

-0.2124

<.0001

<.0001 <.0001

0.91017

-

0.49301

<.0001

Table 3: Correlation between different variables.

<.0001

<.0001

0.90981

-

<.0001

0.94236

1

<.0001 <.0001

Table 3 represents the correlation between all 8 variables in this study, as well as the significances. Besides the correlations between response variable and predictor variables, it can also be observed that correlations between response variables exist. Some of the predictor variables are highly correlated with each other, such as number of cylinders and displacement, horsepower and age, origin and acceleration, and so on. This information may be helpful in reducing the regression model.

3.

Preliminary plans of analysis

In order to study the contributions of each of the predictors to the city-cycle fuel consumption efficiency of vehicles, we proposed to conduct the following studies.

A) Use PROC REG to predict mpg with all 7 predictor variables.

B) Use PROC REG to predict mpg with all predictor variables without horsepower and age.

C) Use PROC REG to predict mpg with only horsepower and age.

D) Use PROC REG to predict mpg with discrete variables: cylinders, origin and age.

E) Use PROC REG to predict mpg with continuous variables: displacement, horsepower, weight and acceleration.

F) Use significance test to check each model built above:

F test.

See which predictor variables should be included, which are not.

Build regression model.

G) Calculation of regression parameters.

H) Diagnosis:

Look at the distribution of each variable.

Look at the relationship between pairs of variables.

Plot the residuals versus:

The predicted/fitted values

Each response variable

Time (if available)

Download