Project proposal 1. Project description In this project, we proposed to identify the potential factors that affect the city-cycle fuel consumption in miles per gallon of different vehicles and generate a multivariate regression model to analyze their relation. We would focus on the correlation between miles per gallon of fuel of vehicles from different make and model and the following seven predictor variables: number of cylinders, engine displacement, horsepower, weight of the vehicle, acceleration time, age of the vehicle and origin country. A total of 398 observations were gathered from CMU Statistics library. Horsepower variable has 6 missing values. In this project, the proposed response variable is miles per gallon of vehicle (miles/gallon), which is a well-known indicator the fuel efficiency. There are seven predictor variables: 1) Number of cylinders. This data is discrete, with three possible values, 4, 6 and 8. The more cylinders an engine has the more fuel it consumes but also provides more energy. 2) Normal engine displacement (cubic inches), which is a continuous variable and means the volume swept by all the pistons inside the cylinders of a reciprocating engine in a single movement from top to bottom. With larger displacement the engine will consume more fuel and provide more energy. 3) Mechanical horsepower (hp), which is a continuous variable and measures the power of engine.Although usually horsepower is positively related to the engine displacement some technology such as turbo will change this situation so we still assume the horsepower is an significant parameter in measuring fuel efficiency. 4) Weight of the vehicle (pounds). A heavy car will consume more fuel than a light car forging the same distance. 5) Acceleration time (seconds). This represents the average time the driver would need to use to accelerate the vehicle from 0 to 80 miles/hour. Acceleration time is an indicator of the overall performance of the vehicle engine. Usually a car with larger horsepower and displacement will have a faster acceleration time but as mentioned above, some technology will deny this situation. 6) Age of the vehicle (years). How old the vehicle is possibly affects the fuel consumption of a vehicle. Since every car has its longevity which is either measured in time or miles and we are not able to make sure if a car has exceeded its longevity, we cannot say the older the car is the more fuel it will consume forging the same distance. However we do expect the more years over a car’s longevity the less the fuel efficiency will be. 7) Origin of the vehicle. This variable represents the country of the vehicle producer. 1 = the US; 2 = Germany and 3 = Japan. Cars from different country will have different quality and target consumers. We expect that Japanese cars will be more efficient since they are usually light. 2. Preliminary analysis 1) Descriptive statistics: First of all, we generated the descriptive statistics of the 7 predictor variables and 1 response variable (Table 1). The results indicate that none of the variables were badly skewed, with the largest skewness of 1.09 coming from horsepower variable. The kurtosis values are also in a reasonable range, which means all data groups are close enough to normal distribution. MPG # of Cylinders Displacement Horse power Acceleration Weight Mean 23.51 5.45 193.43 104.47 2970.42 Standard Error 0.39 0.09 5.23 1.94 42.45 Median 23.00 4.00 148.50 93.50 2803.50 Mode 13.00 4.00 97.00 150.00 2130.00 Standard Deviation 7.82 1.70 104.27 38.49 846.84 Sample Variance 61.09 2.89 10872.20 1481.57 717140.99 Kurtosis -0.51 -1.38 -0.75 0.70 -0.79 Skewness 0.46 0.53 0.72 1.09 0.53 Range 37.60 5.00 387.00 184.00 3527.00 Minimum 9.00 3.00 68.00 46.00 1613.00 Maximum 46.60 8.00 455.00 230.00 5140.00 Sum 9358.80 2171.00 76983.50 40952.00 1182229.00 Count 398.00 398.00 398.00 392.00 398.00 Confidence Level 95.0% 0.77 0.17 10.28 3.82 83.45 Table 1: descriptive statistics of 8 variables in the study. age origin 15.57 6.99 1.57 0.14 15.50 14.50 0.19 7.00 10.00 0.04 1.00 1.00 2.76 3.70 0.80 7.60 0.42 0.28 16.80 8.00 24.80 6196.10 398.00 13.67 -1.18 -0.01 12.00 1.00 13.00 2782.00 398.00 0.64 -0.82 0.92 2.00 1.00 3.00 626.00 398.00 0.27 0.36 0.08 2) Correlation of variables: We then analyzed the correlation between variables (Table 2 and Table 3). Table 2 represents the correlation between response variable and different predictor variables. From this table, we can conclude that the number of cylinders, displacement and vehicle weight all have significant and negative correlation with MPG values. This means the larger these parameters become the less miles per gallon can go which is just what we expect. Their |r|>0.7, and P-value<0.0001. The acceleration and origin variables also affect the fuel consumption efficiency. Their |r| values are about 0.2, and P-value<0.0001. All the above response variables should probably be included in the regression model. On the other hand, it seems that the correlations between MPG and either horsepower or origin are not so significant. Their |r| values are smaller than 0.1, and P-values are greater than 0.1. Whether these two variables should be included in the regression model need to be further tested. Pearson Correlation Coefficients, N = 392 Prob > |r| under H0: Rho=0 cylinder MPG displacement horsepower weight -0.77372 -0.80285 -0.04318 -0.78588 0.20203 <.0001 <.0001 0.3939 <.0001 <.0001 Table 2: Correlation between MPG and response variables. acceleration age origin -0.07253 0.20478 0.1518 <.0001 Pearson Correlation Coefficients, N = 392 Prob > |r| under H0: Rho=0 cylinder displacement horsepower weight acceleration age cylinder displacement 1 0.95034 <.0001 1 0.95034 <.0001 horsepower weight 0.02915 0.85281 0.565 <.0001 0.04488 0.88426 0.3755 <.0001 1 0.28579 <.0001 0.02915 0.04488 0.565 0.3755 0.85281 0.88426 -0.28579 <.0001 0.23707 <.0001 <.0001 <.0001 -0.24458 0.89674 <.0001 0.49753 <.0001 <.0001 0.08278 0.07754 -0.97458 0.39659 0.1017 0.1254 <.0001 1 acceleration age origin -0.23707 0.08278 -0.2115 <.0001 0.1017 <.0001 -0.24458 0.07754 -0.2124 <.0001 0.1254 <.0001 0.89674 0.91017 0.97458 <.0001 <.0001 <.0001 -0.49753 0.39659 0.49301 <.0001 <.0001 <.0001 1 0.90981 0.94911 <.0001 <.0001 -0.94911 1 0.94236 <.0001 <.0001 0.90981 1 0.94236 <.0001 <.0001 <.0001 -0.2115 -0.2124 0.91017 origin 0.49301 <.0001 <.0001 <.0001 <.0001 Table 3: Correlation between different variables. Table 3 represents the correlation between all 8 variables in this study, as well as the significances. Besides the correlations between response variable and predictor variables, it can also be observed that correlations between response variables exist. Some of the predictor variables are highly correlated with each other, such as number of cylinders and displacement, horsepower and age, origin and acceleration, and so on. This information may be helpful in reducing the regression model. 3. Preliminary plans of analysis In order to study the contributions of each of the predictors to the city-cycle fuel consumption efficiency of vehicles, we proposed to conduct the following studies. A) Use PROC REG to predict mpg with all 7 predictor variables. B) Use PROC REG to predict mpg with all predictor variables without horsepower and age. C) Use PROC REG to predict mpg with only horsepower and age. D) Use PROC REG to predict mpg with discrete variables: cylinders, origin and age. E) Use PROC REG to predict mpg with continuous variables: displacement, horsepower, weight and acceleration. F) Use significance test to check each model built above: F test. See which predictor variables should be included, which are not. Build regression model. G) Calculation of regression parameters. H) Diagnosis: Look at the distribution of each variable. Look at the relationship between pairs of variables. Plot the residuals versus: The predicted/fitted values Each response variable Time (if available)

Download
# Project proposal