Ch11 Curve Fitting Dr. Deshi Ye yedeshi@zju.edu.cn Outline The method of Least Squares Inferences based on the Least Squares Estimators Curvilinear Regression Multiple Regression 2/30 11.1 The Method of Least Squares Study the case where a dependent variable is to be predicted in terms of a single independent variable. The random variable Y depends on a random variable X. Regressing curve of Y on x, the relationship between x and the mean of the corresponding distribution of Y. 3/30 Linear regression 4/30 Linear regression Linear regression: for any x, the mean of the distribution of the Y’s is given by x In general, Y will differ from this mean, and we denote this difference as follows Y x is a random variable and we can also choose so that the mean of the distribution of this random is equal to zero. 5/30 EX x y 1 2 3 4 5 6 7 8 9 10 11 12 16 35 45 64 86 96 106 124 134 156 164 182 6/30 Analysis ˆ a bx y ˆi ei yi y n as close as possible to zero. e i i 1 7/30 Principle of least squares Choose a and b so that n n 2 e ( y ( a bx )) i i i 1 2 i i 1 is minimum. The procedure of finding the equation of the line which best fits a given set of paired data, called the method of least squares. Some notations: n n n ( yi )2 i 1 i 1 n n n n ( xi ) 2 i 1 n S xx ( xi x ) xi2 2 i 1 i 1 S yy ( yi y )2 yi2 n n i 1 i 1 n n ( xi )( yi ) i 1 i 1 n S xy ( xi x )( yi y ) xi yi i 1 8/30 Least squares estimators a y b x and b S xy S xx , where x , y are the means of x, y Fitted (or estimated) regression line yˆ a bx Residuals: observation – fitted value= y i (a bxi ) The minimum value of the sum of squares is called the residual sum of squares or error sum of squares. We n will show that 2 SSE residual sum of squares= (yi - a - bxi ) i 1 S xy S xy2 / S xx 9/30 EX solution Y = 14.8 X + 4.35 10/30 X-and-Y X-axis independent predictor carrier input Y-axis dependent predicted response output 11/30 Example You’re a marketing analyst for Hasbro Toys. You gather the following data: Ad $ Sales (Units) 1 1 2 1 3 2 4 2 5 4 What is the relationship between sales & advertising? 12/30 Scattergram Sales vs. Advertising Sales 4 3 2 1 0 0 1 2 3 4 5 Advertising 13/30 the Least Squares Estimators 14/30 11.2 Inference based on the Least Squares Estimators We assume that the regression is linear in x and, furthermore, that the n random variable Yi are independently normally distribution with the means xi Statistical model for straight-line regression Y x i i i i are independent normal distributed random variable having zero means and the common variance 2 15/30 Standard error of estimate 2 The i-th deviation and the estimate of is 1 n 2 S [ y ( a bx )] i i n 2 i 1 2 e 2 Estimate of can also be written as follows S yy S 2 e ( S xy ) 2 S xx n2 16/30 Statistics for inferences: based on the assumption made concerning the distribution of the values of Y, the following theorem holds. Theorem. The statistics nS xx (a ) (b ) t and t S xx 2 se S xx n( x ) se are values of random variables having the t distribution with n-2 degrees of freedom. Confidence intervals : a t / 2 se 1 ( x )2 n S xx : b t / 2 se 1 S xx 17/30 Example The following data pertain to number of computer jobs per day and the central processing unit (CPU) time required. Number of jobs x 1 2 3 4 5 CPU time y 2 5 4 9 10 18/30 EX 1) Obtain a least squares fit of a line to the observations on CPU time b S xy S xx 2, a y bx 0 y 2x 19/30 Example 2) Construct a 95% confidence interval for α s 2 e S yy S xy 2 / S xx n2 The 95% confidence interval of α, a t / 2 se 46 400 /10 2 3 t / 2 t0.025 3.182 1 x2 1 9 0 3.182 * 2 * 4.72 n S xx 5 10 20/30 Example 3) Test the null hypothesis the alternative hypothesis level of significance. 1 against 1 at the 0.05 Solution: the t statistic is given by (b ) 2 1 t S xx 10 2.236 se 2 Criterion: t t0.05 2.353 Decision: we cannot reject the null hypothesis 21/30 11.3 Curvilinear Regression Regression curve is nonlinear. Polynomial regression: Y 0 1x 2 x 2 px p Y on x is exponential, the mean of the distribution of values of Y is given by y x Take logarithms, we have log y log x log Thus, we can estimate , by the pairs of value ( xi ,log yi ) 22/30 Polynomial regression If there is no clear indication about the function form of the regression of Y on x, we assume it is polynomial regression Y a0 a1x a2 x2 ak xk 23/30 Polynomial Fitting •Really just a generalization of the previous case •Exact solution •Just big matrices 24/30 11.4 Multiple Regression The mean of Y on x is given by b0 b1 x1 b2 x2 n Minimize [ yi (b0 b1xi1 bk xk bk xik )]2 i 1 We can solve it when r=2 by the following equations y nb b x b x x y b x b x b x x x y b x b x x b x 0 1 2 0 0 1 1 1 2 1 1 2 2 2 1 1 2 2 1 2 2 2 2 25/30 Example P365. 26/30 Multiple Linear Fitting X1(x), . . .,XM(x) are arbitrary fixed functions of x (can be nonlinear), called the basis functions normal equations of the least squares problem Can be put in matrix form and solved 27/30 Correlation Models 1. How strong is the linear relationship between 2 variables? 2. Coefficient of correlation used Population correlation coefficient denoted Values range from -1 to +1 28/30 Correlation Standardized observation Observation - Sample mean xi x Sample standard deviation sx The sample correlation coefficient r 1 n xi x yi y r ( )( ) n 1 i 1 s x sy 29/30 Coefficient of Correlation Values No Correlation -1.0 -.5 Increasing degree of negative correlation 0 +.5 +1.0 Increasing degree of positive correlation 30/30