Chapter 4: Moments – Linear Regression Chapter 4: Elements of Statistics 4-6 Curve Fitting and Linear Regression 4-7 Correlation Between Two Sets of Data Concepts How close are the sample values to the underlying pdf values ? Practical curve fitting, using an NTC resistor to measure temperature. Statistics Definition: The science of assembling, classifying, tabulating, and analyzing data or facts: Descriptive statistics – the collecting, grouping and presenting data in a way that can be easily understood or assimilated. Inductive statistics or statistical inference – use data to draw conclusions about or estimate parameters of, the environment from which the data came from. Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 1 of 17 ECE 3800 4-6 Curve Fitting and Linear Regression Fitting lines/curves to scatter plots. Data provided as (x,y) pairs. Is there a function that goes through all the points? Yes … If you want to use a polynomial of degree n-1 for n pairs! But we usually want simple curves to represent the data, like lines or parabolas, etc. where y a bx or y a bx cx 2 To fit the curve we want to minimize the following function (the squared error): yi a b xi c xi 2 n 2 i 1 For a linear regression (a line), we have y n err i a b x i 2 2 i 1 To minimize for the values a and b, take the derivatives and set them equal to zero. Then solve for a and b: d err da d err db n 2 yi a b xi 0 i 1 n 2 yi a b xi xi 0 i 1 Solving results in n n y i b xi a i 1 i 1 n n and b n n n i 1 i 1 n i 1 y i xi xi y i xi 2 xi i 1 i 1 n n What happens when we take expected values? Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 2 of 17 ECE 3800 Proof: d err da n n 2 yi a b xi 0 i 1 n n yi a b xi n a b xi i 1 i 1 n a y i 1 i 1 n i b xi i 1 n 1 n 1 n y i b xi n i 1 n i 1 Working on b d err db n 2 yi a b xi xi 0 i 1 n n n i 1 i 1 i 1 y i x i a xi b x i 2 Substituting for the computation for a n n n 1 n n 2 y x y b x x b xi i i i i i i 1 i 1 i 1 i 1 n i 1 n i 1 1 y i xi n 2 2 n n n n 1 1 yi xi b xi b xi 2 b xi 2 xi n n i 1 i 1 i 1 i 1 i 1 i 1 n n Isolating b n b y i xi i 1 n 1 n y i xi n i 1 i 1 1 n 2 xi x i n i 1 i 1 n 2 1 n 1 n 1 n y i xi y i xi n i 1 n i 1 n i 1 1 n 2 1 n xi xi n i 1 n i 1 2 Now that b is determined based on the values, return to a Substituting for the computation for b into a 1 n 1 n 1 n y x y xi i i n i n 1 n 1 n n i 1 n i 1 1 n i 1 a y i b x i y i xi 2 n n n n i 1 i 1 i 1 n i 1 1 1 2 xi xi n i 1 n i 1 Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 3 of 17 ECE 3800 2 1 n 1 n 2 1 n 1 n 1 n 1 n 1 n 1 n y i xi y i xi y i xi xi y i xi n i 1 n i 1 n i 1 n i 1 n i 1 n i 1 n i 1 n i 1 a 2 1 n 2 1 n xi xi n i 1 n i 1 Therefore a becomes a 1 n 1 n 2 1 n 1 n y i xi xi y i xi n i 1 n i 1 n i 1 n i 1 1 n 2 1 n xi xi n i 1 n i 1 2 Alternate formulation using the computed sample means of x and y b 1 n yi xi Yˆ Xˆ n i 1 1 n xi 2 Xˆ n 2 i 1 Yˆ 1n x Xˆ 1n y x a 1 x Xˆ n n i 1 n 2 i i 1 n i 1 2 i i 2 i Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 4 of 17 ECE 3800 2 Linear regression example p. 180. Figure 4-5. %% % Figure 4_5 % clear; close all; x=(0:0.5:10)'; % Linear Curve values y=a*c+b a=2; b=4; yref = a+b*x; % Random noise added to the line ydata = yref + 5*randn(size(x)); figure plot(x,ydata,'x',x,yref) legend('Data','Ref Line') meanx=mean(x); meany=mean(ydata); meanxsq = mean(x.^2); meanysq = mean(ydata.^2); meancorr = mean(x.*ydata); Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 5 of 17 ECE 3800 aest_equ = (meany*meanxsq-meanx*meancorr)/(meanxsq-meanx^2); best_equ = (meancorr-meany*meanx)/(meanxsq-meanx^2); yest_equ = aest_equ + best_equ*x; p=polyfit(x,ydata,1); aest = p(2); best = p(1); yest = polyval(p,x); figure plot(x,ydata,'bo',x,yref,'k',x,yest,'r',x,yest_equ,'m'); legend('Data','Ref Line','Polyfit Line','Equ Line') fprintf('Computation error\n') max(abs(yest_equ-yest)) rxy = meancorr/sqrt(meanxsq*meanysq) Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 6 of 17 ECE 3800 Non-linear estimation using a polynomial fit Example: Taking the data from Table 4-3 on p. 180. i 1 2 3 4 5 6 7 8 9 10 T, xi 10 20 30 40 50 60 70 80 90 100 VB, yi 425 400 366 345 283 298 205 189 83 22 Figure 4-6 450 400 V=426.05+-0.654015*x+-0.0333712*x 2 Breakdown Voltage 350 300 250 200 150 100 50 0 10 20 30 40 50 60 70 Temperature (in C) 80 90 100 p=polyfit(x,y,2); a = p(3); b = p(2); c = p(1); z = a + b*x + c*x.^2; figure plot(x,y,'bo',x,z,'r'); xlabel('Temperature (in C)') ylabel('Breakdown Voltage') title('Figure 4-6') grid atxt=sprintf('V=%g+%g*x+%g*x^2',a,b,c); text(50,375,atxt); Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 7 of 17 ECE 3800 7-7 Correlation of a discrete random variables For a single random variable, we have defined measures of the relationship of one sample or event and the next. These are the means and moments and the variance. 2nd Moment Mean or 1st Moment x x f x, y dx E X EX EX x x n x dx x xi x dx n i 1 n 1 X E X n f x, y dx 2 2 2 EX 2 xi EX 2 i 1 i i 1 n n 1 R XX n n xi 2 i 1 2nd Central Moment x X 2 E X X 2 1 E X X n E X X x X i 1 x xi xi X i 1 n i 1 1 n dx n 2 n 2 n 2 1 E X X n f x, y dx 2 E X X 2 n xi 2 2 xi X X 2 i 1 2 xi X n 2 n i 1 1 xi n n X 2 i 1 n n i 1 i 1 2 1n xi 2 2 X 2 X 2 1n xi 2 X 2 2 n 2 1 2 2 X C XX E X X xi X xi xi n i 1 i 1 i 1 The variance is a measure of the similarity of successive samples or events with each other. How close or correlated with the others would an event be expected to be? X 2 1 n 2 1 C XX n n 2 1 n n 2 1 n xi xi R XX X 2 n i 1 i 1 n 2 Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 8 of 17 ECE 3800 Correlation between discrete random variables X and Y For two sequences or paired groupings (x,y). If we assume that every (x,y) pair is equally likely, the pmf of the functions has the same value for every pair. Repeated pairs simply sum the probability at the point. So, for correlation, EX Y x y f x, y dx dy for xi , yi pairs, i 1to n we can define a pmf for each sample point as 1/n. Therefore, n x xi y y i EX Y x y n i 1 1 R XY E X Y n n xi y i i 1 Defining the cross correlation x X y Y f x, y dx dy E X X Y Y x xi y y i E X X Y Y x X y Y n i 1 n 1 E X X Y Y n E X X Y Y 1 E X X Y Y n 1 n n xi X yi Y i 1 n xi yi xi Y yi X X Y i 1 n xi y i n n X Y n n X Y n n X Y i 1 1 1 1 n xi y i X Y n i 1 C XY E X X Y Y 1 Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 9 of 17 ECE 3800 The Discrete Correlation coefficient For two sequences or paired groupings (x,y). If we assume that every (x,y) pair is equally likely, the pmf of the functions has the same value for every pair, 1/n. Repeated pairs simply sum the probability at the point. So, X X Y Y x X y Y f x, y dx dy E Y X Y X n X X Y Y x X y Y x xi y y i E Y i 1 X Y n X E n X X Y Y 1 xi X y i Y r E Y n X Y X i 1 n X X Y Y 1 1 E xi y i x i Y y i X X Y Y n X Y X i 1 n 1 X X Y Y 1 1 1 1 x y n X Y n X Y n X Y i i X Y X Y n n n n i 1 X X Y Y r XY E X Y 1 n n xi y i X Y i 1 X Y C XY X Y or making it fully data driven X X Y Y r XY E X Y 1 n 1 n 1 n x i y i x i y i n i 1 n i 1 n i 1 2 1 n 2 1 n 1 n 2 1 n xi xi yi yi n i 1 n i 1 n i 1 n i 1 2 The text defines this as the Pearson’s r statistical measure, the linear correlation coefficient between two sets of data! from Wikipedia Pearson product-moment correlation coefficient: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 10 of 17 ECE 3800 Based on the discrete terms, linear estimation becomes Then, a Yˆ R ˆ ˆ ˆ XX X R XY Y R XX X R XY 2 C XX R XX Xˆ and b C C Xˆ R XY Yˆ Xˆ R XX 2 XY XX Pavlovian conditioning for sampled data … always compute the following with data x: Mean, 2nd moment, variance ( X , R XX , and X ) y: Mean, 2nd moment, variance ( Y , RYY , and Y ) x and y: R XY , C XY , and XY n 1 X E X n xi xi 2 E X 2 R XX 1 n i 1 n i 1 2 1 n 1 X 2 C XX xi 2 xi R XX X 2 n n i 1 i 1 n 1 R XY E X Y n C XY E X X Y Y XY n xi y i 1 n i 1 n xi y i X Y i 1 C XY X Y For more information: Alberto Leon-Garcia, “Probability, Statistics, and Random Processes For Electrical Engineering, 3rd ed.”, Pearson Prentice Hall, Upper Saddle River, NJ, 2008. Chap. 8. Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 11 of 17 ECE 3800 Practical Example: NTC Resistor Temperature Measurements Sunseeker Based on Vishay BCComponents, Resistor Products Application Note, Document Number: 29053, 24 May, 2012. Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 12 of 17 ECE 3800 Note: The B constant will be called a K constant in the following material. Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 13 of 17 ECE 3800 The data is an exponential curve with respect to temperature. NTC Resistors typically referenced to 25° C or 298.15° K. For the 1st order approximation, assume 1 1 RT2 RT1 exp K T2 T1 Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 14 of 17 ECE 3800 Plotting the resist versus temperature based on the data and some approximations, we have. Sunseeker NTC Resistor Temperature Curves 4.00E+05 3.50E+05 Resistance (ohms) 3.00E+05 2.50E+05 Data Sheet 2.00E+05 K25/85 1.50E+05 K25/60 1.00E+05 5.00E+04 0.00E+00 0 20 40 60 80 100 Temperture (deg C) See Excel Spread Sheet for values Typically, data sheets provide K values based on 25° C and 85° C. RT2 ln RT1 K 1 1 T2 T1 4190 Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 15 of 17 ECE 3800 For better accuracy within a critical region, the K can be computed to bound desired temperature operating points. For Sunseeker, key temperatures for battery operation are 45° C and 60° C. Therefore a K based on 25° C and 60° C is sufficient for operation. This resulted in a portion of the spread sheet analysis. Designing with an NTC Thermistor. EPCOS NTC Thermistor Application Notes, Feb. 2009. A reference current or voltage is required. In this case a known voltage is provided to a resistor divider and the output voltage is indicative of the temperature. The resulting curve is highly non-linear due to the exponential nature of the device. To “linearize the curve” and reduce the steepest part of the curve, place the NTC in parallel with a large resistor. Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 16 of 17 ECE 3800 For Sunseeker: A 2.5 Vref drives the resistor divider. The Upper value used is 100kΩ and the resistor in parallel with the NTC thermistor is 330kΩ. An inverting op-amp is not used, we are directly connected to a 24-bit ADC. The resulting voltage to temperature curve is NTC 1.8000 1.6000 1.4000 1.2000 1.0000 0.8000 0.6000 0.4000 0.2000 0.0000 0 20 40 60 80 100 See the spread sheet for the expected ADC outputs and hexadecimal digital values. Notes and figures are based on or taken from materials in the course textbook: Probabilistic Methods of Signal and System Analysis (3rd ed.) by George R. Cooper and Clare D. McGillem; Oxford Press, 1999. ISBN: 0-19-512354-9. B.J. Bazuin, Spring 2016 17 of 17 ECE 3800