3/23/23, 8:47 PM Activity-lr-1 Linear Regression Example This example uses the truck sales dataset to illustrate ordinary least-squares (OLS), or linear regression. The plot shows the line that linear regression learns, which best minimizes the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. We also compute the residual sum of squares and the variance score for the model. In [1]: %matplotlib inline import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn import linear_model # Get data df = pd.read_csv( filepath_or_buffer='trucks.csv', header=None) data = df.iloc[:,:].values X = data[:,0].reshape(-1, 1) Y = data[:,1].reshape(-1, 1) # Train the model using the training sets regr = linear_model.LinearRegression() regr.fit(X, Y) slope = regr.coef_[0][0] intercept = regr.intercept_ print("y = %f + %fx" %(intercept, slope)) print("Mean squared error: %f" % np.mean((regr.predict(X) - Y) ** 2)) # Explained variance score: 1 is perfect prediction print('Variance score: %f' % regr.score(X, Y)) # Plot outputs plt.scatter(X, Y, color='black') plt.plot(X, regr.predict(X), color='red', linewidth=1) plt.show() y = 0.434585 + 0.851144x Mean squared error: 0.011812 Variance score: 0.997083 localhost:8888/nbconvert/html/Activity-lr-1.ipynb?download=false 1/7 3/23/23, 8:47 PM Activity-lr-1 In the cell below, we load a subset of the Iris dataset from UCI, specifically all the samples for the "Iris Setosa" flower. The function model finds the OLS model for a pair of features in the data and computes the residual sum of squares and the variance score for that model. The parameters v1 and v2 are the names of the X and Y variables. In [2]: import numpy as np import pandas as pd from sklearn import linear_model df = pd.read_csv( filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databas header=None) data = df.iloc[:,:].values data = data[data[:,4] == "Iris-setosa"][:,:4] def model(X, Y, v1="A", v2="B"): X = X.reshape(-1, 1) Y = Y.reshape(-1, 1) regr = linear_model.LinearRegression() regr.fit(X, Y) slope = regr.coef_[0][0] intercept = regr.intercept_[0] print("%s = %f + %fx%s" %(v2, intercept, slope, v1)) sse = np.sum((regr.predict(X) - Y) ** 2) print("Sum of squared errors: %f" % sse) # Explained variance score: 1 is perfect prediction print('Variance score: %f' % regr.score(X, Y)) return slope, intercept, sse, v1, v2 Exercise localhost:8888/nbconvert/html/Activity-lr-1.ipynb?download=false 2/7 3/23/23, 8:47 PM Activity-lr-1 The samples have 4 features. For each combination of features (each pair or features), consider one of the variables as predictor and the other as response and use the function model to find the OLS model that best fits the data. Report the model with the smallest SSE score. In [3]: one = data[:,0] two = data[:,1] three = data[:,2] four = data[:,3] print(model(one,two)) print(model(one,three)) print(model(one,four)) print(model(two,three)) print(model(two,four)) print(model(three,four)) print(model(two,one)) print(model(three,one)) print(model(four,one)) print(model(three,two)) print(model(four,two)) print(model(four, three)) localhost:8888/nbconvert/html/Activity-lr-1.ipynb?download=false 3/7 3/23/23, 8:47 PM Activity-lr-1 B = -0.623012 + 0.807234xA Sum of squared errors: 3.146569 Variance score: 0.557681 (0.807233665122696, -0.623011727604216, 3.146569429387999, 'A', 'B') B = 0.813768 + 0.129891xA Sum of squared errors: 1.372483 Variance score: 0.069630 (0.12989060806149594, 0.8137676160441513, 1.3724825071449684, 'A', 'B') B = -0.180937 + 0.084886xA Sum of squared errors: 0.519331 Variance score: 0.077892 (0.08488551624453856, -0.18093689432016008, 0.5193311652048225, 'A', 'B') B = 1.188976 + 0.080463xA Sum of squared errors: 1.429143 Variance score: 0.031221 (0.08046332480530793, 1.1889763558154574, 1.4291427928814417, 'A', 'B') B = -0.025258 + 0.078776xA Sum of squared errors: 0.519054 Variance score: 0.078385 (0.07877646265006044, -0.025257949337906593, 0.5190536703309061, 'A', 'B') B = -0.033080 + 0.189262xA Sum of squared errors: 0.510358 Variance score: 0.093825 (0.18926247288503256, -0.03308026030368766, 0.5103579175704992, 'A', 'B') B = 2.644660 + 0.690854xA Sum of squared errors: 2.692927 Variance score: 0.557681 (0.6908543956816328, 2.6446596755601792, 2.6929269869830503, 'A', 'B') B = 4.221204 + 0.536063xA Sum of squared errors: 5.664281 Variance score: 0.069630 (0.536062906724512, 4.221203904555315, 5.664281453362259, 'A', 'B') B = 4.782102 + 0.917614xA Sum of squared errors: 5.613977 Variance score: 0.077892 (0.9176136363636359, 4.782102272727273, 5.613977272727274, 'A', 'B') B = 2.849946 + 0.388015xA Sum of squared errors: 6.891700 Variance score: 0.031221 (0.3880151843817785, 2.8499457700650765, 6.891700108459869, 'A', 'B') B = 3.175213 + 0.995028xA Sum of squared errors: 6.556186 Variance score: 0.078385 (0.9950284090909087, 3.1752130681818183, 6.556186079545453, 'A', 'B') B = 1.343040 + 0.495739xA Sum of squared errors: 1.336790 Variance score: 0.093825 (0.4957386363636363, 1.3430397727272727, 1.3367897727272722, 'A', 'B') In [4]: sse=[0]*12 print("One Two") a,b,sse[0],d,e = model(one,two) print("one,three") a,b,sse[1],d,e = model(one,three) print("one,four") a,b,sse[2],d,e = model(one,four) print("two,three") a,b,sse[3],d,e = model(two,three) print("two,four") a,b,sse[4],d,e = model(two,four) localhost:8888/nbconvert/html/Activity-lr-1.ipynb?download=false 4/7 3/23/23, 8:47 PM Activity-lr-1 print("three,four") a,b,sse[5],d,e = model(three,four) print("two,one") a,b,sse[6],d,e = model(two,one) print("three,one") a,b,sse[7],d,e = model(three,one) print("four,one") a,b,sse[8],d,e = model(four,one) print("three,two") a,b,sse[9],d,e = model(three,two) print("four,two") a,b,sse[10],d,e = model(four,two) print("four, three") a,b,sse[11],d,e = model(four, three) localhost:8888/nbconvert/html/Activity-lr-1.ipynb?download=false 5/7 3/23/23, 8:47 PM Activity-lr-1 One Two B = -0.623012 + 0.807234xA Sum of squared errors: 3.146569 Variance score: 0.557681 one,three B = 0.813768 + 0.129891xA Sum of squared errors: 1.372483 Variance score: 0.069630 one,four B = -0.180937 + 0.084886xA Sum of squared errors: 0.519331 Variance score: 0.077892 two,three B = 1.188976 + 0.080463xA Sum of squared errors: 1.429143 Variance score: 0.031221 two,four B = -0.025258 + 0.078776xA Sum of squared errors: 0.519054 Variance score: 0.078385 three,four B = -0.033080 + 0.189262xA Sum of squared errors: 0.510358 Variance score: 0.093825 two,one B = 2.644660 + 0.690854xA Sum of squared errors: 2.692927 Variance score: 0.557681 three,one B = 4.221204 + 0.536063xA Sum of squared errors: 5.664281 Variance score: 0.069630 four,one B = 4.782102 + 0.917614xA Sum of squared errors: 5.613977 Variance score: 0.077892 three,two B = 2.849946 + 0.388015xA Sum of squared errors: 6.891700 Variance score: 0.031221 four,two B = 3.175213 + 0.995028xA Sum of squared errors: 6.556186 Variance score: 0.078385 four, three B = 1.343040 + 0.495739xA Sum of squared errors: 1.336790 Variance score: 0.093825 In [5]: # the least sse is for third and fourth column where third column is used as in # 0.510 is the least sum of square error print("The minimum SSE is obtained for columns three as independent and four as The minimum SSE is obtained for columns three as independent and four as depen dent: 0.5103579175704992 In [6]: print("Model with the least SSE") model(three,four) localhost:8888/nbconvert/html/Activity-lr-1.ipynb?download=false 6/7 3/23/23, 8:47 PM Out[6]: Activity-lr-1 Model with the least SSE B = -0.033080 + 0.189262xA Sum of squared errors: 0.510358 Variance score: 0.093825 (0.18926247288503256, -0.03308026030368766, 0.5103579175704992, 'A', 'B') In [ ]: localhost:8888/nbconvert/html/Activity-lr-1.ipynb?download=false 7/7