Lab1_Solution September 13, 2018 In [1]: # Make plots inline %matplotlib inline # Make inline plots vector graphics instead of raster graphics from IPython.display import set_matplotlib_formats set_matplotlib_formats('pdf', 'svg') # import modules for plotting and data analysis import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn 0.1 Problem 1 Create 1000 samples from a Gaussian distribution with mean -10 and standard deviation 5.Create another 1000 samples from another independent Gaussian with mean 10 and standard deviation 5. (a) Take the sum of 2 these Gaussians by adding the two sets of 1000 points, point by point,and plot the histogram of the resulting 1000 points. What do you observe? (b) Estimate the mean and the variance of the sum. In [2]: mu, sigma = 10, 5 sampleA = np.random.normal(mu, sigma, 1000); #1000 samples from N(10, 5) distribution sampleB = np.random.normal(-mu, sigma, 1000); #1000 samples from N(-10, 5) distribution sumMatrix = np.add(sampleA, sampleB) #element wise sum of the two sets of samples #Plot each of the sets of samples as histograms plt.hist(sampleA, 100) #orange plt.hist(sampleB, 100) #blue plt.hist(sumMatrix, 100) #green plt.title('Samples from 2 Gaussians and their sum') plt.xlabel('Sample value') plt.ylabel('Bin occurence') 1 #Mean and Variance of the summed set print "Mean:", np.mean(sumMatrix) print "Variance:", np.var(sumMatrix) Mean: 0.15540869606444302 Variance: 50.75917718250388 35 Samples from 2 Gaussians and their sum 30 Bin occurence 25 20 15 10 5 0 20 10 0 10 Sample value 20 30 a) The mean of the summed set of samples is pretty close to 0. Since the expected value of a Normal distribution is the mean, the expected value (or mean) of the summed set will be E[ A + B] = E[ A] + E[ B] = 10 + −10 = 0. Similarily, when two distributions are summed, the variances are also added up (assuming independent variables) Var ( X + Y ) = Var ( X ) + Var (Y ). Since Var ( A) = σa2 , the new variance will be Var ( A + B) = σa2 + σb2 = 52 + 52 = 50. b) Since these samples are pulled randomely from the distributions, the mean and variances won’t be exactly 0 and 50 (respectively), but as the number of samples increases they will approach these true values. 2 0.2 Problem 2 Central Limit Theorem. Let Xi be an iid Bernoulli random variable with value {-1,1}. Look at the random variable Zn = 1 Xi n By taking 1000 draws from Zn , plot its histogram. Check that for small n (say, 5-10) Zn does not look that much like a Gaussian, but when n is bigger (already by the time n= 30 or 50) it looks much more like a Gaussian. Check also for much bigger n: n= 250, to see that at this point, one can really see the bell curve. In [3]: def plotBernoulliSum(n): data = [] #take 1000 samples from Zn for x in range(0, 1000): #create a sample of Zn by averaging n size 1 Binomial samples samples = np.random.binomial(1, 0.5, n) samples[samples == 0] = -1 #Change all the 0 samples to -1 data.append(samples.mean()) #display data with n plt.title("n = %d" % n) plt.xlabel('Sample value') plt.ylabel('Bin count') plt.hist(data, bins=100) plt.show() #Create plots for each value of n plotBernoulliSum(10) plotBernoulliSum(50) plotBernoulliSum(250) 3 n = 10 250 Bin count 200 150 100 50 0 0.75 0.50 0.25 0.00 0.25 Sample value 0.50 0.75 n = 50 100 Bin count 80 60 40 20 0 0.4 0.2 0.0 Sample value 4 0.2 0.4 1.00 n = 250 60 50 Bin count 40 30 20 10 0 0.2 0.1 0.0 Sample value 0.1 0.2 As can be seen, for fewer samples drawn the random variable Zn appears more triangular. While there may be more values of Zn that tend to be closer to 0, it doesn’t follow the typical Gaussian curve. As n gets much larger, we can see the Gaussian curve with very few extreme values and a much smaller variance as well. 0.3 Problem 3 Estimate the mean and standard deviation from 1 dimensional data: generate 25,000 samplesfrom a Gaussian distribution with mean 0 and standard deviation 5. Then estimate the meanand standard deviation of this gaussian using elementary numpy commands, i.e., addition, multiplication, division (do not use a command that takes data and returns the mean or standard deviation). In [4]: mu, sigma, samples = 0, 5, 25000 data = np.random.normal(mu, sigma, samples); #25000 samples from Normal distribution with standard deviation of 5 and mean 0 #compute the mean of the data mean = np.sum(data) / samples #compute variance of the data variance = np.sum((data - mean)**2) variance /= (samples -1) #compute standard deviation from variance of the data deviation = np.sqrt(variance) 5 print "Mean:", mean print "Deviation:", deviation Mean: -0.042206667255083805 Deviation: 5.030464517832782 The mean is simply computed by finding the expected value of the data (aka the average). µx = 1 N N ∑ (Xj ) j =1 The deviation is found by first finding the variation of the data and then taking it’s square root. v u N u 1 σ=t ( X j − µ x )2 N − 1 j∑ =1 These two equations come from Monte Carlo Sampling which seeks to estimate an expectation or integral via a mean, E[ X ] ≈ N1 ∑ N j=1 ( X j ) where X j ∼ P ( X ) the probability distribution of the random variable X. This value approaches the true expectation as N approaches infinity. Also note for the variance above we divide by N-1 and not N. This is due to the Bessel Correction which seeks to find an unbiased estimate of the variance given an estimate of the population mean. Typically for large enough N though, we tend to just divide by N since the difference is marginal. 0.4 Problem 4 Estimate the mean and covariance matrix for multi-dimensional generate 10,000 samples ( ) ((data: ) ( )) Xi −5 20 0.8 of 2 dimensional data from the Gaussian distribution N , . Then, Yi 5 0.8 30 estimate the mean and covariance matrix for this multi-dimensional data using elementary numpy commands, i.e., addition, multiplication, division (do not use a command that takes data and returns the mean or standard deviation). In [5]: samples = 10000 mu = [-5, 5] cov = [[20, 0.8], [0.8, 30]] x, y = np.random.multivariate_normal(mu, cov, samples).T #10000 samples from bivariate Normal distribution with given mu and covariance matrix #compute means meanX = np.sum(x) / samples meanY = np.sum(y) / samples mean = [meanX, meanY] #compute the variance as before varX = np.sum((x - meanX)**2) / (samples - 1) 6 varY = np.sum((y - meanY)**2) / (samples - 1) #compute covariance between x and y, as n increases this should approach 0.8 cov = np.sum((x - meanX)*(y - meanY)) / (samples - 1) covMatrix = np.array([[varX, cov], [cov, varY]]) #print out the mean and covariance matrix print "Mean:\n", mean, "\n" print "Covariance:\n", covMatrix Mean: [-5.008183705275261, 4.98693417994575] Covariance: [[20.39674574 0.85205106] [ 0.85205106 29.53026744]] We can find the mean and variance as before (albeit with 2 dimensions instead of 1). To find the covariance though between two random variables, we use the following equation. Cov( X, Y ) = 0.5 1 N−1 N ∑ (Xj − µx ) ∗ (Yj − µy ) j =1 Problem 5 Download from Canvas/Files the datasetPatientData.csv. Each row is a patient and the last column is the condition that the patient has. Do data exploration using Pandas and other visualization tools to understand what you can about the data set. For example: (a) How many patients and how many features are there? (b) What is the meaning of the first 4 features? See if you can understand what they mean. (c) Are there missing values? Replace them with the average of the corresponding feature column (d) How could you test which features strongly influence the patient condition and which do not? List what you think are the three most important features. In [6]: #Read data specifying the data to be float values and missing values to have ? #no header (names of columns) data = pd.read_csv('PatientData.csv', header=None, dtype=np.float64, na_values='?') #Find the number of features and patients features = len(data.iloc[0])-1 #Subtract one because last column is output patients = len(data) print "Features: ", features 7 print "Patients: ", patients #Replace any missing values with the mean of the columns/features newData = data.fillna(data.mean()) plt.hist(data.T.iloc[0], bins=50) plt.title('Feature 0') plt.xlabel('Sample values') plt.ylabel('Bin counts') plt.show() plt.hist(data.T.iloc[1], bins=10) plt.title('Feature 1') plt.xlabel('Sample values') plt.ylabel('Bin counts') plt.show() plt.hist(data.T.iloc[2], bins=100) plt.title('Feature 2') plt.xlabel('Sample values') plt.ylabel('Bin counts') plt.show() plt.hist(data.T.iloc[3], bins=100) plt.title('Feature 3') plt.xlabel('Sample values') plt.ylabel('Bin counts') plt.show() #Write the new data back to a new file newData.to_csv('PatientDataFixed.csv') Features: Patients: 279 452 8 Feature 0 25 Bin counts 20 15 10 5 0 0 20 40 Sample values 60 80 Feature 1 250 Bin counts 200 150 100 50 0 0.0 0.2 0.4 0.6 Sample values 9 0.8 1.0 Feature 2 175 150 Bin counts 125 100 75 50 25 0 100 200 300 400 500 Sample values 600 700 800 Feature 3 35 30 Bin counts 25 20 15 10 5 0 0 25 50 75 100 125 Sample values 10 150 175 a) There 279 features (+ condition feature) and 452 patients (aka samples). b) With these being medical records and the first column being in the range of 0 to 80, it can be assumed that this first feature is very likely the age of the patient. The second feature is binary and pretty evenly ditributed so it’s most likely gender. The third and fourth features are more difficult to ascertain without being given forknowledge, but reasonable guesses could be blood pressure and heart rate. c) There are several missing values specified as being "?" in the data. They can be filled in with one line of code using pandas as seen above. d) There are various ways to determine feature importance and this is an active area of machine learning study. Possible ways to determine importance include using decision trees (e.g. xgboost feature selection), f scores, pca/lda (largest eigenvalue eigenvectors), highest correlation or mutual information with output variable, standardized regression coefficients, change in Rˆ2/other metrics as variable is added to rest of variables, and others. Depending on which methods you use, you can determine that different features are most important. At the end of the day, the goal is to predict the output variable while minimizing (or maximizing) some metric, so pick all or some features that help with this. 0.6 Written Questions 0.7 Problem 1 Consider two random variables X,Y that are not independent. Their probabilities of are given by the following table: X=0 X=1 Y=0 Y=1 1/4 1/6 1/4 1/3 (a) What is the probability that X = 1? (b) What is the probability that X = 1 conditioned on Y = 1? (c) What is the variance of the random variable X? (d) What is the variance of the random variable X conditioned that Y = 1? (e) What is E[ X 3 + X 2 + 3Y 7 |Y = 1]? a) P( X = 1) = P( X = 1 ∩ Y = 1) + P( X = 1 ∩ Y = 0) = This is true by the Total Probability Theorem. P ( X = x ) = ∑ y ∈Y P ( X = x ∩ Y = y ) b) P( X = 1|Y = 1) = P ( X = 1 ∩Y = 1 ) P (Y = 1 ) = 1 3 1 1 6+3 = 2 3 This is true by Bayes Theorem. P ( X = x ∩Y = y ) P ( X = x | Y = y ) = P (Y = y ) 11 1 4 + 1 3 = 7 12 c) E[ X ] = ∑ x∈X xP( X = x ) = 1 ∗ P( X = 1) + 0 ∗ P( X = 0) = 1 ∗ Var ( X ) = E[( X − E[ X ])2 ] = ∑ x∈X ( x − 7 2 12 ) P( X = x ) = (1 − 7 12 = 7 12 ∗ 7 12 + (0 − 7 2 12 ) 7 2 12 ) ∗ 5 12 = 35 144 d) E[ X |Y = 1] = ∑ x∈X xP( X = x |Y = 1) = 1 ∗ P( X = 1|Y = 1) + 0 ∗ P( X = 0|Y = 1) = 1 ∗ 32 = 23 2 3 Var ( X |Y = 1) = E[( X − E[ X |Y = 1])2 |Y = 1] = ∑ x∈X ( x − 23 )2 P( X = x |Y = 1) = (1 − 23 )2 ∗ + (0 − 23 )2 ∗ 13 = 29 e) E[ X 3 + X 2 + 3Y 7 |Y = 1] = E[ X 3 |Y = 1] + E[ X 2 |Y = 1] + 3E[Y 7 |Y = 1] = ∑ x∈X x3 P( X = x |Y = 1 ) + ∑ x ∈ X x 2 P ( X = x |Y = 1 ) + 3 = 13 ∗ P( X = 1|Y = 1) + 0 + 12 ∗ P( X = 1|Y = 1) + 0 + 3 = 2 ∗ P( X = 1|Y = 1) + 3 = 2 ∗ + 3 = 13 3 Note that E[Y 7 |Y = y] = y. You can derive this yourself. 2 3 0.8 Problem 2 Consider the vectors v1 = [1, 1, 1] and v2 = [1, 0, 0]. These two vectors define a 2-dimensional subspace of R3 . Project the points P1 = [3, 3, 3], P2 = [1, 2, 3], P3 = [0, 0, 1] on this subspace. Write down the coordinates of the three projected points. (You can use numpy or a calculator to do arithmetic if you want). Let < a, b > be the dot product of a and b. Let V = span{v1 , v2 } and V⊥ the orthogonal complement (i.e. all vectors orthogonal to the span V. Then Pi = yV + yV⊥ where yV ∈ V is the projection in V we are looking for and yV⊥ ∈ V⊥ One way to solve this problem is by calculating yV directly by projecting on 2 orthogonal vectors that span V: If you want to project a vector onto a subspace, you need to find the basis of that subspace first. This constists of finding n orthogonal vectors (< vi , v j >= 0) which together have the same span as V above. Take w2 = v2 = [1, 0, 0] as your first vector (since it’s easier to work with), and let w1 = <v1 ,w2 > v1 − Projw2 v1 = v1 − < w2 ,w2 > w2 = [1, 1, 1] − [1, 0, 0] = [0, 1, 1] as the second vector, which is orthogonal to v2 . Then V = span{w1 , w2 } Notes that w1 and w2 are orthogonal, w2 = v2 and w1 is a linear combination of v1 and v2 . Thus w1 is also in the span V making w1 and w2 a valid orthogonal basis of V. < P1,w1 > < P1,w2 > 6 3 ProjV ( P1) = < w1 ,w1 > w1 + <w2 ,w2 > w2 = 2 w1 + 1 w2 = [0, 3, 3] + [3, 0, 0] = [3, 3, 3]. Note that P1 is already in the span V so remains unchanged. < P2,w1 > < P2,w2 > 5 1 ProjV ( P2) = < w1 ,w1 > w1 + <w2 ,w2 > w2 = 2 w1 + 1 w2 = [0, 5/2, 5/2] + [1, 0, 0] = [1, 5/2, 5/2] < P3,w1 > < P3,w2 > 1 0 1 1 1 1 ProjV ( P3) = < w1 ,w1 > w1 + <w2 ,w2 > w2 = 2 w1 + 1 w2 = [0, 2 , 2 ] + [0, 0, 0] = [0, 2 , 2 ]. A second approach would be to find the projection yV⊥ of Pi over V⊥ = span( N ) (where N is the normal vector to V, you can calculate it using the Cross product). And then you know that yV = Pi − yV⊥ . 0.9 Problem 3 Consider a coin such that probability of heads is 23 . Suppose you toss the coin 100 times.Estimate the probability of getting 50 or fewer heads. You can do this in a variety of ways.One way is to use 12 the Central Limit Theorem. Be explicit in your calculations and tell uswhat tools you are using in these. Let 2 Xi ∼ Bernoulli ( ) 3 where Xi are iid. The question essentially asks you to compute: P(Sn ≤ 50) where 100 Sn = Let µ = E[ Xi ] and σ = √ ∑ Xi i =1 Var ( Xi ). So, the question can be rewritten as: P( Zn ≤ 50 − nµ √ ) σ n S −nµ where Zn = σn √n . CLT tells us that as n → ∞, Sn approaches the guassian distribution with mean and variance equal to the mean and variance of the sum of random variables. Thus by normalizing the random variable Sn , we obtain a standard normal distribution. Therefore the solution is 50−nµ approximately Φ(z) where z = σ√n , which is simply the tail probability of the standard normal distribution (on most standard scientific calculators). Using µ = p and σ = p(1 − p), you can calculate this value. Solution: ~0.0203% 13