Lab1 Solution

advertisement
Lab1_Solution
September 13, 2018
In [1]: # Make plots inline
%matplotlib inline
# Make inline plots vector graphics instead of raster graphics
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf', 'svg')
# import modules for plotting and data analysis
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn
0.1
Problem 1
Create 1000 samples from a Gaussian distribution with mean -10 and standard deviation 5.Create
another 1000 samples from another independent Gaussian with mean 10 and standard deviation
5.
(a) Take the sum of 2 these Gaussians by adding the two sets of 1000 points, point by point,and
plot the histogram of the resulting 1000 points. What do you observe?
(b) Estimate the mean and the variance of the sum.
In [2]: mu, sigma = 10, 5
sampleA = np.random.normal(mu, sigma, 1000); #1000 samples from N(10, 5) distribution
sampleB = np.random.normal(-mu, sigma, 1000); #1000 samples from N(-10, 5) distribution
sumMatrix = np.add(sampleA, sampleB) #element wise sum of the two sets of samples
#Plot each of the sets of samples as histograms
plt.hist(sampleA, 100) #orange
plt.hist(sampleB, 100) #blue
plt.hist(sumMatrix, 100) #green
plt.title('Samples from 2 Gaussians and their sum')
plt.xlabel('Sample value')
plt.ylabel('Bin occurence')
1
#Mean and Variance of the summed set
print "Mean:", np.mean(sumMatrix)
print "Variance:", np.var(sumMatrix)
Mean: 0.15540869606444302
Variance: 50.75917718250388
35
Samples from 2 Gaussians and their sum
30
Bin occurence
25
20
15
10
5
0
20
10
0
10
Sample value
20
30
a) The mean of the summed set of samples is pretty close to 0. Since the expected value of a
Normal distribution is the mean, the expected value (or mean) of the summed set will be
E[ A + B] = E[ A] + E[ B] = 10 + −10 = 0.
Similarily, when two distributions are summed, the variances are also added up (assuming
independent variables)
Var ( X + Y ) = Var ( X ) + Var (Y ).
Since Var ( A) = σa2 , the new variance will be
Var ( A + B) = σa2 + σb2 = 52 + 52 = 50.
b) Since these samples are pulled randomely from the distributions, the mean and variances
won’t be exactly 0 and 50 (respectively), but as the number of samples increases they will
approach these true values.
2
0.2
Problem 2
Central Limit Theorem.
Let Xi be an iid Bernoulli random variable with value {-1,1}. Look at the random variable
Zn =
1
Xi
n
By taking 1000 draws from Zn , plot its histogram. Check that for small n (say, 5-10) Zn does not
look that much like a Gaussian, but when n is bigger (already by the time n= 30 or 50) it looks
much more like a Gaussian. Check also for much bigger n: n= 250, to see that at this point, one
can really see the bell curve.
In [3]: def plotBernoulliSum(n):
data = []
#take 1000 samples from Zn
for x in range(0, 1000):
#create a sample of Zn by averaging n size 1 Binomial samples
samples = np.random.binomial(1, 0.5, n)
samples[samples == 0] = -1 #Change all the 0 samples to -1
data.append(samples.mean())
#display data with n
plt.title("n = %d" % n)
plt.xlabel('Sample value')
plt.ylabel('Bin count')
plt.hist(data, bins=100)
plt.show()
#Create plots for each value of n
plotBernoulliSum(10)
plotBernoulliSum(50)
plotBernoulliSum(250)
3
n = 10
250
Bin count
200
150
100
50
0
0.75
0.50
0.25 0.00 0.25
Sample value
0.50
0.75
n = 50
100
Bin count
80
60
40
20
0
0.4
0.2
0.0
Sample value
4
0.2
0.4
1.00
n = 250
60
50
Bin count
40
30
20
10
0
0.2
0.1
0.0
Sample value
0.1
0.2
As can be seen, for fewer samples drawn the random variable Zn appears more triangular.
While there may be more values of Zn that tend to be closer to 0, it doesn’t follow the typical
Gaussian curve. As n gets much larger, we can see the Gaussian curve with very few extreme
values and a much smaller variance as well.
0.3
Problem 3
Estimate the mean and standard deviation from 1 dimensional data: generate 25,000 samplesfrom
a Gaussian distribution with mean 0 and standard deviation 5. Then estimate the meanand standard deviation of this gaussian using elementary numpy commands, i.e., addition, multiplication,
division (do not use a command that takes data and returns the mean or standard deviation).
In [4]: mu, sigma, samples = 0, 5, 25000
data = np.random.normal(mu, sigma, samples);
#25000 samples from Normal distribution with standard deviation of 5 and mean 0
#compute the mean of the data
mean = np.sum(data) / samples
#compute variance of the data
variance = np.sum((data - mean)**2)
variance /= (samples -1)
#compute standard deviation from variance of the data
deviation = np.sqrt(variance)
5
print "Mean:", mean
print "Deviation:", deviation
Mean: -0.042206667255083805
Deviation: 5.030464517832782
The mean is simply computed by finding the expected value of the data (aka the average).
µx =
1
N
N
∑ (Xj )
j =1
The deviation is found by first finding the variation of the data and then taking it’s square root.
v
u
N
u 1
σ=t
( X j − µ x )2
N − 1 j∑
=1
These two equations come from Monte Carlo Sampling which seeks to estimate an expectation
or integral via a mean, E[ X ] ≈ N1 ∑ N
j=1 ( X j ) where X j ∼ P ( X ) the probability distribution of the
random variable X. This value approaches the true expectation as N approaches infinity.
Also note for the variance above we divide by N-1 and not N. This is due to the Bessel Correction which seeks to find an unbiased estimate of the variance given an estimate of the population
mean. Typically for large enough N though, we tend to just divide by N since the difference is
marginal.
0.4
Problem 4
Estimate the mean and covariance matrix for multi-dimensional
generate
10,000
samples
(
) ((data: )
(
))
Xi
−5
20 0.8
of 2 dimensional data from the Gaussian distribution
N
,
. Then,
Yi
5
0.8 30
estimate the mean and covariance matrix for this multi-dimensional data using elementary numpy
commands, i.e., addition, multiplication, division (do not use a command that takes data and
returns the mean or standard deviation).
In [5]: samples = 10000
mu = [-5, 5]
cov = [[20, 0.8], [0.8, 30]]
x, y = np.random.multivariate_normal(mu, cov, samples).T
#10000 samples from bivariate Normal distribution with given mu and covariance matrix
#compute means
meanX = np.sum(x) / samples
meanY = np.sum(y) / samples
mean = [meanX, meanY]
#compute the variance as before
varX = np.sum((x - meanX)**2) / (samples - 1)
6
varY = np.sum((y - meanY)**2) / (samples - 1)
#compute covariance between x and y, as n increases this should approach 0.8
cov = np.sum((x - meanX)*(y - meanY)) / (samples - 1)
covMatrix = np.array([[varX, cov], [cov, varY]])
#print out the mean and covariance matrix
print "Mean:\n", mean, "\n"
print "Covariance:\n", covMatrix
Mean:
[-5.008183705275261, 4.98693417994575]
Covariance:
[[20.39674574 0.85205106]
[ 0.85205106 29.53026744]]
We can find the mean and variance as before (albeit with 2 dimensions instead of 1). To find
the covariance though between two random variables, we use the following equation.
Cov( X, Y ) =
0.5
1
N−1
N
∑ (Xj − µx ) ∗ (Yj − µy )
j =1
Problem 5
Download from Canvas/Files the datasetPatientData.csv. Each row is a patient and the last column is the condition that the patient has. Do data exploration using Pandas and other visualization
tools to understand what you can about the data set. For example:
(a) How many patients and how many features are there?
(b) What is the meaning of the first 4 features? See if you can understand what they mean.
(c) Are there missing values? Replace them with the average of the corresponding feature column
(d) How could you test which features strongly influence the patient condition and which do
not? List what you think are the three most important features.
In [6]: #Read data specifying the data to be float values and missing values to have ?
#no header (names of columns)
data = pd.read_csv('PatientData.csv', header=None, dtype=np.float64, na_values='?')
#Find the number of features and patients
features = len(data.iloc[0])-1 #Subtract one because last column is output
patients = len(data)
print "Features: ", features
7
print "Patients: ", patients
#Replace any missing values with the mean of the columns/features
newData = data.fillna(data.mean())
plt.hist(data.T.iloc[0], bins=50)
plt.title('Feature 0')
plt.xlabel('Sample values')
plt.ylabel('Bin counts')
plt.show()
plt.hist(data.T.iloc[1], bins=10)
plt.title('Feature 1')
plt.xlabel('Sample values')
plt.ylabel('Bin counts')
plt.show()
plt.hist(data.T.iloc[2], bins=100)
plt.title('Feature 2')
plt.xlabel('Sample values')
plt.ylabel('Bin counts')
plt.show()
plt.hist(data.T.iloc[3], bins=100)
plt.title('Feature 3')
plt.xlabel('Sample values')
plt.ylabel('Bin counts')
plt.show()
#Write the new data back to a new file
newData.to_csv('PatientDataFixed.csv')
Features:
Patients:
279
452
8
Feature 0
25
Bin counts
20
15
10
5
0
0
20
40
Sample values
60
80
Feature 1
250
Bin counts
200
150
100
50
0
0.0
0.2
0.4
0.6
Sample values
9
0.8
1.0
Feature 2
175
150
Bin counts
125
100
75
50
25
0
100
200
300
400
500
Sample values
600
700
800
Feature 3
35
30
Bin counts
25
20
15
10
5
0
0
25
50
75
100
125
Sample values
10
150
175
a) There 279 features (+ condition feature) and 452 patients (aka samples).
b) With these being medical records and the first column being in the range of 0 to 80, it can
be assumed that this first feature is very likely the age of the patient. The second feature is
binary and pretty evenly ditributed so it’s most likely gender. The third and fourth features
are more difficult to ascertain without being given forknowledge, but reasonable guesses
could be blood pressure and heart rate.
c) There are several missing values specified as being "?" in the data. They can be filled in with
one line of code using pandas as seen above.
d) There are various ways to determine feature importance and this is an active area of machine learning study. Possible ways to determine importance include using decision trees
(e.g. xgboost feature selection), f scores, pca/lda (largest eigenvalue eigenvectors), highest
correlation or mutual information with output variable, standardized regression coefficients,
change in Rˆ2/other metrics as variable is added to rest of variables, and others. Depending
on which methods you use, you can determine that different features are most important.
At the end of the day, the goal is to predict the output variable while minimizing (or maximizing) some metric, so pick all or some features that help with this.
0.6
Written Questions
0.7
Problem 1
Consider two random variables X,Y that are not independent. Their probabilities of are given by
the following table:
X=0
X=1
Y=0
Y=1
1/4
1/6
1/4
1/3
(a) What is the probability that X = 1?
(b) What is the probability that X = 1 conditioned on Y = 1?
(c) What is the variance of the random variable X?
(d) What is the variance of the random variable X conditioned that Y = 1?
(e) What is E[ X 3 + X 2 + 3Y 7 |Y = 1]?
a) P( X = 1) = P( X = 1 ∩ Y = 1) + P( X = 1 ∩ Y = 0) =
This is true by the Total Probability Theorem.
P ( X = x ) = ∑ y ∈Y P ( X = x ∩ Y = y )
b) P( X = 1|Y = 1) =
P ( X = 1 ∩Y = 1 )
P (Y = 1 )
=
1
3
1
1
6+3
=
2
3
This is true by Bayes Theorem.
P ( X = x ∩Y = y )
P ( X = x | Y = y ) = P (Y = y )
11
1
4
+
1
3
=
7
12
c) E[ X ] = ∑ x∈X xP( X = x ) = 1 ∗ P( X = 1) + 0 ∗ P( X = 0) = 1 ∗
Var ( X ) = E[( X − E[ X ])2 ] = ∑ x∈X ( x −
7 2
12 ) P( X
= x ) = (1 −
7
12
=
7
12
∗
7
12
+ (0 −
7 2
12 )
7 2
12 )
∗
5
12
=
35
144
d) E[ X |Y = 1] = ∑ x∈X xP( X = x |Y = 1) = 1 ∗ P( X = 1|Y = 1) + 0 ∗ P( X = 0|Y = 1) =
1 ∗ 32 = 23
2
3
Var ( X |Y = 1) = E[( X − E[ X |Y = 1])2 |Y = 1] = ∑ x∈X ( x − 23 )2 P( X = x |Y = 1) = (1 − 23 )2 ∗
+ (0 − 23 )2 ∗ 13 = 29
e) E[ X 3 + X 2 + 3Y 7 |Y = 1] = E[ X 3 |Y = 1] + E[ X 2 |Y = 1] + 3E[Y 7 |Y = 1] = ∑ x∈X x3 P( X =
x |Y = 1 ) + ∑ x ∈ X x 2 P ( X = x |Y = 1 ) + 3
= 13 ∗ P( X = 1|Y = 1) + 0 + 12 ∗ P( X = 1|Y = 1) + 0 + 3 = 2 ∗ P( X = 1|Y = 1) + 3 =
2 ∗ + 3 = 13
3
Note that E[Y 7 |Y = y] = y. You can derive this yourself.
2
3
0.8
Problem 2
Consider the vectors v1 = [1, 1, 1] and v2 = [1, 0, 0]. These two vectors define a 2-dimensional
subspace of R3 . Project the points P1 = [3, 3, 3], P2 = [1, 2, 3], P3 = [0, 0, 1] on this subspace.
Write down the coordinates of the three projected points. (You can use numpy or a calculator to
do arithmetic if you want).
Let < a, b > be the dot product of a and b. Let V = span{v1 , v2 } and V⊥ the orthogonal
complement (i.e. all vectors orthogonal to the span V.
Then Pi = yV + yV⊥ where yV ∈ V is the projection in V we are looking for and yV⊥ ∈ V⊥
One way to solve this problem is by calculating yV directly by projecting on 2 orthogonal
vectors that span V:
If you want to project a vector onto a subspace, you need to find the basis of that subspace
first. This constists of finding n orthogonal vectors (< vi , v j >= 0) which together have the same
span as V above.
Take w2 = v2 = [1, 0, 0] as your first vector (since it’s easier to work with), and let w1 =
<v1 ,w2 >
v1 − Projw2 v1 = v1 − <
w2 ,w2 > w2 = [1, 1, 1] − [1, 0, 0] = [0, 1, 1] as the second vector, which is
orthogonal to v2 . Then V = span{w1 , w2 } Notes that w1 and w2 are orthogonal, w2 = v2 and
w1 is a linear combination of v1 and v2 . Thus w1 is also in the span V making w1 and w2 a valid
orthogonal basis of V.
< P1,w1 >
< P1,w2 >
6
3
ProjV ( P1) = <
w1 ,w1 > w1 + <w2 ,w2 > w2 = 2 w1 + 1 w2 = [0, 3, 3] + [3, 0, 0] = [3, 3, 3]. Note that P1
is already in the span V so remains unchanged.
< P2,w1 >
< P2,w2 >
5
1
ProjV ( P2) = <
w1 ,w1 > w1 + <w2 ,w2 > w2 = 2 w1 + 1 w2 = [0, 5/2, 5/2] + [1, 0, 0] = [1, 5/2, 5/2]
< P3,w1 >
< P3,w2 >
1
0
1 1
1 1
ProjV ( P3) = <
w1 ,w1 > w1 + <w2 ,w2 > w2 = 2 w1 + 1 w2 = [0, 2 , 2 ] + [0, 0, 0] = [0, 2 , 2 ].
A second approach would be to find the projection yV⊥ of Pi over V⊥ = span( N ) (where N is
the normal vector to V, you can calculate it using the Cross product). And then you know that
yV = Pi − yV⊥ .
0.9
Problem 3
Consider a coin such that probability of heads is 23 . Suppose you toss the coin 100 times.Estimate
the probability of getting 50 or fewer heads. You can do this in a variety of ways.One way is to use
12
the Central Limit Theorem. Be explicit in your calculations and tell uswhat tools you are using in
these.
Let
2
Xi ∼ Bernoulli ( )
3
where Xi are iid. The question essentially asks you to compute:
P(Sn ≤ 50)
where
100
Sn =
Let µ = E[ Xi ] and σ =
√
∑ Xi
i =1
Var ( Xi ). So, the question can be rewritten as:
P( Zn ≤
50 − nµ
√ )
σ n
S −nµ
where Zn = σn √n . CLT tells us that as n → ∞, Sn approaches the guassian distribution with mean
and variance equal to the mean and variance of the sum of random variables. Thus by normalizing the random variable Sn , we obtain a standard normal distribution. Therefore the solution is
50−nµ
approximately Φ(z) where z = σ√n , which is simply the tail probability of the standard normal
distribution (on most standard scientific calculators). Using µ = p and σ = p(1 − p), you can
calculate this value.
Solution: ~0.0203%
13
Download