Uploaded by melowa9933

Statistics and Machine Learning for Life Science - self-assessment

advertisement
18/08/2022, 17:25
Statistics and Machine Learning for Life Science - self-assessment
Statistics and Machine Learning for Life
Science - self-assessment
Total points 6/8
This quiz is meant for course participants to assess their knowledge level in Python as well
as statistics.
IMPORTANT NOTE : regarding the usage of some functions from specific modules,
we do not necessarily ask that you are able to answer the questions off the top of your
head. Rather, you should be able to provide an answer in a reasonable amount of time after
a quick internet search, or a call to help(), to consult the documentation.
If you are not able to answer most of the questions correctly, it is an indication that you
may need to gain a deeper understanding of the corresponding topic(s) before you are able
to fully benefit from the course.
Some pointers to self-study materials are given below, but note that there are plenty of
other excellent materials available as well , this is by no means an exhaustive list.
- On Python: http://rosalind.info/problems/list-view/?location=python-village
- On pandas: https://pandas.pydata.org/docs/getting_started/index.html
- On data visualization using seaborn: https://seaborn.pydata.org/tutorial.html
- On statistics, it is harder to recommend specific sources, but you can check out
https://www.statisticshowto.com/
0 of 0 points
Questions
6 of 8 points
https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP…
1/10
18/08/2022, 17:25
Statistics and Machine Learning for Life Science - self-assessment
What should the following Python code print: *
1/1
[1,2,3,4,5,8,7,8]
[0,9,36]
an error
[0,3,6]
[6,3,0]
Feedback
yes, well played. Your base python level should be enough for you to follow the course.
https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP…
2/10
18/08/2022, 17:25
Statistics and Machine Learning for Life Science - self-assessment
The pandas library is a useful library to manipulate data in Python. We
*0/1
want to use it to read a data file where fields are separated by tabulations
(\t) and where there is a header. What is (are) proper command(s) to do
this, provided we have imported pandas as pd :
df <- pd.read_table( "myfile.txt" , sep='\t' , header=T )
df = pd.read_csv("myfile.txt")
df = pd.read_csv("myfile.txt", sep='\t')
df = pd.read_table("myfile.txt" , header = 0)
df = pd.read_table("myfile.txt" , header = 1)
Correct answer
df = pd.read_csv("myfile.txt", sep='\t')
df = pd.read_table("myfile.txt" , header = 0)
Feedback
In this course we will use the pandas library to read and manipulate our data. We
recommend you brush up on base operations (reading files, filtering data, computing
means/medians, ...) for this library to fully benefit from the course.
If you answered :
* answer 1 : this is an R syntax. In Python the <- operator is not valid. + the header option
does not work that way here
* answer 2 : here the file is tab-separated, but by default read_csv consider commas (,) to
be the field separator
* answer 5 : the header option is here to indicate which line is the header. Indexing starts
at 0, so header=1 means the header would be the 2nd line of the file...
Answer 3 and 4 were both correct. Full mark if you got at least one right
https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP…
3/10
18/08/2022, 17:25
Statistics and Machine Learning for Life Science - self-assessment
The pandas library is a useful library to manipulate data in Python. We
*1/1
have imported pandas as pd, and we have a DataFrame named df which,
among others, contains column "pvalue". Which command gives us the
subset of df where "pvalue" is < 0.05 ?
df[df['pvalue']<0.05]
sub = pd.subset( df , pvalue<0.05 )
df['pvalue']<0.05
Feedback
Correct, well played.
https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP…
4/10
18/08/2022, 17:25
Statistics and Machine Learning for Life Science - self-assessment
Visual inspection of data is an important diagnostic tool in data analysis, *1/1
as evidenced by example datasets such as the Datasaurus Dozen
(https://www.autodesk.com/research/publications/same-stats-differentgraphs). In our case, we use either the matplotlib or seaborn library to
create nice plots. Assume we have imported matplotlib.pyplot as plt and
seaborn as sns. We want to plot the histogram of some data contained in
a list named myDATA. Find at least 1 command that would allow this :
hist(myDATA)
plt.hist(myDATA)
sns.hist(myDATA)
sns.distplot(myDATA)
plt.histogram(myDATA)
Feedback
Answers 2 and 4 were both correct.
Full mark if you got at least 1 answer correct.
https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP…
5/10
18/08/2022, 17:25
Statistics and Machine Learning for Life Science - self-assessment
scipy.stats is a library of choice when performing statistical testing in
*1/1
Python. Which function call would we use if we wanted to perform a
Student's t-test on the two independent datasets X and Y. (NB : don't
hesitate to go to the functions documentations if you are not sure)
scipy.stats.ttest_ind( X , Y )
scipy.stats.ttest_ind( X , Y , equal_var=False)
scipy.stats.ttest_rel( X , Y )
scipy.stats.t( X , Y )
Feedback
This is the correct answer. Well played.
https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP…
6/10
18/08/2022, 17:25
Statistics and Machine Learning for Life Science - self-assessment
What is a p-value ? (several answers possible) *
0/1
the probability of obtaining the observed test statistic under the null hypothesis.
the probability that we are making an error
the probability that the null hypothesis is not false
it corresponds to 1 - the probability that the alternative hypothesis is true
none of the above
Correct answer
none of the above
Feedback
The correct answer was none of the above.
The actual definition is :
"The p-value is the probability of obtaining a test statistic as or more extreme as the
observed one, if the null hypothesis is true"
Depending on you answer:
* answer1 : almost here. You missed the "as or more extreme" part.
* answers 2,3,4 : wrong. These three answers make the error of neglecting the typeII error
(also called beta), which corresponds to the probability of spuriously accepting H0 while it
is not true.
Knowing the proper definition of a p-value is an crucial notion if you want to get anywhere
in (frequentist) statistics.
If you answered with answer 2 3 or 4, we strongly recommend you brush up on the base
concepts of p-value, errors and error types in statistics before you attend the course.
https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP…
7/10
18/08/2022, 17:25
Statistics and Machine Learning for Life Science - self-assessment
Given the following table, which test could you apply to check the
independence of rows and columns ?
*1/1
Fisher's exact test
Chi-square test
t-test of independence
https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP…
8/10
18/08/2022, 17:25
Statistics and Machine Learning for Life Science - self-assessment
Given the following plot of two variables, which metric would be the most *1/1
appropriate to evaluate their correlation :
Spearman's rank correlation coefficient
Pearson's correlation coefficient
R-squared, the coefficient of determination
Feedback
Correct answer : the non-linearity of the relationship between the two variables means the
spearman's rho is the most appropriate choice here.
https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP…
9/10
18/08/2022, 17:25
Statistics and Machine Learning for Life Science - self-assessment
We observed 150 patients affected by the same disease. Out of these 150, 73
exhibit a fever. We consider each patient independent and their symptoms as
well. Which distribution would be most appropriate if we wanted to model the
number of patients with a fever in a sample of a given size?
a normal distribution
a chi-square distribution
a poisson law
a binomial law
Feedback
Indeed, here the best answer was the binomial distribution, which models the probabilities
of obtaining a number of "successes" (fever in our case) out of a fixed number of trials
(patients in our case).
This content is neither created nor endorsed by Google. - Terms of Service - Privacy Policy
Forms
https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEw…
10/10
Download