18/08/2022, 17:25 Statistics and Machine Learning for Life Science - self-assessment Statistics and Machine Learning for Life Science - self-assessment Total points 6/8 This quiz is meant for course participants to assess their knowledge level in Python as well as statistics. IMPORTANT NOTE : regarding the usage of some functions from specific modules, we do not necessarily ask that you are able to answer the questions off the top of your head. Rather, you should be able to provide an answer in a reasonable amount of time after a quick internet search, or a call to help(), to consult the documentation. If you are not able to answer most of the questions correctly, it is an indication that you may need to gain a deeper understanding of the corresponding topic(s) before you are able to fully benefit from the course. Some pointers to self-study materials are given below, but note that there are plenty of other excellent materials available as well , this is by no means an exhaustive list. - On Python: http://rosalind.info/problems/list-view/?location=python-village - On pandas: https://pandas.pydata.org/docs/getting_started/index.html - On data visualization using seaborn: https://seaborn.pydata.org/tutorial.html - On statistics, it is harder to recommend specific sources, but you can check out https://www.statisticshowto.com/ 0 of 0 points Questions 6 of 8 points https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP… 1/10 18/08/2022, 17:25 Statistics and Machine Learning for Life Science - self-assessment What should the following Python code print: * 1/1 [1,2,3,4,5,8,7,8] [0,9,36] an error [0,3,6] [6,3,0] Feedback yes, well played. Your base python level should be enough for you to follow the course. https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP… 2/10 18/08/2022, 17:25 Statistics and Machine Learning for Life Science - self-assessment The pandas library is a useful library to manipulate data in Python. We *0/1 want to use it to read a data file where fields are separated by tabulations (\t) and where there is a header. What is (are) proper command(s) to do this, provided we have imported pandas as pd : df <- pd.read_table( "myfile.txt" , sep='\t' , header=T ) df = pd.read_csv("myfile.txt") df = pd.read_csv("myfile.txt", sep='\t') df = pd.read_table("myfile.txt" , header = 0) df = pd.read_table("myfile.txt" , header = 1) Correct answer df = pd.read_csv("myfile.txt", sep='\t') df = pd.read_table("myfile.txt" , header = 0) Feedback In this course we will use the pandas library to read and manipulate our data. We recommend you brush up on base operations (reading files, filtering data, computing means/medians, ...) for this library to fully benefit from the course. If you answered : * answer 1 : this is an R syntax. In Python the <- operator is not valid. + the header option does not work that way here * answer 2 : here the file is tab-separated, but by default read_csv consider commas (,) to be the field separator * answer 5 : the header option is here to indicate which line is the header. Indexing starts at 0, so header=1 means the header would be the 2nd line of the file... Answer 3 and 4 were both correct. Full mark if you got at least one right https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP… 3/10 18/08/2022, 17:25 Statistics and Machine Learning for Life Science - self-assessment The pandas library is a useful library to manipulate data in Python. We *1/1 have imported pandas as pd, and we have a DataFrame named df which, among others, contains column "pvalue". Which command gives us the subset of df where "pvalue" is < 0.05 ? df[df['pvalue']<0.05] sub = pd.subset( df , pvalue<0.05 ) df['pvalue']<0.05 Feedback Correct, well played. https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP… 4/10 18/08/2022, 17:25 Statistics and Machine Learning for Life Science - self-assessment Visual inspection of data is an important diagnostic tool in data analysis, *1/1 as evidenced by example datasets such as the Datasaurus Dozen (https://www.autodesk.com/research/publications/same-stats-differentgraphs). In our case, we use either the matplotlib or seaborn library to create nice plots. Assume we have imported matplotlib.pyplot as plt and seaborn as sns. We want to plot the histogram of some data contained in a list named myDATA. Find at least 1 command that would allow this : hist(myDATA) plt.hist(myDATA) sns.hist(myDATA) sns.distplot(myDATA) plt.histogram(myDATA) Feedback Answers 2 and 4 were both correct. Full mark if you got at least 1 answer correct. https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP… 5/10 18/08/2022, 17:25 Statistics and Machine Learning for Life Science - self-assessment scipy.stats is a library of choice when performing statistical testing in *1/1 Python. Which function call would we use if we wanted to perform a Student's t-test on the two independent datasets X and Y. (NB : don't hesitate to go to the functions documentations if you are not sure) scipy.stats.ttest_ind( X , Y ) scipy.stats.ttest_ind( X , Y , equal_var=False) scipy.stats.ttest_rel( X , Y ) scipy.stats.t( X , Y ) Feedback This is the correct answer. Well played. https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP… 6/10 18/08/2022, 17:25 Statistics and Machine Learning for Life Science - self-assessment What is a p-value ? (several answers possible) * 0/1 the probability of obtaining the observed test statistic under the null hypothesis. the probability that we are making an error the probability that the null hypothesis is not false it corresponds to 1 - the probability that the alternative hypothesis is true none of the above Correct answer none of the above Feedback The correct answer was none of the above. The actual definition is : "The p-value is the probability of obtaining a test statistic as or more extreme as the observed one, if the null hypothesis is true" Depending on you answer: * answer1 : almost here. You missed the "as or more extreme" part. * answers 2,3,4 : wrong. These three answers make the error of neglecting the typeII error (also called beta), which corresponds to the probability of spuriously accepting H0 while it is not true. Knowing the proper definition of a p-value is an crucial notion if you want to get anywhere in (frequentist) statistics. If you answered with answer 2 3 or 4, we strongly recommend you brush up on the base concepts of p-value, errors and error types in statistics before you attend the course. https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP… 7/10 18/08/2022, 17:25 Statistics and Machine Learning for Life Science - self-assessment Given the following table, which test could you apply to check the independence of rows and columns ? *1/1 Fisher's exact test Chi-square test t-test of independence https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP… 8/10 18/08/2022, 17:25 Statistics and Machine Learning for Life Science - self-assessment Given the following plot of two variables, which metric would be the most *1/1 appropriate to evaluate their correlation : Spearman's rank correlation coefficient Pearson's correlation coefficient R-squared, the coefficient of determination Feedback Correct answer : the non-linearity of the relationship between the two variables means the spearman's rho is the most appropriate choice here. https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEwP… 9/10 18/08/2022, 17:25 Statistics and Machine Learning for Life Science - self-assessment We observed 150 patients affected by the same disease. Out of these 150, 73 exhibit a fever. We consider each patient independent and their symptoms as well. Which distribution would be most appropriate if we wanted to model the number of patients with a fever in a sample of a given size? a normal distribution a chi-square distribution a poisson law a binomial law Feedback Indeed, here the best answer was the binomial distribution, which models the probabilities of obtaining a number of "successes" (fever in our case) out of a fixed number of trials (patients in our case). This content is neither created nor endorsed by Google. - Terms of Service - Privacy Policy Forms https://docs.google.com/forms/d/e/1FAIpQLSeIws7P59zzoXTNAGACqEkPGTewv9sQE7sODhPjBo4LuFW2wA/viewscore?viewscore=AE0zAgAMFRpEw… 10/10