COGS 9 Fall 2020 Exam 2 Study Guide Our exam will focus on all the material covered in all lectures, assignments, and readings, from Data Visualization I through Geospatial Analysis. Below, we provide some of the key concepts and terms to know for the exam. The exam will feature a variety of multiple choice and true/false questions. The best way to prepare is to review all the lecture slides, ensure that you have met each learning objective, and re-acquaint yourself with the assignments. If something is not clear in your mind, re-watch the videos or come to a discussion section/office hour to ask a question. There will be 50 questions in total. You want to know the definition of the terms and the larger context in which they fit. You will be expected to know the definitions and how they can be applied. In general, this is a class about big concepts. The examples provided in lecture are given to illustrate those larger concepts. The goal is not to memorize specific details about these examples. Standard disclaimer: This is not a complete list. It is merely intended to give you an idea of what types of things you are expected to know. Data Visualization ● Which graphs are appropriate for various types of data (i.e. when would you make a histogram? A scatterplot? etc.) ● What are important things to keep in mind when generating visualizations (including graphs and tables)? ● When interpreting visualizations, what do you do? (think: look for general pattern and things that are outliers/weird/odd) ● What colors are good to use? Which colors and color combinations to avoid? ● What’s the difference between an exploratory and an explanatory visualization? ● What is the “data to ink ratio” and how does changing that ratio aid in understanding data visualization? ● What does it mean to iteratively improve a visualization? ● What makes a good exploratory visualization? What takes a visualization from exploratory to explanatory? ● What differs between visualizations on paper and those presented using slides? ● What does visualization have to do with storytelling and/or effective communication? ● What experiment was carried out in Data Is Personal (R3)? What did the authors learn from this work? ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Descriptive Analysis What is a descriptive analysis? What are the best practices for sampling from a population? How do we describe a dataset? What measurements do we use? (size, missingness, shape, central tendency, variability) What are the differences between the mean and the median? In what situations would you favor one over the other? Exploratory Data Analysis (EDA) What is EDA?Why do EDA? What does it mean that EDA is an iterative process? What are the general steps in an exploratory analysis? What types of plots are helpful during EDA? When is EDA helpful? When is it not the right approach? What were the main lessons from Tidy Data and Organizing Data in Spreadsheeets (R3)? How do the lessons learned in Tidy Data and Organizing Data in Spreadsheeets (R3) help in EDA and analysis? Inference What is inferential analysis? What types of questions are appropriate for inference? What does sampling have to do with inference? What is Anscombe’s Quartet? What is interesting about Anscombe’s quartet? Note here that these four sets of X,Y pairs have very similar, if not identical, summary statistics: means and variances for both X and Y, the X,Y correlation, as well as the equation for the regression line. Yet they are clearly very different upon cursory visual inspection. Why is setting a p-value—the probability of observing a result assuming that the null hypothesis is true—at an arbitrary threshold such as 0.05 (1in-20) problematic? If you run 20 (independent) tests, you’re likely to get one that results in a “significant” effect! What does correlation demonstrate? What does it mean when people say correlation does not equal causation? ● What does a comparison of means demonstrate? What statistical test could you use for this? What assumptions must be met for this statistical test? ● What is regression? What are the assumptions of linear regression? ● What are confounders? ● What was the vertebroplasty example? What did this example demonstrate? ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● A|B Testing What is A|B testing? When is it used? What considerations are made when designing an experiment? What considerations are made after the A|B test has been started? Why are p-values less informative than confidence intervals? What are subgroups when it comes to A|B testing? What is bucketing skew? What could cause bucketing skew? What could cause noise in an A|B test? What does it mean to change one variable at a time? Machine Learning What is Machine Learning (ML)? What are the four basic steps to prediction? What does each of them mean? What are the differences between supervised and unsupervised learning? What are the differences between regression and classification? What are decision trees? When would you use them? What does it mean to assess machine learning models? How are deep learning and machine learning related? What is overfitting? What does ethics have to do with predictive analyses? What is predictive policing and what does it have to do with risk scores and jail sentences? (R5) Algorithms & Computability What is the definition of an algorithm? How can the performance of an algorithm be measured? At Stitch Fix, at what points in the process are algorithms used? What is meant by “What the average user thinks they are doing is what is actually being done?” What does it mean for an algorithm to be FAT? ● What was the algorithm proposed in A Mulching Proposal (R2)? What does this have to do with ethics? Text Analysis ● What is the “TF” in TF-IDF? What is the “IDF”? ● What is a “token”? ● What is sentiment analysis? What is a lexicon? ● What are some limitations of sentiment analysis? ● What are word clouds? ● How can we analyze text in novels? ● What are a few ways you can deal with the issue of context in text data? Geospatial Analysis ● What is wrong with plotting raw count histogram-type distributions of many real-world geospatial features, such as population, economic activity, number of website users, etc.? ● How can plotting certain features, such as disease prevalence rates by geographic region, help scientists form new hypotheses about disease causes (or cures)? ● What is a choropleth map? ● In what ways do spatial data violate conventional statistics? ● What is spatial autocorrelation, and in what way are many real-world geospatial features highly spatially autocorrelated? ● The modifiable areal unit problem (MAUP) is important for things such as Gerrymandering: how we divide features across space influences aggregate statistics within those subdivisions! Be sure you understand this. ● What is the ecological fallacy? How to be wrong ● How can measurements be systematically or randomly wrong (accuracy/precision)? What happens when our proxy measurement doesn’t match what we care about (lamppost)? How can we correct for both these kinds of problems? ● What are errors of analysis? Understand and be able to explain the violations of assumptions that go into the analytic screw-ups known as Anscombe’s Quartet and Simpson’s Paradox. Be able to explain the filedrawer problem and how the missing p-values are a result of bad analytic procedure. ● Know and be able to explain some famous examples of how broken tools lead to broken results. And don’t use Excel unless you know its appropriate. ● Be able to explain confirmation bias, and why its particularly important for an analyst to be aware of and fight this bias. Be familiar with the general concepts of the other cognitive biases. ● Understand how poor communication can lead to bad decision making even if the correct analysis is made. Be able to explain how two space shuttles and their crew were lost due to poor communication