E2 GUIDE

advertisement
COGS 9
Fall 2020
Exam 2 Study Guide
Our exam will focus on all the material covered in all lectures, assignments, and readings, from
Data Visualization I through Geospatial Analysis. Below, we provide some of the key concepts
and terms to know for the exam.
The exam will feature a variety of multiple choice and true/false questions. The best way to
prepare is to review all the lecture slides, ensure that you have met each learning objective, and
re-acquaint yourself with the assignments. If something is not clear in your mind, re-watch the
videos or come to a discussion section/office hour to ask a question. There will be 50 questions in
total.
You want to know the definition of the terms and the larger context in which they fit. You will be
expected to know the definitions and how they can be applied. In general, this is a class about big
concepts. The examples provided in lecture are given to illustrate those larger concepts. The goal
is not to memorize specific details about these examples.
Standard disclaimer: This is not a complete list. It is merely intended to give you an idea of what
types of things you are expected to know.
Data Visualization
● Which graphs are appropriate for various types of data (i.e. when would
you make a histogram? A scatterplot? etc.)
● What are important things to keep in mind when generating visualizations
(including graphs and tables)?
● When interpreting visualizations, what do you do? (think: look for general
pattern and things that are outliers/weird/odd)
● What colors are good to use? Which colors and color combinations to
avoid?
● What’s the difference between an exploratory and an explanatory
visualization?
● What is the “data to ink ratio” and how does changing that ratio aid in
understanding data visualization?
● What does it mean to iteratively improve a visualization?
● What makes a good exploratory visualization? What takes a visualization
from exploratory to explanatory?
● What differs between visualizations on paper and those presented using
slides?
● What does visualization have to do with storytelling and/or effective
communication?
● What experiment was carried out in Data Is Personal (R3)? What did the
authors learn from this work?
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Descriptive Analysis
What is a descriptive analysis?
What are the best practices for sampling from a population?
How do we describe a dataset? What measurements do we use? (size,
missingness, shape, central tendency, variability)
What are the differences between the mean and the median? In what
situations would you favor one over the other?
Exploratory Data Analysis (EDA)
What is EDA?Why do EDA?
What does it mean that EDA is an iterative process?
What are the general steps in an exploratory analysis?
What types of plots are helpful during EDA?
When is EDA helpful? When is it not the right approach?
What were the main lessons from Tidy Data and Organizing Data in
Spreadsheeets (R3)?
How do the lessons learned in Tidy Data and Organizing Data in
Spreadsheeets (R3) help in EDA and analysis?
Inference
What is inferential analysis? What types of questions are appropriate for
inference?
What does sampling have to do with inference?
What is Anscombe’s Quartet? What is interesting about Anscombe’s
quartet? Note here that these four sets of X,Y pairs have very similar, if not
identical, summary statistics: means and variances for both X and Y, the
X,Y correlation, as well as the equation for the regression line. Yet they are
clearly very different upon cursory visual inspection.
Why is setting a p-value—the probability of observing a result assuming
that the null hypothesis is true—at an arbitrary threshold such as 0.05 (1in-20) problematic? If you run 20 (independent) tests, you’re likely to get
one that results in a “significant” effect!
What does correlation demonstrate? What does it mean when people say
correlation does not equal causation?
● What does a comparison of means demonstrate? What statistical test
could you use for this? What assumptions must be met for this statistical
test?
● What is regression? What are the assumptions of linear regression?
● What are confounders?
● What was the vertebroplasty example? What did this example
demonstrate?
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
A|B Testing
What is A|B testing? When is it used?
What considerations are made when designing an experiment?
What considerations are made after the A|B test has been started?
Why are p-values less informative than confidence intervals?
What are subgroups when it comes to A|B testing?
What is bucketing skew? What could cause bucketing skew?
What could cause noise in an A|B test?
What does it mean to change one variable at a time?
Machine Learning
What is Machine Learning (ML)?
What are the four basic steps to prediction? What does each of them
mean?
What are the differences between supervised and unsupervised learning?
What are the differences between regression and classification?
What are decision trees? When would you use them?
What does it mean to assess machine learning models?
How are deep learning and machine learning related?
What is overfitting?
What does ethics have to do with predictive analyses?
What is predictive policing and what does it have to do with risk scores
and jail sentences? (R5)
Algorithms & Computability
What is the definition of an algorithm?
How can the performance of an algorithm be measured?
At Stitch Fix, at what points in the process are algorithms used?
What is meant by “What the average user thinks they are doing is what is
actually being done?”
What does it mean for an algorithm to be FAT?
● What was the algorithm proposed in A Mulching Proposal (R2)? What
does this have to do with ethics?
Text Analysis
● What is the “TF” in TF-IDF? What is the “IDF”?
● What is a “token”?
● What is sentiment analysis? What is a lexicon?
● What are some limitations of sentiment analysis?
● What are word clouds?
● How can we analyze text in novels?
● What are a few ways you can deal with the issue of context in text data?
Geospatial Analysis
● What is wrong with plotting raw count histogram-type distributions of
many real-world geospatial features, such as population, economic
activity, number of website users, etc.?
● How can plotting certain features, such as disease prevalence rates by
geographic region, help scientists form new hypotheses about disease
causes (or cures)?
● What is a choropleth map?
● In what ways do spatial data violate conventional statistics?
● What is spatial autocorrelation, and in what way are many real-world
geospatial features highly spatially autocorrelated?
● The modifiable areal unit problem (MAUP) is important for things such as
Gerrymandering: how we divide features across space influences
aggregate statistics within those subdivisions! Be sure you understand
this.
● What is the ecological fallacy?
How to be wrong
● How can measurements be systematically or randomly wrong
(accuracy/precision)? What happens when our proxy measurement doesn’t
match what we care about (lamppost)? How can we correct for both these
kinds of problems?
● What are errors of analysis? Understand and be able to explain the
violations of assumptions that go into the analytic screw-ups known as
Anscombe’s Quartet and Simpson’s Paradox. Be able to explain the
filedrawer problem and how the missing p-values are a result of bad
analytic procedure.
● Know and be able to explain some famous examples of how broken tools
lead to broken results. And don’t use Excel unless you know its
appropriate.
● Be able to explain confirmation bias, and why its particularly important for
an analyst to be aware of and fight this bias. Be familiar with the general
concepts of the other cognitive biases.
● Understand how poor communication can lead to bad decision making
even if the correct analysis is made. Be able to explain how two space
shuttles and their crew were lost due to poor communication
Download