Uploaded by Julissa

Statistics Test 1 Study Guide

advertisement
To bring with you:
1) Something to write with
2) Calculator (Separate from your phone or any other device that can log on to the
internet. If you don’t have one, don’t go out and buy one just for this class–I’ll bring
a few to the test and you can use one of those.)
Topics on Test 1:
1) Math by hand
a) Mean
- X-bar
- The average
- Ask if we need to memorize the equation
-
n = the number of observations/cases
𝑥i= the value for each observation/ case, i
Σ (sigma) means “add”
In other words, The mean is equal to the sum of the values of x
from 1 to n, divided by n
- Add up all the values of a variable & divide by the number of quantities
you added
b) Variance
- Essential to making statistical claims
- Can think of as variability
- Example:
- Say you are doing a study of how depression changes over the life
course, & the data look like this:
-
-
-
There is variance (or variability) in depression score: Different
participants have different scores
- There is no variance in age: Every participant is 21 years old
- There is a terrible dataset for studying life course change in
depression! It only represents one part of the life course
- In other words: Age cannot possibly be the cause of the differences
in depression scores bc everyone in the study is the same age
Related to range
- Measures of central tendency in relation to variance:
- Mode: no way to discuss variance
- Median: percentiles in the distribution
- Mean: standard deviation
Variance Calculation (Words)
- 1st, you calculate the difference between each person’s value & the
mean
- 15-38.6 = -23.6
- 35-38.6 = -3.6
- 80-3 8.6 = 41.1
- 3-38.6 = 35.6
- 60-38.6 = 21.4
- 2nd, you square each of those difference
2
-
− 23. 6 = 556. 96
-
− 3. 6 = 12. 96
-
41. 1 = 1713. 96
-
− 35. 6 = 1267. 36
2
2
2
2
-
- 21. 4 = 457. 96
3rd, you add them up
- 556.96 + 12.96 + 1713.96 + 457.96 = 4009.2
4th, you divide them by one less than the number of cases
- 4009.2/(5-1) = 1002.3
➢ Can’t have negative variance
c) Standard deviation
- Standard deviation is a statistic that pairs w/ the mean & helps describe the
data by telling us about variance
- Ex:
- The mean of depression is (15+35+80+3+60)/5 = 38.6
- But 38.6 isn’t a great characterization of all 5 of these
people
- The standard deviation = square-root of the variance
- √1002.3 = 31.66
2) Measurement
a) Identify a response scale as ratio/interval/ordinal/nominal and explain why
(explanation is important; we may award partial credit)
- Continuous: Ratio
- Has numeric units (in the “real” world outside of stats)
- Different observers can agree that it means something objective to
have zero of that quantity
- For example, counts, durations, amounts, frequencies, distances,
lengths…
- How many friends do you have? (0,1,2,3,4…)
- How many years old are you?
- How many minutes did you wait to speak with a customer
service representative?
- How much money did you earn last month?
- What percentage of answers did you get correct on the test?
- How many days last week did you talk to your mother?
- How many miles do you travel from your home to the
nearest grocery store?
➢ Ideal in mathematical sense
-
-
Continuous: Interval
- Has numeric units (in the “real” world outside of stats)
- Zero of that thing/idea is not meaningful, or is arbitrary, or is a
social construct
- For example
- Dates (Lot of calendars don’t recognize the birth of Christ
as Year 0 or January 1 as the 1st day of the year)
➢ Arbitrary bc based on historic events
- Times (Greenwich mean time is the 0, but it’s entirely
arbitrary)
- Temperatures, historically
➢ For fahrenheit & celsius (tied to the freezing point
of water)
Categorical: Ordinal
- Categories w/ no fixed distance between them
➢ Moved off the number line
➢ Don’t know that the distance is strictly equal
-
➢ Varies systematically from person to person based on
experiences
- You can rank or order the categories according to some criterion
- For example, agreement, frequency, rating, ranking…
- How much pain are you in? (1-5 where 5 is the worst
imaginable)
- How satisfied are you with your purchase? (Extremely,
very, some, a little, not at all)
- Is your socioeconomic status low, medium, or high?
- How often do you go jogging? (Never, rarely, sometimes,
often)
Categorical: Nominal
- Has categorical units, w/ unknown distance between them
- There’s no inherent order to the categories
- For example,
- Anything w/ a yes/no answer
- Sex/gender
- Race/ethnicity
- Marital status
- Choice of breakfast cereal
b) Write a response scale that is ratio/interval/ordinal/nominal
- Look up examples for inspo.-3) Central tendency (Mode, median, mean/standard deviation)
a) Where to find them in Stata output
-
Often, you need to do something with the cases that said “don’t know,” or “no answer”
before you can calculate a measure of central tendency. You have multiple options,
including:
- You can add “if” to your sum or tab command (this is a temporary fix, in that
you’d need to add it every time you run the command)
-
- sum CIG30AV if CIG30AV<91, detail
Replace: Allows you to replace values with other values, for example, when you
want to tell Stata which values indicate missing or not applicable data
- replace varname=newvalue if varname=oldvalue
-
-
replace COCUS30A=. if COCUS30A>30
Here’s another example: Right now, nonsmokers get the code 91, and past
smokers get the code 93. But if you’re interested in days smoked in the past
month, you don’t have to set those people aside: They’re all people who smoked
0 days in the past month. So you could do this:
-
replace CIG30USE=0 if CIG30USE==91|CIG30USE==93
-
Mode:
-
-
Mode: tabulate, as above
- Tab (the abbreviated version of “tabulate”): Presents a
frequency distribution
- tab varname
- tab CIG30USE
- To find the mode really easily:
- tab CIG30USE, sort
Median:
- Sum (the abbreviated version of “summarize”): Presents the
median (50th percentile) and mean
- sum varname, detail
-
sum COCUS30A, detail
-
Mean:
-
-
summarize (same as for median):
Tip: If you leave off the “, detail” option, you get simpler output that
includes only the mean, standard deviation, and range (i.e., no
percentiles or information about skewness/kurtosis)
- sum varname
- sum COCUS30A
Standard deviation:
- Same as above
b) Interpret each in a sentence
- LOOK @ HW 1 ANSWER KEY WHEN UPLOADED FOR SPECIFIC
EXAMPLES
- Mode:
- The most common value in the dataset
- Note: It’s possible to have more than 1 mode, when 2+
values are equally most common– But I’m not going to
give you any problems where that’s true
- What’s the most typical or most frequent value for this variable
-
Median:
- The middle value
- The 50th percentile
- 50% of the sample is higher, & 50% of the sample is lower
- Mean:
- The average
- Standard Deviation
- Standard deviation is a statistic that pairs w/ the mean & helps
describe the data by telling us about variance
c) Relationship to measurement–which stats apply to which types of
measurement
- Mode: All types of measurement
- Median: Ordinal, interval, & ratio
- Mean: Interval & ratio
- In a dataset, we often want to know the characteristics of a
“normal” participant– What is typical of the people in this study
4) Sampling
a) Census vs. sample
- A census is any study in which every single member of a group of interest
is a participant
- THE U.S. Census happens every 10 years & is a study of every person
living in the U.S.
➢ THE census is an example of a census
- THE Census is a massive, expensive, time-consuming undertaking
- For these reasons, most studies use a sample of the group of interest– this
is even what Census does in “off” years
➢ Polls aren’t a census, but are a sample
- Population: The larger group of interest
- In Census, every person living in the U.S. in 2020
- Could be all undergraduate students at BC in 2022
- Could be U.S. registered voters in the lead-up to an election
- Sample: A subset of participants drawn from a population for the purposes
of making inferences about that population
➢ When you have a large/ hard to reach completely population
b) Types of random sampling
- Any method where I can calculate the probability that each member of the
population gets selected into the sample
- The laws of probability dictate that if my sample is large enough & I use
probability sampling, my sample will be representative of my population
-
i)
ii)
iii)
iv)
We like random sampling because we can estimate sampling error: how
much the sample might differ from the population
■ If you can’t avoid it, it’s good to establish what it is
Simple
- Every member of the population has an equal probability of being
in the sample, n/N, where n is the number of units to be selected &
N is the total number of units in the population
➢ n = the number in sample
- Lots of populations are already enumerated– that is there’s a list of
all the units in the population (a sampling frame)
- Pull names from a hat
- Use a random number generator
Systematic: a fixed interval between draws
- Sample of the population of people who eat at Hillside
- We might survey every 3rd person who walks in for some period
of time
Stratified
- People w/ like characteristics = strata
- We might stratify the BC population by class year
- Then, conduct a simple random sample within each strata (i.e.,
freshmen, sophomores, juniors, seniors)
➢ Can ensure to have students from each class
Multistage
- A series of simple random samples
- Example: Americans in states
- First, I take a simple random sample of states
- Second, I take a simple random sample of counties
- Third, I take a simple random sample of residents within selected
counties
➢ To make it easier
Overall:
● Review slides and your notes
● Reread book sections
To practice math by hand:
● Write out any small set of numbers. Calculate!
To practice measurement:
● Open the General Social Survey data or the National Survey on Drug Use and Health.
Tabulate variables and see if you can classify them. Tip: If you’re not certain, write out
your reasoning–if you convince me on the test, I may give partial or full credit.
● You can also look at the text version of the General Social Survey online. Given that the
Stata codebook for GSS isn’t very good, this might work better. Note that you need to
scroll to page 30 or so before you start seeing questions.
● Click on “Topline Questionnaire” (upper right) for the results of a series of weekly
surveys about coronavirus.
Download