Research Methods II Notes - - - - 1 Scientific Thinking & Psychological Science Scientific Approach is preferred if interested in causality many ways of fixing beliefs- faith, scientific approach scientific approach best if looking at whether something causes something else Historical Examples where people have failed to use the scientific method: a) Benjamin Rush (1793)- yellow fever research and bloodletting thought it was caused by too much blood- he saw many people die he didn’t look for contrary evidence if he drew blood and it didn’t work- he didn’t draw enough blood or people didn’t come to him soon enough- he couldn’t lose b) Joseph Goldberg - Pellagra looks at the difference between correlations and true causation pellagra is a disease that is characterised by open sores, vomiting, runny nose- it was causing 100,000 deaths in the us per year most physicians said that the people who got pellagra lived in poor areas- open sewers, etc. Goldberg didn’t believe it- he thought it was due to the poor diet that the same poor people were living on- couldn’t afford to eat healthy he did an experiment: none of them got pellagra when they ingested liquid feces and blood, urine, etc.- it isn’t a bacteria so he thought he was right- the protein thing, so he took volunteers from a prison to go on low protein diets- they developed pellagra- it showed that it actually was low protein Characteristics of a Scientific Experiment Manipulation (of the independent variable) Control (of extraneous variables - potential confounds) Measurement [of the dependent variable(s)] Comparison (of the measurements with appropriate statistics) there will be appropriate and inappropriate statistics in every study Scientific Observations Two rules: 1. Operational Definitions: things you are measuring must be operationally defined 2. No Distortion: no distortions introduced by you when you are doing your measurements - - Why operational definitions? avoids confusion - do we all mean the same thing when we talk about personality or intelligence? makes study reproducible it’s ok to have mistakes; the fact that we use statistics ensures we might make a mistake (alpha levels) makes measurements reliable(the same results every time you do the study) But trade off between precision of an operational definition and it’s construct validity talking to someone in the street the topics would be more broad if you said “does pornography influence negative views of men towards women?”- you would have to operationalize what you are talking about, but it’s interesting that on the tv or in discussions people will have long discussions without operationalization Research Methods II Notes 2 Sources of Distortion? a) Instruments: when they were first looking at planets the cameras showed that the planets had these coloured rings around them but it was just a problem with the camera b) sampling: it is impossible to get a random sample of people, you will always have some sort of biases c) observer bias: big problem; you tend to see what you expect to see - - Observer Bias Two historical examples Rene Blondlot’s N’Rays he was one of worlds predominant physicists french u of nancy shortly after x-rays discovered he wanted to know whether x-rays could be polarized- he thought he discovered a new ray called N’rays he created special materials and everyone else could see the N’rays by certain methods one guy wasn’t able to replicate Blondlot’s findings and also the review board of Nature didn’t believe it Wood took out the aluminum prism (the critical piece) and it still worked, and put it in when they thought it wasn’t, Clever Hans horse that could answer mathematical problems but what he really knew was how to read people- they would look a different way when he was close to answer Current examples - Female mate choice in wolves they didn’t believe it before…men doing research, biases Alex - talking parrot Koko - the talking chimp: uses American sign language; iq between 70-95; 2000 words "you can only see what you believe" anonymous Flotsam and Jetsam Assignment (stuff to watch for when surfing) challenge the Scientific basis for claims, not the truthfulness of the claim proper attribution of causal relations (control for placebo effects?) Flotsam and Jetsam assignment (stuff to watch for when surfing) - challenge the scientific basis for claims, not the truthfulness of the claim - proper attribution of causal relations (control for placebo effects?) - look where people got phd’s from - data is not always information which is not always knowledge Research Methods II Notes 3 Characteristics of a Scientific Experiment • Manipulation (of the independent variable) - is the manipulation done within subjects or between subjects • Control (of extraneous variables - potential confounds) • Measurement [of the dependent variable(s)] • Comparison (of the measurements with appropriate statistics) - Hypothesis generation the hardest part can be coming up with the right question • interested in amnesia (e.g., read about “lost Mariner” - Oliver Sacks)- interviewed someone with no memory- guy in the prairies with a memory of about 1 second • read “the man who mistook his wife for a hat” • effects of “shock” on memory – evidence of shock on memory (electro-convulsive therapy - ECT) • Wonder if an emotional shock will cause retrograde amnesia (forgetting things from before the incident)? • • • • Converting question to an analytical experiment operational definitions subject selection subject assignment variety of “mechanical details” Experimental Hypothesis • An emotional shock will disrupt memory for events that occur immediately prior to occurrence- will influence information you received a few seconds before the emotional shock • Which words require operational definitions? • An emotional shock will disrupt memory for events that occur immediately prior to occurrence. • Final, operationally valid version: An unexpected 15-second scene that portrays a mutilated body at the end of a 10 minute travel film will disrupt a person's memory of the price of 10 items listed in the film during the one-minute period immediately preceding the final scene - you could come up with other valid operationally defined versions of the same question Why operational definitions? • • • avoid confusion makes study reproducible makes measurements reliable- redo and get similar results But trade off between precision of an operational definition and it’s construct validity Research Methods II Notes • • • • 4 Converting question to an analytical experiment operational definitions subject selection population sampling subject assignment variety of “mechanical details” Using Introductory Psychology Students as subjects (participants) 1. Conclusions based on undergraduate subjects are not wrong, just require further tests to ensure generalizability. If the results don't replicate, still not wrong, just incomplete (once you have the results from the undergraduate student data you can compare that data to other groups like “retired military people”) 2. Much of psychological research is so basic subject sampling is irrelevant. 3. There is considerable replication among different undergraduate populations. This gives geographical and SES variability that helps support generalizability. (you can generalize to other university student populations at other universities that might differ in such ways as whether or not they have to pay for university, etc.) • This is a legitimate problem, but it is often overstated. • • Converting question to an analytical experiment operational definitions subject selection- undergraduates, volunteers (can also have problems) population sampling subject assignment variety of “mechanical details” • • • • Converting question to an analytical experiment operational definitions subject selection subject assignment variety of “mechanical details” • • Final Experiment 20 intro psychology subjects, 10 randomly assigned to each group- bivariate experiment (one group will see the shocking material, one won’t) - with random assignment, people have equal chances of being in each group - random assignment might be a confound because you will most likely get about 15 females or more and then males - you could end up that you get all males in one group which could be a confoundproblem with small groups, but we will do random assignment anyway Experimental group • • Male experimenter written instructions Research Methods II Notes • • • • • - • • • • • • • 5 10 min travel film 1 o’clock testing group testing 15 s final scene version ‘a’ - mutilated body seated in chair (operational defn’ of emotional shock) - I.V. Measure memory - D.V. you should make these things consistent- fast and easy is always good also to keep in mind Control group Male experimenter written instructions 10 min travel film 1 o’clock testing group testing 15 s final scene version ‘b’ - craftsperson weaving a basket - I.V. Measure memory - D.V. YOU WANT TO STATISTICALLY COMPARE DIFFERENCES IN MEASURED MEMORY Loftus and Burns 1982 Loftus, E., & Burns, T. Mental shock can produce retrograde amnesia. Memory and Cognition, 10, 318-323. - she used a video that told bank tellers what to do in case of a robbery - the violent version- boy being shot in face - nonviolent version- bank teller telling people to remain calm - she also took confidence ratings- most of data shows that confidence isn’t really related to accuracy - she found that most of the things in the video were remembered less well than in the nonviolent - what was the number on the jersey? Remember model of human memory - - • • does the info not get into long-term memory? Is the information there but we can’t bring it back into working memory? they might not know the number on the jersey because it was never coded or that it was but you can’t retrieve it- hard to know she redid the experiment using a multiple choice question- to see where people are being affected- she used another nonviolent condition of staying in the alley (got same results as going back into the bank) had another control of couple walking on the beach at the end to see if it was surprise that actually had memory loss Difficulties with Probabilistic Reasoning (from Stanovich: How to think straight about psychology) 1. Salience of atypical cases (“man-who statistics”) e.g., Hamil, Wilson & Nesbett (1980) prison guard exp. 2. Insufficient use of probabilistic information e.g., Bayes theorem Research Methods II Notes • • • • • 3. 4. 5. 6. 7. 6 Cognitive illusions Failure to use sample size information Tendency to explain chance events Gambler’s Fallacy Conjunction Fallacy January 14, 1999 Characteristics of a scientific experiment Manipulation (of the independent variable) Control (of extraneous variables- potential confounds) Measurement (of the dependent variable(s)) Comparison of the measurements with appropriate statistics Difficulties with probabilisitic reasoning (from stanovich: how to think straight about psychology) 1. Salience of atypical cases (“man-who statistics”) i.e. Hamil, Wilson, & Nesbett (1980) prison guard exp. Media - you may know people that violate the rules - depending on how that information is presented, that the salience of the weird cases can override the main way of the world - statistical summaries of relevant information is less effective in changing opinion than one face-to-face opinion - prison guard experiment: either had an abusive guard (talking about how prisoners should be punished) and then a nice guard - they were told that whatever the video they looked at was the typical guard or they were told that the guy they saw was very unusual in his attitudes- you have a 2 X 2 factorial design (four different groups) - the main finding was that when asked what guards were typically like they all talked about the video we saw (even when they were told that what they saw wasn’t even typical) - students choosing courses? - If you have a friend that hated the course but everyone else liked the course, most of us will be influenced by the friend - Vietnam- weekly fatality statistics (200-300); Life magazine - every day they would report the death rate like the stock market, people passed off the daily summary of stocks being up and down - life magazine ran a photo spread of as many of the two hundred soldiers that had died the week before- it looked like a yearbook - some people have argued that this was the turning point in the anti-war movement people became outraged that 250 of their American men were dying over in Vietnam- they had already heard the numbers for years, but they hadn’t seen it (people instead of numbers) - the anecdotes used in media (politicians, etc.) - politicians will tell little anecdotes like the welfare mom going to the spa with the coloured television, etc.- everyone will think “cut welfare!” - no images of the gulf war- no pictures of Americans dying; weeks before the americans this woman talked about how she was an eye-witness to Iraqis going in Research Methods II Notes 7 and killing children in the hospitals- this was a major catalyst for us going to war (it ended up that she was lying)- came out after us already in the war 2. Insufficient use of probabilistic information (e.g. Bays theorem) In a city where 15% of the taxi cabs are blue and remaining 85% are green, the only eye witness to a hit and run accident involving a taxi says the taxi involved was blue. She has been shown to be accurate 80% of the time in making such identification. What is the probability that the taxi was in fact blue, given that she said it was? Bayes Theorem Assuming there is no reason for one taxi colour to have been in an accident than the other: base rate probability of the taxi being blue =.15 she would identify it as such 80% of the time: .8*.15=.12 base rate probability of the taxi being green =.85 she would erroneously identify it as blue 20% of the time: .2*.85=.17 proportion of taxies identified as blue that actually were = .12/(.12+.17)=.41 Bayes Theorem (another example) - Casscelss et al. (1978): Harvard Medical School - suppose there is a disease that occurs at a rate of 1/1000 - a cheep safe test is developed that has a false positive rate of 5% (falsely identifying someone as having the disease when they do not) - this test becomes routinely done - you, with no symptoms of the disease test positive during a routine screening - what is the probability that you actually have the disease? - turns out to be 1/51 or 2% chance you have the disease (p=0.0196) 3. Cognitive illusions (the line test) - see handouts Group A: Question 1: how many chose option a (route 1) option b (route 2) Question 2: how many choose option a (sure gain) option b (80% chance to gain)? Group B: Question 1: how many choose option a (route 1) option b(route 2) Question 2: how many choose options a (sure loss) or option b (80% chance to lose)? - the questions are the same but the wording is different- people can be manipulated depending on how information is worded one of the characteristics of humans is that we tend to be loss-adverse; we don’t like to loose something that we already own 4. Failure to use sample size information - people do not consider sampling information that much - sample size example: in which would you expect to have more days during which over 60% of the babies born are female? (large or small hospital) - it is the small hospital: you are more likely to get extreme results in smaller samplesmore likely to be biased Research Methods II Notes - 8 in which cases is it more likely that more coin tosses will be heads- two or 10 times? (two times) 5. Tendency to explain chance occurrences - stock market analysis - illusion of control (lottery tickets- picking numbers, books, etc.) - people spend forever trying to pick numbers - the chance of someone winning twice was 1/billions - personal coincidences: we like to explain these, we tend to ignore all the noncoincidences when they happen - everyone is connected- seven degrees of separation - wouldn’t is be, like, really weird karma if, like, two people in this class had the same birthday? - bet that two people in this class were born on the same day of the year (not necessarily the same year)? - In a group of 23, chances are 50/50 - coincidences do not need explanation (you forget that everyone else you met didn’t have your same birthday when you meet someone who does) 6. - - Gambler’s fallacy: tendency for people to see unrelated things as being related if you have T, H, H, H, H, H, ____? (coin toss) people will say more likely to be tails the probability is always equal, however we like to explain chance events one of the places this happens is that in sports: if they are making shots in a row, the commentators say they are hot, etc.- but in reality it is the same chance that they will either get it in or not Stanovich’s point is that we are not statistical thinkers, but we are pattern thinkers Linda Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy and was an activist in anti-war movement. Which of the following statements is most likely true about linda? - examples like “Linda works in a bookstore and takes yoga” and things like “she is an insurance salesperson” and “she is a feminist”, “she is a bank teller” and “she is a bank teller and active in the feminist movement” - when you look at the conjunction fallacy (Linda- representative heuristic)- the idea that is less likely to be the last one- the overlap of two things can’t possible be more likely than either one by themselves - we are wired to pick things like that Stanovich: while many scientists sincerely wish to make scientific knowledge accessible to the general public, it is intellectually irresponsible to suggest that a deep understanding of the particular subject can be obtained by the layperson when that understanding is crucially dependent on certain technical information that is only available through formal study - such is the case with statistics and psychology no one will be able to teach me reasonably why the earth is so old…half lives, etc. Research Methods II Notes - 9 he says that the study of behaviour and thinking is the same- statistical knowledge is taught by formal knowledge and psychology is understood by statistics Innumeracy (John Allen Paulos) talking about numbers one of the simplest parts of innumeracy is the problem people have with large numbers “thousands and thousands of them…almost 300” we are faced with numbers all the time we hear numbers and use them but not too many of us have a good understanding of 5, 10, 100- 5 billion? - 1 million seconds…..11.5 days - 1 billion seconds…..32 years - modern homosapiens- 10 trillion seconds old (10, 000, 000, 000, 000 seconds) Stock market scam: - I want to invest stamps 32, 000 letters (predicting stock changes- 50:50) - I would have been right half of the time- 16,000 letters and keep going until you have 500 people who are playing the stock market who have gotten 6 correct predictions…next thing you send out is – for 500 dollars I will send you my next prediction….you would end up with 250, 000! - - it is not easy to understand probabilistic reasoning Inferential Statistics (Ch. 12) Used to infer the characteristics of the population Some terminology: Parameters - a characteristic of the population Statistic - a characteristic of your sample Parametric statistics estimate values of the population from characteristics of a sample These estimates are based on the following assumptions: 1. the population values are normally distributed 2. interval or ratio measurements are made Nonparametric statistics make no assumptions about the distribution of scores in your sample and can be used with ordinal or nominal data Inferential Statistics are used to estimate the probability that observed samples come from the same population. You calculate the observed value of the statistic (based on sample data) and compare this to critical values (or with computer programs, you are simply given the probability that the calculated statistic (or an even unlikely one) occurred by chance. Parametric Inferential statistics make assumptions that: 1. scores have been sampled randomly 2. the sampling distributions are normal 3. the within-groups variances are homogeneous violations of these assumptions will bias the test Research Methods II Notes - 10 Inferential Statistics this is some of the information from chapter 12 used to “infer” the characteristics of the “population” some terminology: - parameters: a characteristic of the population - statistic: a characteristic of your sample Parametric statistics - estimating values of a population from characteristics of a sample - these estimates are based on several assumptions 1. the population values are normally distributed 2. interval or ratio measurements are made 3. samples were drawn randomly from the population(s) - this is a nice theory - we often violate this - sometimes questionnaires that are treated as interval are really just ordinal - Beck depression inventory- the difference between a score of 5 and 8 is not really what you think 4. the samples have equal variances Nonparametric statistics - these statistics don’t make assumptions about the distribution of scores in your sample and can be used with ordinal or nominal data - we have another set of tools for when we don’t have interval or ratio data Characteristics of Psychological Science (Summary of Stanovitch) 1. Psychology progresses by investigating empirical problems. 2. Psychologists propose testable hypothesis. (Freud wouldn’t be a psychologist or a scientist- he was a medical doctor (a neurologist) - - - Falsifiability and Folk Wisdom people don’t have a lot of ideas about human behaviour but many people use folk wisdom- proverbs come up “decisions should be made only after consideration of all the alternatives” folk wisdom has “look before you leap” and also “he who hesitates is lost”- they cover all the basis “out of sight, out of mind” but also “absence makes the heart grow fonder” - these proverbs will explain human behaviour there is a lot of talk in education about co-learning and working together (two heads are better than one) - money and policies directed toward this- i.e. in research you might not get money unless you have a partner for collaborative research - but you also have the proverb- “too many cooks spoil the broth” “opposites attract” and “birds of a feather flock together” - do people seek out people who are similar to them- this seems to be more researchbased the point of these examples is that folk wisdom tends to be contradictory- you can have little proverbs to explain almost anything Research Methods II Notes - 11 it would be like the thinking of Benjamin rush with the bleeding example- it is not falsifiable- the theories can’t be proven wrong (if they don’t get better then you didn’t bleed enough…) Folk Wisdom & the Benefits of Work Experience for Youth - many folk beliefs of the benefits of youth getting work experience a) the money can be used to pay for future education (this theory has a lot of face validityyou will be earning money instead of goofing off) b) the experience will help develop a work ethic c) working will instill respect for the economy (learn about business, appreciate the value of a dollar) d) having been in the “real world” working youth will become more motivated students so they don’t have to flip burgers for the rest of their life a) b) c) d) - Psychological Studies on the Effects of Work Experience for Youth earnings are spent on luxury items workers become more cynical and less respectful of work employment has harmful effects on education it turns out that the workers do less well working also seems to promote some delinquent behaviour the folk arguments make sense to other people- good face validity, but they really don’t pan out Characteristics of Psychological Science (Summary of Stanovitch) 3. The concepts in these hypotheses are operationally defined. - i.e. it should not be illegal to be in possession of child pornography because it violates the charter of human rights and freedoms - the more severe forms of pornography have been associated with bad human behaviour, but there must be clear operational definitions of child pornography - child pornography has to be operationally defined 4. Psychologists use many different empirical methods. 5. Most conclusions are arrived at only after slowly accumulating data from many experiments. (people want the “ureka!” people who’ve “found the answer” but this does not happen) 6. The behavioral principals eventually uncovered at are almost always probabilistic. 7. Psychological data and theories are only acceptable after publication in peer reviewed scientific journals. you can publish any nonsense you want- as long as you pay them to publish whatever you want) - we have to pay to get the journals of our own data…? Stanovich While many scientists sincerely wish to make scientific knowledge accessible to the general public, it is intellectually irresponsible to suggest that a deep understanding of a particular subject can be obtained by the layperson when that understanding is crucially dependent on certain technical information that is only available through formal study.Such is the case with statistics and psychology. Characteristics of a Scientific Experiment • Manipulation (of the independent variable) Research Methods II Notes 12 • • Control (of extraneous variables - potential confounds) Measurement [of the dependent variable(s)] • Comparison (of the measurements with appropriate statistics) - we are finally on “comparison” (of the measurements with appropriate statistics) Hypothesis Testing - in truth, there could be: 1) a true difference between populations sampled (for instance, the video of memorythere is a true difference in memory recall when you are emotionally shocked) - but all we have are samples of the populations - we have a set of data from experimental and control groups and that data is differencethey are differences in samples because there are differences in populations 2) no difference (samples came from a single population) - maybe the two results would be the same for both groups in theory- from the population - but just say you have a sample where you end up with a difference when everyone was taken from the same population - you have to decide whether the samples came from a single population or two separate populations - based on your samples, you have to decide which of these to believe, make your best guess - four possible outcomes: a.) you guess (based on your samples) that the observed difference is real, and be correct: Correct rejection of the null hypothesis b.) guess (based on your samples) that the observed difference is real, and be wrong: Type I (alpha) error c.) guess (based on your samples) that the observed difference is not real, and be correct: Correct failure to reject the null hypothesis d.) guess (based on samples) that there is no difference between the groups and you are wrong: Type II (beta) error - know table with four things of reject or do not reject Ho with Ho is true or Ho is false - we can put values to how often we will be wrong or right - the key to all inferential statistics is the ratio between the between-group (between sample) variance (difference in group means) and the withingroup variance - imagine there is no overlap from the two populations- if you get a biased sample that is high and another that is unusually low- you could say that this confirmed - we want to look at the variance between groups and the variance within groups Research Methods II Notes - Inferential Statistics A key aspect of science is that it deals with testable hypotheses The key tool for testing these hypotheses are inferential statistics Inferential Statistics can be either parametric or non-parametric - statistics that assess the reliability of your findings are called inferential statistics - - - - - 13 Inferential Statistics the key to all inferential statistics between the between-group (between sample) variance (difference in group means) and the within-group variance or, the extent to which the two sample distributions overlap different from descriptive statistics which were calculated on you sample distribution (such as the mean and standard deviation) descriptive statistics may or may not agree with the population values- i.e. the mean of the sample may or may not agree with the mean of the population (thus inferential statistics are used – t-test, etc. which looks at the sample as coming from the population and the chance that your result happened by chance, etc.) “sampling distribution of the mean”: variance in the means of all possible samples of a given size from a population - this distribution has well-defined characteristics- normally distributed and the mean of the sample distribution is the same as the mean of the population from which the samples drawn from - central limit theorum says that even if the population is not normally distributed, the distribution of sample means will tend to be normal- you can make statements that depend on normality using the samples even if the population is not normally distributed standard error of the mean (or “standard error”): estimate of the amount of variability in the expected samples means across series of samples (standard deviation of sample divided by the square root of sample n) degrees of freedom: for a single sample will always be n –1 (because if you have say 10 scores and a known mean, once you have selected 9 scores the value of the 10th can not vary) in analysis of experiment it will be A –1 (where A is the # of levels of the independent variable) inferential statistics are either “parametric” or “nonparametric” - parameter: characteristic of a population - statistic: characteristic of sample - parametric statistic: estimates the value of a population parameter from the characteristics of a sample (when you use a parametric statistic, you are making certain assumptions about the population from which the sample was drawn- i.e. key Research Methods II Notes - - - - 14 assumption of parametric test is that sample was taken from a normally distributed population) - nonparametric statistics used if data does not meet the assumptions of a parametric test each sample statistic provides an independent estimate of the population parameter (i.e. Xbar is an estimate of u) - if two means for example were drawn from the same population you would expect them to differ only because of sampling error- you can calculate the probability that the two sample means would differ as much as or more than they do simply because of chance factors (this probability is the obtained “p”) - i.e. in an experiment where a drug would actually have no affect on subjects, you can work out what the chance would be of getting the two means you did by chance - as a researcher, you don’t know if a drug will have an effect or not- just say you get the following results: you must decide whether the two sample means were drawn from the same population (the treatment had no effect on the sample scores) or from two different populations (the treatment shifted the scores relative to scores from the control group) there are many possibilities- see hypothesis testing table hypothesis that means drawn from same population: null hypothesis hypothesis that means were drawn from different populations: alternative hypothesis inferential statistics measures how likely it would be to get the different sample means by chance if they really did come from one population- if the probability is small then the difference between the sample means is said to be “statistically significant” and null is rejected know hypothesis testing table: four options you calculate an observed value of a statistic and compare it to a critical value on a table (make decision based on whether or not observed value meets or exceeds critical value)- want to reduce possibility of committing a Type I error - probability of committing type I error dependent on the alpha level you chose - alpha level: represents the probability that a difference at least as large as the observed difference between your sample means could have occurred purely through sampling error (smaller you make alpha, less likely you are to commit type I error) - significance level: particular level of alpha you chose - if the obtained p is less than or equal to alpha, your comparison is statistically significant- REJECT NULL One tailed vs. two tailed - one tailed test is when you have a directional hypothesis - two tailed test is when you do not have a directional hypothesis (just that something will produce a change in either direction of something else) Research Methods II Notes - - 15 in this test if you had .05 then each critical region (portion of sampling distribution within which observed values will be statistically significant) must contain 2.5% of the cases implication: you must obtain a greater difference between means of groups in either direction to reach statistical significance if use a two-tailed test than a one-tailed test Parametric Statistics - three assumptions for parametric inferential tests: a) scores have been sampled randomly from population b) sampling distributions of the mean is normal c) variances between groups are highly similar - the t-test - used when experiment includes only two levels of independent variable - “t-test of independent samples”: used when you have data from two groups of participants and those participants were assigned at random to the two groups - has two versions- pooled (computes error term based on the two samples combined under the assumptions that both samples come from populations having the same variance) and unpooled (computes an error term based on the standard error of the mean provided separately by each sample) - “t-test for correlated samples”: used when two means being compared come from samples that are not independent of one another- produces a larger t value than the t test for independent samples when applied to same data if scores from two samples are at least moderately correlated - matched-groups and within-subject designs use this - - - - Factors Contributing to Group Mean Differences strength of the I.V.’s influence on the D.V. - you are hoping that you don’t make type II error level of the treatment of the I.V. - just say you have a conclusion but because the level of treatment was wrong your conclusion would actually be wrong sensitivity of operational definition of the D.V. - you must have a sensitive measure or you might not be able to detect the influence - just say you were testing a drug: there could be altered mean differences because of the strength of the iv or because of the level of treatment you give them (maybe you didn’t give them enough) subject differences: if you were doing a memory test and you had good memory people in emotional shock group- you might not find anything even though your hypothesis was right Factors Contributing to Within-Group Differences degree of control of extraneous variables subject differences- if everyone has the same tolerance for emotional shock for instance, you will probably get pretty consistent results sample size Ways of Increasing Power What can you do? Research Methods II Notes - 16 note that it is easier to control the factors that influence between groups variance than the factors that determine within group variance but some factors that reduce within group variance also tend to decrease external validity Power - power is the probability of rejecting the null hypothesis when it is false - more precisely: power is the probability of rejecting the null hypothesis AND correctly identifying a TRUE alternative hypothesis (avoiding type III error) - inferential statistics designed to help you determine the validity of the null hypothesis- you want statistics to detect differences in your data that are inconsistent with the null (the “power” of a statistical test is its ability to detect these differences) - power is statistic’s ability to correctly reject the null hypothesis - issue of power is important: rejection of null implies that your independent variable affected your dependent variable- want to make sure that you aren’t incorrectly rejecting or correctly rejecting null due to power of test and not the real reasons - power is affected by… chosen alpha level: as you reduce alpha level you reduce probability of making a type I error- also reduce power unfortunately as it is more difficult to reject the null hypothesis (you need a larger difference between means to reject it) size of sample: power increases as size increases because larger sample provides more stable estimates of population parameters whether you use one-tailed or two-tailed: two-tailed test is less powerful than onetailed test - size effect produced by your independent variable: Research Methods II Notes 17 “effect size”: degree to which manipulation of independent variable changes the value of the dependent variable - effect size estimates the amount of overlap between the two population distributions from which samples were drawn- large effect sizes indicate little overlap (greater power) too much power can be as bad as too little power - - - Types of Inferential Statistics there are many…very many! Each is intended, and appropriate, for use in certain specific conditions that depend upon: - Scale of measurement: nominal, ordinal, interval, ratio - Shape of the sample distributions - Experimental design: # of I.V.’s, # of levels of each I.V., how manipulated (within, between, mixed), # D.V.’s • Nonparametric Statistics these make no assumptions about the distribution of scores in your sample and can be used with nominal or ordinal data an example… Bivalent Between Subjects Memory Experiment 20 intro psychology subjects, 10 randomly assigned to each group Experimental group Male experimenter written instructions 10 min travel film 1 o’clock testing group testing 15 s final scene version ‘a’ - mutilated body seated in chair (operational def. of emotional shock) - I.V. Measure # items (of 10) correctly remembered - D.V. • • • • • • • Control group Male experimenter written instructions 10 min travel film 1 o’clock testing group testing 15 s final scene version ‘b’ - craftsperson weaving a basket - I.V. Measure # items (of 10) correctly remembered - D.V. - • • • • • • STATISTICALLY COMPARE DIFFERENCES IN MEASURED MEMORY Hypothetical Results of Bivalent Between Subjects Memory Experiment #1 - Mann-Whitney U-Test Rank order all memory scores from both groups- in case of ties assign average rank Research Methods II Notes 18 - - - - - - simple, you take scores and you rank order them- put best score on the top and you go down, when you have ties you average the rank- both get 7.5 if rank 7 and 8 had the same score) indicate from which group scores came (circle) Count the # of scores from the group hypothesized to have lower scores that are higher than scores from the other group (ours is a directional hypothesis) - the hypothesis is that emotional shock will disrupt memory…we have a directional hypothesis, looking at the extent to which the data fit the hypothesis- if there is a difference between your groups, then the ranks for the scores in one group should be consistently above the ranks from the other group rather than being randomly distributed add up these counts basic logic: you get a quantitative measure of the mixing of the scores table of Mann-Whitney U-test: all the scores for the control were above all the scores for the experimental- that’s why you have the frequency of 0- then you add up the 5 0’s from text: - Mann-Whitney U test used when dependent variable is scaled on at least an ordinal scale - good alternative to t test when data does not meet assumptions analogous to dealing 5 black cards in a row from a randomly shuffled deck of 10 cards, 5 of which are black & 5 red it would be 0.0036 Mann-Whitney U Test - this is analogous to dealing 5 black cards in a row from a randomly shuffled deck of 10 cards, 5 of which are black & 5 red this gives us an extreme value of the Mann-Whitney of 0 - p5/10)*(4/9)*(3/8)*(2/7)*(1/6) = 0.00396 Hypothetical Results of Bivalent Between Subjects Memory Experiment #2 Mann-Whitney U-test Mann-Whitney U-test = 8 Research Methods II Notes - - 19 what you want to find out is whether or not the results you got from an experiment could have resulted from chance the old way of doing things was to give statistical tables on the tables: we had two groups of 10 so our critical value was 24, in a test just say we got a value of 33 when doing the test on data about memory- we added up the ranks and got 33 the most unlikely score of Mann-Whitney is 0- no mixing of groups (one is all high, one is all low) we want to find out how unlikely a value of 33 is thus we would reject the null Review of when to accept and reject the null: Printouts- they give you the p value- telling you the exact probability of getting your results by chance – probability that the null is true (if you had .34 for example, you would accept the null if your alpha was 0.05)- you are not willing to be wrong 34 times out of a hundredsupport for experimental hypothesis would be if the value is less than alpha level - one tailed probability will always be half of the two tailed probability in Critical value tables - tells you the maximum value you can have to reject the null or the minimum value you can have to reject the null- that number or less you would reject the null (like what we did with the 33 way) Between-subjects t-test - when is it appropriate? - Comparing means from two independent groups - “means” mean that you have interval level data - only appropriate for two group comparison - the groups are independent - assumes subjects were randomly sampled - population values are normally distributed- if your data is not normally distributedyou would run Mann-Whitney- the between-subjects t-test is a parametric version of the Mann-Whitney Research Methods II Notes why wouldn’t you do that anyway?- parametric (in general) are more powerful- will give you smaller values- less likely to result in type II error - parametric tests are your choice what does it take into account? - difference between the groups-how far apart the group means are and how variable they are within their own group - computation: - you must be able to understand this because the computer might prompt you or encourage you to do things wrong - it is really the “sample mean difference” divided by the “variability (standard error)” bigger number indicates larger difference between groups (taking into account variability and sample size) - as you increase difference in means- t will go up (the numerator getting larger) - as denominator goes down (less variance): t will go up - very distinct set of scores- have combination of big differences between groups, low variance within groups, high t- high values not typically occuring by chance (the t)this is unlike the Mann-Whitney because the Mann-Whitney has low numbers as less likely - - - 20 Between- subjects t-test: Calculating standard error - we are using sample to infer population values - the bigger the sample the better 1) pooled- assumes both sample are form populations with equal variance. This is a weighted average of the two sample’s variances- if you’re two samples have the same variance it would be ok to assume that the two populations they come from have the same variance- this is more powerful - use definitions from other sheets/ power points - pooled variance is really the sum of the standard deviations over the sum of the degrees of freedoms - note that it is a weighted average because of sample size (we will put more emphasis on bigger Research Methods II Notes 21 samples- you do this by multiplying them by the degrees of freedoms) if you have equal sample sizes it will just be the average of the two- but you must remember that bigger sample sizes will have bigger reliability 2) Unpooled: adds the error terms from the two samples. Use this formula when the homogeneity of the variance assumption is violated (when the samples have significantly different variances) - the real world variances are probably different (bigger probably) - we will add the error terms together from the two samples but we will only do this when the samples have significantly different variances - you are adding the sample variances rather than the pooled variances - Evaluating the significance of t - how big is big enough? - We know what t-distribution looks like - people have calculated the possible t values- you get a normal distribution - because we know what this looks like: we know for a given df we know how likely a t value is to have occurred by chance - big t’s are more unlikely - generate a large # of samples; see shape almost normal; can know how likely a specific t-value (or higher) would occur by chance Evaluating the significance of t What counts as chance? - probability of making Type I error - you determine it (tradition is .05) - it is really just tradition- it is a reasonable balance between type I and type II error - in stats class- you use tables (look for critical values) - print outs- give probabilities of getting a t-value as big, or bigger, than observed by chance (i.e. if the null hypothesis is true) Between- subjects t-test exercise Independent samples means betweengroups- independent people - the test variable in spss is the dependent variable - the levine test says whether or not the variances are equal enoughthe number must be bigger than alpha- you don’t’ want this to be significant, if it is higher than .05 you can accept the null that the two variances were equal- then look at appropriate line Group Statistics GROUP N MEMSCO Experime 10 Mean Std. Std. Error Deviation Mean 3.7000 2.4518 .7753 Research Methods II Notes R ntal group Control group 22 10 7.2000 2.8597 .9043 Independent Samples Test Levene's Test for Equality of Variances MEMSCOR Equal variances assumed Equal variances not assumed t-test for Equality of Means Std. 95% Sig. (2Mean Error Confidence tailed) Difference Differenc Interval of the e Difference Lower F Sig. t df .901 .355 -2.938 18 .009 -3.5000 1.1912 -6.0026 -2.938 17.590 .009 -3.5000 1.1912 -6.0067 Types of inferential statistics - now we are on the “experimental design: # of I.V.’s, # of levels of each I.V., how manipulated (within between, or mixed), # D.V.s Within-subjects t-test - also called t-test for correlated samples or matched samples t-test, paired t-test - when is it appropriate? - when comparing two means and within-subjects experimental design- everyone gets both levels of independent variable or when you have matched group designs (put them in groups based on pretest scores) - if you put twins or siblings or roommates in different levels or little mates- use this - with married couples, dating couples- the unit might be the couple- split them up - within-subjects designs - when is this type of design appropriate? - When participants are scare - When effect size is small - When you are interested in effects over time - When error variance (unatributable variability due to subject-related factorsdifferences between the subjects)- you can control for this by using this design - Understand the correlation in the denominator of the within-subjects t-test: it is the correlation between the scores of the people- as it gets bigger the denominator will get larger- denominator will get smaller, t will get bigger, more likely you are to reject the null hypothesis - When differences amoung subjects is high and consistent…correlation is bigger; when you get rid of this part you basically have the between-subjects t-test (if you put the D numerator to the Y stuff - If correlation is 0, you are basically left with a between-subjects design but your degrees of freedom will be more so you will have LESS POWER - disadvantages Research Methods II Notes - - - - increased fatigue, boredom subject attrition carryonver affects if the correlation between across treatments is low… what does it take into account? - mean difference (change scores) - variance computation: - average difference is on top - standard error of the average difference on bottom - how do you calculate standard error- remember that you don’t have the source of error variance that differences in people could have affected the differences in average scores (the problem that all the smartest people might have been in the same group)- but in this test you have the same people so that is eliminated calculating the denominator- the “standard error” of the difference you are looking at summing the differences, etc.- the sigma D (summing the differences of all the people don’t memorize but recognize computational formulae becomes… so…. - D = difference scores (husband/ wife differences and same person differences in both groups)- the Dbar is the mean difference scores- like ybar – ybar but since you don’t’ have two different groups you don’ t have two different distributions - Df’s adjust for the chance - df for between-subjects design would be n-1 and n-1 would be n2 but for within-subjects design is n-1 Another version of the formula: 23 Research Methods II Notes 24 My phenomenal E.S.P. abilities - - - when using spss for paired samples t-test - you don’t need the column telling you what groups (the 0 an 1) - every row is a new group (since every person is in each group) - we had a 0.0075- so, we had significance…we reject the null hypothesis he used a biased sample- just looked at the ones that had done poorly…not everyone which wasn’t random subject selection - you must have this as a criterion for doing these tests understand D- it is the difference scores (for husband, wife for examples) instead of group differences computation formula used to save steps and calculate by hand Sign Test - a very simple non-parametric test for within-subjects or situations when you have paired data - SPSS: help box: a nonparametric procedure used with two related samples to test the hypothesis that two variables have the same distribution. The differences between the variables for all cases are computed and classified as either positive, negative, or tied. If the two variables are similarly distributed, the numbers of positives and negative differences will not be significantly different. - you should get just as many positives as negatives if your test was bogus (esp for instance wouldn’t be any more likely to increase than decrease responses) - just say you had 4 +’s and 1 - the sign test tells you the probability of getting this by chance - you get a p of .188- it is greater than 0.05 because it is not smaller than alpha- fail to reject null- ACCEPT NULL - the old way used tables- you look at the # of nontied values you have- along the x is how many –‘s (violations of experimental hypothesis) and the y is how many +’s- you just go to the point on the table and find 0.188 - three values to keep in mind- if all five had remembered more items in control than shock group- the probability would have been 0.031- reject null - interesting to note that 5 out of 6 isn’t significant even though 5 out of 5 is - 7 out of 8 makes it- everyone has to be consistent with hypothesis until you have 8 participants (one can violate experimental hypothesis) - when you have 11 you can have two people violating - down the y axis is actually the N- not the number of positives - sign test can be used in the esp data - every sample should have a mean of about 5 - programs like weight watchers- use the best results on posters - you can use sign test with esp data- out of 7, 6 of them did score lower- do a sign test on that it is not significant- p should be 0.062 Sign-Test Results of Bivalent Within-Subjects E.S.P. Intervention Experiment How to use sign test on spss - it is a nonparametric test for “Related samples”- what are your paired scores- the before and after test scores Research Methods II Notes - - 25 we had the .125 because it was double .0625 (because on spss was a two-tailed test)the one tailed probability is always going to be half of the two-tailed test- so we reject the null- they didn’t do significantly worse after intervention we could also do a parametric paired t-test - you don’t need the levine for paired t-tests - your t ends up being 3.548 - p was 0.012- their scores did decrease significantly according to this test (we rejected the null) - you can lie with statistics- it was legit to use both of these teststhe sign test gave different result from the paired t-test - in general you want to use parametric tests: they are more powerful- they are more likely to lead you to the conclusion that will support your hypothesis - with the sign test: we are treating the difference between 7 and 6 the same as the difference between 7 and 3 (either have plus or minus- no more differences other than that) - the one plus was only a difference of 1 but we are treating it as the same as the minuses of 3 or 4! - advantage to use interval or ratio scales - when you take the magnitude of the changes into consideration you get more sensitive results - we never test for insignificant results- too easy to do that- sloppy results, not enough subjects, etc. Wilcoxon Signed ranks test- looks at ranked values- more powerful than sign, less powerful than paired t-test - if you think about the scale of measurement as influencing power - sign test is basically dealing with nominal data- lowest end of data - ranks- dealing with ordinal data- dealing with ranked differences- if you use Wilcoxon you get 0.031- significant- reject the null (people actually did decrease upon intervention Power and inferential statistics - all the factors that are involved in determining which inferential statistic is appropriate can also influence statistical power - remember that power is the probability of correctly rejecting the null (1- type II error rate) - this is why you always use the most powerful test you can - you want to use the t-test - if data is not normally distributed, you go to most powerful nonparametric test - power is ability to correctly support experimental hypothesis- it is 1- type II error rate - probability of correctly rejecting the null hypothesis when it is false - power is an ordinal scale- it can increase and decrease - factors that determine which test is appropriate - scale of measurement: as you go from nominal ratio you gain statistical power - shape of sample distributions: parametric tests more powerful Research Methods II Notes - 26 experimental design: within-subjects designs more powerful than between-subjects designs under certain circumstances Within subjects design - when is it advantageous? - when participants are scarce? (people with a rare disorder for example; like firefighters for instance) - you want to get a lot of information from them - when effect size is small- when the independent variable doesn’t effect the dependent variable very much; too few measures - when you are interested in effects over time - when error variance is due to subject-related factors is high: when a lot of variability in scores is due to difference in people (i.e. people differed in how shocked they were in experiment so you had a lot of variance in memory recall) - when you have a between-subjects design you control for these differences - another version of the formula: - key thing to this formula is the fact that there is a correlation coefficiant in this formula- 2r12 (correlation of the scores of the - as correlation gets more positive: you subtract it and denominator gets small- t goes up - if husband and wife scores not correlated at all- it will be no different from between- subjects design - if you have lot of differences- t will go up - if correlation is 0- the denominator is pretty much the same as it is in the betweengroups design - back in spss output: look at “paired samples correlations” box- it tells you the correlation between the two: shows you the extent to which these values differ from a between subjects design - big different between between and within- has the correlation - disadvantages: - increased fatigue, boredom - subject attrition- different levels might have to be tested on different days- they might not come back for the other days and you wasted time; subjects drop out - carryover effects: - Learning: if a subject learns how to perform a task in the first treatment, performance is likely to be better if the same or similar tasks are used in subsequent treatments.- now they have learned a little bit - Fatigue: if performance leads to fatigue, then performance in later treatments may deteriorate regardless of any effects of the independent variable- in all cases the influence on the dependent variable is not related to the independent variable (CONFOUNDS) - Habituation: under some conditions repeated exposure to a stimulus leads to reduced responsiveness to that stimulus. This reduction is termed habituation - it is when you get used to the stimulus- it doesn’t produce the same effects that it did in the beginning- to study perception - Sensitization: sometimes exposure to one stimulus can cause subjects to respond more strongly to another stimulus (e.g. potentiated startle) - classic example: potentiated startle: if you shock an animal before they are given another aversive sound stimulus they will be even more on edge Research Methods II Notes 27 - - Adaptation: if subjects go through a period of adaptation then earlier results may differ from later results because of the adaptive changes (e.g. adjusting to the dark) - example: adjusting to the dark: if you wait to give a carrot to people after a while in the dark you will be able to see better but only because in takes about 20 minutes to become adjusted; alcohol as an example and if the correlation between across treatments is low, these designs will be less powerful due to d.f Carryover Effects: Contrast - contrast: because of contrast, exposure to one condition may alter responses of subjects in other conditions (i.e. if you pay some people 10 first and some others 5 and then ask them to do a boring task again for 8- groups will respond differently) Applying What You’ve Learned: Contrast Effects Dear Mother and Dad: Since I left for college I have been remiss in writing and I am sorry for my thoughtlessness in not having written before. I will bring you up to date now, but before you read on, please sit down. You are not to read further unless you are siting down, okay? Well then, I am getting along pretty well now. The skull fracture and the concussion I got when I jumped out the window of my dormitory when it caught on fire shortly after my arrival here is pretty well healed now. I only spent two weeks in the hospital and now I can see almost normally and only get those sick headaches once a day. Fortunately, the fire at the dorm, and my jump, were witnessed by an attendant at the gas station nearby and he called the fire department and ambulance. He also visited me in the hospital and since I had nowhere to live because of the burntout dormitory, he was kind enough to invite me to share his apartment with him. It’s really a basement room, but it’s kind of cute. He is a very fine boy and we have fallen deeply in love and are planning to get married. We haven’t got the exact date yet, but it will be before my pregnancy begins to show. Yes, Mother and Dad, I am pregnant. I know how much you are looking forward to being grandparents and I know you will welcome the baby and give it the same love and devotion and tender care you gave me when I was a child. The reason for the delay in our marriage is that my boyfriend has a minor infection which prevents us from passing our pre-marital blood test and I carelessly caught it from him. I know you will welcome him into our family with open arms. He is kind and, although not well educated, he is ambitious. Although he is of a different race and religion then ours, I know your often well expressed tolerance will not permit you to be bothered by that. Now that I have brought you up to date, I want to tell you that there was no dormitory fire, I did not have a concussion or skull fracture, I was not in the hospital, I am not pregnant, I am not engaged, I am not infected, and there is no boyfriend. However I am getting a “D” in Statistics and an “F” in Chemistry and I want you to see those marks in their proper perspective. Your loving daughter, Sharon Research Methods II Notes 28 Carryover Effects can be serious - whenever you see a within-subjects design: question of possibility of order-effects! Dealing with them - counterbalancing: giving different groups of people the treatments in different orders - complete and partial - latin square designs - minimising carryover - problems of unequal carryover Example 1: Stress induced analgesia (insensitivity to pain) - requires 1) stress inducing procedure: ultra-light flying lessons 2) measure of pain sensitivity: finger vise; water as being hot; the toe in the bath test 3) willing participants: 10 people taking the lessons; 7 male and 3 female 4) ethical approval - how can you do this? - Between-subjects (5/group), gender?- you wouldn’t be able to counterbalance for gender and you would only have 5 people per group- Not good! - Within-subjects: increased statistical power (2 times as much data so increased statistical power); subject variance no longer contributing to between treatment differences- MORE POWERFUL STUDY Stress-induced analgesia - hypothetical data- 6 minutes for no-stress and 10 minutes for stress (no-stress first, stress next) - could be carryover affects- they could be more used to the thing on the finger (think they were wimpy in the beginning) - what you should do is to split subjects into groups and give different groups the treatments in different orders (half have one first, half have other first)……do counterbalancing because no one had the stress first- order is confounded Stress- induced analgesia - hypothetical data- now order is not a problem- if you have all possible orders it is complete counterbalancing - you have same data even with counterbalancing- you could conclude more about your test Another pattern of results: differences- opposite results - if you are in no stress first- 10 minutes in vice, second trial 6 minutes - stress first- 10 minutes first, 6 in second trial - stress vs. no stress doesn’t make any effect- what seems to make a difference is the order in which you get the trials - here, counterbalancing you get data suggesting that independent variable is irrelevant but that order is influential Complete Counterbalancing- you can ignore order or look at it as an independent variable - split subjects into groups and give the different groups the treatments in all possible orders - assuming the effects of order will be balanced equally across treatment conditions (levels of the IV) order effects will not confound your results Research Methods II Notes 29 - - - if there is some benefit of going first then that will be balanced across the two conditions can test this assumption by having order as an independent variable (becomes a mixed factorial design, with order manipulated between- and within-subjects) - you can analyze order as a manipulation and see if it actually does- have order as an independent variable- you would have 2 x 2- two levels of each independent variable- the point is that YOU CAN TEST THE ASSUMPTION if you have three treatment conditions- there are 6 possible orders (you will need a multiple of 6 subjects for complete counterbalancing)- if you have 10 subjects you wouldn’t have the same number of people in each group- you either ignore some or get more - when you keep getting more treatment conditions, this is not always good because you probably didn’t have that many to begin with of you can do subset of possible orders (partial counterbalancing)- only use some of the possibilities, but which ones? Example 2: - you only have five participants - I will give them all the drug twice and the placebo twice- I will get 4 bits of data from each of my participants - What order do you give them? - Assume the following order effects: suppose you knew that if you gave people the finger vice- you knew they would keep it on longer and longer (with less anxiety, getting used to it) - If you knew this, you can create some treatment orders that will balance that outif you were from a drug company you would want to use the placebo first and then the drug (they would get more used to the drug as the trials went on- it would be higher than the placebo just because of order effects - you could have drug first and it would show no effect - if you put placebo, drug, placebo, drug: - you could create two groups- PDDP and DPPD- but the problem is that we usually don’t even know was the order effects are usually - this would be the best orders to use? Assume the following order effectswhat sequence to use? Order to Use? What sequence of drug and placebo trails to use? Could do half in each of: Group 1: P D D P Group 2: D P P D This distributes order effects equally to the two conditions Problem, we don’t usually know for certain if there are order effects or exact nature (function) that describes order effects. Research Methods II Notes 30 Partial counterbalancing - no orders distribute the order effects equally to the two conditions - partial counterbalancing is risky unless you know the nature of the order effects - better to randomize treatment order across subjects (this works especially if you have big numbers- doesn’t work with small numbers using within-subjects designs) - create all possible orders and randomly use some of them: will assume that order will be taken care of - this will also assume that any order effects are equally distributed across treatment conditions (so order not confounding) - with partial counterbalancing you can no longer analyze order as an independent variable- disadvantage is that you can’t analyze order as an independent variable Latin Square - one restriction on using it: you have to have the # of treatment orders has to equal the number of treatments you have (i.e. five different doses of a drug, have you have five different orders in which they can receive the drug) - this would work well if you had 10 subjects- if you had 12 you wouldn’t use two or try to find three more - so- if you choose to make the number of treatment orders in an partially counterbalanced design equal to the number of treatments, you can use the Latin square design - a latin square is a k x k table of items in which each item appears exactly once in each column and row - use it when you have four or more treatments Creating a 4 x 4 Latin Square- steps..put in order of slides - - - each treatment is in each group ONCE (A) then shuffle the rows: randomly order rows, e.g. by drawing the numbers 1-4 out of a hat (use a dice- ignore 5 and 6)- (B) - as an illustration, suppose the number were drawn in the order 2, 3, 1, 4 - row 2 becomes 1, 3 becomes 2, 1 becomes 3 and 4 stays 4 then shuffle columns- randomly order columns e.g. by drawing the numbers i – iv out of a hat- as an illustration, suppose the numbers were drawn in the order- ii, i, iv, iii - column ii becomes column i, column i becomes column ii, etc.- (C) after both columns and rows are randomized- randomly assign treatment conditions to your letter codes - plug them into the square- every treatment will only appear once in every trial ; you can assume that order effects will be balanced across treatments Research Methods II Notes 31 Creating a Latin Square After you have randomly ordered columns, and rows, then randomly assign treatment conditions to your letter codes: e.g., treatment level 1 is assigned to C treatment level 2 is assigned to A treatment level 3 is assigned to B treatment level 4 is assigned to D Minimizing Carryover - you should try and minimize carryover effects - if you do this it will increase the power of your design by decreasing error variance (more likely you can reject the null hypothesis) - not always possible to get rid of carryover effects- you can give people practice sessions, or breaks can reduce some carryover effects (e.g. learning and fatigue or habituation respectively) - a learning curve goes up and up and then levels off- if you gave everyone practice trials you would get up to the level part (learning will not be a problem after the second trial)- if you are dealing with fatigue or habituation you want to give them a break (if they are doing physical exercise for example)- give them a break in between doing exercise - if you can use complete counterbalancing, you can make treatment order an independent variable- this allows you to MEASURE order effects (but only with complete counterbalancing or Latin Square) Counterbalancing - if the carryover induced by different orders are approximately equal, counterbalancing can control carryover - However: if carryover is different for different orders of treatment, then counterbalancing may not work (if for instance, A coming after B has different effect of C coming after B) - also, some treatments are irreversible- teach people how to use a nnemonic device; they will always have the training- can’t go back - it would be hard to use counterbalancing to see the effects of personality after irreversible interventions like psychosurgery (frontal lobe lobotomy); logic of experimental designs is relatively new Inferential Statistics - remember: the key to all inferential statistics is the RATIO between the between-group (between sample) variance (difference in group means) and the within-group variance, or, the extent to which the two sample distributions overlap - another issue related to power where you look at overlap is one tailed vs. two tailed tests Research Methods II Notes 32 - if you have a non-directional hypothesis- “the IV will have an influence on the DV”- it will change it in some way; the null that you are testing against is that there is no change- a twotailed test MUST BE USED - a two-tailed test divides alpha in half, placing half in each tail. The null hypothesis in this case is no difference, and there are two alternative hypothesis, one positive and one negative. The critical value of t, tcrit, is written with both a plus and a minus sign - for example, the critical value of t when there are 10 degrees of freedom and alpha is set to .05 is tcrit = +-2.228 - see “non-directional hypothesis graph - the sampling distribution model used in a two-tailed t-test is illustrated below: - if you have a directional hypothesis: specifies a priori (specifies before you collect data) the direction of change caused by the IV, then the null to be tested is not that there is no difference but that there is not an increase/decrease - a one-tailed test may be used- will be more powerful - there are really two different one-tailed t-test, one for each tail. In a one-tailed t-test, all the area associated with alpha is placed in either one tail or the other. Selection of the tail depends upon which direction tobs would be plus or minus if the results of the experiment came out as expected - the selection of the tail MUST be made before the experiment is conducted and analyzed and the directional hypothesis must be well justified - a one-tailed t-test in the + direction is illustrated below: you don’t need as big a t to reject the null if you have a directional hypothesis - Multivalent Designs - why use designs with more than two levels of the IV? a) when you need more than one control group (multiple control conditions)- a control group that is really getting nothing and then a placebo group for example looking at drugs - i.e. when you consume alcohol, more likely to reduce inhibitions- rate movies as erotics, but they also had a control group- rated the movies the same as the people who consumed alcohol Research Methods II Notes 33 - without the placebo group you could make different conclusions- it is a social thing (you are allowed to let your guard down?) b) you might be interested in more than two qualitatively different groups of people- like homosexuals, heterosexuals and bisexuals for example c) when you want to map out the function of the effect of the IV - 1986: show graph on the blood sugar levels of rats and learning - with increased sugar you should increase learning for mazes - NO BIVALENT STUDY CAN SHOW NON-LINEAR FUNCTIONS - if you gave the rats too much glucose they actually became dumber - it is the same graph for the effects of alcohol - depending on the points he picked to use as two points- there is not a single bivalent study that would tell you the true picture of the effects of glucose on rats (he would always get a wrong straight line) - how would you analyze it?- you could use multiple t-tests - what would be the problem of conducting multiple t-tests? - Type I error: if you are going to be wrong 1 in 20 and are making 20 comparisons- you will make at least one mistake - “probability pyramiding”: as you do more comparisons, the probability that you will have a mistake somewhere is greater Analysis of Variance: ANOVA - is a way of analyzing the total variability amoung dependent variable scores and dividing or “partitioning” this total variability amoung the factors assumed to cause it - tries to account for the fact that everyone will not remember the same amount of numbers for instance - tries to figure out how much of variability is due to different “sources of variance” (factors assumed to cause the variability) - e.g we have three groups of emotional shock videos- mild, strong, none - the “total variability” is the complete range of scores you found (even though you have three graphs- the lowest of the low left graph vs. the highest of the one to the most right) - things contributing to sources of variance 1) subject variables 2) experimental error (any variability you can’t account for)- measurement or recording error; even if we don’t make mistakes our scales of measurements might be too crude 3) level of the IV - have to find out as much variability is accounted for by each of these factors - ANOVA calculates an F-ratio which is the ratio of between-group variability to withingroup variability - within-group variability is due to differences amoung subjects and/or experimental error Research Methods II Notes 34 - - - everyone in the same group got the same level of IV- but the other two have impact here- subject factors and error contributing to variability within in each level between- group variability can also be due to differences amoung subjects (in a between-subjects design) and or experimental error and variability caused by treatments effects (the IV) when you work out F- the numerator will be the between-group variability (and subject variance, error, and IV effects) and the denominator will be the within-group variability (the subject variance and error)- the subject variance and error can cancel out and so you are left with the F = IV effect - the treatment level can only influence numerator use ANOVA when you are comparing three or more means a significant F-ratio tells you that at least some of the differences were unlikely to have occurred by chance- amoung these groups there is at least one difference that is unlikely to have occurred by chance - it considers between- groups variation (deviations of the means from the grand mean), within-groups variation (deviations of each score from its group meanestimate of error variance) and these variances are measured by mean-squares Assumptions of ANOVA - it is a parametric test: - assumptions 1) the population is assumed to be normally distributed (don’t know true population, but should examine samples, fairly robust but can do data correction or non-parametric analyses) - you can have non-normally distributed data and still use it- it is robust - if you had bimodal graph you can’t use it however 2) homogeneity of variance- robust as long as samples about equal in size (largest no more than 1.5x smallest), when samples unequal, can be sensitive to this - if you don’t have equal samples, even a small difference in variance it can throw your group off - you want the same numbers of people in both your groups 3) independence of error terms- your measurements are independent- assumes that all measurements are independent, random assignment can start you off that way, watch for differential “subject mortality” - it assumes that a measurement you take from a group will not be related to another measurement you take from the same group - problem- “subject mortality”: they drop out of study, problem is that if you start out with 20 people in 3 groups, 10 people out of one of the groups drop out (maybe the no alcohol group)- the problem is that the people that are left might not be a representative sample Evaluating the Significance of F How big is big enough? We know what F-distribution looks like Generate a large number see shape almost normal, can know how likely a specific F-value (or higher) would occur by chance. What counts as chance? - probability of making type I error - you determine it (tradition is .05) Research Methods II Notes - 35 critical values: based on df for numerator and denominator and alpha level print outs- give probabilities of getting an F-value as big, or bigger than the value observed by chance ANOVA - protects against probability pyramiding - it asks the question whether somewhere amoung the different treatment conditions are there differences that were unlikely to occur by chance- you don’t know where the differences are - a significant F-ration tells you that at least some of the differences amoung means were unlikely to have occurred by chance and you should reject Ho - you therefore have tentative support for your experimental hypothesis that some differences were caused by the IV Interpreting ANOVAs - how to determine where significant differences lie after getting a significant F-ratio? - you don’t’ know which groups are significantly different from each other- only after you have the F-ratio you can go in and compare the different groups- you either have planned vs. unplanned comparisons (like the prior/ post hoc) - with planned comparisons: a specific priori hypotheses (you should have a minimum of two planned comparisons) - you can go in and do t-tests only after you have significant ANOVAs - use separate (1 degree of freedom) F ratios or t-tests for the pairs of means you have hypothesized to differ - the number that conveys new information (called orthagonol comparisons) will be equal to the number of groups (means) –1 - with unplanned (post hoc) comparisons (fishing expeditions) - you have to consider two types of error: error rate per comparisons and the familywise error rate (which takes into consideration probability pyramiding) - see table 12-4 in book: some more conservative than others - know formula to calculate familywise error rates- tells you what the error rate will be for all the t-tests put together- it is like the alpha levels added up four times because you are making 4 comparisons- the probability that somewhere in the four comparisons I will make a mistake is greater than the probability of making a mistake in any one by itself) Calculating Familywise Error Rate FW = 1 - (1 - )c Where: c = number of comparisons, and = the per-comparison error rate (e.g., if making 4 comparisons and is set at .05): FW = 1 - (1 - )c = 1-(1-.05)4 = 1-.954 = .185) - summarize stanovich’s arguments in …list the 7 arguments…etc. know stuff from chapter 8 even though not covered in class Research Methods II Notes 36 First Midterm: Material to be Covered From the Text Note: not all material is from the text • Chapter 8 pp. 216-233; 240-245. • Chapter 9 pp. 246-268; 278-280 • Chapter 12 pp. 361-380 (stop at Two-factor ANOVA); 388-393 First Midterm: Lectures based on Stanovich How to think straight about psychology using undergraduate psychology students human difficulties with probabilistic reasoning (seven listed) more general innumeracy problem: e.g., difficulties with large numbers characteristics of psychological science First Midterm: General Topic Headings Review stuff: advantages of experimental research, between subjects design characteristics of experiments terminology (IV, DV, control, extraneous, and confounding variables) importance of operational definitions sources of distortion principles of hypothesis testing: types I and type II errors First Midterm: Inferential Statistics Parametric and non-parametric tests logic : how they work etc. factors determining which one is most appropriate t-tests, both between and within Mann-Whitney U-test, Sign-test one-tail vs. two-tail tests First Midterm: Within Subject Designs advantages/disadvantages carryover effects- potential causes counterbalancing - complete, partial , Latin squares First Midterm: Multilevel Designs Why do them? How to analyze them? probability pyramiding ANOVA interpreting significant F-ratios, a priori and post hoc comparisons error rates per comparison and familywise Research Methods II Notes 37 Class 10 - more complicated experimental designs: Factorial designs web page creation Factorial designs - where more than one independent variable influences your dependent variable - if you want to assess the effects of several independent variables on a given dependent variable, one solution would be to conduct a separate experiment for each independent variable of interest- but you gain more info using a factorial design - factorial experiment from the textbook: Glass, Singer, Friedman- measuring sound intensity and predictability on number of tries to solve an insoluble puzzle (measuring irritation) - in a factorial design there is a separate group for each possible combination of the levels of the independent variable - when manipulating two independent variables you get four groups (2 X 2)- wanted to vary the two independent variables in such a way as to identify the separate effects of each variable on tolerance for frustration- wanted to avoid confounding the two variables - once you have your four groups, participants are randomly assigned to the different groups- not confounded because you can statistically separate the effects of the independent variables - experiment on sensory modality effects on memory 20 subjects, 10 randomly assigned to each group Auditory Group - 10 min travel film - no emotional shock - written instructions, 1 o’clock group testing, etc. - IV: Auditory Price Information - Measure memory- D.V. Visual Group - 10 min travel film - no emotional shock - written instructions, 1 o’clock group testing, etc. - IV: visual price information - Measure memory- D.V. You want to statistically compare differences in measured memory- answers the question of whether auditory or visual information is remembered better - if we were to do the hypothetical data- we might get data like the following graph: Research Methods II Notes 38 Hypothetical Data - notes on drawing graphs: - look at differences between group means and variability within the groups (those two bits of info are important- one is nominator and denominator of t-test) - you present the means in the graph (graphic representation of the means)- this is not very meaningfulwe don’t have any information about the denominator (no info about spread or variance) - so the above graph is not that helpful - SPSS does not have good graphs- so we are only presenting the mean values (recognise that they are not good figures but we are using them anyway) - for our purposes: to simplify: we will treat the differences that we plot as being real (significant differences) if they are numerically different (assume that the variances are not coming into play) - so for the above graph: 5 items on average for auditory and 7 items on average for visual (we will assume this is significant)- if you ran this bivalent study Experiment on emotional shock effects on memory: Hypothetical Data Experiment on Emotional Shock Effects on Memory 20 subjects, 10 randomly assigned to each group Shock Group • • • • 10 min travel film Auditory Price Information written instructions, 1 o’clock goup testing, etc. IV: Emotional Shock Measure memory - D.V. No Shock Group • • • 10 min travel film Auditory Price Information written instructions, 1 o’clock goup testing, etc. • Measure memory - D.V. - we are going to measure memory again- compare the difference statistically will answer question of whether emotional shock effects memory IV: No Emotional Shock Research Methods II Notes - 39 we have done the second bivalent experiment- does emotional shock disrupt memory for info presented orally? Hypothetical Data here we have results that would lead you to reject the null hypothesis (only 3 items with shock, 5 items with no shock) - we have answered two questions at this point- I wonder if that is the same effect if you presented the information visually Advantages of the Factorial design - might wonder if the effect of emotional shock on memory is equivalent for information presented aurally or visually - you could do another bivalent study - with one factorial design you can answer all three questions- this is why people use them - answers three questions: - is auditory or visual information remembered better? - Does emotional shock disrupt memory? - Does the effect of emotional shock depend upon whether the information was received via auditory or visual channel? (looks at the interaction of the two variables) - Or does the effects of modality of presentation depend upon the presence of emotional shock? - this third question addressees the generalizability of the main effects (you are wondering whether the results of the first group are generalizable to the second group - with one factorial design you can answer all three questions effectively and efficiently - in psychology, we often have informal hypothesis (we think emotional shock disrupts memory as opposed to formal- anything expressed mathematically) in part because there are more than one causal factor determining behaviour (this is why we couldn’t really have a formal hypothesis- factorial design’s necessary here - when a combination of IV’s acting together simultaneously determines the outcome (DV), a factorial design is uniquely suited - more complex (and accurate) causal explanations can be tested with these designs- in psychology, the more complicated explanations often work better Designing a Factorial experiment - three steps 1) identify each hypothesized causal factor (independent variable) of interest (both shock and modality of presentation as examples) 2) decide how many levels of each factor you want (simplest case being two levels of each)- i.e. how many levels of shock (no shock, low shock, high shock) 3) determine all possible combinations (create a matrix of treatment combinations). Designing a Factorial experiment- tables Research Methods II Notes - 40 if you have simplest possible casetwo causal factors at each level we have created four groups- each with one level of each independent variable - this one with words: 2 x 2 factorial experiment - still only have 40 subjects, 10 randomly assigned to each of 4 groups - we can then compare the means from the 4 treatment combinations Factorial designs - it may seem in some ways like this experiment is confounded (because altering two things at once), but this is still an analytic experiment, the “gold standard” of science - ideally we still have random assignment of subjects to treatment groups (any differences won’t be due to subject factors) - good things: 1.) random selection of subjects from defined population 2.) random assignment of subjects to treatment groups 3.) concurrent control of extraneous variables and contrast (statistical comparison) of the measured DV Factorial Designs - all conditions are still being held constant except that in the factorial designed it is the combination of treatments that are manipulated - they may seem confounded but we can mathematically look at their effects - because all possible combinations of levels of the IVs are represented, we can statistically separate their effects- this is done by ANOVA. - Analysis of Variance: ANOVA ANOVA is a way of analyzing the total variability amoung dependent scores and dividing or partitioning this total variability amoung the factors assumed to have caused it- called sources of variance Sources of variance 1) subject variables 2) experimental error (measurement or recording error) 3) combination of IV treatments – do this in a two-way anova Research Methods II Notes 41 2- Way ANOVA - you will get an F-ratio for each of the three effects and also an F-value for each interaction with each other - for main effects, F is the ratio of between-group variability to withingroup variability - Within group variability is due to: differences between the subjects and experimenter error - Between group variability can also be due to: differences amount subjects, experimental error, and/or variability caused by the combined treatment effects of the IVs - Evaluate the significance of F print-outs: give probabilities of getting an F-value as big, or bigger, than the observed by chance for each main effect and interaction possible in the factorial design for a 2-way ANOVA (2 factor design), probabilities are given for both main effects and the interaction Assumptions of ANOVA - remember that ANOVA is a parametric test: as such it assumes: - normally distributed populations - homogeneity of variance (same #’s of people in each group) - independence of error terms Terminology of Factorial Designs - the term “factor” is used interchangeably with “independent variable” - a number is given for each IV (factor) - for each IV you will have a number- it will be number by number by number, etc. factorial designs (depending on how many numbers you have) - the exact # tells you have many levels of the IV you have (i.e. 2 x 2- two IV’s (there are 2 levels- 2 “twos” and each has two levels)) - the number used refers to the number of levels that factor has Modality of information ex. - design? 2 X 2 factorial the two digits indicate that there are two Ivs that both digits are “2”’s indicate each IV has two levels Terminology of Factorial Designs - in addition, how the variables are manipulated can affect the deign- within subjects, between subjects or both (called a mixed design)- one may be manipulated within subjects and the other manipulated between Research Methods II Notes 42 Types of inferential statistics - there are many- each intended for and appropriate under specific conditions like shape, scale, experimental design of study - remember that the experimental design was important- # of IV, # of levels of IV’s, how manipulated- this is same as above Interpreting Factorial data: from tables Interpreting Factorial data: from tables (again) - - - letters A, B, C, D to denote the mean scores for the different groups when looking at main effects- you just look at one independent variable at a time (you ignore independent variable 2) – combine the scores of all the groups (collapse across IV1) to see if the first IV had a MAIN effect main effects: the separate effects of each independent variable how to calculate main effects of IV’s: a) average group means in first column and write result under the first column b) do the same for the group means in the second columnyou now have two “column means” c) average the group means across the first row and write the result to the right of the first row d) do the same for the second rownow you have two “row means” e) compare column meansrepresent the main effect for one IV averaged over the effects of the other IV f) do the same for the row means- represent the main effects for the other IV averaged over the first IV g) reliable difference in means indicates an effect of the IV independent of the other IV next table looking at all the different main effects- if you don’t have equal # of people, you get weighted averages Research Methods II Notes - - 43 if you compare the (A+C)/2 and the (B+D)/2 different averages then you can determine the main effect of IV #1 if you compare the other two you can determine the main effect of IV #2 comparing values to determine the main effects of Ivs 1 and 2 requires statistical analyses to determine if the observed differences are greater than would be expected by chance for a given, a priori research determined, alpha error rate- this is done using ANOVA you require stats! Interpreting factorial data: from tables - you must also interpret interactions - an interaction is present when the effect of one independent variable changes across the levels of another independent variable - i.e. Glass et. al: found that the # of attempts to solve the insoluble puzzle was greater when the noise was soft than when it was loud but this relationship held only when the noise was unpredictable (when noise was predictable the # of attempts was about 26 regardless of noise intensity) - parallel lines will not have any interaction - look at the difference between A and B (A-B) and C and D (C-D) - if there is a difference in the difference you have a significant interaction - if you compare the two differences, and there is a difference- you have an interaction effect between Ivs 1&2 - you can also have A-C and B-D to compare these values to determine interaction- look for a difference in these differences – you can do it the other way (C-A, etc.)- just be consistent - comparing any of the differences in same coloured values determines the interactive effect of IV’s 1&2 - see table with colours- slide 39 - comparing the values to determine the interaction between IV’s 1 and 2 also requires statistical analyses to determine if the observed differences in differences are greater than would be expected by chance… Some (many) examples Interpreting factorial data: from tables (slide 42) Interpreting factorial data: from figures - good idea to graph results (plot results) - to answer question about main effects- do the math - in one low block compare the two blocks (make an average)- in this table they are the same- we do not have a main effect Research Methods II Notes - 44 there is a main effect for the manipulation of the second IV- the red bars are higher than the blue bars Research Methods II Notes 45 Hypothetical Data: Stress-induced analgesia: Revisited - - using order as another independent variable (other than stress vs. no stress) mixed design- order is between, stress is within Complete Counterbalancing - split subjects into groups and give the different groups and give the treatments in all possible orders - assuming the effects or order will be balanced equally across treatment conditions (levels of the IV), order effects will not confound - your results you can test this assumption by having the order in which treatments are received an independent variable (becomes a mixed factorial design with order manipulated between and treatment within subjects) we have three questions- main effect for stress? Main effect for order? Interaction? - Is there a main effect for stress?yes, when tested under stressful conditions, subjects showed …blah, blah, blah - No main effect for order- the average for both stress and no stress scores is the same for both no-stress first and stress first - Is there an interaction?- does the effect of stress- is it the same whether they receive stress or no stress first- there is no difference in differences is 0 Another example- slide 54 - no main effect for stressthere is an effect for order no interaction Another example (slide 56) no effect for stress- the average between the blue and blue and red and red is the same - no effect for order Research Methods II Notes - interaction effect- does the effect of stress depend upon the order that they receive the treatment- it will be 2 and –2 between the two no-stress first and stress first and also 2 and –2 between the two red bars and blue bars (one will be + and one will be - either one shows the interaction effect Interpreting Factorial data: from figures (slide 58-64) line graphs are not as good to look at - - 46 no effect for IV1 to find average for either blue or redit will be the middle of the line- they do not differ so there is not an effect for IV 2- there is an effect no interaction effect- if you see a parallel line there will be no interaction- line data help determine whether things are parallel Interpreting Data from tables- slide 64 Hypothetical data (slide 65) Research Methods II Notes - - 47 is there a main effect for modality?- yes, average for top red line is 7 and average for bottom is 4- yes yes, there is a main effect for shock- the difference between 7-5 (6) is different from 7-3 (5) yes there is an interaction because the difference between 6 and 5 is different from the difference between 7 and 4- no! there would be an interaction between them because of looking at the four raw data points- not the averages- #’s used in determining the main effects are not used to determine the interaction effects for this you would look at the difference between 7 and 7 and between 5 and 3 Hypothetical data (slide 66) - there is a main effect for auditory vs. visual (difference of two) - main effect for emotional shock - no interaction- lines are parallel- people will always remember more in the visual group regardless of whether they are shocked or not (no interaction Hypothetical data *67* - main effect for modality no main effect for emotional shock there is an interaction Problems with “higher order factorial designs” 1) not enough subjects per group: you should have at least 5 subjects per group for a reasonable ability to detect the effects of the independent variable 2) numbers and complexity of the resulting interactions- if you have a 2 x 2 x 2 interaction for example you would get three main effects, three 2 way interactions and a three-way interaction….could be a headache after a while with all the numbers February 9, 1999 - figure out how to paste the figures into the notes Research Methods II Notes 48 Main effects & interactions are independent - are the three questions that you can answer independent?- yes! - the three questions that can be answered with a two-factor design are independent - this means that there are 8 (2X2X2) possible outcomes from the data analysis - main effect for IV1: yes or no - main effect for IV2: yes or no - interaction: yes or no Interpreting factorial data: from tables - this table shows main effect for IV1 and no main effect for IV2 and no interaction there is a main effect for IV1, no interaction, no main effect for IV2 Interpreting factorial data: from figures no main effect for IV2, no interaction (lines are parallel), - no main effect for IV1, no main effect for IV2, interaction - when you are looking at the interaction, it is like saying, if you were to do two different - bivalent data would you data be different depending on what level of one IV you usedlike, would you get different data for IV2 depending on whether you used high or low values for IV1 intersecting lines Crossover interaction - in a crossover interaction occurs when the first IV has one effect at one level of a second Research Methods II Notes - 49 IV but it has the opposite effect at the other level of the second IV see figure with the crossover lines Glass, Singer, & Friedman, 1969 (from text) - interested if effects of exposure to irritating noise on several behavioural measures of tolerance for frustration - operational definition of D.V.: time spent trying to solve an unsolvable problem (deception)- how long a person will work at it - factors (Ivs): - noise predictability (2 levels) - noise intensity (2 levels) - this would be a 2x2 factorial design Glass et al. 1969 - if the samples are equal, they just added the two numbers together (sums instead of averagings)- you can only use this method if equal numbers of people there is a big difference for one and not so much for the other (big for predictability and not so big for intensity) Glass et al. data if you were to plot it- line graph - use line graph because softloud is continuous significant interaction- lines are not parallel worst possible scenario is a roommate who plays very loud noise randomly Research Methods II Notes - 50 throughout the day (unpredictable and loud) predictability is a key aspect of control and this is related to psychological factors Tomilson, Hicks & Pellegrini (1978) - looked at whether pupil size served a role in communication - attributions of female college students to variations in pupil size - there were problems in the literature about this topic: inconsistent findings re. Effects of pupil size in non-verbal communication - Hess talked about innate releaser- things that trigger behaviours that they can’t help - do humans do the same thing with pupil size? - Janisse- thought pupil size was a trivial variable - studies often come about because the literature is confused - all the studies had used bivalent studies- two types of pupil size - “digitally retouching” to examine this in a multivalent experiment Capilano Canyon: North Vancouver - it is a suspension bridge with a cable- it swings and sways and 30 feet down to rapids - low light is romantic- why? - your eyes dilate under low light - pupils dilate when excited- your pupils will be dilated on the bridge - the people in the study were approached by the male or female and then later asked by a confederate to rate how attractive they thought the person was Digital Manipulation- they manipulated a picture for an add, etc. that didn’t exist because they had manipulated the picture - OJ Newsweek picture Tomlinson et. al - limited previous literature also suggest women rate males and females differently re. Pupil size (i.e., there is an interaction)- women rate a women with big pupils as less attractive and males with big pupils as more attractive - subjects: 246 female undergraduates in courses “primarily of interests to women” - method: 10 photos of males and females of “moderate attractiveness” - 35 judges rate these (7 pt scale): picked 1 male and 1 female closest to median with least variance - retouched the photos- made the pupils either smaller or larger - design? - 2 x 3 within subjects factorial design - 2 is the two sexes - 3 different pupil sizes - so you have one group- because it is within groups - IVs? - sex of model - pupil size - dealing with carryover? - presentation order randomized Data - this is a good way to show data because it shows variance as well as just the means Research Methods II Notes - all three differences have to be the same or you might have an interaction Tomlinson et. al - line figure - - - - - you have to look at both mean differences and variance you can add the variance terms in excel- indicates the standard deviation value (gives you a better idea about how the distributions are distinct from each other) you tend to just show the error bars on one side so it’s not as cluttered- you know the error bars are symmetrical there is a main effect of gender there is an interaction it is hard to tell for the other main effectprobably is when you have 246 subjects- gender effect and pupil size and interaction all significant at the .001 level rating of attractiveness depends upon the sex of the person you are rating and how big the pupil size is maybe this says that large is TOO large for males The same graph as above worked out for main effects, etc. Tutorial interpreting factorial data 51 Research Methods II Notes - 52 Fazio & Backler: topics in research methods: main effects and interactions’ This program provides an excellent opportunity to practice interpreting main effects and interpretations I highly recommend everyone do this Advantages of the Factorial design - when a combination of IV’s acting together simultaneously determines the outcome (DV), a factorial design is uniquely suited - more complex and accurate causal explanations can be tested - Mischel- history of personality research has increasingly become the study of higherorder interactions - E.g. delay of gratification: depends upon, age x gender, object x consequences x models x prior experience x … - ask a 3-year old if they want three smarties now or 10 smarties when they wait - when they get to be 4: who’s smartest person in preschool- they will point to boy- and say that they will do the smart thing but then they will not choose the smart thing TutorialAt dos prompt- type in “effects” You will get the main screen -hit enter and go to the main screenFebruary 11, 1999 - 2 X 3 factorial designs the question is whether speed increases similarly for both 2D and 3D- practice has the same effect for both 2D and 3D for the main effect- does it take longer to do a 2D rotation or a 3D rotation you can add up the total seconds- 24 for the 2D rotation and 42 for the 3D rotation this is the same as the 2 X 2 except that we are averaging across three numbers instead of two numbers three level analysis for practice- for main effect?- just some have to be different you can just look at two and if they are different there is a main effect for practice all of them have to be the same for there to be no main effect for interaction- with no practice,3D tasks take 10 seconds longer, with some practice 3D takes 6 seconds longer and with lots of practice 3D rotations take only 2 seconds longer if there is a difference somewhere in the differences you have an interaction this is done with ANOVA which tells you that there is a difference somewhere although we don’t know where when you ask a question about main effect- you are showing that the IV you are talking about has an influence on the DV regardless of practice, 3D will take longer the interaction looks at whether your performance increases similarly for both difference in the difficulty depends upon the practice that you have look at the differences across none, some, lots (10-20, 8-14, 6-8) then the differences between 10, 8, 6 and 20, 14, 8 (do the differences get smaller with practice?) 2 X3 Factorial designs (graph) Research Methods II Notes - we have a main effect for task type, practice, and an interaction 2 X 3 factorial designs - - - - the lines are not parallel….remember that there is probably an interaction then main effect for practice: difference for each bar None, some, lots, etc. and if there is any difference 53 Research Methods II Notes - 54 anywhere you will have a main effect- all of them are different 2 X 3 Factorial designs - - no main effect for rotation type (10 + 6 +8)/2 compared with the other three main effect for practice there is an interaction 2 X3 Factorial designs (graph) - also look at the line graph: - when the two lines cross over then you have an interaction Modification of Stroop Lab - stroop (1935): Stroop colour and word test - stroop effect: the difficulty in naming the colour of an object when the colour conflicts with the name of the object (i.e. when the word blue is written in red ink) - cognitive interference between the naming process (consciously instructed) and the reading process (automatic verbal processing response) - tests the ability of separating the word and colour naming stimlui Modification of stroop - procedure: within subjects: each ind. Receives all levels of iv (within subjects designs - counterbalance - time, record in data table Class data stuff Descriptives Output Created Comments Notes 11 Feb 99 10:53:46 Research Methods II Notes 55 Filter Weight Split File N of Rows in Working Data File Input Definition of Missing Missing Value Handling Cases Used Syntax Resources <none> <none> <none> 14 User defined missing values are treated as missing. All non-missing data are used. DESCRIPTIVES VARIABLES=reading plussign countdig /STATISTICS=MEAN STDDEV MIN MAX . Elapsed Time READING PLUSSIGN COUNTDIG Valid N (listwise) Descriptive Statistics Maximu N Minimum m 14 8.00 16.00 14 9.00 17.00 14 12.00 17.00 0:00:02.04 Mean 10.2857 12.5714 15.1429 Std. Deviation 2.0913 2.1738 1.5119 14 General Linear Model Output Created Notes 11 Feb 99 10:54:53 Comments Input Missing Value Handling Filter Weight Split File N of Rows in Working Data File Definition of Missing Cases Used Syntax Resources Elapsed Time <none> <none> <none> 14 User-defined missing values are treated as missing. Statistics are based on all cases with valid data for all variables in the model. GLM reading plussign countdig /WSFACTOR = factor1 3 Polynomial /METHOD = SSTYPE(3) /CRITERIA = ALPHA(.05) /WSDESIGN = factor1 . 0:00:00.99 Research Methods II Notes 56 Within-Subjects Factors Measure: MEASURE_1 FACTOR1 Dependent Variable 1 READING 2 PLUSSIGN 3 COUNTDIG Effect Pillai's Trace Wilks' Lambda FACTO R1 Hotelling's Trace Roy's Largest Root a Exact statistic b Design: Intercept Within Subjects Design: FACTOR1 Multivariate Tests(b) Value F Hypothesis df .755 18.512(a) 2.000 .245 18.512(a) 2.000 3.085 18.512(a) 2.000 3.085 18.512(a) 2.000 Error df 12.000 12.000 12.000 12.000 Sig. .000 .000 .000 .000 Mauchly's Test of Sphericity(b) Measure: MEASURE_1 Epsilon(a) Within Subject s Effect Mauchl y's W Approx . ChiSquare df Green Hu Lo house ynh wer Geiss Fel bou er dt nd Sig. FACTO .807 2.575 2 .276 .838 .948 .500 R1 Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent variables is proportional to an identity matrix. a May be used to adjust the degrees of freedom for the averaged tests of significance. Corrected tests are displayed in the layers (by default) of the Tests of Within Subjects Effects table. b Design: Intercept Within Subjects Design: FACTOR1 Source FACTOR1 Tests of Within-Subjects Effects Measure: MEASURE_1 Type III Sum of df Squares Sphericity Assumed GreenhouseGeisser Mean Square 165.333 2 82.667 165.333 1.676 98.631 F Sig. 23.53 .000 3 23.53 .000 3 Research Methods II Notes Error(FACT OR1) 57 Huynh-Feldt 165.333 1.896 87.208 Lower-bound 165.333 1.000 165.333 91.333 26 3.513 Sphericity Assumed GreenhouseGeisser 91.333 Huynh-Feldt 91.333 Lower-bound 91.333 21.79 2 24.64 6 13.00 0 23.53 .000 3 23.53 .000 3 4.191 3.706 7.026 Tests of Within-Subjects Contrasts Measure: MEASURE_1 Source FACTOR1 FACTOR 1 Linear Quadrati c 165.143 1 Mean Square 165.143 .190 1 .190 60.857 13 4.681 30.476 13 2.344 Type III Sum of Squares Linear Error(FACTOR1 ) Quadrati c df Source Intercept Tests of Between-Subjects Effects Measure: MEASURE_1 Transformed Variable: Average Type III Sum of Squares df Mean Square 6738.667 1 6738.667 Error 56.667 13 F Sig. 35.277 .000 .081 .780 F 1545.929 Sig. .000 4.359 T-Test Output Created Notes 11 Feb 99 10:56:40 Comments Input Filter <none> Weight <none> Split File <none> N of Rows in Working Data File 14 Research Methods II Notes Missing Value Handling Definition of Missing Cases Used Syntax 58 User defined missing values are treated as missing. Statistics for each analysis are based on the cases with no missing or out-of-range data for any variable in the analysis. T-TEST PAIRS= plussign reading WITH countdig countdig (PAIRED) /CRITERIA=CIN(.95) /MISSING=ANALYSIS. Resource Elapsed Time s 0:00:00.22 Paired Samples Statistics Mean Pair 1 Pair 2 PLUSSIGN COUNTDIG READING COUNTDIG N 12.5714 15.1429 10.2857 15.1429 Std. Deviation 14 14 14 14 Std. Error Mean 2.1738 1.5119 2.0913 1.5119 .5810 .4041 .5589 .4041 Paired Samples Correlations N Pair 1 Pair 2 PLUSSIGN & COUNTDIG READING & COUNTDIG Me an P PLUSSIGN ai 2.5 r COUNTDI 714 1 G P READING ai COUNTDI 4.8 r G 571 2 - Std. Deviati on Correlation 14 14 Sig. .441 -.427 Paired Samples Test Paired Differences 95% Confidence Interval Std. of the Difference Error Mean Lower Upper t d f .114 .127 Sig. (2tailed) 2.0273 .5418 -3.7419 -1.4009 1 4.7 3 46 .000 3.0598 .8178 -6.6238 -3.0904 1 5.9 3 39 .000 we got a significant F-ratio so we are now doing paired t-tests (because within subjects design) to test the orthagonal comparisons if it had been a between-subjects design, you would need a fourth column to say which data each person was in (it would be a nominal variable to define group membership) Research Methods II Notes - 59 then you would have another column with the scoresagain we have one row for each person- each person now has only one set of data in the window, DV would be reaction time and “factor” (IV) would be the scores this also would have been significant although we would have a lower F-value in the between groups design because in between-groups the differences in the subjects could contribute to the differences in the scores Maximizing Chances of finding significance - you can increase between groups differences - use a stronger variable that has a bigger effect in the real world (like a more frustrating variable if you are measuring frustration) - change the level of the treatment of IV (stronger dosage of a drug, etc.) - more sensitive operational definition for the DV (if you had had lists for the stroop that were smaller it would have been a less sensitive measure - you can also decrease within group variance - increase control over extraneous factors - decrease subject variance (change design to matched or within subject design) - increase sample size Craik & Lockart (’72): Levels of processing - we have discussed memory in the past these people proposed that the more you thought about something the more you will remember it three types of understanding graphemic level of processing focuses on surface features: 5 letters, 1 vowel, all uppercase phonological level of processing is based on sound patterns (e.g. rhymes with hot) semantic level of processing gets at the meaning of the word For instance- BEAR - different people would be asked the different questions that follow that would get at different things… - is it all in uppercase? Graphic level of processing - does it rhyme with chair? Phonological level of processing - is it an animal? Semantic level of processing - asked to pick out 60 words they had seen before from a list of 180 Craik & Tulving: processing and memory - if you understand and processes the meaning of the word you will remember it more Types of memory and how they are assessed Research Methods II Notes Memory system 1. procedural 2. priming 3. semantic 4. primary (working/short term) 5. episodic - 60 Retrieval system implicit implicit implicit explicit explicit people with damage to mammillary bodies have memory problems explicit memory- you have to remember remembering a word implicit memory- you are shown _L_P_A_T (this would be elephant you might not recall having heard the word elephant you could also have ELE_______ : some people might say elephant and some elegantyou can measure what they don’t really remember when they remembered it kangaroo elephant conference How do amnestics do on tests? - is amnestics memory bad for both implicit and explicit effects- would Jimmy show priming effects? - You could get all sorts of data like crossover or not - In reality, the amnestics show the same priming effects as non-amnestic people Three-factor designs - you want to integrate the different information - with a three-factor design, there are seven questions that can be answered - there are three main effects (one for each IV) - there are three 2-way interactions - the influence of one IV may depend on anther IV - there is one three way interaction - this is where the two way interactions depend upon the level of the third variable you are looking at - examples based on a study of gender, practice and plane rotation… Hypothetical higher-order study of memory (graph) - it is not within because one subject can not give you all the bits of data - whenever you look at two way interactions, you will collapse across the third IV Research Methods II Notes - the three way interaction assesses whether there are differences in differences of differences - when looking at main effects, you ignore the other two IVs - there is main effect for male and female there is a main effect for practice also main effect for task rotation- 3D tasks take longer three two-way interactions 1: whether males and females improve the same amount with - - practice (don’t care about the 2 vs 3 D) 61 Research Methods II Notes - 62 for gender X dimension- combine the two dimensional data for males with the three dimensional data for males compared with the same for gender- they are the same so there is no two-way interaction next graph: Research Methods II Notes 63 “To steal ideas from one person is plagiarism; to steal from many is research” Three-factor design - remember that you have three questions that you can answer - with one analysis you end up with 7 f and p values to answer these questions - there are main effects (one for each IV) - there are three 2-way interactions - there is one three-way interaction - example he gave was the study on gender, practice, and mental rotation speed - with a 2x2x2 it is safer to go back to the numbers, rather than tables or bar graphs - - for main effect for gender- doesn’t talk about practice or 2D and 3D differences in rotation tasks- single score for males and single score for females- just total them up you get 36 (male) and 52 (female): you must describe: males, overal, are faster than females (give description) practice main effect- collapse across 2D and 3D and male/female 48 vs 40- main effect for practice then add up all the 2D’s with all the 3D’s- there is a main effect for rotation- 2D tasks do not take as long as 3D tasks to do two-way interactions: we are still just dealing with four numbers there are three of them- G X P, G X D, P X D for G X P, you will collapse across the variable that is irrelevant- collapse across dimension, Research Methods II Notes 64 get a total value for males with practice, males without practice, females with practice, females without practice - 18, 18, 30, 22 - you can subtract these values in any way - 18-18 = 0 - 30-22 = 8 - we have a difference in our differences (males do not improve with practice but females do) - G X D: don’t care about practice; collapse across the numbers for practice - Male on 2D, males on 3D, females on 2D, females on 3D - 16, 20, 24, 28 respectively - to examine the gender by dimension interaction, collapse across practice and examine the differences in differences - these differences are not different- both are (-4); - third two-way interactiondimension; collapse across gender; - practice with 2D, no practice with 2D, practice with 3D, no practice with 3D - we do not have an interaction here (both are (-4)) Interpreting 3-way interactions - there is one 3-way interaction amoung gender, practice, dimension (GxPxD) to be analyzed - the three way interactions asks whether the two way interactions differ depending upon the level of the third variable that is examined - in other words, three way interactions require you to compare the difference in the difference of difference scores! - because there are 3 two-way interactions you can assess the 3 way interaction in three different ways - Interpreting 3- way interactions - the easiest way to show/do this with a 2x2x2 data table is to simplify the data to create a 2x2 table containing the effects of the third variable at each combination of the two variables remaining Research Methods II Notes 65 - - e.g. create a table of gender x dimension and fill the four cells with the effect of practice on that combination of gender and dimension you want to make a table that has only four cells- you put in the effect of the other variable (practice) on males doing 2 dimensional tasks (instead of collapsing across the other variables you put in their effect so that all variables are accounted for)- you put the difference in - males on 2D tasks: 0 sec (the difference between males with and without practice) males on 3D tasks: 0 sec females on 2D tasks: 4 sec females on 3D tasks: 4 sec look at the effect of practice at each combination of gender and dimension (similar to what we did for the GxD interaction but instead of averaging, look at differences put in the tables showing how to collapse big one into little ones now 0-0=0 compared to 4-4=0 or (-4)-(-4)=0 Research Methods II Notes - 66 no significant three way interaction Higher order factorial designs (bar chart) this is a little difficult to see Has other examples- best to do handout “2x2x2 factorial data interpretation exercises” Next: Chapter 10: Single-subject designs - also called small n designs lots of situations when you want to see about causality- this is harder when you are a psychotherapist, you want to know if just your client will improve, not whole groups - focus is on the behaviour of a few, even a single, individual - termed small-n or single-subject designs - common in clinical research - generally avoids inferential statistics (different approach to determining reliability) Small N-designs - historically: many studies- psychophysics Fechner and Weber and memory, Ebbinghaus based on very small subjects- even 1 - Ebbinghaus tried to remember words like BAV and GOM- did free recall to see how many he got right - all done prior to the invention of inferential statistics! - Sir Francis Galton & Karl Pearson- correlation coefficient, late 1800’s - Sir Ronals Fisher- F test; 1920-1930 - People realised very quickly how important this was - by 1950’s had to use statistics to get published, animal learning folk (Skinner etc.) created their own journal - we still have an obsession for statistics- we get infatuated about p-values; his problem with the editor who wanted stats done comparing animals who didn’t live at same time Baseline vs. Discrete Trials Designs - single-subject designs come in a variety of forms- all forms can be categorized into either baseline designs or discrete trials designs - today, when we say single-subject designs we usually mean baseline designs Baseline designs - becoming more popular again in clinical research - essential feature is the establishment of a behavioural baseline during the “baseline phase” of the experiment - this baseline establishes level of performance on the dependent measure before introduction of experimental treatment - following baseline phase, subject exposed to experimental treatment and behaviour measured again- second phase called “intervention phase” - requires: - at least one baseline period and - at least one intervention period - you can use biofeedback- if you can calm your physiological arousal at the same time as having the needle, you can train your physiological responses, maybe you could cure problems with needles Research Methods II Notes - - 67 you have phobic needle person come in- rate anxiety over series of minutes and then start biofeedback training and their anxiety is reduced example in the book was of a subset from a larger study using rats and electric shockssubjects tested in a operant conditioning box baseline phase each one got a series of training sessions to framiliarize it with the shock schedule and establish behavioural baseline; either had light on or off (light on meant shocks were predictable and were preceeded by warning tone; darkness meant unpredictable)- kept doing this until the points on the baseline remained within a 10% range; bar had no effect on anything; measured # of times rat pressed bar intervention phase rat could “buy” 1 minute of time in the light predictable condition by pressing the bar; continued to record bar pressing until baseline within 10% range results: bar pressing went from 10 to 85% in intervention phase- had this twice with baseline phase in between and it jumped down and then back up again response rates only high when responses produced the signalled schedule did follow up studies the baseline data would be called “baseline (A)” intervention (B) there is a confound- passage of time overhead: Fig 10-1: example in book: Badia & Culbertson 1972- shock in rats: either signal or not- (note- will return to idea of stress and control?) tet test: trained eyeball test: do you need statistical analysis to tell you that there is a change from the baseline to intervention period but there is still the possible phobia for the passage of time table 10-1: more detail about characteristics about the baseline stuff; characteristics of single-subject designs Functions of baseline - feature of the baseline design is its use of the “behavioural baseline” : record of a subject’s performance across time within a phase - has two important functions: 1) establishes level of the DV prior to intervention 2) allows assessment of variability in DV - may involve a “stability criterion” to minimize error variance so that any effect of the intervention will be more apparent - stability criterion: defined to identify when the baseline no longer shows any systematic trends; used to ensure that baseline accurately represents the level of behaviour produced by a given treatment - trends of either increasing or decreasing values usually occur right after change to a new phase- stability criterion used when there are these systematic changes - keep baseline up to date or you can run the risk of having your subject in a certain phase for too long- waste of time, violation of experimental procedure - you can determine whether baseline has met stability criterion by updating plot after each session and examining it - if stability criterion too stringent baseline may never achieve it; not be able to proceed to next phase - too lax may proceed to next phase before subject’s performance stabilized - to develop good one may require pilot work; experience with % measure used indicated that the baseline was likely to remain stable if it varied only within 10% over three successive trials - the less variance within groups, the easier it is to tell the difference between the two groups Research Methods II Notes 68 Intrasubject replication - baseline and intervention phases each repeated - each subject’s performance thus assessed twice under each phase - look and see whether behaviour was repeatable or “replicable” - ability to replicate reliable data Assessing Generalizability - we are interested in generalizability - use intersubject replication (between-subject replication): use two animals instead of one - should you have 20 rats…you don’t need to have that many - example: ABAB design - these are like within-subjects designs when we don’t counterbalance - each time you change if the changes are consistent, the chances of changing by chance becomes smaller and smaller - this is different from group designs in which inferential statistics are used to assess reliability after a fairly complex chain of logical inferences - note that in both cases, the key is the degree of overlap of distributions Rationale of the baseline design - four areas in which the baseline design and group-designs differ - dealing with random variability - handling error variance - assessing the reliability of findings - determining the generality of results Dealing with random variability - when you manipulate IV you hope to show that it causes changes in behaviour - systematic variance: the changes that vary systematically with the level of the IV - error variance: changes in behaviour not related to variation of the IV (unsystematic variation) - shows up as unsystematic fluctuations of a baseline within a phase - also shows up as variations in final, stable levels of responding across same treatment - also occurs between subjects undergoing same treatment - you control for this in group designs by holding extraneous variables constant and averaging data across subjects - with single-subject approach, error variance handled by tight experimental control - statistical control measures avoided - researchers make an effort to identify the possible sources of variance- graph data and look for error variance (high levels of instability) - must decide how much variance is acceptable just as variance- you never statistically control for error variance you just continue to try and find sources of it - by identifying the sources of variance you can better understand the behaviour in question (as the sources of variance are contributing to different behavioural responses)in the group approach the effects of sources of variance are hidden with the averaging process Dealing with error variance - philosophical difference between single-subject approach and group approach in dealing with error variance Research Methods II Notes - - 69 with a group approach, you control error variance a little by randomization, etc., but there are also statistical ways- if effects are significant (can reject the null) little further efforts is made to determine the sources of variance with small n designs- you try and eliminate it completely; control as much as possible experimentally take repeated measures and implement a stability criterion taking repeated measures of the DV within a treatment allows you to determine how successful you have been at controlling extraneous variables to increase stability of measure use more careful control of other variables like food deprivation, amount of exercise, etc. stability criterion removes transitional data from analysis and eliminates unnecessary time spent collecting data after stability achieved Assessing the reliability of findings - group approach assesses reliability through chain of logical inferences from data of a single experiment - used to estimate population parameters - single-subject design assesses reliability more directly through actual replication- each subject experiences each phase two or more times - performances across multiple exposures are compared- if data replicate then they are reliable - whether you can say the data has been replicated or not depends on the degree of control you have over the DV and on the question the experimenter is looking at - high degree of control variation in baseline will be minimal - low degree of control baseline will be variable- variability can occur within and between replications of the same conditions - see pg. 292 Determining the Generality of Findings - group approach establishes generality by averaging across large numbers of subjects - assumes that the average performance will be “representative” of the population from which the subjects of the experiment were sampled - averages represent a blend of different patterns into an average that is unrepresentative of the individual behaviour underlying the average - single-subject approach: avoids problem by using intersubject replication instead of averaging - direct comparisons of the behaviour of different subjects provides a measure of intersubject variation - if all subjects show similar patterns you can be confident that similar results would be obtained with most subjects from the same population - Reynolds: showed failure of intersubject replication; pigeons trained to peck at translucent disk on which was projected a triangle against a red background - the response of each bird was controlled by a different stimuli - each pigeon responded differently to the intervention time - this showed that there were probably underlying determining factors at work that wouldn’t have been noticed in a group design when averaging used - in addition to intersubject replication, the generality of results in single-subject designs evaluated by double-checking results in new experiments - these experiments build on previous research while extending the range of variables that are assessed- use different kinds of reinforcers and subjects Research Methods II Notes - 70 systematic replication: extensions that incorporate aspects of the original experiment while adding new wrinkles direct replication: exact replications Time-series (Small-N) Designs - don’t involve random assignment - control and manipulate variables sequentially (not simultaneously as in analytic experiments) - anything else that occurred in this time is therefore a potential confound - remember… - phrase used to highlight the fact that your manipulation occurs over time- like in ABABtakes time to do that Types of replication 1) intrasubject- to assess reliability (given time confound) 2) intersubject- to assess generalizability also have… - distinction between systematic replication (introduces some extension or variation on the orignial research- often to assess generalizability of the phenomenon) and direct replication Variability in baseline - variability can be due to: a) chance variation b) carryover effects- can also lead to unstable baselines - after you have established your baseline for the baseline phase and the intervention phase you return to the baseline phase to eliminate the confound of time - i.e. the kids in the grade 10 remedial math class seemed to pay attention more in the intervention phase but this could also be due to the passage of time (more maturity, better parental relations)- because of this possibility you go back to the baseline phase (remove the treatment, intervention, etc.) and watch the baseline- if it returns to normal it is unlikely that the changes occurred because of the passage of time - this is when you use an ABAB design or just an ABA design - when you go back to the baseline phase at any point this is called “reversal strategy” - if the behaviour returns to baseline levels during the second baseline phase and then returns to the previous treatment levels during the second intervention phase you can be confident that the treatment (and not a time variable) caused the observed changes Problematic baselines - a good baseline has: - little unsystematic variation - no systematic changes with time once stable levels of performance are reached - this is not always achieved - problems: 1. unsystematic baseline variability - if the variability is high within a phase this variability is caused by uncontrolled factors - if you can’t bring the factors under control, you can deal with the variability by extending the number of observations you make within a phase - the more observations made closer you are to “true” baseline 2. drifting baselines- baseline shows a trend (doesn’t go down as low as the first baseline in reversal time) Research Methods II Notes 71 it might be impossible to stabilize a baseline against slow, systematic changes (called “drift”) - if you can’t control the drift you can effectively subtract it out - an example of drifting baseline would be when the baseline slowly drifts upward in an intervention phase - you can see effect of treatment if you allow for the drift 3. unrecoverable baselines (due to carryover effects) - this is when baseline levels of performance cannot be recovered during reversal - these changes are considered “carryover effects” - you need to use special designs to deal with completely irreversible changes - i.e. this happens when learning develops during treatment conditions (if you have an ABAB type of design for example) - i.e. rats and bar pressing they are not naive to the fact that in the intervention phase bar pressing got results so they will never go back to not pressing it like in baseline phase 4. unequal baselines between subjects - sometimes the baselines of different subjects can vary right from the beginning or in the intervention phase (i.e. one rat may press the bar a heck of a lot more than another rat) - even if they were given all the same conditions - can result in different functional relationships - with this you would have intrasubject replication but not intersubject replication - you can differ the amount of treatment between rats to make them have the same baseline - you might then be able to achieve intersubject replication 5. inappropriate baselines- floor and ceiling effects - e.g. add: want to see if diet disrupts attentiveness; if they only sit in seat 20% and you want to see if food can make it worse; maybe they are already at the baseline, can’t go any lower; or someone who sits in seat all the time and want to know if something can improve- ceiling effect - low baseline desirable if you expect the treatment to increase the level of responding but undesirable if you expect the treatment to decrease the level of responding - solution adjust experimental conditions to produce the desired baseline levels - many times you do ABAB studies- baseline, intervention, reversal, intervention again - plotting every data point for both subjects- helps generalizability if using two subjects - you can have failure to replicate small N-designs: the pigeon study - the finegold diet: A-baseline, B-placebo cooke, C-artificial colouring in cookie - this data worked - February 18, 1999 Small-n Designs: Terminology - terminology is not really standardized but we can still think of several things that are important: like: types of single-subject baseline designs: a) single-factor (e.g. AB, ABA, ABAB) - baseline condition (A) - intervention condition (B) - ABA design deals with time confound - you can have multiple levels of the independent variable- “parametric design” - but you can’t have completely counterbalanced order of treatments Research Methods II Notes within the same subject you would want to have transitions between close values (A – B) of the independent variable as well as distant values (A – C) – if you had A, B, C as levels - you can also return to the baseline after each level to deal with drift b) multi-factor designs (> 1 IV) - when you have more than one independent variable- assess interactions as in factorial designs - because it is not subjected to statistical analysis, you can omit some cells of the factorial matrix and this does not present any analytical problems - you can just have data points at regular intervals instead of every possible interval if the functional relationships between IV and DV follow regular patterns c) multiple baseline designs (> 1 DV) - these are used when variables cause irreversible changes in behaviour - these designs simultaneously sample several behaviours within the experimental context get multiple behavioural baselines - i.e. people who have two habits they wish to kick- get a baseline for each and then start treatment- you then introduce treatment but apply it to only one of the behaviours - once one behaviour is stabilized at the intervention phase then you attack the next behaviour - it doesn’t matter which behaviours are treated first - the design uses the untreated behaviour as a partial control for time-correlated changes that could confound the results - the behaviours should be independent of one another another way of getting at reliability of data rather than generalizability you introduce therapy at different times for three behaviours you want to get rid of- you have three different baselines for different interventions which tells you that it really is your intervention that is decreasing the behaviour (and baseline) also used for systematic desensitization for phobias- used when you have gradual introduction to thing they are afraid of; just say someone had four phobias- use multiple baseline study-get a baseline for original, and only after the systematic desensitization you get the different baselines (telling you that it is unlikely that her self-reported reduction of fear is caused by anything other than intervention)- if it was the therapist for example that made her less afraid, all of the baselines for the different types of phobias would all change at once… - - - - 72 AB (ABA) Type Non-experimental studies useful when you don’t time the start of the intervention yourself it is a quasi-analytical study example: Drunk driving in Michigan if you change the legal drinking age, would you get more drunk driving accidents legal age 21 18 in 1972, then 18 21 in 1978 % alcohol related traffic accidents: 15% 22% in ’72 decreased again in ’78 does a lower drinking age CAUSE alcohol related traffic deaths? you must still watch for confounds? Wider alcohol availability in ’72 Oil crisis in ‘78- 55 mph limit imposed in 1973/72 (people driving more carefully) Other examples 1.) TV effects on children (Tannis MacBeth-Williams- social psychologist) - very few studies done on this Research Methods II Notes 73 Wilson’s beach in BC- before they got a satellite they had lots of organized and unorganised physical activities and bingo games and dances, etc.- studied the community before they came, and shortly after lots of the extra-curricular activities stopped - this was an AB design- didn’t turn off the satellite; helped us get an idea of the effects of tv on children 2.) David Phillips Research (see overheads): - motor vehicle accidents after publicized suicides: if people see a well-publicized suicides, more accidents (suicides to get insurance?)- 3 days after (time to think?) - airplane accidents after publicized murder- suicides (non-equivalent control groups) - murder-suicides people who want to kill themselves; many people own planes in states; private, business, corporate-executive, etc.- same as cars but you get more insurance with airplanes - also on day 3 (peak)- time to plan again? - With non-experimental AB designs you always have to be careful interpreting results - homicides after heavyweight prize fights: - looked at whether JFK assassination caused more violent crimes - if it is violence (murder, suicide): lets look at other things that are just violent (heavy weight prize fights) - when you measure things you can’t control, you can statistically control for them- people are going to kill someone more on holidays, etc. on christmas and holidays - when you see all three of them showing the same pattern, you become a bit convinced… - Non-Equivalent Control Groups - different than AB type non-experimental groups… Mining safety study… look at scanned graph - an industrial psychologist was called in to help out in Lucky Friday Mine – it had a higher rate of accidents than other mines - collected data on two other mines other than the other mines- the intervention only showed on the Lucky Minenot the others; more evidence for the intervention External and Ecological Validity Stress and Cancer looks at psychoneuroimmunology Research Methods II Notes - 74 ad lib means they can eat whenever they want between subjects design- it would be multivalent (more than two levels of IV) you also have yoked control- who is going to be more stressed?- if the person in charge presses the bar no one gets shocked – if they fail then it is the psychological stress of not having control- when they thought it was the people in charge had more stress Results - not being able to get away from itthe anxiety that we put them under that could reasonably account for data Factors influencing external validity - you want to use animal models to humans - do they extend beyond the specific conditions of the rats- we aren’t really interested in whether rats get cancer or not… - external validity: beyond that experiment Factors influencing external validity Population sampled - always be careful generalizing beyond population studies - look for converging evidence that there is nothing importantly unique (e.g. rat vs human physiology and susceptibility to cancers) - look at converging validity- many studies all pointing in the same direction - one thing will always be less convincing than many together - if there is, information is still useful - sometimes unusual populations are sought out (HIV resistant individuals, spotted hyenas- extreme examples of prostitutes in Africa)- 10 or 20 customers a day for 20 years, prostitutes never use condoms- had thousands of partners and they don’t have aids - hyenas- high testosterone, females more aggressive, clitoris as big as penis, give birth through the elongated clitoris (high mortality rate)- they are bizarre animals and can tell us about the effects of hormones on behaviour Research Methods II Notes 75 Operational definitions - look at construct validity- stress: inescapable electric shock - unavoidable shock- unnatural but are effects unique?- look for converging evidence Operational definition of stress: Phenomena associated with stress Parameter values - values selected for each variable - both independent and controlled - i.e. food additives on ADHD kidsyou better administer the dyes WITH FOOD so you can have external validity Demand characteristics - subtle cues in a research procedure that influence the participants - if you said “man, this is really hard, etc., you might tell the people what you are looking for - serious problems in social sciences (characteristics of volunteers) - can influence both internal and external validity - students holding “poisonous” snakes throwing “acid” in another’s face Ecological validity - how generalizable are the experimental results to the specific set of conditions- those of the natural context in which the phenomena usually occurs Scientific explanations - science is a self correcting process - single research findings are seldom conclusive - problem especially in applied areas where answers are needed quickly March 2, 1999 - - motivation is a factor in learning another language look at placebo effects in surgery- research being sponsored by the big funding research; you need a placebo control group in surgical research; allows you to have better scientific data also do this with double-blind research Stress and the nervous system - things happen in your body if you read erotic material- blood pressure increase, pancreas starts secreting hormones, etc. - can also be frightening thoughts or things you are worried about (exams) - we have wonderfully responsive response systems - survival of zebra depends on response system – get serious sympathetic activity Research Methods II Notes - 76 pupils dilate inhibits saliva heart rate accelerates digestion inhibited sexual arousal goes steroids released from adrenal gland remember about type A personality think about cardiac disease- heart attacks is the number one killer in the US once you have heart problems- type A personality just as much of a problem as smoking influence of social support bombing of London- people that were getting bombed every night were less likely to get stomach disorders than the people in the suburbs- weren’t bombed every night you need a way of studying things when you can’t have the contorls, etc. from the beginning Ex-Post Facto (after the fact) Designs - much less able to determine causality than true experiments but necessary and important designs - necessary for: 1. ethical reasons or 2. an interest in organismic variables (interested in something the person brings to the study; i.e. gender) - in either case- common to both reasons is the fact that you can’t randomly assigns people to treatment conditions Types of Ex-Post Facto Designs - two main types 1) Prospective study- looks forward in time, predicting 2) Retrospective study- look backward in time - in both cases, you are obliged to find naturally occurring groups (thus “after the fact”) and follow them forward (prospective) or trace their histories (retrospective). - problem: even if you randomly sample from those populations, they will be confounded because the populations differ- it is unlikely that the variable you are looking at is the only difference between the two groups - - - high stress job- we’ll use air traffic controller low stress job- clerks in a store in both- this is prospective design: you wait and look forward in time just say you have 20 in one group with cancer and 5 in the other group after the fact in solution 2, it is retrospective- here you select people after the fact on the basis of variabletake people with cancer and Research Methods II Notes - - 77 those without and ask about stress in lives the air traffic controllers have certain other variables VDT: video display terminals Is the difference in cancer rate due to stress; we don’t know; it could be but it could also be due to the smoking, VDT exposure, etc. even though we have random samples of the groups If you did an unethical study here you could have causal relationships Problems with Prospective ex-post-facto designs 1.) subjects are not randomly assigned to treatments - there will be inherent confounds in the populations studied (this is the most serious problem) 2.) sampling problems (often a convenient sample) - may be impossible to identify all members of the populations 3.) dropouts in prospective studies 4.) detection bias - these designs assume that detecting and measuring the variable will be the same for both groups - if you visit doctor more, you might be more likely to detect cancer earlier - all of these influence the internal validity of the study but as well the external validity - internal is what we are mostly worried about- trying to find out if there is indeed a causal relationship - external comes in when you have confounds, etc. Partial solutions (1) matching: - two kinds: a) subject for subject (preferable but difficult- finding people that are the same in every way except for the independent variable- as the list gets longer for the criteria that you have to match for, it gets harder to find people that will match) b) distribution by distribution- you want the average age, amount of smoking, etc. to be the same for the two groups- it is only the averages that are the same - if the risk factors interact- maybe this could be an issue - in both cases, you will selectively drop individuals and bias the sample further (if you drop someone who is hard to match you have a biased sample) - maybe you forgot to match for something (2) measuring: a) know if potential confounds (uncontrolled or extraneous variables)- you measure things that are potential confounds and see if they are different between groups b) to statistically control for these variables- even if you have confounded variables, if they have been measured you can mathematically control for the variables Retrospective studies - these have additional problems in that they rely on memory so the partial solutions are more difficult to employ successfully - the advantage (over prospective designs) is that they are more efficient (cheaper and faster); may be necessary with very rare groups or variables of interest (e.g. rare diseases) - note that even with measurement and matching, internal validity is still questionable Research Methods II Notes - 78 you don’t have dropout Partial solutions for retrospective designs - same problems, same (partial) solutions! - matching and measuring - they are much more difficult to employ! - a real example would be a study of life events and the occurrence of cancer of children - t.j. jacobs & e. charles (1980) - retrospective ex post facto design - case studies suggested experiencing separation or loss related to childhood leukemia - method: semi-structured interview with children and parents - had control group (“variety of physical complaints draw from a medical facility”)- sick kids but not sick with cancer - relates to specificity of hypothesis - they matched on age, sex, SES - measured: - detailed medical history - personality measures - relationships measures - psychological history - marriage assessment - changes in life one year prior to disease onset Characteristics of samples (matched variables)- used subject by subject matching Measured variables in samples - these are already confounds (with the little square) Research Methods II Notes Holmes & Rahe: Social Readjustment Rating scale - have rank order Paired data on LCU measure kids actually did have more stressors Frequency of life-events by group - had more stressful events in the group of kids with cancer D.V.s used is ex-post-facto studies - two special dependent variables used in these studies a) relative risk ratio (prospective studies) - it is a dv that you can study from prospective studies only - illustrated by breast cancer data b) relative odds ratio (approximates the relative risk) - only calculated in retrospective studies - you must be careful - If you are only presented with the ratio you don’t know what the numerator and denominator were - problem with both is that absolute risks are hidden - both absolute and relative risks should be reported Calculating relative risk ratio - prospective ex post facto design - convert it into a contingency table - relative risk ratio is the risk of developing cancer in high stress group over the risk of developing cancer in the low stress group you will get a 4 here: indicates that the risk of someone with a highly stressful life developing cancer in the future is four times that of someone with a low stress lifestyle - 79 Research Methods II Notes Hypothetical data with same relative risk but very different absolute risks Here you get a 5 - it was interesting that there were just as many men as women that developed toxic shock syndromeyet, they targeted tampons 80 Research Methods II Notes - - - 81 if you only give the relative risk ratio you don’t get the whole picture in both of these cases you get a number of 5- it hides the absolute risk Calculating relative odds ratio retrospective ex post facto design the math has to be a little different- not a representative sample; half of the world does not have cancer yet half of your study is cancer patients relative odds ratio = 4.7 this is the odds of people in the cancer group reporting high stress in their past compared to their odds of reporting low stress in their past. The relative odds ratio provides an estimate of the relative risk ratio know how these are used, when used, how calculated Cautions about D.V.’s used in ex-post facto studies - statistically, both measures assume random sampling (unlikely or impossible) - both the relative risk and the relative odds ratios hide absolute risks Lifetime risks of developing breast cancer - - some of the types of cancer- early detection in some cancers was the argument being made for the fact that people seemed to be getting cancer it seems to be that exposure to estrogens is a risk factor for cancer Putting risk of breast cancer in perspective - one women in 9 in whom breast cancer will develop, has a 50% chance of receiving the Research Methods II Notes 82 - diagnosis after age 65 and a 60% chance of surviving that cancer and dying of other causes risk of breast cancer in any given year never exceeds 1 in 34 (at 30 it is 1 in 250) - most of us will die from cardiovascular disorders March 4, 1999 - we will be looking at the comparison between prospective and retrospective data volunteers review for midterm How many believe their moods vary with: - days of the week.? - Lunar cycle? - some health care workers say they will be extra busy on days with a full moon - Menstrual cycle? - How would you study this? - psychologists are interested in moods- it is an area of study - you could do either a prospective or a retrospective study - an experiment isn’t available to you Raging female hormones in the courts Macleans, June 15, 1981 Research Methods II Notes - - 83 if a woman was premenstrual she was given lenient sentencing for things even as bad as murder treatment for PMS ordered as stabber put on probation (Globe and Mail, Feb 10, 1987) women’s violence blamed on period (Toronto Star, August 25, 1978) this is interesting for women’s groups- could they wait till they are premenstrual or fake other symptoms in premenstrual women have lost custody because men say their ex’s suffer from wild mood swings because of premenstrual- isn’t always to the advantage of the women woman’s syndrome brings leniency Politics of PMS 1929 term “premenstrual tension” Dr. Robert Frank since 1970’s Dr. Katherine Dalton- says women are victims of “raging hormones”; she prescribes progesterone therapy estimates of prevalence rates go from 6-95% (it either exists in no one or everyone it seems!; bad operational definitions?) we aren’t talking about premenstrual dysphoric disorder- just the normal pms 150 somatic and psychological symptoms associated with pms according to some, PMS is a social and political construct- there is no such thing in other cultures; it seems to be an invention of western societies Hormonal changes over menstrual cycle - sharp drop in progesterone when you first have your period is thought to be by some the cause of pms Research Methods II Notes 84 Criteria for premenstrual dysphoric disorder A: present during last week of luteal phase, remit within days of follicular phase onset and absent week postmenses (1) markedly depressed mood, feelings of hopelessness, or self deprecating thoughts (2) marked anxiety, tension, feelings of being “keyed up” or “on edge” (3) marked affective lability (feeling suddenly tearful or sad or increased sensitivity to rejection) (4) persistent or marked anger or irritability or increased interpersonal conflicts (5) decreased interest in usual activities (work, school, friends, hobbies) (6) subjective sense of difficulty in concentrating (7) lethargy, fatigue, marked lack of energy (8) marked change in appetite, overeating, or specific food cravings (9) subjective sense of being overwhelmed or out of control (10) other physical symptoms such as breast tenderness, headaches, joint or muscle pain, headaches, “bloating”, weight gain Criteria B D Research Methods II Notes - 85 D is interesting because you have to use prospective data- retrospective data will not due Jessica McFarland did a study on “women vs men and menstrual versus other cycles” (mood fluctuations)- comparing retrospective and prospective data McFarland et al. - Methodological problems with existing literature: - demand characteristics (expectations of participants, bias, volunteers)- know what the researchers want - measured negative moods (truncated range)- sometimes the only possible things they could talk about was grumpy or negative moods; didn’t allow them to talk about happy moods - early ones all based on retrospective reports- remember two weeks ago on Monday, how did you feel? - no control groups- you could compare against contraceptive users - no assessment of the “normal” range of moods- normal has always been men; Research Methods II Notes do women show the classic mood fluctuations and are they less normal than men? her subjects were deceived- told them that she was concerned with emotional, physical, and behavioural patterns- they were blind to the study followed participants for 70 days to get two cycles had three groups- normally cycling women, women on contraceptives, men (men variations on the month are good controls) she used a mood grid- - 86 Mood pleasantness graph there seemed to be significant differencesbut not the classic way; this data showed that they were happier than the rest of us during the follicular and menstrual phase Research Methods II Notes 87 Arousal levels the same women- difference between how they remembered they felt and how they actually felt - they reported a drop premenstrually that wasn’t really there - difference for arousal and mood pleasantness - see two graphs here we see a retrospective bias - also did it for days of the weekseems to be more variability in the days of the week than in the menstrual cycle - - important to realize that these women didn’t show it does not mean that some women do not show the pattern (dysphoric disorder pattern) Data on prevalence of PMS symptoms - in surveys, most women report being more emotional premenstrualy - with prospective studies, most women do not show any relationship between mood and “time of month” - of those who report PMS symptoms, only 50% actually have these mood fluctuations - subsequent studies have shown that significant positive correlation between a women’s belief in PMS prevalence and the extent of her retrospective bias- the women who believe that it exists there will be more of a different between your prospective and retrospective data - this shows biases Research Methods II Notes 88 Characteristics of Volunteers From text book pp. 122-128 - ethical treatment says participants must be informed of the nature, purpose, requirements of your study - must be given opportunity to decline participation - are there differences between people that agree to participate in research? - Volunteer bias: validity of experiment can be affected if you have a sample made up entirely of volunteers - Issue is whether volunteers differ in meaningful ways from non-volunteers - Do any differences affect the “external” validity of research? - Do any differences affect the “internal” validity of research? Characteristics of Research Volunteers - you can have maximum confidence that volunteers… 1. tend to be more highly educated than nonvolunteers 2. tend to come from higher social class than nonvolunteers 3. are of higher intelligence in general, but not when volunteering for atypical research (such as hypnosis, sex research) 4. have a higher need for approval than nonvolunteers 5. tend to be more social than nonvolunteers - know these characteristics for exam - other characteristics tend to be true but not always: volunteers… 1) are more “arousal seeking” than nonvolunteers (especially when the research involves sex) 2) are more likely to be females 3) are more unconventional than nonvolunteers 4) less authoritarian than nonvolunteers 5) Jews more likely to volunteer than Protestants, however Protestants more likely than Catholics 6) have a tendency to be less conforming than nonvolunteers except where volunteers are female and the research is clinically oriented - looking a volunteers and nonvolunteers in people who were profraternity and antifraternity - had 42 undergraduate women who were given an attitude questionnaire about college fraternities - had volunteers and nonvolunteers - a week later the women were randomly assigned to either hear a profraternity, antifraternity, or neutral talk - the volunteers were more affected by the antifraternity communication than nonvolunteers- they have higher need for approval; tend to see experimenter as being Research Methods II Notes - 89 antifraternity themselves; more motivated to please experimenter- this might have caused the observed attitude change and not the content of the persuasive measure this affects internal validity of the experiment (the aspect of the differences between volunteers and nonvolunteers Rosenthal & Rosnow, 1975 From Horowitz - looked at psychedelics - graph “from horowitz”this is a 2 x 2 x 2 design - if you give people negative info about drugs, they become a little change in attitude; big scarebigger change in attitude - volunteerism may not only affect the internal validity of any research, but can also affect the ability to generalize (external validity) - Horowitz: looked at the relationship between the level of fear aroused by a persuasive communication and attitude change- also examined the impact of fear arousal along a second variable (whether they had volunteered or not for the study) - High fear group: read pamphlet; saw film affected volunteers more than nonvolunteers - Low fear group: only read pamphlet affected nonvolunteers more than volunteers - got a bigger change with low fear than high fear in the nonvolunteers could this be because they were mad at being forced to participate in the study? - Remedies for Volunteerism (p. 127-128) make the appeal interesting make the appeal nonthreatening state the importance of the research state why the target population is relevant pay them. Gifts. have request come from a high status (preferable female) person avoid stressful research communicate the normative nature of participating make the appeal come from a known person if possible depending, commitment to volunteer might be better made privately or publically Research Methods II Notes 90 Midterm review - on powerpoints - know mcfarland study- talk about how it illustrates problems with retrospective studies Midterm Review • ANOVA • Factorial Experiments: Advantages • Factorial designs: terminology – #levels IV1 x #levels IV2 x #levels IV3 etc. • Interpreting data from 3 factor experiments and/or experiments with three or more levels of one or more IV (e.g., in class • Ex-post-facto designs: prospective and retrospective designs - advantages disadvantages • Problems & Partial solutions: Midterm Review • DVs used in Ex-post-facto studies • Problem with both in that absolute risks are hidden, both should be reported • Time-series designs, small-n designs – A-B studies (e.g., homicides after prize fights, JFK’s assassination, TV effects etc.) – multiple baseline designs • internal validity – non-equivalent control group – replication within-subjects • generalizability • Assessing external and ecological validity • Volunteers Research Methods II Notes 91 March 11, 1999 - - - Correlations and Regressions in some cases you want to evaluate the direction and degree of relationship (correlation) between scores in two distributions for this you must use a “measure of association” remember Pearson product-moment correlation coefficient or “Pearson r” used when dependent measures are scaled on interval or ratio scale (provides index of direction of the relationship between two sets of numbers) can be direct relationships or inverse relationships (+ or -) magnitude tells you degree of “linear relationship” (straight line) between two variables factors that affect the magnitude and sign of Pearson correlation coefficient: - range of scores - presence of outliers - shapes of score distributions can use correlation/regression with either manipulated predictor variables or natural variation you can use it to look at true experimental data; isn’t only used in quasi-experimental studies correlational statistics can be applied to any type of design (including experimental) correlational design occurs when we do not randomly assign participants to the level of either variable- i.e. levels of variables are not manipulated Steps in conducting correlational designs - these are quasi-analytic experiments: 1) select population and subjects of interest 2) measure variables of interest (at least two) 3) calculate the extent to which the variables are systematically related once you have measured your data it is a good idea to plot it (get a scatterplot) - predictor (assumed causal or IV) variable on abscissa (X-axis) - criterion or DV on ordinate (Y-axis) - you “regress Y on X” (predict Y from X) - you should always look at the scatterplots Pearson’s product moment correlation coefficient - the basis of almost all statistics including ANOVA - appropriate for interval or ratio data only- it is like a - parametric test measures the direction and degree of association Research Methods II Notes 92 Correlational designs: concern - Pearson’s r (based on means) is very sensitive to the presence of “outliers” (not Spearmans, etc.) - - - - - sensitive to heteroscedasticity (rxy relationship may vary across levels of X), and can be biased by having a “restricted range” combining group data can also influence the size of the correlation (in either direction) examine scatterplots to detect these potential problems! http: //www.ruf.rice.educ/~lane/stat_sim/comp_t/index.html as a reference as the slope changes, the blue box gets bigger in either directionmore variability able to be explained by the relationship and less variability unable to be explained by Y when you have a perfect correlation, you will have no unexplained variability (when dots are not on the line) having a restricted range is not good- just the presence of one outlier can make a relationship Research Methods II Notes - - - - 93 nonsignificant correlation is a measure of the average relationship between the two variables homoscedasticity is also if you were to take graphs showing relationships between the two- heteroscedasticity shows less variance for one of them if you combined them it would be a significant negative correlation even though the groups don’t have those correlations this shows that sometimes combining group data can alter your correlation in other ways as well Other diagrams: Statistical inference - there are tables of critical values (overhead) - for a given sample size, larger the absolute value of r, the less likely it is to have occurred by chance - bigger numbers less likely to have occurred by chance - for a given value of r, the larger the sample, the less likely it was to have occurred by chance- you can have significance without something being meaningful Research Methods II Notes 94 Power - the power of a correlational design is increased by: - minimizing error variance - avoiding restricting the range of scores - increasing the sample- if you want to make a big deal out of nothing you need a big sample – i.e. women that have smaller size brains than men (14, 000 participants) - with very powerful correlational designs (many subjects) significant, yet probably meaningless results can be found - e.g. correlation between height and IQ is about r=.1 and this was found to be statistically significant [study based on 14, 000 children: with N=102 (100 d.f.) r +- 16, p<.05] Statistical inference - r2 (coefficient of determination)= estimate of the proportion of variance shared by the two variables; extent to which the co-vary. (can be used as a measure of effect size) - it is the square of the correlation coefficient - i.e. if variation in score x actually caused variations to occur in score y, the coefficient of determination would indicate what proportion of the total variance in score y was caused by variation in score x - 1-r2: coefficient of nondetermination (also called coefficient of alienation or error variance) - continuing about coefficient of nondetermination: - gives proportion of variance in one variable “not accounted for” by variance in the other variable- the “unexplained” variance caused by unmeasured factors (if this is large your measured variables are not having a lot of impact on each other - have a circle showing total variance in x and another showing total variance in y - no shared variance- no overlap - other examples of venn diagrams - 90% of variance shared pretty much overlapped Drawing conclusions from correlational designs - all the same concerns as with experiments (valid, reliable measures, etc.) - additionally, have concerns with directionality, and there are usually many potential confounds (uncontrolled extraneous variables- the 3rd variable problem)- i.e. the lice and health example (turned out that when you are sick your body is too hot for lice to live on; fever) - sometimes it’s hard to know if they are significant or not Research Methods II Notes - 95 causality can not be inferred correlational designs can be used to: discover relations to solve ethical and practical problems to provide greater external/ecological validity Linear Regression - linear regression looks at correlation in terms of predictability - r2 is a measure of the proportion of variance in Y accounted for (predicted by) X - idea behind “bivariate linear regression” is to find the line that best fits the data plotted on a scatterplot: Y’= a+bX [or X’=ax+bxY] Y’ = predicted score of Y b = slop of regression line (or “regression weight”) X = value of x-variable a = y-intercept - this is called the “least squares” regression line - this straight line is the one that minimizes the sum of the squared distances between each data point and the line (as measured on the y-axis) - minimizes the sum of squared deviations, sum of deviations between predicted values of y’ and actual observed values of y =0. - in other words, at any given value of x found in the data, the position of the line tells you the predicted y according to the linear relationship- you can then compare these predicted values to the actual values- best fitting straight line minimizes these differences between predicted and observed values (as it would fit the points the best) - the deviations are called “residuals” (residuals will be low when the regression equation predicts scores on Y) - using your line, you have values of x and then predicted values of Y- but since you have your scatterplot you also have your actual raw scores of Y- you are in a position to see how accurately your regression equation predicts scores on Y (difference between the values of Y and Yhat (Y – Yhat) is called a “residual”) - you will have lower residuals when regression equation generates values of Yhat that are close to actual values of Y - unless variables perfectly correlated, you will get a “standard error of estimate” - if regression based on raw scores, “regression weight” is known as “raw score regression weight”. If you use standardized scores you will obtain a different regression equation and have “standardized regression weight” or “beta weight” - not that if you were to then plot the residuals Yres against X, there would be no linear relation, then correlation would be 0 Research Methods II Notes 96 the linear regression line, Y’=a+bX can be thought of as the straight line that summarizes the linear relationship in a scatterplot by, on average, passing through the average of the Y scores for each X - when variable perfectly correlated- no error in prediction - however, when your correlation is less than perfect there will be error in predicting Y from X (estimate amount of error in prediction by calculating “standard error of estimate”) Lab2 - it takes time to rotate objects - we want to look at things in the manner we are used to - to mental images work the same way? - measured time - Linear regression: - looks at correlation in terms of Predictability, Linear regression finds the best fitting line: Y'=a+bX [or X'=ax+bxY] This is called the least squares regression line [where Y' is the predicted value of Y] -minimizes the sum of squared deviations, sum of deviations between predicted values of y' and actual observed values of y =0. these deviations are called residuals. [Note that if you were to then plot the residuals Yres against X, there would be no linear relation, the correlation would be 0]. [The linear regression line can be thought of as the straight line that summarizes the linear relationship in a scatterplot by, on average, passing through the average of the Y scores for each X.] 0): 1) Every participant who obtained a given value of X obtained one, and only one value of Y: there are no differences in Y scores for a given X 2) Y scores are perfectly predictable from X scores: the data points for a given X are all on top of one another and all data points fall along the regression line. For intermediate correlations: 1) There are different values of Y for each X, however these different Ys are relatively close in value (the variability in Y associate with a given X is less than the overall variability in Y) 2) knowing X allows prediction of approximately what Y will be: data points will fall near the regression line but not on it. Research Methods II Notes 97 For zero correlation: 1) Y scores are as variable at a given value of X as in the overall sample 2) The best prediction of Y, regardless of X will be the average of Y and there will be no regression solution. using standard scores and 2 variables [1 IV], regression coefficient (b) [or raw score as the correlation grows less strong, Y' moves less in response to a given change in X, (the slope, b approaches 0). If standard scores (z scores) are plotted, the slope of the least squares regression line = r [r= change in S.D. units in Y' (the predicted value of Y) associated with a change of 1 S.D. in X. If r=0, best predictor of Y from X is the mean of Y, and the best predictor of X from Y is the mean of X. If r= regressing Y on X and the regression line from regressing X on Y is the same (and passes through the point (mean of X, mean of Y). As the correlation between X and Y weakens, the predicted value of Y' for a Zx=1 will be Zy'<1 and the predicted value of X' for a Zy=1 will be Zx'<1. The regression lines predicting Y' from X and X' from Y diverge with decreasing correlation until at r=0.0, they are perpendicular: horizontal and vertical lines passing through the means of Y and X respectively. This can lead to regression artifact (e.g., Rushton: women less brainy then men. And remember cautions - same as for correlations: assumes linear relations among variables, truncated ranges can reduce correlations or regressions, outliers, heteroscedasticity, etc. Correlations and Regressions Can use correlation/regression with either manipulated predictor variables or natural variation. Correlational statistics can be applied to any type of design (including experimental) but a correlational design occurs when we do not randomly assign participants to the level of either variable - i.e., levels of variables are not manipulated. Quasi-analytic experiments: Steps in conducting correlational designs 1) select population and subjects of interest; 2) measure variables of interest; 3) calculate the extent to which the variables are systematically related Pearson's product moment correlation coefficient (for Interval or ratio data) measures the direction and degree of association. r is the mean of z-score cross-products: r= xZy)/N, the extent to which deviations from the average on each measure are similar for each subject sampled. Research Methods II Notes 98 r2 (coefficient of determination) = estimate of the proportion of variance shared by the two variables; extent to which they co-vary. (Can be used as a measure of effect size.) 1-r2: coefficient of nondetermination (also called coefficient of alienation or error variance) Statistical inference: for a given sample size: larger the absolute value of r, the less likely it is to have occurred by chance, similarly, for a given value of r, the larger the sample, the less likely it was to have occurred by chance. The power of a correlational design is increased by minimizing error variance, avoiding restricting the range of scores, and increasing the sample. Pearson's r (based on means) is very sensitive to the presence of outliers, heteroscedasticity (rXY relationship may vary across levels of X), and can be biased by having a restricted range. Combining group data can also influence the size of the correlation (in either direction). So: examine scatterplots to detect these potential problems!! Visual inspection of data Graph data (scatterplot): predictor (assumed causal or IV) variable on abscissa (X-axis) and criterion or DV on ordinate (Y-axis) March 16, 1999 Calculating I.Q. and GPA Correlation - using cross product as an average of correlation - - - - - we are looking for relationship between x and y the Z thing says that it is 1.45 times below the standard IQ- we also have Z scores for the y variable if you were to plot it, you can see the correlation- the regression line is drawn on there, given by .0593x1.5885 look at just the data we have- use a truncated range: you can see the correlation and the best fit least squares regression line Statistical inference - review r2 and 1-r2 Research Methods II Notes - 99 coefficient of determination is an estimate of the proportion of variance shared by the two variables review venn diagrams Drawing conclusions from correlational designs - all the same concerns as with experiments (valid, reliable measures, etc.) - additionally, have concerns with directionality, and there are usually many potential confounds (uncontrolled extraneous variables- the 3rd variable problem) - causality can not be inferred - correlational designs can be used to: - discover relations - to solve ethical and practical problems - to provide greater external/ecological validity Linear Regression - we are looking at correlation in terms of predictability - r2 is a measure of the proportion of variance in Y accounted for (predicted by) the amount of variance in X - remember that we use the least squares regression line - linear regression finds the best fitting line: this is called the “least squares” regression line Y’= a + bX [or X’= ax +bxY] - minimizes the sum of squared deviations, sum of deviations between predicted values of Y’ and actual observed values of y=0 - these deviations are called “residuals” - note that if you were to plot the residuals Yres against X, there would be no linear relation, the correlation would be 0 graph again: - - values are above and below the line- yellow arrows- if you take the residuals (the differences) and plot them against the predictor- you will not have any relationship- you have wiped out the linear relationship; basis for how we can use measured variables to control them statistically by removing them if you plot the residuals against predictor and do a regression it should be zero Research Methods II Notes - see graph: there is no linear regression line Linear Regression - it is the straight line that summarizes the linear relationship in a scatterplot by, on average, passing through… Formulas - using standard scores and 2 variables (1 IV), regression coefficient (b) [or raw score regression weight] = standardized regression weight (or beta = correlation coefficient (r) - for standardized scores, the slope is equal to the correlation Implications of formulas - the correlation coefficient (r) is going to tell you the change in standard deviation units in Y’ (the predicted value of Y) associated with a change of 1 S.D. in X - if standard scores (Z-scores) are plotted, the slope of the least squares regression line = r graph: - when you convert the data to standardized scores the regression correlation and the correlation coefficient are the same 100 Research Methods II Notes 101 For perfect correlations (r==-1.0) 1) every participant who obtained a given value of X obtained one, and only one value of Y: there are no differences in Y scores for a given X 2) Y scores are perfectly predictable from X scores: the data points for a given X are all on top of one another and all data points fall along the regression line Regression Lines: r = 1 - if you know one of the values, you can predict the other variable perfectly with perfect correlations For intermediate correlations: 0<r<1 1) there are different values of Y for each X, however these different Ys are relatively close in value (the variability in Y associate with a given X is less than the overall variability in Y) 2) knowing X allows prediction of approximately what Y will be: data points will fall near the regression line but not on it Regression line: 0<r<1 - the best fit regression line for predicting x from y diverges from the best fit regression line for predicting y from x- both lines go through the means, however For zero correlation: r=0 1) Y scores are as variable at a given value of X as in the overall sample 2) The best prediction of Y, regardless of x will be the average of Y and there will be not regression solution (i.e. if we were told a student’s IQ and said that there was a correlation of 0 you couldn’t predict gpa except that they are more likely to be average than nonaverage) Implications of formulas - as the correlation grows less strong, Y’ moves less in response to a given change in X, (the slope, b approaches 0) Research Methods II Notes - - if r=0, best predictor of Y from X is the mean of Y, and the best predictor of X from Y is the mean of X if r=+-1.0: then the regression line from regressing Y on X and the regression line from regressing X on Y are the same as the correlation between X and Y weakens, the predicted value of Y’ of a Zx = 1 will be Zy’<1 and the predicted value of X’ for a Zy = 1 will be Zx’<1 the regression lines predicting Y’ from X and X’ from Y diverge with decreasing correlation until at r = 0.0 they are perpendicular: horizontal and vertical lines passing through the means of Y and X respectively this can lead to regression artifact: i.e. Rushton- women less “brainy” than men question for the exam: suppose we have a mean x score of 8 and standard deviation is equal to 2, mean y score = 10, sd = 4 correlation between the two is perfect and positive, and if x=6 and Y= ? standardized score value for the X is –1 (one standard deviation below the mean) our correlation will be one standard deviation below the mean) if the correlation was negative, you would have Y going up (one standard deviation below the mean) i.e. xmean = 8, sd = 4, ymean= 10, sd = 3, correlation (square root of xy)= +.7 and if x was 4 what is y? if correlation was 1- y would be 7 if correlation was 0- y would be 10 real answer will be somewhere in between there- 7.9 (.7x3) minus ymean the linear relationship is very unlikely to have occurred by chance - we have two lines that describe 1000’s of data points that say that brains seem to be bigger- control for body size - so you can take any body size you want and take the regression line and see how big her brain will be, take same body size and predict male brain - - - - - 102 for any given brain size, men should have smaller body size than women- if you regress body size on brain size with correlations that are not perfect- the two regression lines are diverging; try it both ways: so, for a given brain size, women have smaller bodies- they are potentially more brainy what is going on? as correlations weaken, predictions tend to go to mean- tend to go to average (if someone is 7 feetaverage brain most likely not huge) it is just the fact that the relationship is not great Research Methods II Notes 103 Cautions for regression data - same as correlations: - regression assumes linear relations - truncated ranges - outliers heteroscedasticity - combining data from different groups - also, (if a correlational design) 1) subjects not randomly assigned 2) no attempt (in correlation designs) to control variables 3) different levels of the IV are not contrasted while concurrently holding other variables constant Correlation versus ex-post-facto designs - these are very similar and you can convert one to the other - e.g. assign dummy coding to the categorical (nominal- 1’s and 0’s) variable (if there is one) and calculate a “point-biserial correlation” coefficient - interpretation problems are not related to the statistical choice, rather due to the design Correlation versus ex-post facto designs These are very similar quasi-analytic designs and it is possible to convert one to the other [e.g., assign dummy coding to the categorical (nominal) variable (if there is one and it has 2 levels) and calculate a point-biserial correlation coefficient instead of doing a between groups t-test] Interpretation problems are not related to the choice of statistical analysis, rather they are due to the nature of these designs Remember that unlike true analytic experiments: 1) subjects are not randomly assigned, 2) there is no attempt (in correlational designs) to control variables, and 3) different levels of the IV are not contrasted while concurrently holding all other variables constant. Drawing conclusions from correlational designs -- we have all the same concerns as with experiments (valid, reliable measures, etc.) but in addition, have concerns with directionality, and there are usually many potential confounds (uncontrolled extraneous variables in correlational designs - the 3rd variable problem). Although causality can not be inferred from a single correlational design, correlational designs can be used to discover relations, to solve ethical and practical problems, and to provide greater external/ecological validity (by being more easily applicable outside laboratory settings) Causation is not a simple concept. To infer it from correlational studies, we want to have: 1) an association between variables that recurs in different contexts (replication, convergent evidence), 2) have a plausible explanation showing how the predictor variable could cause the criterion variable, and 3) have no equally plausible 3rd variable that could cause the variance in the criterion variable. While correlation doesn't imply causation, causation does imply correlation Research Methods II Notes 104 Death sentences for murder in southern U.S. - example for problems with combining data in group data, etc. - - - the white man who lynched the black man was sentenced to death there is a paradox- white’s more likely to be sentenced, yet for both black victims and white victims, blacks more likely to be sentenced to death- but if you combine them together than whites are more likely to be sentenced to death if you look at victims race- in terms of sentencing, murdering a black person is a less serious crime than murdering a white person- victim data Paradox - whites are more likely to be sentenced to death than are Blacks once convicted of murder - yet for both black and white victims, blacks are more likely to be sentenced to death Explaining the paradox - how does this help us explain the paradox? - victims race is a confound - people tend to murder members of their own race - whites are more likely to murder whites and this is treated as a more serious crime, at least in terms of the death penalty - relative risk ratio = (30/214)/ (6/112) = 2.6 - murders are 2.6 times as likely to be sentenced to death for killing a white vs. a black victim Simpson’s paradox- 2nd example - classify two groups with respect to the incidence of one attribute - if the groups are then separated into several categories or subgroups the group with the higher overall incidence can have lower incidence within each category or subgroup - there is a negative correlation between starting salary for people with economics degrees and the level of degree they have obtained (i.e. PhD’s earn less than M.A.’s, how earn less than BA’s) - does this make sense?…no! Research Methods II Notes - - 105 break down this data in terms of the type of employment (industry, government, teaching). In every type of job: private industry, government, or teaching: there was a positive correlation between degree and starting salary employment selection is the confounding third variable influencing these resultsteachers get paid less than government workers who get paid less than those in private industry people with higher degrees are more likely to end up teaching and those with B.A.’s are very unlikely to be teachers this is similar to the white, black data these are all examples of the danger of combining data from several distinct groups (with respect to the relation between two variables) in calculating correlations maybe it is that the people of different degrees were choosing different professionsones with ph’d’s were more likely to be teachers? one way to have avoided the initial erroneous correlation would have been to use “stratified” sampling if equal numbers of people are sampled from the categories, the overall relationship will be an average of the relations in the subcategories Notes from text about sampling: - at the heart of all sampling techniques is random sampling - every member of the population has an equal chance of appearing in your sample - other methods include 1) Stratified sampling - divide population into segments or “strata” - next select a separate random sample of equal size from each stratum- because individuals are selected from each stratum, you guarantee that each segment of the population is represented in the sample 2) Proportionate sampling - variant of stratified sampling - problem with stratified sampling is that it may lead to certain groups being overrepresented in the sample (e.g. consider a community of 5000 that has 500 Hispanics, 1500 blacks, 3000 whites- and you randomly select 400 people from each segmentHispanics would be overrepresented - in this, the proportions of people in the population are reflected in the sample- you would sample so that there would be 10% Hispanics, 30% blacks, 60% whites just like it is in reality 3) Systematic sampling - sampling every kth element after a random start 4) Cluster sampling - surveying all people within certain clusters - this is used if populations are too large to allow cost-effective random sampling or even systematic sampling - you identify naturally occurring groups and randomly select certain clusters (like one class within a school) - advantage: saves time - but limits sample to those participants found in the clusters 5) Multistage sampling - First stage: identify large clusters and randomly select from amoung them - Second stage: randomly select individual elements - for all of these sampling procedures, you must take sample size into consideration Research Methods II Notes 106 - you want an “economic sample”: one that includes enough participants to ensure valid survey but no more take into consideration the amount of acceptable error and expected magnitude of population parameters sampling error: deviation of sample characteristics from those of the population - survey research used to evaluate behaviour and attitudes of participants falls into the category of correlational research cannot draw causal inferences - Data analysis - doing two simple regressions- one with each dependent variable - have to find means for the variables - within-subjects design- use compare means – choose dependent ones - give you a table with all means Simple regression - use regression equation to predict the outcome of y from score on x - y=bX +a - Y’ is the predicted value of Y - b is the slope (amt of change in Y for a unit change in X) - a is the y –intercept - things to note on spss - check that the angle of rotation is a significant predictor - relationship is linear - R square (proportion of variance accounted for) - (Constant) is your y intercept - how to decipher spss output: - what do we want to do first: you have anova and the t- these are both the same, you only have one iv and one dv in each one - first see if it is significant - now we want to look at specifics- y intercept and slope; the intercept is the number under b in the first one, the slope is the number under that is under that - r2 value is the R Square value- 6.1% variability predicted by variability in the other variable- i.e. angle of … is accounting for 6.1% of the variability in reaction time - what else could account for the variability - it is significantly predicting 8.1% of the variability in the dependent measure - error is accounting for other % discussion: did median RT increase with increased angle of rotation? My much? What about errors? Findings consistent with literature?- not causal relationship, but predictive Method: see handout for details, describe program used, type of stimuli presented, procedure followed - two dependent measures- error and reaction time - ideas for future research - if you add up the error % and the rotation % it is about 15% and the change in the numbers is less than 15%- so the two combined account for all the variance Research Methods II Notes 107 Class 20 • • • • • - - - - Sampling techniques Part & Partial correlations Multiple correlation and regression Types of Multiple regression Return Midterms Mental Rotation: Shephard & Metzler it seems that we are hard wired the 20% improvement might be because with a lot of practice you know exactly what to do, whereas with no practice you might have to wait and make a decision first this was done a long time ago- used statistiscope they don’t present error data- i.e. speed-accuracy trade off for our lab, if you had accuracy increasing with bigger rotation angles and reaction time increasing with bigger rotation angles- this is the worst case scenario because you could say nothing about the angle of rotation and you could just be seeing an accuracy-speed trade-off some programs just eliminate data that is too sloppy or took too long if you had speed and accuracy both with positive slope- maybe you are not even mentally rotating something? you look at t-value to determine the significance of the slope- it is the t value in the above example- shows that it makes sense you can use correlations to test the significance of the slope Simpson's Paradox Classify two groups with respect to the incidence of one attribute; if the groups are then separated into several categories (subgroups), the group with the higher overall incidence can have lower incidence within each category (sub-group). • Simpson's Paradox - Examples Sentencing of blacks and whites in southern U.S. Research Methods II Notes • • 108 Starting salary and level of education Illustrate the danger of combining data from several distinct groups in calculating correlations. • Could have avoided the initial erroneous correlation by using stratified sampling. - you should have had whites killing whites, blacks killing blacks, whites killing blacks, and blacks killing whites - unless you have these categories you can think things are ok (no racism, for example) when there actually is • If equal numbers of people were sampled from the subcategories, the overall relationship will be an average of the relations in the subcategories. - you should look at subgroups for example like gender- they sometimes throw in the fact of looking for gender differences- sometimes people argue that this is not fair to look at just later - Phillip Rushton- should a person be looking for what he is looking for- racial differences and intelligence- what possible good can come from this- you could find negative evidence that could argue against racism but what if you find evidence that supports the view? - others argue that information is knowledge Simple Random Sampling - everyone in the population has equal chance of being in the group - random is not simple - mathematicians have a hard time trying to figure out how to generalize numbers - wonderful because reduces possibility of bias but you have more cost - the bigger your sample, the more random your design- you want many many people but sometimes this gets ridiculous- do you want to have to subject people to hard procedures even when you already know what the answer is (if you already have significance) Randomness? - - which one is random? the one on the right is random- the other one is a pattern of bug behaviour (biological pattern- it is spaced out)- randomness has clusters, etc. imagine the dots are incidence rats of cancer in an area- imagine you live where the cluster is, there will always be clusters somewhere just do to random fluctuations Research Methods II Notes 109 - there will be clusters just due to chance but you can’t tell where they are- the chance of there being clusters somewhere is more likely than there being a cluster in a certain spot - where you have pre-existing groups and you randomly sample groups for example, classes in school- you don’t need to randomly sample from all students, just randomly sample 4 classes and then just sample kids out of just those classes they do this to test many products- if it flies in Winnipeg it will fly elsewhere you can combine this with multistage sampling- interested in attitudes of high school students sexual practices; figure out the school boards (10) – randomly select three of the school boards; each school board has 5 high schools- randomly select 2 schools- each has 10 classrooms; randomly select 2 classes- in there you sample everybody- you have more than one “stage”- it can make it more practical problem is if clusters differ- if school boards differ or if classrooms differ in terms of other factors - - - - - - - • • – • • – - divide world into strata and ensure you have equal numbers from each strata (even if there aren’t equal numbers in the real world) example of blacks and whites killing people- there aren’t equal numbers in reality but you sample same number- get rid of Simpson’s paradox i.e. want male and female- have two stratas and want same #’s of each Sampling Methods Simple random sampling Cluster sampling can be sued with multistage sampling Stratified sampling Proportional sampling very popular, e.g., based on S.E.S. for voting, cultural background etc. (can do later) this is where you want the ratio of different subgroups in same to reflect ratio in real world used in survey studying in attitudes i.e. who will win next election? if you have 60% of upper class- you have to take a sample that has 60% of upper class if looking at upper and lower class differences in voting and want to combine to see who will win Research Methods II Notes - • – - • • • • • - 110 example of the people who didn’t have phones so a phone interview is not a representative sample Systematic sampling every kth person, cheap, easy every X person from a student list, every 5th person that comes in the sub, convenient, easy, probably fine fine unless there is a pattern underlying survey Problems in Causal Interpretations Problems interpreting the results of this type of research will also be impeded by: third variable problem directionality (not always an issue) regression artefact (e.g., Rushton) limited range (floor and ceiling effects), look for converging evidence. in all of these situations you want to look for converging evidence Causation Causation not a simple concept. To show causation, you want to have: 1) an association between variables that recurs in different contexts (replication, convergent evidence) - word recurs is important- you have to have more than one study- no one study will give it to you - you need association to appear repeatedly 2) have a plausible explanation showing how the predictor variable could cause the criterion variable - a good thing is to have a large sample size 3) have no equally plausible 3rd variable that could cause the variance in the criterion variable. While correlation doesn't imply causation, causation does imply correlation - a cross correlation is a “time lag”- male and female data- smoking causes cancer, the association keeps recurring - is there a plausible explanation?- tars, etc. cause cancer in animals so it makes sense - is there an equally plausible 3rd variable- genetic predisposition to cancer, you can’t pin it on smoking, added nicotine to cigarettes so that they will be more addictive • - - • Partial Correlations Partial correlation allows you to examine the relationship between two variables with the effect of the third (or third & fourth etc.) removed from both. very powerful mathematical technique used when two variables are both influenced by a third variable- if the variable was not held constant when the data was collected it could affect the relationship between the two variables of interest but if it was recorded along with the other two, it’s impact on the third variable can be statistically evaluated this method determines the correlation between two variables while statistically controlling for the effects of a third Can be viewed as the average of the simple bivariate correlations across levels of the third, "nuisance" variable (the one that has been "partialed out"). Research Methods II Notes 111 • rYX2X1- here the 2 and 1 are subscripts and the YX and X are multiplied together Partial Correlations removes the systematic relationship with the 3rd variable statistically, by removing the linear trend, then correlate residuals. - Remember the regression line gives you the best equation that describes the relationship between the variables - in a nut shell just say you were looking at SAT and GPA scores but you thought that PE (parental education) would be a third confounded variable - correlate (draw a graph) of SAT and PE and then find residuals for this graph - correlate GPA and PE and then find residuals- both of these residual graphs will have a correlation of 0 (with the effects of one partialed out you no longer actually have a correlation) - to get the correlation of just SAT and GPA you correlate the residual scores - this is done to see if SAT and GPA will still have a significant relationship even when the effects of PE are partialed out any number of measured variables can be partialed out as if controlled for experimentally partial correlation can be tested for statistical significance with nj d.f. (where j= number of variables) - - - this is easy to use if you have a correlation matrix if you have three correlations from data you can get the partial correlation, and also to compare it with the semi-partial correlation formula example from our bookthe SAT test is used universally in states as a measure of knowledge- used to predict how well student will do (whether they will get into a university) we want to know whether given the info from the sats’ whether this is good data for determining how well kids will do in university but there is a third variable- parental education Research Methods II Notes - 112 we want to control for parental education the partial correlation matrix- regress GPA on parental education and then take the residuals- GPA and sat scores have no relationship with parental education when you have residual scores- the partial correlation looks at the two residual scores and tries to find a correlation partials out the effects by regressing the confounded variable instead of the simple two circles, we have three - the coefficients of determinationdegree to which the two circles overlap that you are looking at - coefficients of non-determination are 1-those values - mathematically we are getting rid of X2- the amount that is shared between X1 and Y is determined and the amount that is not shared is also looked at - we use the formula with Y (a+f) and not (a+g) because we are trying to determine y looking at amount of remaining variance so it’s not just c- have to account for the variance that is taken away by the green circle Partial Correlations - an Example Effects of alcohol consumption during pregnancy on fetal outcome Significant negative bivariate correlation But, alcohol consumption tends to correlate with tobacco (nicotine) and caffeine consumption So, partial out effects of nicotine and caffeine, alcohol still has a significant negative partial-correlation with fetal outcome. Other hypothetical example?? – overhead- smoking, stress, and colds (you want to see if smoking causes colds but seems to be confounded with stress- use stress as the one that is held constant) Partial Correlations Major improvement (over simple correlation) Research Methods II Notes 113 Problem … can be other unmeasured variables random assignment, in theory, gets them all Semi-partial (Part) Correlations Allows you to examine the relationship between two variables with the effect of the third removed from one. - - - here the effect of the third is just removed from one of the variables (not both)- the numerator is exactly the same, variables to the right of the dot have been removed from the variables to the left of that variable (the bracket tells you that the effects of x2 have only been removed from X1 and not Y (if you had no brackets- partial) doesn’t tend to be used a whole lot by itself in semipartial you are correlating residuals with raw whereas with partial you are correlating residuals with residuals - • we are only removing the nuisance variable from one Multiple Regression When looking to make predictions, third (or fourth) variables that correlate with the criterion are not nuisances, but may provide Research Methods II Notes • • • • • • • - additional information. With several potential predictor variables, use Multiple Regression analyses Built upon multiple correlation (just as with bivariate regression and correlation are related) Multiple Regression Equation regardless of the number of predictors, the multiple linear regression equation is: Y'=a+b1x1+b2x2+b3x3+ …bkxk where Y’ = is the predicted value of Y a = the regression constant b1 - bk = regression weights or coefficients and x1 - xk = predictor variables it is harder to draw the line- it is hard to illustrate hyperspace Multiple Regression Plane - 114 you look for the best plane that goes through the data sets rather than the best line that goes through the data points An Example- IQ, GPA and Study Time An Example: IQ, GPA & Study Time An Example- IQ, GPA and Study Time Multiple Regression Research Methods II Notes 115 two variable are unlikely to have a correlation of 0 just because of chance - how much of the variability in the y can we predict from the variability in both the x’s - a + b + c – proportion of variance you can account for by knowing the values of x1 and x2 Multiple Regression - Multiple Regression - which variable should get credit for b? this is a problem Considerations about R2 • • • - - • • - • • • R2 Coefficient of determination (total proportion of variance you can account for. 1-R2: coefficient of non determination. R2 cannot be less than the largest single bivariate r2yx the amount of variance accounted for by two predictors can not be less than the variance accounted for by one predictor? (could be the same, but can’t be less) the extent to which a new variable will improve ability to predict y will depend on how correlated it is with the variables that are already correlated Considerations about R2 With additional predictors, R2 will increase only to the extent that the new predictor is not correlated with other predictors already in the equation. Additional variables entering the equation may simply be fitting noise. two variables unlikely to have a 0 correlation- on average it would be zero, but for any one of them it might actually be high you have to… Test the significance of improved fit between observed Y and predicted Y’ after each step in the multiple regression (for most methods) Shares all the assumptions and potential pitfalls of simple linear regression Multiple Regression Methods Direct (Simple) Regression Research Methods II Notes • • • • 116 Forward regression Backward regression Stepwise Regression Hierarchical Regression Multiple Regression methods: (Note that different terminology can be used by different authors) Direct (Simple) Regression: All available predictor variable are put into the equation at once and they are assessed as if they had been entered last i.e., are assessed on the basis of the proportion of variance in the criterion variable (Y) they uniquely account for. (called simple regression in Bordens and Abbott) Forward regression: sequentially add variables, one at a time based on the strength of their squared semi-partial correlations (or simple bivariate correlation in the case of the first variable to be entered into the equation) Backward regression: start with them all then delete them on the bases of smallest change in the R2 Stepwise Regression: a combination of forward and backward: at each step one can be entered (on basis of greatest improvement in R2 but one also may be removed if the change (reduction) in R2 is not significant. (In the Bordens and Abbott text, it sounds like they use this term to mean Forward regression.) Hierarchical Regression: The researcher assumes control over the analyses. On basis of theory or practicality (e.g., economics). Note: this is equivalent to doing semi-partial correlations> Class 20 - multiple correlation and regression - types of multiple regression - other multivariate designs and analyses Multiple Regression - the key thing about it (the distinguishing factor) instead of worrying about correlations, we are trying to make the best prediction possible- the more variables the better - the math is a little different than a simple regression- it builds upon simple regression or simple correlation - remember that regardless the number of predictors, the multiple regression equation is Y’= a+b1x1+b2x2+etc. - we want to figure out what proportion of variance in y is shared with the two predictors Multiple regression coefficient equation Research Methods II Notes 117 R2 - R2 Coefficient of determination (total proportion of variance you can account for) 1-R2: coefficient of non determination R2 cannot be less than the largest single bivariate r2yx Considerations about R2 - with addition predictors, R2 will increase only to the extent that the new predictor is not correlated with predictors already in the equation - if two are perfectly correlated (x1 and x3)- prediction will not be improved because that bit of the square is already accounted for in x1- they can’t correlate perfectly if you want to improve predictability - increase dependent upon the squared semi-partial correlation coefficient: R2Y•X1X2X3…Xn= r2YX1+ r2Y(X2•X1) + r2Y(X3•X1X2)+… + r2Y(Xn•X1X2X3…Xn-1) - • - • - • - • - • - additional variables entering the equation may simply be fitting noise test the significance of improved fit between observed Y and predicted Y’ after each step (except simple). shares all assumptions that linear regression does Multiple Regression Methods Direct (Simple) Regression: all available predictor variables are put into the equation at once and they are assessed as if they had all been entered last how much of the variance in y can this variable and this variable alone account for? this is the method of choice; it is conservative; they are assessed on the basis of the proportion of variance in the criterion variable (Y) they uniquely account for (Squared Semi-Partial Correlations) called simple regression in Bordens and Abbott Forward regression: sequentially add variables, one at a time based on the strength of their squared semi-partial correlations (or simple bivarate correlation in the case of the first variable to be entered into the equation it is mindless enters them into the equation depending on how much of the variance they account for Backward regression: start with them all in the equation then delete them on the bases of smallest change in the R2 if the change isn’t greater than expected by chance than this variable is just fitting noise, so get rid of it Stepwise Regression: a combination of forward and backward at each step one variable can be entered (on basis of greatest improvement in R2) but one may also be removed if the change (reduction) in R2 is not significant in the Bordens and Abbott text, it sounds like they use this term to mean forward regression the order in which the variables are entered is based on a statistical decision, not on theory Hierarchical Regression: this is the only method with which the researcher assumes control over the analyses on basis of theory or practicality you use this is you have a well-developed theory or model suggesting a certain causal order Research Methods II Notes - - 118 this is especially important when “multicollinearity” is a problem (when your predictor variables are highly correlated with each other)- problem of which one gets credit for the overlap? when you think that one variable has relative importance over another, use this method Multicolinearity - results when variables in analysis are highly correlated - impact of this is complex and beyond the scope of this chapter - if two variables are highly correlated- one of them should be eliminated from the analysis - the high correlation means the two variables are measuring essentially the same thing Multiple regression - increased it so that ab is smaller than bc x2 would enter equation first because it is accounting for more variance in y than x1 bc is 45% of y ab is 15% of y a is .06 c is .36 at the top, the first .51 said that both of them are important (hierarchical) the second one says that only one might be important (stepwise) Research Methods II Notes 119 Multicollinearity examples - With one predictor variable- you can predict .66, with two it is .67, with three it is still .67 and fourth is still only .67- these new ones have considerable overlap with the first variable Layton & Swanson - done a while ago, thousands of school kidsgrade 9, trying to predict how well they would do in high school - verbal reasoning gets put in first- it accounts for .31 of how well they will do in grade 11, numerical ability is less, abstract reasoning accounts for 0% - if abstract reasoning put in first, accounts for 20% - how can everything change like that? - what do you think the correlation is between verbal reasoning and abstract reasoning?- it would be high - final line- if you are second in, the effects of the first one has been partialled out, if you are third, the effects of the second and first are partialed out, if you are last- not much variance left Note: Give an example like on the final exam - perhaps as homework Coleman Report - explain school achievement inequalities: - 1 important DV was verbal abilities, 60 IV’s! 5 chosen (based on assumed importance)- blacks and whites examine separately - 3 different orders of entry - if you have a big correlation even if it goes in later in the equation- it would be really important - for whites, self-concept is consistently important - for blacks, less so: control of the environment is more important Research Methods II Notes - 120 interesting that both are quite subjective attitude measures Warsaw study - assessing massive efforts by government to achieve educational equality - spread individuals around the city (mixing SES groups that might differ in ability and achievement) - 1300 kids tested on nonverbal IQ - 3 orders of entry - successfully removed any school or area effects - family consistently significant - concluded that societal changes over a generation failed to override forces that determine social class distribution in mental performance MR is powerful and easy to (mis)use - it is extremely powerful - Anton de Mann’s talk: suicide ideation (how frequently you think about killing yourselfpeople are unlikely to commit suicide if they haven’t talked about it) - Anonymous support service, n.s. but after demographics and things that could not be controlledd - he said there were confounds that could not be controlled for- age, SES - his conclusion was that they didn’t work when there wasn’t a big change in data- a relevant question would be- do these have an affect at all?- if you put it in first, would it account for more?- he didn’t even try that- program put it in 12th - another colloquia on adolescent diabetics: - after demographic information added, education (about dangers- potential fatal- of not taking insulin regularly and maintaining strict diets) was n.s. – the programs they had had didn’t affect- the one thing that you can control for- even though data says you can’t control the fitting if it does help, why get rid of it?- the guy didn’t figure out what the % was of just the intervention - multiple R is the correlation between the predicted values of Y and the observed values of Y - R-square: the square of multiple R and provides index of the amount of variability in the dependent variable accounted for by the predictor variable Research Methods II Notes 121 Regression Weights - for each predictor variable, a table of data will provide a raw regression weight and a standardized regression weight (calculated after values of measures have been standardized) - for most research, use standardized regression weights (beta weights) because they can be compared directly even if the variables to which they apply were measured on different scales- only when variables measured on the same scale can you use the raw score regression weights Class 22 • Other Multivariate Designs • Developmental Designs • Project data analysis • • - - - Multivariate Designs and Analyses Multiple Regression: goal is to explain as much of the variance in the criterion variable (Y - the DV) based on a set of predictor variables (Xs). Discriminant Analysis: basically Multiple regression, with a categorical dependent variable. Activism Among Black South Africans: C. Motjuwadi M.Sc. this was how Clement showed the relationships between variables like friend’s support and social activism- more so than support from family did discriminate analysis used stepwise multiple regression Motjuwadi’s Discriminant Analyses - discriminant analysis is a special case of multiple regression - used when dependent variable is categorical (i.e. male-female or Democrati-RepublicanIndependent) and you have several predictor variables Predicting Protest Participation • gender, friend support, personal power, perceptions of injustice, & area Predicting political Membership (who would join the ANC or other parties?) • participation, gender Predicting Detention Research Methods II Notes 122 • participants, gender, area - - • - - - • - - • - this guy was supposed to get people to sign consent forms and then also fill out forms about political activism they wouldn’t do this, however- could get thrown in jail in the homelands- you have right of passage at 12- asking a 13 year old man who is head of household to get permission from his mother? She gets his permission to do things-he had to watch how he treated the people his presence on the school grounds was also illegal- risked his life for information for his masters degree Multivariate Designs and Analyses Canonical Correlation: looks at the relationship between a set of predictor variables and a set of dependent variables by creating one new predictor variable and one new dependent variable and relates these canonical variates. if you have a number of predictor variables and dependent variables- it amalgamates all your predictor variables into one and then correlates the summary variable with the mathematical summary of the dependent variables works by creating two new variables for each subject called “canonical variates”- one of these is computed for both independent and dependent variables and then the two are correlated- this correlation is called the “canonical correlation” it tends not to be used a lot- it is not clear what they will map on to its development is a purely descriptive strategy can’t be used to infer causal relationships Multivariate Analysis of Variance (MANOVA). Used when you have more than one independent variable and more than one dependent variable that you believe are related (i.e., not independent). this is anova when you have more than one independent variable we do anova to avoid probability pyramiding- if you have 20 dependent variables and do 20 different anova’s you will end up with probability pyramiding again- it controls for the familywise error rates, etc. – allows you to look for interactions amoung dependent variables just like interactions between independent variables it makes sense statistically, not used practically that much print-outs are awkward – it is easier to do anova separately- this is less difficult to interpret operates by forming a new linear combination of dependent variables for each effect in you design- a different linear combination of scores is formed for each of the two main effects and for the interaction (examples 469-473) Log-linear analysis. This non-parametric statistic is basically a multivariate Chisquared. you are dealing with frequencies of categories and dealing with contingency tables this is like chi-square but you have more than two variables chi-square is nonparametric version of t tests log-linear analysis is like a multivariate chi-square Research Methods II Notes - when you have variables that are measured categorically Log-Linear Example - looking at social behaviour between coyotes - adult-pup interactions looked at - we crudely looked at whether interactions were affiliative, rebuff, aggressive, etc. - we were interested in quality of interaction was dependent on whether adult or pup initiated it - it is a 2x2x4 chi-squareeverything was significant - frequency data • - - - Multivariate Designs and Analyses Path Analysis. Uses multiple regression methods to examine hypothesized causal relationships among variables with only correlational data. See how well your theoretically derived model describes relationships among variables. Can also compare competing theories about the relationships among variables. trying to figure out how correlated variables are causally related it isn’t taught so much in stats programs idea behind it- create a model of how you think variables are related and then test with your correlations- the best model “wins” simplest causal relationships would be A causing B and then A and C both causing B and then A and C both predict B but A and C are also related READ PG. 474 Possible Causal Relationships - this figure: parental education predicts scholastic achievement scores 123 Research Methods II Notes 124 One model might be that the messages children get at home could cause motivation which could be related to sat scores - or you could also have work habits directly relating to sat scores- that DIRECTLY related to scholastic achievement- all of these could be tested using correlations - both A and B relate through C and D, and D influences C which both influence E - correlation between A and B, A and C, B and C, C and D, A and D and B and D (these would be weaker because they have to go through C Causal Antecedents of Attachment Research Methods II Notes 125 Cross-correlation in Developmental Research - preference for tv violence and aggression is correlated in third grade but not thirteenth grade, but watching it at a younger age correlated more with aggression in thirteenth grade Multivariate Designs and Analyses • Factor analysis is a multivariate form of data reduction. Factor analysis is typically use to extract a relatively small number of underlying dimensions or factors that can account for relationships among measures (see example from text) - know about the different techniques and know when they are appropriate - the following table- what makes a person attracted to another person?- physical attractedness - had large number of questions rating individuals; many are correlated to each other - wanted a simpler version- did factor analysis- extracts the mathematically created factors and it tells you how your measures are related to each other - factor one- how kind someone is (this was most highly related) - had three main factors from all the factors to summarize what is going on- how good they think the other person is, how socially vital and how personally strong they are- but still only accounting for 50% of the variance - this is a method of extracting what is really relevant Research Methods II Notes 126 Multivariate Designs and Analyses are all very powerful and some are easy statistics to use, and misuse. To use these the techniques appropriately depends upon careful research design and thought. - Data Collections Methods in Developmental Psychology if you are interested in evaluating changes in behaviour that relate to changes in a person’s chronological age use developmental design they are quasi-experimental designs person’s age usually serves as quasi-independent variable- age is often a variable Naturalistic Observations Interviews • structured – questionnaires – surveys • unstructured – clinical Case Studies Experimental: • lab • field Quasi-experimental • correlational • ex post facto Experimental Designs in Developmental Psychology • Longitudinal Designs : take a group and follow them over time • Cross-sectional Designs • Cohort-Sequential (Cross-sequential, time-sequential) Designs Longitudinal Designs Examine developmental changes in one cohort followed over time Cohort: people of same age that have had the same cultural experience (5 year olds in Canada and 5 year olds in Africa are not in the same cohort) Within-Subjects Quasi-analytic design Advantages: • Process of development can be followed with individuals Disadvantages: • Large investment of time and money is required (especially if the age span of interest is large) • Subject attrition can be a problem – you can lose subjects • Carryover effects (e.g., learning) can be a problem • Differences among cohorts are not addressed Research Methods II Notes 127 Cross-sectional Designs Examine two (or more) ages (or cohorts) at one time Between-Subjects Quasi-analytic design Advantages: • Fast and cheap • No subject attrition Disadvantages: • Confounds age and cohort effects – “generation” effect (influence of generational differences in experience which become confounded with the effects of age per se) • Unable to examine the process of development within individuals Cohort-Sequential Designs Combination of cross-sectional & longitudinal designs • two (or more) cohorts, each studied at two (or more) ages. (Sometimes with additional groups tested once to "fill in" the design.) Mixed Quasi-analytic design Advantages & Disadvantages • This is a compromise solution with some of the advantages and disadvantages of crosssectional & longitudinal designs • depending upon the length of the within cohort component and the number of different cohorts. - read the stuff in the text book 172-177 - result of study done cross-sectional study looking at age and IQ- if you plot IQ as a function of age - people get dumber as they get older- increases for a while - the 50 year olds here could have been that they were in the depression times-not as much schooling - the second one shows year of education and IQ- it is a third variable and it is almost perfectly correlated with age Research Methods II Notes 128 Age, Education and I.Q. intelligence increases until 50 or 60 years of age- the results you get depend on the type of measure you use and the cohort effects - Research Projects Due: Next Wed (A)/Thurs(B) • .ppt presentations due Monday (A) , Tuesday (B) • hand in on disk with your names on it to Jill Research Project Report • in APA format • all materials should be included as appendices • review verb tense, SPSS data file, SPSS output • All consent forms, raw data and/or coding sheets to be handed in (separate bundle) Research Project Presentations • Approx. 10 minutes • all partners must participate to receive credit What we want to hear and understand • Research context - existing relevant literature (your approach to the study, why you are • – • • • • studying what you did) your hypotheses and design to test Hexp (IV/DVs, how & what you used to measure them) results (stats, figure) discuss what your results mean, relate to literature thoughtful suggestions for improvements/future research be ready for questions Don’t… - present things that will be consistent across situations (consent, method sectionrecruiting procedures, debriefed) Research Methods II Notes 129 March 30, 1999 Class 23 Discrete Trials Designs • Psychophysics • Signal Detection Theory • Course evaluations • Hand in .ppt files for presentations Characteristics of Discrete Trials Designs 1) individual subjects receive each treatment condition of the experiment dozens (perhaps hundreds) of times. Each exposure to the treatment, or trial, produces one data point for each dependent variable measured. 2) Extraneous variables that might introduce unwanted variability in the DV are tightly controlled. 3) If feasible, the order of presenting the treatments is randomized or counterbalanced to control order effects. 3) The behaviour of individual subjects undergoing the same treatment may be compared to provide intersubject replication. Analysis of Data from Discrete Trials Designs - begins by averaging the responses across the repeated presentations of a particular treatment - large number of presentations helps to ensure the resulting mean provides a stable and representative estimate of the population mean (of the mean that would be obtained if an infinite number of trials could be given to the subject under the treatment conditions) - means obtained from different treatment conditions may then be compared to determine whether they appear to differ (may or may not include inferential statistics) - analysis usually guided by theory or model of behaviour being examined - these analyses yield a small number of descriptive statistics such as the d’ (measure of sensitivity) and B (measure of response bias) in signal detection Psychophysics - this is the branch of psychology that is interested in signal detection trials Concerned with the four perceptual problems of: 1. Detection 2. Identification 3. Discrimination 4. Scaling- how big is it, how many of them are there, how far away is it - you would be asked- is there something there, what is it, might be one or two, - once you get the detection it is easier to continue - think about what the people are doing in former Yugoslavia- you think you see something, what is it, discrimination- one of ours, one of there’s, scalling- how far away is it, do we want to fire? Psychophysics Absolute thresholds are often used as the index of an individuals sensitivity to a specific stimuli, or differences between stimuli. Gustav Theodor Fechner (1860) defined the absolute threshold as the stimulus that "lifted the sensation or sensory difference over the threshold of consciousness" Research Methods II Notes - 130 amount of mechanical energy that has to be out there before you are consciously aware of something (visual or auditorally) in theory, these absolute thresholds look like this: The Absolute Threshold - we could be looking at touch, visual, etc. you play a soft note- start out really quiet, if they can’t hear it you snap it out a notch, at some point she still can’t hear it but you will be close to the absolute threshold- bump it up one notch and she can suddenly hear it (grandma at the hearing doctor) Method of Limits - you have trials that either increase in stimulus strength or decrease - so in this one you indicate whether you have heard it or not (no “I think something might be there”) - so at 7, they suddenly say yes - but you can’t stop here- you then try with a loud one and then stop it - how can they not hear it at 8 when they could hear it at 7 - 4 is a decreasing one- here they detect it at 7 and 8 - so their data is not always consistent- but we can get an average - just look at descending trials - just look at ascending trials - overall mean- of the total six trials - this method is sometimes used by audiologists - it is inconsistent- the absolute threshold isn’t all that absolute - first of all you have alternating – this is good - you can have anticipation and perservation errors- if on the weak trials they couldn’t hear it till the fourth one they might keep that in mind for the other trials - when you have many many trials when they can’t hear it- perservation errors where they go with what they just said in the last trial when they aren’t sure Research Methods II Notes 131 Staircase Method - the staircase method- once they say no you increase the strength and once they say yes you decrease the strength - you are toggling around the absolute threshold - you can track the person’s perceptual sensitivity to the perceptual stimuli - it should track around the persons threshold - it allows you to do experimental manipulations if you use this as a stable baseline- you could introduce a drug treatment, etc. - you can track changes over time or introduction of drugs or other experimental treatments - remember that this is all for the same stimuli- fixed wavelength or tone Why do Thresholds Seem to Vary? - why isn’t it absolute? Stimuli being presented is not the only one that the subject is experiencing (you are never without sensory activity)- cave example Constant background stimulation for any signal Endogenous noise Noise - any background stimulus other than the one to be detected. Can be visual, chemical, mechanical, thermal, or auditory. - in this research there are differences between what you are trying to look at and then everything else you aren’t looking at (noise) Can also be lapses of attention, fatigue, and other psychological changes. - - - Determining the “Absolute” Threshold: Method of Constant Stimuli so with real data, trying to determine the absolute threshold use “Method of constant stimuli”- very precise method; good idea of persons perceptual abilities takes a lot of effort you have one particular stimulus- 8 different intensities for each intensity you have a measure of stimulus strengthone way you can do this is how likely the person is to say they heard it Research Methods II Notes - - - - 132 for 1 and 2 you had 6%each of the data points might be 100 trials or 50 , etc. – so you have 800 trials (for one frequency) the plots are called “Ogive’s”- when you plot perceptual perceived against stimulus intensity so we’ve run 800 trials on grandma- what is her absolute threshold somewhere between 1 and 8 Psychophysics Basic assumption in doing psychophysics is that any type of behaviour has some strength. In Psychophysics the measure of strength most often used is response probability. p(yes) = #yes responses /(#yes+ #no responses) - they have to assume that you can’t measure the actual brain of people (putting electrodes on the brain for example) - we had likelihood of perception on the y axis but in this you use response probability (probability with which they are going to say they thought it- the equation) Determining the “Absolute” Threshold: Method of Constant Stimuli - our absolute threshold will be arbitrary because there is no abrupt increase - traditional response- point at which they say they can detect it 50% of the time (called 50% threshold) Approximate Thresholds Vision: Candle flame from 48km on a dark clear night Audition: Wristwatch vfrom 6m in a quiet room Taste: 1 tsp sugar in 7.5 litres water Olfaction: 1 drop of perfume in a 3 room apartment Touch: a bee’s wing falling on your cheek from 1cm Signal Detection Theory A mathematical, theoretical system that recognises that individuals are not merely passive receivers of stimuli. Participants are also engaged in the process of deciding whether they are confident enough to say "Yes, I detect that stimuli" when engaged in psychophysics experiments. - participants are engaged in decision making processes - will they be biased towards saying yes or no - i.e. fighter pilots- you need perfect vision; you might be biased towards saying yes; or if you want to be compensated for auditory loss you might be biased towards saying no - on some trials you are not going to be sure Research Methods II Notes Signal Detection Theory Problem: subjects may wish to appear sensitive (or insensitive). Subject bias. To account for decision making component, can introduce “catch trials” - they put in catch trials to see if people are lying – if they are saying they are sensing something when something really wasn’t there - there are only two options- a catch trial or an experimental trial - you either say yes or no - it’s either going to be there or it’s not With two possible experimental trials (signal present or absent) and two possible participant responses ("yes" it is present or "no" it isn't there) there are four possible outcomes to each of many trials. Participants' responses on each trial are going to be consequences of both their perceptual sensitivity to the stimuli presented and their decision strategy or bias toward saying some thing is there or not when they are in doubt. Manipulating Bias By varying the conditions of the experiment bias can be altered. • alter expectations • or alter the relative importance of the two types of error. (Payoff matrix) - I’ll give you a loonie every time you get it right, and take back a quarter every time you have a false alarm - pretty good odds- but if it was the other way around it would be different; wouldn’t change your sensitivity but you would probably have different results Outcome Matrix: Signal Present 50% of Trials - you tell the person that there is only going to be a signal 50% of the time Outcome Matrix: Signal Present 90% of Trials Outcome Matrix: Signal Present 10% of Trials 133 Research Methods II Notes 134 - all of this variability is from us messing around with the decision making process Isosensitivity (ROC)Curve - - - - each of the data points summarizes one of the outcome matrixes- plotting the probability of a hit against probability of false alarms the guess line is the diagonal- she’d just be guessing the d’ is the index of sensitivity get an idea of the difference between guessing and the other line person who generated the red line wasn’t as sensitive as the person who generated the blue databecause they were closer to guessing the same person can generate both lines if it was for different frequencies Calculating d' From a Single Outcome matrix Data required for each point on an isosensitivity (ROC) curve requires hundreds of trials (to get accurate probabilities for Hits and False Alarms). With a few assumptions, d' can be calculated from a single outcome matrix using Signal Detection Theory. - you need many many trials to get more accurate - we can calculate the d’ from one single outcome matrix Signal Detection Theory Assumptions 1) Noise is normally distributed. - sometimes it’s high, sometimes low, most likely it is in the middle (shows normal distribution) - you add a fixed amount of noise to the background noise so you are going to shift the amount of sensory activity constant amount Presenting a signal on top of that noise, will therefore shift the amount of sensory activity to the right (higher), by an amount equal to that sensory systems sensitivity to that signal. The difference between the mean amount of sensory activity generated by the noise alone trials and the signal+noise trials will equal sensitivity (d') measured in z-score (standard deviation) units. - some levels are more likely to occur, some are less likely to occur (tops of graphs more likely to occur) - you shift the whole distribution to the right- now that system is going to have an increase proportional to the sensitivity to the signal - d’ becomes indication of sensitivity to the system Research Methods II Notes 135 Stronger Signal (or More Sensitive Receiver) you crank up the frequency, more sensitive person, change the frequency will result in this - still shift over by fixed amount that is equal to the sensitivity to the signal Signal Detection Theory Assumptions 2) Participants adopt a criterion () for dealing with those values of sensory activity that could result from either noise alone or signal plus noise (the area where the noise and signal+noise distributions overlap). If the amount of sensory activity exceeds that amount, the participant will say the detected the signal, any amount less than that and they will say they did not detect the signal. - this adoption of criterion is not conscious- they are setting an absolute threshold for dealing for the decisions that occur in the absence or presence of a signal - the issue is the overlap when they have to guess - if the amount of sensory activity is greater than their criterion they will say yes, if lower, they will say no, there is nothing there - it is not conscious remember - they are acting in a way that seems to say this it is not conscious - if we make this assumption we can save ourselves hundreds of trials - Manipulation of Bias We can now interpret the manipulation of a receiver’s motivation to say “yes” when in doubt (due to either changing expectations of payoffs) as effecting the placement of the criteria - you are introducing a lax or liberal criterion- when you do the pay off thing; you are not influencing the actual sensitivity but the decision making process - all that this manipulation of bias has done is moved where you put your criterion in the range of overlap - we haven’t changed sensitivity - where the criterion is put has no influence on sensitivity- it is only a reflection of bias Research Methods II Notes 136 Sensitivity Criterion location has no effect on sensitivity Sensitivity refers to the average amount of sensory activity generated by a signal compared with the average amount of noise generated sensory activity - difference in the mean distributions - look on the links page about Fechner Class 24 • Remaining Presentation(s) • Signal Detection Theory –conclusion –tutorial at DOS prompt: type “percept” at menu pick “E. Theory and Methodology” at menu pick “B. Signal Detection Theory” The introduction works, part B usually doesn’t Using theory Review for Final (April 17th Auxiliary Gym, 9-12) Pick Date for Review Class • • • • • • • Signal Detection Theory Calculating d' From a Single Outcome matrix Data required for each point on an isosensitivity (ROC) curve requires hundreds of trials (to get accurate probabilities for Hits and False Alarms). With a few assumptions, d' can be calculated from a single outcome matrix using Signal Detection Theory. Research Methods II Notes 137 Signal Detection Theory Assumptions 1) Noise is normally distributed. Presenting a signal on top of that noise, will therefore shift the amount of sensory activity to the right (higher), by an amount equal to that sensory systems sensitivity to that signal. The difference between the mean amount of sensory activity generated by the noise alone trials and the signal+noise trials will equal sensitivity (d') measured in zscore (standard deviation) units. d’ is increased sensitivity between the two- d’ is a measure of sensitivity 2) Participants adopt a criterion () for dealing with those values of sensory activity that could result from either noise alone or signal plus noise (the area where the noise and signal+noise distributions overlap). If the amount of sensory activity exceeds that amount, the participant will say the detected the signal, any amount less than that and they will say they did not detect the signal. - you can’t possible know there is this range of sensory activity in which you have to guessyou can’t know for certain if there was a signal - your system sets the criterion- as long as the level exceeds the criterion you will say yes, if it doesn’t, you will say no - in the orange area, you could either be hearing a signal with relatively low background noise, or you could be hearing more background noise We can now interpret the manipulation of a receiver’s motivation to say “yes” when in doubt (due to either changing expectations of payoffs) as effecting the placement of the criteria - if person quite willing to say yes, they have a lax or liberal criterion- they are saying yes for most of the trials when the signal is there but also when it isn’t - very high false alarms - Research Methods II Notes - 138 these people require more sensory activity to say yes they will hardly have any false alarms- only for the trials when the signal is absent in which background noise is greatest they are conservative in saying that they say yes Sensitivity Criterion location has no effect on sensitivity Sensitivity refers to the average amount of sensory activity generated by a signal compared with the average amount of noise generated sensory activity - with these assumptions, the four cells in our matrix can be related to the areas under the two normal curves with the criterion dividing them Signal Detection Theory With two assumptions: 1) Noise is normally distributed, 2) Participants adopt a criterion () for dealing with those values of sensory activity that could result from either noise alone or signal plus noise, The four cells of an outcome matrix (Hits, Misses, False Alarms & Correct Negatives) can be represented as areas under the two normal distributions. - - when you say yes and the signal is there- the proportion of the curve to the right of the criterion variable will be proportion of hits here the person has a miss - false alarm- this part coresponds to false alarm rate - signal is absent and person says no d’ is difference between two means Research Methods II Notes - 139 we can measure how far apart the means are Signal Detection Theory d’ can then be measured in z-sore units by: use this because of the area under the curveyou want to know how many sd’s the two curves are away from each other d' = ZFA - ZHit Tables for the z-score distribution or percent area under the normal curve typically present the z-score distances between the mean and the Criterion value (). If you are using such a table, ZFA can be found by looking up the z-score associated with (50 - False Alarm %). - tables look at the area under the curve between the mean and the z-score d' = ZFA - ZHit If this number (50-FA%) is positive, then the z-score to be put into the above formula will also be positive, if it is negative, the z-score value for the formula will also be negative. It is essential that the proper signs be used. A good way of checking would be to draw the distributions and the criterion and see the relationship between d' and the two z-scores. Similarly, to find ZHit, look up (50 - Hit %), again, the resulting sign will be the same as is used for the z-score in the formula. Example - tradition is to look at hits and false alarms - the hit rate is 60%- over half - the false alarm rate is the 20% - 20% of the alarm absent curve is to the right of the criterion - we are interested in finding d’- difference between the two means Signal Present Absent Yes Response .60 .20 No Response .40 .80 - z scores to right of mean are positive, to left of mean are negative d' = ZFA - ZHit = Z (50-20) - Z (50-60) z-score associated with 50-20= 30% of the normal curve = .842; for 50-60= Research Methods II Notes 140 -10% it is -.253 d' =.842-(-.253) = .842+.253= 1.095 z-score units Signal Present Absent Second Example Yes Response .95 .75 No Response .05 .25 Did this person have a lax or strict criterion? Lax- pretty willing to say yes d' = ZFA - ZHit = Z (50-75) - Z (50-95) z-score associated with 50-75= -25% of the normal curve = -.675; for 50-95= -45% it is 1.645 d' =-.675-(-1.645) = .970 z-score units - start everything with just the graph and beta (the person’s criterion) then start putting the distributions in when you know how often they say yes, etc. d’ is the increase in sensory activity level from one distribution to another; looking at normal distributions; z units used when you have Phit = .5 and Pfa = .3- the z score associated with Phit is 0 and the z score associated with Pfa is .524 Research Methods II Notes - lets do another one with someone with a very strict criterion: Phit = .15 and Pfa = .05 Phit is the number of times they say the signal is there when it is- so you would subtract this from 50 (35% must be between the mean and criterion) Pfa is when they say it is there when it isn’t Remember we are always looking at when they say it is there (when both graphs are to the right of the criterion) also have to look up 45% both numbers are positive- the criterion is located to the right of the means (both are positive)- look up numbers, etc. all of this is a nice solution to deal with bias- way to test even when they don’t even know if they are right - - - 141 Sensitivity to Pain: An Experimental Study of Acupuncture what happens, Lx is measure of bias- person receives the same amount of thermal energy but they are less likely to say it is painful just remember, if the criterion is to the right of the mean, you subtract the % from 50%, if criterion is to left of mean, you subtract the % from 50 as well, but the number that you eventually get off the table you make negative Research Methods II Notes - 142 Using Theory: Chapter 15 a scientific theory is one that goes beyond the level of a simple hypothesis, deals with potentially verifiable phenomena, and is highly ordered and structured consists of a set of interrelated propositions that attempt to specify the relationship between a variable and some behaviour characteristics of scientific theories: a) describes a scientific relationship (one established through observation and logic) that indicates how variables interact within a system to which the theory applies b) described relationship cannot be observed directly- its existence must be inferred from the data (if you could observe the relationship directly there would be no need for a theory) c) statement is only partially verified (theory has passed some tests but not all relevant tests have been conducted) Distinctions among: • Hypotheses- less complex than theories; looks at only one variable at a time unlike theories which can look at a system of variables • Laws- a theory that has been substantially verified; not subject to disconfirmation like theories are - laws may idealize real-world relationships (such as the law of “ideal” gases when in reality there are no ideal gases) but it holds well enough for most purposes • Models- refers to a range of concepts; in most cases “model” refers to a specific implementation of a more general theoretical viewpoint (like classical conditioning) - can represent an application of a general theory to a specific situation • & Scientific Theories Ways to Distinguished Among Theories 1) Quantitative vs. Qualitative - a quantitative theory is expressed in mathematical terms; specifies the variables and constants with which is deals numerically and relates the numerical states of the variables and constants to one another (i.e. theory of information integration) - a qualitative theory is any theory that is not quantitative; tend to be stated in verbal terms; state which variables are important and how they interact; can describe quantitative relationships but not measured on a scale higher than ordinal (i.e. theory of language acquisition of Noam Chomsky) 2) Levels of Description - some theories primarily designed to describe a phenomenon while others attempt to explain relationships amoung variables • Descriptive theories: a theory that merely describes a relationship; does not give explanations (e.g. Kepler’s theory that the planets move around the sun- described but did not say why this happened); trap is to think you have explained a phenomenon when all you have really done is given it a name • Analogical : explain a relationship through analogy; you must equate each variable in the physical system with a variable in the behavioural system to be modeled; you can then plug in values and apply rules of the original theory to generate predictions • Fundamental- really explain what is going on: a theory that proposes a new structure or underlying process to explain how variables and constants relate; includes assumptions; Research Methods II Notes 143 rare in psychology (e.g. Festinger’s cognitive dissonance theory- when two attitudes or behaviours are inconsistent, cognitive dissonance is aroused) 3) Domain of a Theory- whether it applies to just certain situations or all animals - domain = scope - this dimension looks at the range of situations to which the theory may be legitimately applied - a theory with a wide scope can be applied to a wider range of situations than can a theory with a more limited scope - chances of dealing adequately with a range of phenomena are better for a small area of behaviour than they are for a large area - most psychological theories very specific • Some Roles of Theory in Science Understanding: theories represent a particular way to understand the phenomena they are dealing with • Prediction: even when theories do not provide a fundamental insight into mechanisms of a behaving system they can provide a way to predict behaviour • Organizing & Interpreting data: theories can provide sound framework for organizing and interpreting research results; research results can be interpreted in light of a theory • Generating Research: theories provide ideas for new research; this is known as “heuristic value” of a theory (often independent of validity) - • • • • • Characteristics of Good Theories whether or not a theory falls by the wayside or stands the test of time is determined by the following: ability to Account for data: to be of value, theory must account for most of existing data within its domain; not all data because some data might be unreliable Have “explanatory relevance”: the explanation for a phenomenon provided by a theory must offer good grounds for believing the phenomenon would occur under the specified conditions Are testable: a theory is testable if it is capable of failing some empirical test; i.e. problem with Freud’s theories (they are not testable- cannot be proven wrong) Predict novel events: a good theory should predict new phenomena beyond for which the theory was originally designed; e.g. Einstein’s theory of relativity accounted for the same data and produced the same predictions for a wide range of phenomena as did Newtonian mechanics, but it went beyond Newton’s to predict phenomena not expected to occur from Newton’s point of view Are parsimonious- simple: a theory that makes relatively few assumptions Steps in Developing Theories 1. Defining scope of your theory: defining the domain or scope of the theory; you want to develop a theory that will provide explanations for observed relationships 2. Knowing the literature: become thoroughly familiar with current and past research in the area the theory will cover; know what lawful relationship already discovered, etc. Research Methods II Notes 144 3. Formulating the theory: requires effort, insight, inspiration, luck; need to make random pieces fit together neatly to reveal coherent picture –Preparedness for theories to come almost by accident you must already be prepared (Archimedes example) –using analogy ideas for a theory may develop from knowledge of other well-understood systems that behave similarly to the system you are trying to understand –using introspection at least when dealing with your own behaviour you have access to private information concerning perceptions, problem-solving strategies, memories, etc.; –problems with this method: – important aspects of what you are looking at might not be conscious act of attempting to examine one’s own mental processes might interfere with the processes themselves observable mental events may have no causal role in the operation of system theories used to explain behaviours of animals- can’t use introspection or “anthropomorphizing”(changing into human form) 4. Establishing predictive value: check the theory’s predictiveness against existing data; theory should be adequate to account for the relationships already discovered (you can modify theory or ignore discrepancies if predictions do not fit with theory)- should you ignore discrepancies? (world is not ideal, experimental conditions only approximate conditions specified by the theory; use your own judgment) 5. Testing your theory empirically: set up specified conditions and observe the whether outcome agrees with predictions What Rule Generates This Sequence? 2 4 8 16 32 ? Formulate your hypothesis then test it, I will tell you if it is acceptable as the next number in the sequence. Numbers get larger. - you could just say 64 which would be correct, but the real thing is just that numbers get larger Hexp: All “X” also have a star Research Methods II Notes 145 Testing Theories Both confirmation and disconfirmational strategies can be used to test theories. - confirmation of theories - if relationships predicted by a theory are observed in empirical data, then theory has been supported - when theory supported you have more confidence in ability of theory to explain and predict phenomena within domain- does not mean that theory has been proven - disconfirmation of theories - although you can never prove a theory correct you can prove it wrong - this is hard, however, because when theories are disproved experimenters can often blame other factors for the outcome - strategies for testing theories: - Kuhn (1970): history of science reveals that most theories continue to be defended and elaborated by their supporters even after convincing evidence to the contrary has been amassed - new view only takes hold when supporters of old view die off - how can this be avoided?- process of “strong inference”: a strategy for testing a theory in which a sequence of research studies are systematically carried out to rule out alternative explanations for a phenomenon - confirmation strategy: strategy of looking for confirmation of the theory’s predictions - disconfirmation strategy: strategy of looking for evidence that will disconfirm the prediction It is best to use both. Research Methods II Notes - - 146 you need both to adequately test a theory usually pursue a confirmation strategy when theory is fresh and relatively untested if theory survives these tests you pursue disconfirmation strategies the object during this phase of testing is to determine whether outcomes that are expected from the point of view of the theory always occur you don’t want to have just confirmational testing because it is biased- you have to have testing that tries to disprove theory as well as prove it Research Methods II Notes 147 Review: Since Last Midterm... • Quasi-analytic experiments: Bivalent correlation designs • Calculate the extent to which the two variables are systematically related • Graph data (scatterplot): predictor (assumed causal or IV) variable on abscissa (X-axis) and criterion or DV on ordinate (Y-axis) • Pearson's product moment correlation coefficient (for Interval or ratio data) measures the direction and degree of association. • r is the mean of z-score cross-products • cautions - assumes linear relations, truncated ranges, outliers, heteroscedasticity, combining group data – So: examine scatterplots!! • Problems interpreting the results of this type of research: –third variable problem –directionality (not always an issue), –regression artifact (e.g., Rushton), –floor and ceiling effects, –look for converging evidence • Correlation versus ex-post facto design • Interpretation problems are not related to the statistical choice, rather due to the design • Causation not a simple concept • Simpson's Paradox. • Partial correlation • Remember: can be other confounding variable not measured • Semipartial correlations (sometimes called Part correlations) • Correlation versus ex-post facto design • Interpretation problems: due to the design • Causation is not a simple concept. • Simpson's Paradox • Partial correlation • Semipartial correlations (Part correlations) • Multiple Correlation and Regression –formulae, types, importance of order of entry –considerations about R2 • Developmental Designs • longitudinal, cross-sectional, & cohort-sequential • Cohort: a group with common experiences • Sampling: types: random, stratified, proportional, systematic, cluster (multistage) • Volunteers • Discrete trials designs - Psychophysics • Signal Detection Theory • teasing apart sensory ability and decision to say “yes” • Isosensitivity curves also called ROC (receiver operating characteristic) curves: • calculate d' from hit and false alarm probabilities (using tables of areas under the normal curve) • Scientific theories: types of theories, functions of theories • Evaluation on the basis of : parsimony, testability, precision • Confirming vs. disconfirming strategies (confirmational bias)