Dr. Chang 1 Lecture Notes for EDM6401, Quantitative Methods in Educational Research Chang Lei, Ph.D., Professor 1. An Overview of Educational Research Four Ways of Knowing Methods of Tenacity Truth is true because one believes it even in front of contradicting evidence. Superstition. Method of Authority Truth is true because an authority says so. Religion. Method of Intuition Truth is true because it is logical. It derives from reasoning but does not bear empirical support. Philosophy. Method of Science Science is a method of seeking truth. This method only accounts for solvable problems that have empirical solutions based on observable events. Some major components of the scientific method include: empirical evidence vs. refusal of contradicting evidence random sampling vs. isolated or selected events countering rival explanations replication and public inquiry Truth is routinely challenged and tested and retested by the public. There is no truth but a temporary state where it has not been disproved or rejected as untrue. The "Low" Status of Social Sciences Difficult to replicate The more developed a discipline, the higher the probability of detection and replication. Value judgement Science is considered "pure" and free from value judgement. Too "obvious" People tend to be uninterested in questioning what is already "obvious" to them. Nor do people want to be bothered with what they do not know. Thorndike: "That is the fate of educational research. If it comes out the way people thought it should, they ask, 'What is the point?' If startling conclusions emerge, people say, 'I do not believe it'." Norms of Science Universal Standards The quality of research is judged by universal standards regardless of the experience, race, sex, affiliation, or the characteristics of the researcher. e.g., Blind review process. Dr. Chang 2 Common Ownership of Information Scientific information is not proprietary but is owned and freely shared by all; publication is not only a right but an obligation of a researcher. e.g., Data are to be shared on request. "Publish or perish" enforces this norm. Integrity in Gathering and Interpreting Data The researcher displays disinterestedness and impersonal tone when gathering data or presenting a point of view. e.g., Cite the researcher's own name or use "the author" to minimize personability. Organized Scepticism It is the responsibility of the community of scientists to be skeptical of each new knowledge claim, to test it, to try to think of reasons the claim might be false, to think of alternative explanations. This challenge to new knowledge is sought in science, e.g., conference debate, validation study. The Roles and Outcomes of Research Exploratory Discover new phenomena and relationships among phenomena that are missed by others. Qualitative research plays important role here. e.g., A counselling psychologist wants to know what things make an effective counsellor. Explanatory Develop new theories or use existing theories to account for the observations. e.g., Dollard and Dobb (1939) theorized that frustration leads to aggression from the observation that a child strikes out when deprived of a toy. According to social learning theory, a good role model is important for school achievement. Validation Validating and replicating existing research and theory is an important part of science. Using different samples, populations, research methods. Three Components of Educational Research Methodology 1. MEASUREMENT (PSYCHOMETRICS) Instrumentation, Reliability, Validity 2. RESEARCH DESIGN Sampling, Designs, Internal and External Validity Experiment, quasi-experiment, non-experiment research. 3. DATA ANALYSIS (STATISTICS) Descriptive statistics, Hypothesis testing, and Various analytical techniques Dr. Chang The Process of Scientific Inquiry 1. Identification of a research problem. (why) 2. Consult the literature for a solution. (find out why) 3 3. Formulation of testable hypotheses on the basis of existing theory and/or experience. (a tentative solution) 4. Design a study with efforts to minimize extraneous factors that may contribute to the same phenomenon or relationship you hypothesized. (design a study) 5. Data collection. When the behaviour of subjects are measured or observed, the measurement or observations become empirical data. (carry out the study) 6. Data analysis. Data are summarized in such a way that the summary bears on the research questions and hypotheses. Statistics are used to generalize from sample to population. (report the findings) 7. Interpretation of data, adding to the existing body of knowledge. (this is why) Research Report: * Title Page * Abstract * Introduction Problem Significance Justifications Hypotheses which are integrated in the literature review * Method population sample procedures measurement designs * Results Present the results in the order of the hypotheses * Discussion * References * Tables and Figures The Process of Research: 1. Identification of a research problem. (Why) 2. Consult the literature for a solution. (Find out why) 1. INTRODUCTION SECTION ***** Objectives and Significance of study Literature review Dr. Chang 3. Formulation of testable hypotheses on the basis of existing theory and research. (Here is a solution) 4.Design a study to minimize extranous factors that affect the same phenomenon or relationship you hypothesized. (A plan to test the solution) 5.Data collection. When the behaviours are experimentally manipulated or observed, the outcome become data. (Carry out the plan.) 6.Data analysis. Data are summarized in such a way that the summary bears on the research questions and hypotheses. (Report it) 7.Interpretation of data, adding to the existing body of knowledge. (Why? This is why.) 4 Research questions and hypotheses Ind and Dep variables 2. METHOD SECTION ***** Sample, Design (exp quasi, non), Procedure Validity threats Measurements, Reliability and Validity 3. RESULTS SECTION ***** ANOVA vs. Regression framework, Sig test Confidence interval 4. DISCUSSION SECTION ***** Theory & explanation Limitations and Future directions Ways to Locate a Research Problem 1.Identify broad areas that are closely related to your interests and professional goal and write them down. 2.Then choose among the areas that relate to your future career, an area or a research topic that is feasible. 3.Collaborate with other people; join on-going projects. 4.Read text books where rather comprehensive topics in a field are summarized, and problems and future research needs are identified; journal article for the state of art of the field and authors recommendations; reviews articles for both. 5.Test a theory. 6.Replication. Replicate major milestone study. Replicate studies using different population, samples, methods. 7.Observations. Observe carefully the existing practices in your area of interest. 8.Develop research ideas from advanced courses you take. 9.Get ideas from newspaper and popular magazines. Dr. Chang 5 Variable and Constant An attribute or characteristic of a person or object that varies from person to person, object to object. A constant is an attribute that does not vary from person to person. Student achievement, motivation, blood pressure, etc. Pi, in relation to a particular population, age is a constant for the 6th graders, religion for parochial schools, gender for male prisoners, etc. When you ask a research question, you ask about variables, you want to know the relationship among variables. Why don't students learn, is it because they are in poor health, not motivated, distracted by family problems, by crime, the teachers are not qualified, etc? You end up with a question regarding the relationship among variables. Is there a relationship between motivation and achievement. Independent variable represents the research interest and is manipulated (experiment) or measured (non-experiment) to see the results of its change on the dependent variable. Dependent variable is the observed outcome in response to the independent variable. It is used to evaluate the independent variable. Control variable is a variable that is either made into a constant or is included in the study (even though it is not of interest) to control or neutralize factors extraneous to the research question. Operational Definitions Assign meaning to a construct or a variable by specifying the activities or "operations" necessary to MEASURE or MANIPULATE it. Redefine a concept in terms of clearly observable operations that anyone can see and repeat. These observable and replicable operations can take the form of an experiment or of a measurement instrument. As Cronbach and Meehl (1955) point out, it is rare for a construct to receive one commonly endorsed operational definition. To some researchers, hunger is defined in an animal experiment as "amount of time since last feeding." It may also be defined by others as "amount of energy an animal would expend to seek food." Thus, it is important to be operationally clear about a particular construct, so that other researchers understand, for example, what the construct "hunger" is intended to mean. Measured operational definition (more often used): Intelligence is defined as scores on the Woodcock-Johnson Test of Cognitive Abilities. Vagueness of lecturing is defined as using the following words: A couple, a few, sometimes, all of this, something like that, pretty much. School achievement is defined as one's GPA. Social Economic Status is defined by the number of years of education and the amount of salary the head of a family receives. Popularity is defined operationally by the number of friendship nominations a student receives from his/her school mates. Experimental operational definition: Recall is defined by asking subjects to recite items shown to them from a stimulus list and assigning a point for each item that matches one on the list. Recognition is defined by showing subjects items and asking them to decide whether they were part of the stimulus list. Aggression is defined as the number of times a child hits a toy doll after watching a violent TV show. Dr. Chang 6 Distinguishing Between Two General Types of Literature Reviews There are two general types of literature reviews, each possessing unique as well as common characteristics. Making the distinction prior to embarking on the review is important to both your own mental health and the quality of the product. The two types are: 1.A critical review of a literature 2.A review of literature relevant to a research proposal Following are some is and is not for each type: Critical Review of a Literature Is a place where you may review the body of literature that bears on a problematic area -- or even to examine all the research that relates to the specific question raised in a research proposal. Is an activity the product of which is devoted to critical retrospective on scholarship -- and publishable in such journals as the Psychological Bulletin, Psychological Review. Is not encumbered with supporting the conceptual framework of a proposed study or of justifying study design and methodology decisions. Review of Literature Relevant to a Research Proposal Is an obligation to place the question or hypothesis in the context of previous work in such a way as to explain and justify the decisions made. Is a product that reflects a step-by-step explanation of decisions, punctuated by references to studies that support the conceptual framework and ongoing argument. Is not a product to educate the reader concerning the state of science in the problem area; nor is it to display the thoroughness with which the author pursued a comprehensive understanding of the literature. How to Write a Lit Review Break up the review into several topic areas. Organize all the findings under the various topics into a unified picture of the state of knowledge in the area reviewed. The process of combining and interpreting the literature is more difficult than merely reviewing what has been done. Use two to three studies that are most pertinent and well done as foundations of your review topics. Use similar studies as support. Write the review as if you are expressing your own thoughts and developing and building your own arguments and themes but not as if you are reporting others' work. Don't do article by article listing of things. Don't use the same format, e.g., Baker found... Dr. Chang 7 Rather than citing everything in an article in one place, cite an article multiple times to fit different themes of yours. Write down your thoughts and paraphrase important points of the articles as you read. It may not be a good idea to read all the articles and then write. Look over the articles before copying them. Read several carefully before looking for more. 2: Hypothesis Testing 1.The hypothesis should state an expected relationship between two or more variables. 2.The researcher should have definite reasons based on either theory or evidence for considering the hypothesis worthy of testing. 3.Hypothesis should be testable. The relationship or difference that is stated in a hypothesis should be such that measurement of the variables involved can be made and necessary statistical comparisons carried out in order to determine whether the hypothesis as stated is or is not supported by the research. 4.The hypothesis should be as brief as possible. There is a gender difference in the perception of body sensations. Women and men use physiological cues (internal) and situational factors (external) differently in defining bodily state. Women, compared to men, make greater use of external cues in defining their body sensations. There is a relationship between information processing techniques and subsequent recall of information. Visual imagery has a greater enhancing effect on recall than verbal recitation. People tend to apply dispositional attribution to account for behaviours of others and use situational attribution to explain behaviours of themselves. Induced self-consciousness enhances recall of personal information. Teachers who use specific feedback during lectures obtain higher pupil achievement gains than teachers who use general feedback. High intimacy self-disclosing statements would be more effective in counselling than low intimacy self-disclosing statements. Hypothesis Testing A hypothesis is always about a population. Testing a hypothesis means drawing inference from a random sample to the population where the sample is taken. 1. Research hypothesis reflecting your verbal reasoning. The wording often reflects the research design. e.g., There is a relationship between motivation to learn and math achievement. Girls have higher math achievement than boys. Dr. Chang 8 The effect of induced public self-consciousness is stronger adolescents than adults. 2. Statistical hypothesis reflecting the statistics used to summarize your observations. e.g., ρ > 0: There is a positive correlation between motivation to learn and math achievement. The statistic of correlation is used to summarize data. μg > μb: Mean math achievement of girls is higher than the mean of boys. Mean is used to summarize data. 3. Null hypothesis representing a way to test the statistical hypothesis. μg = μb. The mean math achievement of girls is the same as the mean of boys. ρ = 0. There is no correlation between motivation to learn and math achievement. 4. Statistical tests are conducted with the assumption that the null hypothesis is true. What is the probability of finding a positive correlation when the truth is there is no correlation? What is the probability of finding a difference between the two means when there is no difference? Statistical Significance The probability level at which you will reject the null hypothesis, or, at which you will allow yourself the risk of wrongly rejecting the null hypothesis. Type I Error Significance level is also Type I error rate. It is the probability of rejecting the null hypothesis when the null hypothesis is true. You make such an error only when the null is rejected. Type II Error It is the probability of not rejecting the null hypothesis when the null hypothesis is false. You make such an error only when you fail to reject the null hypothesis. Sampling distribution of means (or any statistic) Is an imagined or theoretical distribution of an infinite number of means computed from random samples of the same size. Because of the central limit theorem, this distribution is used as a probability distribution to determine the probability of obtaining a mean larger than or as large as (in absolute value) the one computed from your sample. Central limit theorem 1. If repeated random samples of size n are drawn from a normally distributed population, the distribution of the sample means is normal. 2. As the sample size increases, disregarding the shape of the population distribution, the sampling distribution of means approximates normality. 3. The mean of the sampling distribution of means equals the population mean. 4. The standard deviation of the sampling distribution of means equals the population standard deviation divided by the square root of sample size. This is called standard error of means. If population variance is not known, sample variance can be used as an estimate of Dr. Chang 9 population variance in computing the standard error. Four steps in hypothesis testing: 1: State the null and alternative hypotheses. Example H0: μ1-μ2 = 0 H1: μ1-μ2 > 0 2: Set the level of statistical significance which is the probability at which you'll reject the null or at which you'll allow yourself to make the type I error. α .05 t (.05, 28)=1.7 3: Compute the test statistic which can be a t-test, z-test, f-test, chi-square, etc. t (28) = 2.85 4: Decision about the null. If you reject the null, you may make a type I error the probability of which is set at step 2. If you do not reject null, you are running the risk of making a type II error the probability of which can be calculated if you know certain parameters. Reject null and support your research (alternative) hypothesis. An Example Hypothesis: People high in public self-consciousness are more conforming to perceived social norms on gender roles (than those low in public self-consciousness). There is a relationship between public self-consciousness and gender role conformity. Independent variable is public self-consciousness. Dependent variable is gender role conformity. Operational definitions: Public self-consciousness is measured by the Self-Consciousness Scale (Fenigstein, Scheier, & Buss, 1975; Scheier & Carver, 1985). Gender role conformity is defined by the following operations: Ten gender role attitudes questions were used to first determine participants' own standings on these gender role questions. The participants were then informed of the mean ratings of their peers on these gender role questions and were asked to re-assess their attitudes toward these gender roles. Conformity to social norms on gender roles is measured by the difference score between the two self-assessment on the ten gender roles questions. Statistical hypothesis: μhigh public - μ2high private > 0 This implies that the statistic, mean, is used to summarize sampled data that bear on the hypothesis. Dr. Chang 10 or ρ > 0, implying that the statistic, correlation, is used to summarize data. Null hypothesis: μhigh public - μ2high private = 0 or ρ=0 Significance level: α <= .05 Hypothesis testing rationale: The hypothesis is regarding population. The null assumes that there is no mean difference between the two populations (groups). My hypothesis assumes that there is a mean difference (in the direction hypothesized) between the two populations. The purpose of hypothesis testing is to make the qualitative decision regarding whether my samples are taken from the populations defined by the null (decision: accept null and your research hypothesis is not supported) or are taken from the populations defined by the alternative (research) hypothesis (decision: reject null and your research hypothesis is supported). The hypothesis testing starts with the assumption that the null is true. Even though the null is true, there is a good chance that, due to sampling fluctuation, you will find some small magnitudes of difference in your samples. The chance for you to find large differences, however, should be very small. In fact, such chance is so small that you should no longer attribute the difference to sampling fluctuations but to the possibility that the null is not true. This chance is the probability associated with your computed sample statistic. As this probability gets smaller, you grow more doubtful about the truth of the null to the point that you make the qualitative decision that the null is not true (reject the null). This point is your significance level and your decision is associated with the possibility of a type I error. Another example: Induced state of public self-consciousness increases gender role conformity. State of public self-consciousness is the independent variable. It is induced by video taping the participants while they are assessing their gender role attitudes in relation to these attitudes of their peers. The knowledge that they are on camera induces the state of public self-consciousness. 3: Measurement and Testing Reliability Classical theory also known as true score theory is mostly concerned with test reliability or the reliability of observed scores of a test in measuring the underlying true abilities or true scores. Reliability can be defined as the strength of the relationship between observed scores and true scores. If we were to administer a test to the same person under all different conditions at different times using different items, there would be different observed scores. The mean of all these observed scores is the person's true score, or true ability or personality. In reality, we only give the person one test and there is only one observed score. This score can be seen as a random variable or as randomly sampled observation from a distribution of all possible observed scores. The observed score can be seen as consisting of the mean of the Dr. Chang 11 distribution (or the true score) and a deviation from the mean which is called error or error score. Thus, x = t + e. The extent to which an observed score represent the true score is reliability. We can use Pearson product moment correlation, ρ, to describe the strength of the relationship between observed scores and true scores, i.e., reliability. Thus, ρxt is called a reliability index. (Note ρxt is not reliability coefficient.) Of course, we don't know the true score and thus, can not solve for ρxt. But assumptions can be made that enable the solving of ρxt. These assumptions, which are not discussed here, make up the classical test theory. With the assumptions, we can numerically estimate the reliability of a test without knowing its true scores. First, we define the square of reliability, numerically, as the ratio between true score variance and observed score variance. This is called the reliability coefficient. (Note this is ρxx or ρxx'.) Second, we can estimate the reliability coefficient by simply correlating two parallel tests (or two forms of a test, two halves of a test, or two administrations of a test). The result is a Pearson correlation coefficient, r, which is an estimate of the reliability coefficient. Keep in mind, the meaning of a Pearson r when used as an estimate of reliability coefficient, is really r2 representing the proportion of the observed score variance that is true score variance. r ranges from -1 to 1 whereas r2 ranges from 0 to 1. That is why, r as a reliability estimate, ranges from 0 to 1 but not -1 to 1. Depending on which kinds of two tests are being correlated to arrive at the reliability estimates, these estimates are given different names as shown below: Test-retest reliability (coefficient of stability) Correlate two administrations of the same test. Parallel form reliability (coefficient of equivalence) Correlate two forms of the same test Split half reliability (Spearman-Brown prophecy formula) Correlate two halves of the test Internal consistency reliability (Cronbach α) Correlate every item with every other item. When ρxx' = 1, 1. the measurement has been made without error (e=0 for all examinees). 2. X = T for all examinees. 3. all observed score variance reflects true-score variance. 4. all difference between observed scores are true score differences. 5. the correlation between observed scores and true scores is 1. 6. the correlation between observed scores and errors is zero. 1. 2. 3. 4. 5. 6. When ρxx’ = 0, only random error is included in the measurement. X = E for all examinees. all observed score variance reflects error variance. all difference between observed scores are errors of measurement. the correlation between observed scores and true scores is 0. the correlation between the observed scores and errors is 1. Dr. Chang 12 1. 2. 3. 4. 5. 6. When ρxx’ is between zero and 1, the measurement include some error and some truth. X = T + E. observed score variance include true-score and error variance. difference between scores reflect true-score differences and error. the correlation between observed scores and true scores is reliability. the correlation between observed scores and error is the square root of 1 – reliability. Validity The validity of the use of a test refers to the extent to which the test truly measures what it is expected to measure. For example, the use of a bathroom scale to measure weight is valid whereas the use of a bathroom scale to measure height is invalid. Commonly discussed validities include content, construct, and predictive validity. Kinds of Validity Evidence Content validity refers to the extent to which the items on a test are representative of a specified domain content. For example, a test that is intended to measure the content of this course should contain items about reliability, validity, intelligence and personality tests. If this test is made up of items on calculus or matrix algebra, the test will have no content validity. Achievement and aptitude (but not personality and attitude) tests are concerned with content validity. Construct validity refers to the extent to which items on a test are representative of the underlying construct, e.g., personality or attribute. Personality and attitude tests are concerned with construct validity. The process to establish construct validity is referred to as construct validation. Construct validation is complicated, involving testing hypotheses concerning the theories from which the test is derived. A common practice is to compare high scorers with low scorers on the test with respect to some external behavior which is hypothesized to correlate with the test. Construct validity has often been narrowly interpreted as providing evidence for the internal structure of a test. Another term for this narrow definition of construct validity is factorial validity because such validity evidence is gathered through factor analysis. The correct definition of construct validity refers to gathering evidence for a broad normological network of relations. In 1959, Campbell and Fiske published an important paper, which has also become the most cited paper in psychology. In this paper, they conceptualize validity issues in a Multitrait-Multimethod (MTMM) correlation matrix. In this MTMM matrix, convergent validity (CV) is the correlation between the different methods of the same trait. This correlation coefficient should be high. Discriminant validity (DV) is the correlation between different traits obtained by the same method. A low value indicates evidence of validity. The correlation between different traits obtained by different methods (HH, heterotrait-heteromethod) should be the lowest in value. This MTMM concept has since become an important part of construct validity, both in terms of its narrow definition and the broad definition. Criterion related validity, including predictive validity and concurrent validity, refers to the extent to which a test correlates with future behaviors which the test is intended to predict. For example, the HKALE is intended to select students who are capable of university studies. For the HKALE to have predictive validity, its test scores should correlate with undergraduate performance, such as the GPA. Predictive validity is simply the correlation between the test and a criterion measure the test is intended to predict. The correlation coefficient is also called Dr. Chang 13 validity coefficient. Sometimes, because of lack of a criterion, one correlates the test with another test that perports to measure the same thing. This is called concurrent validity. Making Sense of Validity Validation is an ongoing process where one keeps accumulating validity evidence. But one is not capable of obtaining all the evidence at once. Certain evidence is never attempted. For different uses of a test, the gathering of some evidence becomes more important than others, giving rise to different validity concepts and procedures. Over time, this discriminating use of evidence in relation to different types of tests becomes a tradition so that certain test and its use is routinely associated with one kind of validity evidence but not with other kinds. For example, most education achievement tests are only concerned with content validity; i.e., whether the test items are representative of what have been taught within a specified domain content. One purpose of education is to create a competent work force and, thus, a valid test of such education achievement should be correlated with future job performance. However, such criterion-related validity evidence is seldom gathered for an education achievement test in part because there is also a strong public mentality that students must be assessed for what they have learned. As an another example, the content validity of a personality test is never questioned. The validity concern with a personality test lies in establishing the linkage between the test items and the underlying trait structure as defined by a theory within which the test is conceptualized. Factor analysis is often used to see if the items form clusters according to the theory and to see if the items correlate with measures of other constructs according to specified patterns of relationships defined by the theory. These efforts are referred to as construct validation. To be simplistic, content validity is associated with an achievement test as construct validity is with a personality and attitude test. The former is concerned with the representativeness of the items with respect to a specified domain content. The content validation procedures are qualitative or judgmental. The latter is concerned with the representativeness of the items with respect to a defined theoretical construct. The procedures are referred to as construct validation which may involve different data collection techniques and strategies, some of which could be thought of as a different validity procedure, e.g., criterion related validity. Finally, some tests do not need validity evidence. For example, a test of typing which is administered to measure the status of an individual's typing skills is its own criterion. Such "obviously valid" tests do not need additional validity evidence. An important issue associated with criterion validity is what is referred to as the restriction of range effect. Validity coefficient is a correlation coefficient the magnitude of which depends on the ranges of scores for the predictor variable (test) and the criterion (a future behavior to be predicted by the test). A correlation based on the full range of scores will always be higher than that based on a restricted range of scores, independent from the true predictability of the test. In reality, most of such tests are used to make selections (of students or employees for example). The validity study is always conducted on those who passed the test and were, thus, given the opportunity to demonstrate the future performance. Thus, the validity is based on a restricted range of scores rather than the full range. The real predictive validity of a test should always be higher than what is obtained from a validity study. There are, however, statistical procedures to adjust the validity coefficient. Another important issue is the relationship between validity and reliability. Remember that ρxx' = ρ2xt, the reliability coefficient is the squared correlation between the observed scores and the true scores. If ρxx' = .81, ρxt = .90, (the correlation between the observed scores and true scores is .90). In general, ρxt > ρxx'. That is an observed score will correlate higher with its own true score than with an observed score on a parallel test. Because a test can not correlate more Dr. Chang 14 highly with any other test or variable than with its own true score, the maximum correlation between an observed score and another variable is the squared coefficient itself. Sometimes, the distinction between reliability and validity is blurred. For instance, Lindquist (1942) defined validity as the correlation between the fallible (the test to be validated) and infallible measure of a trait. Assuming the infallible criterion to be perfectly reliable and the (fallible) test to be perfectly representative of the criterion, the maximum validity of the (fallible) test would have to be the correlation of the observed test scores with true scores on the test itself which is the square root of the reliability of the fallible test. This is basically the criterion related validity -- correlation between the observed scores of the test and observed scores of the criterion. 4: Research Validity and Sampling Techniques Research Validity Research validity can be simply understood as the quality of a research study. There are two kinds of quality issues that are referred to as internal validity and external validity. When we try to pinpoint the cause of a phenomenon or behavior either by experimentally manipulating the independent variable or by measuring or observing the independent variable, we are trying to demonstrate that it is the independent variable that “causes” changes in the outcome variable or in the dependent variable. Internal validity is about the extent to which we can make this causal inference. Internal validity is the extent to which the outcomes of a study result from the variables which were manipulated, measured, or selected in the study rather than from other variables not systematically treated. e.g., there might not be a relationship between televised violence and aggressive behavior, rather, children who watch violence programs are more aggressive in the first place. Different brands of deodorants were tested under left and right arm which provide unequal testing conditions. An expectancy effect will also make people feel the advocated brand lasts longer. In other words, internal validity is about how confident we are about the stated “causal” relationship between the independent variables and the dependent variable. That is, the dependent variable is due to independent variable but not due to something else. To improve internal validity and thus the quality of a research study, we need to be able to rule out alternative “causes” that may have done the same thing to the dependent variable as does our independent variable. Some of the commonly encountered alternative causes include: History: Events take place during the study that might affect its outcome in the same way that the independent variable is hypothesized to affect the outcome. e.g., a study examined whether a certain leadership training program was effective in enhancing students’ sense of competitiveness; in the three months of the experimental study, the TV show, Survivor, was on which could have made the students more aware of competition and become more experienced with some competition strategies. Maturation: Especially for developmental studies where children grow with the passage of time to become more mature in certain developmentally related abilities. e.g., a study showing students’ vocabulary increase from a dialectical reading program may not have internal validity due to maturation because children’s vocabulary increases with time independent of the training program. Testing: When people are measured repeatedly, e.g., pretest-posttest, they become better not because of the independent variable but because they become test smarter. e.g., a study showing math scores improved over the pretest due to a new teaching method might not have internal validity because students work much faster or better because of their experience Dr. Chang 15 with the pretest. Instrumentation: The effect on the dependent variable is not due to the independent variable but due to aspects of the instrument used in the study. Like the testing threat, this one only operates in the pretest-posttest situation. e.g., observed change from pretest to posttest is due not to a math program that is being experimented but rather to a change in the test that was used. The posttest could simply be easier. Instrumentation threats are especially likely when the "instrument" is a human observer. The observers may get tired over time or bored with the observations and thus become more lenient or more stringent or more conservative or liberal in their ratings. Conversely, they might get better at making the observations as they practice more and become more accurate than their ratings at the beginning. In either event, it is the change in instrumentation, not the independent variable, that leads to the observation in the dependent variable. Regression towards the mean: Particularly problematic when subjects are chosen because of extreme scores where high scoring individuals are more likely to score lower and low scoring individuals are likely to score higher the next time they are tested merely due to random measurement error. e.g., a study showing a teacher is effective in improving the scores of students who are at the bottom in the first term may not have internal validity due to the regression towards the mean artifact because by chance alone these students at the bottom will tend to increase (there is a greater chance for top students to fall towards the mean; while bottom students have a greater chance to rise towards the mean). Selection: Results due to assignment to different treatment or control groups but not due to the independent variable that makes the two groups. e.g., you want to compare a new coaching method against the existing method used in PE lessons and recruit volunteers to participate in three days of training of the sport. You then compare the results with those of some of the existing PE lessons. Volunteers are simply more motivated than those attending regular PE class and thus may produce better training outcome (dependent variable) which has little to do with training method (independent variable). Mortality: Especially for longitudinal studies that last for an extended period of time, attrition or dropping off from the study in a non-random manner may affect the outcome of the study. That is, some participants no longer want to continue with the study and the results are based on those who stayed in the study who have are different from those who dropped in some fundamental ways. e.g., a special intervention may be tried out in a school but the results showed that their students’ average HKCE results to be much worse than those from regular schools; this study may not have internal validity due to mortality because the weakest students may have dropped out from the regular schools. Diffusion or imitation of treatment: The control group or one of the treatment groups somehow end up receiving some of the same treatment as the other groups resulting in few differences among the treatments. e.g., a study shows that a new teaching method (tried out in one class) does not lead to better achievement than the traditional method (in another class) in part because students from the two classes constantly compare notes and exchange information. Hawthorne effect refers to the fact that when participants received unusal treatment in a field experiment, they may temporarily change their behavior or performance not because of the manipulation of the independent variable but because of the special attention they received Dr. Chang 16 during the experimentation. The term gets its name from a factory called the Hawthorne Works, where a series of experiments on factory workers were carried out between 1924 and 1932. Among many types of experiments, one stands out and is often talked about is the illumination experiment where researchers came to the factory to change lights all the time (and probably made casual conversations with the workers). In comparison to the control group which did not experience lighting changes, productivity of the experiemental group seemed to increase independent of how lighting was adjusted. John Henry effect: Whereas the Hawthorne effect is due to some unusal performance of the experiemental group, sometimes the control group may also put up some extraordinary performance to outperform the experimental group due to a sense of demoralization for not being included in the “special treatment” experimental group. The term comes from a true story of a railroad worker, John Henry, who probably felt threatened when a spiking machine (to put spikes to stablize the rails) was introduced to replace the manual work and outperformed the machine but later died of heart attack. External Validity External validity is the extent to which the findings of a particular study can be generalized to people or situations other than those observed in the study. Can findings from laboratory be applied to real world where there are many other factors influencing behavior that have been controlled in the lab. All the factors threatening internal validity can be controlled in a lab but the results may become less generalizable outside the lab. e.g., a study on a new teaching method with a certain group of S.4 students does not have external validity if the results cannot be generalized to other topics taught by the same new method, or to other S.4 students. Many threats to external validity can be understood in the form of an interaction: Treatment-attribute interaction: Certain personality and other characteristics may interact with the independent variable so that the effect of the independent variable may be different on people having different personality characteristics. e.g., A study showing democratic parenting style improves academic achievement may lack external validity if we can argue and show that less parental demand and supervision only work with highly motivated students but not other students. Treatment-setting interaction: The independent variable may interact with other external factors or contexts to result in different effects for different settings so that the effect cannot be generalized to all settings. e.g., The positive effect of democratic parenting on school achievement may lack external validity if democratic parenting (independent variable) works only when schools provide clear structure and strict supervision regarding student learning. Pretest sensitization: The effect of the independent variable may be due to the pretest which serves to sensitize the subjects whereas in the population (the real world to which the findings are to be generalized), there is no pretest and thus the treatment (independent variable) may not work as it does in the study. A study showing that a math teaching method worked because the post test improved over the pretest may lack external validity if the pretest helped the teachers and students to identify learning difficulties or made them more aware of certain weaknesses and the teaching method focused on those weaknesses. Posttest sensitization: The effect of the independent variable is part due to sensitization or exercising effect of the posttest which is not available in the population to which the results are to be generalized. e.g., in the above study on a new math teaching method, the posttest served to reinforced the effect of the teaching method and thus the students in the study improved their math performance because they had both the new teaching method and the posttest. Dr. Chang 17 Sampling Techniques An element is an object on which a measurement is taken. It is not the person or thing but a particular measurement of the person or thing that is of interest. e.g., persons' height, class size. A population is all the elements in a defined set about which we wish to make an inference. Examples of target population vs. experimentally accessible population are heights of Chinese vs. heights of Shanghai residents. Sampling units are non-overlapping collections of elements from the population. The sampling unit can be the element. Sampling frame is a list of sampling units. A sample is a collection of sampling units drawn from a frame. Simple random sample If a sample of size n is drawn from a population of size N in such a way that every possible sample of size n has the same chance of being selected. M = Σxi / n is the sample estimate of population mean, μ. Stratified random sampling The population of N units is divided into subpopulations of N1, N2... Nh units which are non-overlapping so that N1 + N2 +..+ Nh = N. The subpopulations are called strata. A sample is drawn from each stratum. The sample is denoted as n1, n2, nh. If a simple random sample is taken from each stratum, the whole procedure is called stratified random sampling. When the population is heterogeneous, stratified sampling increases the precision in estimating population parameters. Breaking up the population makes each stratum homogeneous (measurement varies little among units) so that a small sample is needed to estimate the population characteristics of the stratum. These strata estimates can be combined into a precise estimate of the whole population. Wh = Nh / N is stratum weight. fh = nh / Nh is sampling fraction in the stratum. = Σ NhXh / N = Σ WhXh is the stratified sample estimate of population mean, μ. If in every stratum, nh / n = Nh / N, this means sampling fraction is the same in all strata. Such stratification is called stratification with proportional allocation of the nh. Using proportional allocation, Mst = Σ nhXh / n NXst = N1X1 + N2X2 + N3X3 .. + NhXh is stratified sample estimate of population total, τ. Mst Cluster sampling A cluster sample is a simple random sample in which each sampling unit is a collection, or cluster, of elements. It is used when 1) a good frame listing population elements either is unavailable, unreliable, or costly; 2) the cost of obtaining observations increases as the distance separating the elements increases. For example, when sampling in the field in agricultural research, it is hard to do random sampling by running around. In quality inspection of light bulbs contained in boxes, light bulbs are the elements and the boxes can be the sampling units. Travelling within a city (to obtain a simple random sample of city residents) is more expensive than travelling in a city block (to get a cluster sample of city blocks as sampling units). The Dr. Chang 18 rational for choosing the unit size for cluster sampling is a decision on the unit that gives smaller sampling variance for a given cost or on the cost for a prescribed variance. Used as a general rule, the number of elements within a cluster should be small relative to the population size, and the number of clusters in the sample should be reasonably large. Mcl = Σxi / Σmi is the cluster sample estimate of population mean, μ, where Σxi is the total of observations (sum of elements) in the ith cluster, and Σxi indicates summing over i=1 to n sampled cluster totals; mi is the size of ith cluster, and Σmi indicates summing over i=1 to n sampled cluster sizes,. Systematic sampling Randomly selecting one element from the first k elements in the frame and every kth element thereafter is called a one-in-k systematic sample. It is needed in field study, e.g., select every 10th tree or every 20th file, every 15th shopper who passes by an aisle in a supermarket, until a predetermined n is achieved. If different persons originally handled different sections of the files, or each clerk deals with a surname, then systematic sampling provides more accurate information than simple random sampling which could, by random chance, select n files filed by one clerk. Parameter estimations are the same as those of the simple random sampling. Sample size The method used to select the sample is of utmost importance in judging the validity of the inference made from the sample to the population. The representativeness of the sample is more important than the size of the sample. A representative sample of 100 may be preferable to an unrepresentative sample of 100,000. The size of a sample can never compensate for a lack of representativeness (bias). Having established sample representativeness, using the right sample size becomes an important economical decision. Each observation taken from the population contains a certain amount of information about the population parameter or parameters of interest. Since obtaining the information costs financially, one decides how much information to be sampled. Too little prevents good estimates. Too much may be a waste given a limited economic resource. The quality of information obtained in a sample depends upon the number of elements sampled (sample size) and the amount of variation in the information (population variance). Let's look at an example. What proportion of people are left-handed? How big a random sample is needed to answer the question? Or, in other words, how big an error, or margin of error, is to be tolerated? Let's set the margin of error at no bigger than 10%. If the sample estimate is 20%, you will at least be confident that the population proportion is between 10% and 30%. But you cannot guarantee that every sample, including the one you draw, will have this margin of error unless the whole population is sampled. Sometimes, a sample may have sampling error higher than 10%. Sometimes, a sample may have sampling error lower than 10%. Of course, you would be concerned only with having an error higher than 10%. Then the question becomes how unlikely do you want this unlucky sample having a higher than 10% error to be? Assume you want the unlikelihood to be 5 out 100 samples so that you can be confident, 95% of the times, that a sample does not exceed the specified sampling error of 10%. That is Pr (p - P > 10%) = 5% or Pr (p - P 5) = 95%. You can then use some of the basic statistics to estimate the margin of error by first estimating the population variance (of proportion of left-handers) and standard deviation which is a function of sample size so that you can estimate how big a sample you need so that the chance of making an estimation error of bigger than 10% does not exceed 5%. Dr. Chang 19 5: Experimental, Quasi-Experimental and Non-Experimental Designs To ensure research validity or internal validity, researchers develop various ways to identify, isolate, or nullify variability among subjects in a dependent variable that is presumably "caused" by one or more independent variables that are extraneous to the particular relation or relations under study. Such a research effort is called control, control of variance, or control of extraneous variables. The most powerful way of controlling extraneous variables is experimentation where subjects are randomly assigned to experimental versus control groups. Other things being equal, if random assignment has been used, the groups can be assumed to be equal in all possible characteristics except that due to the manipulation of the independent variable. In other words, variations among subjects due to anything other than the independent variable are scattered or spread out evenly across the randomly assigned groups. The variability due to the manipulation of the independent variable is called systematic variance. The purpose of experimental research is to maximize this source of variance, minimize error variance, and control extraneous variance. Other means of controlling extraneous variance include matching, including the extraneous variable into the study, making the extraneous variable a constant, and using statistical methods to decompose different sources of variance. It is the extent to which extraneous variables are controlled that distinguishes research designs into experimental, quasi-experimental, and non-experimental designs. Experimental design The single most important feature of experimental research is the manipulation of the independent variable. Researchers create changes, called treatment conditions, in variables being researched, called independent variables, to examine the impact of these manipulations on some outcome behaviour or phenomena called the dependent variable. Another important feature of experimental research is the ability to control extraneous variables so that subjects receiving different manipulations of the independent variable are equal except for the manipulation. Experimental research in social science is developed from both physics research and biological research models. In physics research the primary means of controlling extraneous factors is through artificial controls, such as isolation, insulation, sterilization, strong steel chamber walls, soundproofing, lead shielding, etc. These methods ensure the reproduction of similar conditions and the consequent production of certain effects. Biological research moved from the laboratory to the open field, the modern theory of experimental control through randomized assignment to treatment emerges. Agricultural research compares yield per acre for different crops or fertilizers, raking, or plowing method. One of the greatest breakthroughs in experimental design was the realization that random assignment provided a means of comparing the outcomes of different treatments in a manner that ruled out most alternative interpretations. Random assignment requires experimental units, which can be plots of land in agriculture, individual persons in social psychology experiments, intact classrooms in education studies, and neighborhoods in some criminal justice research. Treatments are then assigned to these units by some equivalent of a coin toss. Quasi-Experimental Design Quasi-experiments have treatments, outcome measures, and experimental units, but do not use random assignment to create the comparisons from which treatment-caused change is inferred. Instead, the comparisons depend on nonequivalent groups that differ from each other in many ways other than the presence of a treatment whose effects are being tested. The task is one of separating the effects of a treatment from those due to the initial noncomparability Dr. Chang 20 between the average units in each treatment group. In a sense, quasi-experiments require making explicit the irrelevant causal forces hidden within the intact groups. The advantages of experimental control for inferring causation have to be weighed against the disadvantages that arise because we do not always want to learn about causation in controlled settings. Instead, we would like to be able to generalize to causal relationships in complex field settings, and we cannot easily assume that findings from the laboratory will hold in the field. For lack of randomization, pretest is an integral part of quasi-experiment that enables comparisons among the nonequivalent groups, whereas, in most experiments, a pretest is often unnecessary or undesirable. When a pre-test is not available, researchers may look for some proxy variables to use in an attempt to equate the treatment and control groups. For instance previous academic records can be used as proxies to equate the intact classes where different instructional methods are implemented and post-tests are used to evaluate or compare the different teaching methods. Non-Experimental or Ex Post Facto Research There is a temporal sequence in the order of the independent variable (which occurs first) and dependent variable (which is observed subsequently) that allows a causal inference. Specifically, the researcher manipulates the independent variable to "create" changes in the dependent variable. This expected change (in the dependent variable as a consequence of the manipulation of the independent variable) represents a causal relationship between the two variables. This causal relationship is stated in the hypothesis. Thus, an experimental research is guided by a hypothesis stated a priori. In experimental and quasi-experimental research, inferences are made from the independent variables (the causes) to the dependent variable (the effect). In non-experimental research, also called "ex post facto research", inferences are generally made in the opposite direction. That is beginning with the observation of the dependent variable, attempts are made to uncover, detect, or find the reasons (independent variable) for the existing variations. The point is the variations are not the result of the manipulation of the independent variables but are pre-existing. When an experimental researcher manipulates a variable (e.g., administers different treatments), the researcher has some expectations regarding the effect of the manipulation on the dependent variable. These expectations are expressed in the form of hypotheses to be tested. In non-experimental research, a researcher would not have an expectation of the independent variables or sometimes would not even know prior to data collection what "independent" variables are tenable to explain the pre-existing variations on the "dependent" variable. Often, there are no hypotheses associated with a non-experimental study and researchers adopt the position of "letting the data speak for themselves." This design is most vulnerable to internal validity threats. Two general strategies to protect internal validity are using (1) large samples to compensate for the lack of random assignment and (2) large numbers of "independent" variables to eliminate rival explanations. The latter strategy is intended to overcome the weakness of having no experimental manipulation of the independent variables. Summary of Research Designs Manipulation of Random Sample Variable Design Independent V. Assignment Size Number --------------------------------------------------------------E xperimental Yes Yes Small Small Dr. Chang 21 Quasi-Exp Yes No Non-Exp No No Large Large -------------------------------------------------------------Non-Experimental Design: Causal Comparative Study Non-experimental designs refer to efforts at causal inference based on measures taken all at one time, with differential levels of both effects and exposures to presumed causes being measured as they occur naturally, without any experimental intervention. According to some authors, there are two kinds of non-experimental designs, causal comparative and correlational studies. (Others including me do not think such a distinction is necessary.) The difference lies in the measurement of the "independent" variable which can be either categorical or continuous. The quotation mark indicates that an "independent" variable in non-experimental research is not truly what the term stands for since it is not manipulated. For this reason, a categorical or continuous "independent" variable is also called a grouping variable in causal comparative studies and an exogenous variable in correlational studies. Causal comparative design is different from experimental and quasi-experimental designs in that there is no manipulation of the grouping variable and there is no random assignment of subjects into different groups. The question inspiring gender studies is whether males and females are different with respect to a particular behaviour or trait. In this example, the independent or grouping variable is gender which is not and can not be manipulated and there is no way to randomly assign some subjects into either of the two groups. Another difference is that, as stated earlier, groups in causal comparative research are often formed on the basis of the dependent variable. Researcher are often interested in finding out why, for example, some children are less motivated to learn than others, do not achieve as well as others, have more behavioral problems than others, are more aggressive than others, and turn out to be criminals. The behaviours on which questions are raised are the results or outcomes or dependent variables. Researchers try to find answers to the individual differences on these variables by grouping subjects on these variables into, say, high versus low achievers, students with and without behavioral problems or with and without criminal records, and then compare these two groups of subjects on some suspected causes, e.g., parental supervision, peer influence, TV viewing, etc. Such logical thinking and research processes are almost the opposite of the experimental research. For example, in the above example, TV viewing as an independent variable will be manipulated in experimental research and subjects will be randomly assigned to groups with different amounts or different kinds of TV exposure and their subsequent aggressive, antisocial, or criminal behaviours are observed. In this experimental example, because of randomization, different experimental groups can be considered equal except for the manipulation of the TV viewing. Thus different validity threats can be reasonably ruled out and the observed difference among the groups on the dependent variable, aggressive behaviour, can be attributed to the independent variable, TV viewing. In the non-experimental example, however, one can not say that TV viewing causes aggressive behaviour even though the two groups of aggressive versus non-aggressive children were found to differ on TV viewing. The reason is that the two groups can not be assumed to be equal: They may differ in a lot of things in addition to aggressiveness, e.g., different family background, hormonal levels, personalities, etc. The same statistical analyses are used in causal comparative studies, such as ANOVA. The interpretations of the results should be far more cautious in causal comparative research. Non-Experimental Design: Correlational Study Dr. Chang 22 Partial correlation. The calculation is complicated but the idea of partial correlation is simple. It is an estimate of the correlation between two variables in a population that is homogeneous on the variable (or variables) that is being controlled, whose effects are being removed, or variability on this variable is made into a constant. For example, a correlation between height and intelligence computed out of a sample that is heterogenous on age, say, ranging from 4 to 15, is a simple correlation which is high and positive. A partial correlation would be an average correlation between height and intelligence within each age group where age is a constant. This partial correlation which is likely zero more truly depicts the relationship between height and intelligence. Spurious effect. When two variables are correlated solely because they are both affected by the same cause, the correlation between these two variables is spurious. For example, the tobacco industry argues that the correlation between cigarette smoking and lung disease is not causal but spurious in that both these variables may be caused by a common third factor such as stress or an unhappy mental state. Another example will be the positive correlation between height and intelligence often observed in children. Here, the correlation is again spurious because both variables have the common cause of chronicle age. Mediating variable. The correlation between two variables can be the result of a mediating variable. For example, a strong correlation between SES and academic achievement is often observed and makes some people believe that there is a causal relationship between how rich the parents are and how well the kids do in school. However, such a relationship is now found to be mediated by a third variable, Achievement motivation. That is rich people's children are more motivated to study (by their parents' success) and this motivation leads to good academic performance. This later finding is achieved by correlating SES and Achievement while statistically partialling out Motivation. The correlation is almost zero. The important implication of this statistical insight is that the key lies in motivating the poor kids (providing them with role models) whereas giving them material incentives may not make them study. Suppressor variable. A special case when a partial correlation is larger than its zero-order correlation is called a suppressor variable effect. A suppressor variable is a variable that has a zero, or close to zero, correlation with the criterion or dependent variable but is correlated with the predictor or independent variable. When such suppressor variable is not taken into consideration, the correlation between the independent and dependent variable may be "suppressed" or reduced by this uncontrolled suppressor. For example, a paper-and-pencil pilot test as a predictor was found to predict little of the criterion, flying. The correlation was suppressed by a third variable, verbal ability, which has little to do with flying but a lot to do with test taking. When this suppressor variable was partialled out, the correlation between the pilot test and piloting increased significantly. This is a real example from pilot training during World War II. Validity Issues The only way to enhance research validity of correlational studies which are post facto or after fact is through careful logical deduction and well-thought-out statistical analyses. The first part requires a strong theory to map out relations among variables and careful thinking to include all the possible extraneous variables that may contribute to the post facto observed variations. The second part involves the use of various rather complicated analytic techniques, Dr. Chang 23 such as multiple regression, path analysis, and structural equation modeling. Partial correlation is the basic idea behind these analyses. Thus, correlational studies often involve more variables than experimental research. The variables in a correlational study are not distinguished between independent and dependent variables. First, since there is no experimental manipulation of the variable of research interest, there is no independent variable. Second, because the inference is often not made from the manipulation of the independent to the outcome of the dependent variable. On the contrary, the post facto outcome is observed first and the research purpose is to account for the observations, the order of what is independent and dependent seems opposite to what is used in experiments. Third, unlike experimental studies where there is usually one dependent variable, there can be and usually are more than one outcome variable in a complex patten of associations. In correlational studies, variables are distinguished between what are referred to as exogenous and endogenous variables. An exogenous variable is one whose variability is assumed to be determined by causes outside the model or study under consideration. It is not the interest of the study to explain the variability of an exogenous variable or its causal relations with other exogenous variables. An endogenous variable is one whose variation is to be explained by the exogenous and/or other endogenous variables in the model. Exogenous and endogenous variables are like independent and dependent variables in experimental studies except that there is no manipulation of the exogenous variables and there are usually more than one endogenous variable in a correlational study. In correlational studies one needs to think hard to include as many relevant variables as possible. Omission of relevant variables which are correlated with the exogenous or endogenous variables in the model constitutes what is called a specification error which will lead to biased estimates of the relations among the variables in the model. Specification errors are almost unavoidable in correlational research. All one can do is to attempt to minimize them by including in the design major relevant variables. Although it is hard to say how many relevant or extraneous variables need to be included in a model, it is fairly certain to say that a correlational study involving only one "independent" variable is bound to be misspecified in virtually any instance that comes to mind. An Evaluation Checklist for Quantitative Studies Adopted from McMillan, J. H. (2004). Educational research: Fundamentals for the consumer (4th ed.) Boston: Pearson. 1.0 Research Problem What are the independent and dependent variables? Is the problem researchable? Is the problem significant? Will the results have practical or theoretical importance? Is the problem stated clearly and succinctly? Does the problem communicate whether the study is descriptive, relational, or experimental? Does the problem indicate the population studied? Does the problem indicate the variables in the study? 2.0 Review of Literature 2.1 Does the review of literature seem comprehensive? Are all important previous studies Dr. Chang 24 included? 2.2 Are primary sources emphasized? 2.3 Is the review up to date? 2.4 Have studies been critically reviewed, and flaws noted, and have the results been summarized? (I disagree with 2.1 and 2.4.) 2.5 Does the review emphasize studies directly related to the problem? 2.6 Does the review explicitly relate previous studies to the problem? 2.7 If appropriate, does the review establish a basis for research hypotheses? 2.8 Does the review establish a theoretical framework for the significance of the study? 2.9 Is the review well organized? 3.0 Research Hypothesis 3.1 Is the hypothesis stated in declarative form? 3.2 Does the hypothesis follow from the literature? 3.3 Does the hypothesis state expected relationships or differences? 3.4 Is the hypothesis testable? 3.5 Is the hypothesis clear and concise? 4.0 Selection of Participants 4.1 Are the participants clearly described? 4.2 Is the population clearly defined? 4.3 Is the method of sampling clearly described? 4.4 Is probability sampling used? If so, is it proportional or disproportional? 4.5 What is the return rate in a survey study? 4.6 Are volunteers used? 4.7 Is there an adequate number of participants? 5.0 Instrumentation 5.1 Is evidence for validity and reliability clearly stated and adequate? Is the instrument appropriate for the participants? 5.2 Are the instruments clearly described? If an instrument is designed for a study by the researchers, is there a description of its development? 5.3 Are the procedures for gathering data clearly described? 5.4 Are norms appropriate if norm-referenced tests are used? 5.5 Are standard setting procedures appropriate if criterion-referenced tests are used? 5.6 Do the scores distort the reality of the findings? 5.7 Do response set or faking influence the results? 5.8 Are observers and interviewers adequately trained? 5.9 Are there observer or interviewer effects? 6.0 Design 6.1 Descriptive and Correlational Dr. Chang 25 6.1a If descriptive, are relationships inferred? 6.1b Do graphic presentations distort the findings? 6.1c If comparative, are criteria for identifying different groups clear? 6.1d Are causative conclusions reached from correlational findings? 6.1e Is the correlation affected by restriction in the range and reliability of the instruments? 6.1f If predictions are made, are the based on a different sample? 6.1g Is the size of the correlation large enough? 6.1h If causal-comparative, has the causal condition already occurred? How comparable are the participants in the groups being compared? 6.2 Experimental 6.2a Is there direct manipulation of an independent variable? 6.2b Is the design clearly described? Is random assignment used? 6.2c What extraneous variables are not controlled in the design? 6.2d Are the treatments very different from one another? 6.2e Is each replication of the treatment independent of other replications? Is the number of participants equal to the number of treatment replications? 7.0 Results 7.1 Is there an appropriate descriptive statistical summary? 7.2 Is statistical significance confused with practical significance? 7.3 Is statistical significance confused with internal or external validity? 7.4 Are appropriate statistical tests used? 7.5 Are levels of significance interpreted correctly? 7.6 How clearly are the results presented? 7.7 Is there a sufficient number of participants to give valid statistical results? 7.8 Are data clearly and accurately presented in graphs and tables? 8.0 Discussion and Conclusions 8.1 Is interpretation of the results separate from reporting of the results? 8.2 Are the results discussed in relation to previous research, methodology, and the research problem? 8.3 Do the conclusions address the research problem? 8.4 Do the conclusions follow from the interpretation of the results? 8.5 Are the conclusion appropriately limited by the nature of the participants, treatments, and measures? 8.6 Is lack of statistical significance properly interpreted? 8.7 Are the limitations of the findings reasonable? 8.8 Are the recommendations and implications specific? 8.9 Are the conclusions consistent with what is known from previous research? Quantitative versus Qualitative Research The purpose of research is to draw some causal inference, A leads to B, A affects B, so Dr. Chang 26 that intervention can be introduced. A particular study may or may not be able to draw causal conclusion, but the eventual goal of research in any discipline is to draw such conclusion. There is a fundamental difference in interpreting a causal relationship that distinguish quantitative vs. qualitative. (The chair example.) Quantitative focuses on the specific or most salient causal link whereas qualitative takes into consideration the chain of events as contributing to a specific social process. However, as a tradition, relationship, process, context are talked about in qualitative whereas "cause and effect" is considered quantitative terminology. More formally, these two philosophies are distinguished in the following formal explanations: Alternative conditions: sufficient but not necessary Presence of the conditions associates with the presence of the outcome; but the absence of conditions does not associate with the absence of the outcome. It is sufficient by itself but not necessary. Flue virus is a sufficient but not necessary condition of headache. Contingent conditions: necessary but not sufficient Absence of the conditions indicates the absence of the outcome; but presence of the conditions does not indicate the presence of the outcome. It is necessary but not sufficient by itself. Ability to discriminate letters is necessary but not sufficient to reading. Conclusion drawn from the two philosophies: Quantitative: To be able to infer causality, conditions have to be both sufficient and necessary; i.e., the presence of the conditions is accompanied by the presence of the outcome and the absence of the conditions is accompanied by the absence of the outcome. The only way to test a causal relationship is experiment where the independent variable (the cause) is manipulated to see the corresponding change in the dependent variable (the outcome). There are different variations of experiment to fit in the constraints of social science research, e.g., quasi, causal comparative, correlational, but the ideas behind them are the same which are derived from physical science research, physics. Qualitative: Constellation of conditions that are individually insufficient but necessary and jointly unnecessary but sufficient (INUS) to bring about the outcome. (Meehl's example.) Any social phenomenon resembles the INUS situation. Every social factor itself is not sufficient even though necessary but together they are sufficient to bring about an effect even though that combination is not necessary. The emphasis is on different factors, angles, points of views, culture, the big picture, the chain of events which are necessary but not sufficient. Since the combination of them, which is sufficient, is not necessary and there can be other combinations, there is not a pattern of relationship that is uniformly true. Here the emphasis is on the contexts, the situations. They have to be taken into consideration or findings are context dependent whereas quantitative emphasizes generalization and statistical inference from sample to population. Qualitative try to study a process (rather than an isolated event) by taking into consideration different factors contributing to the process. But with the qualitative approach, there is discipline emphasis in educational research. An anthropological orientation (ethnography) emphasizes the role of culture in influencing behaviour. Researchers with a sociological bent tend to emphasize symbolic interaction. People are seen as acting according to the meaning of things and persons to them; their reality is socially constructed. People act not according to what the school is supposed to be, but according to how they see it. Quan: Dr. Chang 27 Philosophy: Isolated causal link Method: experiment Effort: Control extraneous variable to isolate out the particular linkage; e.g., attitude of subject, history, instrumentation, testing, maturation. standardize data collection. Following physical science tradition, studying only that is observable, measurable, and testable. Latent constructs have to operationalized. Lot of topics are not attempted for research. Qual: P: INUS, causal chain M: Field study. Consider combinations of factors. Look at context. Use different data collection schemes to obtain all sorts of information. Social phenomenon and human behaviour are not directly observable which include intentions, feelings, aspirations influenced by norms, culture, values. Observable behaviour aren't any real than internal phenomenon. Quan: P: Following the physical science tradition, study the social phenomenon or human behaviour as an objective and impartial observer. M: control internal validity threats, such as, observer bias and characteristics. Keep the subjects unaware of your research purpose. Hire data collector and standardize the data collecting condition by training them. Structured interview. Let the data speak. Terminology: Subject, researcher Qual: P: Take the perspective of the people being studied. See the world from the way they see it. M: Participation. Researcher is the only or major source of data collection. Subjects play a role in data interpretation. Having subjects read your report and modify afterwards. T: Informants, collaborators, teachers, vs. participant. Quan: P: Deductive reasoning, formulate theory from previous research and conduct specific empirical test. M: Hypothesis testing. Ask questions before data are collected. Use standardized test. Confirmatory or explanatory studies. Qual: P: Inductive reasoning, theory grounded in observation. From pieces of specific events and observations, develop an explanation. M: Start from scratch. Extensive and prolonged observation. Going back and forth between data and explanation until a theory is fully grounded by observation. No measurement. Measurement is not just asking questions but knowing what to ask. Exploratory or discovery oriented research. Quan: P: Generalization M: Random sampling, hypothesis testing, inferential statistics. Qual: P: Context dependent, Generalization with caution Dr. Chang 28 M: Purposive sample to gather data from most representative situation to draw generalization. Informants are selected for their willingness to talk, their sensitivity, knowledge, and insights into a situation, and their ability and influence to gain access to new situations. No intention to use statistical inference. Lengthy text report. Text analysis. Quan vs. qual is more of approach and philosophical difference than methodological difference. Techniques which were traditionally more often used by one approach than the other are now adopted by both. A case study can be used to triangulate on the findings from large scaled surveys. Field notes compiled through participative observations and personal interviews can be quantified and statistically analyzed to draw inference to the population. Some Details about Research Designs Randomly assign 3 subjects into each of the 3 treatments, Pu, Pr, and Control. Treatment Pu Pr X1,X2,X3 Control X4,X5,X6 MPu X7,X8,X9 MPr MC MGrand Factorial ANOVA: Randomly assign 3 subjects from each gender into each of the 3 treatments, Pr, Pu, and Control. Treatment Pu Male Female Pr X1,X2,X3 Control X4,X5,X6 X7,X8,X9 MPuM MPrM MCM MMale X10,X11,X12 X13,X14,X15 X16,X17,X18 MPuF MPrF MCF MFemale MPu MPr MC Grand RBANOVA: Assign every subject into each of the 3 treatments, Pr, Pu, and Control, in a random order. Subject Treatment Pu Pr Control Mean 1 X1 X1 X1 1 2 X2 X2 X2 2 Dr. Chang 29 3 X3 X3 X3 3 4 X4 X4 X4 4 Mean MPu MPr MC MGrand SPANOVA: Assign every subject from each gender into each of the 3 treatments, Pr, Pu, and Control, in a random order. Sex or BetweenSub Factor Male Female Subject Treatment or Within-Subject Factor Pu Pr Control Mean 1 X1 X1 X1 1 2 X2 X2 X2 2 3 X3 X3 X3 3 4 X4 X4 X4 4 MPuM MPrM MCM MMale 5 X5 X5 X5 5 6 X6 X6 X6 6 7 X7 X7 X7 7 8 X8 MPuF X8 MPrF X8 MCF 8 MPu MPr MC MGrand MFemale Pu: Experimental manipulation of public self-consciousness is achieved by having the subjects respond to the Spence's Attitudes Towards Women Scale with the awareness that their answers will be evaluated by other people. Pr: Experimental manipulation of private self-consciousness is acquired by having the subjects respond to the Spence's Attitudes Towards Women Scale in front of a mirror. Control: Subjects respond to the Spence's Attitudes Towards Women Scale without experimental manipulations. The Spence's Attitudes Toward Women Scale is so scaled that higher values indicate more stereotyped attitudes towards women. Between Subject Design Dr. Chang 30 Between subjects designs are used to draw inference about treatment effects to several populations. This statement does not mean that subjects are drawn from different populations to start with. In fact you use random assignment to create equal groups. The statement means that the treatment (or experiment) is so effective that after the treatment the behavior of the subjects represent that of a different population. Although referred to as an experimental design, it is also used for non experimental comparisons of different population means. As a non-experimental design, subjects are sampled from different populations to start with. Commonly used between-subject designs are the (1) completely randomized design and (2) factorial design. Factorial design deals with more than one independent variable. In the case of two independent variables, the design is called two-way factorial. The two variables can be both experimental variables. More often, one of the two variables is experimental and the other is a measured variable. A special case is the aptitude treatment interaction design where one of the two variables is the state variable, which is experimentally manipulated, and the other is a trait variable, which is measured. All these are experimental designs. Of course, when both variables are measured, e.g., gender by race, the design is non-experimental. Three hypotheses are explored in a two-way factorial: (1) the main effect due to one of the variables, (2) the main effect due to the other variable, and (3) the interaction between the two variables. The factorials are also distinguished between “fixed” and “random” depending on whether the categories of the independent variables are assumed to exhaust those in the defined population (fixed design) or are a random sample of the population (random design). Within Subject Design In a between design, each subject is observed under one condition and inference is drawn between subjects from different conditions. In a within design, each subject is observed under more than one condition and inference is drawn within the same subject across different conditions. There are usually three ways a within-design can be carried out. All three are generally referred to as randomized blocks design. (1) Subjects are observed under all treatment conditions. The order of the conditions assigned to each subject must be random. For example, each subject experiences three dosage conditions of 1 mg, 5 mg, 10 mg. Some subjects have 5 mg first and some have 10 mg first. The key is that the three conditions are assigned to each subject in a random order. As another example, have the same teachers rate content-identical essays which bear names representing immigration status and gender identities to see if there is discrimination in rating essays. In this example, each teacher rates 4 essays (female-HK, male-HK, female-Mainland, male-Mainland) the order of which is random. (2) Assign matched subjects randomly into different treatments. K matched subjects form a block. Within a block subjects are randomly assigned to treatment conditions with one subject per condition. Subjects within a block are considered identical as far as the dependent variable is concerned. The design strength lies in that the within-block variability is smaller than between-block variability. In the earlier example, we can match subjects by, for example, age, gender, physical conditions (all of which have to be related to the dependent variable), and randomly assign each matched triplet into the three dosage conditions. (3) A special case of the first design is the repeated measures design where each subject undergoes repeated observations. For example, in a test-retest design, a subject is observed twice, before and after the treatment. Repeated measures design can also be used in non-experimental studies involving multiple observations. Dr. Chang 31 Mixed Design If you add a between-factor to the Randomized Blocks Design, you have the Split Plot Design (SP) which is a mixed design. There are now two independent variables. One of them varies within the subjects and, thus, is called within-factor. The other varies between subjects and is called between-factor. This basic mixed design can be extended to include more than one within-factor and/or more than one between-factor. All the requirements for RB design apply to the SP design. All three situations which make up an RB design apply to a SP design. The three situations are: 1. Same subjects are observed under all treatment conditions the order of which is random. 2. Matched subjects form a block within which subjects are considered equivalent and randomly assigned to treatments. 3. Same subjects undergo repeated measures where random order does not exist. The same three situations make up the within-factor of a SP design. Either of the two factors which make up a SP design can be an experimental or measured variable, creating four possible scenarios: 1. Between-factor is an experimental (state) and within-factor is a measured (trait) variable. For example, the between-factor is Pu (public self-consciousness induced by the camera condition), Pr (private self-consciousness induced by the mirror condition), and a control group. Subjects are randomly assigned to these three conditions. The within-factor is pre-test (before the experimental conditions) and post-test (under the experimental conditions). This is the typical pretest-posttest experimental design. (One typical RB design is the test-retest design which is not experimental because there is no control group.) In this example, the within-factor represents Situation 3 in B above. 2. Between-factor is a measured and within-factor is an experimental variable. In the above example, the Pu, Pr, and Ctl conditions can be made a within-factor by randomly assigning these conditions either to the same subjects or to blocks of matched triplets. The between-factor can be gender. In this example, the within-factor represents Situations 1 or 2 in B. 3. Both factors are experimental. In our earlier Prozac example, the within-factor is the three experimental conditions of 10mg Prozac, 5mg Prozac, and placebo. We can add a between-factor of whether or not receiving therapy to cope with depression. The research question is whether the combination of counselling and Prozac is more effective than either the counselling or Prozac alone. In this case, both the between and within factors are experimental variables. The between-factor conditions are created by randomly assigning subjects into either the counselling or no-counselling condition. The within-factor can be created in two ways. Corresponding to Situation 1 in B, within each of these two between-factor conditions, subjects undergo all three Prozac conditions in a random order. Corresponding to Situation 2 in B, matched triplets are randomly assigned to one of the three Prozac conditions. 4. Finally, both factors can be measured in a non-experimental study. For example, I'm obtaining repeated measures on withdrawn and aggressive behaviors of primary school children. I'm going to give the children three questionnaires at the end of three consecutive semesters. The within-factor is the three waves of questionnaire data. One between-factor could be gender. Another between-factor could be popularity classification of these children. Through peer nomination, children can be classified into popular, rejected, and neglected. Analysis of Covariance (ANCOVA) ANCOVA is ANOVA with a covariate or covariates. The covariate summarizes individual differences on the dependent variable. These differences would otherwise be allocated as error in an ANOVA. Apparently, the covariate should be correlated with the Dr. Chang 32 dependent variable. By removing variations due to persons from the error term, ANOVA achieves the same goal as that of a within-subject design or a mixed design (RB and SP ANOVA). Three statistical assumptions: 1. Linearity of regression. The relationship between the dependent variable and the covariate is linear. Simply check the scatter plot to examine this assumption. 2. Homogeneity of regression. The regression of the dependent variable on the covariate is the same across different treatment groups. The most intuitive way to check this assumption is to conduct separate regression within the treatment groups to see if the regression coefficients are similar. 3. The covariate is independent from the treatment variable. The way to ensure this assumption is to obtain the covariate before conducting the experiment. That is measuring the covariate before randomly assigning subjects into different treatment conditions and conducting the experiment. The purpose of the covariate is to reduce within group variance or difference among people within groups. If the covariate is highly related with the independent variable, the individual difference between groups will be reduced by the part of the covariate that is related to the independent variable. However, in non-experimental and quasi experimental studies, e.g., teaching methods are compared using intact classes which could differ in ability and ability is used as covariate, the covariate is often correlated with the independent variable and ANCOVA is still used. In this situation, a different question is sought after in non-experimental research. The question is how much is purely due to the grouping variable after the covariate is accounted for. Here researchers may also like to use a covariate which is related to a grouping variable so that "pure" group differences can be identified after controlling for the covariate. For example, in comparing gender role attitudes among several age groups, one may want to control for education which is related to age groups. In this design, one can find out how much is due to education and how much is due to pure age.