Notes Chapters 3 and 4 Litosseliti http://www.city.ac.uk/people/academics/evangelia-litosseliti Chapter 3 by Sebastian M. Rasinger About the author: http://www.sebastianrasinger.com/?page_id=5 “Quantitative Methods: Concepts, Frameworks and Issues” Overview/Outline: (Repeated in 3.1 Introduction) 1. General characteristics of quantitative research, based on the key differences between quantitative and qualitative methodology 2. Real linguistic examples for discussion of the quantifiability of data—the quality of being measurable—comparing it to qualitative approaches, such as are many discourse analytic frameworks 3. (Section 3.2) The concepts of quantitative linguistic variables, hypotheses, theories and laws, reliability and validity 4. (Section 3.3) Critical evaluation of the most frequently used research designs in quantitative research—longitudinal, cross-sectional, or experimental designs. 5. (Section 3.4) Using questionnaires in quantitative research, design features, phrasing questions, sequencing questions, and measuring tools for different variables of interest in linguistic studies. Questionnaire coding is also discussed. 3.2 Quantitative versus qualitative methods In order to explain the difference, the discussion begins with an exchange of utterances between a Mom and a 2.5 year old. The section distinguishes between a qualitative discussion of the exchange and what the mom does, etc— looking a patterns, sequences, and the characteristics of both the mom’s and the child’s utterances. And then it introduces MLU. I don’t know why she doesn’t cite Roger Brown, whose metric this first was. Qualitative vs Quantitative Qualitative—concerned with structures and patterns. “how” something is. by their very nature, inductive Rampton 1995 study of “crossing” was inductive (note that below we have a more recent meta-study: Rampton and Charalambous 2010 http://www.kcl.ac.uk/sspp/departments/education/research/ldc/publications/workingpapers/58.pdf Quantitative—how much or how many or… Allows us to compare large numbers of instances….statistically. Typically presumed to be deductive—based on a theoretical framework, we develop hypotheses, which we then TEST—against the idea of a null hypothesis (which we don’t learn about until chapter 4). So, “a good hypothesis must have the potential of being wrong” p. 53. Scott and Marshall 2005. Dictionary of Sociology. Example of results in quantitative question asking—onset of SLA or Critical Period Hypothesis? Mention of Birdsong and Molis, 2001; Johnson and Newport, 1991. Two values are extracted: proficiency levels (using some metric) and age. “Talking about quantitative methods inevitably means talking about variables. p. 53: from OED “a variable is something which is liable to vary or change; a changeable factor, feature, or element.” Gender is a frequent variable in linguistics, but we need to rethink our assumptions about how it can be measured. Important feature (p. 54): “whenever we want to quantitatively measure something—that is, assign a variable value to a particular case, we need to thoroughly think about a reliable way to make this decision. We need a set of clear and objective definitions for each category or outcome. Moreover, our measure must be designed in such a way that it comprises as many cases as possible.” Dialectology and sociolinguistics has looked at the presence, absence, or different realization of certain linguistic features. Milroy 1987. It comes down to counting the numer of different realizations of the vowels being studied— how often does some vowel appear vs another, and what vowel value do we get? The procedure that leads to the “translation” of (physical) properties of a case into a numerical value is known as operationalization” Reliability and Validity Reliability refers to our measure repeatedly delivering the same or near same results. Replication should give us the same results. But we can’t really replicate with the same population; people learn. So, there’s the “split half” method. Take a measure across a group, then split the group randomly and see how things turn out. Validity refers to our measure actually measuring what it is supposed to measure—hence measurement or instrument validity. Research design, theoretical background and actual methods used are inseparably linked and form the overall framework for any study. These three parts must work well together. p. 57. One type of design is cross-sectional. A selection of a population across a relatively homogenous group at one point in time—a “snapshot.” Another is longitudinal. These can be panel designs or cohort designs. Bryman 2004 is cited as saying that the crucial differene between the two is that panel designs allow us to observe both cohort and aging effects, but cohort studies will only identify aging effects, hence allowing us to control for third variables. Longitudinal studies require attention to sample retention. A way to circumvent this challenge is to have “apparent time studies.” Woods 2000 collected data from three generations of a family at about the same time. Data collection issues Is language observed in its natural environment? (sociolinguistic research in longitudinal and cross-sectional studies.) But we can only observe (look for and hope to find) variables—does a particular speech community show a particular linguistic feature? What about experimental design? An experimental design manipulates the variables. Experimental groups (EG) treatment Control Groups (CG) no treatment You can do within subject effects: before and after getting it (“pre-and post-stimulus”) And between-subject effects—some get it and some don’t (EG and CG once) Between subject designs offer the challenge of making sure that the two groups are similar enough not to introduce other uncontrolled/uncontrollable variables. 3.4 Panacea Questionnaires: Design, Use, and Abuse Questionnaires are frequently used to measure people’s attitudes to and perceptions of languages. (Matched Guise tests described in Chapter 4, for example.) The concept of “ethnolinguistic vitality” Bourhis, et al. 1981 Giles, et al., 1977 language use Extra and Yagmur, 2004 Rasinger, 2007 The data from questionnaires is easy to use. Questionnaires are difficult to design well—so that they reliably generate valid data. How many questions? What data do we need from the questionnaire? Which questions are directed to the research questions? And how do we design them to focus on these? Questions: Are your students interested in learning xyz? To what extent are students interested in learning xyz? On a scale from 1 to 5, where 5 indicates “very interested,” and 1 indicates “not nterested at all,” to what extent do you think your students are interested in learning English? 1 2 3 4 5 Likert Scales—are about a spectrum of agreement vs disagreement Notice how important it is to know what you’re looking for, so you can figure out how to find it (p. 62) Compehensive (comprehensible?) and “objective” questions People have to understand the question Questionnaires should be (merely) scientific tools that help us to measure diferent aspects of “reality”— very similar to a voltmeter measuring an electric potential. And as such, it must measure neutrally and objectively. Open versus closed questions and multiple item responses multiple choice/ Likert scale ? or open answer—Interviews are like this--Can we get smaller samples? Focus groups? What about response sets and acquiescence responses? People may respond “agree” when they think they’re supposed to be in agreement, rather than when they are in agreement. One must be very clever in posing the questions. The Oviatt has an electronic copy of Rasinger’s 2007 book Bengali-English in East London: A study I urban multilingualism. Coding the questionnaire = turning things into numbers. Creating data matrices The coding of all other variables works analogically: every potential variable value is assigned a particular numerical value. Likert scales get numbers. Look at Table 3.3 Section 3.5 Summary Chapter 4 by Erez Levon About the author: http://webspace.qmul.ac.uk/elevon/ 4.1 What quantitative analyses do They’re about counting and measuring. For something to be counted, two conditions are “normally considered to be necessary:” a. what we want to count must be countable b. what we want to count must have the potential to be variable (in other words, the results we expect to get when we count must not be constant—who would care? And see below). What does this mean?” “The condition of quantifiability requires that you operationalize the possible set of responses so that they can be counted in a clear and coherent way.” (and we’re referred back to 3.2). An example is given of creating categories for the question about which issues most affected voters’ choice of candidates, the categories being environment, economy, and education. Using categories gives structure to the diversity of responses; we count with a purpose, and enter the numbers we reap into the structured categories (with outliers in other places) , and this piece of operationalization is called coding. Think of the range of research questions you could ask in the context of this example—and see what the implications for the design might be. And more on variability: the variability requirement is about the possibility of variation—if we actually find none, it’s ok; variability is a requirement about the possible existence of variation—it must be assumed to exist in principle. Statistics fall into two general categories: descriptive statistics and inferential statistics. Descriptive statistics provide indices about the general shape or quality of the data—they include mean and median. Note the claim: “What these calculations (mean/median) allow us to do is identify potential patterns in our data set.” We need to find out if the pattern has any weight to it. So, we must turn to inferential statistics to learn whether what looks like a pattern (a correlation) is one—whether the apparent dependency is justifiable. The earrings vs shoes example works (does the result that more people without earrings bought red shoes mean anything, or is it the consequence of the fact that more people overall weren’t wearing earrings, so people with no earrings bought more of everything, not just red shoes.). But to test whether or not there may be a pattern, we turn to inferential statistics. And we need a hypothesis to test in the context of collecting and evaluating the data. An experimental hypothesis is a(n educated) guess about what might be going on with the data. Note that many linguists argue that the hypothesis should be active even before the data are collected. We don’t wander out, and collect lots of data to see what might be there… An experimental hypothesis suggests that the variation in our data (dependent variability) will depend on some other factors (factors that we select and vary independently). And the null hypothesis avers that there is no relationship—that whatever pattern we (think we) see is accidental. What we’re doing, in the analysis of data we collect in the presence of a hypothesis, in fact, is trying to falsify the null hypothesis. And that’s the result we’re looking for. Levon makes a point of stressing that: “Inferential statistics provide a probabilistic measure that allows us to gauge the extent to which the null hypothesis is true—or false. What we want is for the likelihood that it is true to be very, very, very small—that’s why we like those “p-values” with lots of zeros following the lone decimal point after a zero: 0.05 (at least, or should I say most.). We like to find statements such as p ≤ 0.01 Let’s look at the summary on p. 72. So, how do we get those “p-values”? 4.2 What Quantitative Method to Use In other words…. “So, how do we get those “’p-values’”? There are “hundreds of different inferential statistical tests” Really? Goodness. The choice, however, depends on the kind and number of variables we consider. Categorical Variables—is it an x or a y or a z ? Are they wearing earrings or not? Continuous Variables—Anything scalar. How old? (a scale)—35 or 40? In this chapter, we’re looking only at studies with one independent and one dependent variable. But either of these these can be categorical or continuous. When the independent variable is continuous, correlation analyses are used. When there are combinations of categorical and continuous independent variables, other kinds of statistical tests are used (Generalized Linear Models, Linear Mixed Models). Many studies have more than one independent and more than one dependent variables, and these can get matched in a variety of different ways—(ANalysisOfVAriance)s, (when there are more than one dependent variables) MultivariateANOVAs (to find out more about the relationship between the independent variables themselves, between them as a group and between the dependent variables, as a group as well and to compare the effects of the multiple independent variables on the multiple dependent ones), and Linear Regressions are given as examples. When the independent variable is categorical, the independent variable can still be either categorical or continuous. The chapter looks at both, as the character of the dependent variable with regard to this issue determines the choice of the statistical test selected: chi-square (χ2) is selected when the dependent variable is categorical t-test is used when the dependent variable is continuous Chi-square tests calculate what the distribution of variable values would be if the null hypothesis were true for the sample being studied. This distribution is compared to what is actually found, so that the null hypothesis can be tested (and falsified). They cannot be used to examine data from continuous dependent variables. t-tests are used with continuous dependent variables. Descriptive methodologies are also used, and are the first steps in t-tests. The example—with the varying range of heights of 10 people—heights are the variables, of course—could be a set of dependent variables (we want to see what relationship, if any there is in taking some vitamin and the height of some group of people) or of independent variables (we want to see what relationship height has to shoe preference—tie or slip-on). In either case, we can describe the variation in height using the mean, the median, and the standard deviation. mean: the average value in a given set of values median: the midpoint in a set of values, where half of the values are below that one and half are above standard deviation: the extent to which any single value will deviate from the mean—this tells us how well the mean actually represents the set of values, and experimenters usually seek 1 standard deviation from the mean as a “good sign.” t-tests examine the means and standard deviations of two sample populations. The goal is to find out if the respective values of the means are significantly (or not significantly) different from one another (beyond what their respective face values say). A note: Not to belabor the obvious, but “significant” and “significantly” have formal definitions here—we are talking about computed numerical values that are considered reliable in deeming results significant or not. It seems important to remind ourselves of that. 4.3 4.3.1 Processing the data Chi-square tests Jason Baldridge on definiteness and indefiniteness in Hindi http://www.ling.upenn.edu/~jason2/papers/hindidef.htm Note: LING and Computers course at UT Austin—Jason Baldridge: http://lnc-s11.utcompling.com/course-notes The Hindi indefiniteness example. Note the term calquing. What does it mean? How is it used here? Note the dependent variable: use of Hindi-derived articles [Null article vs Overt article] Independent variable: speakers’ educational and functional level in English [Groups 1, 2, and 3] One each, both categorical. So, chi-square is the evaluative measure of choice. It works on “raw data:” raw numbers. It requires at least five tokens per cell (what’s a cell?) Steps: 1. Create a table of observed data with cells holding the data. Notice how the data from the independent and dependent variables are arranged. 2. Make sure that there are at least 5 tokens in each cell or a total of tokens from the study that is equal to five times the total number of cells. We have 6 cells in Table 4.1, and our total of tokens (=null and overt articles found—all sentences that needed articles) is 380. That’s way more. And we already have more than 5 tokens (=null vs overt articles found) in each cell. This is about the robustness of the data. 3. Construct a table of expected values. The expected values reflect what we’d get if the independent variable had no effect on the dependent variable. Computing expected values: For every cell, multiply that cell’s column total by that cell’s row total and then divide that number by the grand total of values (the grand total of all collected tokens; it’s 380 for us. So, for cell one, it’s 42 x 177/ 380. Do that for each cell. 4. Next, we compute the difference between the observed value and the expected we square that value, and then we divide that square by the expected value (Observed – Expected)2 Expected 5. Find the value, and based on that value, the determination of the p-value is undertaken. Cowart, Wayne (1997). Experimental syntax: Applying objective methods to sentence judgments. Thousand Oaks, CA: SAGE P291 .C68 1997eb 6. The next step is to determine the degrees of freedom: df DF may be thought of as setting the general parameters under which the statistical test holds true df = (# of chart rows – 1 ) x (# of chart columns – 1) 7. We look at a significance chart, next, to find out what p-value is associated with our chi-square statistic with what we discover are its “available” degrees of freedom. [look at the chart on p. 81] What we want to find is a high percentage of chance that the null hypothesis is false. In the book’s example, we can be 95% confident that it is—that is what a p-value of p=0.05 means. In fact, the data presented yield a p-value of 0.0001; our chapter author stopped at what is “usually expected in the humanities and social sciences.” Another example is given on pp. 81ff, using data about African American English. Note that from the table on p. 82, the first “expected value” would be 62 (column total) x 23 (row total) / 88 (study total). [16.2] Next we begin computing the chi-square statistic, which is the result of adding all the computed values of each result in (ov – ev)2 So, for the first cell we have (20—16.2)2 ev 16.2 Make certain we understand the summary on p. 83 Note that Blake and Cutler’s article on AAE is available for looking at—put it up on Moodle. 4.3.2 t-tests used for data with one independent and one dependent variable, where the independent variable is categorical, but the dependent variable is continuous. Carmen Fought’s 1999 analysis of vowel fronting among Latina/o speakers in Los Angeles. Very roughly, F1 is pharynx F2 is mouth Table on p 84 The independent variable: Class The dependent variable: ratios of /u/ to /i/ F2s Step 1 Calculate the mean and standard deviation for both categories of F2 ratios The mean is the average The SD is calculated thusly: First, calculate the difference of each data point from the mean, and square the result of each: Next, calculate the mean of these values, and take the square root: This quantity is the population standard deviation, and is equal to the square root of the variance. Step 2 Calculating the t-test statistic Knowing which of the many to use depends on two things: a. we need to know if we have paired or unpaired data. Paired data refer to experiments where there is some natural relationship between subjects in each of the two groups before the data are even collected. The most common example of paired data is what is called a repeated measures experiment—e.g., “before and after” tests. b. Determine whether the two groups in the analysis are of equal or unequal size. The formula for independent (unpaired) equal samples: t = the difference between the means of the two groups’ values the pooled standard deviation The pooled standard deviation is computed this way: √ s 21 + s22 __________ n (where n=the number in each group (they’re equal; it’s 13 in this example) Follow the steps on pp 85-86 Then, calculate the degrees of freedom. For t-tests of independent samples with equal sample size, we calculate df by taking the total number of subjects in both groups and subtracting 2. In these data, the total was 26, so 26-2=24 Next, we take this value to the significance chart—one that is designed for t-tests. Let’s look a bit at the discussion on p. 87-The last example given is from a classic study by Wallace Lambert, et al. in Montreal—the matched guise experiment. The groups of subjects were different size—so a different t-test calculation is needed. Research methods in Sociolinguistics Holmes and Hazen P40.3 .R47 2014 Research methods in sociolinguistics : a practical guide / edited by Janet Holmes and Kirk Hazen. Wiley/Blackwell 2014. t-test is summarized on p. 89 4.4 Interpreting the results Looking again… Carmen Fought’s results: Statistical significance and “real world significance” are not always the same thing. But…you’d have to do a matched guise test to see how sensitive speakers are to the differences—as small as they seem. Or, introduce and discuss qualitative data—did she do this? “The basic point is that quantitative methods can only take you so far. They can act as a crucial first step in mapping out the sociolinguistic terrain and in telling you what people are doing with language. The Blackwell Guide to Research Methods in Bilingualism and Multilingualism. Li Wei and Melissa G. Moyer, eds. Blackwell. 2008. P115 .B575 2008