April 8 - WordPress.com

advertisement
Enrico Giai
BA in Translating and Interpreting
MA Student in Translation Studies
Turin University
Email: enrico.giai@gmail.com
Data Collection and
Analysis in Sociolinguistics
Practical elements for research methods in sociolinguistics
Turin, 07-08 April 2014
2
Tuesday, April 8th
Main topics
 Inferential statistics
 Variables
 Hypothesis
 Null Hypothesis
 Likelihood
 Chi square test
 ANOVA
 Rbrul for inferential and multivariate statistics
3
Inferential Statistics –
Variables
 Two types of variables
 Dependent
 Independent
 The independent variable(s) affect the dependent variable in
some predictable way
 Another classification (for questions):
 Category type variables (usually dependent variables)
 Ordinal type variables
 Continuous type variables (usually independent variables)
4
Inferential Statistics –
Experimental and Null Hypothesis
 Experimental hypothesis
 The hypothesis according to which a certain variable is affected in a
predictable & systematic way by some other variable
 Must be tested
 Null hypothesis: the exact opposite of the experimental
hypothesis
5
Inferential Statistics –
Likelihood and Statistical Significance
 Likelihood, or statistical significance
 The probability for the null hypothesis to be true
 Expressed by a percentage
 As a convention in the humanities and social sciences, we take 5% sure
that the null hypothesis is true (p = 0.05) as a cut-off point. Greater than
5% sure (p > 0.05), we cannot reject the null hypothesis; less than or
equal to 5% sure (p ≤ 0.05), we reject the null hypothesis
(Levon 2010:71)
6
Inferential Statistics – Chi Square Test (1)
 Related to 2 category type questions
 The test compares the observed frequencies with the expected
ones, in order to establish whether the null hypothesis is true or false
 How to
 Calculate the observed frequencies
 Calculate the expected frequencies
 Calculate the chi squared values
 Sum the chi squared values up
 Calculate the degree of freedom
 If the critical value of significance is higher than the one related to p=0.05, the
null hypotesis will be true
 You can use RBRUL or TEST.CHI.QUAD Excel formula
7
Inferential Statistics – Chi Square Test (2)
 You can use TEST.CHI.QUAD Excel formula
 Example: occurrences of code-switching in relation to age brackets in
Filipino language survey
 N.B.: Age as a category type question because we consider age brackets!
8
Inferential Statistics – Chi Square Test (3)
 Observed frequencies:
 =(E3*B5)/E5 in H3
 =(E3*C5)/E5 in I3
 =(E3*D5)/E5 in J3
9
Inferential Statistics – Chi Square Test (4)
 Chi Square Test in J5:
 =TEST.CHI.QUAD(B3:D3;H3:J3)
The value is > 0.05, therefore the results were achieved by chance (NO
STATISTICAL SIGNIFICANCE)
10
Inferential Statistics – Scatterplot
 Related to 2 continuous type questions
 Compares the correlation between two variables
 Positive correlation
 Negative correlation
 You can use RBRUL – see slide #56
11
Inferential Statistics – ANOVA
 ANalysis Of VAriance
 Bi/Multivariate Regression Analysis
 Related to more than one category type question and more than one
continuous type question
 You can use RBRUL – see e-book
12
Inferential and multivariate statistics
 Inferential statistics
 Formulating and testing hypothesis
 Key concepts: likelihood, dependent and independent variables, hypothesis and
null hypothesis
 Multivariate statistics, or statistical modelling
 How a dependant variable changes in relation to two or more independent
variables
 Key concept: the three lines of evidence (See Tagliamonte 2012)
 Statistical significance (p<0.05)
 Factor weight (FW→1)
 Strength of factor group
13
Rbrul and multivariate statistics
 Rbrul
 Based on R
 Tool for multivariate statistics
 Input: Excel worksheet
 Output: numbers
 What for?
 Formulating hypothesis after descriptive analysis of a
questionnaire/corpus
 Testing hypothesis with inferential multivariate analysis
 What do we need?
 Excel worksheet in .csv format
R
Converting .xlsx format to .csv (1)
14
Let’s consider the Filipino language survey (.xls format)
1. Go to http://www.docspal.com (or another online converter)
Converting .xlsx format to .csv (2)
15
2. Upload .xls or .xlsx Excel file and select .csv in “convert to”
Converting .xlsx format to .csv (3)
16
3. Click on “Convert”
Converting .xlsx format to .csv (4)
17
4. Click on output file
Converting .xlsx format to .csv (5)
18
5. Click on “Salva pagina con nome”
Converting .xlsx format to .csv (6)
19
6. .csv output file
Rbrul: Installation step-by-step (1)
20
1. Download R: http://cran.r-project.org/bin/windows/base/
Rbrul: Installation step-by-step (2)
21
2. Press “Avanti” until the installation process finishes.
Rbrul: Installation step-by-step (3)
22
3. Open R. If you have troubles, right-click “Esegui come
amministratore”.
Rbrul: Installation step-by-step (4)
23
4. Open R.
Rbrul: Installation step-by-step (5)
24
5. Write: source(“http://www.danielezrajohnson.com/Rbrul.R”)
Rbrul: Installation step-by-step (6)
25
6. Hit the Enter key
Rbrul: Installation step-by-step (7)
26
7. Write rbrul()
Rbrul: Installation step-by-step (8)
27
8. Hit the Enter key.
Now you are in Rbrul.
28
Rbrul: Loading data (1)
1. Write 1 and press the Enter key
29
Rbrul: Loading data (2)
2. Write c and press the Enter key
30
Rbrul: Loading data (3)
3. Open the questionnaire in .csv
Rbrul: Loading data (4)
31
4. Now you are ready
32
Example: Linguistic survey and RBRUL (1)
 Considered variables:
 Code-switching (category type variable/question)
 Who speaks what language(s) at work, with friends, & with family in IT &
PH (continuous type question)
 Who uses what language(s) when watching TV, reading, dreaming, &
thinking (category type question)
 Number of known languages (continuous type question)
 Age (continuous type question)
33
Example: Linguistic survey and RBRUL (2)
 Hypothesis:
 Code-switching & who speaks what language(s) at work, with friends, &
with family in IT & PH (cat+con: bivariate analysis)
 Code-switching & Who uses what language(s) when watching TV,
reading, dreaming, & thinking (cat+cat: cross tabulation)
 Number of known languages & age (con+con: scatterplot)
Hypothesis: Code-switching and language use (1)
34
Formulate hypothesis on code-switching and language use with friends in
IT/PH using bivariate analysis.
Is there a relation between the number of languages used to talk with
friends in PH and in IT & the occurrences of code-switching?
Average PH: 1.43
Average IT: 1.31
Category+continuous: bivariate analysis
1. Press 5 for bivariate
analysis and hit Enter key.
Hypothesis: Code-switching and language use (2)
35
Formulate hypothesis on code-switching and language use with friends in
IT/PH using bivariate analysis
2. Choose variables (1)
Hypothesis: Code-switching and language use (3)
36
Formulate hypothesis on code-switching and language use with friends in
IT/PH using bivariate analysis
3. Dependant variable (50)
Hypothesis: Code-switching and language use (4)
37
Formulate hypothesis on code-switching and language use with friends in
IT/PH using bivariate analysis
4. Type of response (Enter)
Hypothesis: Code-switching and language use (5)
38
Formulate hypothesis on code-switching and language use with friends in
IT/PH using bivariate analysis
5. Choose application (2 + Enter x3)
Hypothesis: Code-switching and language use (6)
39
Formulate hypothesis on code-switching and language use with friends in
IT/PH using bivariate analysis
6. Choose independent variable (# lang used with Friends in IT/PH) (42
Enter 46 Enter x2)
Hypothesis: Code-switching and language use (7)
40
Formulate hypothesis on code-switching and language use with friends in
IT/PH using bivariate analysis
7. Choose continuous variable (42 Enter 46 Enter x2)
Hypothesis: Code-switching and language use (8)
41
Formulate hypothesis on code-switching and language use with friends in
IT/PH using bivariate analysis
8. Modelling (5 Enter)
Hypothesis: Code-switching and language use (9)
42
Formulate hypothesis on code-switching and language use with friends in
IT/PH using bivariate analysis
8. Modelling (5 Enter)
Hypothesis: Code-switching and language use (10)
43
Formulate hypothesis on code-switching and language use with friends in
IT/PH using bivariate analysis
Logodd: 0.571 vs 0.292 (If positive, high likelihood)
Deviance: 142.818 vs 144.821 (The larger the deviance, the less accurate
the result given)
P value: 0.0644 vs 0.234 (>0.05)
Therefore: Correlation code-switching/language use with friends is NOT
SIGNIFICANT
44
Hypothesis: Code-switching and language
use (11)
The same procedure can be adopted in analysing the relation between
code-switching & language used with family & at work in Italy and in the
Philippines
45
Hypothesis: Code-switching and language
use – TV (1)
Formulate hypothesis on code-switching and language use when
watching TV using cross tabulation and Chi Square Test.
Is there a relation between the languages used to watch TV & the
occurrences of code-switching?
Category+category: cross tabulation
1. Press 4 for cross tabulation and hit Enter key.
46
Hypothesis: Code-switching and language
use – TV (2)
Formulate hypothesis on code-switching and language use when
watching TV using cross tabulation and Chi Square Test.
2. Choose factors for columns (50 Enter)
47
Hypothesis: Code-switching and language
use – TV (3)
Formulate hypothesis on code-switching and language use when
watching TV using cross tabulation and Chi Square Test.
3. Choose factors for rows (51 Enter x3)
48
Hypothesis: Code-switching and language
use – TV (4)
Formulate hypothesis on code-switching and language use when
watching TV using cross tabulation and Chi Square Test.
4. Cross tabulation
Do those who watch TV in Italian code-switch more?
49
Hypothesis: Code-switching and language
use – TV (5)
Formulate hypothesis on code-switching and language use when
watching TV using cross tabulation and Chi Square Test.
5. Chi Square Test in Excel
 Effective frequency of Italian/Code-switching: 45
 Expected frequency of Italian/Code-switching: 32.09
 Multiply the total amount of observed frequencies related to the first
independent variable (=45) and the total amount of observed frequencies
related to its dependent variable (=87). The amount is then divided by the
total amount of the frequencies (=122).
87 ∗ 45
𝐸𝑥𝑝. 𝑓𝑟𝑒𝑞. =
= 𝟑𝟐. 𝟎𝟗
122
50
Hypothesis: Code-switching and language
use – TV (6)
Formulate hypothesis on code-switching and language use when
watching TV using cross tabulation and Chi Square Test.
5. Chi Square Test in Excel
51
Hypothesis: Code-switching and language
use – TV (7)
Formulate hypothesis on code-switching and language use when
watching TV using cross tabulation and Chi Square Test.
5. Degree of freedom: (3-1)*(8-1)=2*7=14
0.1863 is the result of chi.sq.tst.
The value referred to p=0,05 is 1.761.
Our result is much lower. (p>0.40!)
Therefore: there can’t be a relationship
c.s/watching tv in other languages.
52
Hypothesis: Age and number of known
languages (1)
Formulate hypothesis on age and number of known languages using
scatterplot.
Is there a relation between the number of known languages & age?
Continuous+continuous: scatterplot
1. Press 6 for scatterplot and hit Enter key.
53
Hypothesis: Age and number of known
languages (2)
Formulate hypothesis on age and number of known languages using
scatterplot.
2. Press 1 to enter scatterplot menu and select y-axis variable (2 Enter)
54
Hypothesis: Age and number of known
languages (3)
Formulate hypothesis on age and number of known languages using
scatterplot.
3. Press 12 to enter scatterplot menu and select x-axis variable (12 Enter)
55
Hypothesis: Age and number of known
languages (4)
Formulate hypothesis on age and number of known languages using
scatterplot.
4. Select standard layout and features (Enter x8)
56
Hypothesis: Age and number of known
languages (5)
Formulate hypothesis on age and number of known languages using
scatterplot.
5. Scatterplot appears
Hypothesis: Age and number of known languages (6)
57
Formulate hypothesis on age and number of known languages using scatterplot.
The line is higher at the beginning and lower at the end: Negative relation
The elder the people, the higer the amount of known languages?
YES, because of negative correlation
58
To sum up
1. Descriptive analysis – Hypothesis → Null Hypothesis
 1 continuous + 1 category type question: comparison of means
 2 category type questions: crosstabs
 2 continuous type questions: scatterplot
2. Inferential analysis – Null hypothesis test → Significance
 1 continuous + 1 category type question: bi/multivariate analysis (ANOVA)
 2 category type questions: cross tabulation + chi squared test
 2 continuous type questions: correlation
59
References
BLOOMER A. & WRAY A., 2006, Projects in Linguistics 2nd Edition, Hodder Arnold, London
and New York.
HON, K., 2013, “An Introduction to Statistics”, retrievable from the World Wide Web:
http://www.artofproblemsolving.com/LaTeX/Examples/statistics_firstfive.pdf
JOHNSON, D. E., 2009, “Getting off the GoldVarb Standard: Introducing Rbrul for MixedEffects Variable Rule Analysis”, in Language and Linguistics Compass 3/1 (2009): 359–383
TAGLIAMONTE, S. A., 2012, “Quantitative Analysis”, in TAGLIAMONTE S. A., 2012, Change,
Observation, Interpretation, Wiley Blackwell, Chichester
TAMMINGA, M., 2011, “Getting started with Rbrul”, retrievable from the World Wide Web:
http://www.danielezrajohnson.com/Getting_started_with_Rbrul.pdf
SUNDERLAND J., 2010, "Research Questions in Linguistics", in Litosseliti L. (ed.), Research
Methods in Linguistics, Continuum, London and New York: 9-28.
See e-book in my blog
60
Thank you
Download