Measuring Linguistic Complexity

advertisement
Measuring Linguistic
Complexity
Kristopher Kyle
3-5-2015
Who is this guy?
 Interested in:
 L2 Writing Quality/Development
 Assessment
 Natural Language Processing
 Productive Vocabulary
 Productive Syntax
Outline of Workshop
 Why measure linguistic complexity?
 How can linguistic complexity measures be conceptualized?
 How do we actually measure linguistic complexity?
 Hands-on workshop I: Measuring syntactic complexity
 Hands-on workshop II: From raw data to findings (if time)
Why measure linguistic complexity?
 In the 70’s, SLA researchers (e.g., Larsen-Freeman, 1978) wanted to
measure language development
 Larsen-Freeman proposed three constructs of development:
 complexity
 accuracy
 fluency
 The general hypothesis (with regard to complexity) has been: As
language learners develop, their language will become more
complex.
 How complexity is measured has been the subject of much debate
(e.g., Bulté & Housen, 2012)
How can linguistic complexity measures
be conceptualized?
 Wolfe-Quintero et al. (1998) provides a compendium of CAF
measures up until the late 90’s
 Lexical Complexity:
 a variety of general and part of speech specific type/token ratio
counts
 Syntactic Complexity
 a variety of clause, sentence, and T-unit measures that focus on
clausal complexity.
How can linguistic complexity measures
be conceptualized?
 Most of syntactic complexity indices are ratio scores:
(Structure A)/(Structure B).
 The denominator (Structure B) is either:
 clause: a main verb and its dependents (I eat pizza.)
 T-unit: an independent clause and any attached dependent
clauses (I eat pizza because it is delicious.)
 sentence: A string of words that starts with a capital letter
and ends with sentence-ending punctuation (I think you know
what a sentence is.)
How can linguistic complexity measures
be conceptualized?
 The numerator (Structure A) has included many structures:
 clauses
 dependent clauses
 adverbial clauses
 T-units
 complex T-units
 coordinate phrases
 complex nominals
 verb phrases
 passives
How can linguistic complexity measures
be conceptualized?
 Length of unit measures have also been prominent (e.g.,
Ortega, 2003; Lu, 2011).
 Mean length of clause (MLC)
 Mean length of T-unit (MLTU)
 Mean length of sentence (MLS)
How can linguistic complexity measures
be conceptualized?
 The rise of phrasal complexity:
 Biber, Poonpon, and Grey (2011) suggested that clausal
subordination (i.e., what most syntactic complexity indices
measure) is NOT a prominent feature of academic writing
 Informal speech includes many dependent clauses, but
academic writing includes many dependent phrases (and
especially noun phrases.
How can linguistic complexity measures
be conceptualized?
 Some important issues:
 Definition of measures
 What counts as a clause?
 Prominence of broad indices
 What does MLC really tell us about development?
 Often only a limited range of measures are used.
How do we actually measure linguistic
complexity?
 To measure linguistic complexity, we have two options.
 Option #1: Count features by hand
 Option #2: Count features using a computer
How do we actually measure linguistic
complexity?
 Advantages of Option 1:
 Researcher has full control over how syntactic complexity is
measured.
 Human counts may be more accurate
 Disadvantages of Option 1:
 Expensive!
 Intra-rater reliability
 Inter-rater reliability – who is qualified?
How do we actually measure linguistic
complexity?
 Advantages of Option 2:
 Very cheap
 Reliable (same results every time)
 Usually Accurate
 Biber (e.g., 2004) and Lu (2010, 2011) report accuracies above 90%
 Can analyze a broad range of indices at once.
 Disadvantages of Option 2:
 Research has less control (is at mercy of available programs)
 Some data is not well-suited to automatic analysis
 Some linguistic features cannot be reliably captured
Hands-on workshop I: Measuring
syntactic complexity
 Go to www.kristopherkyle.com/workshop/ and download the
“short_samples.zip” file.
 Without talking with your neighbor(s) fill in the included excel
sheet for examples 1-5.
 What were your answers?
 Any issues with example 5?
 Now do the same for example 6…
Hands-on workshop I: Measuring
syntactic complexity
 Tool for the Automatic Analysis of Syntactic Complexity
(TAASC)
 Prototype!!!
 Includes indices created by Xiaofe Lu (Syntactic Complexity
Analyzer; Lu, 2011)
 Also includes some replications of the Biber Tagger
Hands-on workshop I: Measuring
syntactic complexity
 How TAASC works:
 Reads file
 Splits file into sentences
 Parses each sentence
 uses Stanford Parser
 Uses regular expressions (a way to search for patterns) to
identify particular structures in the parse tree.
 uses Stanford Tregex (regular expressions for parse trees)
Hands-on workshop I: Measuring
syntactic complexity
 Now, lets check to see if your computer is set up correctly.
 First, search for Terminal (mac) or Command Prompt
(Windows)
 Then type: java –version
 Then type: python
 Go to www.kristopherkyle.com/workshop/ and download the
appropriate version of TAASC (windows or mac).
 Extract it to your Desktop
 Copy the example files to the “to_process_2” folder
Hands-on workshop I: Measuring
syntactic complexity
 Now, in Terminal/Command Prompt type:
 cd [location of TAASC folder] (then press “return”)
 python [name of the appropriate TAASC program] (“return”)
 Your results should now be in a file called “results.csv”
 If you want to examine the accuracy of the parse trees, look
in the folder “parsed_files” using Tregex
Hands-on workshop I: Measuring
syntactic complexity
 Some simple patterns:
 VP
 VP<S
 Some important patterns:
 clause: S|SINV|SQ <<# MD|VBP|VBZ|VBD
 T-unit: S|SBARQ|SINV|SQ > ROOT | [$-- S|SBARQ|SINV|SQ
!>> SBAR|VP]
Hands-on workshop II: From raw data to
findings
 Go to www.kristopherkyle.com/workshop/ and download the
“Workshop_Data.zip” file.
 58 participants, three timed essays over 1 year.
 IEP Levels 3-4 (Intermediate/Advanced)
 Now let’s analyze some data!
 NOTE: We didn’t get to this in class…
Download