SIGN LANGUAGE WORD LIST COMPARISONS: by

advertisement
SIGN LANGUAGE WORD LIST COMPARISONS:
TOWARD A REPLICABLE CODING AND SCORING METHODOLOGY
by
Jason Parks
Bachelor of Arts, Bethel University, 2000
A Thesis
Submitted to the Graduate Faculty
of the
University of North Dakota
in partial fulfillment of the requirements
for the degree of
Master of Arts
Grand Forks, North Dakota
December
2011
Copyright 2011 Jason Parks
ii
This thesis, submitted by Jason Parks in partial fulfillment of the requirements for the
Degree of Master of Arts from the University of North Dakota, has been read by the Faculty
Advisory Committee under whom the work has been done and is hereby approved.
_____________________________________
Chairperson
_____________________________________
_____________________________________
This thesis meets the standards for appearance, conforms to the style and format
requirements of the Graduate School of the University of North Dakota, and is hereby approved.
_______________________________
Dean of the Graduate School
_______________________________
Date
iii
PERMISSION
Title
Sign Language Word List Comparisons: Toward a Replicable Coding and
Scoring Methodology
Department
Linguistics
Degree
Master of Arts
In presenting this thesis in partial fulfillment of the requirements for a graduate degree
from the University of North Dakota, I agree that the library of this University shall make it freely
available for inspection. I further agree that permission for extensive copying for scholarly
purposes may be granted by the professor who supervised my thesis work or, in his absence, by
the chairperson of the department or the dean of the Graduate School. It is understood that any
copying or publication or other use of this thesis or part thereof for financial gain shall not be
allowed without my written permission. It is also understood that due recognition shall be given
to me and to the University of North Dakota in any scholarly use which may be made of any
material in my thesis.
Signature ___________________________
Date
iv
___________________________
TABLE OF CONTENTS
LIST OF FIGURES ...................................................................................................................... viii
LIST OF TABLES .......................................................................................................................... ix
ACKNOWLEDGMENTS .............................................................................................................. xi
ABSTRACT ...................................................................................................................................xii
CHAPTER
1 INTRODUCTION ............................................................................................................... 1
1.1 Analyzing word lists for lexical similarity ............................................................... 2
1.2 Previous sign language word list comparison studies .............................................. 3
1.3 The problem ............................................................................................................. 6
2 HYPOTHESIS AND METHODOLOGY PROPOSAL ...................................................... 8
2.1 Methodology proposal for the coding system .......................................................... 9
2.1.1 Synchronic analysis ...................................................................................... 9
2.1.2 Phonological basis of coding system ............................................................ 9
2.1.3 Identifying a sign token for coding ............................................................. 11
2.2 Handshape parameter values .................................................................................. 12
2.2.1 Description of codes used for handshape values ........................................ 15
2.2.2 Identifying variants of a handshape parameter value ................................. 17
2.3 Location parameter values...................................................................................... 18
2.4 Joint movement parameter values .......................................................................... 23
2.5 Palm orientation parameter values ......................................................................... 27
3 PROCEDURE................. ................................................................................................... 28
3.1 Participants ............................................................................................................. 28
v
3.2 Elicitation procedure .............................................................................................. 31
3.3 Word list video data coding procedure ................................................................... 32
3.4 Assessing similarity using Levenshtein distance.................................................... 34
3.4.1 Calculating Levenshtein distance ............................................................... 34
3.4.2 Levenshtein distance applied to sign language word list comparisons....... 36
4 RESULTS......... ................................................................................................................. 40
4.1 Identifying similarity groupings based on Levenshtein distance results ................ 40
4.2 Validity of Levenshtein distance results................................................................. 44
4.3 Evaluation of parameters ........................................................................................ 48
4.3.1 Individual parameters ................................................................................. 48
4.3.2 Parameter sets ............................................................................................. 53
4.4 Evaluation of handshape parameter values ............................................................ 56
4.5 Evaluation of word list items .................................................................................. 58
4.5.1 Comparison of item subsets ........................................................................ 58
4.5.2 Items with elicitation problems .................................................................. 61
4.6 Similarity results using refined parameters, values, and word list items................ 63
5 CONCLUSION...... ............................................................................................................ 66
5.1 Refining the parameters for comparison ................................................................ 67
5.2 Refining parameter values ...................................................................................... 68
5.3 Refining the word list items ................................................................................... 69
5.4 Final methodology proposal ................................................................................... 69
5.5 Areas and considerations for future research ......................................................... 70
APPENDICES ................................................................................................................................72
Appendix A Word list items ................................................................................................. 73
Appendix B Rank and frequency of parameter values.......................................................... 75
Appendix C Levenshtein distances between each variety pairing ........................................ 81
vi
REFERENCES .............................................................................................................................. 83
vii
LIST OF FIGURES
Figure
Page
1.
Signs that would be considered similar—identical in two out of three parameters ................. 4
2.
Handshape parameter value inventory—99 values with codes and images........................... 14
3.
Location parameter value inventory—25 body and 6 spatial location values ....................... 20
4.
Examples of body contact coded as initial or final location parameter ................................. 22
5.
Location coding examples where non-dominant hand contact is disregarded ....................... 23
6.
Joint movement parameter coding example for "Fingers" value ........................................... 25
7.
Joint movement parameter coding example for "Wrist" value .............................................. 25
8.
Joint movement parameter coding example for "Elbow" value ............................................. 26
9.
Joint movement parameter coding example for "Shoulder" value ......................................... 26
10. Annotating word list videos using ELAN .............................................................................. 33
11. Calculating the Levenshtein distance between two signs for “cat” ....................................... 37
12. Dendrogram of Levenshtein distance similarity groupings based on six parameters ............ 41
13. Correlation of mean Levenshtein distance to mean RTT-R intelligibility score between
countries ................................................................................................................................. 47
14. Visual comparison of Levenshtein results of individual parameters for variety groupings ... 50
15. Levenshtein distances of variety groupings for parameter sets ............................................. 55
16. Levenshtein distances of variety groupings for four sets of word list items .......................... 59
17. Dendrogram of Levenshtein distance similarity groupings for 4P-215-74 data set ............... 65
viii
LIST OF TABLES
Table
Page
1.
Similarity grouping example based on Blair’s lexical similarity criteria................................. 3
2.
Handshape coding suffixes for finger variations ................................................................... 16
3.
Handshape coding suffixes for thumb variations ................................................................... 16
4.
Unique code suffixes for handshapes..................................................................................... 17
5.
Handshape values with variants ............................................................................................. 18
6.
Participant metadata ............................................................................................................... 30
7.
Levenshtein distance between two pronunciations of "afternoon" ........................................ 35
8.
Levenshtein distance between two signs for "cat" ................................................................. 37
9.
Levenshtein distances of variety groupings based on the six parameters of the initial coding
system .................................................................................................................................... 42
10. Levenshtein distances and RTT-R intelligibility scores for three country comparisons ....... 46
11. Levenshtein distances of variety groupings based on individual parameters ........................ 49
12. General statistics of individual parameter Levenshtein distance results ................................ 52
13. Levenshtein distances of variety groupings based on parameter sets .................................... 54
14. Handshape values that occur least frequently to combine with similar values ...................... 57
15. Handshape values to merge because they are hard to distinguish ......................................... 57
16. Levenshtein distance results for four sets of word list items ................................................. 59
17. 12 word list items with the most missing data entries ........................................................... 61
ix
18. 14 word list items that elicit the most sign tokens ................................................................. 62
19. Levenshtein distance results of sets with reduced word list items and handshape parameter
values ..................................................................................................................................... 63
20. Word list items ....................................................................................................................... 74
21. Rank and frequency of the combined initial and final handshape parameter values ............. 76
22. Rank and frequency of initial handshape parameter values ................................................... 77
23. Rank and frequency of final handshape parameter values ..................................................... 78
24. Rank and frequency of the combined initial and final location parameter values ................. 79
25. Rank and frequency of initial and final location parameter values ........................................ 80
26. Rank and frequency of the two palm orientation parameter values ....................................... 80
27. Rank and frequency of the five joint movement parameter values ........................................ 80
28. Levenshtein distances between each pair of sign language varieties ..................................... 82
x
ACKNOWLEDGMENTS
This word list comparison study is the result of the work, participation, and support of many
people over several years of fieldwork and research. First, I thank my wife and coworker,
Elizabeth Parks, who provided valuable input on the word list coding and methodology
development and has consistently encouraged me during the coding, analysis, and writing of this
thesis. I am also grateful to my advisory committee members who provided vital guidance and
timely feedback during this thesis project: Dr. John Clifton, Dr. Albert Bickford, and Dr. Mark
Karan.
I thank the various SIL International survey team members (Beth Brown, Julia Ciupek-Reed,
Christina Epley, Elizabeth Parks, Bettina Revilla, Audrey Stone, and Holly Williams) who helped
elicit the word lists used in this study. The data analysis would not have been possible without the
enthusiastic involvement of Chad White who wrote the programs and designed the software to
convert the sign language data for analysis using the Levenshtein distance metric. Michael
Lastufka also developed helpful programs to evaluate various scoring systems and parameter
value frequencies. In addition, Dr. Nelson Fong provided timely assistance with the ANOVA
statistical calculations.
Finally, I acknowledge and thank the numerous deaf and hearing people who graciously
welcomed our survey teams and assisted us in our survey fieldwork—especially the deaf
participants who shared their knowledge, experience, and time with us during the word list
elicitations.
xi
ABSTRACT
This study describes and evaluates a methodology for sign language word list comparisons.
The purpose of this sociolinguistic research tool is to identify similarity relationships among sign
language varieties by assessing similarities of lexical items. Similarities are calculated using the
Levenshtein distance metric which measures the number of differences between signs.
In this study, the methodology was refined for optimal efficiency through an analysis of:
which parameters of a sign should be compared, which values should be included in each
parameter value inventory, and which items should be used in the word list. As a result of the
study, I propose both an efficient coding system and a methodology that is replicable and
relatively objective, easily merges multiple data sets, and identifies similarities among sign
language varieties. The validity of the methodology is supported by similarity grouping results
that highly correlate with intelligibility testing results of other studies.
The word list data for this study comes from video data archived with SIL International that
represents 50 sign language varieties from 13 countries, mostly in Latin America and the
Caribbean.
xii
CHAPTER 1
INTRODUCTION
Research in language variation can offer helpful insights to organizations and individuals
involved in education planning, language policy, and language development. In language
variation studies, the use of multiple research instruments that explore a broad range of
sociolinguistic and linguistic factors in variation can reinforce conclusions by describing the
language situation from a variety of perspectives. One relatively straightforward research
instrument used to assess language relatedness is comparison of word lists. There are two general
methodological approaches that have been applied to word list comparisons of spoken languages:
comparing cognates (forms that have descended from a common historical form) and comparing
similar forms regardless of the historical relationships.
Within the approach comparing cognates, the historical-comparative method (Campbell 2004,
16-27, 188-197) compares language varieties to identify shared innovations and groups the
varieties based on these shared innovations. In the absence of a historical-comparative analysis of
the varieties, phonostatistic and lexicostatistic methods can be used to determine the relatedness
of the varieties being studied. Phonostatistic methods do this by measuring phonological
differences between forms (Simons 1977). Early practitioners of lexicostatistics identified
apparent/probable cognates based on phonetic similarity, and cognate percentages were used to
determine language relatedness (similarity groups were based on both shared innovations and
shared retentions) (Gudschinsky 1956, 180-81). More recently, some practitioners have proposed
that related forms should be identified purely on the basis of phonetic similarity, regardless of the
actual historical relationship between the forms (Sanders 1977, 32-37).
1
A variety of methods have been used to calculate the phonetic similarity of forms.
McElhannon (1967) judged forms as similar if 50% or more of the phonemes corresponded.
Deibler and Trefry (1963) calculated similarity by scoring comparisons on a scale of zero to four
based on the number of phoneme differences between the two forms. Blair (1990) outlined what
has become a common methodology to assess lexical similarity. When comparing two forms, all
pairs of phones are classified into one of three categories; and forms are considered as similar or
non-similar depending on the number of phone pairs in each category and the word length. Using
this method, language varieties are grouped based on the overall percentage of similar forms. For
a rough simplification of the scoring criteria, two forms are considered similar if at least half of
the phones are identical or very similar, another 25% are at least somewhat similar, and only 25%
of the phones can be different (Blair 1990, 31-33). In the past decade, the Levenshtein distance
metric (minimum number of edits required to convert one form into another) has been used to
calculate similarities between forms on a gradient scale using a more nuanced measurement than
the similar vs. non-similar categorization (Heeringa et al. 2006).
Sign language researchers using word list comparisons have generally followed the lexical
similarity tradition since the early research from the late 1970's to the present. In the following
three sections, I will briefly describe: an example of lexical similarity analysis in spoken
languages, how previous studies have analyzed lexical similarities among sign languages, and a
problem in previous studies that will be the focus of this study.
1.1 Analyzing word lists for lexical similarity
For an example of a lexical similarity analysis in spoken languages, Kluge (2000; 2005)
describes a study of 49 Gbe language varieties in West Africa. For one set of similarity judgment
criteria, Kluge followed Blair’s methodology (1990) with a few modifications based on a
comparison approach by Schooling (1981) that ignores reduplication and apparently affixed
2
morphemes occurring in the same position. For an example of how this similarity criteria would
consider words as similar or non-similar among selected Gbe language varieties for the item
“cow”, see Table 1 (Kluge 2000, 19). With focus on the morpheme ɲĩ, the Arohun, Ayizo, and Be
variety forms are considered similar since they share two identical phonetic segments (ɲ and ĩ)
and the additional affixed morphemes (bu and n ) in the Ayizo and Be variety forms are
disregarded since they occur in the same position. The Dogbo and Be variety forms are
considered non-similar since the additional affixed morphemes (n and xwe) do not occur in the
same position.
Table 1: Similarity grouping example based on Blair’s lexical similarity criteria
Similar words
ɲĩ (Arohun variety)
ɲĩbu (Ayizo variety)
ɲĩn (Be variety)
Non-similar words
xweŋĩ (Dogbo variety)
ɲĩn (Be variety)
Using this criteria for identifying similar forms, Kluge's Gbe study identified three main
clusters of the 49 language varieties. The lexical similarity percentages ranged from 71-100%
between any two language varieties within one of the three main clusters, the average similarity
among all varieties within a cluster ranged from 82-91%, and the average lexical similarity
between clusters ranged from 64-70% (Kluge 2005, 34).
1.2 Previous sign language word list comparison studies
Over the last few decades, dozens of sign language researchers have used percentages of
lexically similar words in word list comparisons as a research instrument for sign language
identification, making meaningful contributions to cross-linguistic and variation studies. In
general, to evaluate lexical similarity these studies each identified a set of sign parameters to
compare and developed a scoring criteria; unfortunately, the scoring criteria and the set of
parameters were often different in each study.
3
In four of the previous studies, three parameters have been used for comparison: handshape,
location, and movement. Guerra Currie et al. (2002) and Aldersson and McEntee-Atalianis (2008)
scored signs as similar if at least two out of the three parameters were identical. Bickford (2005)
grouped signs as similar if the locations were the same and either the handshape or movement
parameter was also the same. For example, these three studies would consider the two signs for
“water” shown in Figure 1 as similar since they differ in just the handshape parameter and the
location and movement parameters are the same.
Figure 1: Signs that would be considered similar—identical in two out of three parameters
Hendriks (2008) used these same three parameters, but focused on the initial location of a
sign for the location parameter. Hendriks’ scoring criteria gave one point if all three parameters
matched, half of a point if two out of three matched, and zero points if less than two parameters
matched.
Vanhecke and De Weerdt (2004, 30) compared four parameters (handshape, location,
movement, and orientation), and identified four types of similarity in their scoring system:
identical (four out of four parameters identical), similar (one small difference in just one
parameter), related (differences in one or two parameters), and different (more than two
parameter differences). Johnson and Johnson (2008) compared signs based on these same four
parameters, and in some cases a fifth non-manual parameter. For each parameter that was
4
identical they gave one-fourth or one-fifth of a point depending on whether four or five
parameters were compared. Sasaki (2007) evaluated word lists based on five parameters:
handshape, location, movement, orientation, and one/two hands. Sasaki used scoring criteria that
categorized signs into three groups: identical, similar (four out of five parameters identical), and
distinct. Xu (2006) compared signs based on the following five parameters: handshape, location,
movement, palm orientation, and iconic motivation. In Xu's scoring criteria, at least three out of
the five parameters needed to be identical to be scored as similar. In addition to the five
parameters, Xu also considered iconicity and handedness when evaluating similarity. Hurlbut
(2007) compared signs based on seven parameters, and weighted more heavily certain parameters
considered to be of extra importance. Hurlbut scored signs as similar if at least two parameters
were identical.
Woodward (1977, 337-340; 1993) calculated lexical similarity and listed the percentage of
similar forms between word list items of sign varieties. However, Woodward describes no
scoring criteria used to identify similar forms or what if any parameters were identified for
comparison. Parkhurst and Parkhurst (2007, 12) used a scoring criteria where one point was given
if signs were identical, half of a point if judged as similar, and zero points if judged as completely
different, but did not identify specific parameters used for comparison.
In the first word list comparison study using data gathered by our SIL International survey
team during fieldwork in Guatemala in 2007, E. Parks and I (with input from Bickford), identified
four parameters and developed parameter inventories to explore various scoring systems (Parks
and Parks 2008). In that preliminary study, we chose scoring criteria that required an identical
handshape in either the initial or final sign positions and an identical location in either the initial
or final sign positions for lexical items to be considered as similar. We coded signs using an
inventory of 48 handshape parameter values and 23 location parameter values (2008, 24-25). The
5
word list comparison analysis of the Guatemala sign varieties provided a catalyst for the
methodology proposal of this study.
1.3 The problem
In general, previous sign language word list comparison studies lack a detailed description of
any parameter values that were used to code sign parameters, and in some studies the criteria for
similarity judgments were largely subjective (or not made explicit). Consequently, it would not be
possible to accurately replicate the results of these studies given the methodology description
available in the reports. The difficulty of evaluating and comparing various similarity criteria sets
is accentuated by the lack of reporting of the raw data. Nor is it currently possible to compare the
similarity percentage results between studies since the studies do not share a common similarity
criteria set, the number of parameter values and possible distinctions within a sign parameter have
never been described, and the sets of word list items have been different. Also, it is not possible
to add any additional word list data from other sign varieties to an existing study and obtain
results for the combined data set since the similarity criteria set is not sufficiently described and
the raw data used to make similarity judgments is not reported. Any of these factors could
conceivably affect the similarity percentages that are calculated by a study, and thus the
percentages from different studies are not comparable.
In response to the problems identified from previous sign language word list comparison
research, in this study I propose a word list comparison methodology that justifies which
parameters should be used, clearly defines a set of possible parameter values for each parameter
being coded and compared, and uses a scoring system based on Levenshtein distances rather than
lexical similarity judgments. With the use of a computer software package developed for
Levenshtein distance analysis of word lists, and another program written specifically to convert
sign language word list data for Levenshtein distance analysis, the proposed methodology is less
6
subjective and requires much less time to analyze, is replicable by other researchers, is relatively
easy to learn, and allows results to be compared among various studies that follow the proposed
methodology.
With this research focus, in the next chapter I will describe my research hypothesis, a sign
language coding system methodology including a description of sign parameters and possible
parameter values, and the Levenshtein distance similarity metric. In the third chapter, I will
discuss the procedure used for eliciting and coding sign language word lists. The fourth chapter
will present the comparison results and an assessment of their validity, based on wordlist data that
has been archived with SIL International. In the final two chapters, I discuss my interpretation of
the results and propose a refined methodology for sign language word list comparisons followed
by a conclusion and suggestions for future research.
7
CHAPTER 2
HYPOTHESIS AND METHODOLOGY PROPOSAL
The main research goal for this study is to find an appropriate selection of parameters for
comparison, possible values that may be assigned for each parameter, and lexical items to include
in an optimal word list, so that word list data can be efficiently analyzed to produce a similarity
matrix and a dendrogram (a tree diagram) that reflect relationships between pairs of language
varieties and among clusters of language varieties. In order to determine an appropriate word list
comparison methodology to meet my research goal, I worked to adapt previous coding and
scoring systems. The coding system of this study had two stages of development. In the first
stage, I developed an initial coding system and applied it to the data set. In the second stage,
based on observations of the results using the initial coding system, I propose a final refined
coding system for application in future sign language word list comparison studies.
In the initial coding system, I identified six parameters of a sign for comparison: initial
handshape, final handshape, initial location, final location, palm orientation change, and joint
movement. Signs were coded for each of the six parameters using a detailed inventory of unique
values with descriptions of how to consistently apply the coding system. These sign parameters
and the parameter coding values were not meant to be an exhaustive inventory of every possible
phonetic component of a sign, but rather an easy-to-follow coding system that was sufficiently
detailed to provide valid similarity grouping results for word list comparisons. This coding
system was tested on a video data set of 50 word lists (most lists contained 241 lexical items)
representing sign language varieties from 13 countries. Then, similarities among the language
varieties were evaluated using the Levenshtein distance metric which calculates the similarities of
8
lexical items. In this chapter I discuss the methodological basis for the coding system, and then
give a description of the values developed for each parameter of the initial coding system.
2.1 Methodology proposal for the coding system
This section describes the basis for the proposed methodology: it is a synchronic, not a
diachronic, analysis (section 2.1.1), sign parameters are selected that reflect both the simultaneity
and sequentiality of sign language phonology (section 2.1.2), and criteria are developed to
identify sign tokens (or utterances) in the word list video data in a consistent manner (section
2.1.3).
2.1.1 Synchronic analysis
The proposed methodology is a synchronic analysis of the elicited items—the analysis
compares sign language varieties at one point in time without reference to historical development.
In contrast, a diachronic analysis would determine whether items share a common historical form.
Therefore, this synchronic analysis does not claim to identify signs that can be traced back to a
common ancestral form (cognates). In addition, it makes no claims of genetic relationships and
does not distinguish between inherited or borrowed signs (loans). Kessler (2001, 5) states,
"whether language elements share certain properties because they are inherited from a common
ancestor language, or whether they share them through borrowing, the language and the elements
in question can be said to be historically connected." So despite not making these distinctions, the
results of this type of synchronic analysis could prompt questions and suggest areas of focus for
future studies of historical relationships among sign language varieties.
2.1.2 Phonological basis of coding system
The sign language coding system for word list comparisons that I recommend is based on a
phonological framework that includes both the simultaneity and sequentiality of sign language. In
9
early sign language linguistics, Stokoe et al. (1965) identified three parameters of a sign that they
regarded for analytical purposes as occurring simultaneously: place of articulation or location,
handshape, and movement. The sequentiality of sign language is described in the Move-Hold
phonological model of Liddell and Johnson (1989, 208-210). In this model, signs are regarded as
consisting of sequences of segments. The coding system I propose presupposes this richer
conception of sign language phonology, which recognizes both simultaneity and sequentiality in
the structure of a sign—an assumption that is held in most subsequent theorizing about sign
language phonology (Brentari 1998; Sandler 1989).
In the initial coding system for this study, six parameters were chosen to describe both the
sequential and simultaneous phonetic components of a sign. To represent simultaneity, both the
handshape and location features were identified. To represent sequentiality, the handshapes and
locations were each identified twice, once at the initial position of the sign, and once at the final
position of the sign. These parameters of handshape and location are two of the most common
parameters identified for transcription and analysis in previous word list comparisons and have
been the focus of many other sign language linguistic studies. Another common parameter that I
wanted to include in the coding system was movement, but previous transcription systems for
movement have varied widely and some aspects of movement can be captured by identifying
changes in handshape and location. In an effort to focus on only a few easily distinguishable
aspects of movement, I chose two parameters to represent various movements throughout the
duration of a sign token: palm orientation change (marking if the palm orientation changes by at
least 45 degrees or not) and joint movement (fingers, wrist, elbow, or shoulder). For the
handshape, location, and two movement parameters, a set of phonetic value inventories was
created with the goal of developing a well-defined and user-friendly coding system that also
described enough phonetic values to provide clear distinctions when comparing sign language
varieties.
10
Signs were coded based on phonetic not phonemic contrast. I took this coding approach for
two reasons: sign language linguists have not developed a standard methodology for identifying
phonemic contrast, and elicitation sessions during fieldwork often took place under time
constraints that would not have allowed a thorough investigation of phonemic contrast.
Non-manual mouthing features of a sign were not included for comparison because written
words were used during elicitation and participants’ exposure to oral training varied (some
participants mouthed almost every written word, while others used much less mouthing), and in
some cases hearing people were present during elicitation and participants may have mouthed
words for the hearing audience even if the mouthing was not natural to their sign language. Due
to these factors, mouth movements in the data appear to have been strongly influenced by spoken
languages in idiosyncratic ways that make them unreliable for lexical comparison.
Distinctions were not made between one-handed and two-handed signs. This approach
follows the argument made by Johnston (2003, 61) that variation that is not likely to be
phonemically different should be disregarded. For example, during fieldwork in many
communities it appeared that the difference between one-handed signs and two-handed signs was
often only a contrast between citational and non-citational forms without a change in meaning.
Some participants signed very formally during the elicitation sessions (preferring two-handed
signs) while others were much more casual and tended to prefer one-handed signs. Disregarding
this type of variation in the coding system, I also only coded the handshape of the dominant hand.
The non-dominant hand was only represented in the coding system if it was a point of contact
(location parameter value) for the dominant hand.
2.1.3 Identifying a sign token for coding
In order for other researchers to easily add to the existing word list corpus or replicate the
results of the study, I developed the following criteria to identify and consistently code sign
11
tokens in the video data. Some signs had one easily recognizable token and the parameter coding
was straightforward. However, in some cases, signs appeared to be multimorphemic forms with
more than one distinct sign token. For these situations, if there was a quick and smooth transition
between just two locations, the sign was coded as one token. Other signs that appeared to be
multimorphemic signs were coded into two separate sign tokens if the participant made a
significant pause between locations. To determine if a pause was long enough to separate a sign
into more than one token, the pause duration was compared to the participant's usual signing
speed and tempo for other elicited items. If a sign contained three distinct locations for what
appeared to be one sign, the sign was coded into separate tokens so that there would be at most
two locations in one token: one initial and one final. For example, several sign varieties in Latin
America have the signs for man or male, and woman or female used as an affix for many
concepts relating to people or kinship (e.g. boy, girl, son, daughter, grandfather, grandmother,
brother, sister, and others). In other cases, participants may fingerspell the letter "o" or "a" at the
end of a sign corresponding to the last letter in the written Spanish word. These additional sign
components were coded as separate tokens representing the item, unless there was a total of only
two distinct locations in the sign with a quick and smooth transition movement - in which case
the sign would be coded as one token.
A fingerspelled sign was included in comparisons and coded as one token. The first manual
alphabet form was coded as the initial handshape and the last manual alphabet form was coded as
the final handshape. The intermediary manual alphabet forms were disregarded since many forms
in fast fingerspelling were blurred and difficult to distinguish in the video data.
2.2 Handshape parameter values
In their study of American Sign Language, Liddell and Johnson identified over 150 hand
configurations (Liddell and Robert E. Johnson 1989, 223). This amount of distinction in a coding
12
system seemed overly detailed for the purpose of word list comparisons. Instead, I based my
selection of handshape parameter values on a study of four distinct sign languages by Rozelle
(2003). Rozelle identified an inventory of 68 handshapes among the data set; 22 of these
handshapes were identified in all four languages. Each sign language had a handshape inventory
ranging in size from 34-49 handshapes (Rozelle 2003, 80).
The initial list of handshape values included 102 handshapes listed in the appendix of
Rozelle's dissertation and three other fairly common handshapes our survey team had identified in
the Guatemala sign variety comparison, for a total inventory of 105 handshape values. Six of
these 105 handshapes were never observed in the video data. These six handshapes were
combined with other handshape values to increase the simplicity of the coding system by not
including values that only rarely occur and consequently do not have a significant influence on
similarity calculations. The resulting inventory of 99 handshape values is listed in Figure 2
alphabetically by the handshape value code along with an image representation of the handshape
value. (Handshape images are used with permission and slightly modified from Rozelle (2003)).
13
Figure 2: Handshape parameter value inventory—99 values with codes and images
14
In Appendix B, Table 21 contains a list of the 99 handshape values according to rankfrequency among the entire word list data. Four of the five most frequently occurring handshapes
of this database (coding values: 1, 5, S, and A-Text) match the rank of the pooled data of the four
sign languages analyzed by Rozelle (2003, 108). Rank-frequencies of handshape values for only
the initial handshape parameter are listed in Table 22, and Table 23 lists only the final handshape
parameter rank-frequency results.
The initial handshape parameter values were identified at the same point in the video data as
the initial location parameter values. Similarly, the final handshape and location parameter values
were identified at the same point in the video data timeline. If the handshape was the same at the
beginning and end of a sign token, the same value was coded for both the initial and final
handshape parameter values.
2.2.1 Description of codes used for handshape values
The handshape value codes were written in Latin script for ease of coding and analysis using
computers. The coding values were designed for use by researchers familiar with written English
and ASL in order to avoid the necessity of memorizing abstract value codes. The values were
assigned the codes listed in Figure 2 based on the value's similarity to the ASL manual alphabet
or numbering system. For example, the ASL manual alphabet handshape
was assigned the
code "B". There is one irregular code that doesn't correspond to a letter of the ASL manual
alphabet: "ILY" which stands for the "I love you" handshape,
, used in ASL and many
other sign languages.
Six main variations of finger configuration (or flexing of finger joints) were distinguished in
the coding system by the addition of suffixes to the basic manual alphabet handshape code. These
six code suffixes for finger variations are listed in Table 2. In the handshape descriptions, the
15
term “base joint” refers to the metacarpal-phalangeal joint, and the term “non-base joint” refers to
the proximal and/or distal inter-phalangeal joints.
Table 2: Handshape coding suffixes for finger variations
Code suffix for
finger variation
"bent"
Description
Example
"flex"
only the non-base joint(s) of finger(s) are flexed
"flexgap"
non-base joints are flexed in both finger(s) and thumb, but not
touching each other
"flex+"
non-base joints of finger(s) are extremely flexed but not
completely flexed to palm, and finger(s) are also touching thumb
"gap"
base joint is flexed in selected finger(s) and thumb is opposed,
but finger(s) and thumb are not touching each other
only the base joint of finger(s) are flexed
Ubent:
Lflex:
Fflexgap:
Fflex+:
Ugap:
"little"
only the index finger is selected rather than all fingers, and the
other fingers are completely flexed to palm (the term “little”
does not refer to the little or pinky finger)
Olittle:
The coding system identified four variations due to the position of the thumb. Code suffixes
for thumb variation were separated from the manual alphabet code (and possible suffix for finger
variations) with a hyphen followed by a “T” for thumb. The four thumb position variations are
listed with examples in Table 3.
Table 3: Handshape coding suffixes for thumb variations
Code suffix for thumb variation
"-Text"
Description
thumb extended
Example
A-Text:
"-Tflex"
thumb joint flexed
"-Top"
thumb opposed
"-Ttog”
thumb together with side of palm
1-Tflex:
U-Top:
Bbent-Ttog:
There are nine code suffixes that are unique to only one manual alphabet code in the
handshape inventory. These unique code suffixes are listed in Table 4.
16
Table 4: Unique code suffixes for handshapes
Unique code suffixes Description
"Gspread"
middle, ring, and pinky fingers are extended and spread, rather
than completely flexed to palm as in "G"
"Olittlebent"
only index finger is flexed at base joint, all other fingers' joints
are completely flexed to palm
"Olittleflex+"
only index finger is extremely flexed and touching thumb, all
other fingers' joints are completely flexed to palm
"Olittle-Tund"
thumb tucked under flexed index finger, all other fingers' joints
are completely flexed to palm
"Rhole"
index and middle fingers are touching, and either the index or
middle finger is flexed to form a hole between them
"Tcross"
thumb and index finger are touching and crossing each other,
base joint of index finger is flexed
"Wunspr"
index, middle, and ring fingers are unspread and touching each
other, rather than spread as in "W"
"Y-MID"
middle finger is fully extended, rather than flexed as in "Y"
Image
2.2.2 Identifying variants of a handshape parameter value
For some handshape values, one value may be used to code a variety of slight handshape
variations. In most of these cases, the variations were either not distinct enough to be clearly and
accurately distinguished in the video data (due to low video quality, poor lighting and
backgrounds, and only one camera angle perspective) or the handshape variation only occurred a
few times in the entire dataset and the value inventory would have been unnecessarily complex if
separate handshape values were identified and coded. Another reason for combining certain
handshape variations was that many participants appeared to have different physical variations in
the degree of flexing or extension possible in the thumb and finger joints. If the handshape
observed in the video data did not exactly match one of the handshape values in the inventory, the
most similar handshape value existing in the inventory was chosen to represent it. See Table 5 for
examples of how slight variations in handshapes were coded as one handshape value according to
the handshape value inventory.
17
Table 5: Handshape values with variants
Handshape value code Handshape variants
1
Description of variation
middle finger may be completely flexed to palm, or may be
only slightly flexed and touching thumb
1flex
index finger may be flexed at only one non-base finger joint, or
both non-base finger joints
7
ring finger may be flexed at only the base joint, or all ring
finger joints may be flexed
8
middle finger may be flexed at only the base joint, or all middle
finger joints may be flexed
A-Text
thumb may be fully extended, or proximal inter-phalangeal
thumb joint may be flexed
B-Text
thumb may be fully extended, or proximal inter-phalangeal
thumb joint may be flexed
D
non-base joints of thumb and the middle, ring, and pinky
fingers may be flexed, or only the base joint may be flexed
F
non-base joints of thumb and index finger may be flexed, or
only the base joint may be flexed
ILY
non-base joints of middle and ring finger may be flexed, or only
the base joint may be flexed
K
thumb may touch the side of the middle finger, or touch at the
tip of the middle finger
Rhole
non-base joints of the index finger may be flexed and the
middle finger fully extended, or the non-base joints of the
middle finger may be flexed and the index finger fully extended
middle, ring, and pinky fingers may be completely flexed to
palm or extended; thumb may cross the index finger on either
the near or far side of the index finger
thumb and pinky finger may be fully extended, or may be
completely flexed to palm
Tcross
Y-MID
2.3 Location parameter values
The initial coding system identified two location parameters within one sign token - an initial
and a final location. In their study of American Sign Language, Liddell and Johnson (1989, 274276) identified 56 body locations, 38 non-dominant hand locations, and 14 spatial locations for a
total of 108 locations. For the purpose of word list comparisons evaluating similarities among
sign language varieties, I hypothesized that this level of coding detail would not significantly
enhance similarity results, and would actually hinder consistent application of the coding system.
18
At a lower level of distinction, a total of 62 locations were identified in Rozelle's study of four
distinct sign languages. Rozelle found 18 body locations and six spatial locations that were
common to all four languages. The location inventory sizes of each language ranged from 34 to
46 locations (Rozelle 2003).
The initial coding system of this study contained 31 values for the location parameters: 25
body locations, and six spatial locations. See Figure 3 for a diagram of the location values and
brief coding value descriptions written in parentheses.
19
Figure 3: Location parameter value inventory—25 body and 6 spatial location values
In Appendix B, Table 24 lists the 31 location values by the rank-frequency occurrence results
from the entire database, and Table 25 contains the rank-frequency results for both the initial and
final location parameters separately.
Location parameter values were based on the position of the dominant hand at the beginning
and end of a sign token. While coding location values, I focused on identifying where changes in
20
the speed of movement occurred. Word list items were usually elicited a few seconds apart so that
the participant's hands would come to a resting position between signs and the initial and final
locations would be easily observed. If the dominant hand remained in only one location
throughout a sign token, the same location parameter value was coded for both initial and final
location parameters. If a multimorphemic form was given for a particular item, or if several
variant forms were given in quick succession, and the dominant hand did not return to a resting
position between signs, coding judgments were made to predict the natural initial or final location
parameter value of each sign token. In some cases, due to video quality or camera angles, it was
difficult to determine if the dominant hand made contact with a body location. If the dominant
hand appeared to be near a body location, but the video data was not conclusive on whether
contact was made or not, I coded the body location rather than the spatial location.
In some cases, when the dominant hand made contact with only one body location and the
movement was repetitive, it was difficult to decide if the body location value should be coded as
the initial or final location parameter. See Figure 4 for two examples of this situation. In the sign
for “church”, “SHand” (the side of the non-dominant hand) would be coded as a final location; in
the sign for “paper”, “Palm” (the palm of the non-dominant hand) would be coded as the initial
location.
21
Figure 4: Examples of body contact coded as initial or final location parameter
To differentiate the body contact location as the initial or final location between these two
examples, the acceleration of the dominant hand movement before and after contact with the
body location was observed to determine the parameter choice. In the sign for “church”, the
dominant hand accelerated just prior to body contact, so the body contact location value “SHand”
(side of hand) was coded in the final location parameter and “SN” (neutral space) was coded in
the initial location parameter. In the sign for “paper”, the dominant hand began to accelerate just
after making contact with the body location, so the body location “Palm” was coded in the initial
location parameter and “SN” (neutral space) in the final location parameter. The assumption
underlying both judgments is that motion normally accelerates during the course of a sign’s
movement: movements that decelerate or are slower are regarded as transitional movements, not
part of the lexical specification of the sign.
In a two-handed sign, if the hands made contact, the body location value at the point of
contact on the non-dominant hand was coded for the location parameter. However, in two
situations, contact with the non-dominant hand was not considered the most salient location value
of the sign token. In the first situation, the non-dominant hand was not coded as a location
parameter if it made contact with the arm of the dominant hand at a point closer to the body than
22
the wrist area. In the second situation, the dominant hand made contact with the non-dominant
hand while the non-dominant hand was lying against a head or torso body location. Figure 5
shows examples of these situations.
Figure 5: Location coding examples where non-dominant hand contact is disregarded
In the sign for "tree", the location parameters would not be coded as “Palm” (palm of nondominant hand), even though the palm of the non-dominant hand touch the elbow of the arm of
the dominant hand. Instead, both the initial and final location parameter values would be coded as
"SN" (neutral space) - the location of the dominant hand. In the sign for "sleep", the body location
“Cheek” would be coded rather than the location of contact with the non-dominant hand “Palm”.
In both of these examples, the non-dominant hand was not judged as the most salient location
value: the non-dominant hand was relatively distant from the location of the dominant hand, or
contact was made with a more central body location value.
2.4 Joint movement parameter values
According to Sandler and Lillo-Martin (2006, 197), path and internal movements are "the
main kinds of movement found in lexical signs." Path movements can be characterized into one
of four main types: straight, arc, "7", and circle movements; and internal movements come from
23
changes in the handshape or palm orientation (Sandler and Lillo-Martin 2006, 197). In the initial
coding system, I did not categorize these two movement types directly, but they were represented
indirectly by the combination of two movement parameters: the joint movement parameter and
the palm orientation change parameter. In addition, some aspects of movement were represented
indirectly by coding both the initial and final positions of the handshape and location parameters.
This section focuses on the joint movement parameter, and in section 2.5 I discuss the palm
orientation change parameter.
Five joint movement parameter values were identified for the initial coding system: Fingers,
Wrist, Elbow, Shoulder, and Hold (no movement at all). Hand-internal movements would usually
be coded as "Fingers" or "Wrist", and path movements would be coded as "Elbow" or "Shoulder".
When more than one joint was moving, the smallest (most distal) joint was encoded. This resulted
in the following parameter value sequence based on coding priority: Fingers > Wrist > Elbow >
Shoulder. In Appendix B, Table 27 lists the five joint movement features according to rankfrequency from the entire database.
The joint movement parameter value would automatically be coded as "Fingers" if the initial
and final handshape parameter values had been coded with different values. However, joint
movement would also be coded as "Fingers" if the fingers only slightly wiggled or trilled while
maintaining the same handshape value. See the sign for "colors" in Figure 6 for an example.
24
Figure 6: Joint movement parameter coding example for "Fingers" value
The sign for "yes" shown in Figure 7 is an example of a sign where the joint movement
parameter would be coded as "Wrist".
Figure 7: Joint movement parameter coding example for "Wrist" value
The sign for "never" shown in Figure 8 is an example of a sign where the joint movement
parameter would be coded as "Elbow".
25
Figure 8: Joint movement parameter coding example for "Elbow" value
The sign for "chicken" shown in Figure 9 is an example of a sign where the joint movement
parameter would be coded as "Shoulder".
Figure 9: Joint movement parameter coding example for "Shoulder" value
If it was difficult to distinguish if a movement at the beginning of the sign was actually part
of the sign or just a transitional movement, the duration of time the dominant hand remained at
the final location was compared to the duration of movement. If the movement was much shorter
in duration than the hold, and there was no acceleration just prior to the hold, the movement was
considered a transitional or pre-sign token movement, and the joint movement parameter value
was coded as "Hold".
26
2.5 Palm orientation parameter values
The palm orientation parameter categorized movement as one of two parameter values. If the
palm orientation of the dominant hand changed by 45 degrees or more among any two positions
in the entire sign token, the parameter was coded with the "P+" value. If the dominant hand palm
orientation did not change by at least 45 degrees, the parameter was coded with the "P-" value. In
Appendix B, Table 26 shows the two palm orientation change values in order of rank-frequency
from the entire database.
27
CHAPTER 3
PROCEDURE
The coding system described in the previous chapter was applied to word list video data that
was collected and archived by SIL International sign language survey teams between November
2007 and January 2010. The video data set represents 50 sign language varieties from 13
countries, mostly in Latin America and the Caribbean. Most word lists contained 241 lexical
items. In this section, I discuss the participants, word list elicitation procedure, coding procedure,
and how similarities among language varieties were calculated using the Levenshtein distance
metric.
3.1 Participants
In various regions of each country, deaf community members encountered at deaf association
or club gatherings, schools, and religious meetings volunteered to participate in the study. As
much as possible, the survey teams screened participants to elicit word lists from people who
were active members of the deaf community, were deaf or hard of hearing, had grown up in the
elicitation city region, and had not traveled internationally. Within a country or region, the survey
team tried to include an equal representation of both males and females and younger and older
generations. Using these guidelines, the participants of this study are fairly reliable
representatives of their sign language communities. Although most of the word lists represent
sign language varieties from Latin America and the Caribbean, word lists from the United States
were included since American Sign Language has had a wide influence in much of the Americas.
Word lists from Ireland and Northern Ireland were also included since I wanted to see what type
28
of similarity scores would be calculated between sign language varieties that were generally
considered to be quite different and had relatively less historical connections with varieties in the
Americas. Some basic metadata of the 50 participants representing 13 countries are listed
alphabetically by country in
Table 6.
29
Table 6: Participant metadata
Country
Chile
Chile
Chile
Chile
Dominican Republic
Dominican Republic
Dominican Republic
Dominican Republic
Dominican Republic
Dominican Republic
Dominican Republic
Dominican Republic
Dominican Republic
El Salvador
El Salvador
El Salvador
Honduras
Honduras
Honduras
Honduras
Ireland
Jamaica
Jamaica
Jamaica
Jamaica
Jamaica
Jamaica
North Ireland
Panamá
Panamá
Paraguay
Paraguay
Paraguay
Paraguay
Paraguay
Paraguay
Paraguay
Paraguay
Perú
Perú
Perú
Perú
Saint Vincent
Trinidad
Trinidad
Trinidad
United States
United States
United States
United States
Country
ID
Chile-01
Chile-02
Chile-04
Chile-05
DomR-01
DomR-02
DomR-03
DomR-04
DomR-05
DomR-06
DomR-08
DomR-09
DomR-10
ElSal-03
ElSal-08
ElSal-12
Hond-01
Hond-05
Hond-10
Hond-11
Ire-01
Jam-01
Jam-02
Jam-03
Jam-06
Jam-07
Jam-08
NIre-01
Pan-01
Pan-06
Prgy-02
Prgy-03
Prgy-04
Prgy-05
Prgy-06
Prgy-07
Prgy-08
Prgy-09
Peru-01
Peru-05
Peru-18
Peru-22
StVin-01
Trin-01
Trin-02
Trin-03
USA-01
USA-05
USA-06
USA-07
City of
residence
Puerto Montt
Punta Arenas
Iquique
Santiago
Santo Domingo
Santo Domingo
Barahona
Santo Domingo
La Romana
La Romana
Santiago
Moca
Puerto Plata
La Libertad
San Salvador
Ahuachapan
Tegucigalpa
Juticalpa
San Pedro Sula
El Progreso
Dublin
Kingston
May Pen
Portmore
Mandeville
Montego Bay
Brown's Town
Belfast
Panamá
David
Asunción
Coronel Oviedo
Caaguazú
Ciudad del Este
Ciudad del Este
Itaugua
Asunción
Itaugua
Arequipa
Chiclayo
Lima
Trujillo
Kingstown
San Fernando
Port of Spain
Port of Spain
Hartford
Los Angeles
Los Angeles
Los Angeles
30
Gender Age Deaf family
members
female 20
no
female 21
no
female 30
no
male
38
no
male
25
no
male
20
no
male
42
no
male
35
yes
male
18
yes
female 16
no
male
27
no
male
35
no
female 36
no
male
27
no
female 23
yes
female 19
yes
male
27
yes
female 19
no
male
28
yes
male
24
no
male
50
yes
male
26
no
male
25
no
male
50
no
male
27
no
female 28
yes
female 25
no
male
22
no
female 44
no
male
40
yes
male
28
no
male
52
yes
male
37
no
female na
na
male
28
no
female 45
yes
female 41
yes
female 37
yes
female 18
no
male
19
no
female 23
yes
female 28
yes
female 33
no
male
27
yes
male
33
no
female 47
yes
female 32
yes
female 21
yes
male
42
yes
male
23
no
Age started
signing
3
1
18
16
10
7
21
11
6
8
8
14
12
11
3
7
15
10
4
7
10
12
7
1
6
5
3
3
32
17
5
17
14
na
6
17
1
5
8
9
1
5
3
6
3
3
1
1
1
14
3.2 Elicitation procedure
With each of these participants, a word list containing up to 243 items was elicited using a
Powerpoint presentation on a notebook computer. One video camera was set up directly in front
of the participant, and index cards were inserted into the camera view between each Powerpoint
slide to visually identify each word list item in the video. The elicitation slides for each item
usually contained both written spoken language words (either in English or Spanish depending on
the most common spoken language of the region) and an image. For all but 41 items that were
difficult to accurately represent visually, the slides included images since the visual
representations tended to help facilitate accurate elicitations, and written English or Spanish
literacy was often low in the deaf communities. For 40 items that had clearly opposite or
contrasting concepts, two contrasting images were included in the slide with an arrow to identify
which item was being elicited. As in the study by Osugi et al. (1999, 92), the survey teams found
this comparison technique of contrasting concepts to be effective and easily understood by
participants during elicitations. Similar to the approach of Parkhurst and Parkhurst (2007, 11),
participants were encouraged to include any variants or synonyms for each item to try to avoid
the problem outlined by Rensch (1992, 13) where similar forms actually existed among sign
varieties, but the similar forms did not happen to be elicited.
A basic set of 241 items were included in most word lists in this study. The list contained
lexical items from a variety of grammatical word classes (nouns, verbs, adjectives, quantifiers,
interrogatives, and others) and semantic domains (animals, food, household items, weather, time,
family, numbers, physical characteristics, religious items, emotions, physical activities, and
others). In comparison to previous word list comparison studies, the items of this study most
closely resemble the items used by Bickford (2005, 34-37). Two additional items were included
in the four Peru word lists. For two of the 50 word lists not all of the items were elicited: the
Prgy-07 word list contains only the first 112 items, and the Hond-01 word list contains only the
31
first 215 items from the 241-item list. One United States word list (USA-01) contains 210 items
elicited in a slightly different order than the others. See Table 20 in Appendix A for a list of the
word list items in the order that they were typically elicited.
From all 50 participants, a combined total of 15,720 sign tokens were elicited from 11,831
item elicitations. For 73% of the item elicitations, only one sign token was elicited; due to
multimorphemic forms or multiple variants for one item, two sign tokens were elicited for 22% of
the items, and 5% of the items prompted three or more sign tokens.
3.3 Word list video data coding procedure
The word list videos were annotated using the ELAN media annotation software (Max Planck
Institute for Psycholinguistics 2011). An ELAN template was used with eight tiers. The first tier
labeled “gloss” was created as a parent tier with a controlled vocabulary containing the word list
items. Six dependent tiers were created corresponding to the six parameters to be coded: initial
handshape, final handshape, initial location, final location, palm orientation change, and joint
movement. Controlled vocabularies containing the parameter values were created for each of
these tiers so that coding errors due to typing or spelling would be avoided, and the parameter
values could be easily accessed from a drop-down menu. An eighth tier was created for
comments to mark items that may be of interest in future studies: fingerspelling, notes on
elicitation misunderstandings (homonyms, copying or describing the elicitation image), and
marking variants for sociolinguistic variables if an explanation was given (variants based on
region, gender, age, etc.). A screenshot of coding sign token parameters in ELAN is shown in
Figure 10.
32
Figure 10: Annotating word list videos using ELAN
If the participant did not recognize the item being elicited and gave no sign, the sign was
coded as “xxx” for all parameters. If a sign or phrase was elicited, but it was an obvious
misunderstanding of the item due to written language homonyms or an unclear elicitation image,
the sign was coded as “???”. If participants only described an item or the elicitation image, and
the explanatory signs were clearly not meant to represent the lexical item, these signs were coded
as "???". In the analysis, if parameters were coded as “xxx” or “???” that item was omitted from
comparisons.
33
3.4 Assessing similarity using Levenshtein distance
The algorithm used in this study to calculate similarity among sign language varieties is
called the Levenshtein distance (string edit distance) metric. In essence, it measures the amount of
difference between lexical items by calculating the differences in strings.
In contrast to Blair's approach of assessing lexical similarity in which pairs of words are
considered to be similar or not similar, Levenshtein distance measurements provide a more
nuanced assessment of how different the words are. In addition, Levenshtein distance calculations
can be rapidly and objectively calculated by computer programs without the need for a research
analyst to make pair by pair similarity judgments.
In this section, I describe how Levenshtein distance calculations are made, how they have
been applied to spoken language studies, and how they were applied in this study.
3.4.1 Calculating Levenshtein distance
In spoken languages, in preparation for Levenshtein distance calculations, each phonetic
segment of a word is assigned a unique character code, typically symbols in the International
Phonetic Alphabet. Depending on the level of distinction desired in the comparison, these codes
could include diacritics. Once each word is represented as a string of characters representing the
individual phonetic segments, pairs of character strings are compared to assess the difference (or
Levenshtein distance) between the lexical items. Levenshtein distances are calculations of the
minimum (most efficient) number of edits that would be necessary to make two character strings
identical. There are three possible types of edits that may be necessary: insertions, deletions, and
substitutions. The Levenshtein distance (sum of edits) is usually normalized by length to correct
skewing that would occur in the calculation of average Levenshtein distances based on word
length. If only the raw number of edits were averaged to calculate Levenshtein distance, longer
words would have larger influence on distances than shorter words. Normalization by length can
34
be done a variety of ways, Heeringa et al. (2006, 53) recommend dividing the number of edits by
the length of the longest alignment between the two words. Consequently, the normalized
Levenshtein distance between words from two different language varieties could range from zero
(identical character strings) to one (completely different strings) for each lexical item. If a word
list contains multiple variants for one lexical item, the Levenshtein distance would be the average
distance of all comparisons of variants for each word list pair. The Levenshtein distance between
two language varieties for an entire word list is the average of the distances calculated for each
word list item.
As an example of how Levenshtein distance would be calculated between two forms in a
spoken language, Table 7 shows the edits needed to change one pronunciation/form of
"afternoon" in English (æǝftǝnʉn) to another pronunciation/form (æftǝrnun) (White 2010, 4).
Table 7: Levenshtein distance between two pronunciations of "afternoon"
Beginning form
æǝftǝnʉn
Edit
Resulting form
delete ǝ
æftǝnʉn
insert r
æftǝrnʉn
substitute ʉ /u æftǝrnun
Levenshtein distance (number of edits) = 3
Levenshtein distance (normalized) = 3/8 = 0.375
In contrast to this example where the Levenshtein distance between the two forms is 0.375, a
Blair style lexical similarity judgment would only have two possible values: similar or not
similar, and the two forms from Table 7 would be considered as similar since six of the eight
phones are identical.
Over the last decade, several studies have analyzed differences among language varieties
using Levenshtein distance. Investigating Nisu language varieties spoken in Yunnan, China, Yang
(2009) found that Levenshtein distance results complemented the findings of historicalcomparative analysis and had a high correlation with intelligibility testing results. According to
Yang (2009, 28), while comparative analysis identifies specific differences and intelligibility tests
35
reveal the effect of the differences on comprehension, Levenshtein distances "clarify the degrees
of difference between varieties”.
3.4.2 Levenshtein distance applied to sign language word list comparisons
To calculate Levenshtein distances for sign language data, the value for each of the six
parameters coded for a sign token is assigned a single character, and the six parameters are
treated as if they were a phonetic spelling by arranging them in a fixed sequence. For the initial
coding system of six parameters, each sign token was represented as a string of six characters.
Since all sign tokens were coded with the same number of parameters, there were no edits due to
insertions or deletions; the calculation of necessary edits to a character string were only based on
substitutions (when parameter values were not identical for a given pair of forms).
For an example of how the Levenshtein distance would be calculated for the lexical item
“cat” between two sign varieties of Chilean Sign Language, Chile-01 and Chile-05, Figure 11
shows the images of the initial and final positions of each sign.
36
Figure 11: Calculating the Levenshtein distance between two signs for “cat”
Table 8 lists the parameter values for each sign with the last column showing the tally of
Levenshtein distance edits.
Table 8: Levenshtein distance between two signs for "cat"
Chile-01
Initial handshape parameter value
5
Final handshape parameter value
A
Initial location parameter value
Fore
Final location parameter value
Fore
Palm orientation change parameter value PJoint movement parameter value
Fingers
Levenshtein distance (normalized): 2/6 = 0.333
Chile-05
B-Text
Bbent-Text
Fore
Fore
PFingers
Value difference
Yes
Yes
No
No
No
No
Edits
1
1
0
0
0
0
Comparing these two signs, since the initial and final handshape parameter values are both
different each would require one edit. No edits would be needed for the location or movement
parameters since there were no differences between the parameter values. So the non-normalized
Levenshtein distance for this comparison would be two. In this study, Levenshtein distances were
normalized (dividing the number of edits by six for the number of parameters compared), so the
37
normalized Levenshtein distance would be 2/6 = 0.333. In comparison, for a Blair style lexical
similarity criteria requiring at least two of three parameters (handshape, location, and movement)
to be identical for signs to be categorized as similar, these two signs for “cat” would be
considered as similar.
In Levenshtein distance calculations that involve more than one sign token per word list item,
the resulting Levenshtein distance is the mean of Levenshtein distances between every possible
combination of sign tokens. For example, if variety A is coded for two sign tokens (A1 and A2)
for word list item X, and variety B is coded for three sign tokens (B1, B2, and B3). The
Levenshtein distance between varieties A and B for item X would be the average of distances
between A1 and B1, A1 and B2, A1 and B3, A2 and B1, A2 and B2, and A2 and B3.
The Levenshtein distances of this study were calculated using the SLLED and Rugloafer
software programs developed by White (2011). The word list parameter data was first exported as
interlinear text from ELAN. Then, the SLLED software served as a converter program where
parameter values for an item were assigned a single character and arranged in a fixed sequence.
The SLLED software allows the user to select which of the six parameters are to be included in a
comparison if a subset of the six parameters is desired. The SLLED software outputs the
converted word list data as an XML file which is the input format required by Rugloafer. The
Rugloafer software acts as a front end for the various features of the RuG/L04 software suite for
dialectometrics and cartography primarily developed by Kleiweg (2011) which includes the
calculations of the Levenshtein distance between variety pairs.
While Levenshtein distance can calculate similarities between pairs of language varieties, the
results can also be used to group many language varieties into clusters based on similarities. In
the "Preferences" menu of the Rugloafer software, there are several clustering algorithm options
available for selection. For this study, I used the agglomerative clustering method called the
unweighted pair-group method using the average approach (UPGMA) which uses a proximity
38
matrix to cluster varieties and calculate the Levenshtein distances between clusters. In the
UPGMA method, the distance between language variety clusters is the "average distance between
pairs of objects, one in one cluster, one in the other", and "tends to join clusters with small
variances" and be "relatively robust" (Everitt, Landau, and Leese 2001, 60). For example, if two
varieties (X1 and X2) are grouped together at a Levenshtein distance of 0.40, and two varieties
(Y1 and Y2) are grouped together at a Levenshtein distance of 0.45, and the four varieties are
grouped together as a cluster at a larger Levenshtein distance (e.g. 0.53), this Levenshtein
distance for the grouping of X and Y would be calculated as follows: calculate the average
distance between varieties X1 and Y1, and X1 and Y2 (e.g. mean Levenshtein distance of X1 to
Y = 0.50), then calculate the average distance between varieties X2 and Y1, and X2 and Y2 (e.g.
mean Levenshtein distance of X2 to Y = 0.56). The Levenshtein distance of the cluster of X and
Y would be the average of the two distances: 0.50 + 0.56, divided by 2 = 0.53.
39
CHAPTER 4
RESULTS
While analyzing the results, I had a four-point research focus: 1) to calculate the degrees of
difference among the sign language varieties and produce a dendrogram showing these
relationships, 2) to assess the validity of the results by determining the correlation between word
list comparison and intelligibility testing results, 3) to evaluate the coding system parameters and
value inventories in order to refine and optimize the comparison methodology, and 4) to evaluate
and refine the set of word list items to elicit for comparisons. In this chapter, I present the results
of each of the four points in the analysis.
4.1 Identifying similarity groupings based on Levenshtein distance results
The dendrogram in
Figure 12 displays the Levenshtein distance similarity groupings for all 50 sign language
varieties comparing the six parameters and parameter value inventories of the initial coding
system. In the dendrogram, an output of the Rugloafer software, word list pairs and groupings are
linked by vertical lines—the position of these lines in the horizontal x-axis correspond to the
average Levenshtein distance among the varieties in the cluster. The number of shades for
clusters in the dendrogram is based on a number chosen in the Rugloafer software preferences
prior to similarity calculations to help distinguish the similarity groupings.
40
Figure 12: Dendrogram of Levenshtein distance similarity groupings based on six parameters
In general, the formation of sign language variety similarity clusters based on Levenshtein
distances groups varieties most clearly by countries. This general grouping pattern confirms the
Levenshtein distance results. One would expect sign language varieties from the same country to
be more similar to each other than to sign varieties from other countries (due to increased
41
language contact, shared deaf educational settings and places of learning sign language, and
shared historical influences). As expected based on known historical connections, the varieties
from Ireland and Northern Ireland are the most different from any of the varieties in the
Americas.
The Levenshtein distance numerical results corresponding to the vertical lines that connect
varieties in the dendrogram are listed in Table 9. The variety groupings are listed from top to
bottom from most to least similarity. The Levenshtein distances listed in the right column
correspond to the average Levenshtein distance among the varieties included in the cluster as
calculated by the unweighted pair-group method clustering algorithm. These same Levenshtein
distances are used to create the dendrogram shown in
Figure 12 and correspond to the positions on the x-axis where varieties are linked by a
vertical line.
Table 9: Levenshtein distances of variety groupings based on the six parameters of the initial coding system
Variety groupings
Honduras (H)
United States (U)
Jamaica (J) + St. Vincent (S)
U + JS
Chile (C)
Trinidad (T)
Panama (Pan)
UJS + T
Peru (Pe)
El Salvador (E)
Dominican Republic (D)
H + Pan
UJST + D
Paraguay (Par)
UJSTD + HPan
UJSTDHPan + Pe
UJSTDHPanPe + E
C + Par
UJSTDHPanPeE + CPar
Northern Ireland (NI) + Republic of Ireland (RI)
UJSTDHPanPeECPar + NIRI
42
Levenshtein distance
0.341
0.348
0.383
0.401
0.417
0.419
0.426
0.438
0.442
0.458
0.464
0.476
0.492
0.506
0.513
0.536
0.552
0.572
0.626
0.643
0.666
The purpose for the different shades of similarity clusters is not to identify or classify distinct
sign languages but rather to visually separate and distinguish sign variety groupings. Defining the
difference between languages and dialects is a bold and complicated endeavor that is beyond the
scope of this study. Consequently, although the Jamaica, Saint Vincent, Trinidad, and United
States sign varieties are all in the same shaded cluster and the Levenshtein distance of this group
is less than the Levenshtein distance within the groups for most of the other countries, the
Levenshtein distance grouping results do not alone prove that these language varieties should all
be considered dialects of one sign language without agreement from other sociolinguistic research
tools. Even so, these similarity results could be used as a basis for preliminary grouping of
varieties into languages as long as there is a full awareness that they are only based on the
similarities of lexical items. In combination with other sociolinguistic research tools, this study
could contribute to the discussion of identifying sign languages and dialects that should also
include other factors such as historical influences, language attitudes and identity, and
intelligibility. Intelligibility testing results of Jamaican and Dominican Republic participants
towards a United States sign language variety are discussed in more detail in section 4.2. In
support of making preliminary language groupings based on Levenshtein distances, a study of
spoken language varieties in Central Asia found that similarity groupings “perform well in the
preliminary classification of varieties even when the dataset includes unrelated varieties” (van der
Ark et al. 2007, 7).
Following the pattern of many lexical similarity studies, it may be tempting to propose
thresholds of Levenshtein distances among sign varieties that would predict intelligibility or
language groupings. However, thresholds may not be consistently applicable. Hendriks (2008,
37) found that lexical similarity scores among what were considered to be similar sign languages
were lower than the common thresholds used to predict language groupings for spoken languages.
In another word list comparison study evaluating how changes in scoring criteria effect similarity
43
results, Kluge (2008) recommends focusing more on the relative relationships rather than
absolute scores and thresholds when making conclusions about language similarities and
proposing directions for future research. Without including other related research findings, it is
difficult at this point to propose an accurate Levenshtein distance threshold that could be used to
predict language groupings. First, Levenshtein distance results would need to be calibrated
against known situations, and then the proposed thresholds would need to be adjusted based on
the scoring criteria used.
4.2 Validity of Levenshtein distance results
The Levenshtein distance results seem to produce a distinct representation of the similarities
among the 50 sign language varieties. To assess the validity of these results, I will discuss a few
observations with corresponding factors that reinforce the accuracy of the similarity groupings. I
will also compare the Levenshtein distances with intelligibility testing results between speakers of
sign languages in the United States, Jamaica, and the Dominican Republic.
In examining the results, there are anecdotal factors that support the similarity groupings.
First, there is a relatively large difference between one Paraguayan sign language variety (Prgy07) and the other seven Paraguayan varieties. Actually, the Prgy-07 participant represented a deaf
community that was perceived by others in the country to use a unique sign variety. If we
excluded Prgy-07 from the comparison, the Paraguayan varieties would be grouped at a
Levenshtein distance of 0.420 rather than 0.506. In another observation, three Jamaica sign
varieties are more similar to the St. Vincent variety than to other three varieties from Jamaica. As
an explanation, during fieldwork in St. Vincent, the survey team was told that deaf people from
St. Vincent (including the word list participant) have had frequent contact with deaf people from
Jamaica. The Levenshtein distance results suggest that this contact was with only a subset of the
Jamaican deaf population. The grouping of the Honduras varieties (0.341) and the grouping of the
44
United States varieties (0.348) show the least amount of variation of any grouping of language
varieties within a country. This may reflect the use of a more highly standardized sign language
in these two countries than in the other countries in this study. At least in the United States, there
are by far the most published materials relating to sign languages of any of the countries of this
study. This would contribute to standardization despite the relatively large deaf population and
land area of the country. The dendrogram placed the sign varieties from Chile and Paraguay as
the most different from the other varieties in the Americas. From a subjective perspective, the
survey team members fluent in American Sign Language had more difficulty negotiating meaning
with deaf people in Chile and Paraguay than with deaf people from the other countries
represented in this study from the Caribbean, Central America, and South America.
The groupings of varieties within a country that have the largest Levenshtein distances (the
Dominican Republic: 0.464, El Salvador: 0.458, and Peru: 0.442 - excluding Prgy-07 from the
Paraguay varieties) may be a result of one or more of the following three factors: 1) deaf
educational institutions that are relatively less integrated on a national level than other countries,
2) historical influences that have caused greater diversity in sign varieties, and 3) less mobility
and interaction among regional deaf communities. Each of these factors was observed to some
extent by the survey teams during fieldwork in these three countries. The Dominican Republic, El
Salvador, and Peru all had a few deaf schools that were run by the government and at least one
deaf school that was privately run by a mission organization from the United States - usually
using a sign variety more similar to ASL than the sign varieties of the government run schools
(Williams and Parks 2010; Parks and Parks 2010a).
A limited set of intelligibility testing results also correlate with the Levenshtein distance
results. Intelligibility testing is intended to determine the degree to which users of one language
variety will understand users of another variety. Intelligibility is often assessed by a methodology
called Recorded Text Testing (RTT). In the traditional RTT methodology described by Casad
45
(1974), after listening to a portion of a recorded text, participants responded to questions about
the text which were evaluated to assess how much was understood. A modification to this
methodology using the retelling method (RTT-R) rather than asking questions is described by
Kluge (2007). In an RTT-R, a text is played for participants and the participants are asked to
retell the text. RTT-R scores are determined based on the percentage of pre-selected data points
from the text that were included in the retelling by the participant.
Intelligibility of an American Sign Language narrative video text was evaluated in the
Dominican Republic and Jamaica using a methodology similar to a recorded text test retelling
method (RTT-R) (Parks and Parks 2010b). The text was elicited and hometown tested in Tucson,
Arizona. Testing of this text was conducted by the SIL Americas Area sign language survey team
in three locations: Los Angeles, California (to approximate the higher end of scores we might
expect from similar language varieties from the same country as the storyteller), Jamaica, and the
Dominican Republic. The mean RTT-R score from each of the three locations was compared to
the mean Levenshtein distance among all word list pairs between each country. The number of
data points for each research instrument, the mean Levenshtein distances and RTT-R scores, and
the standard deviations from the mean are shown in Table 10.
Table 10: Levenshtein distances and RTT-R intelligibility scores for three country comparisons
RTT-R and
Levenshtein distance results
RTT-R data points
Mean RTT-R score
RTT-R standard deviation
Levenshtein distance data points
Mean Levenshtein distance
Levenshtein distance standard deviation
Within
United States
7
87.4%
7.1%
6
0.337
0.025
Jamaica to
United States
9
74.6%
17.6%
24
0.415
0.040
Dominican Republic
to United States
11
55.9%
15.8%
36
0.520
0.039
The correlation results show a linear negative relationship (r = -1.000, p = 0.014) between
RTT-R intelligibility testing results and Levenshtein distances (a negative or positive correlation
coefficient near 1.00 shows a strong relationship between the results). These results must be
interpreted with caution since the intelligibility results only go in one direction (understanding of
46
the Tucson sign variety text), and because only mean scores are compared rather than the scores
from both instruments for one individual since the same participants were not involved in both
the word list and intelligibility testing elicitations. A graph of the correlation with the trend line
and equation showing the relationship between the mean Levenshtein distances and RTT-R
scores is shown in Figure 13.
Figure 13: Correlation of mean Levenshtein distance to mean RTT-R intelligibility score between countries
Mean Levenshtein distance
0.53
Dominican
Republic to USA
Mean Levenshtein distance = -0.579(Mean RTT-R) + 0.844
R² = 0.9997
0.43
Jamaica to USA
Within USA
0.33
50%
60%
70%
80%
90%
Mean RTT-R score
A high negative correlation (r = -0.86, p < 0.01) between Levenshtein distances and
intelligibility was also found by Beijering et al. (2008, 18) in a study of 18 Scandinavian language
varieties. In another study, for a data subset excluding data analyzed with a different
methodology, Yang (2009, 28) found a strong negative correlation (r = -0.79, p < 0.001) between
Levenshtein distance and intelligibility among Nisu language varieties in China. Yang (2009, 27)
also found a “high degree of agreement” between the Levenshtein distances and historicalcomparative analysis results.
Comparing the Levenshtein distance word list comparison methodology with the
intelligibility testing methodology, both have certain advantages and I would recommend that
selection of one over the other be dependent on the fieldwork context. In general, the word list
comparison tool is better suited to fieldwork situations where time is short (it requires less onsite
47
fieldwork time) and potential participants may have had little formal education or exposure to
testing methods (the elicitation procedure is much easier to explain). The RTT-R methodology
requires much more onsite preparation including the elicitation of an appropriate narrative text
and hometown testing to calibrate the results. On the other hand, the word list comparison
methodology requires more time to analyze than the RTT-R. Even though the results appear to be
highly correlated, where feasible, I would advise that both be used since multiple perspectives can
strengthen the research conclusions and recommendations.
4.3 Evaluation of parameters
I evaluated each of the six parameters of the initial coding system individually, and then in
sets of two, four, and five parameters in contrast to all six parameters to determine which
parameters or parameter combinations most clearly grouped the varieties based on similarities
and differences. These comparisons in combination with ANOVA statistical evaluations helped to
identify which parameters were most efficient in assessing similarity among sign varieties. In
section 4.3.1, I compare the Levenshtein distance results of each parameter individually to
evaluate if any of the six parameters of the initial coding system are obscuring similarity results.
Then based on the weaknesses observed in certain parameters, in section 4.3.2 I evaluate various
subsets of the six parameters in order to omit unclear parameters which would simplify the
coding system and improve the similarity distinctions shown in the results.
4.3.1 Individual parameters
As shown in section 4.1, the Levenshtein distance results show there are 12 groupings of sign
varieties based on country groupings. In this section, I use Levenshtein distance similarity scores
of these 12 groupings (instead of all 1,225 variety pairings) to show relative differences among
the parameter sets. Table 11 shows the Levenshtein distance results of these sign variety
48
groupings (listed in rows) for each of the six parameters individually and for all six parameters
together (listed in columns). A table cell marked with an “x” indicates that the parameter results
did not exactly group only the varieties listed in that row. Cells with an “x” indicate that the
parameter is not clearly and distinctly grouping varieties based on similarities.
Table 11: Levenshtein distances of variety groupings based on individual parameters
Initial
Final
Initial Final
Palm
Handshape Handshape Location Location Orientation
United States (U)
0.425
0.406
0.358
0.301
x
U + Jamaica & St. Vincent (JS)
0.505
0.490
0.436
x
x
UJS + Trinidad (T)
0.529
0.547
0.463
0.426
x
Honduras (H) + Panama (Pan)
0.607
0.599
x
0.413
x
UJST + Dominican Rep. (D)
0.637
0.635
0.492
0.445
x
UJSTD + HPan
0.682
0.676
0.528
0.455
x
UJSTDHPan + Peru (Pe)
0.697
0.698
0.541
0.497
x
UJSTDHPanPe + El Salvador (E) 0.721
0.719
0.566
0.509
0.332
Chile (C) + Paraguay (Par)
0.748
0.747
0.620
0.522
0.321
UJSTDHPanPeE + CPar
0.829
0.824
0.640
0.568
0.351
N. Ireland (NI) + Rep. Ireland (RI) 0.849
0.851
x
0.548
0.320
UJSTDHPanPeECPar + NIRI
0.863
0.873
0.694
0.609
0.376
Joint
6
Movement Parameters
0.346
0.348
x
0.401
0.411
0.438
0.408
0.476
x
0.492
x
0.513
x
0.536
0.481
0.552
0.463
0.572
0.540
0.626
x
0.643
0.613
0.666
The Levenshtein distance results listed in Table 11 for individual parameters are also
graphically displayed in Figure 14 to help clarify the discussion of observations that follow
(missing data points for variety groupings in the graph represent cells with an “x” in the table).
49
Figure 14: Visual comparison of Levenshtein results of individual parameters for variety groupings
The initial and final handshape parameters consistently identified the 12 groupings that were
also apparent in the results based on all six parameters. The initial location parameter missed two
groupings, and the final location parameter missed just one grouping. The two movement
parameters had the most divergence from clearly identifying the 12 groupings: the palm
orientation change parameter was the most divergent missing seven groupings, and the joint
movement parameter missed five groupings. The groupings that were identified by the palm
50
orientation change parameter only produced Levenshtein distances between 0.332 and 0.376
without much distinction, and in two cases not following the trend of increasing differences
between groups for the groupings in the 9th and 11th rows of Table 11.
One explanation for why the movement parameters are not as helpful in identifying degrees
of difference among language varieties may be slight skewing of results due to the fact that
parameter values could be identical merely by chance. This is especially apparent for the two
movement parameters since they only have a few possible parameter values. For example, since
there are only two possible values in the palm orientation change parameter, the probability of
that parameter being identical for two sign tokens is 50% (25% chance both are P+ (0.5 x 0.5)
plus 25% chance both are P- (0.5 x 0.5)). Furthermore, from the occurrence frequencies of the
parameter values for the entire database (see Appendix B), we know that "P-" is coded for the
palm orientation change parameter for 69% of all sign tokens and “P+” occurs 31% of the time.
The probability of identical parameter values between two members of a pair just based on
chance would now be 57.2% (0.69 multiplied by 0.69 = 47.6% for “P-”, and 0.31 multiplied by
0.31 = 9.6% for “P+”). This would slightly skew the results toward smaller Levenshtein distances
among language varieties and decrease the relative degrees of difference shown by other
parameters.
The final location parameter consistently calculated higher similarities than the initial
location parameter and the handshape parameters. One possible explanation for this trend is the
high frequency of occurrence for neutral space (51%) as the final location parameter value. Just
based on chance matches of only the neutral space parameter value (0.51 multiplied by 0.51 =
26%), the high probability would produce a Levenshtein distance of at most 0.74 between a pair
of sign varieties for the final location parameter.
Table 12 lists a few statistical observations of the Levenshtein distance results based on all
1,225 variety pairs for each parameter comparison. The Cronbach’s Alpha is an internal51
consistency reliability measure: to calculate it, the results are split in half and the halves are
compared to each other for every possible combination of split halves. The output is a value
between zero (no internal consistency: extremely low reliability) and one (internally consistent:
extremely high reliability). As a rough guideline, data is considered unreliable if the Cronbach’s
Alpha is less than 0.7. The mean, standard deviation, and range (the difference between the most
similar variety pair and the least similar variety pair) of Levenshtein distances for the 1,225
variety pairs are listed for each parameter. The standard deviation and range indicate the level of
distinction among variety pairs that each parameter is able to produce. A larger standard deviation
and range shows that the results are less clumped together which would suggest that similarity
groupings are easier to identify in the results.
Table 12: General statistics of individual parameter Levenshtein distance results
Initial
Handshape
Cronbach’s Alpha
0.9663
Mean Levenshtein distance 0.725
Standard deviation
0.1225
Range
0.541
Final
Handshape
0.9670
0.722
0.1242
0.599
Initial
Location
0.9375
0.564
0.0958
0.498
Final
Location
0.9169
0.501
0.0876
0.458
Palm
Orientation
0.6959
0.324
0.0476
0.272
Joint
Movement
0.8981
0.479
0.0839
0.447
6
Parameters
0.9771
0.554
0.0886
0.416
The initial and final handshape parameters gave the largest ranges and standard deviations of
Levenshtein distances of any parameter which suggests that they produce clearer groupings of
similarity—it is more likely that the difference among Levenshtein distances will be statistically
significant if the distances of the data set have a larger range and standard deviation. In addition,
both handshape parameters produce larger mean Levenshtein distances than the location or
movement parameters. These two observations are related: since handshapes produce a larger
mean Levenshtein distance they are more likely to be different which increases range of
Levenshtein distances especially for the least similar variety pairs. This high difference in
handshapes follows the results of two other studies—in a set of signs that differed by only one
parameter, handshape was most frequently the different parameter, followed by movement, and
then location (Aldersson and McEntee-Atalianis 2008, 63-67; McKee and Kennedy 2000, 56-57).
In contrast to the pattern in these two studies, in this study movements showed less distinction in
52
differences than locations. One explanation may be due to the lower number of possible
parameter values for movements than locations in the coding system of this study. The large
number of parameter values for handshapes in this coding system may also explain the tendency
for these parameters to show more differences among variety pairs.
The Cronbach’s Alpha internal-consistency reliability measure shows that the handshape
parameters have the highest reliability of the individual parameters (initial handshape parameter:
0.9663, final handshape parameter: 0.9670) - only slightly lower than the reliability of all six
parameters combined (0.9771). The palm orientation change movement parameter had the lowest
reliability (0.6959) which is under the 0.7 threshold for recommended reliability.
4.3.2 Parameter sets
Based on the observations of the performance of individual parameters in section 4.3.1, I
explored possible simplifications of the coding system. Various sets of parameters are compared
to see if similar or even enhanced results can be obtained by excluding certain parameters from
the analysis. Table 13 shows the Levenshtein distance results of 12 groupings of varieties (listed
in rows) for four sets of parameters (listed in columns): all six parameters, five parameters—all
parameters except palm orientation change (labeled as 5P-NoPO), four parameters—the
handshapes and locations not including the two movement parameters (labeled as 4P-NoMove),
and the initial handshape and location parameters (labeled as 2P-Initial). The statistical values of
Cronbach’s Alpha (internal-consistency reliability measure), standard deviation and range
(difference between most similar and least similar variety pair Levenshtein distances), and the
mean Levenshtein distance are also given to help compare the effectiveness of various parameter
sets in distinguishing similarity groupings.
53
Table 13: Levenshtein distances of variety groupings based on parameter sets
Cronbach’s Alpha
Mean Levenshtein distance
Standard deviation
Range
United States (U)
U + Jamaica & St. Vincent (JS)
UJS + Trinidad (T)
Honduras (H) + Panama (Pan)
UJST + Dominican Rep. (D)
UJSTD + HPan
UJSTDHPan + Peru (Pe)
UJSTDHPanPe + El Salvador (E)
Chile (C) + Paraguay (Par)
UJSTDHPanPeE + CPar
N. Ireland (NI) + Rep. Ireland (RI)
UJSTDHPanPeECPar + NIRI
6 Parameters
0.9771
0.554
0.0886
0.416
0.348
0.401
0.438
0.476
0.492
0.513
0.536
0.552
0.572
0.626
0.643
0.666
5P-NoPO 4P-NoMove 2P-Initial
0.9781
0.9771
0.9726
0.599
0.628
0.645
0.0993
0.1053
0.1074
0.476
0.495
0.503
0.361
0.365
0.383
0.422
0.433
0.451
0.464
0.481
0.483
0.509
0.539
0.572
0.531
0.552
0.565
0.554
0.585
0.605
0.577
0.609
0.620
0.598
0.630
0.644
0.621
0.660
0.685
0.681
0.716
0.735
0.707
0.728
0.759
0.724
0.755
0.769
The ANOVA statistical analysis showed that all four parameter sets were significantly
different from each other (p < 0.01). The Levenshtein distance results of the four sets of
parameters shown in Table 13 are visually displayed in Figure 15.
54
Figure 15: Levenshtein distances of variety groupings for parameter sets
The combination of parameters that seems to show distinctions between similar and different
groupings most efficiently while still maintaining a high internal consistency is the fourparameter combination of initial and final handshapes and locations (labeled as 4P-NoMove).
These results have a high internal-consistency reliability based on the Cronbach’s Alpha value of
0.9771 (equal value to the six parameter set, and only slightly less than the five parameter set
(0.9781)). The range of Levenshtein distances from the most to least similar variety pair for these
four parameters (0.495) and the standard deviation (0.1053) are larger than the ranges and
standard deviations of the five and six parameter combinations which suggests that the
distinctions between similarity groups are clearer in the four parameter set. A study of
55
Guatemalan sign varieties also found that inclusion of the palm orientation and movement
parameters in the comparison resulted in a smaller range among similarity scores (Parks and
Parks 2008, 27).
The relative relationships of the variety groupings (increasing Levenshtein distance while
progressing through the variety groupings) are similar among all the parameter sets with one
exception: the 2P-Initial parameter set comparison calculates the grouping of the Dominican
Republic varieties with the varieties from the United States, Jamaica, Trinidad, and St. Vincent at
a smaller Levenshtein distance than the grouping of all Honduras and Panama varieties. This
observation combined with a smaller Cronbach’s Alpha value (0.9726) suggests that the 2PInitial set is not optimal. Since the 4P-NoMove set yields similar results to the initial six
parameter set with even more distinction between similarity groups, and the two movement
parameters require more time to code consistently than the other parameters, the 4P-NoMove set
is the optimal choice of parameters to evaluate during comparisons.
4.4 Evaluation of handshape parameter values
For the initial and final handshape parameters, the initial coding system identified 99 distinct
handshape values. Two small subsets of these handshape values either occurred very infrequently
or were difficult to distinguish during coding. I combined or merged these values in order to
propose a coding system with improved efficiency and consistency without sacrificing clear
similarity groupings. There were 19 values that occurred less than 0.10% of the time; I combined
17 of them with one of the other 80 handshape values with similar features, and two of them with
each other ("U-Top" and "Ugap" were both coded as "U-Top"). I also combined the infrequent
values “ILYbent-Top” and “7” with “ILYflex-Top”, and used a new code name, “ILY-Top” for
the resulting value. These 19 least frequently occurring values are shown in Table 14.
56
Table 14: Handshape values that occur least frequently to combine with similar values
Rank
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Code name
to be merged
U-Top
Rhole
Ybent
F-Text
E-Top
Ugap
7
Olittle-Tund
1flex-Tflex
ILYbent-Top
E-Ttog
1-Ttog
F-Ttog
Iflex
Wflex
E-Tflex
I-Ttog
1-Tflex
Y-MID
Occurrences
Frequency
Code with the following similar value
23
21
19
18
16
14
13
11
10
10
8
6
6
6
6
5
5
3
2
0.08%
0.07%
0.06%
0.06%
0.05%
0.05%
0.04%
0.04%
0.03%
0.03%
0.03%
0.02%
0.02%
0.02%
0.02%
0.02%
0.02%
0.01%
0.01%
U-Top (merged with Ugap)
R
Y
Lbent
C
U-Top
ILYflex-Top, ILYbent-Top = ILY-Top
T
Lflex
ILYflex-Top, 7 = ILY-Top
E-Text
1
F
Ibent
W
E-Text
I
L
Y (or Wunspr for middle finger variant)
There were seven pairs of handshape parameter values that were difficult to distinguish in the
word list videos. I merged each of these pairs, reducing the handshape parameter inventory by
seven values. These seven merged values are listed in Table 15. I used a new code name "Fgap"
for the initial coding system values of "Fflexgap" and "Gspread".
Table 15: Handshape values to merge because they are hard to distinguish
Code name to be merged
5bent
5flex-Text
8flexgap
B-Ttog
C-Top
Clittle-Top
Gspread and Fflexgap
Remaining code with similar features
5-Top
5flex
8gap
B-Text
C
Clittle
Fgap (new code)
By merging the sets of handshape values representing infrequently occurring values and those
representing features that were difficult to distinguish in the videos, the handshape parameter
value inventory was reduced from 99 to 74 values. After evaluating word list items in section 4.5,
I examine the effects of these refinements in addition to word list item refinements in section 4.6.
57
4.5 Evaluation of word list items
In order to determine an optimal set of word list items to use in comparisons, I analyzed the
results with two foci: to compare different subsets of items to determine if certain subsets may
enhance or obscure the clarity of similarity relationships, and to identify specific items that may
tend to skew results or cause missing data due to unclear elicitations.
4.5.1 Comparison of item subsets
Levenshtein distances (using the 4P-NoMove parameter set, labeled in this section as 4P-All)
for the complete set of 243 word list items were compared to Levenshtein distances for three
subsets of items to determine if certain subsets produced more distinctions in similarity
groupings. One subset included 67 items containing animals, foods, and other basic nouns
(labeled as 4P-AnimalFoodNoun) that were relatively easy to represent with images during
elicitation—45 items from this set are the same items as used in a 50-item noun list described as
highly iconic by Parkhurst and Parkhurst (2003, 14). Another subset consisted of all the
remaining 176 items not included in 4P-AnimalFoodNoun, which may be considered to be a list
of items less easily represented by images during elicitation (labeled as 4P-NoAnimalFoodNoun).
The third subset of only 25 items contained colors, days, and months (labeled as 4PColorDayMonth). This small subset was chosen based on intuitive observations during coding
(high similarities within a country and low similarities between countries), and I was curious to
see the resulting Levenshtein distance similarity groupings this relatively small subset of items
would produce. Table 16 shows the Levenshtein distances for the four sets of word list items
(listed in columns) including the Cronbach’s Alpha internal-consistency reliability evaluation,
mean, standard deviation, and range.
58
Table 16: Levenshtein distance results for four sets of word list items
4P 4P4P4PAll
ColorDayMonth AnimalFoodNoun NoAnimalFoodNoun
(243 items) (25 items)
(67 items)
(176 items)
Cronbach’s Alpha
0.9771
0.9179
0.8701
0.9750
Mean Levenshtein distance
0.628
0.678
0.651
0.615
Standard deviation
0.1053
0.1748
0.0827
0.1196
Range
0.495
0.850
0.500
0.537
United States (U)
0.365
0.151
0.485
0.334
U + Jamaica & St. Vincent (JS)
0.433
x
0.539
0.430
UJS + Trinidad (T)
0.481
0.418
0.568
0.466
Honduras (H) + Panama (Pan)
0.539
0.471
x
0.518
UJST + Dominican Rep. (D)
0.552
0.621
x
0.512
UJSTD + HPan
0.585
x
x
0.566
UJSTDHPan + Peru (Pe)
0.609
x
x
0.603
UJSTDHPanPe + El Salvador (E) 0.630
0.637
0.652
0.621
Chile (C) + Paraguay (Par)
0.660
0.795
0.672
0.654
UJSTDHPanPeE + CPar
0.716
0.808
0.708
0.714
N. Ireland (NI) + Rep. Ireland (RI) 0.728
x
0.709
0.735
UJSTDHPanPeECPar + NIRI
0.755
x
0.739
0.760
The results of the four different sets of items are visually displayed in Figure 16.
Figure 16: Levenshtein distances of variety groupings for four sets of word list items
59
Compared to the 4P-All set, the 4P-AnimalFoodNoun subset produced slightly larger
Levenshtein distances in the more similar variety groupings and slightly smaller Levenshtein
distances in the less similar variety groupings. The Cronbach’s Alpha internal-consistency
reliability measure was the lowest (0.8701) among this set of items, and four variety groupings
were not clearly identified (shown by an "x" in Table 16). In comparison, Bickford (2005, 23)
found that a smaller 84-item list that was elicited with pictures and that contained potentially
more iconic concepts produced 7.5% higher similarity scores compared to a 240-item list that
included an additional 156 items that were only elicited with written words and not images.
In the contrasting 4P-NoAnimalFoodNoun item subset, the Levenshtein distances are very
similar in absolute distances and relative relationships to the 4P-All set. The 4PNoAnimalFoodNoun subset calculated a slightly larger range (0.537) than the 4P-All set (0.495).
Similarly, in two other studies, word lists containing items that were judged as less-iconic have
produced a greater level of distinction among language varieties (Parkhurst and Parkhurst 2003;
Johnson and Johnson 2008, 37). The ANOVA statistical analysis showed that 4P-All and 4PNoAnimalFoodNoun were not significantly different from each other (p < 0.01). From these
observations, the exclusion of items that are elicited with pictures and that may be judged by
some standards as "more iconic", only results in minor changes to both the absolute Levenshtein
distances and the relative relationships of similarity grouping results.
Interestingly, the 4P-ColorDayMonth item subset showed extremely high distinction (a range
of 0.850), maintained similar relative relationships across most of the selected groupings (not
distinguishing five groupings; shown by an "x" in Table 16), and had quite a high Cronbach’s
Alpha (0.918) for a small set of items. Vanhecke and De Weerdt (2004, 34-35) also found a
higher than expected number of identical signs from a list that included colors, days, and months
among five regions in Flanders. From all five regions, they calculated 72.3% of 1,401 concepts to
be similar or related. Their finding complements the trend found in this data: among groupings of
60
relatively similar sign language varieties, the items of colors, days, and months will show high
similarity between varieties (e.g. four ASL varieties grouped at a Levenshtein distance of 0.151).
But in comparisons of relatively different language varieties, the items will reveal sharp
differences among variety groups (e.g. Chile varieties grouped with Paraguay varieties at a
Levenshtein distance of 0.808). This may be due to a higher standardization of these items within
a country as they are basic concepts that may be more consistently taught in deaf schools.
4.5.2 Items with elicitation problems
There are two sets of word list items that caused problems during elicitations. The first set,
listed in Table 17, contains 12 word list items that have the most missing data entries since they
tended to be difficult to elicit or to cause misunderstandings during elicitations. Out of all 50
word lists, these 12 items had no data entries for at least 20% of the word lists.
Table 17: 12 word list items with the most missing data entries
Item
sharp
to count
continue
story
correct
to start
enemy
early
late
only
to meet
weak
No data entries
17
17
16
14
13
11
10
10
10
10
10
10
The difficulty these items caused during elicitation did not seem to be related to whether they
included an image or just a written word—the ratio of items with images for these 12 items is
similar to the ratio of items with images for the entire word list. One possible explanation for
elicitation problems that occurred with items that did include images was that the images were
confusing to participants (e.g. the participants did not directly associate the image with the item).
This is the reason the two items “to live” and “to die” were not elicited after fieldwork in Peru.
61
Another possible explanation is that the items may represent concepts that participants are not as
familiar with as other items in the list.
The second set of problematic items consists of 14 word list items, listed in Table 18, that
may skew similarity calculations due to the large number of sign tokens they tend to elicit.
Table 18: 14 word list items that elicit the most sign tokens
Item
feather
lightbulb
window
bus
computer
land
you’re welcome
grass
rich
rope
shirt
chicken
dog
tomato
Ratio of sign tokens per participant
1.91
1.86
1.84
1.84
1.82
1.78
1.77
1.75
1.73
1.72
1.72
1.71
1.66
1.66
The large number of sign tokens for these items may indicate that these items represent vague
concepts that are prone to trigger several variants or descriptive phrases instead of single signs.
Another explanation may be that the elicitation images for these items were open to multiple
interpretations. These items may also tend to vary based on cultural differences. For example, the
item “window” in one region may have several types: one sheet of glass, several horizontal metal
panes that rotate, vertical panes that rotate, or just a cut-out opening in a wall. Each type of
window may have a different sign, but the differences among signs are due to differences in
regional construction norms and not the generic concept of the item.
The effect of reducing the number of word list items from 241 to 215 on similarity groupings
is discussed in section 4.6. Regardless of the results, excluding the items from Table 17 that are
most often missed by participants would increase the comfort levels of both participants and
researchers during the elicitation sessions since some participants feel embarrassed when an item
is not recognized or they are not familiar with the sign corresponding to that item. In addition,
62
some participants tend to become bored or easily distracted during the elicitation of many items,
so reducing the number of items will also improve participant comfort.
4.6 Similarity results using refined parameters, values, and word list items
To evaluate how similarity results would be affected by using the refined handshape
parameter value inventory of 74 values and/or the reduced set of 215 word list items, I
recalculated Levenshtein distances for two sets of data: one set consisting of the four handshape
and location parameters evaluating 215 items coded with the initial handshape value inventory of
99 values (labeled as 4P-215-99), and a second set that based on the four parameters evaluating
215 items that identified only 74 handshape parameter values (labeled as 4P-215-74). The
Levenshtein distance results for these two refined parameter sets are compared to the 4P-NoMove
set (labeled in section 4.5.1 as 4P-All, and in this section as 4P-241-99) in Table 19.
Table 19: Levenshtein distance results of sets with reduced word list items and handshape parameter values
Cronbach’s Alpha
Mean Levenshtein distance
Standard deviation
Range
United States (U)
U + Jamaica & St. Vincent (JS)
UJS + Trinidad (T)
Honduras (H) + Panama (Pan)
UJST + Dominican Rep. (D)
UJSTD + HPan
UJSTDHPan + Peru (Pe)
UJSTDHPanPe + El Salvador (E)
Chile (C) + Paraguay (Par)
UJSTDHPanPeE + CPar
N. Ireland (NI) + Rep. Ireland (RI)
UJSTDHPanPeECPar + NIRI
4P-241-99
0.9771
0.628
0.1053
0.495
0.365
0.433
0.481
0.539
0.552
0.585
0.609
0.630
0.660
0.716
0.728
0.755
4P-215-99
0.9757
0.622
0.1092
0.511
0.358
0.423
0.477
0.531
0.537
0.578
0.601
0.625
0.653
0.711
0.731
0.757
4P-215-74
0.9759
0.618
0.1101
0.512
0.352
0.415
0.469
0.529
0.533
0.575
0.597
0.623
0.649
0.708
0.724
0.751
As would be expected by eliminating word list items that were difficult to elicit correctly, the
mean Levenshtein distance was slightly less in 4P-215-99 (0.622) compared to the complete set
of word list items in 4P-241-99 (0.628). Likewise, the comparison using the reduced set of
handshape parameter values had a slightly smaller mean Levenshtein distance (0.618). ANOVA
63
statistical analysis showed that 4P-241-99, 4P-215-99, and 4P-215-74 were not significantly
different from each other (p < 0.01). The Cronbach’s Alpha is also very similar among all three
data sets. This statistical analysis indicates that using the reduced sets of word list items and
handshape parameter values (improving elicitations of word lists, and the efficiency and accuracy
of coding) does not negatively impact the similarity distinctions of the Levenshtein distance
results among sign language varieties. In fact, the standard deviation and range of 4P-215-74 is
actually larger than the other two sets which would suggest that it shows more distinctions
between similar and different sign language varieties.
The dendrogram in Figure 17 displays the Levenshtein distance similarity groupings for all
50 sign language varieties comparing the four parameters of handshapes and locations using the
refined word list of 215 items and the reduced handshape parameter value inventory of 74 values.
64
Figure 17: Dendrogram of Levenshtein distance similarity groupings for 4P-215-74 data set
In comparison to the dendrogram that was produced using the initial coding system (
Figure 12), the similarity groupings are very similar with only a few small changes in the
grouping of varieties within a country. A matrix of the specific Levenshtein distances for each
word list pairing is shown in Table 28 of Appendix C.
65
CHAPTER 5
CONCLUSION
Given the results of the evaluation of the coding methodology and of the Levenshtein
distance similarity results, in this chapter I summarize my interpretations of the results and
present a final proposal for an efficient and effective coding methodology for sign language word
list comparisons. First, I propose a set of parameters to use for comparisons and explain why
certain parameters of the initial methodology should be excluded from future word list
comparisons. Second, I propose a reduced inventory of possible parameter values to be used for
the handshape parameters. Third, I propose a reduced set of items for word list elicitations.
A refined set of 215 word list items is recommended for optimal similarity calculations and
participant comfort during elicitation sessions. Using the proposed coding methodology, this
preliminary word list comparison evaluating the similarity of lexical items using the Levenshtein
distance metric appears to produce both reliable and valid degrees of difference among sign
language varieties. The Levenshtein distance results had a Cronbach's Alpha of 0.9759 (internal
reliability rating), and their validity is supported by a high negative correlation with intelligibility
testing results (r = -1.000, p = 0.014).
Since word lists are relatively quick to elicit during fieldwork, the proposed coding system is
straightforward with well-defined parameter values, the Levenshtein distance calculations can be
performed rapidly and objectively, and the SLLED and Rugloafer analysis software is userfriendly with many helpful outputs, word list comparisons using this methodology can effectively
contribute toward sign language identification, documentation, and language development project
planning.
66
5.1 Refining the parameters for comparison
I recommend basing word list comparisons on four phonetic parameters of a sign token:
initial handshape, final handshape, initial location, and final location. Analysis of the results using
the six parameters of the original methodology indicates that the two parameters coding
movement have low internal-consistency reliability and do not produce similarity groupings as
clearly as do the handshape and location parameters. The palm orientation change parameter had
a low Cronbach's Alpha of 0.6959 and did not group seven of the 12 common similarity
groupings of varieties calculated by the other parameters. Likewise, the joint movement
parameter had a Cronbach's Alpha of 0.8981 and did not group five of the 12 common similar
variety groupings. In comparison, the Cronbach's Alpha of the handshape and location parameters
was higher, ranging from 0.9169 to 0.9670 which shows that the comparison results of these
parameters have more internal-consistency reliability. Both initial and final handshape parameters
calculated all 12 of the common similarity groupings; and the initial location parameter only
missed two while the final location parameter missed just one grouping. Since the movement
parameters produce less clarity and distinctions in the similarity groupings, have a low internalconsistency reliability, and certain aspects of movement are represented indirectly through the
coding of the initial and final positions of handshapes and locations, I do not recommend
including the two movement parameters in the final proposed methodology. In addition, they
require more time and are more difficult to code than the handshapes and locations.
Relative similarity groupings and Levenshtein distance ranges calculated by the fourparameter set and either of the handshape parameters alone are quite similar. It could be argued
that only the final handshape parameter should be used to assess similarity since it has the highest
Cronbach's Alpha of any single parameter and has the largest range of Levenshtein distances
between the most similar and least similar language varieties. However, locations tend to have
fewer errors in articulation than handshapes since they require less detailed motor movements
67
(Siedlecki Jr. and Bonvillian 1993; Meier et al. 1998), thus coding only for handshape may
introduce noise in the analysis due to production errors. Finally, since the Cronbach’s Alpha is
higher when four parameters are compared than when just one handshape parameter is compared,
and the locations are relatively easy and quick to code, I recommend keeping the location
parameters in the coding system.
5.2 Refining parameter values
Sign tokens were coded for each of the four parameters using an inventory of unique values
with descriptions of how to consistently apply the coding system and combine minor feature
differences. The initial and final location inventory contained 31 possible values in the initial
methodology and I do not propose making any changes to the number of values. Although they
did not cause problems, for clarity and consistency with other location value codes, I would
recommend modifying the code names of four location values that were unnecessarily
abbreviated in the initial coding system: changing "Should" to "Shoulder", "Fing" to "Finger",
"Fore" to "Forehead", and "Hip" to "HipLeg".
For the initial and final handshape parameters, the initial coding system identified 99 distinct
handshape parameter values. As described in section 4.4, two sets of handshape parameter values
were merged to make the coding system more efficient and accurate - reducing the total inventory
from 99 to 74 values. Since using the reduced handshape value inventory produced similarity
results that were not significantly different from the initial handshape value inventory (p < 0.01),
I recommend using the refined inventory of 74 handshape values. This will decrease the time
required to learn the coding system and become consistent in applying it. In future studies, if one
of these 74 values appears to combine contrastive features among the language varieties being
compared, additional parameter values can be added to the coding and scoring system (the
SLLED software was designed with “empty” spaces for additional values).
68
5.3 Refining the word list items
There were 26 items highlighted in section 4.5.1 that tended to be difficult to elicit or that
tended to trigger several variants or descriptions that may skew similarity calculations. I
recommend excluding these two sets of problematic items to reduce the total number of items
from 241 to 215 items. Excluding the 12 items listed in Table 17 will increase participant comfort
during elicitation sessions and reduce missing data entries. In addition, excluding the 14 items
from Table 18 that tend to elicit the largest number of sign tokens will reduce the skewing of
similarity results due to potentially vague concepts. In general, comparing more word list items
improves the reliability of the results, yet there is a tension between this advantage and the
potential negative effect of participants becoming bored or tired with long elicitation sessions.
Reducing the number of items as recommended will maintain the advantage of good reliability
resulting from a longer list while improving participant comfort during elicitations.
Since the difference between the results from the complete set of items and the results from
the subset of items "4P-NoAnimalFoodNoun" with items that some might consider "less iconic"
was small, I do not propose excluding the "more iconic" items. In addition, I recommend
including these items at the beginning of elicitation sessions since the participants usually become
more comfortable with the elicitation procedure when the first items are very familiar and easily
triggered.
5.4 Final methodology proposal
The final proposed word list comparison methodology includes 215 word list items and uses
four parameters to code sign tokens: initial handshape, final handshape, initial location, and final
location. The handshape parameter value inventory contains 74 values, and the location inventory
contains 31 values. Sign tokens are coded for these parameters and values using ELAN software.
69
This ELAN data is converted by SLLED software in order to calculate Levenshtein distances and
degrees of difference among sign language varieties using the Rugloafer software.
5.5 Areas and considerations for future research
Many areas remain for future research due to the exploratory nature of this study of word list
comparison methodology. First, it may be possible to enhance Levenshtein distance calculations
by assigning weights to parameter values - producing a smaller distance for similar values and
larger distance for different values instead of a binary score. For example, when comparing the
initial location parameter, the values “Cheek”, “Chin”, and “Wrist”, are currently considered
equally different from each other and one edit would be tallied in the Levenshtein distance
calculations for any difference. By assigning weights to values, relatively similar location
parameter values like “Cheek” and “Chin” would calculate a smaller Levenshtein distance than
the comparison of two values like "Cheek" and "Wrist". But further research is needed to
determine what weights should be assigned to parameter value pairings, how weighted value
pairings would affect similarity calculations, and whether there would be noticeable differences
in the relative relationships of sign varieties.
A second area for further research would be to expand and refine the analysis of the
correlation between Levenshtein distances and intelligibility testing results. For example, CiupekReed (2011) reports intelligibility testing results of an ASL text in El Salvador that could be
compared to the Levenshtein distances among the sign varieties of these two countries as reported
in this study.
Third, other sign language sociolinguistic research methodologies could be used to support or
contradict this word list comparison methodology and the Levenshtein distance results. For
example, the data from a previous study that used a Blair-style lexical similarity method could be
70
reanalyzed using the methodology of this study. The results of the two methodologies could then
be compared and the pros and cons of each method could be evaluated.
Fourth, it will be important to evaluate the proposed word list comparison methodology
among sign varieties from more distinct regions of the world. It is possible that articulatory
feature distinctions would be observed while coding word lists from a larger sign language
variety database that would require a modification of the current parameter value inventories. A
more complete understanding of the limits of the smallest and largest Levenshtein distances
expected between very similar and very different sign language varieties might improve the
interpretation of Levenshtein distances and relative similarity relationships.
As a final consideration for future research, although one of the primary goals of this word
list comparison methodology was to develop a more objective process to assess sign language
variety similarities, in some cases it was difficult to consistently and accurately code the
parameter values for each sign token. Difficulties coding handshapes were mainly due to poor
video quality resulting from less than ideal lighting conditions and backgrounds during fieldwork.
Since only one video camera was used, signs were only viewable from one perspective and it was
difficult to determine some locations and movements in three dimensions. If sufficient resources
of time and equipment were available, coding accuracy would be improved by using multiple
video cameras, adequate lighting, and a standard background material.
While I hope that this study provides a quick, efficient, and accurate tool to be used on a
broad scale in future sociolinguistic research of sign languages, additional research is needed to
strengthen the claims that can be made from the results. I encourage future sign language
sociolinguistic researchers to continue to modify and refine this methodology in order to
appropriately apply it to their specific contexts.
71
APPENDICES
72
Appendix A
Word list items
The word list items are listed in their elicitation order grouped by topic and/or semantic
domain in Table 20. The last two items were only elicited from five participants near the middle
of the elicitation.
73
Table 20: Word list items
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
cat
mouse
dog
chicken
rabbit
horse
elephant
bear
lion
spider
fish
snake
cow
animals
banana
apple
grapes
carrot
onion
tomato
bread
corn
rice
meat
egg
milk
wine
coffee
salt
food
flower
tree
leaf
wood
fire
grass
wind
mountain
sea
land
river
island
rock
water
sun
moon
stars
ice
snow
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
shirt
shoe
table
bed
door
window
house
garbage
rope
feather
knife
book
paper
lightbulb
computer
city
plane
bus
red
black
white
green
blue
yellow
colors
three
six
nine
ten
twenty
hundred
thousand
numbers
full
empty
wet
dry
dirty
clean
long
short
old
young
weak
strong
fat
skinny
poor
rich
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
happy
sad
hot
cold
beautiful
ugly
to love
to hate
to start
to finish
to work
to play
yes
no
true
false
good
bad
easy
difficult
friend
enemy
man
woman
boy
girl
father
mother
son
daughter
grandfather
grandmother
husband
wife
brother
sister
family
cousin
soldier
doctor
police
king
judge
law
teacher
morning
afternoon
day
night
74
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
early
late
year
week
sunday
monday
tuesday
wednesday
thursday
friday
saturday
month
january
february
march
april
may
june
july
august
september
october
november
december
to dance
to cook
sweet
hungry
to sleep
to dream
to help
to fight
to forgive
peace
to run
to sit
to stand
to build
to see
to search
to meet
to ask
to understand
to lie
to kill
sharp
pain
blood
afraid
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
angry
laugh
tired
money
to sell
to buy
to pay
to count
to need
deaf
to sign
name
story
what?
how?
when?
where?
who?
how many?
all
some
more
less
many
nothing
only
always
never
now
almost
continue
other
new
problem
correct
with
school
church
god
devil
jesus
mary
angel
thank you
you’re welcome
to live
to die
Appendix B
Rank and frequency of parameter values
The following tables list the rank and frequency of each parameter value based on the
occurrences in the complete database of 50 sign varieties representing 13 countries. These
frequencies were quickly calculated thanks to a package of xml and xsl scripts developed
specifically for this word list comparison study by Lastufka (2010). In Table 21, the 99
handshape values are listed by rank-frequency for all coded handshapes in both initial and final
handshape parameters. The total tally of occurrences was 30,370.
75
Table 21: Rank and frequency of the combined initial and final handshape parameter values
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Handshape code
B-Text
1
5
S
A-Text
F
B
O
Bbent-Text
5flex-Top
V
Obent
U
1flex
L
Ttog
I
5-Top
Y
C-Top
A
R
W
1bent
5flex-Text
G
Bflex-Text
D
Vflex
K
Oflex+
8-Text
Olittle
5bent
Mbent
Olittlebent
1-Top
Tcross
5flex
3
Clittle-Top
Lflex
Ubent
C
B-Top
E
ILY
T
Gspread
Bbent
Occurrences Frequency
2,975
9.80%
2,807
9.24%
2,350
7.74%
1,812
5.97%
1,054
3.47%
1,045
3.44%
1,036
3.41%
865
2.85%
813
2.68%
721
2.37%
707
2.33%
658
2.17%
639
2.10%
611
2.01%
609
2.01%
589
1.94%
513
1.69%
500
1.65%
497
1.64%
485
1.60%
484
1.59%
402
1.32%
396
1.30%
365
1.20%
346
1.14%
320
1.05%
313
1.03%
308
1.01%
278
0.92%
269
0.89%
268
0.88%
267
0.88%
224
0.74%
216
0.71%
210
0.69%
206
0.68%
199
0.66%
179
0.59%
175
0.58%
171
0.56%
162
0.53%
162
0.53%
157
0.52%
147
0.48%
145
0.48%
140
0.46%
132
0.43%
119
0.39%
115
0.38%
113
0.37%
Rank Handshape code
51
ILYflex-Top
52
5-Tflex
53
Fflexgap
54
Bbent-Top
55
K-Text
56
Uflex
57
8
58
M
59
U-Text
60
Olittleflex+
61
E-Text
62
B-Ttog
63
Vbent
64
Wunspr
65
Ubent-Text
66
Bbent-Ttog
67
R-Text
68
Ubent-Top
69
8gap
70
3flex
71
Clittle
72
Fflex+
73
3flex-Top
74
Ibent
75
Lbent
76
Uflex-Top
77
N
78
8flex+
79
Bflex-Ttog
80
8flexgap
81
U-Top
82
Rhole
83
Ybent
84
F-Text
85
E-Top
86
Ugap
87
7
88
Olittle-Tund
89
1flex-Tflex
90
ILYbent-Top
91
E-Ttog
92
1-Ttog
93
F-Ttog
94
Iflex
95
Wflex
96
E-Tflex
97
I-Ttog
98
1-Tflex
99
Y-MID
Occurrences Frequency
107
0.35%
103
0.34%
93
0.31%
90
0.30%
87
0.29%
87
0.29%
83
0.27%
83
0.27%
83
0.27%
82
0.27%
76
0.25%
71
0.23%
62
0.20%
62
0.20%
60
0.20%
58
0.19%
54
0.18%
54
0.18%
53
0.17%
51
0.17%
49
0.16%
45
0.15%
43
0.14%
43
0.14%
41
0.14%
41
0.14%
37
0.12%
33
0.11%
32
0.11%
31
0.10%
23
0.08%
21
0.07%
19
0.06%
18
0.06%
16
0.05%
14
0.05%
13
0.04%
11
0.04%
10
0.03%
10
0.03%
8
0.03%
6
0.02%
6
0.02%
6
0.02%
6
0.02%
5
0.02%
5
0.02%
3
0.01%
2
0.01%
In Table 22, the 99 handshape values are listed by rank-frequency for the initial handshape
parameter, the total tally of occurrences was 15,185.
76
Table 22: Rank and frequency of initial handshape parameter values
Rank Handshape code
1
1
2
B-Text
3
5
4
S
5
A-Text
6
B
7
O
8
F
9
V
10
Bbent-Text
11
5flex-Top
12
U
13
I
14
L
15
Ttog
16
Oflex+
17
A
18
Obent
19
C-Top
20
Y
21
1flex
22
W
23
R
24
5-Top
25
G
26
D
27
5flex-Text
28
1-Top
29
K
30
8-Text
31
5bent
32
1bent
33
Bflex-Text
34
Vflex
35
Mbent
36
B-Top
37
3
38
Olittle
39
Gspread
40
Clittle-Top
41
C
42
5flex
43
Tcross
44
Olittleflex+
45
U-Text
46
Lflex
47
ILY
48
Fflexgap
49
K-Text
50
Bbent-Top
Occurrences Frequency Rank Handshape code
1,586
10.44%
51
M
1,522
10.02%
52
ILYflex-Top
1,085
7.15%
53
Ubent
868
5.72%
54
T
552
3.64%
55
E
530
3.49%
56
Bbent
480
3.16%
57
8
467
3.08%
58
5-Tflex
375
2.47%
59
Uflex
370
2.44%
60
B-Ttog
334
2.20%
61
E-Text
331
2.18%
62
8flex+
298
1.96%
63
Wunspr
293
1.93%
64
8gap
290
1.91%
65
Clittle
249
1.64%
66
Vbent
247
1.63%
67
Olittlebent
233
1.53%
68
R-Text
230
1.51%
69
Bbent-Ttog
229
1.51%
70
Fflex+
220
1.45%
71
U-Top
212
1.40%
72
3flex-Top
205
1.35%
73
8flexgap
192
1.26%
74
Ibent
187
1.23%
75
3flex
167
1.10%
76
N
166
1.09%
77
Uflex-Top
149
0.98%
78
Bflex-Ttog
146
0.96%
79
E-Top
145
0.95%
80
Rhole
135
0.89%
81
F-Text
133
0.88%
82
Olittle-Tund
125
0.82%
83
Ybent
124
0.82%
84
Ugap
113
0.74%
85
Lbent
108
0.71%
86
Ubent-Text
95
0.63%
87
7
92
0.61%
88
E-Ttog
89
0.59%
89
ILYbent-Top
87
0.57%
90
1flex-Tflex
85
0.56%
91
I-Ttog
77
0.51%
92
Ubent-Top
73
0.48%
93
E-Tflex
71
0.47%
94
F-Ttog
66
0.43%
95
1-Ttog
65
0.43%
96
Iflex
63
0.41%
97
Wflex
62
0.41%
98
1-Tflex
61
0.40%
99
Y-MID
60
0.40%
Occurrences Frequency
58
0.38%
55
0.36%
54
0.36%
53
0.35%
51
0.34%
50
0.33%
47
0.31%
46
0.30%
38
0.25%
35
0.23%
35
0.23%
33
0.22%
33
0.22%
32
0.21%
29
0.19%
29
0.19%
27
0.18%
27
0.18%
26
0.17%
24
0.16%
21
0.14%
20
0.13%
17
0.11%
17
0.11%
16
0.11%
16
0.11%
16
0.11%
15
0.10%
14
0.09%
12
0.08%
10
0.07%
10
0.07%
10
0.07%
9
0.06%
8
0.05%
8
0.05%
6
0.04%
5
0.03%
5
0.03%
4
0.03%
4
0.03%
4
0.03%
3
0.02%
3
0.02%
2
0.01%
2
0.01%
2
0.01%
1
0.01%
1
0.01%
In Table 23, the 99 handshape values are listed by rank-frequency for the final handshape
parameter, the total tally of occurrences was 15,185.
77
Table 23: Rank and frequency of final handshape parameter values
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Handshape code
B-Text
5
1
S
F
B
A-Text
Bbent-Text
Obent
1flex
5flex-Top
O
V
L
5-Top
U
Ttog
Y
C-Top
A
1bent
I
R
Bflex-Text
W
5flex-Text
Olittlebent
Vflex
D
G
Olittle
K
8-Text
Tcross
Ubent
5flex
Lflex
Mbent
E
5bent
3
Clittle-Top
ILY
T
Bbent
C
5-Tflex
ILYflex-Top
Ubent-Text
1-Top
Occurrences
1,453
1,265
1,221
944
578
506
502
443
425
391
387
385
332
316
308
308
299
268
255
237
232
215
197
188
184
180
179
154
141
133
132
123
122
106
103
98
97
97
89
81
76
75
69
66
63
62
57
52
52
50
Frequency
9.57%
8.33%
8.04%
6.22%
3.81%
3.33%
3.31%
2.92%
2.80%
2.57%
2.55%
2.54%
2.19%
2.08%
2.03%
2.03%
1.97%
1.76%
1.68%
1.56%
1.53%
1.42%
1.30%
1.24%
1.21%
1.19%
1.18%
1.01%
0.93%
0.88%
0.87%
0.81%
0.80%
0.70%
0.68%
0.65%
0.64%
0.64%
0.59%
0.53%
0.50%
0.49%
0.45%
0.43%
0.41%
0.41%
0.38%
0.34%
0.34%
0.33%
Rank
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Handshape code
Ubent-Top
Uflex
E-Text
B-Top
8
B-Ttog
3flex
Lbent
Vbent
Bbent-Ttog
Fflexgap
Bbent-Top
Wunspr
R-Text
Gspread
Ibent
K-Text
M
Uflex-Top
3flex-Top
8gap
Fflex+
N
Clittle
Oflex+
Bflex-Ttog
U-Text
8flexgap
Olittleflex+
Rhole
Ybent
F-Text
7
1flex-Tflex
ILYbent-Top
Ugap
1-Ttog
Iflex
Wflex
E-Ttog
F-Ttog
1-Tflex
E-Tflex
E-Top
U-Top
I-Ttog
Olittle-Tund
Y-MID
8flex+
Occurrences
50
49
41
37
36
36
35
33
33
32
31
30
29
27
26
26
26
25
25
23
21
21
21
20
19
17
17
14
11
9
9
8
7
6
5
5
4
4
4
3
3
2
2
2
2
1
1
1
0
Frequency
0.33%
0.32%
0.27%
0.24%
0.24%
0.24%
0.23%
0.22%
0.22%
0.21%
0.20%
0.20%
0.19%
0.18%
0.17%
0.17%
0.17%
0.16%
0.16%
0.15%
0.14%
0.14%
0.14%
0.13%
0.13%
0.11%
0.11%
0.09%
0.07%
0.06%
0.06%
0.05%
0.05%
0.04%
0.03%
0.03%
0.03%
0.03%
0.03%
0.02%
0.02%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.01%
0.00%
In Table 24, the 31 location values are listed by rank-frequency for the combined initial and
final parameters, the total number of occurrences was 30,370.
78
Table 24: Rank and frequency of the combined initial and final location parameter values
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Location code
SN
Fing
SFFace
Palm
SLoCheek
Chin
SHand
Chest
Fore
Tips
Lips
Cheek
SUpCheek
SFAHead
BHand
Nose
Wrist
LoArm
Elbow
Ear
Ribs
Eye
Should
Neck
UpArm
Waist
THead
Teeth
BHead
Hip
SAHead
Occurrences
14,141
1,919
1,699
1,634
1,082
1,039
1,002
960
941
901
749
731
593
590
561
309
234
207
171
170
145
135
86
64
61
61
44
39
37
37
28
Frequency
46.56%
6.32%
5.59%
5.38%
3.56%
3.42%
3.30%
3.16%
3.10%
2.97%
2.47%
2.41%
1.95%
1.94%
1.85%
1.02%
0.77%
0.68%
0.56%
0.56%
0.48%
0.44%
0.28%
0.21%
0.20%
0.20%
0.14%
0.13%
0.12%
0.12%
0.09%
In Table 25, the 31 location values are listed by rank-frequency separately for the initial and
final location parameters, the total number of occurrences was 15,185.
79
Table 25: Rank and frequency of initial and final location parameter values
Initial Location
Rank Code
1
SN
2
Fing
3
SFFace
4
Palm
5
Chin
6
Fore
7
Tips
8
Chest
9
Lips
10
SLoCheek
11
Cheek
12
SHand
13
SFAHead
14
SUpCheek
15
BHand
16
Nose
17
Ear
18
Eye
19
LoArm
20
Wrist
21
Elbow
22
Should
23
Ribs
24
Neck
25
UpArm
26
Hip
27
SAHead
28
THead
29
Teeth
30
Waist
31
BHead
Occurrences
6,413
1,085
926
908
656
597
550
495
481
472
398
389
340
311
263
213
94
94
94
94
56
48
44
37
34
23
22
18
16
12
2
Frequency
42.23%
7.15%
6.10%
5.98%
4.32%
3.93%
3.62%
3.26%
3.17%
3.11%
2.62%
2.56%
2.24%
2.05%
1.73%
1.40%
0.62%
0.62%
0.62%
0.62%
0.37%
0.32%
0.29%
0.24%
0.22%
0.15%
0.14%
0.12%
0.11%
0.08%
0.01%
Final Location
Rank Code
1
SN
2
Fing
3
SFFace
4
Palm
5
SHand
6
SLoCheek
7
Chest
8
Chin
9
Tips
10
Fore
11
Cheek
12
BHand
13
SUpCheek
14
Lips
15
SFAHead
16
Wrist
17
Elbow
18
LoArm
19
Ribs
20
Nose
21
Ear
22
Waist
23
Eye
24
Should
25
BHead
26
Neck
27
UpArm
28
THead
29
Teeth
30
Hip
31
SAHead
Occurrences
7,728
834
773
726
613
610
465
383
351
344
333
298
282
268
250
140
115
113
101
96
76
49
41
38
35
27
27
26
23
14
6
Frequency
50.89%
5.49%
5.09%
4.78%
4.04%
4.02%
3.06%
2.52%
2.31%
2.27%
2.19%
1.96%
1.86%
1.76%
1.65%
0.92%
0.76%
0.74%
0.67%
0.63%
0.50%
0.32%
0.27%
0.25%
0.23%
0.18%
0.18%
0.17%
0.15%
0.09%
0.04%
In Table 26, the two palm orientation values are listed from most to least frequently occurring
out of 15,185 total occurrences.
Table 26: Rank and frequency of the two palm orientation parameter values
Rank
1
2
Palm orientation code
PP+
Occurrences
10,508
4,677
Frequency
69.20%
30.80%
In Table 27, the five joint movement values are listed from most to least frequently occurring
out of 15,185 total occurrences.
Table 27: Rank and frequency of the five joint movement parameter values
Rank
1
2
3
4
5
Joint movement code
Elbow
Fingers
Wrist
Shoulder
Hold
Occurrences
7,551
4,847
1,552
1,026
209
80
Frequency
49.73%
31.92%
10.22%
6.76%
1.38%
Appendix C
Levenshtein distances between each variety pairing
Table 28 lists the Levenshtein distances between each pairing of the 50 sign language
varieties (1,225 pairs) using the four parameter coding system of initial and final handshapes and
initial and final locations. This data set uses the refined word list of 215 items and the refined
handshape parameter value inventory of 74 values.
81
Table 28: Levenshtein distances between each pair of sign language varieties
Chi l e-01
0.456 Chi l e-02
0.428 0.445 Chi l e-04
0.453 0.436 0.444 Chi l e-05
0.710 0.729 0.699 0.706 DomR-01
0.681 0.683 0.660 0.659 0.353 DomR-02
0.698 0.692 0.676 0.676 0.374 0.373 DomR-03
0.683 0.679 0.647 0.672 0.427 0.377 0.327 DomR-04
0.724 0.709 0.696 0.709 0.523 0.508 0.469 0.475 DomR-05
0.694 0.704 0.694 0.697 0.471 0.414 0.459 0.427 0.472 DomR-06
0.693 0.673 0.668 0.657 0.439 0.411 0.410 0.400 0.487 0.439 DomR-08
0.709 0.724 0.693 0.705 0.474 0.438 0.420 0.434 0.534 0.462 0.409 DomR-09
0.698 0.707 0.677 0.685 0.403 0.379 0.331 0.348 0.466 0.416 0.432 0.409 DomR-10
0.684 0.683 0.674 0.681 0.682 0.646 0.649 0.641 0.699 0.672 0.656 0.645 0.648 El Sa l -03
0.704 0.699 0.703 0.713 0.596 0.557 0.522 0.535 0.613 0.612 0.565 0.548 0.496 0.504 El Sa l -08
0.702 0.678 0.689 0.692 0.666 0.622 0.639 0.631 0.680 0.672 0.644 0.639 0.641 0.384 0.526 El Sa l -12
0.710 0.712 0.694 0.732 0.608 0.553 0.543 0.547 0.611 0.606 0.565 0.581 0.539 0.657 0.563 0.641 Hond-01
0.698 0.693 0.684 0.706 0.582 0.537 0.506 0.535 0.596 0.592 0.565 0.579 0.500 0.665 0.570 0.648 0.308 Hond-05
0.689 0.697 0.683 0.701 0.590 0.535 0.512 0.523 0.604 0.597 0.570 0.569 0.505 0.648 0.547 0.643 0.305 0.289 Hond-10
0.713 0.714 0.718 0.709 0.593 0.554 0.530 0.558 0.603 0.589 0.538 0.556 0.526 0.655 0.541 0.647 0.382 0.370 0.343 Hond-11
0.751 0.757 0.748 0.761 0.784 0.771 0.758 0.754 0.801 0.792 0.760 0.754 0.755 0.755 0.748 0.763 0.775 0.764 0.752 0.789 Ire-01
0.723 0.734 0.728 0.735 0.547 0.548 0.486 0.507 0.563 0.561 0.544 0.563 0.486 0.681 0.547 0.684 0.602 0.577 0.597 0.600 0.718 Ja m-01
0.727 0.732 0.729 0.730 0.512 0.520 0.469 0.488 0.551 0.581 0.543 0.534 0.474 0.679 0.513 0.667 0.580 0.567 0.577 0.583 0.711 0.373 Ja m-02
0.723 0.727 0.731 0.729 0.522 0.534 0.460 0.483 0.558 0.551 0.546 0.527 0.472 0.674 0.531 0.651 0.591 0.572 0.572 0.598 0.716 0.360 0.322 Ja m-03
0.750 0.744 0.747 0.744 0.565 0.543 0.516 0.521 0.604 0.573 0.570 0.561 0.525 0.699 0.554 0.677 0.626 0.604 0.608 0.613 0.723 0.420 0.374 0.385 Ja m-06
0.739 0.741 0.735 0.736 0.503 0.513 0.447 0.489 0.546 0.541 0.527 0.524 0.458 0.680 0.526 0.673 0.590 0.560 0.561 0.573 0.727 0.364 0.316 0.315 0.356 Ja m-07
0.732 0.722 0.728 0.732 0.559 0.551 0.532 0.530 0.571 0.597 0.561 0.561 0.517 0.682 0.519 0.674 0.601 0.608 0.594 0.604 0.732 0.433 0.382 0.398 0.417 Ja m-08
0.750 0.753 0.758 0.754 0.762 0.725 0.737 0.738 0.737 0.749 0.763 0.743 0.745 0.745 0.756 0.755 0.766 0.739 0.744 0.754 0.724 0.759 0.735 0.747 0.760 NIre-01
0.670 0.668 0.680 0.678 0.577 0.545 0.568 0.555 0.620 0.574 0.567 0.566 0.573 0.630 0.613 0.618 0.587 0.567 0.582 0.590 0.762 0.643 0.619 0.617 0.645 Pa n-01
0.648 0.645 0.643 0.652 0.549 0.495 0.496 0.500 0.609 0.573 0.536 0.561 0.494 0.568 0.538 0.574 0.493 0.458 0.479 0.478 0.739 0.576 0.571 0.561 0.604 Pa n-06
0.648 0.637 0.648 0.649 0.715 0.669 0.692 0.669 0.695 0.710 0.698 0.711 0.690 0.693 0.715 0.695 0.730 0.701 0.716 0.731 0.768 0.721 0.725 0.708 0.734 Prgy-02
0.666 0.651 0.657 0.668 0.736 0.676 0.719 0.686 0.735 0.719 0.680 0.718 0.726 0.688 0.712 0.694 0.745 0.734 0.727 0.741 0.769 0.748 0.732 0.731 0.742 Prgy-03
0.659 0.640 0.659 0.651 0.736 0.701 0.723 0.693 0.703 0.714 0.711 0.725 0.722 0.690 0.708 0.700 0.746 0.731 0.728 0.745 0.785 0.735 0.731 0.742 0.749 Prgy-04
0.666 0.662 0.661 0.649 0.733 0.679 0.717 0.689 0.741 0.729 0.725 0.730 0.702 0.697 0.728 0.703 0.755 0.722 0.720 0.736 0.775 0.752 0.746 0.752 0.766 Prgy-05
0.652 0.644 0.675 0.657 0.731 0.691 0.710 0.688 0.728 0.722 0.706 0.726 0.713 0.697 0.713 0.689 0.727 0.709 0.706 0.732 0.776 0.742 0.734 0.739 0.751 Prgy-06
0.650 0.623 0.670 0.668 0.702 0.662 0.704 0.665 0.741 0.716 0.676 0.733 0.713 0.715 0.726 0.667 0.747 0.727 0.726 0.729 0.778 0.719 0.719 0.715 0.723 Prgy-07
0.620 0.625 0.620 0.622 0.725 0.674 0.697 0.659 0.693 0.687 0.681 0.721 0.705 0.682 0.696 0.690 0.724 0.703 0.704 0.710 0.769 0.728 0.727 0.718 0.749 Prgy-08
0.649 0.640 0.649 0.647 0.719 0.665 0.682 0.661 0.696 0.682 0.684 0.717 0.692 0.699 0.713 0.703 0.719 0.703 0.697 0.711 0.772 0.722 0.707 0.717 0.732 Prgy-09
0.672 0.678 0.674 0.676 0.601 0.569 0.558 0.573 0.609 0.613 0.579 0.590 0.561 0.666 0.594 0.660 0.618 0.603 0.610 0.598 0.731 0.607 0.604 0.577 0.630 Peru-01
0.639 0.641 0.653 0.649 0.616 0.588 0.571 0.571 0.623 0.627 0.600 0.630 0.565 0.665 0.609 0.659 0.655 0.637 0.636 0.626 0.761 0.645 0.609 0.605 0.645 Peru-05
0.670 0.668 0.658 0.673 0.596 0.576 0.527 0.526 0.611 0.616 0.568 0.588 0.548 0.691 0.602 0.654 0.618 0.594 0.590 0.587 0.763 0.580 0.560 0.553 0.591 Peru-18
0.630 0.652 0.637 0.650 0.624 0.589 0.567 0.553 0.612 0.623 0.586 0.597 0.561 0.634 0.581 0.652 0.630 0.617 0.608 0.608 0.767 0.630 0.600 0.588 0.635 Peru-22
0.730 0.747 0.755 0.752 0.537 0.552 0.476 0.490 0.555 0.552 0.534 0.547 0.464 0.692 0.505 0.670 0.543 0.562 0.553 0.546 0.748 0.414 0.369 0.366 0.441 StVi n-01
0.735 0.735 0.730 0.735 0.549 0.534 0.467 0.506 0.556 0.574 0.548 0.523 0.480 0.659 0.503 0.658 0.529 0.551 0.545 0.521 0.733 0.432 0.411 0.420 0.472 Tri n-01
0.727 0.724 0.710 0.717 0.592 0.580 0.536 0.556 0.587 0.609 0.571 0.576 0.525 0.657 0.564 0.663 0.568 0.588 0.585 0.560 0.733 0.490 0.494 0.472 0.538 Tri n-02
0.704 0.729 0.713 0.718 0.577 0.571 0.551 0.555 0.618 0.604 0.577 0.591 0.536 0.660 0.522 0.654 0.614 0.621 0.625 0.602 0.724 0.518 0.469 0.468 0.520 Tri n-03
0.715 0.724 0.713 0.723 0.489 0.501 0.446 0.471 0.553 0.563 0.545 0.546 0.457 0.659 0.547 0.647 0.551 0.535 0.555 0.567 0.695 0.402 0.361 0.373 0.453 USA-01
0.726 0.725 0.728 0.731 0.511 0.510 0.457 0.491 0.545 0.527 0.517 0.535 0.455 0.674 0.509 0.672 0.579 0.563 0.560 0.571 0.717 0.399 0.375 0.378 0.446 USA-05
0.729 0.739 0.734 0.737 0.529 0.549 0.473 0.516 0.595 0.591 0.525 0.547 0.484 0.678 0.524 0.664 0.605 0.574 0.576 0.590 0.733 0.409 0.372 0.383 0.444 USA-06
0.748 0.737 0.742 0.735 0.525 0.529 0.483 0.533 0.582 0.537 0.546 0.575 0.484 0.686 0.537 0.680 0.607 0.588 0.589 0.605 0.725 0.415 0.400 0.419 0.449 USA-07
Ja m-07
0.371 Ja m-08
0.762 0.765 NIre-01
0.629 0.625 0.754 Pa n-01
0.545 0.579 0.723 0.461 Pa n-06
0.729 0.721 0.743 0.667 0.655 Prgy-02
0.745 0.743 0.778 0.690 0.687 0.459 Prgy-03
0.735 0.736 0.766 0.696 0.679 0.373 0.413 Prgy-04
0.767 0.762 0.758 0.673 0.665 0.422 0.504 0.448 Prgy-05
0.755 0.744 0.764 0.692 0.673 0.409 0.476 0.411 0.435 Prgy-06
0.701 0.720 0.761 0.660 0.658 0.575 0.530 0.525 0.606 0.595 Prgy-07
0.723 0.729 0.751 0.685 0.662 0.361 0.438 0.368 0.430 0.380 0.533 Prgy-08
0.725 0.729 0.743 0.686 0.659 0.363 0.431 0.364 0.425 0.396 0.523 0.343 Prgy-09
0.600 0.618 0.735 0.617 0.562 0.659 0.712 0.689 0.701 0.700 0.714 0.661 0.673 Peru-01
0.628 0.617 0.764 0.600 0.581 0.647 0.697 0.669 0.669 0.670 0.678 0.663 0.664 0.501 Peru-05
0.552 0.599 0.736 0.601 0.548 0.673 0.726 0.693 0.708 0.708 0.671 0.670 0.688 0.385 0.487 Peru-18
0.595 0.597 0.741 0.620 0.574 0.654 0.682 0.674 0.697 0.678 0.675 0.660 0.663 0.478 0.469 0.450 Peru-22
0.348 0.427 0.731 0.640 0.559 0.733 0.755 0.742 0.761 0.751 0.711 0.733 0.714 0.595 0.635 0.573 0.601 StVi n-01
0.387 0.467 0.741 0.628 0.555 0.714 0.725 0.727 0.746 0.726 0.701 0.701 0.705 0.579 0.623 0.573 0.589 0.346 Tri n-01
0.470 0.509 0.728 0.639 0.574 0.705 0.723 0.722 0.735 0.734 0.708 0.711 0.714 0.602 0.632 0.578 0.609 0.440 0.340 Tri n-02
0.476 0.481 0.749 0.616 0.582 0.700 0.716 0.718 0.739 0.722 0.697 0.706 0.704 0.616 0.618 0.609 0.597 0.491 0.441 0.468 Tri n-03
0.386 0.487 0.728 0.597 0.524 0.700 0.712 0.715 0.731 0.731 0.690 0.713 0.696 0.555 0.604 0.535 0.601 0.395 0.400 0.466 0.508 USA-01
0.374 0.472 0.743 0.620 0.552 0.723 0.740 0.730 0.747 0.728 0.713 0.721 0.714 0.569 0.620 0.547 0.616 0.409 0.418 0.478 0.496 0.294 USA-05
0.384 0.469 0.757 0.616 0.574 0.732 0.740 0.748 0.763 0.746 0.717 0.732 0.731 0.602 0.628 0.567 0.622 0.430 0.445 0.496 0.492 0.348 0.346 USA-06
0.413 0.489 0.748 0.635 0.581 0.729 0.748 0.738 0.758 0.748 0.732 0.737 0.728 0.602 0.636 0.587 0.638 0.444 0.460 0.511 0.526 0.354 0.321 0.361 USA-07
82
REFERENCES
Aldersson, Russell R., and Lisa J. McEntee-Atalianis. 2008. “A lexical comparison of signs from
Icelandic and Danish sign languages.” Sign Language Studies 9: 45-87.
van der Ark, René, Philippe Mennecier, John Nerbonne, and Franz Manni. 2007. Preliminary
identification of language groups and loan words in central Asia. In Proceedings of the
RANLP Workshop on Computational Phonology, ed. Petya Osenova, 13-20. Borovets,
Bulgaria. http://www.let.rug.nl/~nerbonne/papers/Ark-et-al-Central-Asia-2007.pdf.
Beijering, Karin, Charlotte Gooskens, and Wilbert Heeringa. 2008. “Predicting intelligibility and
perceived linguistic distance by means of the levenshtein algorithm.” Linguistics in the
Netherlands 25 (1): 13-24.
Bickford, J. Albert. 2005. “The sign languages of eastern Europe.” SIL Electronic Survey Reports
2005 (026): 45.
Blair, Frank. 1990. Survey on a shoestring: a manual for small-scale language surveys.
Publications in Linguistics 96. Dallas, TX: Summer Institute of Linguistics and the
University of Texas at Arlington.
Brentari, Diane. 1998. A prosodic model of sign language phonology. Cambridge, MA: MIT
Press.
Campbell, Lyle. 2004. Historical linguistics: An introduction. 2nd ed. Cambridge, MA: MIT
Press.
Casad, Eugene H. 1974. Dialect Intelligibility Testing. Summer Institute of Linguistics
Publications in Linguistics and Related Fields 38. Dallas, TX: Summer Institute of
Linguistics.
Ciupek-Reed, Julia. 2011. Participatory methods in sociolinguistic sign language survey: A case
study in El Salvador. M.A. Thesis, Grand Forks, ND: University of North Dakota.
Deibler, Ellis W., and David Trefry. 1963. Languages of the Chimbu sub-district. Port Moresby:
Department of Information and Extension Services.
Everitt, Brian S., Sabine Landau, and Morven Leese. 2001. Cluster Analysis. 4th ed. New York:
Oxford University Press.
Gudschinsky, Sarah C. 1956. “The abc’s of lexicostatistics (glottochronology).” Word 12 (2):
175-210.
Guerra Currie, Anne-Marie P., Richard P. Meier, and Keith Walters. 2002. A crosslinguistic
examination of the lexicons of four signed languages. In Modality and structure in signed
83
and spoken languages, ed. Richard P. Meier, Kearsy Cormier, and David Quinto-Pozos,
224-236. New York: Cambridge University Press.
Heeringa, Wilbert, Peter Kleiweg, Charlotte Gooskens, and John Nerbonne. 2006. Evaluation of
string distance algorithms for dialectology. In Proceedings of the Workshop on Linguistic
Distances, 51-62. Sydney.
Hendriks, Bernadet. 2008. Jordanian Sign Language: Aspects of grammar from a cross-linguistic
perspective. LOT Dissertation Series 193. Utrecht, the Netherlands: Netherlands
Graduate School of Linguistics.
http://www.lotpublications.nl/publish/articles/003014/bookpart.pdf.
Hurlbut, Hope M. 2007. “A survey of sign language in Taiwan.” SIL Electronic Survey Reports
2008 (001): 117.
Johnson, Jane E., and Russell J. Johnson. 2008. “Assessment of regional language varieties in
Indian Sign Language.” SIL Electronic Survey Reports 2008 (006): 121.
Johnston, Trevor. 2003. BSL, AUSLAN and NZSL: Three signed languages or one? In Crosslinguistic perspectives in sign language research: selected papers from TISLR 2000, ed.
Anne Baker, B. van den Bogaerde, and O. Crasborn, 47-70. Hamburg: Signum.
Kessler, Brett. 2001. The significance of word lists. Dissertations in Linguistics. Stanford, CA:
Center for the Study of Language and Information Press.
Kleiweg, Peter. 2011. RuG/L04: software for dialectometrics and cartography.
http://www.let.rug.nl/~kleiweg/indexs.html.
Kluge, Angela. 2000. The Gbe language varieties of West Africa: A quantitative analysis of
lexical and grammatical features. Unpublished M.A. Thesis, Cardiff: University of
Wales, College of Cardiff. http://www.sil.org/silesr/2008/silesr2008-023.pdf.
———. 2005. “A Synchronic Lexical Study of Gbe Language Varieties: The Effects of Different
Similarity Judgment Criteria.” Linguistic Discovery 3 (1): 22-53.
———. 2007. “RTT retelling method: An alternative approach to intelligibility testing.” SIL
Electronic Survey Reports 2007: 14.
———. 2008. “A synchronic lexical study of the Ede language continuum of West Africa: The
effects of different similarity judgment criteria.” Afrikanistik online 2007 (4).
http://www.afrikanistik-online.de/archiv/2007/1328.
Lastufka, Michael. 2010. ParamValueUseFreq.xsl. Dallas, TX: SIL International.
Liddell, Scott K., and Robert E. Johnson. 1989. “American Sign Language: The phonological
base.” Sign Language Studies 64: 195-277.
Max Planck Institute for Psycholinguistics. 2011. ELAN - Language Archiving Technology.
Nijmegen, The Netherlands. http://www.lat-mpi.eu/tools/elan/.
84
McElhanon, Kenneth A. 1967. “Preliminary observations on Huon Peninsula languages.”
Oceanic Linguistics 6: 1-45.
McKee, David, and Graeme Kennedy. 2000. Lexical comparison of signs from American,
Australian, British, and New Zealand Sign Languages. In The signs of language
revisited: An anthology to honour Ursula Bellugi and Edward Klima, ed. Karen
Emmorey and Harlan Lane, 49-76. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Meier, Richard P., Claude Mauk, Gene R. Mirus, and Kimberly E. Conlin. 1998. Motoric
constraints on early sign acquisition. In The proceedings of the twenty-ninth annual child
language research forum, ed. Eve V. Clark, 63-72. Stanford, CA: Center for the Study of
Language and Information Press.
Osugi, Yutaka, Ted Supalla, and Rebecca Webb. 1999. “The use of word elicitation to identify
distinctive gestural systems on Amami Island.” Sign Language & Linguistics 2 (1): 87112.
Parkhurst, Stephen, and Dianne Parkhurst. 2003. “Lexical comparisons of signed languages and
the effects of iconicity.” Work Papers of the Summer Institute of Linguistics, University
of North Dakota Session 47: 17.
———. 2007. “Spanish Sign Language survey.” SIL Electronic Survey Reports 2007 (008): 85.
Parks, Elizabeth, and Jason Parks. 2008. “Sociolinguistic survey report of the deaf community of
Guatemala.” SIL Electronic Survey Reports 2008 (016): 30.
———. 2010a. “A Sociolinguistic Profile of the Peruvian Deaf Community.” Sign Language
Studies 10 (4): 33.
———. 2010b. Investigating sign language variation through intelligibility testing: The recorded
text test retelling method. In TISLR 2010 Posters. West Lafayette, IN.
http://www.purdue.edu/tislr10/pdfs/Parks Parks.pdf.
Rensch, Calvin R. 1992. Calculating lexical similarity. In Windows on bilingualism, ed. Eugene
H. Casad, 13-15. Summer Institute of Linguistics and the University of Texas at
Arlington Publications in Linguistics 110. Dallas, TX: The Summer Institute of
Linguistics and The University of Texas at Arlington.
Rozelle, Lorna. 2003. The structure of sign language lexicons: Inventory and distribution of
handshape and location. Doctoral dissertation, University of Washington.
Sanders, Arden G. 1977. Guidelines for conducting a lexicostatistic survey in Papua New Guinea.
In Language variation and survey techniques, ed. Richard Loving, 21:21-43. Workpapers
in Papua New Guinea languages. Ukarumpa, Papua New Guinea: Summer Institute of
Linguistics.
Sandler, Wendy. 1989. Phonological representation of the sign: Linearity and nonlinearity in
American Sign Language. Dordrecht: Foris.
Sandler, Wendy, and Diane Lillo-Martin. 2006. Sign language and linguistic universals. New
York: Cambridge University Press.
85
Sasaki, Daisuke. 2007. Comparing the lexicons of Japanese Sign Language and Taiwan Sign
Language: A preliminary study focusing on the difference in the handshape parameter. In
Sign languages in contact, ed. David Quinto-Pozos, 123-150. Sociolinguistics in Deaf
Communities 13. Washington, D.C.: Gallaudet University Press.
Schooling, Stephen J. 1981. A linguistic and sociolinguistic survey of French Polynesia.
Hamilton, N.Z.: Summer Institute of Linguistics.
Siedlecki Jr., Theodore, and John D. Bonvillian. 1993. “Location, handshape & movement:
Young children’s acquisition of the formational aspects of American Sign Language.”
Sign Language Studies 78: 31-52.
Simons, Gary F. 1977. Phonostatistic methods. In Language variation and survey techniques, ed.
Richard Loving, 155-184. Workpapers in Papua New Guinea Languages 21. Ukarumpa,
Papua New Guinea: Summer Institute of Linguistics.
Stokoe, William C., Dorethy Casterline, and Carl Croneberg. 1965. A dictionary of American
Sign Language on linguistic principles. Washington, D.C.: Gallaudet College Press.
Vanhecke, Eline, and Kristof De Weerdt. 2004. Regional variation in Flemish Sign Language. In
To the lexicon and beyond: Sociolinguistics in European deaf communities, ed. Mieke
van Herreweghe and M. Vermeerbergen, 27-38. Sociolinguistics in Deaf Communities
10. Washington, D.C.: Gallaudet University Press.
White, Chad. 2010. An evaluation of Levenshtein distance calculation. In Paper presented at the
International Language Assessment Conference, 41. Penang, Malaysia.
———. 2011. Rugloafer. Website. https://sites.google.com/site/rugloafer/home.
Williams, Holly, and Elizabeth Parks. 2010. “A Sociolinguistic Survey Report of the Dominican
Republic Deaf Community.” SIL Electronic Survey Reports 2010 (005): 20.
Woodward, James C. 1977. Historical bases of American Sign Language. In Understanding
language through sign language research, ed. P. Siple, 333-348. New York: Academic
Press.
———. 1993. “The relationship of sign language varieties in India, Pakistan and Nepal.” Sign
Language Studies 78: 15-22.
Xu, Wang. 2006. A comparison of Chinese and Taiwan Sign Languages: Towards a new model
for sign language comparison. M.A. Thesis, Columbus, OH: Ohio State University.
http://people.cohums.ohio-state.edu/chan9/ling/theses/xu-wang_2006_MA.pdf.
Yang, Cathryn. 2009. “Nisu dialect geography.” SIL Electronic Survey Reports 2009 (007): 40.
86
Download