Associated Language codes

advertisement
Use of language codes in RDA name authority records,
Associated Language: ISO 639-2 versus ISO 639-3
- an Africana cataloger’s perspective (and learning curve)
Marcia Tiede
Area Studies Cataloger, Northwestern University Library
March 2014
One of the many optional MARC 21 fields in name and series authority records under RDA is Associated
Language (377), defined as: "Codes for languages associated with the entity described in the record.
Includes the language a person uses when writing for publication, broadcasting, etc., a language a
corporate body uses in its communications, a language of a family, or a language in which a work is
expressed."
The principal subfield codes in Associated Language are ‡a , Language code, and ‡l, Language term. Both
are repeatable. ‡2 , Source of language code (nonrepeatable), is used in association with ‡a if the code
source is not ISO 639-2. As in other fields, if a term’s source is specified in ‡2 , 7 is used in the second
indicator.
ISO 639-2 and ISO 639-3 are both alpha-3 (three-letter) international standard codes for language
identification. They represent two parts of a six-part standard that has been evolving since the 1990s from
the original alpha-2 code known simply as ISO 639 (now ISO 639-1). ISO 639-2 (1998) is a code set for over
400 individual languages and collections of languages. ISO 639-3 (2007) is a code set containing all the
codes from ISO 639-2 plus 7,000 others, which aims to be "comprehensive" in its coverage of languages.
ISO 639-2 is the default language code set in MARC, and was in fact based on the MARC Language Code. As
of 2007 when the Introduction to the MARC Code List for Languages was written, there were 484 language
codes in MARC, of which 55 were collective codes. The relationship between the MARC and ISO-639-2
language codes is described there as follows:
ISO 639-2 ... was based on the MARC Code List for Languages. Language names in ISO 639-2
are not necessarily the same as those in MARC, particularly because of the practice of
correlating the MARC language names with those used in Library of Congress Subject
Headings. The MARC list includes references for unused forms of language names, while
the ISO list has in [only] some cases included alternative name forms ... In addition the MARC
documentation includes a list of individual languages under collective codes or language
groups, while the ISO list only includes the group codes themselves. The Library of Congress
is the maintenance agency for both lists, and the two are kept compatible in terms of code
additions and deletions.
ISO 639-2 is also the code set used to represent written languages in NISO Z39.53, Codes for the
Representation of Languages for Information Exchange. ISO 639-2 is intended for use in "libraries, archives,
and other documentation applications."
The Library of Congress is the registration authority for ISO 639-2 codes. In accordance with LC Standards,
Criteria for ISO 639-2 (Sep. 22, 2006, issued just before the publication of ISO 639-3), there are a variety of
"objective and subjective metrics" for petitioning to include a language in ISO 639-2. The criteria include
number of documents in that language (50 or more from a given institution or group of five institutions),
along with other factors such as documentation of the language's official status and use in formal
education. ISO 639-2 language codes cannot in theory be changed (though they can be "retired" or
discontinued). The process of proposing a new language code for ISO 639-2 depends on a fairly formidable
set of criteria. And in fact, almost no language codes have been added to ISO 639-2 in recent years.
1
The Africana cataloging context
African languages for which few or no cataloged publications exist do not have a language identifier code
assigned in ISO 639-2, or may be represented only by a collective (language cluster) code at a very broad
level. I have not yet come across an explanation for how these collective codes were determined, beyond
mention of the practical need to cluster related languages that are still in some cases being classified. For
most libraries the codes provided in ISO 639-2 are more than adequate. And for library users, subject or
keyword access is what counts.
The complexity of African (and other) language categorization, and the scarcity or nonexistence of
publications about some African languages, means that many African languages are still without
established LC subject headings. Unlike language codes, language subject headings may be proposed to
Library of Congress readily, at the point of need, when cataloging a publication about a language for which
there is not yet a relevant heading. Usage in that publication is a starting point in creating the language
heading proposal, and the format of the subject proposal follows the "pattern" described in the Subject
Headings Manual, H 1154 - Languages. As with other subject headings, there is the possibility of proposing
changes to them over time. Submission of African language subject proposals is generally done via the
SACO Africana Subject Funnel.
Cataloging materials that are published in African languages is another matter, and not one that I will get
into much here, because I’ve had limited experience of that so far. But even for libraries that specialize in
collecting Africana, we outside the African context experience only a very small sampling of the actual
volume that is produced in some languages. Identifying the language in itself can be very challenging,
much less comprehending the content. Locating language resources or expertise in such a variety of
languages is something that requires patience and resourcefulness. Script can be an issue, even in a
romanized context. The diversity of languages in a single country can be bewildering; South Africa has
eleven official languages as of 1997, meaning that government and educational publications are often
issued in parallel in multiple languages.
More to the point for use of Associated Language in name authority files, there is a fair likelihood of
multilingual knowledge and production, be the person a missionary, musician or Muslim scholar. The
definition of this field hews to a fairly narrow interpretation of “communication” for individuals – “writing
for publication, broadcasting, etc.” – but I have generally interpreted this a little more broadly to include
any languages a person is likely to use for communication of any sort. Someone who translates may not
actually write in that language, but is able to use his/her understanding to “commune” with that language
in order to transform the meaning into another language. And in the Africana context, there is a high
likelihood that one of those languages might not have an equivalent MARC / ISO 639-2 code, or only be
represented by a collective code.
Getting back to subject headings: Africana catalogers refer to Ethnologue: Languages of the World, a
project of SIL International, as a primary resource for proposing new language headings. Ethnologue uses
ISO 639-3 language identifier codes, and SIL International is the registration authority for ISO 639-3 codes.
According to the code's home page on SIL's website, "ISO 639-3 attempts to provide as complete an
enumeration of languages as possible, including living, extinct, ancient, and constructed languages,
whether major or minor, written or unwritten." Gary Simons of SIL International recently did a synopsis of
the history of, need for, and goals of ISO 639-3 and the other members of the ISO 639 family.
One small point where ISO language codes and MARC or other language names can meet up, in the context
of a name authority record’s 377 field, is in the use of ‡l, Language term – specifically, to clarify an ISO 6392 collective language code. (There are no collective language codes in ISO 639-3.) Field 377's ‡l would
seem to be a useful field for the (anglophone) human eye. My first assumption was that this could be used
at will. But the Descriptive Cataloging Manual (DCM, Z1) specifies, "Prefer language codes over language
terms .... Use subfield $1 (Language term) only to provide information not available in the MARC Code List
2
for Languages.” (In any case, the spelled-out justification for using a given language code should already be
supplied in a 670 note field, Source Data Found, in human-readable form.) Here is the example that
DCM Z1 provides for use of Associated Language:
377 ## $a myn
377 #7 $a acr $2 iso639-3
(ISO 639-3 code for Achi (acr); assigned a collective code (myn) for Mayan languages in the MARC Code List
for Languages)
Though “Achi” is established as a language subject heading and could therefore (I believe) have been
supplied as a legitimate, clarifying term in 377 ‡l, a separate 377 was made to specify the Achi language
according to its ISO 639-3 code.
This leads me to some questions.
If the language is not yet given in the MARC Code List for Languages, and there is only a
collective language code in MARC, what would one use in ‡l ?
The answer seems to be that one would use the relevant LC language heading if there is one (just the
substantive part, dropping "language" or "dialect"); and if an LCSH for that language doesn't yet exist, one
can supply the language name as given in a reference source such as Ethnologue.
My thought was that, in the "natural" progression of things, at some point newly approved LC language
subject headings would appear on the MARC Code List for Languages, Name Sequence, and that they
would be assigned a MARC code. In other words, I had assumed, naively, that our efforts to establish
LCSH's for languages led in some way to establishment of MARC language codes for these languages.
But that is not the case – or at least not now, though language headings that were already extant did shape
the MARC Code List for Languages.
There is an appendix of Changes to MARC Code List for Languages since the 2007 Edition, updated most
recently four months ago (November 2013), but under Part IV: New Codes, it notes, "None since 2007
Edition." On the LC Standards ISO 639-2 Registration Authority there is a table tracking changes from 1989
to 2012, last updated a year ago. After a flurry of language code additions through 2006, there were two in
2007 and one in 2012 (for zgh, Standard Moroccan Tamazight). So it seems that the MARC language codes
upon which ISO 639-2 was based, and ISO 639-2 itself, are essentially becoming static, non-developing
standards.
The changes that have taken place are tweaks to the language name itself (e.g. Maasai rather than Masai),
reassignment of a language to a different collective language code, and three actual code changes - for
Serbian, Croatian, and Moldovan - which occurred in mid-2008 and early 2009. This last part is interesting,
since the whole point of creating "standard" codes for language identification is to have a fixed point of
reference. But there were two factors that entered into these particular changes - political upheaval (the
breakup of Yugoslavia) reflected in language splitting, and script differences in language expression
(Moldavan in Cyrillic, Romanian in Latin) being "compressed" into a single code. (See Tanya Whippie's 2010
thesis describing the use of MARC codes for these languages.)
If one is using a code from ISO 639-3, what is to prohibit use of a complementary ‡l language
term with that?
The Descriptive Cataloging Manual stipulation to "prefer language codes over language terms" was written
specifically in the context of MARC language codes. Here is a hypothetical case: Subject uses a language
that has not been established as an LC subject heading, and (of course) has no ISO 639-2 code. Since in this
case one is not specifying a language to make greater “sense” of a collective code, my initial thought was to
use two 377 fields, as follows:
3
377 _ 7
bog ‡2 iso639-3
377 _ 7 ‡l Bamako Sign Language ‡2 iso639-3
But there is apparently a philosophical basis for preferring use of language codes rather than names, as a
way to skirt potential political issues around the full expression of a language name. This controversy
remains even at the code level, though, with some codes being truncations of language names that could
be taken as pejorative to speakers of the language.
It is also good to remember, for perspective (something that one does lose track of sometimes in the thick
of the effort), that these name authority records are not actually intended as reading material for anyone
outside of the rarefied cataloging community. (Though shared public resources such as VIAF are exposing a
narrow slice of that content.) The purpose of these codes is for machine recognition, to permit data
retrieval and manipulation. Since use of the 377 field is merely optional, it is not clear how useful such
manipulation might be (and to what ends it might be put). But we are just coming up on the first
anniversary of RDA implementation, so time will tell more.
What formality or usefulness is there to the term used in ‡l language term? And if there is no
formality or usefulness, why does that subfield even exist?
I had assumed that one needed to use an established form, preferably as established in an LCSH if not (yet)
in MARC. But given the lack of transfusion of newer LCSH language headings into MARC language codes /
names, I now doubt that.
A paper presented by Stephen Morey, Mark W. Post and Victor A. Friedman in December 2013, "The
language codes of ISO 639: a premature and possibly unobtainable standardization," calls into question the
usability and credibility of the entire ISO 639 standardization endeavor. Among other critiques, they single
out ISO 639-3 and its maintenance by SIL International as being "excessively centralized" and potentially
preserving offensive names for language communities, and that "the in-principle 'permanency' of language
codes such as those of ISO 639-3 is fundamentally incompatible with the nature of human languages, which
are demonstrably impermanent". (See a commentary on their presentation by Martin Haspelmath.)
In a bibliographic context, however, the printed word is at least somewhat permanent, and we can try to
describe it as such. Our language coding efforts in name authority work – which only obliquely refers to
language materials produced or potentially produced by those entities – is a different endeavor.
***
Following are three situations recently encountered in creating or revising personal name authority records
under RDA, to illustrate Associated Language (377) field use with a combination of ISO 639-2 and ISO 639-3
language codes.
4
Case 1: Subject is bilingual, Arabic and Zaghawa. Zaghawa language has been established as LCSH since
1992, but there is no equivalent MARC / ISO 639-2 language code. It is classified in Ethnologue as NiloSaharan, Saharan, Eastern, and its ISO 639-3 language code, zag, is provided there.
Two separate 377 fields may be created – the first, ara for Arabic, with no source flagging needed since the
default source is ISO 639-2; and the second, zag for Zaghawa, flagged with its source as ISO 639-3, and
second indicator 7. In addition, based on the language classification in Ethnologue, one may enter a 377
field for the ISO 639-2 collective code for Nilo-Saharan languages, ssa, followed by a clarifying language
term, even though this language is not referenced in the MARC Code List for Languages. When using ‡l to
clarify a collective ISO 639-2 code, that pairing needs to be in its own field.
5
Case 2: Subject uses three languages—Dutch, English, and Adhola. Adhola language is established as LCSH
since 2009, but has only a collective code in MARC / ISO 639-2: ssa, which represents Nilo- Saharan
languages.
Three separate 377 fields may be created. The first is for the established MARC / ISO 639-2 codes for Dutch
and English. The second is for the ISO 639-2 collective code ssa and the subfield for the specific language,
Adhola, since when employing ‡l to clarify a collective code, that pairing needs to be in its own field. The
third is for the ISO 639-3 language code adh flagged with its source.
An additional issue here is that none of the several more specific levels of linguistic group for this language
have been established in MARC / ISO 639-2 as collective codes, though one of them, Nilotic languages, has
been established since 1985 as an LCSH. So we are obliged to use the very broad collective code.
6
Case 3: Subject uses or has worked in three languages—French, Amharic, and Dogon. French and Amharic
have MARC / ISO 639-2 language codes. Dogon has been assigned an ISO 639-2 collective code and has had
an LC language subject heading since 1985. On further investigation, however, Dogon turns out to be a
collective language name in itself, representing over a dozen languages, some of which are not mutually
intelligible. There are several more specific codes available for Dogon languages in ISO 639-3.
Three 377 fields may be entered here. The first gives the MARC / ISO 639-2 codes for French and Amharic.
The second gives the MARC / ISO 639-2 collective code nic for Niger-Kordofanian (Other), with a clarifying
‡l, Dogon - which, however, turns out to be a collective term in itself. The third 377 is for the ISO 639-3
language code dts for the more specific Dogon language that Griaule studied, Toro So.
7
References:
Codes for the Representation of Languages for Information Interchange, ANSI/NISO Z39.53-2001
http://www.niso.org/apps/group_public/download.php/6541/Codes%20for%20the%20Representation%20
of%20Languages%20for%20Information%20Interchange.pdf
Codes for the Representation of Names of Languages: ISO 639-2/RA [Registration Authority] change notice.
http://www.loc.gov/standards/iso639-2/php/code_changes.php
Criteria for ISO 639-2
http://www.loc.gov/standards/iso639-2/criteria2.html
Ethnologue: Languages of the world. 17th edition; online version.
https://www.ethnologue.com/
Haspelmath, Martin. Can language identity be standardized? On Morey et al.'s critique of ISO 639-3.
Diversity Linguistics Comment website, posted Dec. 4, 2013.
http://dlc.hypotheses.org/610
ISO 639-2: an international standard for language codes. November 1998 (rev. June 4, 1999)
http://www.loc.gov/marc/iso639.html
ISO 639-3 Registration Authority.
http://www-01.sil.org/iso639-3/
Marc 21 Format for Authority Data: 377: Associated Language.
http://www.loc.gov/marc/authority/ad377.html
MARC Code List for Languages.
http://www.loc.gov/marc/languages/
MARC Code List for Languages: Introduction (as of Sep. 2007).
http://www.loc.gov/marc/languages/introduction.pdf
Simons, Gary. ISO 639-3 : where are we and how did we get here?
Workshop on Identifying Codes for Languages, Newcastle, Australia, 9 Feb. 2013.
http://www-01.sil.org/~simonsg/local/ISO%20639-3.pdf
Source Codes for Vocabularies, Rules, and Schemes: Language Code and Term Source Codes.
http://www.loc.gov/standards/sourcelist/language.html
Whipple, Tanya L. A study of the use of MARC language codes in OCLC catalog records. Thesis, M.S.L.S.,
University of North Carolina at Chapel Hill, 2010.
https://cdr.lib.unc.edU/indexablecontent/uuid:be4aad8f-3846-472b-a8e9-05818d5d2186
8
Download