>> Michael Gamon: It's my privilege and pleasure today to introduce two colleagues that I met at the CALICO Conference two years ago and then again this year. So Sun-Hee Lee was an assistant professor at Wellesley College and is interested -- she's interested in research in Korean linguistics, has worked a lot in teaching Korean, and so is also interested in developing automatic methods to help learners of Korean. Seok Bae Jang is from Brigham Young University, is a visiting lecturer there -- >>: Jang. >> Michael Gamon: Jang, sorry. >>: It's confusing, I know. >> Michael Gamon: We always use the first names. Sorry. Yeah, he's a visiting lecturer at Brigham Young University and is also interested in Korean linguistics, Korean language processing and the development of automatic learners, tools for learning Korean, and also in -- of course, both of them work in corpora annotation, and we'll hear a little bit about that. The third collaborator in this work is Markus Dickinson from Indiana University, who's not here. The title of the talk is Developing an Annotated Korean Learner Corpus and an Automatic Analysis of Learner Language. Thank you. >> Sun-Hee Lee: Hi, everyone. So today I'm going to just introduce the work that the three of us have actually been doing for a couple of years. And we started actually separately as an individual, you know, just with individual research interests, but somehow we got merged into one and then we formed a research group and started to work on like automatic processing of learner errors. Today I'm going to just talk about general goals of our study and introduce you -we specifically working on the special area call ICALL, which is intelligent computer-assisted language learning field. How many of you heard of a CALL system? Okay. So that just [inaudible] we put intelligent call, which made other people working on CALL quite annoyed, which is true. So let me explain later what is actually intelligent CALL and then -and I'll briefly talk about what's the usage of learner corpus because we are developing annotated learner corpus. And then since we are working on specifically the language Korean and then I think I need to introduce some background in order to make our project understood, so I will briefly talk about the background in Korean, and then Seok Bae and I will introduce two pilot studies we have been working for the last couple of years, and then the last we're going to introducing what we found so far and then in what direction we're going to move the next few years. And our short-term goal is that actually developing annotated learner corpus and then provide annotation scheme that can be used as a sort of golden standard for markup errors in Korean language corpus. We're working on the Korean, but, you know, the Korean is quite similar to Japanese. So it's not only the one language specifically project. And then we tried to find useful properties with some resource for markup system that can facilitate automatic error detection process and provide also resource for the Korean language teaching actually so that people can use in the language teaching place. And then long-term goal is to build up Korean intelligent computer-assisted language learning system, which is actually very far from us now. So CALL is actually, you can think of it as using the computer in order to teach language. So we can use PowerPoint or other softwares in order to make learners excited. However, the intelligent CALL is a little bit improved CALL system so that we can actually provide intelligent feedback to the learner, typically by using currently developed natural language processing technology. And then if I talk about briefly about the limit of some CALL system right now and then people that developing exercises and then provide use the multimedia files and the movie clips and then make language learning more exciting and interesting. However, current exercises are typically very limited to uncontextualized multi-choice questions. Sometimes you just click and point and click and get the right and then system brings up, yes, you are right, or, no, you are wrong. But it's not quite smart enough. And then also some feedback usually is limited to yes, no, or sometimes letter-by-letter matching. For example, there is a quite well known software came out called Bonzai and Robo-Sensei. Have you heard of this software? And then developed by Nagata actually 1996, not 1006, 1996. And their system actually -- people were excited about having that kind of software for learning Japanese. Learning Japanese is very challenging, especially for language learners like native speakers of English. Japanese is typologically very different. Korean is even harder. So that's why foreign learners are spelling when they try to learn the language. But Bonzai is actually -- it's sort of a translation-based exercise. So the system asks you translate this sentence into Japanese, and if you make an error with the Japanese particle -- I'm going to explain what particles are, but it's very similar to English prepositions. But for language learners, it's hard to get the right particles. So if there is something wrong with particles, then it gives the feedback you are wrong. But basically the system is one-to-one correspondence between like just the translation. Sort of a [inaudible] matching between English and Japanese. So the goal actually -- we think the tool must be extended to permit sort of a diagnosed errors made by language learners. So language learners shouldn't be so restricted. And then language learners put something and then system process the input and then provide intelligent feedback according to the learner level. That's why we use the term "intelligent." And so I know it's a little bit controversial with the terminology, but that's sort of the way to go now. And then learner corpora, we are developing. Basically learner corpora means like the texts of the learner language, for example, and then native speakers of English, and then they can learn Korean language, and if they make -- produce some assignment, writing assignments, the collection of the text is called learner corpus. And then with learner corpus actually there are many [inaudible] processing actually applied with the learner data. For example, the header information can be also -- need to be processed. Also in terms of the transcription level, people have to process the orthographic information. And also at the annotation level, sentence-boundary disambiguation or some sort of annotation for Tokenisation and other Lemmatisation or -- a simple example would be syntactic parsing and semantic parsing. That kind of annotation can be added. And then part of it actually is error tagging. So we are interested in the annotation of errors and how to mark errors efficiently so that, you know -- we are developing the system, then we do it by using that information, we can produce some automatic error detection method later. So usage of error annotation is actually -- error annotation determines what kind like language properties are underused or overused, sometimes misused. So it brings the correct learner analysis. That's why it's important. And also annotation also provides some sort of evaluation data for the system development. And error taxonomy also tells developers what type of errors to anticipate when you process learner data. And it can be used for investigating some sort of like foreign language learners inter-language in terms of the language linguistics, actually, or second language [inaudible] part. So it has some usage in that part. And also it's error analysis through error annotation can be incorporated directly into foreign language teaching practice too. But bringing actually exact correct error taxonomy is very challenging. So it's been known, actually, validity of the error taxonomy, it's very -- people thought about it and tried to provide sort of the golden standard annotation scheme. However, throughout the 1970s, actually, people realized that it is all -- it's impossible to have a generic error taxonomy, because you cannot annotate the whole thing. It's just working for every different tasks. So task specific or purpose oriented taxonomy is very necessary in this project. And also reliability of the error annotation is important because basically annotation is done by annotators, but annotators don't have enough background or are sort of missing some particular like errors, after you're done with the annotation, the whole thing is useless. So how to maintain high reliability is always challenging topic with annotation task. So now, like, based on this one -- and let me talk about briefly about Korean language. Korean language is actually -- it belongs to agglutinative language, which means like morphologically it has more complex word structure inside. And also it has relatively free word order. For example, complex morphological combinations means like morphological boundaries tend to be maintained in spite of some sort of application of phonological rule. We're going to talk about it when we talk about our spelling error project. Because the spelling errors are generated with many different reasons. Sometimes, you know, people get tired and then just physically they are not in the right condition. But many times there are some, like, linguistic reasons why people produce certain type of errors. That's the part we are interested. And then morphology for Korean is more complex and verbs can be actually conjugated by adding very small morphemes. But also nouns can be also composed of -- it has a certain noun phrase, but noun phrases combined with particles which are quite similar to English prepositions. For English, actually, for English learners, as a second-language learner, and then we produce a lot of preposition errors, right? So for those people who are working on the ESL and just -- what's the program name? >>: Assistant. >> Sun-Hee Lee: Yeah, assistant, and that's interesting topic. But for Korean, actually, particles are very similar to prepositions, but it's more complex, because linguistically it has very similar functions, but it's more -- there is more long-distance connectivity between the Korean particles and the verb forms. So that's why. And I will explain later. And, also, this is one verb Korean, so you can see -- you cannot read the Korean letters, but this is just one verb. But inside of the verb there are small fragments, and each fragment represents tense information, sometimes mood information and many different just linguistic informations. So the verb "hold" is composed of, like, five elements inside. And then another challenge is that each word, each character is actually representing one syllable in writing. However, when people pronounce it, the morpheme boundaries are not maintained. So you can easily see why people make a lot of spelling errors. So we will talk about it more when Seok Bae is talking about spelling errors. And then for Korean particles, it looks like -- it's very similar to English in terms of the -- like in this structure, "at school," instead of "at school," Korean use "school at." So it seems to be simple, but one difference is that there is no space between "at" and "school." So their particles cannot stand alone. It's sort of considered one word. However, it shouldn't stand alone. So identifying which one is the particle is like meaning like you have to get the right morpheme boundary information. And then more things about Korean particles, actually. So there are -- particles play, sort of represent some structure information. So English, like subject and object, can be identified in terms of the word order. But for Korean, there is a subject marker for after the noun phrase and object marker. So it's very similar to Japanese. And then because of the markers, the word orders are relatively flexible because we don't need to maintain the same word order for subject and object. And then particles contribute that kind of flexibility of free word order. And also particle realization is based on verb argument structure and also semantic information. And there are also some different class of particles called topic markers which actually deliver a lot of discourse pragmatic informations. So subject markers are sometimes replaced by topic markers, but they are under a very specific fragmatic setup. So people try to understand what kind of linguistic information is delivered by using the topic markers. So in order to process Korean particles, you have to actually deal with from the spelling error level challenge and then you have to deal with that high level pragmatic and semantic information also. And then let me just talk about briefly about Korean writing system. Korean writing system, it's based on the syllabic structure. So syllabic alphabet we do one umjeol is a basic unit. So in English you can identify syllables when you just hear the string of sounds automatically. So Korean actually -- Korean letter system represent one syllable as one unit. But in English it's more linear. So a, b, c, d is written -- like when you write, you write just left to right. But for Korean actually divide the syllable structure and then represent it in the letter system. So it's composed of three different things. First one is actually choseong, which is the first sound. It can be consonant, it can be vowel. But the second sound is the jungseong, which is usually -- all the time vowel, and then last sound is similar to [inaudible] is jongseong. So for English if there is hak, the sound "hak," you're going to write down h-a-k linearly, hak, right? However, for Korean system "h" is represented by here on the top, this little symbol beside the h sound here, and then "i" is represented by this symbol, and then "cu" sound is this symbol. But when people write down, they combine it together. So instead of "cu" sound go here, like "k" sound up here is underneath it. So it has more combinatoric structure. And then spacing is also very headache for Korean language processing. The first is that also sometimes like a particle combination doesn't allow any space between the noun phrase and then particle. But sometimes like some cases Korean system, spacing rules are very redundant. It seems like you don't have the rule. Sometimes you can use a space, sometimes you don't need to use the space. That actually creates a lot of problem with the language processing part. And spacing is called Eojeol. So we call like English -- for English, like we have a million-word corpus, right? But for Korean we say million Eojeol corpus, because it's hard to just to count each word, because particles, there is no space. So we just use, if there is a space unit and then one unit is called Eojeol. So there are million Eojeol corpus. And then syllabic representation of Korean requires learner to require various knowledge. For example, like linguistic knowledge of syllabic compositions, you know, which one goes to where. So you know all the positional information, but also like you need to know some orthographic knowledge, including sound-letter relation. For example, "k" "g" sound in Korean, Korean people don't distinguish them. So there is mapping between these two sounds into this one letter. So they have all kind of complex information as native speakers. So for language learners, it's really hard to match which sounds correspond to which letter. So they have to match them. And then there are two frequent errors. So I've been talking about spelling errors, and also we can guess there are lots of particle errors. So these two things are our main interest for study. And then they actually -- when people study the learner languages, they found that particle errors, like 24.4 percent errors, learners produce many particle errors, and misspelling errors about 20 percent out of all the errors the language learners produced. So that's quite high. Yeah? >>: One quick question on the last statistics. Does that ratio stay relatively constant at different proficiency levels or does it change? >> Sun-Hee Lee: I think -- I don't remember the whole study, but it was the final result. So they combined all different levels. But I think for each level it's quite similar. >>: I noticed the particle errors are very [inaudible]. >> Sun-Hee Lee: Actually, that's connected to our, also, study. I'm not going to talk a lot about heritage language learning versus foreign language learning. However, we've been checking, actually, we've been expecting that there would be some dramatic difference in terms of the particle errors between those two groups. Actually, what they found is that statistically it's not that significant. But the patterns are not quite the same. So that's one thing we are investigating more. But that's a very interesting question. But you can see that it's not like English preposition errors. Preposition errors I believe it would be much lower for English, right? But for Korean particles, because of linguistic properties of particles, the error rates are pretty high. And then so -- so I'm done for the background in Korean so you don't need to -you know, just look at the data. But Seok Bae now will talk about, like, specifically what we did with these two project. One is on the spelling errors and then the other is -- I'm going to talk about the particles. Okay. Do you want to come? >> Seok Bae Jang: Hi. I'm Seok Bae Jang. I'll reintroduce our [inaudible], the research about the spelling errors. Mostly spelling errors of Korean learners are generated from the lack of awareness of morphological combinations. Korean language learners tend to depend on sound. Most people think that the Korean alphabet is based on sound. That's true. So if you didn't know much about the Korean morphology, if you know the Korean spelling, you can write down a very similar sound with a Korean spelling. However, Korean [inaudible] is based on morphology, not phonology. So even though you can write down the similar sound using Korean spelling, alphabet, however, it is very difficult to write down correct spelling rules. That's the problem, yeah. So mostly the spelling errors stems from lack of morphological knowledge in Korean. However, for English the phonological confusion plays a crucial role in error production. So that's [inaudible]. Also, spelling errors are crucial to deal with in any computational system, as you know. If you cannot solve the spelling errors, it is very difficult to go up to the higher level processing. Also, it is required -- developing a corpus annotated with them will allow for the development of spelling checker for learner language. Now this is our error taxonomy in spelling errors. The annotated corpus analysis can show actual range of spelling errors of Korean learners and it can show how each type of error is related to the linguistic knowledge. So it is related to the learner's native language or it is related to the deficit phonological and morphological knowledge of Korean. We classify five categories of spelling errors. The first one is phonological and the second one is morphological, typographical, incomprehensible and foreign word. Let's go on to the first, phonological errors. Most of the phonological errors are based on the incorrect mapping between a sound and a letter. So there is big difficulties for English speakers in Korean sound. Most of Korean consonant has a three way distinction [inaudible]. Can you [inaudible] sound? Actually [inaudible] the first [inaudible] sound is the plain, [inaudible], [inaudible], the tense one is [inaudible], [inaudible] sound is the grass. So it is very difficult, especially for English speakers. So lots of phonological errors came from those difficulties. And here, [inaudible], that means it's pretty. So in Korean we are using the tense [inaudible]. However, most English speakers are writing like -- using plain [inaudible] sound. [inaudible]. And there are also mismatch between Korean vowel system and English vowel system. Mostly [inaudible]. This is [inaudible], this is [inaudible], this is [inaudible], [inaudible]. So [inaudible]. However many English speakers [inaudible]. So these types of errors are generally restricted to difference between Korean and English. Okay. In the previous slide I explained the difficulties. But those difficulties came from the similarity of the sound of the [inaudible] in Korean phonological system. However, even though the sounds have had a long distance relationship, however, there is also confusion. In this case so [inaudible] here sound and correct one is [inaudible]. [inaudible] and [inaudible] is very different. However, many English speakers [inaudible] these errors. Most heritage Korean speakers not usually generate this kind of errors. However, non-native speakers generate a lot of this kind of errors. Let's move on to the morphology errors. Morphological errors include the failure of morpheme identification. That's the most important thing for errors. For example, [inaudible] actually is to eat. The second [inaudible] is past tense, the last one, [inaudible], is just the final ending. However, here there is a sound change between the tense [inaudible] and the second character, [inaudible]. Actual pronunciation is [inaudible]. The pronunciation is correct. However, the spelling orthography is wrong. Okay. Also, there is the double consonant problem and the overgeneralization here. Then these are the foreign word errors. You can read this character, right? Yeah. New York. Yeah. In Korea we have the standard form of the foreign names and place name, person name, city names. That form is "nyuyok." However, most English speakers [inaudible]. This might be simple in comparison with the previous morphological and phonological errors. If you can memorize all the foreign names used in Korean, it's okay. However, for the English speakers not familiar with the Korean text, having many difficulties in this case. There are some borderline cases. Is this phonological or is it morphological errors? To solve this problem, we decided morphological. Even though the cause of the error came from the phonological rule, though it is [inaudible] the wrong forms on the surface structure we consider it as morphological error. Okay. Our approach. We did some pilot study and then moved to a larger corpus. We collected ten people corpus, ten people per essay, so ten students of non-heritage. That means they're not Korean-American. And we calculated the inter-annotator agreement and we checked the accuracy of the existing spelling checkers and then we did our research against a more bigger data. Okay. This came from our pilot sample, which is ten people corpus. We calculated the Kappa score for the inter-annotator agreement. The first one is the Kappa score for error type. It's .83. It's a little bit high. And the correction information, this is the correct one, this is the wrong one. Each annotator put those information in the corpus and that Kappa score is .73. And this is feedback. This means [inaudible] sound should be written with the "m" sound. [inaudible] should be "m" in this example. For the feedback information we got .75 Kappa. So all these scores show high correlation between two annotators. So that's positive for our research. Now I'll show some spelling checking results. Spelling checkers for Korean do not adequately handle learner errors. So our hypothesis is more morphological. Information needs to be -- needed to handle the learner's errors. Then learners require different error diagnostic tactics, and we also think the learner errors need to support feedback. We used three spelling checkers. The first one is the Korean word processor called Hangeul. Its English name is HWP. It is the most popular word processor in Korea especially. And the second one is the HAM. Actually, that's the acronym. Hangeul Analysis Module. This is the most popular open source -- actually shareware software in Korea. The next one is the very famous, your Microsoft Word 2007. And I checked with the spelling results with two different sets of data. This is the raw corpus and this is the word space -- word spacing error corrected data. So mostly the spelling checkers are based on the [inaudible] analyzing process. However, word spacing isn't really the crucial point to measuring the accuracy. So with the raw corpus, HWP shows the .71 precision and around .5 recall and overall F-measure is .62. As you can see here, after word spacing, it still has the spelling errors, but I just corrected the spacing errors. The precision goes up to .92. So the overall F-measure is.69. I think it's a little bit high, but HAM version 5 is a general purpose precision tool. So the precision is .5 and after word spacing correction it's .61. Its F-measure is around .57 and .62. Here are the results from MS Word 2007. As you can see, the precision is just.3. And afterward spacing correction, it goes up to .65. And the recall is very high compared with the previous result. The reason why the precision is so low, Microsoft Word 2007 generates too many false/positive. Even though it detected the [inaudible] string, it just showed all the spacing errors, not the actual -- the spelling errors. That's the problem. So HWP presents some information about the feedback. You are wrong in the standard form or you are wrong in the first consonant or something like that, but Microsoft Word, it didn't give any information about it. The good thing is Microsoft Word checked some kind of grammar checker which is honorific. However, the feedback include lots of the wrong information. Even though it present the wrong information, it is not allowed in Korean grammar. It's good, but it's not good sometime. >> Sun-Hee Lee: Especially language learner, no feedback is better than wrong feedback. So we were quite interested because while Microsoft Word, the Korean version, checked spacing error and then spelling error at the same time, and they even titled it under the same know. So we don't know whether it's a spelling error or an actual spacing error. So in many cases, actually, they find these spacing errors quite correctly. I was quite surprised. But most of the time when they detect it, it's all kind of spacing errors rather than spelling errors. So we thought, like, that's very interesting. >> Seok Bae Jang: Okay. This is our [inaudible] data. We divided our corpus, all groups. Heritage beginner, that means Korean-American beginner and Korean-American intermediate and non-heritage, English speakers beginner and English speakers intermediate. So we have 100 people corpus, and this is [inaudible] number, and this is the percentages of each errors. Here M means morphological errors and P means phonological errors. As you can see, the morphological and phonological errors taking part of the most of errors here. Interesting is the H1 means the heritage beginner, which is Korean-American beginner, generated the most number of errors here, and F2 means foreign English speakers, intermediate English speakers generated the least number of errors. Okay. So overall the morphological and phonological errors is very high. And this is the comparison by background. This means Korean-American or not. The difference in number is not so big. However, we can see some interesting things, especially for heritage learns. Korean-Americans generated more phonological errors than morphological errors. However, the non-heritage learners generated morphological error more than phonological error. This is a comparison by language level, begin and intermediate. The difference is not so big. However, the intermediate level learners generated more morphological errors. I ran the HWP spelling checkers on the 100 people data, and this means the error was detected correctly with possible candidate. There are multiple candidates, but even though there is one correct candidate, I check it has candidate. Okay? The first thing is about 23 percent. It shows it detected, but there's no candidate. That case is 43 percent. 32 percent are not detected. So our goal is to increase these numbers for learners. Okay. This is comparison between -- yeah, actual examples. As you can see, this is the aspirated errors, and these are tensed errors here. Especially for the tensed errors, Korean-Americans intermediate level generated lots of errors here. And this is for vowel. Actually, this distinction is very difficult for native Koreans. Actually, [inaudible] is like in apple. Apple, [inaudible]. This is like [inaudible]. It's like egg. [inaudible]. This distinction is very difficult in especially Koreans. So lots of the errors came here from the heritage learners, heritage intermediate learners. One more interesting thing is the non-heritage Korean learners, which is in the intermediate level, generate the least number of errors. Okay. This is summary so far. Yeah. With the annotated corpus of spelling errors we can improve evaluation for particle error detection and we can test several method of spelling error detection to determine their effectiveness for each error type. We can -- we will try several kind of multi-detection method using the POS tagging, string similarity, and machine learning or some static information. In future we conduct experiment to find out the most effective feedback for each type of spelling errors. Also, we need to clarify borderline case errors and exact specific error patterns. Then we develop efficient method to separate spacing errors and spelling errors. Okay. >> Sun-Hee Lee: And let me -- just to go back to the particle errors, but since we are -- let me just briefly go through like -- spelling errors we saw sort of like basic level errors generated by -- based on the morphological and phonological confusion. But particle errors, it's more like linguistical and more complex. That's the key point I'm going to talk about. Because the connection is more connected to the very different level morphological and syntax and semantics. That's the reasons. And then we need to also consider, when we process -- when we develop the system, we have to consider the target. What's the test we are doing. And we are actually targeting writings, so writing [inaudible]. So for writing, particle errors -- particles always appear. Even though many people think, like, Korean and Japanese, particles disappear, so sometimes teachers teach you, like, oh, don't use them, you know, just use a noun phrase. However, if you drop many particles, you cannot actually write properly. So that's the one thing. So for our test, that's one of the thing, like particle errors are the main topic. So the taxonomy we have is, like, we divided particle errors, many different kinds. So omission is considered as a serious error. So as you may know, like in [inaudible] so Korean pronouns disappear. However, like instead of English here, I can use he and she, people drop the subject, right? So for [inaudible] and recovering missing element is the major task. With Korean particles, we have the same issue. Particles disappear. We have to insert them back. So how to insert them back, because we have the verb information. Basically what we need is that verb argument structure information. So if we developed a full list of the verb argument structure, that would be very helpful. So that's one thing we are doing. But also, there are other types, like replacement, adding and malformation errors. And then already some guiding principles we've developed with respect to particle errors. But let me just skip this one quickly and then move on to -basically before we finish it, this is after we did the error markup and then we extracted error rules so that we can use later in order to search the errors. So this is the extracted rules example. And then when error appears, actually, it will be great if we find a single type of error. However, errors are overlapped. So how to deal with overlapping errors is one of the major tasks, and then also how to connect them together is another challenge. And then with the participate errors, we did also similar study. We distinguished the learner groups into heritage learners and then foreign learners and then tried to find out some of the significant results. But unfortunately so far, statistically, it's not quite meaningful. But we realize that it's more like a pattern internal analysis, like does heritage learners produce the different types of errors. Okay. And then the findings, like some of the finding with the particles, like, you know, it's across different learner level, and also error tagging needs to be matched with the feedback processes. When we developed the whole annotation frame, we can actually -- when you annotate it, we can match it to the feedback system. So, for example, when we produce the feedback, we have to think of the learner level too. We're not going to deliver all the complex linguistic information to the beginning level learners. They will all disappear then, you know? So for beginning levels, we need to provide a simple answer. However, for the, like, higher level learners, they are more curious why they are making those errors. So we are trying to divide the learner level and then divide the annotation level too. That's another task we are working on. And let me skip some of the result. And then for automatic processing, actually, Markus is focusing on developing parsers. So he and his colleague are developing -- because we don't need a really deep down parser, we need a really shallow-level parsing. So he's working on Korean dependency parser geared towards the particle errors. And also for machine learning part, this is actually -- I know some of you are working on this issue, but for Korean, data sparsity is the major issue. And then we are actually very, very grounding level. So we just have sort of the study to work on it, but for the next level we want to obtain sort of a broad range of the data by using the web corpus. One student is actually using -- wrote a paper on it over the summer, and then also we want to have some model of correct Korean usage and also explore using small topic-specific web corpora and also extract the contextual features to predict the correct particle usage would be another task. And also as Michael [inaudible] with the English preposition, you guys use the language model for doing the machine learning system, so -- and then [inaudible] Tetreault and Chodorow and use their ME [phonetic], [inaudible] feature for developing their own system. >> Michael Gamon: We actually use [inaudible] first and then we have the results second [inaudible]. >> Sun-Hee Lee: Yeah. So since we have that, some examples from the English, although [inaudible] is very different, we are using -- we are thinking of adopting similar method and then doing the experiments more. That's the next thing we are doing. So some ways -- the key point is that we want to highlight that linguistic awareness actually plays a crucial role not only in the language learning but when we process the language errors. Because [inaudible] the phonological difference and the morphological distinction play a crucial role in the error detection, it's really hard. It's a hard task. And also by building of annotated learner corpus, we want to provide a platform for the research that can be applied to many different fields. Sometimes some people, when working on many [inaudible], they will use the formal analysis and also for a [inaudible] part and then people will use it for automatic analysis of learner language. And also we realize that our annotation scheme and guidelines cannot be very comprehensive. So we have to be very specific and we have to define the task, the property of the task, and then consider the effectiveness of the feedback attached to the annotation scheme. So that's the one thing. And then complex errors including verbal endings, space errors, lexical selection errors need to be ultimately incorporated into the system. But that's the next level issue, how to incorporate all different annotation part. And then future work. So we're going to develop -- expand the learner corpus with annotation for participates and some problematic language constructions. And also make the annotation corpus publicly available. That's the thing. We're going to publish the annotated corpus, which can be used by many people. And also for automatic processing, we need to integrate annotation with other type of annotation by using some multi-layer annotation scheme, and then also we need to develop the annotation tools that can mark particular errors, but also, you know, you can add more annotations, overlapping information. And then also that tool should be flexible enough to mark the longest [inaudible] too. So that's one other thing we are working on. And then investigating some efficient learner oriented feedback module, learner module. But that's probably the last step we're going to head for. Okay. So that's the whole summary. And then we actually -- it's like I confess that we didn't produce actual things, but we started to just to work on the issue, and then that's the way to go. That's the purpose of today's talk. Yeah. [applause]. >> Michael Gamon: If you have questions, you should definitely make sure that the question gets answered. >>: With the spelling errors, and particularly the errors that relate to the gap between pronunciation and syllabification during pronunciation and morphology, how much of that is actually handled often by the conversion -- I mean, I poked around with the Microsoft one, but not really extensively. But what about the -what is it -- the HWT? That has its own input mode [inaudible]. >> Seok Bae Jang: Yeah, the internal engines came from the research group in [inaudible] university. The professor [inaudible] investigated the Korean [inaudible] so mostly their engines came from those lab. So that's the reason why it's so high, the score. But as I noted, they're using some preexisting [inaudible] matching and also some algorithms to correct the [inaudible] distinction. It's very simple, but this is very difficult for learners. However, the [inaudible]. The [inaudible] and [inaudible] is assigned the same key. So the errors between the [inaudible] and [inaudible] need to be corrected by the statistical models. So I think you guys easily can -- >> Sun-Hee Lee: Yeah, just a little bit of [inaudible] information would be helpful. >>: A lot of these errors seem to be the sort of thing that can be -- can be prevented earlier on. If a non-native speaker misanalyzes the -- what's the example? I forget the one. Actually, that's the vowel case. If you get a syllabification case where the final [inaudible], if it's something like [inaudible] where you've got the k needs to go underneath, at the input stage, that should be detected in theory. >> Seok Bae Jang: [inaudible]. >>: It seems that some of this depends on the [inaudible] input mode. >>: I was surprised that you had numbers like 30 percent or something of learner errors were not caught by [inaudible]. I was surprised that any errors were not caught, but generally spell checkers that we make, at least, flag everything unless we put them in the [inaudible] someplace or -- when you say they weren't caught, were they words that were not appropriate but they happened to be spelled correctly? The intention was wrong, but it was actually a word? Just gaps in the spell check? >> Seok Bae Jang: I'm not sure. But I'm guessing that the spelling checkers might work, it's just focusing between the statistical models [inaudible]. >>: I don't think it's that sophisticated [laughter]. I'm surprised it's not flagging everything [inaudible]. >> Seok Bae Jang: [inaudible] in using with the output, I guess the [inaudible] word is very good to [inaudible] the word boundary. Actually other boundaries. I've heard that it's not very good to investigate the actual error point in a word. >> Sun-Hee Lee: I was surprised, actually, whatever dictionary they use, like proper names, HWP has a much better dictionary. Because those are most errors -- like with the proper names, they're all green-lighted, there is a spelling error, and we present the candidates. >>: I mean, I'm pretty sure that Microsoft Spell is a whole bunch of hand-coded rules for spacing errors. But I thought it also did [inaudible]. >>: I think the [inaudible]. >>: The same person did both of them. They're both related sort of. >> Seok Bae Jang: Yeah, maybe I think that Microsoft wasn't focusing on the native speakers writing [inaudible]. >>: The size of the sample works are pretty nice, but it would seem like there would be huge opportunities if some of these tools were available on the web that you could be getting numbers in the tens of thousands instead of in the tens or hundreds. Is there anyone doing research where they basically are creating tools that learners can use on the web and get new size of data sets to learn from? >> Seok Bae Jang: Actually, there's a huge corpus in Korean government which is called the [inaudible] corpus project. They have over one hundred million corpus person, and we can use it. However -- >>: A hundred million what? >> Seok Bae Jang: Hundred million corpus. >>: They're got a learner's corpus -- >> Seok Bae Jang: Yeah. It is like learner's corpus, yeah. But correcting the [inaudible] is difficult, especially for the -- as you know, it is related to the [inaudible] approval, and sometimes the beginning level learners just generate just [inaudible] in the 100 level, 200 level, so we are focusing to make a balance between each groups. As I know, the [inaudible] beginner and [inaudible] beginner and intermediate and [inaudible] intermediate -- >> Sun-Hee Lee: Are you talking about the learner corpus available on the web and then just capturing the data? >>: I'm talking about -- I mean, it seems like there are language learning tools on line, and any time someone is using them, they're making errors. And they would have -- they would have data on tens of thousands of users rather than on tens of users, and it would seem like that would be a very rich data set, and I'm wondering whether anyone is doing -- actually making use of that. >> Seok Bae Jang: [inaudible] they make the most of errors, yeah. >> Sun-Hee Lee: He's mentioning like if the tool is available and they produce like [inaudible] of data, but there are also -- like, practically, hard to get access to the engine and then get the permission to collect all that data, which is possible, but, you know, we actually started to think about -- because for our tasks we thought that learner information is very important, but normally if we based on like a web based kind of exercises, we cannot keep the record, the correct record, of the data, of the learner information. So for the research proposal we started the collection by doing this sort of hand collecting the data. We do actually very sophisticated background research. We developed like a language questionnaire, background check, how many years they studied, what kind of language their parents used and then how they communicate when they talk back to their parents. Sometimes they talk in English just -- because that's how like later, when we developed the system, we can get sort of like get into the right track rather than, like, collecting all learner data without knowing whose data they are. So that's one of the concern we haven't actually considered. But because of the data sparsity problem, actually, the students who had idea about creating web corpus actually used some tools to collect just the massive data. So that's another way of doing it too. But in the end Markus actually has a better idea about, you know, [inaudible]. He and his students will do the research in that direction. But we are more, like, linguistics based, like annotation, like just groundwork based research right now. >>: And whatever is the predominant search engine in Korea, I believe, if I'm not mistaken, it's a Korean search engine. >> Seok Bae Jang: Right. >>: Do you know what they do in terms of spelling correction for search? >> Seok Bae Jang: Mostly the information [inaudible], they use the HAM, yeah. >>: And it's very fast. >> Seok Bae Jang: Or [inaudible] around the five or six groups who is doing the research in the [inaudible] analyzer and never used the HAM engine before and in this case they [inaudible]. >>: They use tools like that? >> Seok Bae Jang: Yeah. >> Michael Gamon: Thank you. [applause]