>> Hoifung Poon: Hi. It's my great pleasure to welcome Taylor Berg-Kirkpatrick to visit us from Berkeley. Taylor has done a number of interesting work, ranging from diverse topic such as like word alignment, summarization, and more recently on recognizing text from images. So and then I have -- a number of us have been using things like, for example, [inaudible] EM that he has developed. So I'm very excited to have him here. And without further ado, here's Taylor. >> Taylor Berg-Kirkpatrick: Thanks. Hi. So this is joint work with Greg Durrett and Dan Klein, my advisor. And this talk is going to be about transcribing images of historical documents into text. So let me first just give an example of a historical document. I'll be mainly talking about documents from the printing press era. So here's an image of a document that was printed in 1775. And it's part of the proceedings of the Old Bailey Courthouse in London. Let me read just a little bit of it. Jacob Lazarus and wife, the prisoner, were both together when I received them. I sold eleven pair of them for three guineas, and delivered the remainder back to the prisoner. So researchers are interested in being able to do analysis on collections of documents that look a lot like this one. But in order to do many types of analysis, you first have to take these images and turn them into text. And so that's what this talk is about. But so let's see what happens, though, when we take this particular historical document image and plug it into a modern, off-the-shelf OCR system. I'll use Google Tesseract as an example. So here's the transcription you get out of Google Tesseract. I've marked the errors in the transcription in red. And you can see there's a lot of errors. In fact, more than half of the words here are wrong. This is the pretty terrible error rate. And that's weird because Google Tesseract is really good on modern documents. Most OCR systems are really good on modern documents. So, next, to help explain why existing OCR systems perform so badly on historical documents, let me just quickly cartoon how a typical OCR system might work. So here's an image of some digitally typeset modern text, reads the word modern. So the first thing an OCR system will often do is try to find a baseline for the text in the image. Once it's found the baseline, it will try to segment the pixels in the image into regions that correspond to individual characters. So Tesseract actually does that by identifying connecting components. Okay. So once the image is segmented into characters, the system will then try to recognize each character segment individually. And that usually happens using some kind of supervised classifier that's trained on a single font or maybe a collection of fonts. So this kind of pipelined approach works really well on modern documents that are digitally typeset so that white space is really regular and where the font is known. But we've seen that that kind of approach fails really badly on documents like this one. So now let me tell you kind of the three primary aspects of historical documents that make them really hard for existing OCR. The first problem is that the fonts in these documents are often unknown. So here's a rendering of the word positive in a historical document. We can try to line up a modern serif font with the first two characters. For this particular modern serif font, doesn't quite line up because the descender on the P isn't quite long enough. We'll try a couple more modern serif fonts. In fact, no modern font is going to perfectly line up with this image because the font used in the image is an ancestor of modern fonts. And extreme case of this problem of unknown font, you can see here with the use of the long S glyph. So we don't even -- we don't use this glyph anymore, so we don't even have a representation for it in modern fonts. Okay. A second problem that occurs in a lot of these printing press era documents has to do with the baseline of the text. So this example reads the death of the deceased. And you can see that the baseline of the text kind of wanders up and down as you go across the line. The glyphs kind of hop up and down. And so this actually has to do with the way that these documents were printed using a printing press. So humans take, you know, templates, character templates, and align them using a mechanical baseline, and there's some slop in that mechanism. And as a result you get this kind of noise in the baseline. Okay. A third problem has to do with inking levels and the fact that they vary across a lot of these documents. So this example reads rode along in silence. And on the left of the example, the word rode is underinked. It's so underinked that the glyph for the character D has actually divided into two different unconnected components. On the right there's a large degree of overinking and bleeding ink, so much so that many of the glyphs have now come together into one big connected component. And this, again, is a result of how the document was printed. Maybe the ink was applied to the roller bar unevenly or maybe pressure was applied unevenly when the document was printed. So here are four portions -- here are portions of four different historical documents, one from 1725, 1823, 1875, and 1883. And you can see that the three problems that I mentioned are present in these four documents. And, in fact, if you look, those three problems are present in pretty much all documents from the printing press era. So our approach is going to explicitly deal with each of those issues in turn. So the way we deal with the problem of unknown font is that we're going to learn the font in an unsupervised fashion from the document itself. We're going to deal with the noise as a result of printing process by jointly segmenting and recognizing and doing that in a generative model that explicitly models the modern baseline and also explicitly models the varying levels of ink across the document. So now I'll give you a high-level depiction of the generative model that we use. It's actually going to generate all the way down to the level of pixels. So first thing we do in the model is generate some text character by character that comes from a language model. And now conditioned on that text we're going to generate a bunch of typesetting information. And so this is going to look like a bunch of bounding boxes that tell you how the glyphs are going to be laid out. So specifically first thing we do for the character P is generate a left padding box. So this box is going to house the horizontal white space that you see before you actually get to the glyph for the character P. Then we'll generate a glyph bounding box for P which will how the glyph for P and the vertical white space above and below it. And then the right padding box. We'll do the same for R and the rest of the characters. So now we have a kind of layout. We're also going to generate for each of these glyph boxes a vertical offset for the glyph that goes in that box. And so this will actually specify the baseline of the text and allow it to wander as it goes across the document. And then finally we'll generate an inking level, an overall level of ink, in each of these glyph boxes. And that similarly will allow us to model to wandering levels of ink across the document. >>: Yes. >> Taylor Berg-Kirkpatrick: Yeah. >>: [inaudible]? >> Taylor Berg-Kirkpatrick: So we're actually going to be learning the glyph shapes. And so once you have -- once you've specified the baseline and you know the shape of the glyph in terms of its pixel layout, that then defines the top of the glyph as well. So that's enough to specify it. That will become more clear as I -- I'm going to dive into increasing levels of detail. And okay. So then once we have all that typesetting information and the text, condition and all that, we're going to actually start filling in pixels. And this will happen in the rendering model. So we're actually laying out the pixels for the glyphs. And then here we've generated an image of the text. So I'm going to name some of the random variables involved. I'll call the text E, and that's going to come from P of E, the language model. And the typesetting information I'll call T, and that comes from the typesetting P of T given E. And then finally the image itself, the actual grid of pixels, I'll call X. And that will come from the rendering model, P of E -- P of X given E and T. And so the only observed random variable in our model is going to be X, the actual image itself. Both E and T are going to be latent random variables. And during learning we'll marginalize them out and when it comes time to do transcription we'll fill them in with inference. Okay. So now I'm going to tell you more details about each of these three components in turn. Let me just say that probably kind of the first thing that popped into your head about what each of these components might look like is basically correct. With one exception. There's going to be a little extra level of complexity in the rendering model, and I'll tell you about it when we get there. All right. So first talk about the language model. We're going to have a random variable E sub-I corresponding to each character. All right? And these random variables will be distributed according to a Kneser-Ney smoothed character 6-gram model that will train on a large corpus of external text data. So this is going to resolve a lot of the ambiguity when we're doing this in an unsupervised way, the fact that we have a strong language model trained on a bunch of data. Okay. So now let's talk about typesetting. Yeah. >>: [inaudible]. >> Taylor Berg-Kirkpatrick: So we played with a couple different language models. You'll see them in a second. One is outer-domain. It comes from Gigaword. And the other is in-domain for some of the historical corpora that we looked at. >>: [inaudible]. >> Taylor Berg-Kirkpatrick: So the in-domain one is from that period, and it was manually transcribed. That was a big curatorial effort that actually transcribed a bunch of these documents manually. And so we can use that to sort of see what the penalty is from using outer-domain data. Okay. So I'll talk about the typesetting model. And we'll look at the typesetting process for the token index I. And it works the same way for each token. So the character random variable at index I is E sub-I, as we've seen. And let's say that that's an A. So the first thing we're going to do in the typesetting process is generate a random variable L sub-I which specifies the width of the left padding box. And L sub-I is going to come from a multinomial. Then we'll generate a random variable G sub-I, which is the width of the glyph box, and that's, again, going to come from a multinomial. And then finally we'll generate the width of the right padding box, again, from a multinomial. And remember the left padding box, horizontal white space, right padding box, horizontal white space, glyph box for the glyph and the vertical white space. And so these three multinomials are specific to the character type A. They govern the widths of tokens of type A and the distribution of horizontal white space around tokens of type A. And so that latter thing can capture effects like curing. We're going to have similar multinomials for every character in the alphabet, and together all these multinomials are part of our parameterization of the font that we're going to learn in an unsupervised fashion. >>: [inaudible] what is the real parameters you use? >> Taylor Berg-Kirkpatrick: Oh, these pictures? >>: [inaudible] glyph box, the glyph box width? >> Taylor Berg-Kirkpatrick: The glyph box width, yes. >>: Yes. >> Taylor Berg-Kirkpatrick: What? >>: Are you really using [inaudible]? >> Taylor Berg-Kirkpatrick: Oh. Here it depends on the height that we choose. So we downsample the resolution of the images so that each line is 30 pixels high. And then I believe that we cap the glyph box width at 30. It's the same as the height. So, yeah, I think that's correct. That's actually what we use in the model, 1 to 30. That's in terms of pixels. Okay. So we're also going to generate the vertical offset, as I mentioned before, and that we'll call V sub-I. But that's going to come from our global multinomial. And then finally we'll generate the inking level for this glyph box D sub-I again from our global multinomial. Okay. So now we're going to try to fill in the pixels for this particular glyph box token. And so far we've generated the width of the box, the vertical offset of the glyph that's going in the box, and the inking level for printing that glyph in the box. And we're trying to fill in the grid pixels which is X. So there has to be something in our model that tells us the shape of the glyph for the character type A. Because actually we're trying to print an A here because we're saying that this is an A. So that part of our model is going to be a grid of glyph shape parameters. And here I'm showing you a particular setting of the parameters that we learned in our model on some document. These glyph shape parameters are going to combine with the current glyph box width, the current vertical offset, and the current inking level to produce an intermediate grid of Bernoulli pixel probabilities that has the same dimensions as the particular glyph box token that we're trying to fill in. And once we've done that, we can just sample the pixel values in the glyph box from the corresponding Bernoullis. So the special thing about this approach is that our parameterization of glyph shape is actually independent of the particular width, offset, and inking level for the token we're trying to generate. And that means we can fix the glyph shape parameters in the model and then vary the glyph box width to produce tokens with different widths, we can change the vertical offset to generate tokens with different baselines, and we can change the inking level to generate tokens with different levels of ink, all while keeping the shape of the glyph the same. And this is important because it means that when we're doing learning, when in the document we encounter an instance of an underinked A and then later in the document encounter an instance of an overinked A, both of those sources of information can influence our notion of what the shape of an A glyph should look like. We don't have to have separate grids of parameters for underinked As and overinked As. And it turned out that this kind of shared parameterization that we're using was really important to getting unsupervised learning to work in this task, because, after all, we're going to try to learn these glyph shape parameters unsupervised style. Okay. So here's where I'm going to go into a little extra level of detail. The key to getting this parameterization to work is how you define this function that maps the glyph shape parameters along with the width, offset, and inking level down to the intermediate grid. So I'll tell you about that process for a particular row of the glyph shape parameters, how it generates the corresponding row of the Bernoulli pixel probabilities. So I'll call the glyph shape parameter row phi and the Bernoulli pixel probability row theta. And first I'll tell you how we generate the leftmost Bernoulli pixel probability in the row. So in order to generate the leftmost guy, we interpolate the glyph shape parameter row phi with a vector of interpolation weight alpha what's shaped according to a Gaussian that's centered over to the left of the row of glyph shape parameters. Once we've done that, we just apply the logistic function, and that gives us the actual Bernoulli pixel probability. Now, to generate the remaining Bernoullis in this row, as we move along to the right, we just interpolate with different vectors alpha that are still shaped like Gaussians, but now we're centered proportionally along the row. So this means that the Bernoulli pixel probability position J is generated by interpolating with a specific vector alpha sub-J. And now our parameterization looks like this. Theta sub-J is log proportional to the inner product of alpha sub-J and phi. Now, it turns out that we can get the various effects that I talked about earlier of varying width of tokens, varying the vertical offset, and varying the inking level by having these interpolation weights actually depend on the random variables D sub-I, V sub-I, and D sub-I. All right. So I'm not going to -- there's some more detail to be talked about there. I'm not going to talk about it right now. What I do want to say, though, before we move on to learning is that you can actually view these interpolation weights like fixed feature functions in a locally normalized log linear model. And that means that when it comes time to learn the glyph shape parameters phi, we can actually use out-of-the-box unsupervised learning methods. So that's a nice property of the parameterization here. >>: So can I [inaudible] you picked [inaudible] glyph shape and then just rescale them to what [inaudible]? >> Taylor Berg-Kirkpatrick: Yeah. So that's the effect that GI would have. So you'd have a specific -- so for the position J, you'd have a specific set of alphas that would be used to map down the full width glyph shape parameters to a glyph box depending on G. So suppose you wanted to map down to a glyph box with width 10, you'd have a specific set of interpolation weights that would do that. And G would pick that out. But you define those ahead of time. Similarly, VI is just going to pick which row that you're pulling from. The trickiest thing here is DI. So it turns out that you can -- because you're pushing through a nonlinearity by rescaling everything, by changing the interpolation weights and by adding bias, you can kind of change things like contrast and darkness levels. And so you specify for each of the discrete settings of these random variables, you specify ahead of time the kinds of interpolation weights that you'll use, and then those are actually -- they can be treated like feature functions. But they can actually do all these effects that we want using this single parameterization of shape phi. If that makes sense. >>: So [inaudible]. >> Taylor Berg-Kirkpatrick: Yeah. Yeah. >>: So another question is that in your [inaudible] your baseline, wandering, and also your [inaudible] in your generative model they are independent. >> Taylor Berg-Kirkpatrick: They are, yeah. >>: Okay. >> Taylor Berg-Kirkpatrick: There's reason to believe they shouldn't be, right? >>: Yeah, because ->> Taylor Berg-Kirkpatrick: Especially the baseline, which -- yeah. >>: Also inking you can also ->> Taylor Berg-Kirkpatrick: These would all be kind of slow varying. And even -- I'll show you an example later where because of three-dimensional warping, as you get near the bindings of books, they're actually slow varying in the chains of the width of the glyphs as well. So we don't explicitly model any of that and it works really well, but we thought about ways to extend it. We think -- we've seen a couple examples, a couple errors that we think it might actually lead to improvements. But there aren't a lot of errors that we think it would actually fix. Yeah. But, I mean, it's an interesting -- I mean, we can talk about it later. It's interesting how you might do it. Because it gets -- to just do it naively, you end up adding a lot of states to the Semi-Markov model and it can get really slow and this is already pretty slow. But you could think about using some kind of mean field approximation. Because, you know, the offsets are basically independent of what's going on with the language model. And so it might -- it's an appropriate approximation. Okay. So I'll actually talk about learning. So we use EM to learn the font parameters. And so that, again, means learning both the multinomials governing the layout and the glyph shape parameters specifying glyph shape. And we initialize EM using font parameters that come from fixtures of modern fonts. So we just took all the fonts that appear on the Ubuntu distribution and mixed them down. And in order to get the expectations that we need for EM, we run the semi-Markov dynamic program because the model itself is actually an instance of a hidden semi-Markov model. And then ->>: So the mixture, is it just uniform weight or ->> Taylor Berg-Kirkpatrick: Yeah, uniform -- the initialization mixture? Yeah. So there were -- by hand we identified some weird ones, like Comic Sans. There's like a bunch -like Zapf Dingbats or whatever. And we just said those don't count. But the ones actually with like text, we just uniformly put them together. And so to make this fast, we took a coarse-to-fine approach to inference. Okay. So now you know basically what the model looks like and you know how we do learning inference. Before I show you experiments, I just want to show you two examples of the system kind of in action. So here is a snippet of a document. It reads how the murderers came to. So if we train the model using EM on the entire page that this snippet came from, we learn some font parameters. And now with maximum likelihood inference, we can fill in the text random variables and get a transcription. So this is the actual output of our model, how the murderers came to. We happened to get this one right. We can also do maximum likelihood inference on the typesetting random variables and use that to kind of peek into what the model is actually doing here. So here is a representation of the model's prediction about the typesetting random variables. The blue here corresponds to padding boxes and spaces. And the white boxes are the glyph boxes. So you can kind of see the layout. The glyphs that actually appear inside these white boxes are the Bernoulli pixel probabilities that the model chose to use to actually generate the pixels that it saw. Yeah. >>: So do you model space? >> Taylor Berg-Kirkpatrick: So the way we model space is we say it's a special type of glyph that has fixed parameters that are just white with some small probability of black for noisiness. Yeah. So we don't try to learn that. But so this means then that spaces generate a glyph box which is blank and then two padding boxes which is blank. Which is kind of like there's an unidentifiability problem there, but it didn't -- doesn't really cause problems. >>: How do you determine the width of a glyph box where the ink has been -- where it is overinked? For example, [inaudible], for example. >> Taylor Berg-Kirkpatrick: Sorry, say again? >>: So [inaudible] would be a wide glyph box [inaudible] would be two glyph boxes. But if you were given a ->> Taylor Berg-Kirkpatrick: So how would you distinguish between those ->>: How do you distinguish between -- if you're setting up a glyph box, how do you distinguish between the case where ->> Taylor Berg-Kirkpatrick: Through the language model. >>: The language model. >> Taylor Berg-Kirkpatrick: The language model would tell you. Hopefully. Yeah. I mean, if -- yeah. And some of the errors we get are because the language model was ambivalent about stuff like that. And you will see cases where the -- yeah. >>: [inaudible] confused. Can you tell me again your testing procedure? You used the document and then you run through your model? >> Taylor Berg-Kirkpatrick: Yeah. >>: And then you get the global ink multinomial parameters, the global baseline from this? >> Taylor Berg-Kirkpatrick: Right. >>: And then so basically you do inference on a document. >> Taylor Berg-Kirkpatrick: So we run EM to get the ->>: So the output ->> Taylor Berg-Kirkpatrick: -- the parameters. And then we do maximum likelihood inference to transcribe. >>: I see. >> Taylor Berg-Kirkpatrick: Yeah. There's no supervision and we're training and testing on the same document because it's totally unsupervised. Because we have to learn the font in that document before we can transcribe. >>: And one page is enough to learn, or is it like ->> Taylor Berg-Kirkpatrick: One page turns out to be an enough, yeah. So ideally we'd be able to do this on whole books and actually update the language model parameters as well. But right now it's a little slow. But we're working to make it faster. So I want you to see in this example, in the [inaudible] typesetting that we're actually -we're capturing the wandering baseline. The model is using the vertical offset random variables to do that, which is good. The model is also using the inking random variables to capture the overinked H on the left and the underinked E in the word the. So it's cool to see that the model is kind of doing what we intended it to do. I want to show you another example. This snippet reads taken ill and taken away -- I remember. And again we use EM to learn parameters on -- from the whole page that this snippet came from. And we get a transcription and it turns out again we were right in this case. And then here is the predicted typesetting. And so here I want you to see that, again, we're modeling the wandering baseline. This is something that I mentioned earlier. The wandering baseline in the input here isn't a result of the printing press and how it worked, this is actually a result of three-dimensional warping and you get near the binding of a book. And so we can capture that. But what's more cool is that because the page begins to face slightly away from the camera as it recedes in three-dimensional space, those characters are actually thinner. And so our -- because we can generate glyphs of different width, we capture that here with the thinner B, E, and R. So it's kind of cool that we can capture this effect, because these effects are kind of common in a lot of these historical documents. >>: So just to make sure, so the baseline can actually be not horizontal? >> Taylor Berg-Kirkpatrick: No, so we -- we don't model the fact that they have slanted. We're just -- it's an approximation. We're saying, yes, they're kind of moving down and getting thinner as they ->>: [inaudible] baseline [inaudible]. >> Taylor Berg-Kirkpatrick: Yeah. Baseline doesn't have an angle. It just has -- it's just an offset. >>: Because the red dot you show in a previous slide ->> Taylor Berg-Kirkpatrick: Yeah, that was just -- yeah, I'm just like laser pointing with the lines. Okay. So now let me talk about actual real experiments. I'll show you results on two different test sets. The first was taken from this Old Bailey corpus that I talked about earlier, the Old Bailey Courthouse in London. And here we choose 20 images at random distributed across years, each consisting of 30 lines. And that formed our test set. And we manually transcribed these guys to get the gold data. And the second test set comes from a corpus called Trove. And so this is a collection of historical newspapers from Australia that's curated by the National Library of Australia. Here we took ten images at random, again each consisting of 30 lines. We compare against two different baselines. The first is Google Tesseract, which I mentioned earlier. The second is a state-of-the-art commercial system called ABBYY FineReader. And this is actually the system that was chosen by the National Library of Australia to automatically transcribe the documents they had in Trove. I'll show you experiments with two different language models. The first is out-of-domain, as I mentioned. This we trained on 34 million words from the New York Times portion of Gigaword. And the second language model is in-domain for our dataset, and this came from manually transcribed text from the Old Bailey corpus, 32 million words. >>: Is that the reason you restricted it to just the New York Times portion? >> Taylor Berg-Kirkpatrick: Yeah. So it's a character 6-gram model, and it seemed like more data wasn't helping and it was making bringing stuff in slower. But yeah. Okay. So I'll show you results in terms of word error rate on the first test set, which was the Old Bailey corpus. So here bigger bars are worse. So Google Tesseract on this test set gets a word error rate of 54.8. So it's pretty bad. It's kind of like we saw on that small example. This is getting more than half of words wrong. ABBYY FineReader does a bit better. It gets a word error rate of 40 on this test set. And our system, with the out-of-domain language model, gets a word error rate of 28.1 and with the in-domain language model gets a word error rate of 24.1. So that's a pretty big reduction in error. That's actually more than a 50 percent error reduction compared to Google Tesseract on this test set. Here's the second test set, Trove. Here Google Tesseract is doing even worse. This is kind of a harder test set in a way. It's getting a word error rate of 59.3. Now ABBYY FineReader is getting almost half of words wrong, 49.2. And our system, with the out-of-domain language model is getting a word error rate of 33. And so that's not quite a 50 percent error reduction compared to Google Tesseract, but it's kind of close. Okay. So I just want to go back to the original example and ->>: [inaudible] show the result using the [inaudible]? >> Taylor Berg-Kirkpatrick: We never ran it. We could have. We figured that it would have been more in-domain than NYT was for Trove, but not as in-domain as it was for Old Bailey because it was actually for the Old Bailey corpus. >>: Would it make sense to use character error rate here? >> Taylor Berg-Kirkpatrick: Yeah, we have those numbers too in the paper. We went back and forth about what to present. The character error rates -- character error rates are much lower, and that's what people usually present. But they're kind of misleading because once you get a -- you know, a couple characters wrong in a word, it makes it hard to search for, makes it hard to read. I don't know. I think it's better to see word error rates. But I think the character error rates are in the 10s, something like this. I don't totally remember. >>: [inaudible] people don't use [inaudible] language models, like you're using a 6-layer [inaudible]? >> Taylor Berg-Kirkpatrick: 6-gram, yeah. >>: For a letter, right? >> Taylor Berg-Kirkpatrick: Yeah, a character. Yeah. >>: Why don't people use words? >> Taylor Berg-Kirkpatrick: So we did a lot of experiments, actually, after we published the paper on trying to boost the language model because we thought, like you did, you know, less ambiguity is better. It's actually hard. The problem is that your state space in the semi-Markov model gets huge. And so you have to do different approximations. We started doing beaming. And the beaming kind of interacts badly with models that have these kind of two levels of generating words and then characters we found. And even when we waited long enough to get more exact results, we saw that it wasn't giving a big improvement for the out-of-domain language model. Even for the in-domain language model it wasn't giving huge improvements. Which is a little -- we're not totally sure. We expected it to give bigger improvements, but it didn't. So here's that original example again. So just to give you an idea of how legibility gets better. This is Google Tesseract with more than half of words wrong. And this is our system with fewer than a third of words wrong. This is our output on the document. And you can see that it's easier to read. Okay. Next I just want to show you some examples of the fonts that we actually learn. So first here is the initializer for the glyph shape parameters that we use for the character G. And you can see that it is a mixture of a bunch of modern fonts. There are kind of two different types of dominant Gs that are visible here. The first is the type of G that has the descender starting on the left. And the second is the type of G that has the descender starting on the right. So here are the final parameters that we learned for these glyph shape parameters after running EM on various documents from across different years. And you can see that in all of these historical documents the type of G with the descender on the left is used. And we were able to learn that. We were also able to learn more fine-grained distinctions in glyph shape as glyph shape various from document to document and across years. There's one more cool thing that we can do with our model that I want to talk about, and this has to do with the fact that we're actually generating all the way down to the level of pixels. We can treat obscured pixels in the input either that are a result of overinked or blotched ink or even tearing in the page. We can treat obscured pixels as unobserved random variables in the model. And then during learning we can marginalize them out. And during transcription inference, we can actually make predictions about their values. So here is a portion of a document that we found that had severe bleeding ink from the facing page. You can see that here, these big capital letters, they got superimposed. We went and manually annotated that bleeding ink in red, and that's why it's red here. We told the model during learning to treat all the red pixels unobserved and during inference to try to fill them in. And so here's the predicted typesetting that we get on this portion. You can see that we correctly identified the word mouth, which is cool, and we made reasonable predictions about what pixels should go there. We also correctly identified the missing D that was totally obscured in the word found and again made a reasonable prediction of what the D shape should be that goes here. Because, after all, we know what Ds look like in this document because we learned the font on this document. So that's cool. And we're looking at -- there's future applications that you could think of doing, integrating the model with different kinds of reconstruction techniques that we're looking at. Okay. So now I'll conclude. So unsupervised font learning, we've seen that it can yield state-of-the-art results on documents where the font is unknown. We've also seen that generatively modeling these sources of noise that were specific to the printing press era documents was really effective. And finally we're trying to make this into a tool so that historians and researchers can actually use it, and we're working on that now. The biggest problem is that it's slow right now. We're trying to make it fast. And thanks. [applause]. >> Taylor Berg-Kirkpatrick: Yeah. >>: So one observation a few slides back when you showed, you know, a comparison of the Google recognition and your recognition ->> Taylor Berg-Kirkpatrick: This one? >>: Yeah. And so when I look at it, it looks like especially names ->> Taylor Berg-Kirkpatrick: Yes. So this is why we're thinking of adapting the language model. So these are like court transcripts. It's like a script. The names occur over and over again. And we'll transcribe the same name completely differently every time we see it. I do want to -- like I think that this is funny. If you see what Google Tesseract did, foolmarfs. I just wanted to point that out. I think it's funny that it said that. But, yeah, so some kind of constraint -- we thought a little bit about this so that you don't transcribe the same name differently each time. If you added that constraint, that might improve things. Or if you simply try to adapt the language model. >>: Yeah, that was, I mean, maybe a naive thought, but I was thinking if you had a separate character n-gram model for capitalized ->> Taylor Berg-Kirkpatrick: That's a good idea, yeah. >>: Because, I mean, the names are often -- you know, they may not be English, they may be ->> Taylor Berg-Kirkpatrick: Yeah. >>: -- coming in from other ->> Taylor Berg-Kirkpatrick: Right. >>: There's a lot of character n-gram capabilities that are probably very different in names than they are in English in general. >> Taylor Berg-Kirkpatrick: That makes a lot of sense. And sometimes, especially if the name isn't severely degraded, you'd be better off like ignoring the language model altogether maybe. Like, for example, you can see here, this looks like the language model likes it, but it probably doesn't -- the pixels don't like it. So it's a good idea. >>: I know this is probably a [inaudible] question, but did you try your model on the modern documents? >> Taylor Berg-Kirkpatrick: Yeah, we did. And just a little bit. And it does -- you know, the character error rates on modern documents in our system and other systems are tiny. It's like, you know, less than 2 percent character error rate, 1 percent character error rate. And so it seemed like ours was about a good as Google Tesseract. The thing is ours is so much slower than Google Tesseract that you should just run Google Tesseract if you have a modern document. >>: So how slow it is right now? >> Taylor Berg-Kirkpatrick: Right now it's about 10 minutes per page. Maybe 20 minutes. And that's on a pretty fast desktop. It turns out actually the biggest bottleneck isn't the dynamic program, which we've made really fast now. The bottleneck is actually the computation of all the emission probabilities for all the different spans. Because there's actually many different types of templates you could actually use to print things. You have all different widths, different offsets, different inking levels. They're parameterized in a really simple way, but you actually have to compute how each of those possible templates would fit with the input. And so that's a big computation. You can write it as something that looks basically like a matrix outer product. And so just doing that in our code made it faster [inaudible] cache locality and stuff. But we're also looking at putting it on a GPU because it's sort of ideally suited for a GPU. But that said, we're worried that historians or humanities researchers aren't going to have like some desktop, some gaming desktop with like 3 GPUs in it. So we're also looking at approximations that you could do to speed it up on the CPU. >>: So this question related to Michael's, but [inaudible] your word error rate is much bigger than the character error rate, which seems like I don't have any [inaudible] but seems like there was some sort of error correcting you could do if you -- maybe also related to Patrick's question, too, is like if you have some word base kind of language model ->> Taylor Berg-Kirkpatrick: Right. >>: -- and you actually try to recover from the characters, maybe your word, just miss a couple characters. >> Taylor Berg-Kirkpatrick: Yeah. >>: And can you sort of -- can correct that. >> Taylor Berg-Kirkpatrick: Yeah, if we could spit out some kind of lattice and then rescore it with some high n-gram word, something like that. Yeah, it's probably a good idea. People do do a lot of post processing, and I think they see improvements in OCR outputs. >>: So for the Google tool, did you -- or for the other tool, do you get a score of each character's -- so if you only get [inaudible] second most possible [inaudible]. >> Taylor Berg-Kirkpatrick: Yeah, no, we didn't. You mean like n-best lists or something like this? >>: Yeah. >> Taylor Berg-Kirkpatrick: Yeah, we didn't try that. >>: I was wondering if you had that and you just smooth with the language model probably you would do pretty well. >> Taylor Berg-Kirkpatrick: Yeah. Yeah, that's possible. Yeah. That's a good idea. >>: So you're not planning to sell your tool since the other ones are commercially available? >> Taylor Berg-Kirkpatrick: Yeah, no, we don't -- I don't know. Not really. >>: So one question. How is it that possible that people transcript 32 million ->> Taylor Berg-Kirkpatrick: Yeah. I don't know. It's pretty amazing. They said like one of these -- it's not -- it's not pay, I think. They set up a Web site, and people come to the Web site and they think it's fun and they do it. And they got lucky that enough people got interested that they did it. Yeah. >>: [inaudible]. >> Taylor Berg-Kirkpatrick: Yeah. It's kind of amazing. >>: So have you talked to some potential consumers, like ->>: [inaudible]. [laughter]. >> Taylor Berg-Kirkpatrick: Yeah, no, yeah, we should ->>: You should call them. >>: No, I'm thinking of along more about like social scientists. >> Taylor Berg-Kirkpatrick: Yeah. So we've been in contact with a couple. Somebody at Berkeley. Yeah. And we're trying to figure out what -- so we looked at their documents, and they do look a lot like these documents. And so really now it's a matter of us making it fast enough and making the interface easily usable for them. >>: So another thing you sort of gloss around, but you definitely seem [inaudible] definitely seem like a good idea that when you try to -- let's say you try to run a whole book and you keep updating your language model ->> Taylor Berg-Kirkpatrick: Yeah. >>: -- [inaudible] language model is sort of like initialized from the [inaudible]. >> Taylor Berg-Kirkpatrick: Right. >>: But then they can get adaptive from whatever preliminaries [inaudible]. >> Taylor Berg-Kirkpatrick: Yeah, yeah. >>: So what's the bottleneck to do that? >> Taylor Berg-Kirkpatrick: Well, so since -- I mean, we could have tried it, but we figured that because we're training on single pages, that's not enough data to actually get signal on the language model. We might be wrong. Maybe you could get something like names because they do repeat. But we figured we need it to be fast enough that we can train on larger portions at a time. >>: So you also mentioned coarse-to-fine. So is it coarse-to-fine in terms of different resolution? >> Taylor Berg-Kirkpatrick: So the coarse-to-fine was -- yeah, that's something we're looking at, too, though. But, no, coarse-to-fine was in terms of increasing the order of the language model over time. That makes the dynamic program pretty fast. And then you prune these in the lower order one and we use [inaudible]. >>: It seem how you envision, very naturally. >> Taylor Berg-Kirkpatrick: I know. Yeah. And so this is one of the approximations we're thinking about. There's an older literature on something called document image decoding. And this is also -- this is old-school kind of [inaudible] model that's similar to this and that they're treating OCR as a noisy channel model. And they came up with some interesting algorithms with Minka back in the day, something called iterative complete paths, where you find -- so you -- all those emission probabilities I talked about which were so slow for us to compute, you compute them all at a lower resolution or with some sort of like column projection to give you an upper bound on the actual probabilities. And then you find the Viterbi path, and then you update all the guys on that Viterbi path with the exact scores. And then you recompute the Viterbi path. And so once that converges, you're guaranteed that you found the right path, but you haven't had to actually compute all those emission probabilities. And so that's a place where the kind of coarse-to-fine pixel level stuff could be really useful. >>: [inaudible]. [laughter]. >>: If you look at what happens with [inaudible] iterations that give you just [inaudible] without taking the parameters, just from the initialization point ->> Taylor Berg-Kirkpatrick: So what's the score of the initializer? >>: Yeah. >> Taylor Berg-Kirkpatrick: It's -- it's like 10 [inaudible]. Maybe more than that. So it's not -- it's not totally failing, but it's pretty bad. It's like maybe as bad as ABBYY FineReader. >>: Still better than Google? >> Taylor Berg-Kirkpatrick: Better than Google. Yeah. >>: Probably due to the joint [inaudible]. >> Taylor Berg-Kirkpatrick: And the ability to model the noise, yeah. I think the joint and then the ability to do baseline stuff. >>: And so here's a question. I'm wondering why you didn't ask this one, so it seems, you know, once you have a model that is this flexible to deal with noise, right, I mean, the next frontier would be handwriting. >> Taylor Berg-Kirkpatrick: Yeah. >>: And, I mean, there's obviously a huge demand for that in try the medical domain. >> Taylor Berg-Kirkpatrick: So we thought about this. And the reason we haven't actually done experiments is we think -- we suspect it's unlikely to work because we've noticed in error analysis that the places where our model fails are where you have a large region where there isn't any single column of white space between characters. So then it's on the first pass of EM you get the segmentation wrong, and then the learning all goes wrong. You can get away with this happening where a couple characters bleed together, but with something like handwriting or cursive, yeah, or even Arabic OCR, even digitally types of Arabic, this could be a big problem. I mean, it might work. And we've thought about ways to -- I mean, maybe with enough random restarts like you could start to, you know, do better. But in our initial experiments, we weren't hopeful. >> Hoifung Poon: So let's thank Taylor again. [applause]