>> Hoifung Poon: Hi. It's my great pleasure... visit us from Berkeley. Taylor has done a number...

advertisement
>> Hoifung Poon: Hi. It's my great pleasure to welcome Taylor Berg-Kirkpatrick to
visit us from Berkeley. Taylor has done a number of interesting work, ranging from
diverse topic such as like word alignment, summarization, and more recently on
recognizing text from images.
So and then I have -- a number of us have been using things like, for example, [inaudible]
EM that he has developed. So I'm very excited to have him here. And without further
ado, here's Taylor.
>> Taylor Berg-Kirkpatrick: Thanks. Hi. So this is joint work with Greg Durrett and
Dan Klein, my advisor. And this talk is going to be about transcribing images of
historical documents into text.
So let me first just give an example of a historical document. I'll be mainly talking about
documents from the printing press era.
So here's an image of a document that was printed in 1775. And it's part of the
proceedings of the Old Bailey Courthouse in London.
Let me read just a little bit of it. Jacob Lazarus and wife, the prisoner, were both together
when I received them. I sold eleven pair of them for three guineas, and delivered the
remainder back to the prisoner.
So researchers are interested in being able to do analysis on collections of documents that
look a lot like this one. But in order to do many types of analysis, you first have to take
these images and turn them into text. And so that's what this talk is about.
But so let's see what happens, though, when we take this particular historical document
image and plug it into a modern, off-the-shelf OCR system. I'll use Google Tesseract as
an example.
So here's the transcription you get out of Google Tesseract. I've marked the errors in the
transcription in red. And you can see there's a lot of errors. In fact, more than half of the
words here are wrong. This is the pretty terrible error rate. And that's weird because
Google Tesseract is really good on modern documents. Most OCR systems are really
good on modern documents.
So, next, to help explain why existing OCR systems perform so badly on historical
documents, let me just quickly cartoon how a typical OCR system might work.
So here's an image of some digitally typeset modern text, reads the word modern. So the
first thing an OCR system will often do is try to find a baseline for the text in the image.
Once it's found the baseline, it will try to segment the pixels in the image into regions that
correspond to individual characters.
So Tesseract actually does that by identifying connecting components. Okay.
So once the image is segmented into characters, the system will then try to recognize
each character segment individually. And that usually happens using some kind of
supervised classifier that's trained on a single font or maybe a collection of fonts.
So this kind of pipelined approach works really well on modern documents that are
digitally typeset so that white space is really regular and where the font is known. But
we've seen that that kind of approach fails really badly on documents like this one.
So now let me tell you kind of the three primary aspects of historical documents that
make them really hard for existing OCR.
The first problem is that the fonts in these documents are often unknown. So here's a
rendering of the word positive in a historical document. We can try to line up a modern
serif font with the first two characters. For this particular modern serif font, doesn't quite
line up because the descender on the P isn't quite long enough.
We'll try a couple more modern serif fonts. In fact, no modern font is going to perfectly
line up with this image because the font used in the image is an ancestor of modern fonts.
And extreme case of this problem of unknown font, you can see here with the use of the
long S glyph. So we don't even -- we don't use this glyph anymore, so we don't even
have a representation for it in modern fonts.
Okay. A second problem that occurs in a lot of these printing press era documents has to
do with the baseline of the text. So this example reads the death of the deceased. And
you can see that the baseline of the text kind of wanders up and down as you go across
the line. The glyphs kind of hop up and down.
And so this actually has to do with the way that these documents were printed using a
printing press. So humans take, you know, templates, character templates, and align
them using a mechanical baseline, and there's some slop in that mechanism. And as a
result you get this kind of noise in the baseline.
Okay. A third problem has to do with inking levels and the fact that they vary across a
lot of these documents. So this example reads rode along in silence. And on the left of
the example, the word rode is underinked. It's so underinked that the glyph for the
character D has actually divided into two different unconnected components.
On the right there's a large degree of overinking and bleeding ink, so much so that many
of the glyphs have now come together into one big connected component.
And this, again, is a result of how the document was printed. Maybe the ink was applied
to the roller bar unevenly or maybe pressure was applied unevenly when the document
was printed.
So here are four portions -- here are portions of four different historical documents, one
from 1725, 1823, 1875, and 1883. And you can see that the three problems that I
mentioned are present in these four documents. And, in fact, if you look, those three
problems are present in pretty much all documents from the printing press era.
So our approach is going to explicitly deal with each of those issues in turn. So the way
we deal with the problem of unknown font is that we're going to learn the font in an
unsupervised fashion from the document itself.
We're going to deal with the noise as a result of printing process by jointly segmenting
and recognizing and doing that in a generative model that explicitly models the modern
baseline and also explicitly models the varying levels of ink across the document.
So now I'll give you a high-level depiction of the generative model that we use. It's
actually going to generate all the way down to the level of pixels.
So first thing we do in the model is generate some text character by character that comes
from a language model. And now conditioned on that text we're going to generate a
bunch of typesetting information. And so this is going to look like a bunch of bounding
boxes that tell you how the glyphs are going to be laid out.
So specifically first thing we do for the character P is generate a left padding box. So this
box is going to house the horizontal white space that you see before you actually get to
the glyph for the character P.
Then we'll generate a glyph bounding box for P which will how the glyph for P and the
vertical white space above and below it. And then the right padding box. We'll do the
same for R and the rest of the characters.
So now we have a kind of layout. We're also going to generate for each of these glyph
boxes a vertical offset for the glyph that goes in that box. And so this will actually
specify the baseline of the text and allow it to wander as it goes across the document.
And then finally we'll generate an inking level, an overall level of ink, in each of these
glyph boxes. And that similarly will allow us to model to wandering levels of ink across
the document.
>>: Yes.
>> Taylor Berg-Kirkpatrick: Yeah.
>>: [inaudible]?
>> Taylor Berg-Kirkpatrick: So we're actually going to be learning the glyph shapes.
And so once you have -- once you've specified the baseline and you know the shape of
the glyph in terms of its pixel layout, that then defines the top of the glyph as well. So
that's enough to specify it. That will become more clear as I -- I'm going to dive into
increasing levels of detail.
And okay. So then once we have all that typesetting information and the text, condition
and all that, we're going to actually start filling in pixels. And this will happen in the
rendering model. So we're actually laying out the pixels for the glyphs. And then here
we've generated an image of the text.
So I'm going to name some of the random variables involved. I'll call the text E, and
that's going to come from P of E, the language model.
And the typesetting information I'll call T, and that comes from the typesetting P of T
given E.
And then finally the image itself, the actual grid of pixels, I'll call X. And that will come
from the rendering model, P of E -- P of X given E and T.
And so the only observed random variable in our model is going to be X, the actual
image itself. Both E and T are going to be latent random variables. And during learning
we'll marginalize them out and when it comes time to do transcription we'll fill them in
with inference.
Okay. So now I'm going to tell you more details about each of these three components in
turn.
Let me just say that probably kind of the first thing that popped into your head about
what each of these components might look like is basically correct. With one exception.
There's going to be a little extra level of complexity in the rendering model, and I'll tell
you about it when we get there.
All right. So first talk about the language model. We're going to have a random variable
E sub-I corresponding to each character. All right? And these random variables will be
distributed according to a Kneser-Ney smoothed character 6-gram model that will train
on a large corpus of external text data. So this is going to resolve a lot of the ambiguity
when we're doing this in an unsupervised way, the fact that we have a strong language
model trained on a bunch of data.
Okay. So now let's talk about typesetting. Yeah.
>>: [inaudible].
>> Taylor Berg-Kirkpatrick: So we played with a couple different language models.
You'll see them in a second. One is outer-domain. It comes from Gigaword. And the
other is in-domain for some of the historical corpora that we looked at.
>>: [inaudible].
>> Taylor Berg-Kirkpatrick: So the in-domain one is from that period, and it was
manually transcribed. That was a big curatorial effort that actually transcribed a bunch of
these documents manually. And so we can use that to sort of see what the penalty is from
using outer-domain data.
Okay. So I'll talk about the typesetting model. And we'll look at the typesetting process
for the token index I. And it works the same way for each token. So the character
random variable at index I is E sub-I, as we've seen. And let's say that that's an A.
So the first thing we're going to do in the typesetting process is generate a random
variable L sub-I which specifies the width of the left padding box. And L sub-I is going
to come from a multinomial.
Then we'll generate a random variable G sub-I, which is the width of the glyph box, and
that's, again, going to come from a multinomial.
And then finally we'll generate the width of the right padding box, again, from a
multinomial.
And remember the left padding box, horizontal white space, right padding box, horizontal
white space, glyph box for the glyph and the vertical white space.
And so these three multinomials are specific to the character type A. They govern the
widths of tokens of type A and the distribution of horizontal white space around tokens of
type A. And so that latter thing can capture effects like curing.
We're going to have similar multinomials for every character in the alphabet, and together
all these multinomials are part of our parameterization of the font that we're going to
learn in an unsupervised fashion.
>>: [inaudible] what is the real parameters you use?
>> Taylor Berg-Kirkpatrick: Oh, these pictures?
>>: [inaudible] glyph box, the glyph box width?
>> Taylor Berg-Kirkpatrick: The glyph box width, yes.
>>: Yes.
>> Taylor Berg-Kirkpatrick: What?
>>: Are you really using [inaudible]?
>> Taylor Berg-Kirkpatrick: Oh. Here it depends on the height that we choose. So we
downsample the resolution of the images so that each line is 30 pixels high. And then I
believe that we cap the glyph box width at 30. It's the same as the height. So, yeah, I
think that's correct. That's actually what we use in the model, 1 to 30. That's in terms of
pixels.
Okay. So we're also going to generate the vertical offset, as I mentioned before, and that
we'll call V sub-I. But that's going to come from our global multinomial. And then
finally we'll generate the inking level for this glyph box D sub-I again from our global
multinomial.
Okay. So now we're going to try to fill in the pixels for this particular glyph box token.
And so far we've generated the width of the box, the vertical offset of the glyph that's
going in the box, and the inking level for printing that glyph in the box. And we're trying
to fill in the grid pixels which is X.
So there has to be something in our model that tells us the shape of the glyph for the
character type A. Because actually we're trying to print an A here because we're saying
that this is an A. So that part of our model is going to be a grid of glyph shape
parameters. And here I'm showing you a particular setting of the parameters that we
learned in our model on some document.
These glyph shape parameters are going to combine with the current glyph box width, the
current vertical offset, and the current inking level to produce an intermediate grid of
Bernoulli pixel probabilities that has the same dimensions as the particular glyph box
token that we're trying to fill in.
And once we've done that, we can just sample the pixel values in the glyph box from the
corresponding Bernoullis.
So the special thing about this approach is that our parameterization of glyph shape is
actually independent of the particular width, offset, and inking level for the token we're
trying to generate.
And that means we can fix the glyph shape parameters in the model and then vary the
glyph box width to produce tokens with different widths, we can change the vertical
offset to generate tokens with different baselines, and we can change the inking level to
generate tokens with different levels of ink, all while keeping the shape of the glyph the
same.
And this is important because it means that when we're doing learning, when in the
document we encounter an instance of an underinked A and then later in the document
encounter an instance of an overinked A, both of those sources of information can
influence our notion of what the shape of an A glyph should look like. We don't have to
have separate grids of parameters for underinked As and overinked As.
And it turned out that this kind of shared parameterization that we're using was really
important to getting unsupervised learning to work in this task, because, after all, we're
going to try to learn these glyph shape parameters unsupervised style.
Okay. So here's where I'm going to go into a little extra level of detail. The key to
getting this parameterization to work is how you define this function that maps the glyph
shape parameters along with the width, offset, and inking level down to the intermediate
grid.
So I'll tell you about that process for a particular row of the glyph shape parameters, how
it generates the corresponding row of the Bernoulli pixel probabilities.
So I'll call the glyph shape parameter row phi and the Bernoulli pixel probability row
theta. And first I'll tell you how we generate the leftmost Bernoulli pixel probability in
the row. So in order to generate the leftmost guy, we interpolate the glyph shape
parameter row phi with a vector of interpolation weight alpha what's shaped according to
a Gaussian that's centered over to the left of the row of glyph shape parameters.
Once we've done that, we just apply the logistic function, and that gives us the actual
Bernoulli pixel probability.
Now, to generate the remaining Bernoullis in this row, as we move along to the right, we
just interpolate with different vectors alpha that are still shaped like Gaussians, but now
we're centered proportionally along the row.
So this means that the Bernoulli pixel probability position J is generated by interpolating
with a specific vector alpha sub-J. And now our parameterization looks like this. Theta
sub-J is log proportional to the inner product of alpha sub-J and phi.
Now, it turns out that we can get the various effects that I talked about earlier of varying
width of tokens, varying the vertical offset, and varying the inking level by having these
interpolation weights actually depend on the random variables D sub-I, V sub-I, and D
sub-I.
All right. So I'm not going to -- there's some more detail to be talked about there. I'm not
going to talk about it right now. What I do want to say, though, before we move on to
learning is that you can actually view these interpolation weights like fixed feature
functions in a locally normalized log linear model. And that means that when it comes
time to learn the glyph shape parameters phi, we can actually use out-of-the-box
unsupervised learning methods. So that's a nice property of the parameterization here.
>>: So can I [inaudible] you picked [inaudible] glyph shape and then just rescale them to
what [inaudible]?
>> Taylor Berg-Kirkpatrick: Yeah. So that's the effect that GI would have. So you'd
have a specific -- so for the position J, you'd have a specific set of alphas that would be
used to map down the full width glyph shape parameters to a glyph box depending on G.
So suppose you wanted to map down to a glyph box with width 10, you'd have a specific
set of interpolation weights that would do that. And G would pick that out. But you
define those ahead of time.
Similarly, VI is just going to pick which row that you're pulling from. The trickiest thing
here is DI. So it turns out that you can -- because you're pushing through a nonlinearity
by rescaling everything, by changing the interpolation weights and by adding bias, you
can kind of change things like contrast and darkness levels.
And so you specify for each of the discrete settings of these random variables, you
specify ahead of time the kinds of interpolation weights that you'll use, and then those are
actually -- they can be treated like feature functions. But they can actually do all these
effects that we want using this single parameterization of shape phi. If that makes sense.
>>: So [inaudible].
>> Taylor Berg-Kirkpatrick: Yeah. Yeah.
>>: So another question is that in your [inaudible] your baseline, wandering, and also
your [inaudible] in your generative model they are independent.
>> Taylor Berg-Kirkpatrick: They are, yeah.
>>: Okay.
>> Taylor Berg-Kirkpatrick: There's reason to believe they shouldn't be, right?
>>: Yeah, because ->> Taylor Berg-Kirkpatrick: Especially the baseline, which -- yeah.
>>: Also inking you can also ->> Taylor Berg-Kirkpatrick: These would all be kind of slow varying. And even -- I'll
show you an example later where because of three-dimensional warping, as you get near
the bindings of books, they're actually slow varying in the chains of the width of the
glyphs as well.
So we don't explicitly model any of that and it works really well, but we thought about
ways to extend it. We think -- we've seen a couple examples, a couple errors that we
think it might actually lead to improvements. But there aren't a lot of errors that we think
it would actually fix. Yeah.
But, I mean, it's an interesting -- I mean, we can talk about it later. It's interesting how
you might do it. Because it gets -- to just do it naively, you end up adding a lot of states
to the Semi-Markov model and it can get really slow and this is already pretty slow. But
you could think about using some kind of mean field approximation. Because, you
know, the offsets are basically independent of what's going on with the language model.
And so it might -- it's an appropriate approximation.
Okay. So I'll actually talk about learning. So we use EM to learn the font parameters.
And so that, again, means learning both the multinomials governing the layout and the
glyph shape parameters specifying glyph shape. And we initialize EM using font
parameters that come from fixtures of modern fonts. So we just took all the fonts that
appear on the Ubuntu distribution and mixed them down.
And in order to get the expectations that we need for EM, we run the semi-Markov
dynamic program because the model itself is actually an instance of a hidden
semi-Markov model. And then ->>: So the mixture, is it just uniform weight or ->> Taylor Berg-Kirkpatrick: Yeah, uniform -- the initialization mixture? Yeah. So there
were -- by hand we identified some weird ones, like Comic Sans. There's like a bunch -like Zapf Dingbats or whatever. And we just said those don't count. But the ones
actually with like text, we just uniformly put them together.
And so to make this fast, we took a coarse-to-fine approach to inference.
Okay. So now you know basically what the model looks like and you know how we do
learning inference. Before I show you experiments, I just want to show you two
examples of the system kind of in action.
So here is a snippet of a document. It reads how the murderers came to. So if we train
the model using EM on the entire page that this snippet came from, we learn some font
parameters. And now with maximum likelihood inference, we can fill in the text random
variables and get a transcription.
So this is the actual output of our model, how the murderers came to. We happened to
get this one right.
We can also do maximum likelihood inference on the typesetting random variables and
use that to kind of peek into what the model is actually doing here.
So here is a representation of the model's prediction about the typesetting random
variables. The blue here corresponds to padding boxes and spaces. And the white boxes
are the glyph boxes. So you can kind of see the layout.
The glyphs that actually appear inside these white boxes are the Bernoulli pixel
probabilities that the model chose to use to actually generate the pixels that it saw. Yeah.
>>: So do you model space?
>> Taylor Berg-Kirkpatrick: So the way we model space is we say it's a special type of
glyph that has fixed parameters that are just white with some small probability of black
for noisiness. Yeah. So we don't try to learn that. But so this means then that spaces
generate a glyph box which is blank and then two padding boxes which is blank. Which
is kind of like there's an unidentifiability problem there, but it didn't -- doesn't really
cause problems.
>>: How do you determine the width of a glyph box where the ink has been -- where it is
overinked? For example, [inaudible], for example.
>> Taylor Berg-Kirkpatrick: Sorry, say again?
>>: So [inaudible] would be a wide glyph box [inaudible] would be two glyph boxes.
But if you were given a ->> Taylor Berg-Kirkpatrick: So how would you distinguish between those ->>: How do you distinguish between -- if you're setting up a glyph box, how do you
distinguish between the case where ->> Taylor Berg-Kirkpatrick: Through the language model.
>>: The language model.
>> Taylor Berg-Kirkpatrick: The language model would tell you. Hopefully. Yeah. I
mean, if -- yeah. And some of the errors we get are because the language model was
ambivalent about stuff like that. And you will see cases where the -- yeah.
>>: [inaudible] confused. Can you tell me again your testing procedure? You used the
document and then you run through your model?
>> Taylor Berg-Kirkpatrick: Yeah.
>>: And then you get the global ink multinomial parameters, the global baseline from
this?
>> Taylor Berg-Kirkpatrick: Right.
>>: And then so basically you do inference on a document.
>> Taylor Berg-Kirkpatrick: So we run EM to get the ->>: So the output ->> Taylor Berg-Kirkpatrick: -- the parameters. And then we do maximum likelihood
inference to transcribe.
>>: I see.
>> Taylor Berg-Kirkpatrick: Yeah. There's no supervision and we're training and testing
on the same document because it's totally unsupervised. Because we have to learn the
font in that document before we can transcribe.
>>: And one page is enough to learn, or is it like ->> Taylor Berg-Kirkpatrick: One page turns out to be an enough, yeah. So ideally we'd
be able to do this on whole books and actually update the language model parameters as
well. But right now it's a little slow. But we're working to make it faster.
So I want you to see in this example, in the [inaudible] typesetting that we're actually -we're capturing the wandering baseline. The model is using the vertical offset random
variables to do that, which is good. The model is also using the inking random variables
to capture the overinked H on the left and the underinked E in the word the. So it's cool
to see that the model is kind of doing what we intended it to do.
I want to show you another example. This snippet reads taken ill and taken away -- I
remember. And again we use EM to learn parameters on -- from the whole page that this
snippet came from. And we get a transcription and it turns out again we were right in this
case. And then here is the predicted typesetting.
And so here I want you to see that, again, we're modeling the wandering baseline. This is
something that I mentioned earlier. The wandering baseline in the input here isn't a result
of the printing press and how it worked, this is actually a result of three-dimensional
warping and you get near the binding of a book. And so we can capture that.
But what's more cool is that because the page begins to face slightly away from the
camera as it recedes in three-dimensional space, those characters are actually thinner.
And so our -- because we can generate glyphs of different width, we capture that here
with the thinner B, E, and R. So it's kind of cool that we can capture this effect, because
these effects are kind of common in a lot of these historical documents.
>>: So just to make sure, so the baseline can actually be not horizontal?
>> Taylor Berg-Kirkpatrick: No, so we -- we don't model the fact that they have slanted.
We're just -- it's an approximation. We're saying, yes, they're kind of moving down and
getting thinner as they ->>: [inaudible] baseline [inaudible].
>> Taylor Berg-Kirkpatrick: Yeah. Baseline doesn't have an angle. It just has -- it's just
an offset.
>>: Because the red dot you show in a previous slide ->> Taylor Berg-Kirkpatrick: Yeah, that was just -- yeah, I'm just like laser pointing with
the lines.
Okay. So now let me talk about actual real experiments. I'll show you results on two
different test sets. The first was taken from this Old Bailey corpus that I talked about
earlier, the Old Bailey Courthouse in London. And here we choose 20 images at random
distributed across years, each consisting of 30 lines. And that formed our test set. And
we manually transcribed these guys to get the gold data.
And the second test set comes from a corpus called Trove. And so this is a collection of
historical newspapers from Australia that's curated by the National Library of Australia.
Here we took ten images at random, again each consisting of 30 lines.
We compare against two different baselines. The first is Google Tesseract, which I
mentioned earlier. The second is a state-of-the-art commercial system called ABBYY
FineReader. And this is actually the system that was chosen by the National Library of
Australia to automatically transcribe the documents they had in Trove.
I'll show you experiments with two different language models. The first is
out-of-domain, as I mentioned. This we trained on 34 million words from the New York
Times portion of Gigaword. And the second language model is in-domain for our
dataset, and this came from manually transcribed text from the Old Bailey corpus, 32
million words.
>>: Is that the reason you restricted it to just the New York Times portion?
>> Taylor Berg-Kirkpatrick: Yeah. So it's a character 6-gram model, and it seemed like
more data wasn't helping and it was making bringing stuff in slower. But yeah.
Okay. So I'll show you results in terms of word error rate on the first test set, which was
the Old Bailey corpus. So here bigger bars are worse. So Google Tesseract on this test
set gets a word error rate of 54.8. So it's pretty bad. It's kind of like we saw on that small
example. This is getting more than half of words wrong.
ABBYY FineReader does a bit better. It gets a word error rate of 40 on this test set. And
our system, with the out-of-domain language model, gets a word error rate of 28.1 and
with the in-domain language model gets a word error rate of 24.1. So that's a pretty big
reduction in error. That's actually more than a 50 percent error reduction compared to
Google Tesseract on this test set.
Here's the second test set, Trove. Here Google Tesseract is doing even worse. This is
kind of a harder test set in a way. It's getting a word error rate of 59.3. Now ABBYY
FineReader is getting almost half of words wrong, 49.2. And our system, with the
out-of-domain language model is getting a word error rate of 33. And so that's not quite
a 50 percent error reduction compared to Google Tesseract, but it's kind of close.
Okay. So I just want to go back to the original example and ->>: [inaudible] show the result using the [inaudible]?
>> Taylor Berg-Kirkpatrick: We never ran it. We could have. We figured that it would
have been more in-domain than NYT was for Trove, but not as in-domain as it was for
Old Bailey because it was actually for the Old Bailey corpus.
>>: Would it make sense to use character error rate here?
>> Taylor Berg-Kirkpatrick: Yeah, we have those numbers too in the paper. We went
back and forth about what to present. The character error rates -- character error rates are
much lower, and that's what people usually present. But they're kind of misleading
because once you get a -- you know, a couple characters wrong in a word, it makes it
hard to search for, makes it hard to read. I don't know. I think it's better to see word
error rates. But I think the character error rates are in the 10s, something like this. I don't
totally remember.
>>: [inaudible] people don't use [inaudible] language models, like you're using a 6-layer
[inaudible]?
>> Taylor Berg-Kirkpatrick: 6-gram, yeah.
>>: For a letter, right?
>> Taylor Berg-Kirkpatrick: Yeah, a character. Yeah.
>>: Why don't people use words?
>> Taylor Berg-Kirkpatrick: So we did a lot of experiments, actually, after we published
the paper on trying to boost the language model because we thought, like you did, you
know, less ambiguity is better. It's actually hard. The problem is that your state space in
the semi-Markov model gets huge. And so you have to do different approximations. We
started doing beaming. And the beaming kind of interacts badly with models that have
these kind of two levels of generating words and then characters we found.
And even when we waited long enough to get more exact results, we saw that it wasn't
giving a big improvement for the out-of-domain language model. Even for the in-domain
language model it wasn't giving huge improvements. Which is a little -- we're not totally
sure. We expected it to give bigger improvements, but it didn't.
So here's that original example again. So just to give you an idea of how legibility gets
better. This is Google Tesseract with more than half of words wrong. And this is our
system with fewer than a third of words wrong. This is our output on the document. And
you can see that it's easier to read.
Okay. Next I just want to show you some examples of the fonts that we actually learn.
So first here is the initializer for the glyph shape parameters that we use for the character
G. And you can see that it is a mixture of a bunch of modern fonts. There are kind of
two different types of dominant Gs that are visible here. The first is the type of G that
has the descender starting on the left. And the second is the type of G that has the
descender starting on the right.
So here are the final parameters that we learned for these glyph shape parameters after
running EM on various documents from across different years. And you can see that in
all of these historical documents the type of G with the descender on the left is used. And
we were able to learn that.
We were also able to learn more fine-grained distinctions in glyph shape as glyph shape
various from document to document and across years.
There's one more cool thing that we can do with our model that I want to talk about, and
this has to do with the fact that we're actually generating all the way down to the level of
pixels. We can treat obscured pixels in the input either that are a result of overinked or
blotched ink or even tearing in the page. We can treat obscured pixels as unobserved
random variables in the model. And then during learning we can marginalize them out.
And during transcription inference, we can actually make predictions about their values.
So here is a portion of a document that we found that had severe bleeding ink from the
facing page. You can see that here, these big capital letters, they got superimposed. We
went and manually annotated that bleeding ink in red, and that's why it's red here. We
told the model during learning to treat all the red pixels unobserved and during inference
to try to fill them in.
And so here's the predicted typesetting that we get on this portion. You can see that we
correctly identified the word mouth, which is cool, and we made reasonable predictions
about what pixels should go there.
We also correctly identified the missing D that was totally obscured in the word found
and again made a reasonable prediction of what the D shape should be that goes here.
Because, after all, we know what Ds look like in this document because we learned the
font on this document. So that's cool. And we're looking at -- there's future applications
that you could think of doing, integrating the model with different kinds of reconstruction
techniques that we're looking at.
Okay. So now I'll conclude. So unsupervised font learning, we've seen that it can yield
state-of-the-art results on documents where the font is unknown. We've also seen that
generatively modeling these sources of noise that were specific to the printing press era
documents was really effective. And finally we're trying to make this into a tool so that
historians and researchers can actually use it, and we're working on that now. The
biggest problem is that it's slow right now. We're trying to make it fast. And thanks.
[applause].
>> Taylor Berg-Kirkpatrick: Yeah.
>>: So one observation a few slides back when you showed, you know, a comparison of
the Google recognition and your recognition ->> Taylor Berg-Kirkpatrick: This one?
>>: Yeah. And so when I look at it, it looks like especially names ->> Taylor Berg-Kirkpatrick: Yes. So this is why we're thinking of adapting the language
model. So these are like court transcripts. It's like a script. The names occur over and
over again. And we'll transcribe the same name completely differently every time we see
it. I do want to -- like I think that this is funny. If you see what Google Tesseract did,
foolmarfs. I just wanted to point that out. I think it's funny that it said that.
But, yeah, so some kind of constraint -- we thought a little bit about this so that you don't
transcribe the same name differently each time. If you added that constraint, that might
improve things. Or if you simply try to adapt the language model.
>>: Yeah, that was, I mean, maybe a naive thought, but I was thinking if you had a
separate character n-gram model for capitalized ->> Taylor Berg-Kirkpatrick: That's a good idea, yeah.
>>: Because, I mean, the names are often -- you know, they may not be English, they
may be ->> Taylor Berg-Kirkpatrick: Yeah.
>>: -- coming in from other ->> Taylor Berg-Kirkpatrick: Right.
>>: There's a lot of character n-gram capabilities that are probably very different in
names than they are in English in general.
>> Taylor Berg-Kirkpatrick: That makes a lot of sense. And sometimes, especially if the
name isn't severely degraded, you'd be better off like ignoring the language model
altogether maybe. Like, for example, you can see here, this looks like the language
model likes it, but it probably doesn't -- the pixels don't like it. So it's a good idea.
>>: I know this is probably a [inaudible] question, but did you try your model on the
modern documents?
>> Taylor Berg-Kirkpatrick: Yeah, we did. And just a little bit. And it does -- you
know, the character error rates on modern documents in our system and other systems are
tiny. It's like, you know, less than 2 percent character error rate, 1 percent character error
rate. And so it seemed like ours was about a good as Google Tesseract. The thing is ours
is so much slower than Google Tesseract that you should just run Google Tesseract if you
have a modern document.
>>: So how slow it is right now?
>> Taylor Berg-Kirkpatrick: Right now it's about 10 minutes per page. Maybe 20
minutes. And that's on a pretty fast desktop. It turns out actually the biggest bottleneck
isn't the dynamic program, which we've made really fast now. The bottleneck is actually
the computation of all the emission probabilities for all the different spans. Because
there's actually many different types of templates you could actually use to print things.
You have all different widths, different offsets, different inking levels.
They're parameterized in a really simple way, but you actually have to compute how each
of those possible templates would fit with the input. And so that's a big computation.
You can write it as something that looks basically like a matrix outer product. And so
just doing that in our code made it faster [inaudible] cache locality and stuff. But we're
also looking at putting it on a GPU because it's sort of ideally suited for a GPU.
But that said, we're worried that historians or humanities researchers aren't going to have
like some desktop, some gaming desktop with like 3 GPUs in it. So we're also looking at
approximations that you could do to speed it up on the CPU.
>>: So this question related to Michael's, but [inaudible] your word error rate is much
bigger than the character error rate, which seems like I don't have any [inaudible] but
seems like there was some sort of error correcting you could do if you -- maybe also
related to Patrick's question, too, is like if you have some word base kind of language
model ->> Taylor Berg-Kirkpatrick: Right.
>>: -- and you actually try to recover from the characters, maybe your word, just miss a
couple characters.
>> Taylor Berg-Kirkpatrick: Yeah.
>>: And can you sort of -- can correct that.
>> Taylor Berg-Kirkpatrick: Yeah, if we could spit out some kind of lattice and then
rescore it with some high n-gram word, something like that. Yeah, it's probably a good
idea. People do do a lot of post processing, and I think they see improvements in OCR
outputs.
>>: So for the Google tool, did you -- or for the other tool, do you get a score of each
character's -- so if you only get [inaudible] second most possible [inaudible].
>> Taylor Berg-Kirkpatrick: Yeah, no, we didn't. You mean like n-best lists or
something like this?
>>: Yeah.
>> Taylor Berg-Kirkpatrick: Yeah, we didn't try that.
>>: I was wondering if you had that and you just smooth with the language model
probably you would do pretty well.
>> Taylor Berg-Kirkpatrick: Yeah. Yeah, that's possible. Yeah. That's a good idea.
>>: So you're not planning to sell your tool since the other ones are commercially
available?
>> Taylor Berg-Kirkpatrick: Yeah, no, we don't -- I don't know. Not really.
>>: So one question. How is it that possible that people transcript 32 million ->> Taylor Berg-Kirkpatrick: Yeah. I don't know. It's pretty amazing. They said like
one of these -- it's not -- it's not pay, I think. They set up a Web site, and people come to
the Web site and they think it's fun and they do it. And they got lucky that enough people
got interested that they did it. Yeah.
>>: [inaudible].
>> Taylor Berg-Kirkpatrick: Yeah. It's kind of amazing.
>>: So have you talked to some potential consumers, like ->>: [inaudible].
[laughter].
>> Taylor Berg-Kirkpatrick: Yeah, no, yeah, we should ->>: You should call them.
>>: No, I'm thinking of along more about like social scientists.
>> Taylor Berg-Kirkpatrick: Yeah. So we've been in contact with a couple. Somebody
at Berkeley. Yeah. And we're trying to figure out what -- so we looked at their
documents, and they do look a lot like these documents. And so really now it's a matter
of us making it fast enough and making the interface easily usable for them.
>>: So another thing you sort of gloss around, but you definitely seem [inaudible]
definitely seem like a good idea that when you try to -- let's say you try to run a whole
book and you keep updating your language model ->> Taylor Berg-Kirkpatrick: Yeah.
>>: -- [inaudible] language model is sort of like initialized from the [inaudible].
>> Taylor Berg-Kirkpatrick: Right.
>>: But then they can get adaptive from whatever preliminaries [inaudible].
>> Taylor Berg-Kirkpatrick: Yeah, yeah.
>>: So what's the bottleneck to do that?
>> Taylor Berg-Kirkpatrick: Well, so since -- I mean, we could have tried it, but we
figured that because we're training on single pages, that's not enough data to actually get
signal on the language model. We might be wrong. Maybe you could get something like
names because they do repeat. But we figured we need it to be fast enough that we can
train on larger portions at a time.
>>: So you also mentioned coarse-to-fine. So is it coarse-to-fine in terms of different
resolution?
>> Taylor Berg-Kirkpatrick: So the coarse-to-fine was -- yeah, that's something we're
looking at, too, though. But, no, coarse-to-fine was in terms of increasing the order of the
language model over time. That makes the dynamic program pretty fast. And then you
prune these in the lower order one and we use [inaudible].
>>: It seem how you envision, very naturally.
>> Taylor Berg-Kirkpatrick: I know. Yeah. And so this is one of the approximations
we're thinking about. There's an older literature on something called document image
decoding. And this is also -- this is old-school kind of [inaudible] model that's similar to
this and that they're treating OCR as a noisy channel model.
And they came up with some interesting algorithms with Minka back in the day,
something called iterative complete paths, where you find -- so you -- all those emission
probabilities I talked about which were so slow for us to compute, you compute them all
at a lower resolution or with some sort of like column projection to give you an upper
bound on the actual probabilities.
And then you find the Viterbi path, and then you update all the guys on that Viterbi path
with the exact scores. And then you recompute the Viterbi path.
And so once that converges, you're guaranteed that you found the right path, but you
haven't had to actually compute all those emission probabilities. And so that's a place
where the kind of coarse-to-fine pixel level stuff could be really useful.
>>: [inaudible].
[laughter].
>>: If you look at what happens with [inaudible] iterations that give you just [inaudible]
without taking the parameters, just from the initialization point ->> Taylor Berg-Kirkpatrick: So what's the score of the initializer?
>>: Yeah.
>> Taylor Berg-Kirkpatrick: It's -- it's like 10 [inaudible]. Maybe more than that. So it's
not -- it's not totally failing, but it's pretty bad. It's like maybe as bad as ABBYY
FineReader.
>>: Still better than Google?
>> Taylor Berg-Kirkpatrick: Better than Google. Yeah.
>>: Probably due to the joint [inaudible].
>> Taylor Berg-Kirkpatrick: And the ability to model the noise, yeah. I think the joint
and then the ability to do baseline stuff.
>>: And so here's a question. I'm wondering why you didn't ask this one, so it seems,
you know, once you have a model that is this flexible to deal with noise, right, I mean,
the next frontier would be handwriting.
>> Taylor Berg-Kirkpatrick: Yeah.
>>: And, I mean, there's obviously a huge demand for that in try the medical domain.
>> Taylor Berg-Kirkpatrick: So we thought about this. And the reason we haven't
actually done experiments is we think -- we suspect it's unlikely to work because we've
noticed in error analysis that the places where our model fails are where you have a large
region where there isn't any single column of white space between characters. So then
it's on the first pass of EM you get the segmentation wrong, and then the learning all goes
wrong. You can get away with this happening where a couple characters bleed together,
but with something like handwriting or cursive, yeah, or even Arabic OCR, even digitally
types of Arabic, this could be a big problem.
I mean, it might work. And we've thought about ways to -- I mean, maybe with enough
random restarts like you could start to, you know, do better. But in our initial
experiments, we weren't hopeful.
>> Hoifung Poon: So let's thank Taylor again.
[applause]
Download