>> Lee Dirks: Good afternoon. My name is... education and a scholarly communication at Microsoft Research, and it...

advertisement
>> Lee Dirks: Good afternoon. My name is Lee Dirks and I am the director for
education and a scholarly communication at Microsoft Research, and it is my
pleasure today to introduce two speakers who will be joining us from Cornell
University's Department of Information Science.
One is a long-term colleague and collaborator, Dr. Carl Lagoze, and his
colleague Bernhard Haslhofer. And they will be talking to us today about
research on scholarly practices and communication at Cornell Information
Science.
And I'd like to introduce and welcome to the podium Carl Lagoze.
>> Carl Lagoze: Thanks, Lee. Thank you all for joining virtually and physically. I
really appreciate it. I'm going to talk about -- give an overview today of work
we've been doing a at Cornell in scholarly practices and communication.
I'm going to start off, give some -- I'm going to dabble and bounce around some
research projects we've been involved in, motivate the work. Bernhard is going
to follow up who joined us recently in March to talk about the work that he's
doing, and we'll connect those all under one umbrella.
If you people in the audience have any questions, please feel free to ask or
comment at will.
So I just very briefly I always feel the need to evangelize and advertise Cornell
Information Science which I think is a very unique entity. Our view of information
science is really built around these three hubs here of dealing with technology,
information systems, human computer interaction, so how humans interact with
technology and how the social -- the social rules and cultural norms interact with
all of those things.
So we really bring together technology, people, and society together under one
umbrella. We advertise ourselves as the information school that was never a
library school, so you could say that that has good aspects and it has bad
aspects. But the nice thing about our program is we link all the way out on
campus to a whole bunch of different types of areas, including law, computer
science, psychology and other things. So it's a really interesting -- it's a new
program. The department's been in existence for about four years now, and I'm
really proud to be a member of it, and I think we're making great strides.
The people whose work I'm going to be talking about today is of course as
Robert Volinsky, my colleague at Berkeley once said, he does the Power Point
and the rest of the people do the work. So it's Bernhard, who recently joined the
work; Yookyung Jo, who is a PhD student in computer science; Karin Patzke,
who is a first year information science student in IS; Theresa, who is finishing up
her degree and going to join me as a post-doc, and then a few other people,
including Paul Ginsparg, who you might know, who has done the major work on
archive, and Simeon Warner, his colleague.
So a bunch of great people. And sort of to set the stage, many of you might
know David Levy, who is a faculty member at -- in the information school the
University of Washington. Lee, do you know David?
>> Lee Dirks: I know the name.
>> Carl Lagoze: Yeah. So David, as a CS PhD from Stanford, he describes
himself as a computer scientists who went astray. And I find myself more or less
being in that position. My -- the background of my work is highly technical on the
screen here are a bunch of projects that I've been involved in. Digital Libraries
Initiative, I did a lot of infrastructure work there. I created a system called
ancestral, which is an infrastructure for technical reports on the Web.
NSDL, the National Science Digital Library. Partly responsible with Sandy
Payette for Fedora Commons and Open Archives Initiative and work that was
highly protocol based and technology based.
And of after years of working in this very technical manner, I continually got
frustrated as to why the work was being adopted by a limited number of
audiences but not going in sort of large scale deployment.
And that led me to think more about what is it -- what's interesting about
infrastructures, what is it that really makes it difficult to -- to development
infrastructure that really works across a bunch of different communities and why
are we trying to drive things with technology when all of the standard a
paradigms of how technology permeates society are these sort of sociotechnical,
sociocultural models.
So I've begun to think a lot more about that. And the recent work I've been doing
with people in my group is understanding how to resolve cyberinfrastructure and
scholarly culture. This by the way is the Meal Bridge. If you have you not seen
this, this is a bridge in France. It's one of the great architectural wonders so.
That's not a science fix picture, it's a real bridge. So what I want to talk about
today is some aspects of this research and look at it from the top level and
understand what the problem S.
And this work stems from this sort of broader vision of interoperability and how
do we make systems work across different cultures and different domains. And
of course, our standard Tower of Babel view here. How do we get a technical
protocol that works amongst people that work in very different manners?
So the foolish among us where I talk about myself years ago we're naive to think
that this was a relatively easy problem that you came up with a protocol. And
this is a paper that the late Barry Leiner published on some work that we were
doing that came out of Cornell, which is open architecture for the worldwide
digital library. And this view that we could come up with -- that information and
knowledge, information tabets could be -- fit into a single architecture with
something that we naively thought in 1998. And since then have sort of
rethought and understood more of the complexities.
So a lot of people -- I join a -- I proudly join a large number of people who thought
that there was going to be sort of these great -- the Web was going to come
along and that these architectures were going to come along and the world was
suddenly going to change. The revolution was going to happen. And I -- I
always hate to keep over using Steven Harned's name, but Steven is myself
poster child of someone who would get up and say look, physicists did this,
everybody else is going to do it any day now. And excuse me for the him holding
an iPhone, right. It's not my picture.
But this notion that the revolution was inevitable, as soon as we deployed this
architecture and everybody was going to change and all these issues with
intellectual property and scholarly cultures were going to go away was really
obviously false. And it really pretended that scholars were sort of a bunch of
sheep, that they -- they -- that they just sort of did things in sort of en masse and
got themselves shaved in the same way and herded in the same way and bog in
the same way. And the reality is that adoption of these technologies really
varies. It varies from relative disinterest, like why should I care, to slightly more
violent reactions to of I don't care, I don't want it, and this is what I think of it. So
we have found through the years -- and I'll be talking about chemists, not that this
is my representations of chemists, but there's a lot of reactions in a lot of
communities that we simply don't want this, it violates what we want.
You know, an interesting thing is I'm a real fan of citizen science. And get up and
talk about citizen science in some venues and they will literally start throwing
mud at you. Don't you realize that violates the basic notions of how science is
done and professional science. So this -- so this very clear difference in
understanding these different communities is very important. So if you haven't
even this, this is I think one of the best documents available. This is a white
paper that came out of an NSF workshop. Understanding infrastructure,
dynamics, tensions and designs. It's a wonderful paper that talks about how
difficult this phenomenon called infrastructure is to develop. And it uses a lot of
historical precedent such as railroads, banks, the banking system, the electricity
system and the telephone system.
And it really points out this basic sociotechnical paradigm that the path between
technological and the socialist not static. There's no one correct mapping and
cyberinfrastructure al only delve when associate, organizational and cultural
issues are revolved in tandem with the creation of technology, that you must
approach this difficult thing of society and culture at the same time you do the
relatively easy thing of technology. And I say relatively easy, to quote the late
Jim Gray, who said if all of our problems -- if only all of our problems were code.
So that's -- that's the set of the work, the basis of the work that we build upon.
And really when we look across science, we look across scholarship we see a
bunch of different islands that really segregate them in interesting ways. We see
these strong methodological islands, people do different scientific methods and
the scientific method varies and the technology has to VMware with them.
Epistemological knowledge, the generally way they view the world, their notion of
knowledge. And the culture, the cultures that have developed. And these are all
sort of independent dimensions that we need to understand.
So the way you start this research is trying to understand what is the analytical
unit. I mean a basic research problem as you're looking at this issue is what are
you calling -- what are you defining or attaching the word culture to? Is it a
discipline? And I think we'd all agree that the discipline is actually really -- it's too
coarse of a granularity. Of course you can say chemists and suddenly you have
synthetic chemists, you have theoretical chemists, you have physical chemists,
and the same thing across a lot of other disciplines.
Is it a subdiscipline? Well, I don't think that works very well either. Is it a
research group? Is it a -- and that sort of gets you down to the level of
impossibility because if it's individual research groups, you can't do much of
anything. Or is it this thing called the invisible college that other people have
used as a term to describe these group of people who collaborate across
institutional lines and some unique manner.
So the issue here is to understand that and the issue is to pick a really interesting
community to start your research on, which is chemistry. Chemistry is a
particularly challenging area, right, Mark? That's right. And there's -- we have
spent about the last three or four years looking the at chemists and working with
chemists to understand why this is such a difficult area. And some things that
stand out about why it's so difficult is the commercial value of chemical
information.
I mean astronomers will tell you that our data -- what's wonderful about our data
is nobody would pay for it. That's certainly not true of chemical information,
especially in the pharmaceutical industry.
And awe then there's just the nature of the research culture. You have this
pre-dominance of this thing called synthesis, this creation culture which
overshadows discovery. Really the concern here is not to discover, to look at the
world, it's to figure out how to make new things. You know, in these sort of
secret methods of making new things. And that leads to a highly autonomous
culture in which successful research really relies not on a lot of other people, it
relies on being in a lab and figuring out how to get these things to work together.
And then, there's the really difficult problem of a scholarly societies. The ACS,
particularly in the United States, the American Chemical Society is simply
violently opposed to open access. And violently opposed to many ways and
people always love this who don't know this. The core identifier system in
chemistry has copyright issues attached to it. You cannot use the identifiers
without paying the ACS for the identifiers.
It's kind of hard build infrastructure if you can't share identifiers of objects. And
then the RSC is considerably better. But still has some licensing issues. But
they're getting better.
So the -- just to go back to the historical notions of some of the things that
happened here, one of the things we looked at at the beginning of our work is
this question. In physics the archive came along with phenomenal success.
Paul Ginsparg invented it in 1991 and high-energy physics and it danced from
there to other fields of physics. It became the venue for publishing. It basically
changed the way that -- changed the way they published.
When they tried to do the same thing in biomedicine, it became highly modified
and the hybrid looks very little -- very little like the original model. And in
chemistry it crashed and burned. The attempt to have to have an archive of
chemistry literature just open archive went away and societies basically tried to
shut it down. So, you know, we have to ask why these things happen.
So for a really big overview and nice -- and I can compliment this work even
though my name's on it, because it's mostly Theresa's work, my student Theresa
and I wrote an NFS workshop in DC about two years ago and out of that came
two papers, a Nature Chemistry article, and the longer white paper that the URL
of that is included in this.
And this really points out what -- it gives a very long, very highly detailed analysis
of what is the problem with chemistry, what's going on here.
And it talks about the technology, it talks about the culture. So if you're
interested in this subject, I recommend that you take a look at these. That's the
handle to the paper.
Okay. So how do you study this problem? I mean, we can all sit here and we
can do back in the envelope or napkin or writing on the napkin, sort of okay,
disciplines differ, how do we understand this? But really this requires sort of
innovative methods to understand how these -- why this problem exists.
So I'm going to talk in particular to some work that Theresa and Asif-ut Haque
and I have been doing that this paper published in Scientometrics talks about is
to understand the fundamentals, of the problem, to understand why this problem
exists and what might -- where we might go from here.
So to ask basic research questions like how do the socio-intellectual
characteristics of fields shape the communications that they carry on? How does
the communication culture reflect the practice culture, and how do they interact
and what's the synergy between them? And how within a field such as
chemistry, how are there differences within the field itself?
So we've adopted I think a very innovative approach which is to combine
ethnography and network analysis and take advantage of the strengths of each.
So ethnography is great. But the problem with ethnography is it's very
expensive. It's very, very labor intensive and a classic ethnography study can
only study a few groups.
Network analysis is great because you can take huge a little of data. But the
problem is that the nuances and subtleties do not -- you do not see the nuances
and subtleties. So what we try to do is to try to combine those two techniques
with one enhancing the other. And I'll talk about how that's done in a second, to
come up with a mesoscopic level of analysis, to understand the co-authorship
relationships and how they cluster in their relationship between those clusters to
understand what collaboration patterns are in chemistry.
And that's -- we're trying to understand how collaboration patterns reflect
communication cultures, communication cultures reflect collaboration patterns,
and how technology might either interfere or enhance those collaboration
patterns. Because if you don't understand the way scientists work together,
you're not going to be able to have technology to allow that -- to enhance that.
So here's an overall view of the technique. And the technique is as follows. I
think I can get a pointer here. I thought I could get a pointer here, but I can't get
a pointer here.
>> Lee Dirks: You have a laser pointer too.
>> Carl Lagoze: That thing. And the pointer goes like that.
Okay. So the basic technique looks like this. We -- on the left hand side are the
ethnographic field studies, on the right hand side are the network analysis. And
we do a basically a delineation of the field. We understand what the units of
analysis are. We then go bounce from that, and we do it all through a field
citation network within that -- within the individual fields.
We then bounce back and do -- we analyze -- we sit in in those -- on the main -some main labs in that group and work with them to understand what they do
and then do -- develop coauthorship networks in those groups and reflect those
-- the -- in those groups in the leaders of those groups we take the networks and
we bounce them back and forth and have them look at our clusterings and
understand what they mean to them and how we might modify our clusterings.
And we use this buy directionality of going from ethnographic to network analysis
back and forth to come up with the synthesis of field differences in the cultures.
And the technique works really nicely because you get these aha moments in
which actual a researcher will look at a clustering that you've done and say hey,
that person there reflects what we're doing. That sort of -- that sort of pattern
reflects what we're doing without us trying to infer from it information that is really
cultural.
So the work we've done has studied 5 labs in two major fields, and this field here
included synthetic chemistry, theoretical chemistry, theoretical physics and
experimental physics. So these sort of existed -- and this was a cult -- a mainly -what we call an instrument building culture. This was a field of chemistry that
relied on building large instruments and using those instruments.
And this was another field that was inorganic -- that includes inorganic chemistry,
polymer chemistry and organic chemistry. And this was a more synthetic -- a
culture that did substance building. And this picture leads to the breakdown in
the analysis and the collaboration patterns that really are based on these two
notions. People who work with large instruments that cost a lot of money that
take a long time to delve and people who work with creating substances in
individual labs. And we'll see very different collaboration patterns in the two of
them.
So we spent a lot of time with those people. We attempted to understand their
research cultures, what are the value skills in their domains, what is competence,
how do they share information? And from there came up with some interesting
analysis that I'll talk about in a moment.
So to go from the fields to actually doing the network analysis, we worked with
the leaders of these labs to develop a lexical query into the Web with of science
to come out with a co-authorship graph. So what we wanted to do is delve a
co-authorship graph, a clustered graph that basically represented the field. I'm
sorry. A citation graph that represented the breadth of the field.
And we wanted to come up with someone that actually -- something that had
actually the recall and depth in the field and we could use article self citations to
understand if you were getting enough depth. The query revealed the things that
Carl Lagoze published and cited on himself. And then the breadth to actually
cover the people that I would collaborate with and reference.
And it's a rude and crude method, but we've based come up with a notion that
this is really a good way, using the bibliographic mapping and the citation map is
a really good way to delineate these units of study.
And what comes out of that is a huge citation network that we then use to create
a -- a co-author network and we then classify nodes in the cluster and compare
the clusters with field observations and interview the group leaders and try to
figure out whether these graphs represent what they think of the world.
And what's interesting when you do that, you come up with some very nice
patterns. So I put up here three different topologies of groups that represent
collaboration patterns. And here you can see what stares out at us is a classic PI
led research group. So this is a co-authorship graph again. The nodes are
authors and the edges are if I co-authored with you. And here is the strong
leader who insists that every paper be published with him or her.
Here's the research institute. I would think that Microsoft Research sort of looks
like this, that you have people who find -- you walk down the hall, you say, oh, we
have similar research, you start to work together and then you co-author with
each other. .
And then here you have a international research network where you have PIs in
separate countries and you have some crossing boundaries across there.
So we have a bunch of topologies like this, and these topologies, you can reflect
these topologies again back in the ethnographic studies. And people look at
them and say hey, I know who that is. And I know who that is. And things like
that. So what's -- what also comes out is if you analyze this at the mesoscopic
level is you start to understand what the roles are of different people in these -so here's a bunch of groups that interact very closely together, and here are
bridge people.
And again, we can show these to people, and they say I know who that is so you
have these strength of weak ties type of phenomenon that occur there. And then
you have -- you have these other fields where you have these intensive
collaboration between the groups themselves and you have -- you don't have
bridge people, you have many things going across. Often these are -- often this
is a graduate student of this person who now moved over to this group over here
who still co-authors with his or her colleagues over here but works with people
over here and does the weak tie over there.
So this is all documented in this Scientometrics article, and what's interesting, if
you go one step further, is you can start to classify the nodes. So you can come
up -- you can classify people into being non hubs or hubs. And we have a whole
classification of nodes and their proportion of the population. And you can really
come up with a notion of who the role player is in this sort of interesting social
network based on co-authorship.
So the results here have been quite nice. And you can even extend that into the
geographic realm and understand -- so I think the yellow nodes are people in the
United States, the blue nodes -- now, the yellow nodes are people in Asia, the
blue nodes are the United States, the red nodes are Europe or something like
that. You could understand how geographic dispersion affects co-authorship.
So we did all this work, and we were sitting around one day looking at this and
saying, gee, those Asians really collaborate a lot. Isn't that interesting how much
they collaborate? And we had this sudden realization the potholes in the road to
thinking this research was legitimate. It's the fact that often there are two people
who have -- there are -- there's one name who's really two people and there's
one person who often has two names. So you run into -- the reason that the
Asians appear to collaborate so much with each other is that not all C Wangs are
the same C Wang, not all S Woos are the same S Woos.
So we ran into a problem quite familiar to people and people at Microsoft who
have been doing academic search have seen, which is the name disambiguation
in a name ambiguation problem. And it turns out that that problem distorted our
results really badly and we had to go back and spend six months working on the
problem of name -- author name homonymity -- homonymy, which is a very
difficult problem. It's -- it's given the fact that features are very hard to extract
from academic papers and that the strength of those features such as
institutional affiliation are -- is very difficult to judge, it becomes quite a hard
problem. So we have the paper that we published in JCDL this year that we're
presenting at JCDL that talks about doing this with very sparse metadata. And
the way we're doing this is doing it by coauthor affiliations. It came up with very
nice results.
So if I have Wang who published a paper with Lagoze and Warner and I have the
privilege of having a unique identifier, we can pretty reliably if we have enough
evidence, statistical evidence, disambiguate that from the paper by Wang,
Zduncyk and Gagner and that we can -- given enough statistical -- overwhelming
statistical evidence, we can say that these are different Wangs.
And our results came up really nicely there. The problem of course that remains
is the Wang, Sun, Kim, Yang papers which, you know, we have -- if you look at
our results in the paper, you'll see we have really, really nice results for these -this world. This world is just -- I mean [inaudible]. So we paid students about -we paid for about a hundred hours of student work to do gold standard. And
even the students couldn't figure out, I mean just by searching all over the place,
searching bing, searching Google, whatever, they couldn't disambiguate these.
So if humans can't do it, we have a hard time with machines.
But just to show you the results here, this is -- red -- these graphs are based on
node role here. So these are the role -- these -- and what's important here to
recognize is that since our co-authorship graph has different roles in it, the
mistakes in one role are really bad than mistakes in other roles. And can I write
on the board because ->> Lee Dirks: Yes.
>> Carl Lagoze: So if you have one of these subgraphs, subgraph, and you
assume that person is very -- say that's the person if row four which is a person -that's a bridge, if that's really five people that grossly affects the topology of your
graph. So we have to be careful that a disambiguation techniques are biased
towards our type of graph. So the problem becomes more complex. But in any
case, without fully describing this -- these -- this result thing, the reds are really
bad disambiguation problems. And you can see with our disambiguation to
algorithm, there's very few reds left in here.
So you can take a look at this paper. It purports those results. So, to conclude
on that and then pass it over to Bernhard, our initial results show some very
interesting different collaboration patterns between fields that rely on either
individual achievement or collective achievement. And the collaboration patterns
are quite different and they point to a shaping of tech -- of one of the prejudices
we tend to have is that sharing is great. And in this world, sharing is not
absolutely great, so the technology has to be adaptive to those type of field
differences and cultural differences.
I'll pass it over to Bernhard, and then I'll close up with a couple of last comments.
Thank you.
>> Bernhard Haslhofer: Yes, thank you, Carl. I will now focus on one specific
technique in the area of scholarly work and scholarly communication, which is
annotation, and I will demonstrate one particular tool we built back in Vienna. I'm
originally coming from Vienna. I came to the US two months all. So that's also
the reason why I'm attributing my colleagues here in Vienna.
But I will demonstrate basically our annotation tool and how you can apply it for
one specific use indication which is the annotation of historic maps. So we will
see later that this is also a use case which targets several domains. But it's a
very interesting use case.
In this demo which I will give at the end, you will also see how we make use of
the vast amount of data that is out there on this data Web -- on the links data
Web. I'm speaking of TPPD at tier names and all the other sources out there.
We believe that this is quite an interesting combination and that scholars as well
as institutions running infrastructures where people can annotate could benefit
from pulling in data from the data Web.
Okay. From Carl's talk we know that communities are diverse. So the first
question that usually comes up when I start talking about annotations is the
question okay what is an annotation? And there are so many definitions out
there in the literature. And definitions that you find might suit for one you
discipline but not completely be unsuitable for another discipline. So whenever it
comes to signing annotation tools, annotation models protocol the first question
to be revolved is okay, what is an annotation? It's a really difficult question. And
I found one very nice abstract definition in Palmer's paper, and the reference is in
the last slide. It simply says okay, annotation is note taking to manage, interpret
and coordinate sources. But I don't want to stick with definitions. I go on to
some concrete examples. This is an example I took from a colleague from the
[inaudible] repository. There are people have the possibility to annotate works
from Pico to add their interpretations to ancient text. So this is one facet of
annotation in one particular community. The link to the slide is down there.
Here we have another use case of annotations. This is coming from the
Australian literature project where scholars of the possibility to add -- to annotate
text passages and add additional knowledge to those passages. So here for
instance somebody explains what the term young ladies means by means of an
annotation.
Here we have another example from the -- I think it's a biochem domain, I took
this example from a colleague from Colorado University. In this domain they're
annotating words within the text to find -- to learn more about the semantics of a
term. There's a complete different notion of annotation. And this is my very
personal few of annotations that a technique I use most often the actually the
annotation of PDFs to mark important sessions and document. This is also a
very important aspect of an annotation.
Okay. So what is YUMA now? YUMA is a tool we built to annotate multimedia
objects on the Web. It's actually a Web application which consists of a server
component and several clients. The clients are media specific so we have a
client for video, one for audio, one for images, one for maps. And these clients
can be used for different use cases.
We developed this tool back in Europe in the context of Europeana. Europeana
is a European digital library project which aggregates multimedia content from
cultural institutions all over Europe. So there's a really large pool of media
objects out there which can be annotated by scholars but also by public but by
normal Internet users. YUMA is also open source software. It's available on get
hub. Anyone can pull the source code. We also have be if somebody push us or
at least gives us some issues so that we can further develop this code. So it's
completely accessible and free.
YUMA is also in our opinion of course an infrastructor for corporate research
because people really have the possibility with those tools to add their thoughts,
their ideas directly to the media objects. And this is a very important and
communication mechanism in scholarly research.
With this slide I just want to emphasize that the demo today will be about historic
map annotation but YUMA is really more, it's a multimedia annotation framework.
So we have Web clients for annotating videos, audio files on the Web, images on
the Web to audio -- the video and the audio client are written in flash originally
but currently completely rebuilt into HTML 5. So we want to get away from flash
in this case.
The image, the image client is written in JavaScript but built with Google Web
toolkit and the same is the case for the map annotation client which I will show
you later.
Why, why did I take the map annotation use case? I took it because it's a very
interesting use case because it targets -- I don't know how I get rid of this -because it targets quite different disciplines. You can study historic maps of
course from historic point of view to learn about how country borders have
changed. But also from another point of view you can study technologies that
were used as a certain point in time to construct maps or to produce knowledge
about geographical units at this time.
Certain maps are also pieces of art, so you could also study them from artistic
perspective. And of course most importantly maps are still the predominant
medium to transport geographical information. But you see there's quite a variety
of disciplines that are involved in this particular use case, and that's why we think
it's quite interesting.
Okay. But now it's demo time. I prepared a screencast which will show -- just
start again. Okay. This screencast shows you how you can use our tool to
annotate digital maps for use case. I took -- we took a map from the Library of
Congress so you should be familiar with the geographical region here. You can
simply start annotating this map by giving the URI of this map to our annotation
tool. Here I'm zooming in and zooming out of the map just to demonstrate that
we actually used high resolution maps tiled into a smaller images such as the
same way as we tile normal maps. And we support two types of annotations.
The first type is referencing the second time -- second type is semantically
augmented annotations and the start of the first type. The georeferencing
annotations allow you to associate geographical points to ancient maps. So here
for instance I identified the city of Philadelphia on my historic map and they say
okay, this point is Philadelphia. I enter this information in my annotation box.
The system automatically fetch geo information from geo names to add this geo
information to this control point and to properly georeference a map you need
actually four control points in order to project that today's coordinate system on
the coordinate system of this map, of this ancient map so that you can do a
geographical information system features on geographical maps as well.
So here I'm annotating the city of Detroit. I say okay, this is Detroit and I create
later on one more point, one for Chicago because I know that at this point where
the river flows into Lake Michigan this should be Chicago. So I created four
georeferencing annotations for this map. And now I somehow projected today's
coordinate system on this map. And now I can overlay any kind of KML vector
data on this map. For instance, here I'm selecting country borders. Very simple
example, but it shows you what you can do with it. And this allows you to nicely
embed an historic map into today's map.
You can imagine that instead of this map you can take any other map.
Okay. So now the second type of annotation which is the semantically
augmented annotation, this allows users to add commands to regions on a map.
And I created a command for the region I know best in the states here, which is
the Finger Lakes region. I just moved there recently. So I'm saying okay, this
must be Cayuga Lake, at least one of those lakes, and then I add my knowledge
about this region.
And you see that in the background there is some text popping up. And this text
are actually suggested based on the geographic region. This annotation region
covers. But also on the annotation text the user [inaudible] so it's analyzed
during the annotation creation process actually.
And these tags are actually not just string tags, these are links to resources on
the Web, to resources in tier names but also to resources in DBPedia, DBPedia
is just a structured version of Wikipedia. So we give the user to the possibility to
check text which are in his or her opinion relevant for this annotated reason and
gain additional semantic information for this region but also for this map.
So for instance, here there are some false positives as well like the Australian
capital, but Ezra Cornell is the founder of Cornell University is somehow related
to this region. Central New York is obviously related. Cornell University is in this
region. Cayuga Heights is part of Ithaca so these are also related. And
important thing is that these are not string text. This is -- these are links to
DBPedia and behind those links is extremely valuable semantic information
which we can pull in, into our system and we will show you later on how the
system or organization running the system can benefit from this at the end.
Okay. So you see create the annotation at the same time during the authoring
process of an annotation we create links. And we believe that these links have
quite a high quality because it's the annotation creator who acknowledges the
semantic validity of a link.
Okay. What we do now is we search for annotation. It's a very simple use case
but before we search, I though you that this data record has been created. This
is annotation I created before. With the text plus the links you can open it again,
see the annotation. And now just let's assume you're a user and now we want to
search for a map.
If we don't have annotated metadata or any annotations we won't find anything.
With annotations we can at least search over information a user has provided.
We will find something when we search for Finger Lakes because user wrote it's
about Finger Lakes. That's trivial. We will also find something for Cornell
University because in that text that's trivial.
But now we start searching in different languages. So I'm searching for lago
Cayuga. It's Italian. I also find this record.
Now I'm searching in Finish. I also find this record. Now I search in Japanese. I
also find this record and the same for Chinese.
So one of the benefits of pulling in those data from DBPedia is that you can
enable a very simple multilingual search feature. And this is especially important
in Europe where we have 20 whatever languages just in Europe. Okay. Now I
show you that all the annotations that are actually created by the users are
shared as linked data on the Web or at least as data on the Web. So each
annotation gets a URI. And this dereferencible via http. So you can simply send
an http request against an annotation and you receive the complete annotation
represented in RGF in this case via http: We also support JSON if somebody
wants to go for JSON.
Okay. So you see with this annotation tool, we try two things. We try to make
use of data that is out there on the Web, which is very nice for use cases like
Europeana where we have this multi-lingual search problem because we can
exploit information a user gives us. But we also give something back. We give
back the knowledge the users created in terms of data. And this is a nice cycle, I
think.
Just go back to my presentation. Yeah, here is the demo again, in case you
want to see. I think you notice maybe the data model we used for the annotation
that this not something we invented, it's something which is coming from a
collaboration which started approximately one and a half year ago here in the US
and also in Australia which aims at a facilitating some interoperability
environment for annotations. So the goal is really to make annotations available
in the way so that you can take it and transfer it to another system. For instance
it should be possible to intercreate an annotated map into something E learning
system. And this of course requires some shared data model and the open
annotation collaboration is working on that at the moment. And we are also
somehow involved there.
Yeah. This is how the annotation craft looks at the moment. It's an RDF craft.
You have the concept of an annotation, you have the concept of a body of an a
annotation which is more or less the content the annotation contains or the text
plus the links. And you have the concept of an annotation target which is actually
the medium out there on the Web, which has been annotated. And you see that
everything has URIs and everything is dereferencible via http.
Yeah. Future work. In the next two years at Cornell I would definitely like to
further develop this annotation -- this map annotation use case. Maybe set up
some repository for historic maps. Where people can annotate maps and adapt
their knowledge to map.
And it would also be nice to integrate this with today's systems that's supports
today's map like Google Maps or bing maps or whatever because it would be my
opinion very interesting for the user to see for certain location, okay, what -what's this location in 1769. So this could be a very interesting future work.
I also would like to enhance the semantic augmentation a bit. At the moment it's
really simple and straightforward. We just pull in data from DBPedia, index the
label and just search over the label that's trivial, that's not optimal because it
produces still a lot of false positives. So we would like to integrate some
relevance feedback mechanism which improves the tech proposal all the time.
That's one of the ideas.
And the second big idea is to also think how some of those techniques,
especially the semantic enrichment technique can be integrated with scholarly
publications. So when somebody writes an article, how we can encourage the
author who writes the publication to create links to external resources on the
Web while he or she is creating the paper.
So this is my future work here at Cornell. And I'm always happy to get some
input. Thank you.
>> Carl Lagoze: A couple more -- so one more thing. Okay. So if I may. So
what do I do, just plug in?
>> Bernhard Haslhofer: I think so.
>> Carl Lagoze: So I wanted to close up with just one more piece of work that is
-- fits into this whole area that we're doing. This is work that we presented just
recently at the Web conference in Indian, and this is work with Yookyung Jo and
John Hopcroft who is in computer science. And this is really to take a look at a
field like computer science and understand how topics develop and ideas
develop over time. A lot of the work that's been done in this area basically has
either fixed time slices or has a fixed set of topics to see how they develop. And
there's been some certainly nice work there. But what we wanted to do was not
at all define the time slice. There's a topic that originated the at any point in time
and it could be any -- the topic could be anything that basically no learning at all,
just totally unlearned technique.
So this is sort of a slightly representative graph, but we want to understand this
notion of as time progresses this growth of a topic into separate subtopics or
even the merger of multiple topics into a single topic. And the example we use
all the time is this notion of link based searching, which arguably was sort of
[inaudible] and Jon Kleinberg which sort of combined, you know, sort of this very
interesting areas of IR and graph theory and other things and into an interesting
thing. And there's even more complex topic creation things. So -- topic creation
configurations.
And we have a technique that essentially does a statistical analysis the text of
the papers we used just the abstracts and walks through time, looks through the
corpus and notices when a -- when a paper appears that is -- has a statistically
different characteristic than papers that preceded it. And also then that's
statistically different fingerprint carries on to follower papers.
So the definition of a topic is something that differs if a way from previous papers
but actually carries that difference persists over time and it's sort of an interesting
model. And then also the linkage between topics is a statistically significant
resemblance between these two separate topic areas.
So we came up with some really nice results here. And I'll just briefly put up -this is a little bit hard to read. But this hear is a breakdown of the ACM corpus
over -- since 1972. And you can see here -- and this is actually -- we deduced
these by hand looking at the papers. But future work is be to be able to pick it -to cluster these areas.
But you can see the development of graphics over time, development of
database over time. And you can also see the linkages between areas and how
they sort of grew from each other. So -- and what's interesting here again is this
is -- the nodes here are topic areas. They're not specific papers. They're a
group of papers that fit into the topic area and then the linkages are the linkages
between topic areas.
So the results came up came up really quite nicely here. And there's a paper in
the Web conference this year. So to close up last slide, a lot of the work that
we've talked about and that we're doing is sort of combined into this large project
called the Data Conservancy which is one of the NFS sponsored data curation
projects. And I thought I would just end up by pointing out a paper that set of
gives a large overview of understanding why cyberinfrastructure is so hard.
This is a paper that was accepted this year and presented at JCDL by my
graduate student Karin Patzke and I. And really looking at this whole
cyberinfrastructure problem as a sociotechnical problem that really needs to be
understood in the large to be able to work well.
So with that, we will discuss or end.
[applause].
>> Lee Dirks: Thank you.
>> Carl Lagoze: Thank you.
>>: I had a question.
>> Carl Lagoze: Yes?
>>: So it's a question and a perhaps a weird one for both of you, but you started
your talking by saying, you know, in the beginning of your career you did these
different efforts and, you know, some of them wouldn't grow like you would have
hoped or expected, protocol based, et cetera. And I'm trying to think in your part
making reference to open annotation work, how are you able to think about these
situations and how do you make sure that open annotation does scale, does
work, and is -- you run the risk of again developing a system that is hard for its
scholars to use.
And I'm trying to -- you're not alone. Obviously there's, you know, house of
people around the world who can meet up with similar systems every day. How
do we -- how do we address that and prevent that from happening?
>> Bernhard Haslhofer: Yeah, there was a workshop two weeks ago where
exactly this question came up, how to assure that this model actually fits the
discipline and the data quite by one discipline can really be used by the people in
this discipline and how cross-disciplinary work should be enabled. And the
solution seems to be that this annotation model will be really something
extremely generic and that different disciplines need to create some kind of
annotate -- application profiles for their specific needs. For instance we have the
whole community of geoinformation people who are annotating whatever with
GIS coordinate and they need some very specific aspects to be covered by the
model and they probably will need some extensions of this model.
So it seems that there will be some kind of a core and extensions.
>> Carl Lagoze: So let me -- let me -- because I have another answer. There's
-- I don't mean by anything that we're saying that the effort to create
technological solutions is useless. The problem is when you go from the low
hanging fruit, the generic -- and I applaud the generic efforts, I think they make a
huge amount of sense and a lot of the work that I've done and other people have
done, I mean certainly you would be pointing the finger at Tim Berners-Lee and
say that was a foolish effort when you say -- if you say that these things don't
work.
But there's the sort of the hard balance between generic and too much
specificity. And we haven't found on that spectrum -- it's easy to do the generic
things, like OAIPMH. Now there's an example of things that worked relatively
well that does a limited amount of stuff. As you sort of move up the functionality
spectrum, it becomes very hard because then you start to clash against specific
cultural notions.
So our grand notion of, you know, this great -- the fourth paradigm science that
we've talked about all along when we'll be able to do these wonderful -- this great
synthesis of data from different domains and use it in great new ways is a vision
that's great, but it's going to take a heck of a long time in understanding to get
there because it's not the low hanging fruit. It really clashes in this world. Yeah,
Mark?
>>: [inaudible] talk you mentioned how the blogs publishes such as the World
Society of Chemistry, the American Chemical Society actually inhibit
collaboration. In your opinion, what could they do to facilitate that collaboration?
Is there a magic bullet? Is there a ->>: Or is there a bullet? [laughter].
>> Carl Lagoze: I think it takes -- Mark, I don't have a great answer, but my sort
of intuitive thinking while talking answer really relates back to the same problem
the music industry is having. It's sort of understanding what your value add is
and distinguishing that from the data level. That publishing industry would be
wise to think of themselves as service providers on top of the data layer and that
the data layer -- the data shall be free. And he or she creates the best services
wins the capitalist game. And the ACS, like the music industry and the movie
industry has tried, the entertainment industry has tried to embrace this vertical
thing which includes the data and all the services.
And they're inhibiting scholarship by doing that and they will lose eventually that
battle. So I don't know if that's -- yeah, Curtis.
>>: You know, I like the idea of annotating lots of different media outcomes but
there's always the challenge of the separation of the media from the annotation
and [inaudible] some in the idea of some of these formats would then be able to
accommodate both -- at least [inaudible] the annotation and also have it separate
as well. Do you have an approach in your annotation of how you make sure
those stay and persist over time [inaudible].
>> Bernhard Haslhofer: I don't have, but there is a group at Los Alamos working
on the project called Memento and they added the concept of fix to the
annotation model so that you can at least capture the time or the -- at a time
actually when annotation -- or the time when the annotation was created for a
certain medium so that you can -- if a medium disappears at least look it up in
Web archive or somewhere else. So that's one of the ideas. So that you can go
back in time and retrieve the annotated resources in case it disappears.
>>: [inaudible].
>>: So that's when something was annotated. But I guess the question is if I
originally annotated the thing and it turned out to be the world expert of this
document but the annotation is separate from the document itself, if your system
dies then, you know, I assume you have to figure how to get that data back
associated with it as [inaudible] ends up moving through [inaudible].
>> Bernhard Haslhofer: Those questions are definitely not part of the monthly.
But, I mean, at the end, annotations will be Web resources just like any other
Web resources and the idea is to, at least in my partnership, I'm not speaking for
the consortium here, to reuse also techniques that are used for the same
problem in the Web context I guess. Also for security and authentication. It's not
part of the model, it's really in the Web stack.
>>: I mean if a [inaudible] have the same problem with [inaudible] how do you
ensure that those annotations [inaudible] tagged you can [inaudible] these image
files [inaudible] video and other formats that are dynamic and [inaudible]
associate that with things that [inaudible] tiles and lots of interesting ->> Bernhard Haslhofer: The current solution have to make sure that URI of the
annotated media object remains dereferencible and returns the representation.
>>: Right.
>> Bernhard Haslhofer: That it was at the time of the annotation, yeah.
>>: [inaudible].
>> Carl Lagoze Bernhard, I have a question. So has anybody in the annotation
consortium doing work on automatically connecting annotations to things so I
was -- when you were talking I was thinking of Twitter. And my big complaint
about Twitter is it's -- Twitter alone is an incredibly decontextualized thing. But I
am learning -- I am basically annotating. And there's lots of examples. I mean
it's the walled gardens of the Web are actually connected in implicit ways. And is
anybody doing work in taking the annotation model and using it for those types of
->> Bernhard Haslhofer: Yeah, I think if you look up the standard or the alpha
version of the standard, there's the example there because the tweet has got a
URI and you can use it as an annotation on something, so it's considered, yeah.
>> Carl Lagoze: Okay.
>>: And you're using [inaudible] like that to condense them, that becomes
intermediator which if they disappear [inaudible] so it's how do you ensure the
persistence of the connection between the annotation which is so important,
sometimes more important and a lot of times [inaudible] might have copies of.
>> Carl Lagoze: Well, you know in that realm you know the great controversy
when they threw out library cards in the libraries. Where the library cards have
these wonderful annotations or the smudges.
>>: [inaudible].
>> Carl Lagoze: Yeah. That's a wonderful example of annotation that was not
intended to be annotation.
>>: Absolutely.
>> Carl Lagoze: Yeah.
>> Bernhard Haslhofer: But you're actually targeting one of the big problems of
this whole data on the Web discussion. So the problem what happens if the links
are gone.
>>: Yeah.
>> Lee Dirks: Anything else?
>>: No, we'll just end on that eternal problem. [laughter].
>> Lee Dirks: Thank you all for coming. Bernhard and Carl, thank you very
much.
Download