>> Lee Dirks: Good afternoon. My name is... education and a scholarly communication at Microsoft Research, and it...

>> Lee Dirks: Good afternoon. My name is Lee Dirks and I am the director for education and a scholarly communication at Microsoft Research, and it is my pleasure today to introduce two speakers who will be joining us from Cornell University's Department of Information Science. One is a long-term colleague and collaborator, Dr. Carl Lagoze, and his colleague Bernhard Haslhofer. And they will be talking to us today about research on scholarly practices and communication at Cornell Information Science. And I'd like to introduce and welcome to the podium Carl Lagoze. >> Carl Lagoze: Thanks, Lee. Thank you all for joining virtually and physically. I really appreciate it. I'm going to talk about -- give an overview today of work we've been doing a at Cornell in scholarly practices and communication. I'm going to start off, give some -- I'm going to dabble and bounce around some research projects we've been involved in, motivate the work. Bernhard is going to follow up who joined us recently in March to talk about the work that he's doing, and we'll connect those all under one umbrella. If you people in the audience have any questions, please feel free to ask or comment at will. So I just very briefly I always feel the need to evangelize and advertise Cornell Information Science which I think is a very unique entity. Our view of information science is really built around these three hubs here of dealing with technology, information systems, human computer interaction, so how humans interact with technology and how the social -- the social rules and cultural norms interact with all of those things. So we really bring together technology, people, and society together under one umbrella. We advertise ourselves as the information school that was never a library school, so you could say that that has good aspects and it has bad aspects. But the nice thing about our program is we link all the way out on campus to a whole bunch of different types of areas, including law, computer science, psychology and other things. So it's a really interesting -- it's a new program. The department's been in existence for about four years now, and I'm really proud to be a member of it, and I think we're making great strides. The people whose work I'm going to be talking about today is of course as Robert Volinsky, my colleague at Berkeley once said, he does the Power Point and the rest of the people do the work. So it's Bernhard, who recently joined the work; Yookyung Jo, who is a PhD student in computer science; Karin Patzke, who is a first year information science student in IS; Theresa, who is finishing up her degree and going to join me as a post-doc, and then a few other people, including Paul Ginsparg, who you might know, who has done the major work on archive, and Simeon Warner, his colleague. So a bunch of great people. And sort of to set the stage, many of you might know David Levy, who is a faculty member at -- in the information school the University of Washington. Lee, do you know David? >> Lee Dirks: I know the name. >> Carl Lagoze: Yeah. So David, as a CS PhD from Stanford, he describes himself as a computer scientists who went astray. And I find myself more or less being in that position. My -- the background of my work is highly technical on the screen here are a bunch of projects that I've been involved in. Digital Libraries Initiative, I did a lot of infrastructure work there. I created a system called ancestral, which is an infrastructure for technical reports on the Web. NSDL, the National Science Digital Library. Partly responsible with Sandy Payette for Fedora Commons and Open Archives Initiative and work that was highly protocol based and technology based. And of after years of working in this very technical manner, I continually got frustrated as to why the work was being adopted by a limited number of audiences but not going in sort of large scale deployment. And that led me to think more about what is it -- what's interesting about infrastructures, what is it that really makes it difficult to -- to development infrastructure that really works across a bunch of different communities and why are we trying to drive things with technology when all of the standard a paradigms of how technology permeates society are these sort of sociotechnical, sociocultural models. So I've begun to think a lot more about that. And the recent work I've been doing with people in my group is understanding how to resolve cyberinfrastructure and scholarly culture. This by the way is the Meal Bridge. If you have you not seen this, this is a bridge in France. It's one of the great architectural wonders so. That's not a science fix picture, it's a real bridge. So what I want to talk about today is some aspects of this research and look at it from the top level and understand what the problem S. And this work stems from this sort of broader vision of interoperability and how do we make systems work across different cultures and different domains. And of course, our standard Tower of Babel view here. How do we get a technical protocol that works amongst people that work in very different manners? So the foolish among us where I talk about myself years ago we're naive to think that this was a relatively easy problem that you came up with a protocol. And this is a paper that the late Barry Leiner published on some work that we were doing that came out of Cornell, which is open architecture for the worldwide digital library. And this view that we could come up with -- that information and knowledge, information tabets could be -- fit into a single architecture with something that we naively thought in 1998. And since then have sort of rethought and understood more of the complexities. So a lot of people -- I join a -- I proudly join a large number of people who thought that there was going to be sort of these great -- the Web was going to come along and that these architectures were going to come along and the world was suddenly going to change. The revolution was going to happen. And I -- I always hate to keep over using Steven Harned's name, but Steven is myself poster child of someone who would get up and say look, physicists did this, everybody else is going to do it any day now. And excuse me for the him holding an iPhone, right. It's not my picture. But this notion that the revolution was inevitable, as soon as we deployed this architecture and everybody was going to change and all these issues with intellectual property and scholarly cultures were going to go away was really obviously false. And it really pretended that scholars were sort of a bunch of sheep, that they -- they -- that they just sort of did things in sort of en masse and got themselves shaved in the same way and herded in the same way and bog in the same way. And the reality is that adoption of these technologies really varies. It varies from relative disinterest, like why should I care, to slightly more violent reactions to of I don't care, I don't want it, and this is what I think of it. So we have found through the years -- and I'll be talking about chemists, not that this is my representations of chemists, but there's a lot of reactions in a lot of communities that we simply don't want this, it violates what we want. You know, an interesting thing is I'm a real fan of citizen science. And get up and talk about citizen science in some venues and they will literally start throwing mud at you. Don't you realize that violates the basic notions of how science is done and professional science. So this -- so this very clear difference in understanding these different communities is very important. So if you haven't even this, this is I think one of the best documents available. This is a white paper that came out of an NSF workshop. Understanding infrastructure, dynamics, tensions and designs. It's a wonderful paper that talks about how difficult this phenomenon called infrastructure is to develop. And it uses a lot of historical precedent such as railroads, banks, the banking system, the electricity system and the telephone system. And it really points out this basic sociotechnical paradigm that the path between technological and the socialist not static. There's no one correct mapping and cyberinfrastructure al only delve when associate, organizational and cultural issues are revolved in tandem with the creation of technology, that you must approach this difficult thing of society and culture at the same time you do the relatively easy thing of technology. And I say relatively easy, to quote the late Jim Gray, who said if all of our problems -- if only all of our problems were code. So that's -- that's the set of the work, the basis of the work that we build upon. And really when we look across science, we look across scholarship we see a bunch of different islands that really segregate them in interesting ways. We see these strong methodological islands, people do different scientific methods and the scientific method varies and the technology has to VMware with them. Epistemological knowledge, the generally way they view the world, their notion of knowledge. And the culture, the cultures that have developed. And these are all sort of independent dimensions that we need to understand. So the way you start this research is trying to understand what is the analytical unit. I mean a basic research problem as you're looking at this issue is what are you calling -- what are you defining or attaching the word culture to? Is it a discipline? And I think we'd all agree that the discipline is actually really -- it's too coarse of a granularity. Of course you can say chemists and suddenly you have synthetic chemists, you have theoretical chemists, you have physical chemists, and the same thing across a lot of other disciplines. Is it a subdiscipline? Well, I don't think that works very well either. Is it a research group? Is it a -- and that sort of gets you down to the level of impossibility because if it's individual research groups, you can't do much of anything. Or is it this thing called the invisible college that other people have used as a term to describe these group of people who collaborate across institutional lines and some unique manner. So the issue here is to understand that and the issue is to pick a really interesting community to start your research on, which is chemistry. Chemistry is a particularly challenging area, right, Mark? That's right. And there's -- we have spent about the last three or four years looking the at chemists and working with chemists to understand why this is such a difficult area. And some things that stand out about why it's so difficult is the commercial value of chemical information. I mean astronomers will tell you that our data -- what's wonderful about our data is nobody would pay for it. That's certainly not true of chemical information, especially in the pharmaceutical industry. And awe then there's just the nature of the research culture. You have this pre-dominance of this thing called synthesis, this creation culture which overshadows discovery. Really the concern here is not to discover, to look at the world, it's to figure out how to make new things. You know, in these sort of secret methods of making new things. And that leads to a highly autonomous culture in which successful research really relies not on a lot of other people, it relies on being in a lab and figuring out how to get these things to work together. And then, there's the really difficult problem of a scholarly societies. The ACS, particularly in the United States, the American Chemical Society is simply violently opposed to open access. And violently opposed to many ways and people always love this who don't know this. The core identifier system in chemistry has copyright issues attached to it. You cannot use the identifiers without paying the ACS for the identifiers. It's kind of hard build infrastructure if you can't share identifiers of objects. And then the RSC is considerably better. But still has some licensing issues. But they're getting better. So the -- just to go back to the historical notions of some of the things that happened here, one of the things we looked at at the beginning of our work is this question. In physics the archive came along with phenomenal success. Paul Ginsparg invented it in 1991 and high-energy physics and it danced from there to other fields of physics. It became the venue for publishing. It basically changed the way that -- changed the way they published. When they tried to do the same thing in biomedicine, it became highly modified and the hybrid looks very little -- very little like the original model. And in chemistry it crashed and burned. The attempt to have to have an archive of chemistry literature just open archive went away and societies basically tried to shut it down. So, you know, we have to ask why these things happen. So for a really big overview and nice -- and I can compliment this work even though my name's on it, because it's mostly Theresa's work, my student Theresa and I wrote an NFS workshop in DC about two years ago and out of that came two papers, a Nature Chemistry article, and the longer white paper that the URL of that is included in this. And this really points out what -- it gives a very long, very highly detailed analysis of what is the problem with chemistry, what's going on here. And it talks about the technology, it talks about the culture. So if you're interested in this subject, I recommend that you take a look at these. That's the handle to the paper. Okay. So how do you study this problem? I mean, we can all sit here and we can do back in the envelope or napkin or writing on the napkin, sort of okay, disciplines differ, how do we understand this? But really this requires sort of innovative methods to understand how these -- why this problem exists. So I'm going to talk in particular to some work that Theresa and Asif-ut Haque and I have been doing that this paper published in Scientometrics talks about is to understand the fundamentals, of the problem, to understand why this problem exists and what might -- where we might go from here. So to ask basic research questions like how do the socio-intellectual characteristics of fields shape the communications that they carry on? How does the communication culture reflect the practice culture, and how do they interact and what's the synergy between them? And how within a field such as chemistry, how are there differences within the field itself? So we've adopted I think a very innovative approach which is to combine ethnography and network analysis and take advantage of the strengths of each. So ethnography is great. But the problem with ethnography is it's very expensive. It's very, very labor intensive and a classic ethnography study can only study a few groups. Network analysis is great because you can take huge a little of data. But the problem is that the nuances and subtleties do not -- you do not see the nuances and subtleties. So what we try to do is to try to combine those two techniques with one enhancing the other. And I'll talk about how that's done in a second, to come up with a mesoscopic level of analysis, to understand the co-authorship relationships and how they cluster in their relationship between those clusters to understand what collaboration patterns are in chemistry. And that's -- we're trying to understand how collaboration patterns reflect communication cultures, communication cultures reflect collaboration patterns, and how technology might either interfere or enhance those collaboration patterns. Because if you don't understand the way scientists work together, you're not going to be able to have technology to allow that -- to enhance that. So here's an overall view of the technique. And the technique is as follows. I think I can get a pointer here. I thought I could get a pointer here, but I can't get a pointer here. >> Lee Dirks: You have a laser pointer too. >> Carl Lagoze: That thing. And the pointer goes like that. Okay. So the basic technique looks like this. We -- on the left hand side are the ethnographic field studies, on the right hand side are the network analysis. And we do a basically a delineation of the field. We understand what the units of analysis are. We then go bounce from that, and we do it all through a field citation network within that -- within the individual fields. We then bounce back and do -- we analyze -- we sit in in those -- on the main -some main labs in that group and work with them to understand what they do and then do -- develop coauthorship networks in those groups and reflect those -- the -- in those groups in the leaders of those groups we take the networks and we bounce them back and forth and have them look at our clusterings and understand what they mean to them and how we might modify our clusterings. And we use this buy directionality of going from ethnographic to network analysis back and forth to come up with the synthesis of field differences in the cultures. And the technique works really nicely because you get these aha moments in which actual a researcher will look at a clustering that you've done and say hey, that person there reflects what we're doing. That sort of -- that sort of pattern reflects what we're doing without us trying to infer from it information that is really cultural. So the work we've done has studied 5 labs in two major fields, and this field here included synthetic chemistry, theoretical chemistry, theoretical physics and experimental physics. So these sort of existed -- and this was a cult -- a mainly -what we call an instrument building culture. This was a field of chemistry that relied on building large instruments and using those instruments. And this was another field that was inorganic -- that includes inorganic chemistry, polymer chemistry and organic chemistry. And this was a more synthetic -- a culture that did substance building. And this picture leads to the breakdown in the analysis and the collaboration patterns that really are based on these two notions. People who work with large instruments that cost a lot of money that take a long time to delve and people who work with creating substances in individual labs. And we'll see very different collaboration patterns in the two of them. So we spent a lot of time with those people. We attempted to understand their research cultures, what are the value skills in their domains, what is competence, how do they share information? And from there came up with some interesting analysis that I'll talk about in a moment. So to go from the fields to actually doing the network analysis, we worked with the leaders of these labs to develop a lexical query into the Web with of science to come out with a co-authorship graph. So what we wanted to do is delve a co-authorship graph, a clustered graph that basically represented the field. I'm sorry. A citation graph that represented the breadth of the field. And we wanted to come up with someone that actually -- something that had actually the recall and depth in the field and we could use article self citations to understand if you were getting enough depth. The query revealed the things that Carl Lagoze published and cited on himself. And then the breadth to actually cover the people that I would collaborate with and reference. And it's a rude and crude method, but we've based come up with a notion that this is really a good way, using the bibliographic mapping and the citation map is a really good way to delineate these units of study. And what comes out of that is a huge citation network that we then use to create a -- a co-author network and we then classify nodes in the cluster and compare the clusters with field observations and interview the group leaders and try to figure out whether these graphs represent what they think of the world. And what's interesting when you do that, you come up with some very nice patterns. So I put up here three different topologies of groups that represent collaboration patterns. And here you can see what stares out at us is a classic PI led research group. So this is a co-authorship graph again. The nodes are authors and the edges are if I co-authored with you. And here is the strong leader who insists that every paper be published with him or her. Here's the research institute. I would think that Microsoft Research sort of looks like this, that you have people who find -- you walk down the hall, you say, oh, we have similar research, you start to work together and then you co-author with each other. . And then here you have a international research network where you have PIs in separate countries and you have some crossing boundaries across there. So we have a bunch of topologies like this, and these topologies, you can reflect these topologies again back in the ethnographic studies. And people look at them and say hey, I know who that is. And I know who that is. And things like that. So what's -- what also comes out is if you analyze this at the mesoscopic level is you start to understand what the roles are of different people in these -so here's a bunch of groups that interact very closely together, and here are bridge people. And again, we can show these to people, and they say I know who that is so you have these strength of weak ties type of phenomenon that occur there. And then you have -- you have these other fields where you have these intensive collaboration between the groups themselves and you have -- you don't have bridge people, you have many things going across. Often these are -- often this is a graduate student of this person who now moved over to this group over here who still co-authors with his or her colleagues over here but works with people over here and does the weak tie over there. So this is all documented in this Scientometrics article, and what's interesting, if you go one step further, is you can start to classify the nodes. So you can come up -- you can classify people into being non hubs or hubs. And we have a whole classification of nodes and their proportion of the population. And you can really come up with a notion of who the role player is in this sort of interesting social network based on co-authorship. So the results here have been quite nice. And you can even extend that into the geographic realm and understand -- so I think the yellow nodes are people in the United States, the blue nodes -- now, the yellow nodes are people in Asia, the blue nodes are the United States, the red nodes are Europe or something like that. You could understand how geographic dispersion affects co-authorship. So we did all this work, and we were sitting around one day looking at this and saying, gee, those Asians really collaborate a lot. Isn't that interesting how much they collaborate? And we had this sudden realization the potholes in the road to thinking this research was legitimate. It's the fact that often there are two people who have -- there are -- there's one name who's really two people and there's one person who often has two names. So you run into -- the reason that the Asians appear to collaborate so much with each other is that not all C Wangs are the same C Wang, not all S Woos are the same S Woos. So we ran into a problem quite familiar to people and people at Microsoft who have been doing academic search have seen, which is the name disambiguation in a name ambiguation problem. And it turns out that that problem distorted our results really badly and we had to go back and spend six months working on the problem of name -- author name homonymity -- homonymy, which is a very difficult problem. It's -- it's given the fact that features are very hard to extract from academic papers and that the strength of those features such as institutional affiliation are -- is very difficult to judge, it becomes quite a hard problem. So we have the paper that we published in JCDL this year that we're presenting at JCDL that talks about doing this with very sparse metadata. And the way we're doing this is doing it by coauthor affiliations. It came up with very nice results. So if I have Wang who published a paper with Lagoze and Warner and I have the privilege of having a unique identifier, we can pretty reliably if we have enough evidence, statistical evidence, disambiguate that from the paper by Wang, Zduncyk and Gagner and that we can -- given enough statistical -- overwhelming statistical evidence, we can say that these are different Wangs. And our results came up really nicely there. The problem of course that remains is the Wang, Sun, Kim, Yang papers which, you know, we have -- if you look at our results in the paper, you'll see we have really, really nice results for these -this world. This world is just -- I mean [inaudible]. So we paid students about -we paid for about a hundred hours of student work to do gold standard. And even the students couldn't figure out, I mean just by searching all over the place, searching bing, searching Google, whatever, they couldn't disambiguate these. So if humans can't do it, we have a hard time with machines. But just to show you the results here, this is -- red -- these graphs are based on node role here. So these are the role -- these -- and what's important here to recognize is that since our co-authorship graph has different roles in it, the mistakes in one role are really bad than mistakes in other roles. And can I write on the board because ->> Lee Dirks: Yes. >> Carl Lagoze: So if you have one of these subgraphs, subgraph, and you assume that person is very -- say that's the person if row four which is a person -that's a bridge, if that's really five people that grossly affects the topology of your graph. So we have to be careful that a disambiguation techniques are biased towards our type of graph. So the problem becomes more complex. But in any case, without fully describing this -- these -- this result thing, the reds are really bad disambiguation problems. And you can see with our disambiguation to algorithm, there's very few reds left in here. So you can take a look at this paper. It purports those results. So, to conclude on that and then pass it over to Bernhard, our initial results show some very interesting different collaboration patterns between fields that rely on either individual achievement or collective achievement. And the collaboration patterns are quite different and they point to a shaping of tech -- of one of the prejudices we tend to have is that sharing is great. And in this world, sharing is not absolutely great, so the technology has to be adaptive to those type of field differences and cultural differences. I'll pass it over to Bernhard, and then I'll close up with a couple of last comments. Thank you. >> Bernhard Haslhofer: Yes, thank you, Carl. I will now focus on one specific technique in the area of scholarly work and scholarly communication, which is annotation, and I will demonstrate one particular tool we built back in Vienna. I'm originally coming from Vienna. I came to the US two months all. So that's also the reason why I'm attributing my colleagues here in Vienna. But I will demonstrate basically our annotation tool and how you can apply it for one specific use indication which is the annotation of historic maps. So we will see later that this is also a use case which targets several domains. But it's a very interesting use case. In this demo which I will give at the end, you will also see how we make use of the vast amount of data that is out there on this data Web -- on the links data Web. I'm speaking of TPPD at tier names and all the other sources out there. We believe that this is quite an interesting combination and that scholars as well as institutions running infrastructures where people can annotate could benefit from pulling in data from the data Web. Okay. From Carl's talk we know that communities are diverse. So the first question that usually comes up when I start talking about annotations is the question okay what is an annotation? And there are so many definitions out there in the literature. And definitions that you find might suit for one you discipline but not completely be unsuitable for another discipline. So whenever it comes to signing annotation tools, annotation models protocol the first question to be revolved is okay, what is an annotation? It's a really difficult question. And I found one very nice abstract definition in Palmer's paper, and the reference is in the last slide. It simply says okay, annotation is note taking to manage, interpret and coordinate sources. But I don't want to stick with definitions. I go on to some concrete examples. This is an example I took from a colleague from the [inaudible] repository. There are people have the possibility to annotate works from Pico to add their interpretations to ancient text. So this is one facet of annotation in one particular community. The link to the slide is down there. Here we have another use case of annotations. This is coming from the Australian literature project where scholars of the possibility to add -- to annotate text passages and add additional knowledge to those passages. So here for instance somebody explains what the term young ladies means by means of an annotation. Here we have another example from the -- I think it's a biochem domain, I took this example from a colleague from Colorado University. In this domain they're annotating words within the text to find -- to learn more about the semantics of a term. There's a complete different notion of annotation. And this is my very personal few of annotations that a technique I use most often the actually the annotation of PDFs to mark important sessions and document. This is also a very important aspect of an annotation. Okay. So what is YUMA now? YUMA is a tool we built to annotate multimedia objects on the Web. It's actually a Web application which consists of a server component and several clients. The clients are media specific so we have a client for video, one for audio, one for images, one for maps. And these clients can be used for different use cases. We developed this tool back in Europe in the context of Europeana. Europeana is a European digital library project which aggregates multimedia content from cultural institutions all over Europe. So there's a really large pool of media objects out there which can be annotated by scholars but also by public but by normal Internet users. YUMA is also open source software. It's available on get hub. Anyone can pull the source code. We also have be if somebody push us or at least gives us some issues so that we can further develop this code. So it's completely accessible and free. YUMA is also in our opinion of course an infrastructor for corporate research because people really have the possibility with those tools to add their thoughts, their ideas directly to the media objects. And this is a very important and communication mechanism in scholarly research. With this slide I just want to emphasize that the demo today will be about historic map annotation but YUMA is really more, it's a multimedia annotation framework. So we have Web clients for annotating videos, audio files on the Web, images on the Web to audio -- the video and the audio client are written in flash originally but currently completely rebuilt into HTML 5. So we want to get away from flash in this case. The image, the image client is written in JavaScript but built with Google Web toolkit and the same is the case for the map annotation client which I will show you later. Why, why did I take the map annotation use case? I took it because it's a very interesting use case because it targets -- I don't know how I get rid of this -because it targets quite different disciplines. You can study historic maps of course from historic point of view to learn about how country borders have changed. But also from another point of view you can study technologies that were used as a certain point in time to construct maps or to produce knowledge about geographical units at this time. Certain maps are also pieces of art, so you could also study them from artistic perspective. And of course most importantly maps are still the predominant medium to transport geographical information. But you see there's quite a variety of disciplines that are involved in this particular use case, and that's why we think it's quite interesting. Okay. But now it's demo time. I prepared a screencast which will show -- just start again. Okay. This screencast shows you how you can use our tool to annotate digital maps for use case. I took -- we took a map from the Library of Congress so you should be familiar with the geographical region here. You can simply start annotating this map by giving the URI of this map to our annotation tool. Here I'm zooming in and zooming out of the map just to demonstrate that we actually used high resolution maps tiled into a smaller images such as the same way as we tile normal maps. And we support two types of annotations. The first type is referencing the second time -- second type is semantically augmented annotations and the start of the first type. The georeferencing annotations allow you to associate geographical points to ancient maps. So here for instance I identified the city of Philadelphia on my historic map and they say okay, this point is Philadelphia. I enter this information in my annotation box. The system automatically fetch geo information from geo names to add this geo information to this control point and to properly georeference a map you need actually four control points in order to project that today's coordinate system on the coordinate system of this map, of this ancient map so that you can do a geographical information system features on geographical maps as well. So here I'm annotating the city of Detroit. I say okay, this is Detroit and I create later on one more point, one for Chicago because I know that at this point where the river flows into Lake Michigan this should be Chicago. So I created four georeferencing annotations for this map. And now I somehow projected today's coordinate system on this map. And now I can overlay any kind of KML vector data on this map. For instance, here I'm selecting country borders. Very simple example, but it shows you what you can do with it. And this allows you to nicely embed an historic map into today's map. You can imagine that instead of this map you can take any other map. Okay. So now the second type of annotation which is the semantically augmented annotation, this allows users to add commands to regions on a map. And I created a command for the region I know best in the states here, which is the Finger Lakes region. I just moved there recently. So I'm saying okay, this must be Cayuga Lake, at least one of those lakes, and then I add my knowledge about this region. And you see that in the background there is some text popping up. And this text are actually suggested based on the geographic region. This annotation region covers. But also on the annotation text the user [inaudible] so it's analyzed during the annotation creation process actually. And these tags are actually not just string tags, these are links to resources on the Web, to resources in tier names but also to resources in DBPedia, DBPedia is just a structured version of Wikipedia. So we give the user to the possibility to check text which are in his or her opinion relevant for this annotated reason and gain additional semantic information for this region but also for this map. So for instance, here there are some false positives as well like the Australian capital, but Ezra Cornell is the founder of Cornell University is somehow related to this region. Central New York is obviously related. Cornell University is in this region. Cayuga Heights is part of Ithaca so these are also related. And important thing is that these are not string text. This is -- these are links to DBPedia and behind those links is extremely valuable semantic information which we can pull in, into our system and we will show you later on how the system or organization running the system can benefit from this at the end. Okay. So you see create the annotation at the same time during the authoring process of an annotation we create links. And we believe that these links have quite a high quality because it's the annotation creator who acknowledges the semantic validity of a link. Okay. What we do now is we search for annotation. It's a very simple use case but before we search, I though you that this data record has been created. This is annotation I created before. With the text plus the links you can open it again, see the annotation. And now just let's assume you're a user and now we want to search for a map. If we don't have annotated metadata or any annotations we won't find anything. With annotations we can at least search over information a user has provided. We will find something when we search for Finger Lakes because user wrote it's about Finger Lakes. That's trivial. We will also find something for Cornell University because in that text that's trivial. But now we start searching in different languages. So I'm searching for lago Cayuga. It's Italian. I also find this record. Now I'm searching in Finish. I also find this record. Now I search in Japanese. I also find this record and the same for Chinese. So one of the benefits of pulling in those data from DBPedia is that you can enable a very simple multilingual search feature. And this is especially important in Europe where we have 20 whatever languages just in Europe. Okay. Now I show you that all the annotations that are actually created by the users are shared as linked data on the Web or at least as data on the Web. So each annotation gets a URI. And this dereferencible via http. So you can simply send an http request against an annotation and you receive the complete annotation represented in RGF in this case via http: We also support JSON if somebody wants to go for JSON. Okay. So you see with this annotation tool, we try two things. We try to make use of data that is out there on the Web, which is very nice for use cases like Europeana where we have this multi-lingual search problem because we can exploit information a user gives us. But we also give something back. We give back the knowledge the users created in terms of data. And this is a nice cycle, I think. Just go back to my presentation. Yeah, here is the demo again, in case you want to see. I think you notice maybe the data model we used for the annotation that this not something we invented, it's something which is coming from a collaboration which started approximately one and a half year ago here in the US and also in Australia which aims at a facilitating some interoperability environment for annotations. So the goal is really to make annotations available in the way so that you can take it and transfer it to another system. For instance it should be possible to intercreate an annotated map into something E learning system. And this of course requires some shared data model and the open annotation collaboration is working on that at the moment. And we are also somehow involved there. Yeah. This is how the annotation craft looks at the moment. It's an RDF craft. You have the concept of an annotation, you have the concept of a body of an a annotation which is more or less the content the annotation contains or the text plus the links. And you have the concept of an annotation target which is actually the medium out there on the Web, which has been annotated. And you see that everything has URIs and everything is dereferencible via http. Yeah. Future work. In the next two years at Cornell I would definitely like to further develop this annotation -- this map annotation use case. Maybe set up some repository for historic maps. Where people can annotate maps and adapt their knowledge to map. And it would also be nice to integrate this with today's systems that's supports today's map like Google Maps or bing maps or whatever because it would be my opinion very interesting for the user to see for certain location, okay, what -what's this location in 1769. So this could be a very interesting future work. I also would like to enhance the semantic augmentation a bit. At the moment it's really simple and straightforward. We just pull in data from DBPedia, index the label and just search over the label that's trivial, that's not optimal because it produces still a lot of false positives. So we would like to integrate some relevance feedback mechanism which improves the tech proposal all the time. That's one of the ideas. And the second big idea is to also think how some of those techniques, especially the semantic enrichment technique can be integrated with scholarly publications. So when somebody writes an article, how we can encourage the author who writes the publication to create links to external resources on the Web while he or she is creating the paper. So this is my future work here at Cornell. And I'm always happy to get some input. Thank you. >> Carl Lagoze: A couple more -- so one more thing. Okay. So if I may. So what do I do, just plug in? >> Bernhard Haslhofer: I think so. >> Carl Lagoze: So I wanted to close up with just one more piece of work that is -- fits into this whole area that we're doing. This is work that we presented just recently at the Web conference in Indian, and this is work with Yookyung Jo and John Hopcroft who is in computer science. And this is really to take a look at a field like computer science and understand how topics develop and ideas develop over time. A lot of the work that's been done in this area basically has either fixed time slices or has a fixed set of topics to see how they develop. And there's been some certainly nice work there. But what we wanted to do was not at all define the time slice. There's a topic that originated the at any point in time and it could be any -- the topic could be anything that basically no learning at all, just totally unlearned technique. So this is sort of a slightly representative graph, but we want to understand this notion of as time progresses this growth of a topic into separate subtopics or even the merger of multiple topics into a single topic. And the example we use all the time is this notion of link based searching, which arguably was sort of [inaudible] and Jon Kleinberg which sort of combined, you know, sort of this very interesting areas of IR and graph theory and other things and into an interesting thing. And there's even more complex topic creation things. So -- topic creation configurations. And we have a technique that essentially does a statistical analysis the text of the papers we used just the abstracts and walks through time, looks through the corpus and notices when a -- when a paper appears that is -- has a statistically different characteristic than papers that preceded it. And also then that's statistically different fingerprint carries on to follower papers. So the definition of a topic is something that differs if a way from previous papers but actually carries that difference persists over time and it's sort of an interesting model. And then also the linkage between topics is a statistically significant resemblance between these two separate topic areas. So we came up with some really nice results here. And I'll just briefly put up -this is a little bit hard to read. But this hear is a breakdown of the ACM corpus over -- since 1972. And you can see here -- and this is actually -- we deduced these by hand looking at the papers. But future work is be to be able to pick it -to cluster these areas. But you can see the development of graphics over time, development of database over time. And you can also see the linkages between areas and how they sort of grew from each other. So -- and what's interesting here again is this is -- the nodes here are topic areas. They're not specific papers. They're a group of papers that fit into the topic area and then the linkages are the linkages between topic areas. So the results came up came up really quite nicely here. And there's a paper in the Web conference this year. So to close up last slide, a lot of the work that we've talked about and that we're doing is sort of combined into this large project called the Data Conservancy which is one of the NFS sponsored data curation projects. And I thought I would just end up by pointing out a paper that set of gives a large overview of understanding why cyberinfrastructure is so hard. This is a paper that was accepted this year and presented at JCDL by my graduate student Karin Patzke and I. And really looking at this whole cyberinfrastructure problem as a sociotechnical problem that really needs to be understood in the large to be able to work well. So with that, we will discuss or end. [applause]. >> Lee Dirks: Thank you. >> Carl Lagoze: Thank you. >>: I had a question. >> Carl Lagoze: Yes? >>: So it's a question and a perhaps a weird one for both of you, but you started your talking by saying, you know, in the beginning of your career you did these different efforts and, you know, some of them wouldn't grow like you would have hoped or expected, protocol based, et cetera. And I'm trying to think in your part making reference to open annotation work, how are you able to think about these situations and how do you make sure that open annotation does scale, does work, and is -- you run the risk of again developing a system that is hard for its scholars to use. And I'm trying to -- you're not alone. Obviously there's, you know, house of people around the world who can meet up with similar systems every day. How do we -- how do we address that and prevent that from happening? >> Bernhard Haslhofer: Yeah, there was a workshop two weeks ago where exactly this question came up, how to assure that this model actually fits the discipline and the data quite by one discipline can really be used by the people in this discipline and how cross-disciplinary work should be enabled. And the solution seems to be that this annotation model will be really something extremely generic and that different disciplines need to create some kind of annotate -- application profiles for their specific needs. For instance we have the whole community of geoinformation people who are annotating whatever with GIS coordinate and they need some very specific aspects to be covered by the model and they probably will need some extensions of this model. So it seems that there will be some kind of a core and extensions. >> Carl Lagoze: So let me -- let me -- because I have another answer. There's -- I don't mean by anything that we're saying that the effort to create technological solutions is useless. The problem is when you go from the low hanging fruit, the generic -- and I applaud the generic efforts, I think they make a huge amount of sense and a lot of the work that I've done and other people have done, I mean certainly you would be pointing the finger at Tim Berners-Lee and say that was a foolish effort when you say -- if you say that these things don't work. But there's the sort of the hard balance between generic and too much specificity. And we haven't found on that spectrum -- it's easy to do the generic things, like OAIPMH. Now there's an example of things that worked relatively well that does a limited amount of stuff. As you sort of move up the functionality spectrum, it becomes very hard because then you start to clash against specific cultural notions. So our grand notion of, you know, this great -- the fourth paradigm science that we've talked about all along when we'll be able to do these wonderful -- this great synthesis of data from different domains and use it in great new ways is a vision that's great, but it's going to take a heck of a long time in understanding to get there because it's not the low hanging fruit. It really clashes in this world. Yeah, Mark? >>: [inaudible] talk you mentioned how the blogs publishes such as the World Society of Chemistry, the American Chemical Society actually inhibit collaboration. In your opinion, what could they do to facilitate that collaboration? Is there a magic bullet? Is there a ->>: Or is there a bullet? [laughter]. >> Carl Lagoze: I think it takes -- Mark, I don't have a great answer, but my sort of intuitive thinking while talking answer really relates back to the same problem the music industry is having. It's sort of understanding what your value add is and distinguishing that from the data level. That publishing industry would be wise to think of themselves as service providers on top of the data layer and that the data layer -- the data shall be free. And he or she creates the best services wins the capitalist game. And the ACS, like the music industry and the movie industry has tried, the entertainment industry has tried to embrace this vertical thing which includes the data and all the services. And they're inhibiting scholarship by doing that and they will lose eventually that battle. So I don't know if that's -- yeah, Curtis. >>: You know, I like the idea of annotating lots of different media outcomes but there's always the challenge of the separation of the media from the annotation and [inaudible] some in the idea of some of these formats would then be able to accommodate both -- at least [inaudible] the annotation and also have it separate as well. Do you have an approach in your annotation of how you make sure those stay and persist over time [inaudible]. >> Bernhard Haslhofer: I don't have, but there is a group at Los Alamos working on the project called Memento and they added the concept of fix to the annotation model so that you can at least capture the time or the -- at a time actually when annotation -- or the time when the annotation was created for a certain medium so that you can -- if a medium disappears at least look it up in Web archive or somewhere else. So that's one of the ideas. So that you can go back in time and retrieve the annotated resources in case it disappears. >>: [inaudible]. >>: So that's when something was annotated. But I guess the question is if I originally annotated the thing and it turned out to be the world expert of this document but the annotation is separate from the document itself, if your system dies then, you know, I assume you have to figure how to get that data back associated with it as [inaudible] ends up moving through [inaudible]. >> Bernhard Haslhofer: Those questions are definitely not part of the monthly. But, I mean, at the end, annotations will be Web resources just like any other Web resources and the idea is to, at least in my partnership, I'm not speaking for the consortium here, to reuse also techniques that are used for the same problem in the Web context I guess. Also for security and authentication. It's not part of the model, it's really in the Web stack. >>: I mean if a [inaudible] have the same problem with [inaudible] how do you ensure that those annotations [inaudible] tagged you can [inaudible] these image files [inaudible] video and other formats that are dynamic and [inaudible] associate that with things that [inaudible] tiles and lots of interesting ->> Bernhard Haslhofer: The current solution have to make sure that URI of the annotated media object remains dereferencible and returns the representation. >>: Right. >> Bernhard Haslhofer: That it was at the time of the annotation, yeah. >>: [inaudible]. >> Carl Lagoze Bernhard, I have a question. So has anybody in the annotation consortium doing work on automatically connecting annotations to things so I was -- when you were talking I was thinking of Twitter. And my big complaint about Twitter is it's -- Twitter alone is an incredibly decontextualized thing. But I am learning -- I am basically annotating. And there's lots of examples. I mean it's the walled gardens of the Web are actually connected in implicit ways. And is anybody doing work in taking the annotation model and using it for those types of ->> Bernhard Haslhofer: Yeah, I think if you look up the standard or the alpha version of the standard, there's the example there because the tweet has got a URI and you can use it as an annotation on something, so it's considered, yeah. >> Carl Lagoze: Okay. >>: And you're using [inaudible] like that to condense them, that becomes intermediator which if they disappear [inaudible] so it's how do you ensure the persistence of the connection between the annotation which is so important, sometimes more important and a lot of times [inaudible] might have copies of. >> Carl Lagoze: Well, you know in that realm you know the great controversy when they threw out library cards in the libraries. Where the library cards have these wonderful annotations or the smudges. >>: [inaudible]. >> Carl Lagoze: Yeah. That's a wonderful example of annotation that was not intended to be annotation. >>: Absolutely. >> Carl Lagoze: Yeah. >> Bernhard Haslhofer: But you're actually targeting one of the big problems of this whole data on the Web discussion. So the problem what happens if the links are gone. >>: Yeah. >> Lee Dirks: Anything else? >>: No, we'll just end on that eternal problem. [laughter]. >> Lee Dirks: Thank you all for coming. Bernhard and Carl, thank you very much.

>> Lee Dirks: Good afternoon. My name is... education and a scholarly communication at Microsoft Research, and it...

Products

Support

&gt;&gt; Lee Dirks: Good afternoon. My name is... education and a scholarly communication at Microsoft Research, and it...

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Lee Dirks: Good afternoon. My name is... education and a scholarly communication at Microsoft Research, and it...