>> Matthew J. Graham: Good afternoon, and welcome to... practical astrosemantics workshop. This is actually the fifth gathering...

advertisement
>> Matthew J. Graham: Good afternoon, and welcome to the afternoon session which is
practical astrosemantics workshop. This is actually the fifth gathering we will have had which is
devoted to astronomy and semantics and astronomy. I'm going to talk for about 25 minutes on
just exactly what semantics and semantics in astronomy is and then Norman is going to talk
about some stuff and Sebastien is going to talk about some stuff and then [inaudible] will talk
about some stuff and then will have a discussion about some stuff, which is what semantics
really is, isn't it? Let me start by giving you a definition of astrosemantics and essentially
astrosemantics is the branch of astronomy which deals with machine processible knowledge
and it's nothing more than that or it's nothing less than that, and it does this through the use of
a set of technologies which you can broadly call semantic technologies. A lot of this stuff
traditionally came out of the AI community and has been adopted by various bits of the web
and things like that, but there has been precious little application specifically in astronomy and
what we are trying to do is promote the usage of these tools because in the era of big data
everyone focuses on big data and doesn't think about big informational, big knowledge which is
what you get necessarily when you have big data. You quite often see this pyramid or diagram
where you have data going into information going to knowledge going to wisdom, and the
statement is that normally you'll spend about 40% of your time grubbing around in the data
and maybe 30% of your time working with information and if you're lucky 20% up in the
knowledge layer and then the final 10% of your time gaining wisdom. However, by using
semantic technologies you actually invert that and you as a carbon-based system would be
spending maybe 10% of your time down in the data layer and maybe 20% in the information,
but you'll be spending the bulk of your time up here in the knowledge and wisdom part of it
doing good thinking, good science, making discoveries instead of wondering about why is my
data crap and what can I discover in it. The bulk of this stuff will be handled by semantic
technologies or smart technologies. I think smart is the way to think about this sort of stuff,
that you're not, we’re not talking about making intelligent computers which are thinking. We
are just talking about knowledge machine processable so that you can then apply first-order
logic and the sorts of things that computers are really good at, so that they can handle that sort
of stuff in the same way that they already handle basic zeros and ones. Why do we want to do
this? What are some of the things that are making knowledge machine processable would be
good for in astronomy? Let me go to Ned and I type in an NGC number, NGC 7377. This is what
NGC 7377 looks like and it comes up with a whole load of annotations. One of the annotations
that NGC 7377 has is most unusual galaxy. This is the most unusual galaxy in Ned, so it is
tagged. The reason is because if you go back to the original source paper from the 1970s
talking about this, someone said this is a most unusual galaxy and someone has annotated that
in a fashion. But that means there has been a judgment call on this. There's something about
this object which makes it an unusual galaxy, so if you could take that as a tag you might want
to go to the system then and say well, if this is such an unusual galaxy in the opinion of
someone, I would like to find other objects like this either by type, so I would go out and find
more spirals or more ellipticals hopefully that someone has tagged up. But what happens if
instead using a tag such as E someone has tagged it up as elliptical using the word elliptical?
Somehow my system needs to know that E, the concept of E in this particular tagging system
and the concept of elliptical in a particular tagging system are equivalent, that there is a
concept scheme being used here, that there is some piece of domain knowledge that says when
I am tagging some people might use this particular tag or use this particular tag. That's a piece
of smartness that you want to put into your application at a very low level. I might want to do
it by properties. I might want to say that actually I'm interested in things which have spiral
arms and someone may have said that this is a spiral galaxy or that this is a galaxy of S type.
Somewhere there will be a definition in your smart system that says these are the set of
properties that in my worldview I will define a spiral galaxy as having. It will have spiral arms or
it might have dust lanes. If I say it has dust lanes then I can do a search for objects which have
been tagged as having dust lanes and my system will be smart enough to know that that falls
within my definition of what I consider to be a spiral galaxy. The word spiral doesn't appear
anywhere in that tagging, conceivably; there could be an object that says this has got dust lanes
and that's what it has. It doesn't mention spiral at all, but when I do my search to get the
objects like that it will come back and say this is a spiral galaxy because it's defined to have dust
lanes. I could say find me more objects which are of the particular distance of this object. You
know, there might be objects out there that don't have the specific distance metric attached to
them at the moment, but they may say that this is a spiral galaxy. In my knowledge system I
might have something which says spiral galaxies have a distance measure, or can have a
distance attached to them through something called the Tully Fisher relationship and the Tully
Fisher relationship requires there to be an H1 line width attached to it, so maybe I can go out
and find galaxies are objects which have an H1 line width attached to them and somewhere
there is defined an algorithm to say you can use that to guess a distance, an estimated distance
for that particular galaxy using this particular algorithm or this particular property. That's a
broader set of what the main knowledge would be, but I can express that in a machine
processable fashion so that my smart system can go out and give me a data set that's usable or
present me with a set of data that I could then do. This is not just by tagging; it's also by
inference that I'm using, I've expressed my knowledge and my information in such a way that
the system can be smart about it and do things on my behalf that I've already told it or given it
the knowledge to do so. On a completely different note, in Second Life various people have,
there's a little thing called a semantic museum and what people have done there is that they
have put in this particular case, it's about classical Greece. There is a display about classical
Greece and they have created pictures of various objects from classical Greece and they put
descriptions of those objects in such a fashion that there is knowledge about those objects and
knowledge about their provenance and how it's all connected together, pieces of information
pieces of knowledge that are connected together in a machine processable fashion. What
happens when you go up to the object is that this description pops up about the object saying
that this is a prochous and it comes from the classical period of Greece? It sort of tells you how
this was characterized and it tells you something about the object and where it is. Each of
those little pieces of information in there does not necessarily reside in a single block of text in
Second Life. It actually is probably stored at the level of individual pieces of information, but
you can connect the pieces of knowledge together like we are familiar with in Wikipedia linking
it all together and you can retrieve it because it's all related and tagged together. You can also
because you've encoded information in such a way and said what the structure is, you can do
clever things about multilingual presentations and all of this sort of thing, so the term that I use
for this sort of approach is artificial docents. Now where this might be useful is if you are doing
a sort of smart EPO type thing, so here we have a curiosity picture showing stratification on
Mars and, you know, there could be a piece of text saying "planetary scientists believe that the
same geological processes that have shaped the Earth, volcanism, techtonism, water and ice
and impacts are at work on Mars," and there could be links then to descriptions about that and
to find further information, further knowledge about that which would be useful if you are
exploring this sort of thing. Think of it as a very smart Wikipedia, or it could link through to let's
say we are interested in impacts and it could then bring up the amateur picture of the thing
that collided with Jupiter this week and say Jupiter is important in the context of impacts
because, and someone has put a description in there. Now I know that Ray was claiming that
he does a lot of Wikipedia entries, so stuff that he puts in could be drawn up and presented in a
fashion through a knowledge-based framework, and it would be reuse of the information that
he's put in, not necessarily in the context that he originally put in, but if the necessary metadata
to make it useful or to describe it or the use is in place, it can put together for doing this sort of
smart docent EPO type approach. A third area in which sort of knowledge can be useful is we
as a species have this sort of innate need to classify things. This is L. Eyer’s sort of conceptual
taxonomy for essentially variable stars or variable objects. Wouldn't it be great if I was doing
data mining or machine learning in some fashion and I could incorporate this knowledge not as
a preprocessing step or as a postprocessing step, but as something that's actually used directly
as part of the machine learning or the data mining that I'm doing? Two particular techniques
that I've been playing around with is there is a way that you can use self organizing maps to
include what's called an ontology which is essentially just a representation of domain
knowledge so that you can get a metric between two different knowledge ideas. You can get a
metric between things that are expressed in a taxonomy and if you go to NED, NED has the
undersea catalog as you've seen it and then it has this wonderful set of annotations and you
can express that in such a way that it is machine processable and get distances between a spiral
galaxy and an elliptical galaxy in your concept scheme. Because you then have a quantitative
measure, you can then do traditional data mining clustering techniques like the ontological
SOM or self organizing map. So you take the NGCs, you figure out broadly what their classes
are, you get a self organizing map based on the distances of the concepts on how those objects
are tagged up and then you can see how those sort of relate. Another technique is a sort of
super class of Bayesian networks which Gerald talked about and I think, are you talking about
Bayesian networks as well Gerald, ah, Ashish [phonetic]?
>>: [inaudible].
>> Matthew J. Graham: Right, so Ashish will talk more about networks. There's a thing called
Markov logic networks which are a way of combining statistical reasoning and first-order logical
reasoning. Essentially what you can do is you can express your domain knowledge in the form
of rules, but you can associate weights with those rules, so the rules are expressing knowledge
and you can attach weights to those expressing your believability in that particular statement,
so you could say that a supernova is associated with a nearby galaxy 80% of the time, and you
could then use that as a classification system and it will use first-order knowledge, first-order
logic rather to infer things like what is the most likely statement I can say about a particular
piece of information based on a body of knowledge that I've associated with it or Markov
probabilities and stuff like that. Those are just three sort of separate areas where having
machine processable knowledge could potentially be very useful in astronomy. The question is
how do we use these tools? Well, the basic rule is that there is in machine processable
knowledge there is the sort of lowest level, the quantum of knowledge is something that is very
simple. It is what's called a triple. You simply have subject, predicate, object and you reduce all
of your statements to that basic level, that very fundamental unit, and then you do all sorts of
other things with it and express it in your concept schemes storage and data bases and infer
over it, but that's the basic level that we work at. The World Wide Web has defined a whole set
of standards related to knowledge representation and the basic one is called resource
description framework, RDF and you will hear talk if you read more about semantics and
semantic technologies, you will hear more about what are called triples, or RDF triples and
essentially what they are is they are just a subject, predicate, object. Pluto is a planet. Actually,
that's not true anymore. That's the other thing about knowledge. Knowledge evolves and our
knowledge changes. If you work in bio medics or, bio informatics, you have something called
the gene ontology. The gene ontology which expresses the current body of knowledge about
gene structures changes about 10 times a day, gets updated about 10 times a day, so they are
very used to having techniques in dealing with this fluid body of knowledge and we change
these ideas. We reclassify stuff and whatever. That is sort of a bonus you get as well it; it's
inherent in that. How do we build up bodies of knowledge? Well one way of doing it is
extracting knowledge from the literature. You express it in angle brackets and we've been very
good at doing that in astronomy for at least 2000 years. In time domain astronomy we send
out event notices expressing our knowledge of what we have seen in the sky and it essentially
asks the question of who, what, where, when, how and why, but actually literature, there's a lot
of information in literature. There's a lot of knowledge in literature that is worth exploring and
trying to extract. This brings us into text mining and trying to create concept schemes from the
body of text. Now there are two ways that you can create a context scheme. I can go and ask
George what he believes the correct hierarchy or correct classification scheme is for variable
objects. And then I can go and ask Bob and they will have probably let’s hope 80% agreement
but they might disagree over certain things and I may ask someone else. Everyone will have
their own version of what a concept scheme is and how you would classify certain things or
how things might be related depending on what your domain knowledge is. That doesn't
matter in semantics because there is this whole thing about resolving different people's ideas
to form consensus opinions so you can have hierarchies and how to map them. But going to
the literature where there is maybe a more established corpus of knowledge and a
representation is a good idea, so one of the things we've been playing around with is another
way in which in time domain you can send event notifications around are things called
astronomers telegrams. These are natural language, so instead of trying to put it into a forced
structure like maybe we’re trying to do with the angle bracket way, with the ATels you have
someone who has written some sentences. The hope is that these are scientifically rich
sentences, but there may be things that are very useful in there that might allow us to gather
snippets of knowledge together that are relevant for a particular class of astronomical object,
so I've had students over the last two summers who've been working on this and trying to
hierarchically cluster vector representations of text that they've extracted from ATels and there
is some interesting clustering going on that would maybe allow us to create a concept scheme
for that particular subdomain. This is another thing. You don't need to create one huge
overarching body of knowledge. This is astronomy and that's what you've got to use. You can
deal with the very small subdomain of which you are expert and just work with that and maybe
use someone else's representation of which they are also experts and then join the two
together quite easily. ADS labs is a potentially very rich corpus of material to mine and create
these sort of concept schemes and I think that there is work that has been done or is being
done in these sorts of various.
>>: [inaudible].
>> Matthew J. Graham: Okay. There was, David Hogg has been putting ideas up on his ideas
blog last month and one of them was about creating a paragraph level index for arXiv.
Wouldn't this be a great way of identifying astronomical knowledge and as he says also
identifying who wrote which paragraph in a paper? And then there are things like the NED
annotations. They are somewhat freeform, but you can create a concept scheme from those as
I sort of have done and present that and say this is the worldview according to NED; use it if you
want, but those would be concept schemes from bodies of knowledge, bodies of data that we
already have. How do you store this information? How do you store knowledge? Well, for a
separate project, one thing that I had done recently is we had a corpus of information, we had
about 11,000 event notifications and we wanted to see what was the best way of representing
the knowledge, the information stored in those was and for retrieving it. You see, there are
these things called triples stores which, who talked about it? Mark Stoltza [phonetic] was
mentioning with the flash blades on Monday and we have relational databases. We have things
like no SQL non-relational databases which are seen as one of the storage solutions for big data.
With LSST we are going to be getting not 11,000 events; we’re going to be getting about ten
million to 100 million events a night. That knowledge, that information needs to be stored
somewhere in a very efficient way, so we were looking at doing a little toy experiment to see
what we could do to store those, so we tried a traditional relational database with two different
ways of organizing our information inside of that. We tried MongoDB which is I think what
Galaxy U, or Zuniverse uses under the hood. StarDog, which is a commercial triple store and
eXist which is a native XML database. VOEvents is an XML; it's a angle back at representation,
and so you just give it a very simple thing. In this case, find me all events which have, you
know, a parameter called event flux and a value less than 40, so there is my knowledge
statement, my knowledge query. Find me everything with that in it. It turns out that the triple
store is by far the fastest because it, the technology is very fast at retrieving these very, the
semantic triples, which is essentially you break it down and then you're just doing matching. I
was quite surprised by that. I fully expected MySQL, the relationship data based to win out, but
this particular product it. We had hoped to have the creator of this particular product here this
afternoon but he had to call off at the last minute. There are these specific technologies for
storing very large amounts of knowledge data out there which will scale very nicely I suspect.
Finally, there are these different ways of codifying knowledge which I've alluded to. We start
off with controlled vocabularies and the semantics working group of the IVOA is very much into
trying to collect these and provide advice on how to do these, so we start off with controlled
vocabularies. The theory working group has produced one of these for marking up
astronomical theories. There are taxonomies like the general catalog of variable stars defining
how a hierarchy of variable stars. There are thesauri which Norman is going to be talking about
in a few minutes and then we have these things called full-blown ontologies. CDS uses one of
these for doing a lot of starts and I talked about the concept schemes that I've been working
on, data mining, and there are a couple of more out there. Something I finally came across
yesterday which I was very interested to see is something which is putting all of these together
which is the SKA information intensive framework, which has been developed by the IBM CTO
and someone in New Zealand which provides semantic services to manage data right across the
board for ingesting, for classifying, for making inferences to do transient detection and for data
retrieval. This uses one of the ontologies that the IVOA has defined, so I was very interested to
see this. I'm trying to find out more information because there seems to be very little
information about it. There seems to have been a set of press announcements about it at the
end of last year, but clearly SKA has got its ideas right about how to use knowledge and
knowledge management systems in the heart of a big data system.
>>: [inaudible].
>> Matthew J. Graham: Yes.
>>: It is?
>> Matthew J. Graham: It is, yes. And it's, the guy who proposed it is the chief technology
officer at IBM who is also in charge of some aspect of SKA project management, so I hadn't
heard about it before I happened to see a random news story on it yesterday and I was quite
surprised by it. Ray says he knows nothing about it.
>>: [inaudible].
>> Matthew J. Graham: Yeah, I know. I don't know. Joe might know more.
>>: [inaudible].
>> Matthew J. Graham: Joe is not in the room. I'll ask him later. That's where I'll end off, so if
there are any questions or comments…
[applause].
>> Norman Gray: Hello again. As Matthew mentioned, I'm going to talk about the both the
history and the future of broadly considered vocabularies in the context of the VO which in this
context often means within the IVOA, but not exclusively. Oh, that wasn't supposed to happen.
I'm going to skip the slides because they are implied by the discussion that we had in the last
session, closed parenthesis. The core assertion here, the thing I want you to remember is that
there are multiple thesauri and I'll get to definitions in a moment, already in astronomy. This is
no longer arcane even if it was originally and this is all ready for deployment and for
applications to build on it. This needs saying because this whole area, the vocabularies, the
thesauri, the semantic wave has appeared to be off putting to people and that's intelligible
because a lot of the technologies involved are sufficiently far from the comfort zones of most
political scientists; they were new to many computer scientists until 10 years ago and they are
sufficiently fiddly that they are a route from the reading of a download to producing something
interesting and useful, so you needed to be sort of pre-convinced if it was worth the effort.
Now I think there are enough applications or potential applications that are fairly obvious next
steps that that is no longer the case, so that was in the past and is no longer now. As Matthew
mentioned there are vocabularies. There are thesauri; there are ontologies, so some
definitions are useful. So this section of the half hour is pedagogical and the pedagogical will be
through stories. I don't expect to read all of that. You can read it in the PDF of the slides. You
can and I encourage you to do so; read it in the PDF slides afterwards. The main point is that
there is a story here about someone starting off reading a paper and going from there to
service to data to other things including Wikipedia and moving around through the possibilities
of astronomical knowledge to make their work easier. That works; that doesn't work yet but
getting from where we are now to there is just a matter of code. The bare bones of that are
already present in things like the deep linking or the deep markup, the deep tagging that
Matthew mentioned in the last session would support this. The point is that one of the points,
and I'll come back to it several times is that what supports doing that is that it's easy for
humans because the web works the way it goes. If you're reading a webpage you can follow a
link. That link doesn't just, isn't restricted to go into another place on that page. It's not
restricted to going to another page on that site. The link can go anywhere on the web, but only
humans can read HTML pages. This depends on a machine-readable web, on machines that go
through the same sort of service to service to source to source linking that humans have always
been able to do with the web. I don't think there's much more to see there. This is lightweight
semantics, that word again, lightweight stuff. The more elaborate end of the range of
possibilities you can talk about ontology's and I'll skip over this fairly swiftly. I just wish to
illustrate that there is a heavyweight end of the spectrum. What does this mean? That is an ra
and a dec and an r magnitude by some particular observer or particular deed. What can you
conclude from that? This is an optical measurement of a star. You know it's a star because it
has n ra and a dec. It's optical because there was an r band measurement there. Look there is
an astronomer because she has taken an optical measurement. She was not in a radio
observatory on that day. You can tell that because she was clearly at an optical observatory.
An optical observatory and radio observatories are different things, and so you could ask who
was at an optical observatory in March and you could never get an answer to that question. In
other words, there are rules here. If it has an ra and a dec it is a celestial object. If it has a
measurement in the optical, it's visible in the optical and so on. All of you are sitting here
thinking that's not true. I can think of countless examples of that. And you are right. This is not
absolute truth. It is a programming language. If you're at simulation, what you are coding up is
not absolute truth. It's enough of the truth to be usable for, to get something done, to draw a
conclusion. So this is the sort of more elaborate programming with logic that reflects enough of
a discipline that the machine doesn't understand what's happening but it can act a little bit as if
it did and that's the a really modest goal of things even at the heavyweight end of the
spectrum. I mentioned the spectrum and the spectrum, this is a much power pointed picture.
You can imagine the spectrum of these from very light weight knowledge representation things
at one end to very heavyweight knowledge representation things that the other end. Way over
here are lists of terms. You've seen things like that if you submitted a paper to one of the major
astronomy journals, you have to add keywords to the paper. Those keywords have no structure
but you have to pick one or two of them as a controlled vocabulary. You can't just put in a
keyword you like. You can do a little bit more than that. You can add structure to those
keywords. You can say this is a narrower term than this one. The canonical example there is if
you have a car, concept of a car, a steering wheel is a narrower concept. There is less covered
by that concept. It does not follow that a steering wheel is a type of car, so the relationship is
about finding things. It is not a subclass relationship. Going the other side of this red line you
can talk about more formal “is a” relations. You have the class of all cars and the class of red
cars and if you say something is a red car then it is a deduction that it is a car. Nothing exotic
there. It has a very light weight bit of logical structure but at the very beginning of how you can
imagine the classification is based on that type of structure. Over on this side you have much
more elaborate intricate types of the same sort of thing, but there is a sort of boundary of
vocabularies and thesauri on this side and ontologies on the side. There is also a boundary of
cost. Things on this side are easy and cheap. Things on the side are not easy and emphatically
are not cheap and all of them are [inaudible]. As Matthew said there isn't just, this is not about
having one grand ontology thesaurus, so keep that in mind. I'm now going to go on to talk
about various thesauri and vocabularies that already exist just to show the range of things and
that are being developed further at present. One controlled vocabulary that many people
know about is UTDs. [inaudible] everybody knows about UTDs. I think they have evolved over
the course of years, not hugely but there are developments. The UCDs are barely structured.
There is some structure sort of there, but it doesn't play a heavy role. The reason why UCDs are
so important is because everyone recognizes them and they have their authority from the fact
that they were extracted. They were deduced; they were mined from the headings of the
databases in the holdings of the data, so these aren't just some words that some folks thought
that might be a nice idea. These are, this is a collection of things that are actually measured in
astronomy, so they are not intricate but they are very well known. Things like that laid onto
work in 2009 where the IVOA declared in a rather tentative declaration that everyone should
use SKOS. SKOS is our w3 standard for writing down simple knowledge organization systems,
so it says that this is, these are the important features, the thesaurus idea of a narrow
relationship, a broader relationship, a related relationship and that's about it, so very light
weight. This document simply said that's a good plan if you could produce thesauri in
astronomy, do that, and it mentioned as examples it showed SKOS versions of four existing
thesauri, namely the journals keyword list, so basically the list of keywords that are required for
the journals with angle brackets around it, the UCDs, an overt vocabulary called AVM and the
IAU’s 1993 thesaurus of all astronomy, which was an intricately constructed thesaurus but
hasn't been updated in 20 years so it is of limited use. That again stresses the point that I come
back to that it is useful to, these thesauri don't have to be complete and they don't have to be
new. They can be refurbished versions of existing things and there is value in that. What does
one of these look like? The concepts are the nv urls. You can describe broader, narrower
relationships. The concept is different from the labels that describe it. The concept of absolute
magnitude is not bound to the string absolute helligkeit in German or absolute magnitude in
English and so on and so the labels are a separate layer on top of the notion and this points to
other related things, so nothing exotic or particularly arcane there. This is about lightweight
semantics. Another, that is from the IAU’s thesaurus. Another set of terms is Simbad's object
tapes, again not hugely structured but well understood and applied to a very large number of
objects. These are effectively a lightweight ontology of astronomical objects. Sebastien can
talk more about that if there are any questions at all. Also from CDS and this time from the
heavyweight end of the spectrum is the ontology of astronomical object types, which really
needs a snappier name. This is the URL by these people. It was virtually finished in 2008 with
some tweaks later and I put this up not because I have much more to say about this, but
because I want to point to these notes describing it and the use cases for that which I think are
interesting. This has a couple of applications. Matthew mentioned one. It is to be used within
CDS to do consistency checking of the attachment of labels to astronomical objects to see if
somehow something has been inconsistently labeled. That I say is at the heavyweight end.
One thing that is sort of missing from this talk is snappy demos. I said at the beginning and I will
say again at the end that these things are much more mature than they were and are more
usable ready to use things than they were in the past but they are not quite there yet, so I do
have snappy astronomy specific demos but all of this general technology does work at scale and
the BBC’s sports webpages are I understand essentially a semantic wave application. There's a
big [inaudible] behind them; none of the pages are fixed. Of course that works. This is just
being used as a content management system and this is not the only piece that uses the
content management system. But this content management system is firmly based on the
technologies that I'm talking about here and is obviously on big scale. I picked this particular
screenshot because this is, although it's only sports stuff that is covered by this application, it's
flexible enough and has enough links across the BBC that this is basically a news story, a
basically non-sports news story that had enough of a sport angle that it was included as one of
the sports headlines. As it happens that's today. One of the things I'll be saying again and again
is about linking between vocabularies; there are multiple vocabularies and you can link
between them, and this has been systematized in the notion of the linked data cloud and I did
not expect you to read that. The part of the point is it's too big read. Each of these is a, some
sort of data warehouse and I think these ones are publications of various types. These ones are
community content. These are the governmental ones, but the point is they link from one to
another. Each of these gives out its information as RDF, as these angle brackets I showed you in
the illustration of thesaurus and makes links from one to another. This is just like the ordinary
human readable wave. The point is this wave of links to links to links to links all go via
Wikipedia is literally here in a computer readable version of the wave and in this case going via
DBpedia which is a machine readable version of Wikipedia. So this is not a French technology
anymore. That is the point that I keep going back to. Bigger projects, one is the grandly unified
astronomy thesaurus. This has various people involved. I think the important thing here is the
range of institutions involved in developing this. The history of it is that it was, the history is
implied by the slide after next. One interesting thing is the use cases. These are some of the
use cases for EDAs. I'll let you read them, but the point is that having this structure, machine
readable and shareable allows applications to be built on that foundation. The publishers have
similar, distinct but similar use cases. There are different things they want to do with the same
structure and if they are doing things with the same structure, then it is natural for them for
these applications to go back and forth between the various commercial publishers and the
community oriented ADS and beyond. The goals for this are fairly specific. The background to
this is both the American Institute of physics, AIP and Institute of physics, IOP in the UK were
generating large all physics thesauri and they have donated the astronomic portions of these to
this project, so there are some incompatibilities. Both of those thesauri have in their lineage
the IAU’s 1983 thesaurus, but they have come by them by different routes. Productizing this is
important because this mustn't end up being just a research project. This must end up being a
thing which is released and which can build long-term applications on. The maintenance
process is important because it is necessary so it will not fall out of sync with astronomy as the
IAU’s thesaurus did and so a mechanism for involving astronomers, not information scientists,
but astronomers in the creation of it is important. As Matthew said in the last remark, this does
not have to be complete. It is okay to have more than one thesaurus. The process, AAS,
American Astronomical Society will own this. IAU expects to bless it. The process is not terribly
exciting but the point is is that there is one which has yet to be baked a bit and finalized. There
is nothing more to say about that. Okay. Moving on, another cluster of thesauri are the
[inaudible] theory vocabularies which again Matthew previewed. These are the four involved,
The Paris Observatory in Meudon and I have been involved a little bit, although it is the
Meudon people who have been doing most of the work. Now the motivation here is that the
simulation data model with the IVOA is about finding simulation results. If you search for these
is the plan by searching for the objects we simulated, by searching for the types of input
parameters, by searching for the algorithms being used and so you don't just want to have free
text searching for all of this. You want it to be a little more structured and so these need
thesauri for these efforts. That is the site, votheory.obspm.fr and there are 10 vocabularies
being created there. Some of these were pre-existing ones which were just tagged up and
reported; some were given more or less substantial editing work after being imported and
some were created on this site. The back end of that is a commercial system called Poolparty
which is a rather expensive system which we have a free license to use, but because this is
using SKOS, we are not in hock to that commercial supplier. If they disappeared or started
excessive charging then we would just dump the SKOS files and go somewhere else. There is
such a diversity of standards. I'm going to talk now about one of these in particular, the
thesaurus for chemical species because I was involved in elaborating that a little bit. It builds
on the VAMDC, virtual atomic and molecular line database c; I'm not sure what the C stands for.
That is quite a large thesaurus of chemical species of astronomical interest and here is a term
from it. I put this up here just to illustrate what the thesaurus looks like. There is some URL
that refers to this concept. This has a label, a couple of labels. This is water, the concept of
water. It has preferred names, names and narrower concepts which like the concept of heavy
water and pretreated water are narrower concepts than this. I make that point next. In this
case, heavy water is a narrower concept than water. You find that in the same part of the shop,
if you like, as the shop where we find water, not very much of it typically, but it is also the case
that heavy water is water. It's a type of water and that isn't captured here. That implication
isn't present in what's up there. The solution to that was to write a parallel, was to generate a
parallel and that site did change by the way, a parallel ontology that does talk of subclassing,
that heavy water is a type of water. Another thing is this very naturally can link to other
sources of information. This is clicking on a link, other sources about water. Chebi is chemical
elements of biological interest which is another elaborately and I think extensively curated
source of machine-readable knowledge about chemistry. It is natural for this fairly homegrown
bit of work to link into that and thus inherit the knowledge that other people are extensively
creating in their ontology, so that linking and the machine readable linking is like a broken
record. I am going back to it again and again. So there are, just to make it explicit the various
lessons that I've been mentioning in passing through this last 20 minutes. A little bit of
structure goes a long way. You don't have to have elaborate logically intricate descriptions of
the entire astronomical world in order to build something useful. A little bit of structure is
enough. Most of the work involved in the various thesauri I have described here has been
scripting work, just taking something that already exists which possibly people believe in and
use in one little part of the forest and put angle brackets around it and making it a standard and
openly shareable and reusable thing. It's okay to have lots of them. There is no benefit really
to having a monolithic figure here. A monolithic thesaurus has the virtue of consistency and
some internal integrity, but it doesn't have many other benefits. If you had a whole forest of
these things, as long as they were linked together and not logically inconsistent with each other
every one maintaining little bits of the tree can have their work useful. Thesauri don't do
everything. Sometimes adding a companion, a very lightweight ontology which just expresses a
few extra relationships is necessary and as I've said several times this is all about machines
being able to do what we do on the web anyway, to move from source to source. I'll start
where I began, there are already many thesauri developed and deployed. This is no longer
arcane. This is ready for applications to be built on top of it, so what are you waiting for? And I
will stop there. [applause]. We have plenty of time for questions.
>> Matthew J. Graham: Yes, are there any questions for Norman?
>>: So you showed [inaudible] data staff, is there any astronomical data in there already?
>> Norman Gray: No, none.
>> Matthew J. Graham: Well, that's not strictly true. Any astronomical data that is in Wikipedia
and is sufficiently marked up will be.
>>: Real data from…
>> Norman Gray: Real data? [laughter].
>>: In that sense each data would be like your main goal would be to connect to the linked in
network?
>> Norman Gray: Where--I know exactly how to put astronomical information on there. There
is a lot of information in Simbad, for example, which is exactly the sort of information about
astronomical objects which could go onto that, perhaps not tomorrow. It is only a matter of
code and--no, that's not true. It's a matter of code and permission to use that data that way.
>>: [inaudible] something there, you have to create an RDF database, right?
>> Norman Gray: Not necessarily. You have to give the information out as RDF. But you don't
have to store it as RDF.
>>: But in the case of the [inaudible] Observatory, would it be possible to connect the
[inaudible] Observatory through the new data web?
>> Norman Gray: The link to the web isn't really about bulk data. RDF tends to be slightly
uneasy. It's not natural to pump vast quantities of numbers out that way. It's about concepts,
about the sort of knowledge that you have after reading a Wikipedia page.
>> Matthew J. Graham: So one thing that we did experiment with at a previous semantic
workshop a couple of years ago was putting an interface around Simbad for the information in
there or a subset thereof and there are ways that you can put layers on top of relational
databases and then you just define the mapping into the RDF. You don't need to reconvert all
of your data already into it. And we sort of got somewhere with that, so that might be a way of
doing something similar to that.
>> Norman Gray: So the proof of concept way that I had of showing a link data version of
Simbad was just a layer that took the linked data API on this site and made the right SQL
queries and rewrote the results.
>>: For example, I could easily imagine that it would be out of context browsing through these
relationships and stepping from one to the other, but also some of the relationships could tie
into data. For example, some of it may already be being done at some level. For example, I
know space telescope shares with ADS the relationships between data sets and publications
and so that's a very natural sort of stepping off point in both directions. Maybe you discovered
some data sets and you want to update some publications and then you can go from there to
other data sets and so on and it's a very natural way to sort of expand the sort of scope of
things that may be of interest to whatever you are browsing.
>> Norman Gray: And this provides a language for doing that. This is glue. This is a pot of glue
sitting there ready to be used and once you've done that for machines it's a matter of code to
do it for human readable pieces as well.
>> Matthew J. Graham: Last question.
>>: I assume that we will produce these triple stores for output.
>> Matthew J. Graham: [inaudible].
>>: Didn't quite follow the context you mentioned UCDs that [inaudible].
>> Norman Gray: I was using those as an example of quite lightweight controlled vocabulary
with some structure but the structure isn't the important thing about it, which gains its
authority from the fact that it is used by many people and it was harvested from [inaudible]
rather than being the creation of someone's beautiful mind.
>>: But they do have these [inaudible]?
>> Norman Gray: They are useful. The point to be made there was that even something as
lightweight as UCDs, the mere fact that there is an agreement, use this strength even that isn't
enough to get lots of beautiful things done, so if you have more structure in the thesaurus you
get more things done. As sort getting some function from any consensus.
>> Matthew J. Graham: Thanks Norm. Our next speaker is Sebastien Derriere who is going to
be talking about building a smart portal for astronomy. Thank you, Sebastien.
>> Sebastien Derriere: Thank you Norman. Thank you Matthew. [laughter]. I could have made
the title improving the serious portal but this is more ambitious and I hope…
>>: It's already smart.
>> Sebastien Derriere: [laughter]. It's already very smart, so I will start because portal is a fuzzy
word. I took the official definition of the word portal in the Merriam-Webster dictionary and so
the first, there are several meanings to portal and the first one is a door, entrance, especially a
grand or imposing one. So I thought well, that sounds good. I hope we can build a portal to
astronomy or a unifying portal for a city of services. There are other meanings and one has to
do with church and one has to do with a tunnel, and we don't want to go that way, or an entry
point for diseases or pathogens in the body. Well, that's not good, and the last one is quite
modern. I think it's an arresting definition. It's a site serving as a guide or point of entry to the
World Wide Web and usually including a search engine or a collection of links to other sites
arranged especially by topic and it might be a hard time coming up with this kind of definition.
It's still a bit fuzzy. You can go to the World Wide Web or do some searches or point to
different things, and in fact when we do portal we have the same kind of problem, so my
definition for astronomy portal is a bit more restrictive. It's I say astronomy portal is a web
interface or web application because it's more and more integrated somehow. It's something
that will allow us to do some data discovery or service discovery and maybe then doing some
queries to that data set or services and then why not do some analysis and some workflows and
so on and I would do more and more of this in your browser. If you want to do complicated
things then you will need some kind of customizable interface where you can put modular
widgets as you want and these widgets will communicate with each other and maybe
communicate with external applications and we start to have some technologies which do all
that. We have very nice JavaScript frameworks which allow you to do some quite high level
programming for interfaces. You can do Ajax with XML or JSON to exchange packets of
information. We've got HTML5 which has for more and more support and allows you to make
nice interfaces and we've got all of these VO protocols where you can query data. I will mainly
focus on the data and service discovery aspect in this session which has to do with--because
that's where most of the semantic goes in. We've got to choices, in my opinion. We can go for
simplicity or we can go to complexity. You can make a nice simple portal or you can make a
quite complex portal. See all those widgets in the--well it's not very customizable but there is
plenty of information in the [inaudible]. In practice, I looked at a few existing portals. This is
the VAO portal. You can maybe notice that the input here is quite simple. You can put in this
search box an object name or coordinates and say, what is the size of the area you consider
around this sky location? And when you press submit, you will get a bunch of results
corresponding to a list of resources around this location. The input is very simple and then you
can filter the results with facets and so on, so it's a good example of a VAO portal. If you go to
ADS, it's slightly different. You still have to have a single text input box but you can stop to put
some scripting in this box. You can categorize the input terms that you are putting in there, so
you can say I want to find papers that have cloud in the title and which are dealing with a
thesaurus object, for example. You've got additional constraints which you can put in your
query. You can choose how you wish to solve the results and so on and you've got the
possibility to login, so you can categorize the inputs. You can choose somehow how the sorting
will be done and you can login so once you've logged in you get additional, well, maybe the
portal can have some memory on your behavior and customize things for you. Again, on the
results page you've got this faceted search where you can restrict the results. A slightly
different portal is what was demonstrated by Andrew Connolly, I think it was three years ago at
the [inaudible], two or three years ago. He made a portal demonstration called ASCOT and this
is made of many individual widgets. Here you've got the name resolver. You put the name of
the object, you get the coordinates and then you can broadcast these coordinates or use them
in conjunction with a widget which is writing a SQL query to some dataset and when you submit
this you can make a plot of the data in another widget and so these widgets are talking to each
other and this is more data analysis portal than a simple data discovery portal. In fact the data
discovery is barely there because the predefined buttons for the catalogs which you can query
in this case. And then I am coming to CDS resources. This is the current VizieR front page or
the VizieR full search. We've got a simple search where we've got two text boxes, but this is the
complete VizieR interface; it's much more complex than what I should before. You can put here
some text, whatever you want. This can be another name of keywords, names of catalogs or
bit codes and this service will try to figure out how to, what to make from this, or you can say
yourself what kind of constraint you want to put on the search. In fact, this single interface can
do two different things. It can search for catalogs if you don't know them, so you can search
say, I want catalog with infrared photometry of pulsars and the you will try to find catalogs, or
you can directly, if you know the name of a catalog, put an object here or search for an object
across all of the catalogs, so you've got quite different things that you can do with this same
interface. The point is that it's complicated. I never show this to the students when I'm
teaching databases or access to VO data, because it is too complex. So we came up a few years
ago with the idea to have a single CDS portal where you can go and make a simple query and
this query will be broadcast to Simbad, Aladin and VizieR and we tried aggregate the results
together. Currently it's only limited to object name and/or position, just like the VAO portal.
The benefit from that is that the web interface is very simple. You've got a box you put your
target. It can be an object name. It can be coordinates and you press go and then you get an
aggregated result. You get Simbad results at the top with, we made the selection of links. We
aggregate some information from Simbad and then we give you some pointers, find out more
for this object in Simbad or you want to find all of the bibliography for this object or do you
want to find similar objects in Simbad, which is something that Simbad does not do. In that
case, we take the object, look at papers for this object and look at other objects cited in those
papers and we make statistics on the fly and so we say oh, the papers that deal with this
subject also deal often with these other objects, so that is some information that you can
retrieve. This is a list of images which will cover this subject in [inaudible] and this is a list of
catalogs. In fact we have two things for the VizieR results. We search first objects from
catalogs in VizieR which only deal with this subject. These catalogs might not even contain sky
coordinates, can be XY coordinates or just a list of numbered elements for a strength of values
atomic lines for the subject. A certain one is we make a full positional search for this object in
VizieR, so this is the current capabilities of the portal and that's all. What we plan to do is have
some improvements. For example, we would like to be able to put additional inputs into the
query other than the simple object name and coordinates, and make some smart interpretation
of these inputs and then broadcast a query to the various services that we have and possibly
others and still at the same time keep it simple. That's not easy and I tried to summarize a few
use cases to access through such a portal. What is the redshift of 3C273? Find information on
globular clusters in M31, around M31, what we know about the proper motion of brown
dwarves, and we are trying to make an SED for Vega. I just want a Veron catalog or the Veron
catalogs because there are several versions. Quasars with redshift greater than five. I want to
crossmatch the SDSS and 2MASS, so I looked at existing portals; the most famous portal I think
everyone is using quite often is Google. For the user for Google is simple. It's amazing that it's
remained that simple for so long. It's helping you by making suggestions, so in a way you feel
smart because as you type, you see suggestions coming and if you are very lazy and it's fast
enough you can simply click on what is suggested and you don't have to type the whole thing.
And the suggestions as well as the results are nicely sorted. Sometimes when discussing with
colleagues they have trouble sorting the output. They say oh, who am I to sort the output for
the user? I just give everything. This you cannot do when you have a data avalanche, when you
have many, many data sets you need someone to find the most relevant ones. You cannot
simply throw everything to the user. Imagine if Google throws all of the webpages it has for
one keyword. You would not use it. It is useless. It means that on the server side you need a
heck of a lot of indexing. It needs to be very fast and it needs to be capable of fuzzy searches
and to find if you made a typo. You want the system to tell you if you made a typo and still find
the right results and so on. A second example I had in mind for requirements for improving the
portal was Wolfram alpha. Wolfram alpha is smart. It's really smart. If you type M51 redshift,
it would say ah, I am assuming you mean that M51 is an astronomical object, which is good and
it is what I want, and so we get will interpret my input and just give me a result. That's nice. On
the user side it's simple, just a single text box where you can put stuff. You can categorize it
yourself, so there is a script where you can say this is a city, colon and put a name of the city,
but if you don't it will try to figure out in the values that are the basis that are behind that, that
maybe this string is a city name. On the server side there is a lot of interpretation of your
inputs and there are many, many--I should have put plurals here--there are many knowledge
bases. They keep on harvesting data sets from databases, knowledge bases such as DBpedia
and so on and trying to annotate it with metadata so they can figure out what's in there and
have many relations, so if you put two city names into Wolfram alpha it will give you the
distance between the cities, the time it takes to fly from one to the other and so on. It's also
about, thinking about what you have in mind when you make a simple search.
>>: Can I make a comment? So one of my complaints about alpha is that it sometimes gives
you complete garbage and you can't say how quickly can you [inaudible] this 42, and you've got
no way of telling so there's no confidence in the…
>> Sebastien Derriere: I think there are some sources here, for each result, it says where it is
coming from and you can go and figure out the originals.
>>: [inaudible] it says [inaudible] meters, fine. You put average heights of anybody from
anywhere in the world and it gives you the same answer, so that sort of thing. Clearly it's
picked up a number somewhere and such it's misinterpreting, so one important thing is that
you need some measure of confidence.
>> Sebastien Derriere: Yeah confidence.
>>: Whether that's like [inaudible] [laughter].
>> Sebastien Derriere: Yeah, so, I try to think, okay, what are these simple things you need to
build a smart portal. First the portal needs to understand what you mean when you give it
some input. For this you need some kind of stemming, making sure that star and stars are
interpreted in the same way, lemmatization where you deal with synonyms, so whether you
type quasar or QSO, it will be interviewed the same and then categorize the inputs which is a
problem with name entity recognition. Once you've done that, well, what can you do with it?
What can you do with the query? So you must somehow match the annotated or entity
recognized search string to some search template in various contexts which is what Wolfram
alpha is doing and you must somehow maybe translate the query in each context. In the case
of the value of city services, they don't have the same metadata exactly for various reasons but
we have to deal with that. The first thing I really did was try to look at statistics and what do
astronomers do when they search in Vizier? You've got this text box in Vizier were you can type
whatever you want, so I just took the logs of Vizier and I just took the contents and what is used
in there. If you do that you can make some nice plots so this is for a few, I don't know, tens of
thousands of queries. This is the number of queries for each of the search term ranked by
order. This is the most frequently used term which happens to be the word 2Mass, so it has
been searched more than 1000 times. So than the most searched term and so on. You can mix
this until you have gotten expressions which are only searched only once. It's a very classical
shape for it follows a [inaudible]. In fact it follows to [inaudible]. You've got one slope here and
another one here. It's called a zip flow. You find this everywhere. Take your book, take every
word of the book, compute how many times each word appears, rank them, plot the number
versus the rank in log log scale and you will find a [inaudible]. It's universal.
>>: [inaudible].
>> Sebastien Derriere: Sorry?
>>: [inaudible]?
>> Sebastien Derriere: You will find, usually, it depends. Sometimes it's one slope and
sometimes they are two slopes, but it's something that is well known. It means that while you
can have many different searches and a few searches will appear very often and the trend is
following a [inaudible]. I tried to look at what is the average length of the search string in
number of characters. How many characters do people type in the search box before they
expect to have a result? What is your guess on this? How many characters, 10, 15 20?
>>: Four, eight.
>> Sebastien Derriere:
>>: Five.
>>: I guess eight.
>> Sebastien Derriere: It's less. It's less than the difference if you look at the total number of
queries is one, two, three, four five, so five character is average, the most likely that people will
type. If you look at the distinct search strings then it goes to six, but most often it will be
between one, one is not often, but between three and ten characters. That's what people will
type in the search box. It can be as large as, well this is garbage. [laughter]. But a few tens
have [inaudible]. You must deal with that. People will not give you a lot of information most
often, and from the list of terms that I extracted from the logs, I tried to categorize the search
terms and I made this very simple categorization. People will search in VizieR for catalogs, to
get a name of its catalog. They can search for a mission, instrument, sometimes it's the same
word can be both. If I search for IRAS, IRAS is the name of a mission. It is also the name of the
IRAS catalog. They can search for the name of a person. They can look for measurements,
proper motion, redshift. They can look for object types, spiral galaxies, quasars, object names
or positions which are somehow you can easily go from object name to a position or a data
product. I want a spectral. I want something else. So I took my previous use cases and I tried
to categorize them given the, with these categories and this is what you end up with. These are
already complex queries given that people on average type six characters, but they can say oh, I
can search for proper motion of brown dwarves or globular clusters in M31, so what do I do if I
have an object type and an object? Do you have a question?
>>: So is five or six keystrokes what people are comfortable with for entry or is it more what is
needed to identify to get results?
>> Sebastien Derriere: I think this is a bit biased first, because the search box was quite small.
It was maybe 20 characters so you would not type a lot of text and maybe this is sufficient to
retrieve what they want, but if you want to make some more complex queries like this one,
probably you would type a bit more information, hopefully for me you would give more
information.
>>: [inaudible].
>> Sebastien Derriere: Yes.
>>: That means that you said that 2MASS was the most searched term. That one has five
strokes.
>> Sebastien Derriere: Yeah. Building a smart portal you have to, there is what the user wants.
Maybe you know this cartoon, but it is what the user wants, how he expresses it like what is he
typing in the search box. Then you've got maybe what the portal will understand and what the
portal returns and you don't want to end with this. Hopefully you want to be as close as what
the user desires in the output. In order to interpret and tag the user inputs, the simple way is
to say to the user okay, please tag it for me. Use this syntax where--you've got this syntax in
ADS where you can type author, this, or in Google you can say I want result from this website so
I will put this site column in front of this and let the user do it. The hard way is you let the user
type whatever he wants and try to figure out what was meant. And so you can, to solve this
problem you can use both vocabularies but I will show just after that these vocabularies are not
enough, and what you really wish to do in order for the portal to feel smart, or to be smart, is
you want to do it on the fly, ideally, and provide suggestions. As the user is typing you want to
be able to make suggestions and say oh, I know what you mean. I know what you mean and I
am giving you hints on what I am able to understand. So I made some prototypes where you
can say for example, you can categorize your inputs with these categories that I defined as
targets and the quantity red shift and you can give this as an input. The user can type it. You
can provide ways to suggest that as you type it's just like typing quantity and so what quantity
starts with an A and so it can be age. It can be absolute magnitude, abundance, age and so on.
You pick this from the vocabulary. In order to suggest you need to make some queries, sort it,
filter it. You don't want to provide hundreds of results and this tagging and suggesting in
general can be done with vocabularies and thesauri, but not only. You don't have thesauri for
everything. If you take author names, where do you find a reference list of author names,
values? Here you take ADS with complete statistics on everything and you make statistics on
author names, but then how do you sort? What are the most relevant authors in the output?
That is a big problem. You can take some on the next slide. It is more difficult for object
names. We've got more than 12 million identifiers in Simbad. We can load all of this in
memory; it fits, but if you go to all of the possible object names that are specified by the
dictionary of [inaudible] you've got nearly ten billion in VizieR, ten billion identifiers for all of
the big catalogs, take 2MASS and [inaudible] and so on. And what you want to do ideally is as
you type in the search box you would search for vocabularies and of course object names at the
same time to find what the user is typing. There is a big issue of finding this and sorting. It is
mandatory to sort things if you want to face the data avalanche. For this you want to make use
of some statistics. You can make statistics, for example, on the most queried catalogs or
present first papers with the most citations, but I know people who are not happy with this
analysis. It is biased. The winner will take all. If there is one catalog which is frequently
queried, I will always come out at the top. Yes, but that's life. That's what everyone is doing. If
you don't do that people will most likely, it is most likely that people won't be happy with what
you present. They will say, ah, I am looking for this one, the most used one and it's not at the
top. This is not logical. It's not only the number, although the user statistic, it is the trend in
the statistics. If you've got, so you need some short-term statistics. What about the, a paper
that is recent? It doesn't have yet a big number of references or citations, but it's got a high
citation rate, so it's popular, so you want this one to move to the top even if its total number of
citations is low. The last point I want to make before conclusion is you've got one query, so you
interpret it correctly. You search for quasars with red shift greater than five. You've identified
that Quasar is an object type, red shift is a measurement. You've got a memory constraint on
the measurement and then you need to broadcast this query to different services. They will
talk different languages. They will transmit different data, so you will need in Simbad, quasars
are called QSOs. Red shift is called red shift. In VizieR you have to look for astronomy keywords
and QSOs with an S and you find the red shift with the UCD. You need mappings between all of
these and this is probably the hardest part in the work of making these smart things. So I just
summarize a few ideas that I have presented to you. I want to build a smart portal which will
enable us to do some data services and new features and discovery. It's important. We release
across much service. It is able to cross big catalogs and so on and still people will query Simbad
with a list of 20,000 objects one by one. They make 20,000 queries. They don't know that
there is a service that can do this in seconds. That will overload your server and so on with
thousands of queries, so if people say, kind of object by list and this you want to say hey, there
is a service that is able to interpret your query. Building a smart portal is extreme text mining.
You've got very short input strings and it makes things more difficult and it is another challenge
than mining the full ADS corpus. You can learn from your logs. Look at what people are
searching and you will probably serve them better. There is no such thing as too many
metadata. If you want to interpret queries in various contexts, you will always find a point
where, ah, why don't we have this piece of metadata in the database. Yes, you can add it. It
will take days and days of work, so think about the metadata early. Sorting and suggesting is
important. You can learn from the actual usage of the portal. If users are registered you can
even customize the output to remember what they have been searching for and provide it in
the output. Things are changing. It was said earlier that the vocabularies can change. There
can be new data. There can be obsolete data sets or services. You need to take this into
account to have a dynamic answer. And the last point is do not underestimate the carbonbased intelligence. You will never be able to build a portal which is smarter than the user or I
don't think so. But you must have confidence in the user, that he will be able to somehow
rephrase his query to find what he wants, so don't try to envision all of the possible cases. Try
to be smart enough with the portal and the user will be smart also with you. Thank you.
[applause].
>> Matthew J. Graham: We've probably got time for maybe one question.
>>: The visual that your portal generates what would be the solution [inaudible] onedimensional data? What visual do you…
>> Sebastien Derriere: Oh. I am not planning to…
>> Matthew J. Graham: Can you repeat the question?
>> Sebastien Derriere: Sorry?
>> Matthew J. Graham: Can you repeat the question?
>> Sebastien Derriere: Yeah, the question is what do we plan to do when the result is
multidimensional data. At this point the portal will not be about visualizing the data and so on,
but mostly locating or interpreting the input and redirecting the user to specific links. If we can
provide the numerical answer and do it, that's good, like for the red shift, that's good. But if the
user asks for globular clusters in M31, you will say oh, I can search for a catalog containing this
kind of object type, such as globular clusters and make the search in the area of M31 and tell
the user that this is the kind of query you can construct and you can customize it. Or in Simbad
you can search for filter on object type, stored in Simbad, same thing, look around M31,
customize the output with a search radius and so on. So it would be more like providing the
user with a list of directions in which to get different results than trying to integrate and
visualize everything on the same page. This is for later, much later probably.
>> Matthew J. Graham: Okay. We will take a 20 minute break if we may, and then we will
reconvene at 3:07.
Download