LIS512 lecture 12 conclusions Thomas Krichel 2010-12-22 knowledge organization • As I have already said, the course is not about knowledge organization. • Instead it is about the organization of information. • In fact it is only about library cataloging. • Library cataloging represents a very tiny amount of organized information as used by people. It follows pre-web and even precomputer standards. information retrieval • is generally about the getting to information that it is in some type of system. • Usually this is a computer system. • Most of the time it’s about what algorithms the system needs to implement. • Recently information retrieval has become a huge because of search engines. information retrieval first • One may argue that one has to understand information retrieval first. • Information retrieval in the sense that you have to understand how people use information. • Then one designs a system that helps people with the way they search. search engines • Search engines such as Google have become the way most people seek information. • Information found in search engines comes mostly from the open web. • That means from web pages that can be reached by browsing the web using links. deep web • Library catalogs are mainly part of what is referred to as the “deep web”. • This is information on the web that is not available through browsing. • The information we have in our catalog is part of it, although it really would not have to, we could also publish it in a non-deep web form. • I don’t know why this has not been done for koha. modern information organization • Since people use web search engine find information, people running search engines de facto control access to information. • But the search engine (usually) can only show what has been put on the web sites. • Thus webmasters have a crucial role to play. how do search engines work? • Initially web search engines worked a bit like the search engine we have on koha. • That engine basically looks for the words of the query in a record and shows those records that have them. • Some people in the 1990, mainly advocated some form of cataloging for web pages. metadata • The idea of metadata for the web is that we can add some information to pages that give some more/other information than the text that is on it. • It’s been largely discredited by the amount of effort required to do this and the lack of support by search engines (past and present) for it. the big computer brain • If a computer can just simply suck in all the information on the web, and build a search engine, then is there any need for any metadata? • Hmm, the information that is on the web is already structured in some form. – html headers – image file header – mp3 header metadata not dead • At it’s start Google bragged about it not using any metadata at all. • The words you would search would be on the pages. • Sometimes, however, words from the links to the page are used. • Sometimes, special components of a page are used. metadata not dead • Users use search engine generally to reach a certain site. • Once they are on the site, they like to follow links around the site rather than doing more searching (I think). • To organize a site, you need to have some form of organization of your material. digital preservation • The big problem looming with digital information is how to preserve it. • Some information may not be worth preserving. • But any crap becomes valuable when it is old enough, or when there is a lot of it. • One thing for sure, it can not be done without organizing information. back to cataloging • It cataloged material receives so little attention, it is worth to spend a whole course on it? • Yes, physical libraries still need it. We don’t know however, how long they will be around. • On the other hand the cataloging rule bring in some thinking about structure information that can be of some use in other areas of information organization. back to cataloging • Other areas of information organization are less human organized. • A lot of them are – subject to research – owned by companies • Therefore I have no problems with spending so much time on cataloging. • But they way it is done is to relate cataloging rules closer to general principles. the future • Webmasters will have more and more control over what people can/will read. • Search engine builders will have more and more control over what people can find. • While we will find it difficult to build search engines, we all can build web sites. • Learn web site architecture and see information organization in some other action. web information retrieval • We can think of the web as a pile of documents called pages. • Some pages are hard to index – PDF documents • some index, some don’t – Pictures – Sound files • But a majority of pages are written in HTML – easy to index – have a loose structure Google uses the structure of HTML • Google finds the title of the page, i.e. the contents of the <title> element. • Google analysis headings and large font sizes and gives priority weight to terms found there. • Most importantly, Google uses the link structure of the web to find important pages. Google finds important pages • The idea is that the documents on the web have different degrees of importance. • Google will show the most important pages first. • The ideas is that more important pages are likely to be more relevant to any query than non-important pages. Google's monkey • Imagine that the web has P pages. Each page has its own address (URL). • Imagine a monkey who sits at a terminal. He follows links at random, but on rare occasions he gets bored and types in an address of a random page out of those P. • Will the monkey visit all pages with equal probability? page rank • Google page rank of a page is the probability that the Google's monkey will visit the page. – The monkey will come frequently to pages that have a lot of links to them. – Once he is there, he will likely go to a page that it linked by one of the pages that an important page links to. • The structure of all the links on the entire web reveals the importance of the page. many page ranks • There is an infinite number of ways to calculate the page rank depending on – how likely the monkey gets bored. – the probability of the monkey to visit each page. • Potentially, there is a page rank for each user of the web. Google tries to observe users and may be associating personal page ranks. notation • Assume that a monkey gets bored with probability d. If bored, it will visit page p with probability π_p. • For any page p, let o_p the number of outgoing links. • Let l(p',p) be the number of links from page p' to page p. page rank formula • The page rank for a page p is r_p = π_p d + (1-d) ∑ l(p',p) r_p' / o_p' • In words, it is likelihood that, if bored the money goes to the page p plus the likelihood that he gets there from another page p'. The likelihood getting there from p' is the likelihood of being there, times the number of links between p' and p, divided by the number of outgoing links on p'. example • • • • • • Let there be a web of four pages A B C D A links to B. B links to C. C links to A and D. D links to A. Let the probability to get bored be ¼ and there be a ¼ chance to move to any page when bored. page ranks • The following system calculates the ranks r_A = ¼ ¼ + ¾ (r_C / 2 + r_D) r_B = ¼ ¼ + ¾ r_A r_C = ¼ ¼ + ¾ r_B r_D = ¼ ¼ + ¾ r_C / 2 • Since this is fairly complicated, Google uses an iterative approximation to calculate the rank. • Note that the sum of all ranks is 1. http://openlib.org/home/krichel Thank you for your attention! Please switch off machines b4 leaving!