Information Retrieval Systems
Maria Indrawan
C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au
© Maria Indrawan Monash
University 2003
1
relational database structured
XML documents
• data representation
• query formulation
• matching
© Maria Indrawan Monash
University 2003 free text, search engine non- structured
2
• What will I learn in this unit?
– how to manage data that cannot be effectively handled by a relational DBMS.
• XML documents
• Text (free text)
• There will be no SQL in this unit.
© Maria Indrawan Monash
University 2003
3
• On the completion of this unit, you will
(hopefully!) be able to:
Understand the difference nature of information (structured, semistructured, unstructured) and their associated issues when dealing with information retrieval.
understand the XML technologies and their role in Information
Retrieval.
Be able to demonstrate the ability to create and manipulate XML documents.
Understand the design issues and various approaches to the development of text databases.
© Maria Indrawan Monash
University 2003
4
• Relational database concepts, such as SQL, indexing.
• Basic UNIX commands, eg file, directory manipulation commands.
• HTML.
• Basic level of Maths (year-12 level).
© Maria Indrawan Monash
University 2003
5
.
© Maria Indrawan Monash
University 2003
6
• Component A:
– Assignment 1 – XML Schema
(week 6)
– Assignment 2 - XSLT
(week 9)
– Unit Test, - XML, XSLT
(week 10)
10%
15%
15%
© Maria Indrawan Monash
University 2003
7
• Component B
– Assignment 3 Research Paper
(week 12)
• Component C:
– Exam 50%
10%
© Maria Indrawan Monash
University 2003
8
• Component A:
– Assignment 1 – XML Schema
(week 6)
– Assignment 2 - XSLT
(week 9)
– Unit Test, - XML, XSLT
(week 10)
10%
15%
15%
© Maria Indrawan Monash
University 2003
9
• Component B
– Unit Test on text retrieval
(week 12)
• Component C:
– Exam
10%
50%
© Maria Indrawan Monash
University 2003
10
• The result of the unit test will determine the final grade for component A as follow:
Unit Test
Fail
Pass
Credit
Distinction
Pass
Maximum grade for
Component A
Credit
Distinction
High Distinction
© Maria Indrawan Monash
University 2003
11
• In order to pass this unit you must attain:
– 50% overall and
– at least 40% of the available marks in each component A, B and C.
© Maria Indrawan Monash
University 2003
12
Prescribed:
XML:How To Program (1 st ed)
Deitel, H.M. Deitel P.J. Nieto, TR. Lin, T. and Sadhu, P
Prentice Hall
Recommended:
Professional XML, 2 nd Ed, WROX Publisher.
Beginner XML, WROX Publisher.
XML Schema
Eric Van Der Vlist, O’Reilly Publishing.
© Maria Indrawan Monash
University 2003
13
• Unit website:
– www.csse.monash.edu.au/courseware/cse4500
– www.csse.monash.edu.au/courseware/cse3201
• Useful links:
– WWW consortium http://www.w3.org/
– http:// www.topxml.com
– http:// www.xmlsoftware.com.au
– XML Editor http://www.xmlspy.com
© Maria Indrawan Monash
University 2003
14
• Please read all the necessary university materials on cheating/plagiarism (listed in the unit guide).
© Maria Indrawan Monash
University 2003
15
• Quota system
• Acceptable policy
– http://www.infotech.monash.edu.au/myf it/students/student_labinfo_rules_netusag e.cfm
– http://www.adm.monash.edu.au/unisec/ pol/itec12.html
© Maria Indrawan Monash
University 2003
16
• I have a question on …
– Read the textbook or reading list.
– Explore additional materials, eg W3C.
– Ask my tutor.
– Ask my lecturer.
• Can I ask my tutor/helpdesk to find the bugs in my work?
– No.
• Will the solution to the tutorial exercises be published?
– No. Students are encourage to discuss their work with the tutors.
• Will study the lecture notes be sufficient for this unit?
– No. Students need to read the textbook and additional reading list.
© Maria Indrawan Monash
University 2003
17
© Maria Indrawan Monash
University 2003
18
• Be able to:
– Understand XML technologies and their roles.
– Understand different components of an XML document.
– Create a well-form XML document.
© Maria Indrawan Monash
University 2003
19
• XML=ExtensibleMarkup Language.
• Markup Languages:
– HTML
– SGML
• Utilise the mark ups to define the
– structure
– semantics => to a certain level.
• WWW Consortium(W3C) recommendation
– www.w3c.org
© Maria Indrawan Monash
University 2003
20
HTML
• tags define the presentation layout
<p> CSE3201 </p>
<p> Information Retrieval
</p>
XML tags define the structure and the meaning of the data
<unit>
<unitCode> CSE3201
</unitCode>
<unitName> Information
Retrieval </unitName>
</unit>
© Maria Indrawan Monash
University 2003
21
• Distributed applications need to share data.
– plain text
– structure and the meaning of the data are tightly defined.
• Delivery of data to multi-devices
– Separation of data and presentation.
© Maria Indrawan Monash
University 2003
22
<bookshop>
<book>
<title> Harry Potter and the
Sorcerer’s Stone</title>
<author>
<initials>J.K</initials>
<surname> Rowling</surname>
</author>
<price value=“$16.95”></price>
</book>
…
</bookshop> title book bookshop author initials price surname book value
© Maria Indrawan Monash
University 2003
23
• DTD/Schema
– definition of XML structures
• XSL (XSLT and XSL-FO)
– presentation
• XPath
– locating nodes
• Xlink, Xpointer
– linking
• DOM and SAX
– APIs to manipulate XML
© Maria Indrawan Monash
University 2003
24
• Required to read and manipulate XML documents.
•
Read the XML documents as a plain text and transform it into a data structure, typically tree, in the memory.
• The applications, such as web browser, access the data structure and process the data according to their objectives.
• Example: msxml
© Maria Indrawan Monash
University 2003
25
• SOAP (simple object access protocol)
• Microsoft BizTalk Server
• WSDL and UDDI in Web Services
• Semantic Web
© Maria Indrawan Monash
University 2003
26
• Performance
– text processing vs binary processing
• Security
© Maria Indrawan Monash
University 2003
27
• Elements.
• Attributes.
• Character and Entity References.
• Character Data (CDATA).
• Processing Instruction.
• Comments.
© Maria Indrawan Monash
University 2003
28
Root Element (compulsory) bookshop
Branch
Elements
Leaf
Element book
© Maria Indrawan Monash
University 2003 book title author price initials surname value attribute
29
• The basic building block of XML markups.
• It may contains:
– Text
– Other elements (child elements)
– Attributes
– Character Data
– Other markup, eg comments
• Delimited with a start-tag and an end-tag.
• Element can be empty.
• The end-tag CANNOT be omitted as in HTML.
• Each tag must consist a valid element type name .
© Maria Indrawan Monash
University 2003
30
• Element’s Name (Tag’s name) is CASE
SENSITIVE.
– <BOOK>
<Book>
<book>
• Trailing space is legal but will be ignored
– <BOOK > = <BOOK>
© Maria Indrawan Monash
University 2003
31
• Has no content.
• May be associated with attribute.
• Example:
<img src=‘logo.png’></img> can be abbreviated into
<img src=‘logo.png’/>
© Maria Indrawan Monash
University 2003
32
• Elements.
• Attributes.
• Character and Entity References.
• Character Data (CDATA).
• Processing Instruction.
• Comments.
© Maria Indrawan Monash
University 2003
33
• Information regarding the element .
“If elements are ‘nouns’ of XML then attributes are its ‘adjective’.
• <tagname attribute_name=“attribute_value”>
<book>
<title> Harry
Potter</title>
</book>
<book title=“Harry
Potter”>
</book>
© Maria Indrawan Monash
University 2003
34
• Determine by the semantic contents.
• Attributes are characteristics of an element.
<book>
<title> Harry
Potter</title>
</book>
<book title=“Harry
Potter”>
</book>
© Maria Indrawan Monash
University 2003
35
• Elements.
• Attributes.
• Character and Entity References
.
• Character Data (CDATA).
• Processing Instruction.
• Comments.
© Maria Indrawan Monash
University 2003
36
• Use to display characters that are not supported by the input device (keyboard).
– entering £ using US-ASCII keyboard.
• Format: &#NNNNN; or &#xXXXX;
– N decimal
– X hexadecimal
• Example: $ => $ OR $
© Maria Indrawan Monash
University 2003
37
• Entities may be defined and used for:
– Representing character used in mark-up
• < == “<“
• & == “&”
– String
• &IR == Information Retrieval
• Predefined entities: <, >, ", etc
© Maria Indrawan Monash
University 2003
38
• Elements.
• Attributes.
• Character and Entity References.
• Character Data (CDATA).
• Processing Instruction.
• Comments.
© Maria Indrawan Monash
University 2003
39
• To escape blocks of text containing characters which would otherwise be recognized as markup.
• <![CDATA[…]]>
• <![CDATA[<greeting>Hello, world!</greeting>]]>
© Maria Indrawan Monash
University 2003
40
<example>
<![CDATA[&Warn;-&Disclaimer;<© 2001;
&PM;>]]>
</example>
<example>
&Warn;-&Disclaimer;&lt;&copy
2001; &PM; &gt>
</example>
© Maria Indrawan Monash
University 2003
41
• Elements.
• Attributes.
• Character and Entity References.
• Character Data (CDATA).
• Processing Instruction
.
• Comments.
© Maria Indrawan Monash
University 2003
42
•
Processing instructions (PIs) allow documents to contain instructions for applications.
• <?target … instruction … ?>
• Target is used to identify the application or other object to which the PI is directed
.
• <?xml-stylesheet href=“mystyle.css” type=“text/css”>
© Maria Indrawan Monash
University 2003
43
• Elements.
• Attributes.
• Character and Entity References.
• Character Data (CDATA).
• Processing Instruction.
• Comments
.
© Maria Indrawan Monash
University 2003
44
• Syntax:
<!–- comment text -->
• Comments cannot be used within element tags.
<tag>… some content … <tag <!– it is illegal -->>
• Comments may never be nested.
<!– Comments cannot <!– be nested --> like this -->
© Maria Indrawan Monash
University 2003
45
• XML document has to be well-formed.
– Conform to syntax requirements
– Conform to a simple container structure
• Common structure of XML document:
– Prolog
– Body
– Epilog
© Maria Indrawan Monash
University 2003
46
• Includes:
– XML Declaration
<?xml version=“1.0” encoding=‘utf-8’ standalone=“yes”>
• Version is mandatory, encoding and standalone are optional
– Document Type Declaration
<!DOCTYPE
• It is not DTD=Document Type Definition
• A simple well-formed XML does not need it.
– Schema declaration
© Maria Indrawan Monash
University 2003
47
• Body
– Contains 1 or more elements
– The “contents”
• Epilog
– Hardly used
– Can be used to identify end of document
© Maria Indrawan Monash
University 2003
48
• Contains a root element.
• valid tag’s name.
• no overlapping tags.
© Maria Indrawan Monash
University 2003
49