CSE3201/4500 - Monash University

advertisement

CSE3201/4500

Information Retrieval Systems

Maria Indrawan

C4.26, 9903-1916 maria.indrawan@infotech.monash.edu.au

© Maria Indrawan Monash

University 2003

1

Type of Data

relational database structured

XML documents

• data representation

• query formulation

• matching

© Maria Indrawan Monash

University 2003 free text, search engine non- structured

2

Introduction

• What will I learn in this unit?

– how to manage data that cannot be effectively handled by a relational DBMS.

• XML documents

• Text (free text)

• There will be no SQL in this unit.

© Maria Indrawan Monash

University 2003

3

Objectives

• On the completion of this unit, you will

(hopefully!) be able to:

Understand the difference nature of information (structured, semistructured, unstructured) and their associated issues when dealing with information retrieval.

 understand the XML technologies and their role in Information

Retrieval.

 Be able to demonstrate the ability to create and manipulate XML documents.

Understand the design issues and various approaches to the development of text databases.

© Maria Indrawan Monash

University 2003

4

Prerequisite Knowledge

• Relational database concepts, such as SQL, indexing.

• Basic UNIX commands, eg file, directory manipulation commands.

• HTML.

• Basic level of Maths (year-12 level).

© Maria Indrawan Monash

University 2003

5

Assessment

• There are different assessments for CSE3201 and CSE4500

.

• Undergraduate students =>

CSE3201

• Masters students => CSE4500

© Maria Indrawan Monash

University 2003

6

CSE4500 Assessment

Component A:

– Assignment 1 – XML Schema

(week 6)

– Assignment 2 - XSLT

(week 9)

– Unit Test, - XML, XSLT

(week 10)

10%

15%

15%

© Maria Indrawan Monash

University 2003

7

CSE4500 Assessment

Component B

– Assignment 3 Research Paper

(week 12)

Component C:

– Exam 50%

10%

© Maria Indrawan Monash

University 2003

8

CSE3201 Assessment

Component A:

– Assignment 1 – XML Schema

(week 6)

– Assignment 2 - XSLT

(week 9)

– Unit Test, - XML, XSLT

(week 10)

10%

15%

15%

© Maria Indrawan Monash

University 2003

9

CSE3201 Assessment

Component B

– Unit Test on text retrieval

(week 12)

Component C:

– Exam

10%

50%

© Maria Indrawan Monash

University 2003

10

Assessment Rules

• The result of the unit test will determine the final grade for component A as follow:

Unit Test

Fail

Pass

Credit

Distinction

Pass

Maximum grade for

Component A

Credit

Distinction

High Distinction

© Maria Indrawan Monash

University 2003

11

Assessment Rules

• In order to pass this unit you must attain:

– 50% overall and

– at least 40% of the available marks in each component A, B and C.

© Maria Indrawan Monash

University 2003

12

Textbook

Prescribed:

XML:How To Program (1 st ed)

Deitel, H.M. Deitel P.J. Nieto, TR. Lin, T. and Sadhu, P

Prentice Hall

Recommended:

Professional XML, 2 nd Ed, WROX Publisher.

Beginner XML, WROX Publisher.

XML Schema

Eric Van Der Vlist, O’Reilly Publishing.

© Maria Indrawan Monash

University 2003

13

Resources

• Unit website:

– www.csse.monash.edu.au/courseware/cse4500

– www.csse.monash.edu.au/courseware/cse3201

• Useful links:

– WWW consortium http://www.w3.org/

– http:// www.topxml.com

– http:// www.xmlsoftware.com.au

– XML Editor http://www.xmlspy.com

© Maria Indrawan Monash

University 2003

14

Plagiarism/Cheating

• Please read all the necessary university materials on cheating/plagiarism (listed in the unit guide).

© Maria Indrawan Monash

University 2003

15

Computing Facilities

• Quota system

• Acceptable policy

– http://www.infotech.monash.edu.au/myf it/students/student_labinfo_rules_netusag e.cfm

– http://www.adm.monash.edu.au/unisec/ pol/itec12.html

© Maria Indrawan Monash

University 2003

16

Being Resourceful and

Independent

• I have a question on …

– Read the textbook or reading list.

– Explore additional materials, eg W3C.

– Ask my tutor.

– Ask my lecturer.

• Can I ask my tutor/helpdesk to find the bugs in my work?

– No.

• Will the solution to the tutorial exercises be published?

– No. Students are encourage to discuss their work with the tutors.

• Will study the lecture notes be sufficient for this unit?

– No. Students need to read the textbook and additional reading list.

© Maria Indrawan Monash

University 2003

17

Basic XML

© Maria Indrawan Monash

University 2003

18

Objectives

• Be able to:

– Understand XML technologies and their roles.

– Understand different components of an XML document.

– Create a well-form XML document.

© Maria Indrawan Monash

University 2003

19

What is XML?

• XML=ExtensibleMarkup Language.

• Markup Languages:

– HTML

– SGML

• Utilise the mark ups to define the

– structure

– semantics => to a certain level.

• WWW Consortium(W3C) recommendation

– www.w3c.org

© Maria Indrawan Monash

University 2003

20

XML vs HTML

HTML

• tags define the presentation layout

<p> CSE3201 </p>

<p> Information Retrieval

</p>

XML tags define the structure and the meaning of the data

<unit>

<unitCode> CSE3201

</unitCode>

<unitName> Information

Retrieval </unitName>

</unit>

© Maria Indrawan Monash

University 2003

21

Why XML?

• Distributed applications need to share data.

– plain text

– structure and the meaning of the data are tightly defined.

• Delivery of data to multi-devices

– Separation of data and presentation.

© Maria Indrawan Monash

University 2003

22

XML Document – an Example

<bookshop>

<book>

<title> Harry Potter and the

Sorcerer’s Stone</title>

<author>

<initials>J.K</initials>

<surname> Rowling</surname>

</author>

<price value=“$16.95”></price>

</book>

</bookshop> title book bookshop author initials price surname book value

© Maria Indrawan Monash

University 2003

23

XML Technologies

• DTD/Schema

– definition of XML structures

• XSL (XSLT and XSL-FO)

– presentation

• XPath

– locating nodes

• Xlink, Xpointer

– linking

• DOM and SAX

– APIs to manipulate XML

© Maria Indrawan Monash

University 2003

24

XML Parser

• Required to read and manipulate XML documents.

Read the XML documents as a plain text and transform it into a data structure, typically tree, in the memory.

• The applications, such as web browser, access the data structure and process the data according to their objectives.

• Example: msxml

© Maria Indrawan Monash

University 2003

25

XML Usage

• SOAP (simple object access protocol)

• Microsoft BizTalk Server

• WSDL and UDDI in Web Services

• Semantic Web

© Maria Indrawan Monash

University 2003

26

XML Issues

• Performance

– text processing vs binary processing

• Security

© Maria Indrawan Monash

University 2003

27

XML Document – Basic

Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

© Maria Indrawan Monash

University 2003

28

Elements

Root Element (compulsory) bookshop

Branch

Elements

Leaf

Element book

© Maria Indrawan Monash

University 2003 book title author price initials surname value attribute

29

Element

• The basic building block of XML markups.

• It may contains:

– Text

– Other elements (child elements)

– Attributes

– Character Data

– Other markup, eg comments

• Delimited with a start-tag and an end-tag.

• Element can be empty.

• The end-tag CANNOT be omitted as in HTML.

• Each tag must consist a valid element type name .

© Maria Indrawan Monash

University 2003

30

Element’s Name

• Element’s Name (Tag’s name) is CASE

SENSITIVE.

– <BOOK> 

<Book>

<book>

• Trailing space is legal but will be ignored

– <BOOK > = <BOOK>

© Maria Indrawan Monash

University 2003

31

Empty Element

• Has no content.

• May be associated with attribute.

• Example:

<img src=‘logo.png’></img> can be abbreviated into

<img src=‘logo.png’/>

© Maria Indrawan Monash

University 2003

32

XML Document – Basic

Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

© Maria Indrawan Monash

University 2003

33

Attributes

• Information regarding the element .

“If elements are ‘nouns’ of XML then attributes are its ‘adjective’.

• <tagname attribute_name=“attribute_value”>

<book>

<title> Harry

Potter</title>

</book>

<book title=“Harry

Potter”>

</book>

© Maria Indrawan Monash

University 2003

34

Attributes vs Element

• Determine by the semantic contents.

• Attributes are characteristics of an element.

<book>

<title> Harry

Potter</title>

</book>

<book title=“Harry

Potter”>

</book>

© Maria Indrawan Monash

University 2003

35

XML Document – Basic

Components

• Elements.

• Attributes.

• Character and Entity References

.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

© Maria Indrawan Monash

University 2003

36

Character References

• Use to display characters that are not supported by the input device (keyboard).

– entering £ using US-ASCII keyboard.

• Format: &#NNNNN; or &#xXXXX;

– N decimal

– X hexadecimal

• Example: $ => &#36; OR &#x24

© Maria Indrawan Monash

University 2003

37

Entity References

• Entities may be defined and used for:

– Representing character used in mark-up

• &lt == “<“

• &amp == “&”

– String

• &IR == Information Retrieval

• Predefined entities: &lt, &gt, &quot, etc

© Maria Indrawan Monash

University 2003

38

XML Document – Basic

Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments.

© Maria Indrawan Monash

University 2003

39

Character Data

• To escape blocks of text containing characters which would otherwise be recognized as markup.

• <![CDATA[…]]>

• <![CDATA[<greeting>Hello, world!</greeting>]]>

© Maria Indrawan Monash

University 2003

40

Character Data(2)

<example>

<![CDATA[&Warn;-&Disclaimer;&lt;&copy 2001;

&PM;&gt;]]>

</example>

<example>

&amp;Warn;-&amp;Disclaimer;&amp;lt;&amp;copy

2001; &amp;PM; &amp;gt>

</example>

© Maria Indrawan Monash

University 2003

41

XML Document – Basic

Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction

.

• Comments.

© Maria Indrawan Monash

University 2003

42

Processing Instruction(PI)

Processing instructions (PIs) allow documents to contain instructions for applications.

• <?target … instruction … ?>

• Target is used to identify the application or other object to which the PI is directed

.

• <?xml-stylesheet href=“mystyle.css” type=“text/css”>

© Maria Indrawan Monash

University 2003

43

XML Document – Basic

Components

• Elements.

• Attributes.

• Character and Entity References.

• Character Data (CDATA).

• Processing Instruction.

• Comments

.

© Maria Indrawan Monash

University 2003

44

Comments

• Syntax:

<!–- comment text -->

• Comments cannot be used within element tags.

<tag>… some content … <tag <!– it is illegal -->>

• Comments may never be nested.

<!– Comments cannot <!– be nested --> like this -->

© Maria Indrawan Monash

University 2003

45

Structure of XML Document

• XML document has to be well-formed.

– Conform to syntax requirements

– Conform to a simple container structure

• Common structure of XML document:

– Prolog

– Body

– Epilog

© Maria Indrawan Monash

University 2003

46

Prolog

• Includes:

– XML Declaration

<?xml version=“1.0” encoding=‘utf-8’ standalone=“yes”>

• Version is mandatory, encoding and standalone are optional

– Document Type Declaration

<!DOCTYPE

• It is not DTD=Document Type Definition

• A simple well-formed XML does not need it.

– Schema declaration

© Maria Indrawan Monash

University 2003

47

Body & Epilog

• Body

– Contains 1 or more elements

– The “contents”

• Epilog

– Hardly used

– Can be used to identify end of document

© Maria Indrawan Monash

University 2003

48

Well-formed XML Document

• Contains a root element.

• valid tag’s name.

• no overlapping tags.

© Maria Indrawan Monash

University 2003

49

Download