Trees and XML

advertisement
TREES AND XML
Software Engineering Lecture 3
Annatala Wolf
Mathematical Modeling
 When we consider types abstractly (from the client
perspective), we can use math to describe them in
an unambiguous manner.
 int:





integers (0, -14, 9000)
double:
real numbers (0.0, 3.14159, -0.00008)
boolean:
Boolean value (true, false)
char:
“character” Unicode labels (‘a’, ‘4’, ‘\t’)
String:
mathematical string of character
text streams: mathematical string of character
Basic Types: Sets and Strings
 Sets: denoted with { }
 A set is an unordered collection of items. Sets
don’t retain information about duplicates. The
number of elements in a set is its size (or cardinality).
 { } is the empty set (the only set with no elements)
 { 1, 2 } = { 2 , 1 } = { 1, 1, 2 }
 Strings: denoted with < >
 A string is a totally ordered collection of some
type, which can grow or shrink. Strings don’t have a
size; they have a length.
 < 1, 2, 3 > has the element 1 at the front of the list
 We use “Crackle” to abbreviate <‘C’, ‘r’, ‘a’, ‘c’, ‘k’, ‘l’, ‘e’>.
Basic Types: Tuples
 Tuples: denoted with ( )
 A tuple is also a totally ordered collection, but the
length of a tuple is typically fixed. It also doesn’t
matter which item is “first”. All that matters is you
can identify which element is which, semantically
speaking (when you interpret the meaning).
 For example, an ordered pair (x, y) is simply a twotuple (a tuple of length two). It doesn’t really matter
whether you put x or y first, as long as you know which
one is x and which one is y!
 A database row (name, SSN, birthdate) is another
example. Each datum is a separate field in the tuple.
Warning: English, Math, and Java
 TODO (see Piazza note for now)
Seven Bridges of Königsburg*
 Classic problem from
the 1700’s: could you
cross each bridge in
Königsberg exactly once,
in a single path?
 This seems like a
question math could
answer… but at the time,
math had not yet been
applied to this sort of
physical “relationship”.
Euler (same fellow who
named the constant e)
proved in 1735 that it was
impossible to do this.
Graph Theory*
 Euler proved this by inventing a new form of math
called graph theory, which studies connections.
 Later, it developed into a related field, topology.
Each region
is a vertex…
…each bridge
is an edge.
Euler noticed
all 4 vertices
have an odd
degree (# of
edges)...
But unless a vertex is at
the start or end, it must
have an even degree
(one edge in = one edge
out). A contradiction!
Graph Terminology*
 Graph: a set of vertices (also called nodes)
connected by edges.
 Each edge links two vertices, or one vertex with itself
(called a self-loop or self-edge).
 Some graphs may have edges that only go in one
direction, or hold other data (color, order, weight).
 A path is a sequence of connected vertices. (A
simple path is a path with no repeated vertices.)
 A cycle is a path that ends on the same vertex from
which it started.
Trees*
 A tree is a very common and
useful type of graph! Three
definitions are equivalent:
 A tree is any undirected graph where:
Every two vertices are connected by exactly one
simple path.
2. It is connected (every vertex can reach every other)
and acyclic (it contains no simple cycles).
3. It is connected, and has one fewer edges than vertices.
1.

If permitted, it could also be the order zero graph, which
has no nodes or edges: an empty tree, in other words.
Rooted Tree Terminology
 A rooted tree is a tree with one node specially
designated as the root.
Size: total number of
nodes in the tree.
Height: total number
of levels. (We count
the root, but it’s often
skipped.)
Leaves are nodes
without children. Nodes
with children are called
internal nodes.
Recursive Structure
 Trees have a recursive structure to them. That
means you can make a bigger tree by sticking
together one or more smaller trees (subtrees).
 If our tree is
ordered, then we
can tell outgoing
edges apart. We
usually index
them starting
with zero on the
left side.
XML (eXtensible Markup Language)
 XML is a structured text format for storing data.
 HTML is (in a sense) a specialized form of XML.
<?xml version="1.0" encoding="UTF-8"?>
<book printISBN="978-1-118-06331-6"
webISBN="1-118063-31-7"
pubDate="Dec 20 2011">
<author>Cay Horstmann</author>
<title>
Java for Everyone: Late Objects
</title>
...
</book>
XML Tags
 Any text in XML that starts with < and ends with
the first > to follow it is called a tag.
 There are several kinds of special tags (such as
comment tags), which begin with <? or <! .
 All other tags may contain at most one / character.
 Start-tags contain no / characters.
<example foo=“bar”>
 End-tags begin with </ .
</example>
 Empty-element tags end with
<example foo = “bar” />
/> .
Markup vs. Character Data
 Every start-tag must be followed by an end-tag.
 The end-tag for a start-tag must appear before the
end-tag for a earlier start-tag (no element overlap).
 Example: <a><b></a></b> is not allowed.
 Tags constitute XML “code”. Collectively, tags are
called markup.
 One exception: CData text within the special CDATA
tag is not considered markup. This is because the
CDATA tag acts as an escape code for raw data. We
won’t be using these fields, however.
 Anything that isn’t markup is character data.
Elements
 An element consists of a start-tag, its matching
end-tag, and everything between the two tags
(which may include nested elements).
 An empty element is an element that holds
nothing between its tags.
 An empty-element tag is the preferred way to
represent a start-tag followed by an end-tag (but
both forms are acceptable XML). In this case, the
tag by itself is an entire element.
<a foo=“bar”></a>
=
<a foo=“bar” />
XML Documents
 Every XML document consists of two things:
 First, the prolog: an XML declaration followed by
zero or more document type declarations.
<?xml version = “1.0”>
<!DOCTYPE greeting SYSTEM “hello.dtd”>
 Last, exactly one element (called the root element).
<tagname maybeAttribute=“value” etc.>
... (everything else goes here)
</tagname>
 The only stuff that can appear outside of these areas are
processing instructions (special tags used with some
applications), comments, and whitespace.
Attributes
 Start-tags and empty-element tags may have
attributes. Attributes are (name, value) pairs
which appear after the tag name, in the format
attribute = “value”:
<tagname attr1=“val1” attr2=“val2” />
 Each attribute within the same tag must use a
unique name, but tag names need not be unique.
<hi attr1=“foo” attr2 = “foo”>
<hi>Same tag name.</hi>
</hi>
Tags and Character Data Indices
 Tags and their content may
look identical, so how can
you tell them apart?
 XML is ordered, so tags and
content may be identified by
which thing appeared first.
<tag>
(index 0; the root)
Character data.
(index 0, then index 0)
<tag>Char-data!</tag> (index 0, then index 1)
<tag />
(index 0, then index 1, then index 0)
</tag>
Full XML Example
<?xml version="1.0" encoding="UTF-8"?>
<roottag attrib1=“val1” attrib2=“val2”>
<child0>
<gchild0>Char-data</gchild0>
<gchild1>Char-data</gchild1>
</child0>
More character data at index 1.
<selfclosingchild2 />
More data at index 3.
This is the first element
<child4 attrib1 = “val1”> (index 0) beneath the root
(also index 0). It contains
Even more char-data.
two elements with subindices 0 and 1, each
</child4>
which contains character
data held at index 0.
</roottag>
Is HTML Just a Kind of XML?*
 Mostly! Well-formed HTML is often a specific form
of XML, but some standards (like HTML 5.0) allow a
few things XML forbids.
 Also, most HTML documents are not well-formed.
 To make HTML well-formed XML, you’d need to
replace ill-formed HTML such as:
<p>
Paragraph text.<br>
More text on next line.
 With something like:
<p>
Paragraph text.<br />
More text on next line.
</p>
Even though most HTML out
there is not well-formed, it’s still
a good idea to stick with wellformed HTML. Using it helps to
ensure that a page should look
the same on any browser.*
*Except, of course, for IE. Derp.
Tree Structure
 Notice that the restrictions on how we can nest
tags forces a hierarchical structure. This means
XML can be represented easily as a tree!
<root>
attr1 = “value1”
attr2 = “value2”
<tag>
attr1 = “value1”
Character Data
Character Data
<tag>
attr1 = “value1”
attr2 = “value2”
<tag>
<tag>
Representing Tags and Character Data
 One easy way to represent XML as a tree structure
is to treat each element as a separate node.
 Nodes can be arranged hierarchically starting from
root. Each node will be labeled with the tag for
that element, and also contain information about
all the attribute (name, value) pairs.
 To handle character data, we can treat it as a
separate node under the element it appears within.
XML as a Tree
<root attr=“val”>
<mid attr1=“val1” attr2=“val2”>
<bottom>Hey there!</bottom>
<root>
attr = “val”
<fluttershy />
</mid>
<stuff is=“example” />
</root>
<mid>
attr1 = “val1”
attr2 = “val2”
<stuff>
is = “example”
<fluttershy>
Character data will
always appear as
a leaf of the tree,
since it holds no
elements inside.
<bottom>
Hey there!
Labels and Attributes
 This design simplifies how we can describe (and
use) such a tree.
 An “XML tree object” consists of:
A single node that holds character data, or…
2. A single node that holds an element (identified by
its tag name), plus zero or more XML tree objects
indexed (starting from 0) in the order in which each
appeared in the XML document.
1.
 Notice how case 1 is not a valid XML document!
It’s still valuable to include it, however, because it
makes it much easier to define our model.
XMLTree
 Formally, an XMLTree object is modeled by
ordered tree of nodes, where a node is an ordered
triple: ( isTag, label, set of (name, value) ).
 In this triple, isTag is a boolean value; while label,
name, and value are each strings of character.
 There are two further constraints:
 An XMLTree is not permitted to have the same name in
its set of (name, value) appear with different values. In
other words, attribute names are unique within each tag.
 If isTag is false, then the set must be empty and this
node must be a leaf in the tree. In other words,
character data can’t have attributes or child nodes.
XMLTree Instance Methods
 Iterator<String> attributeIterator()
 String attributeValue(String name)
 XMLTree child(int k)
 display()
 boolean hasAttribute(String name)
 boolean isTag()
 String label()
 int numberOfChildren()
 String toString()
Working with XMLTree
 Every XMLTree is either:
 A valid representation of an XML document (or, a
portion of one, which could stand on its own)…
 …or else it is simply character data (e.g., a String).
 You can always access the root and:
 Check to see if it’s a tag with isTag().
 Use label() to get the tag name or character data.
 If it’s a tag, you can also:
 Check the numberOfChildren().
 Get a particular child(int k) tree (the one at index k).
 Look at attributes with various other functions.
Attribute and Display Methods
 You can use attributeIterator() to produce an
Iterator<String> holding all of the attribute names.
 Alternately, you can check to see if a particular
attribute name exists with hasAttribute().
 To get the value for an attribute, pass the name of
the attribute to attributeValue().
 The display() method does a fancy display of your
XMLTree in a new window, and toString() has
been written so if you treat an XMLTree like a
String, it will actually be legible.
How to Handle Unfamiliar Trees…?*
 It should be obvious that you can write code to
allow someone to look through a tree, or pull apart
a tree if you know its structure in advance.
 But how do you write code to blindly work with a
tree you’ve never seen before?
 How could you write the code to print out an entire
XMLTree, if you didn’t have display() or toString()?
 How could you copy all parts of an XMLTree object?
 Since we defined XMLTree recursively, you’d need
to use recursion to do this simply! We’ll take a
closer look at this topic later on in the course.
RSS
 It uses XML. That’s pretty much it.
 (details TODO, but not urgent)
Download