TREES AND XML Software Engineering Lecture 3 Annatala Wolf Mathematical Modeling When we consider types abstractly (from the client perspective), we can use math to describe them in an unambiguous manner. int: integers (0, -14, 9000) double: real numbers (0.0, 3.14159, -0.00008) boolean: Boolean value (true, false) char: “character” Unicode labels (‘a’, ‘4’, ‘\t’) String: mathematical string of character text streams: mathematical string of character Basic Types: Sets and Strings Sets: denoted with { } A set is an unordered collection of items. Sets don’t retain information about duplicates. The number of elements in a set is its size (or cardinality). { } is the empty set (the only set with no elements) { 1, 2 } = { 2 , 1 } = { 1, 1, 2 } Strings: denoted with < > A string is a totally ordered collection of some type, which can grow or shrink. Strings don’t have a size; they have a length. < 1, 2, 3 > has the element 1 at the front of the list We use “Crackle” to abbreviate <‘C’, ‘r’, ‘a’, ‘c’, ‘k’, ‘l’, ‘e’>. Basic Types: Tuples Tuples: denoted with ( ) A tuple is also a totally ordered collection, but the length of a tuple is typically fixed. It also doesn’t matter which item is “first”. All that matters is you can identify which element is which, semantically speaking (when you interpret the meaning). For example, an ordered pair (x, y) is simply a twotuple (a tuple of length two). It doesn’t really matter whether you put x or y first, as long as you know which one is x and which one is y! A database row (name, SSN, birthdate) is another example. Each datum is a separate field in the tuple. Warning: English, Math, and Java TODO (see Piazza note for now) Seven Bridges of Königsburg* Classic problem from the 1700’s: could you cross each bridge in Königsberg exactly once, in a single path? This seems like a question math could answer… but at the time, math had not yet been applied to this sort of physical “relationship”. Euler (same fellow who named the constant e) proved in 1735 that it was impossible to do this. Graph Theory* Euler proved this by inventing a new form of math called graph theory, which studies connections. Later, it developed into a related field, topology. Each region is a vertex… …each bridge is an edge. Euler noticed all 4 vertices have an odd degree (# of edges)... But unless a vertex is at the start or end, it must have an even degree (one edge in = one edge out). A contradiction! Graph Terminology* Graph: a set of vertices (also called nodes) connected by edges. Each edge links two vertices, or one vertex with itself (called a self-loop or self-edge). Some graphs may have edges that only go in one direction, or hold other data (color, order, weight). A path is a sequence of connected vertices. (A simple path is a path with no repeated vertices.) A cycle is a path that ends on the same vertex from which it started. Trees* A tree is a very common and useful type of graph! Three definitions are equivalent: A tree is any undirected graph where: Every two vertices are connected by exactly one simple path. 2. It is connected (every vertex can reach every other) and acyclic (it contains no simple cycles). 3. It is connected, and has one fewer edges than vertices. 1. If permitted, it could also be the order zero graph, which has no nodes or edges: an empty tree, in other words. Rooted Tree Terminology A rooted tree is a tree with one node specially designated as the root. Size: total number of nodes in the tree. Height: total number of levels. (We count the root, but it’s often skipped.) Leaves are nodes without children. Nodes with children are called internal nodes. Recursive Structure Trees have a recursive structure to them. That means you can make a bigger tree by sticking together one or more smaller trees (subtrees). If our tree is ordered, then we can tell outgoing edges apart. We usually index them starting with zero on the left side. XML (eXtensible Markup Language) XML is a structured text format for storing data. HTML is (in a sense) a specialized form of XML. <?xml version="1.0" encoding="UTF-8"?> <book printISBN="978-1-118-06331-6" webISBN="1-118063-31-7" pubDate="Dec 20 2011"> <author>Cay Horstmann</author> <title> Java for Everyone: Late Objects </title> ... </book> XML Tags Any text in XML that starts with < and ends with the first > to follow it is called a tag. There are several kinds of special tags (such as comment tags), which begin with <? or <! . All other tags may contain at most one / character. Start-tags contain no / characters. <example foo=“bar”> End-tags begin with </ . </example> Empty-element tags end with <example foo = “bar” /> /> . Markup vs. Character Data Every start-tag must be followed by an end-tag. The end-tag for a start-tag must appear before the end-tag for a earlier start-tag (no element overlap). Example: <a><b></a></b> is not allowed. Tags constitute XML “code”. Collectively, tags are called markup. One exception: CData text within the special CDATA tag is not considered markup. This is because the CDATA tag acts as an escape code for raw data. We won’t be using these fields, however. Anything that isn’t markup is character data. Elements An element consists of a start-tag, its matching end-tag, and everything between the two tags (which may include nested elements). An empty element is an element that holds nothing between its tags. An empty-element tag is the preferred way to represent a start-tag followed by an end-tag (but both forms are acceptable XML). In this case, the tag by itself is an entire element. <a foo=“bar”></a> = <a foo=“bar” /> XML Documents Every XML document consists of two things: First, the prolog: an XML declaration followed by zero or more document type declarations. <?xml version = “1.0”> <!DOCTYPE greeting SYSTEM “hello.dtd”> Last, exactly one element (called the root element). <tagname maybeAttribute=“value” etc.> ... (everything else goes here) </tagname> The only stuff that can appear outside of these areas are processing instructions (special tags used with some applications), comments, and whitespace. Attributes Start-tags and empty-element tags may have attributes. Attributes are (name, value) pairs which appear after the tag name, in the format attribute = “value”: <tagname attr1=“val1” attr2=“val2” /> Each attribute within the same tag must use a unique name, but tag names need not be unique. <hi attr1=“foo” attr2 = “foo”> <hi>Same tag name.</hi> </hi> Tags and Character Data Indices Tags and their content may look identical, so how can you tell them apart? XML is ordered, so tags and content may be identified by which thing appeared first. <tag> (index 0; the root) Character data. (index 0, then index 0) <tag>Char-data!</tag> (index 0, then index 1) <tag /> (index 0, then index 1, then index 0) </tag> Full XML Example <?xml version="1.0" encoding="UTF-8"?> <roottag attrib1=“val1” attrib2=“val2”> <child0> <gchild0>Char-data</gchild0> <gchild1>Char-data</gchild1> </child0> More character data at index 1. <selfclosingchild2 /> More data at index 3. This is the first element <child4 attrib1 = “val1”> (index 0) beneath the root (also index 0). It contains Even more char-data. two elements with subindices 0 and 1, each </child4> which contains character data held at index 0. </roottag> Is HTML Just a Kind of XML?* Mostly! Well-formed HTML is often a specific form of XML, but some standards (like HTML 5.0) allow a few things XML forbids. Also, most HTML documents are not well-formed. To make HTML well-formed XML, you’d need to replace ill-formed HTML such as: <p> Paragraph text.<br> More text on next line. With something like: <p> Paragraph text.<br /> More text on next line. </p> Even though most HTML out there is not well-formed, it’s still a good idea to stick with wellformed HTML. Using it helps to ensure that a page should look the same on any browser.* *Except, of course, for IE. Derp. Tree Structure Notice that the restrictions on how we can nest tags forces a hierarchical structure. This means XML can be represented easily as a tree! <root> attr1 = “value1” attr2 = “value2” <tag> attr1 = “value1” Character Data Character Data <tag> attr1 = “value1” attr2 = “value2” <tag> <tag> Representing Tags and Character Data One easy way to represent XML as a tree structure is to treat each element as a separate node. Nodes can be arranged hierarchically starting from root. Each node will be labeled with the tag for that element, and also contain information about all the attribute (name, value) pairs. To handle character data, we can treat it as a separate node under the element it appears within. XML as a Tree <root attr=“val”> <mid attr1=“val1” attr2=“val2”> <bottom>Hey there!</bottom> <root> attr = “val” <fluttershy /> </mid> <stuff is=“example” /> </root> <mid> attr1 = “val1” attr2 = “val2” <stuff> is = “example” <fluttershy> Character data will always appear as a leaf of the tree, since it holds no elements inside. <bottom> Hey there! Labels and Attributes This design simplifies how we can describe (and use) such a tree. An “XML tree object” consists of: A single node that holds character data, or… 2. A single node that holds an element (identified by its tag name), plus zero or more XML tree objects indexed (starting from 0) in the order in which each appeared in the XML document. 1. Notice how case 1 is not a valid XML document! It’s still valuable to include it, however, because it makes it much easier to define our model. XMLTree Formally, an XMLTree object is modeled by ordered tree of nodes, where a node is an ordered triple: ( isTag, label, set of (name, value) ). In this triple, isTag is a boolean value; while label, name, and value are each strings of character. There are two further constraints: An XMLTree is not permitted to have the same name in its set of (name, value) appear with different values. In other words, attribute names are unique within each tag. If isTag is false, then the set must be empty and this node must be a leaf in the tree. In other words, character data can’t have attributes or child nodes. XMLTree Instance Methods Iterator<String> attributeIterator() String attributeValue(String name) XMLTree child(int k) display() boolean hasAttribute(String name) boolean isTag() String label() int numberOfChildren() String toString() Working with XMLTree Every XMLTree is either: A valid representation of an XML document (or, a portion of one, which could stand on its own)… …or else it is simply character data (e.g., a String). You can always access the root and: Check to see if it’s a tag with isTag(). Use label() to get the tag name or character data. If it’s a tag, you can also: Check the numberOfChildren(). Get a particular child(int k) tree (the one at index k). Look at attributes with various other functions. Attribute and Display Methods You can use attributeIterator() to produce an Iterator<String> holding all of the attribute names. Alternately, you can check to see if a particular attribute name exists with hasAttribute(). To get the value for an attribute, pass the name of the attribute to attributeValue(). The display() method does a fancy display of your XMLTree in a new window, and toString() has been written so if you treat an XMLTree like a String, it will actually be legible. How to Handle Unfamiliar Trees…?* It should be obvious that you can write code to allow someone to look through a tree, or pull apart a tree if you know its structure in advance. But how do you write code to blindly work with a tree you’ve never seen before? How could you write the code to print out an entire XMLTree, if you didn’t have display() or toString()? How could you copy all parts of an XMLTree object? Since we defined XMLTree recursively, you’d need to use recursion to do this simply! We’ll take a closer look at this topic later on in the course. RSS It uses XML. That’s pretty much it. (details TODO, but not urgent)