Uploaded by htdvul

Pilgrim What Is RSS - RSS reader impl

advertisement
XML.com: What Is RSS
1 of 9
http://www.xml.com/lpt/a/1080
Published on XML.com http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html
See this if you're having trouble printing code examples
What Is RSS
By Mark Pilgrim
December 18, 2002
What are Syndication Feeds
RSS is a format for syndicating news and the content of news-like sites, including major news sites like Wired, news-oriented community sites like
Slashdot, and personal weblogs. But it's not just for news. Pretty much anything that can be broken down into discrete items can be syndicated via
RSS: the "recent changes" page of a wiki, a changelog of CVS checkins, even the revision history of a book. Once information about each item is in
RSS format, an RSS-aware program can check the feed for changes and react to the changes in an appropriate way.
RSS-aware programs called news aggregators are popular in the weblogging community. Many weblogs make content available in RSS. A news
aggregator can help you keep up with all your favorite weblogs by checking their RSS feeds and displaying new items from each of them.
A brief history
But coders beware. The name "RSS" is an umbrella term for a format that spans several different versions of at least two different (but parallel)
formats. The original RSS, version 0.90, was designed by Netscape as a format for building portals of headlines to mainstream news sites. It was
deemed overly complex for its goals; a simpler version, 0.91, was proposed and subsequently dropped when Netscape lost interest in the portalmaking business. But 0.91 was picked up by another vendor, UserLand Software, which intended to use it as the basis of its weblogging products
and other web-based writing software.
In the meantime, a third, non-commercial group split off and designed a new format based on what they perceived as the original guiding principles
of RSS 0.90 (before it got simplified into 0.91). This format, which is based on RDF, is called RSS 1.0. But UserLand was not involved in designing
this new format, and, as an advocate of simplifying 0.90, it was not happy when RSS 1.0 was announced. Instead of accepting RSS 1.0, UserLand
continued to evolve the 0.9x branch, through versions 0.92, 0.93, 0.94, and finally 2.0.
What a mess.
So which one do I use?
That's 7 -- count 'em, 7! -- different formats, all called "RSS". As a coder of RSS-aware programs, you'll need to be liberal enough to handle all the
variations. But as a content producer who wants to make your content available via syndication, which format should you choose?
6/20/2011 4:22 PM
XML.com: What Is RSS
2 of 9
http://www.xml.com/lpt/a/1080
Version
Owner
Netscape
0.90
0.91
UserLand
Pros
Drop dead simple
RSS versions and recommendations
Status
Obsoleted by 1.0
Officially obsoleted by 2.0,
but still quite popular
0.92, 0.93,
UserLand
Allows richer metadata than 0.91
0.94
RSS-DEV
RDF-based, extensibility via modules,
1.0
Working Group not controlled by a single vendor
Extensibility via modules, easy
UserLand
2.0
migration path from 0.9x branch
Recommendation
Don't use
Use for basic syndication. Easy migration
path to 2.0 if you need more flexibility
Obsoleted by 2.0
Use 2.0 instead
Stable core, active module
development
Stable core, active module
development
Use for RDF-based applications or if you
need advanced RDF-specific modules
Use for general-purpose, metadata-rich
syndication
What does RSS look like?
Imagine you want to write a program that reads RSS feeds, so that you can publish headlines on your site, build your own portal or homegrown
news aggregator, or whatever. What does an RSS feed look like? That depends on which version of RSS you're talking about. Here's a sample RSS
0.91 feed (adapted from XML.com's RSS feed):
<rss version="0.91">
<channel>
<title>XML.com</title>
<link>http://www.xml.com/</link>
<description>XML.com features a rich mix of information and services for the XML community.</description>
<language>en-us</language>
<item>
<title>Normalizing XML, Part 2</title>
<link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
<description>In this second and final look at applying relational normalization techniques to W3C XML Schema
data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal
forms.</description>
</item>
<item>
<title>The .NET Schema Object Model</title>
<link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
<description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic
manipulation of W3C XML Schemas.</description>
</item>
<item>
<title>SVG's Past and Promising Future</title>
<link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
<description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks
6/20/2011 4:22 PM
XML.com: What Is RSS
3 of 9
http://www.xml.com/lpt/a/1080
forward to 2003.</description>
</item>
</channel>
</rss>
Simple, right? A feed comprises a channel, which has a title, link, description, and (optional) language, followed by a series of items, each of which
have a title, link, and description.
Now look at the RSS 1.0 version of the same information:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
>
<channel rdf:about="http://www.xml.com/cs/xml/query/q/19">
<title>XML.com</title>
<link>http://www.xml.com/</link>
<description>XML.com features a rich mix of information and services for the XML community.</description>
<language>en-us</language>
<items>
<rdf:Seq>
<rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/normalizing.html"/>
<rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/som.html"/>
<rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/svg.html"/>
</rdf:Seq>
</items>
</channel>
<item rdf:about="http://www.xml.com/pub/a/2002/12/04/normalizing.html">
<title>Normalizing XML, Part 2</title>
<link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
<description>In this second and final look at applying relational normalization techniques to W3C XML Schema data
modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
</description>
<dc:creator>Will Provost</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item rdf:about="http://www.xml.com/pub/a/2002/12/04/som.html">
<title>The .NET Schema Object Model</title>
<link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
<description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic
manipulation of W3C XML Schemas.</description>
<dc:creator>Priya Lakshminarayanan</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item rdf:about="http://www.xml.com/pub/a/2002/12/04/svg.html">
<title>SVG's Past and Promising Future</title>
<link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
6/20/2011 4:22 PM
XML.com: What Is RSS
4 of 9
http://www.xml.com/lpt/a/1080
<description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward
to 2003.</description>
<dc:creator>Antoine Quint</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
</rdf:RDF>
Quite a bit more verbose. People familiar with RDF will recognize this as an XML serialization of an RDF document; the rest of the world will at
least recognize that we're syndicating essentially the same information. In fact, we're including a bit more information: item-level authors and
publishing dates, which RSS 0.91 does not support.
Essential Reading
What Are Syndication Feeds
By Shelley Powers
Table of Contents
Syndication feeds have become a standard tool on the Web. But when you enter the world of
syndicated content, you're often faced with the question of what is the "proper" way to do
syndication. This edoc, which covers Atom and the two flavors of RSS--2.0 and 1.0--succinctly
explains what a syndication feed is, then gets down to the nitty-gritty of what makes up a feed, how
you can find and subscribe to them, and which feed will work best for you.
Read Online--Safari
Search this book on Safari:
Code Fragments only
Despite being RDF/XML, RSS 1.0 is structurally similar to previous versions of RSS -- similar enough that we can simply treat it as XML and write
a single function to extract information out of either an RSS 0.91 or RSS 1.0 feed. However, there are some significant differences that our code
will need to be aware of:
What are Syndication Feeds
1. The root element is rdf:RDF instead of rss. We'll either need to handle both explicitly or just ignore the name of the root element altogether
6/20/2011 4:22 PM
XML.com: What Is RSS
5 of 9
http://www.xml.com/lpt/a/1080
and blindly look for useful information inside it.
2. RSS 1.0 uses namespaces extensively. The RSS 1.0 namespace is http://purl.org/rss/1.0/, and it's defined as the default namespace.
The feed also uses http://www.w3.org/1999/02/22-rdf-syntax-ns# for the RDF-specific elements (which we'll simply be ignoring for
our purposes) and http://purl.org/dc/elements/1.1/ (Dublin Core) for the additional metadata of article authors and publishing dates.
We can go in one of two ways here: if we don't have a namespace-aware XML parser, we can blindly assume that the feed uses the standard
prefixes and default namespace and look for item elements and dc:creator elements within them. This will actually work in a large number
of real-world cases; most RSS feeds use the default namespace and the same prefixes for common modules like Dublin Core. This is a
horrible hack, though. There's no guarantee that a feed won't use a different prefix for a namespace (which would be perfectly valid XML
and RDF). If or when it does, we'll miss it.
If we have a namespace-aware XML parser at our disposal, we can construct a more elegant solution that handles both RSS 0.91 and 1.0
feeds. We can look for items in no namespace; if that fails, we can look for items in the RSS 1.0 namespace. (Not shown, but RSS 0.90 feeds
also use a namespace, but not the same one as RSS 1.0. So what we really need is a list of namespaces to search.)
3. Less obvious but still important, the item elements are outside the channel element. (In RSS 0.91, the item elements were inside the
channel. In RSS 0.90, they were outside; in RSS 2.0, they're inside. Whee.) So we can't be picky about where we look for items.
4. Finally, you'll notice there is an extra items element within the channel. It's only useful to RDF parsers, and we're going to ignore it and
assume that the order of the items within the RSS feed is given by their order of the item elements.
But what about RSS 2.0? Luckily, once we've written code to handle RSS 0.91 and 1.0, RSS 2.0 is a piece of cake. Here's the RSS 2.0 version of
the same feed:
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>XML.com</title>
<link>http://www.xml.com/</link>
<description>XML.com features a rich mix of information and services for the XML community.</description>
<language>en-us</language>
<item>
<title>Normalizing XML, Part 2</title>
<link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
<description>In this second and final look at applying relational normalization techniques to W3C XML Schema
data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal
forms.</description>
<dc:creator>Will Provost</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item>
<title>The .NET Schema Object Model</title>
<link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
6/20/2011 4:22 PM
XML.com: What Is RSS
6 of 9
http://www.xml.com/lpt/a/1080
<description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic
manipulation of W3C XML Schemas.</description>
<dc:creator>Priya Lakshminarayanan</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item>
<title>SVG's Past and Promising Future</title>
<link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
<description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks
forward to 2003.</description>
<dc:creator>Antoine Quint</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
</channel>
</rss>
As this example shows, RSS 2.0 uses namespaces like RSS 1.0, but it's not RDF. Like RSS 0.91, there is no default namespace and items are back
inside the channel. If our code is liberal enough to handle the differences between RSS 0.91 and 1.0, RSS 2.0 should not present any additional
wrinkles.
How can I read RSS?
Now let's get down to actually reading these sample RSS feeds from Python. The first thing we'll need to do is download some RSS feeds. This is
simple in Python; most distributions come with both a URL retrieval library and an XML parser. (Note to Mac OS X 10.2 users: your copy of
Python does not come with an XML parser; you will need to install PyXML first.)
from xml.dom import minidom
import urllib
def load(rssURL):
return minidom.parse(urllib.urlopen(rssURL))
This takes the URL of an RSS feed and returns a parsed representation of the DOM, as native Python objects.
The next bit is the tricky part. To compensate for the differences in RSS formats, we'll need a function that searches for specific elements in any
number of namespaces. Python's XML library includes a getElementsByTagNameNS which takes a namespace and a tag name, so we'll use that to
make our code general enough to handle RSS 0.9x/2.0 (which has no default namespace), RSS 1.0 and even RSS 0.90. This function will find all
elements with a given name, anywhere within a node. That's a good thing; it means that we can search for item elements within the root node and
always find them, whether they are inside or outside the channel element.
DEFAULT_NAMESPACES = \
(None, # RSS 0.91, 0.92, 0.93, 0.94, 2.0
'http://purl.org/rss/1.0/', # RSS 1.0
'http://my.netscape.com/rdf/simple/0.9/' # RSS 0.90
6/20/2011 4:22 PM
XML.com: What Is RSS
7 of 9
http://www.xml.com/lpt/a/1080
)
def getElementsByTagName(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
for namespace in possibleNamespaces:
children = node.getElementsByTagNameNS(namespace, tagName)
if len(children): return children
return []
Finally, we need two utility functions to make our lives easier. First, our getElementsByTagName function will return a list of elements, but most of
the time we know there's only going to be one. An item only has one title, one link, one description, and so on. We'll define a first function
that returns the first element of a given name (again, searching across several different namespaces). Second, Python's XML libraries are great at
parsing an XML document into nodes, but not that helpful at putting the data back together again. We'll define a textOf function that returns the
entire text of a particular XML element.
def first(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
children = getElementsByTagName(node, tagName, possibleNamespaces)
return len(children) and children[0] or None
def textOf(node):
return node and "".join([child.data for child in node.childNodes]) or ""
That's it. The actual parsing is easy. We'll take a URL on the command line, download it, parse it, get the list of items, and then get some useful
information from each item:
DUBLIN_CORE = ('http://purl.org/dc/elements/1.1/',)
if __name__ == '__main__':
import sys
rssDocument = load(sys.argv[1])
for item in getElementsByTagName(rssDocument, 'item'):
print 'title:', textOf(first(item, 'title'))
print 'link:', textOf(first(item, 'link'))
print 'description:', textOf(first(item, 'description'))
print 'date:', textOf(first(item, 'date', DUBLIN_CORE))
print 'author:', textOf(first(item, 'creator', DUBLIN_CORE))
print
Running it with our sample RSS 0.91 feed prints only title, link, and description (since the feed didn't include any other information on dates or
authors):
$ python rss1.py http://www.xml.com/2002/12/18/examples/rss091.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data
modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date:
6/20/2011 4:22 PM
XML.com: What Is RSS
8 of 9
http://www.xml.com/lpt/a/1080
author:
title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic
manipulation of W3C XML Schemas.
date:
author:
title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to
2003.
date:
author:
For both the sample RSS 1.0 feed and sample RSS 2.0 feed, we also get dates and authors for each item. We reuse our custom
getElementsByTagName function, but pass in the Dublin Core namespace and appropriate tag name. We could reuse this same function to extract
information from any of the basic RSS modules. (There are a few advanced modules specific to RSS 1.0 that would require a full RDF parser, but
they are not widely deployed in public RSS feeds.)
Here's the output against our sample RSS 1.0 feed:
$ python rss1.py http://www.xml.com/2002/12/18/examples/rss10.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data
modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date: 2002-12-04
author: Will Provost
title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic
manipulation of W3C XML Schemas.
date: 2002-12-04
author: Priya Lakshminarayanan
title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to
2003.
date: 2002-12-04
author: Antoine Quint
Running against our sample RSS 2.0 feed produces the same results.
6/20/2011 4:22 PM
XML.com: What Is RSS
9 of 9
http://www.xml.com/lpt/a/1080
This technique will handle about 90% of the RSS feeds out there; the rest are ill-formed in a variety of interesting ways, mostly caused by
non-XML-aware publishing tools building feeds out of templates and not respecting basic XML well-formedness rules. Next month we'll tackle the
thorny problem of how to handle RSS feeds that are almost, but not quite, well-formed XML.
Related resources
Sample RSS feeds: RSS 0.91, RSS 1.0, RSS 2.0.
rss1.py
Specifications: RSS 0.90, RSS 0.91, RSS 1.0, RSS 2.0.
Syndic8.com, a directory of 10,000 publicly available RSS feeds.
News Readers in the Open Directory, a variety of client-side and server-side programs for reading RSS
feeds.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.
6/20/2011 4:22 PM
Download