Genesis & History of the Open Directory Project

advertisement
Genesis of the Open Directory
Project
Rich Skrenta
skrenta@rt.com
January 21, 2003
March 1998
• Work project was winding down
• Going up and down Sand Hill road trying to
get a web-calendar startup funded
• Read Danny Sullivan’s report on Yahoo’s
listing problems on Search Engine Watch
http://www.searchenginewatch.com/sereport/97/09-yahoo.html
http://www.wired.com/news/print/0,1294,10236,00.html
Idea for GnuHoo
• Yahoo seemed to be ignoring their core
asset - the directory
• How could we build a competitor?
• Didn't want to pay an editorial staff
– even a cheap one
• Tequila + Brainstorming = GnuHoo
Idea for GnuHoo
• Use volunteer editors to build a web directory
like Yahoo’s
• Volunteers would do a better job than paid
generalists, since they would be experts about
their area & have a personal interest
• Restrict editors to sub-branches of the directory,
to limit the harm they could do
Original Goals
• Thought if we could reach 1,000 editors the directory
would be successful
• Bootstrap problem was key - how to get the first 10,000
sites. The directory had to look “real” from Day 1
• Figured we needed 1M sites for a competitive directory
• Original get-off-the-coach motivational goal: We told
ourselves that if we could get a story in Wired out of the
effort, it would be worth doing
“Seed” Problem
• Needed a hierarchy & 10,000 sites to launch
the directory
• Briefly considered Dewey Decimal
– good thing we didn’t, it’s not free
– didn’t seem to fit the web
• Original GnuHoo hierarchy mirrored Usenet
alt.2600
alt.3d
alt.food
alt.internet
alt.mud
alt.online-service
alt.rock-n-roll
alt.rock-n-roll.metal
alt.security
alt.sources
alt.tv.simpsons
alt.tv.x-files
comp.ai
comp.ai.alife
comp.ai.fuzzy
comp.ai.games
comp.ai.nat-lang
Computers/Hacking
Computers/Graphics/3D
Recreation/Food
Computers/Internet
Games/MUDs
Computers/Internet/ISPs
Music/Rock-n-Roll
Music/Heavy_Metal
Computers/Security
Computers/Software
Television/Simpsons
Television/X-Files
Computers/AI
Computers/AI/Artificial_Life
Computers/AI/Fuzzy
Computers/AI/Games
Computers/AI/Natural_Language
Original Homepage Mock-up
ARTS
Movies Television Books ...
RECREATION
Travel Food Outdoors Humor ...
BUSINESS
Jobs Companies Investing ...
REFERENCE
Education Libraries Taxes ...
COMPUTERS
Internet Software Hardware ...
REGIONAL
US Canada UK Australia Belgium ...
GAMES
Video MUDs Gambling ...
SCIENCE
Engineering Psychology Physics ...
HEALTH
Fitness Medicine Diseases ...
SHOPPING
Autos Clothing Directories ...
HOME
Kids Houses Consumers ...
SOCIETY
People Religion Issues ...
NEWS
Online Media Newspapers ...
SPORTS
Baseball Football Skiing ...
Category Bootstrapping
• Scanned URLs mentioned in newsgroups to find
seed sites for the corresponding directory category
• This yielded something that looked pretty good at a
casual glance
• …but a lot of the of the original seed URLs were
bad sites or placed in the wrong category
• The first editor in a category simply had to delete or
move the bad entries, which left behind a good
category
Coding & Launch
• Coded from April-June, 1998
• Perl cgi and flat files
• Simple HTML forms to add/edit/delete
websites in the directory
• Web pages served from static HTML files in a
directory tree
• HTML files regenerated whenever an edit
was made
Simple Flat File Format
u:
t:
d:
c:
http://www.newhoo.com/
NewHoo!
The largest human-edited directory of the web
Computers/Internet/Web_Directories
Minimalist Design
• Minimal locking, last-writer-wins semantics
– flock() only used for category counts
• Write-with-append, rename() only safe
operations
• No big database
• A few DBM files for minor stuff
Coding & Launch
• Used publicly-available software for keyword
search of the directory: Originally Glimpse,
later Isearch
• First ran on BSDI, later moved to Linux
– filesystem progression: ufs, ext2, vxfs
• Launched June 5, 1998
• Acquired by Netscape in October, 1998
http://www.wired.com/news/print/0,1294,13625,00.html
Early Press was Key to Growth
• About 1% of the visitors to NewHoo applied to become
editors
• Some fraction of those would be accepted
• The more traffic we got, the more editors we would get
• We grubbed around for any hits we could in the
beginning
• Initial Slashdot, Netly, Wired, Red Herring stories were
vital traffic sources
• No matter what the story said, “Just spell our URL right”
Social Design of NewHoo
• Not a free-for-all links page - every editor
had to apply & be approved
• Every edit logged and possible to undo
• Hierarchy of editors, with senior ones
keeping an eye on the new ones
• Emergent editing guidelines, enforced with
peer review
Why Did You Apply to be a NewHoo Editor?
“There is a link to my old warwick uni account
that has been dead for two years. As editor I
could change it.”
Why Did You Apply to be a NewHoo Editor?
I’m already building Linux indexes and sites,
better to have them all nicely integrated in
computers/software/linux
Why Did You Apply to be a NewHoo Editor?
We already maintain a site called CoinLink
which lists over 800 coin related sites. We know
the coin industry and could easily assist in
building and maintaining this section of the
index.
Why Did You Apply to be a NewHoo Editor?
You have no category in Recreation/Collecting
that focuses on Christmas ornament collecting.
Ornament collecting is one of the fastest
growing hobbies. I've collected ornaments for 25
years and feel I know many of the "best" web
sites dealing with this subject.
Motivations to Edit
• Same urge that makes you straighten a crooked
picture you see on the wall
• People were maintaining link lists on their own
manually; they could do so more easily with
NewHoo’s web forms
• Didn’t need to see the whole directory finished
to have their category be useful
• …but knowing they were helping to build the
pyramid was a warm fuzzy
Directory Editing is Amenable to
Incremental Effort
•
•
•
•
First editor finds a good site and adds it
Second fixes a typo in the description
Third editor moves it to a more appropriate category
Fourth editor later notices the site moved and fixes the
URL
• Not as hard as writing device drivers; many can help
• If you ask too much, results fall off quickly
The Free Use License
• Netscape offered the data from the ODP
under a free-use license
• Directory data was adopted by Lycos,
AltaVista, Google and other search engines
• Only requirement was that the Add URL
link point back to dmoz.org
– helped keep dmoz authoritative & prevent forks
GnuHoo -> NewHoo -> ODP
• FSF objected to the “Gnu”
• Yahoo objected to the “Hoo”
• Netscape renamed it to the Open Directory
Project and hosted it on directory.mozilla.org
• directory.mozilla.org was too long to type, so
we shortened it to dmoz.org
Robozilla
• Lloyd Tabb wrote a crawler to visit every site in
the ODP to see if it was 404/301/302
• Didn’t take action on its own, but alerted editors
to potentially bad or moved sites
• Brought bad sites in the ODP down to 0.25%
• Our crawl of Yahoo showed 8% bad links
“That’s a Problem We Want to Have”
• Design decisions were made in the interest of
expediency. Why invest more time in the
infrastructure if the site never takes off?
• Still running much of the 1.0 code today, over 4
years later
• Zillions of flat files in a gigantic VXFS
filesystem
• Were we wrong? No, I don’t think so.
The ODP Won
•
•
•
•
•
55,000 total editors, probably 10,000 active
3.4M sites, 460K categories
Largest human-created taxonomy ever
Several times larger than competitors
Cited in 83 academic research papers
(source: citeseer.nj.nec.com)
The ODP “Won”
…but directories no longer scale to the web
for users:
– small web: use a directory
– big web: use keywords
Everyone uses
:-)
“Lost Ark” Ending?
• The traffic & validation provided by Netscape was key
to the ODP’s success
• Possible future: lost server in an ops farm
• What new idea can take the ODP to the next level?
Download