Genesis of the Open Directory Project Rich Skrenta skrenta@rt.com January 21, 2003 March 1998 • Work project was winding down • Going up and down Sand Hill road trying to get a web-calendar startup funded • Read Danny Sullivan’s report on Yahoo’s listing problems on Search Engine Watch http://www.searchenginewatch.com/sereport/97/09-yahoo.html http://www.wired.com/news/print/0,1294,10236,00.html Idea for GnuHoo • Yahoo seemed to be ignoring their core asset - the directory • How could we build a competitor? • Didn't want to pay an editorial staff – even a cheap one • Tequila + Brainstorming = GnuHoo Idea for GnuHoo • Use volunteer editors to build a web directory like Yahoo’s • Volunteers would do a better job than paid generalists, since they would be experts about their area & have a personal interest • Restrict editors to sub-branches of the directory, to limit the harm they could do Original Goals • Thought if we could reach 1,000 editors the directory would be successful • Bootstrap problem was key - how to get the first 10,000 sites. The directory had to look “real” from Day 1 • Figured we needed 1M sites for a competitive directory • Original get-off-the-coach motivational goal: We told ourselves that if we could get a story in Wired out of the effort, it would be worth doing “Seed” Problem • Needed a hierarchy & 10,000 sites to launch the directory • Briefly considered Dewey Decimal – good thing we didn’t, it’s not free – didn’t seem to fit the web • Original GnuHoo hierarchy mirrored Usenet alt.2600 alt.3d alt.food alt.internet alt.mud alt.online-service alt.rock-n-roll alt.rock-n-roll.metal alt.security alt.sources alt.tv.simpsons alt.tv.x-files comp.ai comp.ai.alife comp.ai.fuzzy comp.ai.games comp.ai.nat-lang Computers/Hacking Computers/Graphics/3D Recreation/Food Computers/Internet Games/MUDs Computers/Internet/ISPs Music/Rock-n-Roll Music/Heavy_Metal Computers/Security Computers/Software Television/Simpsons Television/X-Files Computers/AI Computers/AI/Artificial_Life Computers/AI/Fuzzy Computers/AI/Games Computers/AI/Natural_Language Original Homepage Mock-up ARTS Movies Television Books ... RECREATION Travel Food Outdoors Humor ... BUSINESS Jobs Companies Investing ... REFERENCE Education Libraries Taxes ... COMPUTERS Internet Software Hardware ... REGIONAL US Canada UK Australia Belgium ... GAMES Video MUDs Gambling ... SCIENCE Engineering Psychology Physics ... HEALTH Fitness Medicine Diseases ... SHOPPING Autos Clothing Directories ... HOME Kids Houses Consumers ... SOCIETY People Religion Issues ... NEWS Online Media Newspapers ... SPORTS Baseball Football Skiing ... Category Bootstrapping • Scanned URLs mentioned in newsgroups to find seed sites for the corresponding directory category • This yielded something that looked pretty good at a casual glance • …but a lot of the of the original seed URLs were bad sites or placed in the wrong category • The first editor in a category simply had to delete or move the bad entries, which left behind a good category Coding & Launch • Coded from April-June, 1998 • Perl cgi and flat files • Simple HTML forms to add/edit/delete websites in the directory • Web pages served from static HTML files in a directory tree • HTML files regenerated whenever an edit was made Simple Flat File Format u: t: d: c: http://www.newhoo.com/ NewHoo! The largest human-edited directory of the web Computers/Internet/Web_Directories Minimalist Design • Minimal locking, last-writer-wins semantics – flock() only used for category counts • Write-with-append, rename() only safe operations • No big database • A few DBM files for minor stuff Coding & Launch • Used publicly-available software for keyword search of the directory: Originally Glimpse, later Isearch • First ran on BSDI, later moved to Linux – filesystem progression: ufs, ext2, vxfs • Launched June 5, 1998 • Acquired by Netscape in October, 1998 http://www.wired.com/news/print/0,1294,13625,00.html Early Press was Key to Growth • About 1% of the visitors to NewHoo applied to become editors • Some fraction of those would be accepted • The more traffic we got, the more editors we would get • We grubbed around for any hits we could in the beginning • Initial Slashdot, Netly, Wired, Red Herring stories were vital traffic sources • No matter what the story said, “Just spell our URL right” Social Design of NewHoo • Not a free-for-all links page - every editor had to apply & be approved • Every edit logged and possible to undo • Hierarchy of editors, with senior ones keeping an eye on the new ones • Emergent editing guidelines, enforced with peer review Why Did You Apply to be a NewHoo Editor? “There is a link to my old warwick uni account that has been dead for two years. As editor I could change it.” Why Did You Apply to be a NewHoo Editor? I’m already building Linux indexes and sites, better to have them all nicely integrated in computers/software/linux Why Did You Apply to be a NewHoo Editor? We already maintain a site called CoinLink which lists over 800 coin related sites. We know the coin industry and could easily assist in building and maintaining this section of the index. Why Did You Apply to be a NewHoo Editor? You have no category in Recreation/Collecting that focuses on Christmas ornament collecting. Ornament collecting is one of the fastest growing hobbies. I've collected ornaments for 25 years and feel I know many of the "best" web sites dealing with this subject. Motivations to Edit • Same urge that makes you straighten a crooked picture you see on the wall • People were maintaining link lists on their own manually; they could do so more easily with NewHoo’s web forms • Didn’t need to see the whole directory finished to have their category be useful • …but knowing they were helping to build the pyramid was a warm fuzzy Directory Editing is Amenable to Incremental Effort • • • • First editor finds a good site and adds it Second fixes a typo in the description Third editor moves it to a more appropriate category Fourth editor later notices the site moved and fixes the URL • Not as hard as writing device drivers; many can help • If you ask too much, results fall off quickly The Free Use License • Netscape offered the data from the ODP under a free-use license • Directory data was adopted by Lycos, AltaVista, Google and other search engines • Only requirement was that the Add URL link point back to dmoz.org – helped keep dmoz authoritative & prevent forks GnuHoo -> NewHoo -> ODP • FSF objected to the “Gnu” • Yahoo objected to the “Hoo” • Netscape renamed it to the Open Directory Project and hosted it on directory.mozilla.org • directory.mozilla.org was too long to type, so we shortened it to dmoz.org Robozilla • Lloyd Tabb wrote a crawler to visit every site in the ODP to see if it was 404/301/302 • Didn’t take action on its own, but alerted editors to potentially bad or moved sites • Brought bad sites in the ODP down to 0.25% • Our crawl of Yahoo showed 8% bad links “That’s a Problem We Want to Have” • Design decisions were made in the interest of expediency. Why invest more time in the infrastructure if the site never takes off? • Still running much of the 1.0 code today, over 4 years later • Zillions of flat files in a gigantic VXFS filesystem • Were we wrong? No, I don’t think so. The ODP Won • • • • • 55,000 total editors, probably 10,000 active 3.4M sites, 460K categories Largest human-created taxonomy ever Several times larger than competitors Cited in 83 academic research papers (source: citeseer.nj.nec.com) The ODP “Won” …but directories no longer scale to the web for users: – small web: use a directory – big web: use keywords Everyone uses :-) “Lost Ark” Ending? • The traffic & validation provided by Netscape was key to the ODP’s success • Possible future: lost server in an ops farm • What new idea can take the ODP to the next level?