Assembling a parallel corpus from RSS news feeds John Fry Artificial Intelligence Center SRI International 333 Ravenswood Avenue Menlo Park, CA 94025-3493 USA fry@ai.sri.com Abstract • Good translation pairs are often missed. STRAND, for example, reports recall scores of 60% for some language pairs (Resnik and Smith, 2003). We describe our use of RSS news feeds to quickly assemble a parallel English-Japanese corpus. Our method is simpler than other web mining approaches, and it produces a parallel corpus whose quality, quantity, and rate of growth are stable and predictable. 1 • Although their source code has not been published, some web mining systems appear to be quite complex to implement, requiring hand-crafted URL manipulation rules and expertise in HTML/XML, similarity scoring, web spiders, and machine learning. Motivation A parallel corpus is an indispensable resource for work in machine translation and other multilingual NLP tasks. For some language pairs (e.g., English-French) data are plentiful. For most language pairs, however, parallel corpora are either nonexistent or not publicly available. The need for parallel corpora led to the development of software for automatically discovering parallel text on the World Wide Web. Examples of such web mining systems include BITS (Ma and Liberman, 1999), PTMiner (Chen and Nie, 2000), and STRAND (Resnik and Smith, 2003). These web mining systems, while extremely useful, do have a few drawbacks: • They rely on a random walk through the WWW (through search engines or web spiders), which means that the quantity and, more importantly, the quality of the final results are unpredictable. • Their ‘generate-and-test’ approaches are slow and inefficient. For example, STRAND and PTMiner work by applying sets of hand-crafted substitution rules (e.g., english → big5) to all candidate URLs and then checking those new URLs for content, while the BITS system considers the full cross product of web pages on each site as possible translation pairs. • They sometimes misidentify web page pairs as translations when in fact they are not. This paper describes a much simpler approach to web mining that avoids these disadvantages. We used our method to quickly assemble an English-Japanese parallel corpus whose quality, quantity, and rate of growth are stable and predictable, obviating the need for quality control by bilingual human experts. 2 Approach Our approach exploits two recent trends in the delivery of news over the WWW. The first trend is the growing practice of multinational news organizations to publish the same content in multiple languages across a network of online news sites. Among the most prolific are online news sites in the domain of information technology (IT). For example, CNET Networks (http://www.cnetnetworks.com), a large IT media company, publishes stories in Chinese, English, French, German, Italian, Japanese, Korean, and Russian on its worldwide network of IT news sites. Another conglomerate, JupiterMedia (http://www.jupiterweb. com), publishes in English, German, Japanese, Korean, and Turkish. The second trend is the use of RSS, an XMLbased syndication format. RSS is increasingly used by both mainstream news web sites (e.g., wired.com, news.yahoo.com, and the IT news sites mentioned above) as well as sites that provide news-like content (e.g., slashdot.org and Listing 1: Procmail code for creating our corpus 1 2 3 4 5 6 7 8 9 10 11 12 13 # . procmailrc file : extracts # parallel URLs from RSS feeds :0 HB * ^ User - Agent : rss2email | url = ‘ grep -o http :.* ‘\ ; wget -O - $url \ | egrep \ ’( English )| CNET Networks | target = original | <I >.* N </ I > </A > ’\ | grep -o http :.*\ | sed -e ’s /[ ?"].*// ’\ | xargs -r echo -e " $url \ t "\ >> par allel_url_list . txt weblogs). RSS-aware client programs, called news aggregators, help readers keep up with such sites by displaying the latest headlines as soon as they are published. In other words, readers subscribe to the sites’ RSS feeds, rather than checking the sites manually for new content. In many cases, a story published in a target language (say Japanese) will include a link to the original story in the source language (usually English). When the target articles are published over RSS, as they increasingly are, then virtually all the ingredients of a parallel corpus are in place, with no random crawling required. 3 Assembling the parallel corpus Using RSS feeds in the domain of technology news, we were able to automatically assemble an English-Japanese parallel corpus quickly with little programming effort. The first step in assembling our corpus was to find web sites that publish Japanese-language news stories along with links to the original source articles in English. Table 1 lists the four RSS feeds we subscribed to. Instead of using a news aggregator, we subscribed to the sites in Table 1 using the open-source rss2email program (written by Aaron Swartz), which delivers news feed updates over email. We then relied on standard UNIX tools like procmail, grep, sed, and wget to process the incoming RSS feeds as they arrived by email. Listing 1 shows our .procmailrc configuration file that instructs procmail how to process incoming RSS feeds. First, the URL of the new Japanese story is extracted from the email (line 5), and the article is downloaded (line 6). Next, the link (if any) to the English source article is extracted from the Japanese article (lines 7-11). Finally, both the Japanese and English URLs are saved to a file (lines 12-13). The regular expression in lines 8-9 of Listing 1 matches text that accompanies a link to the English source article. This is the only part of Listing 1 that is specific to the sites we used (Table 1) and that would need to be modified in order to adapt our method to different languages or web sites. It should be noted that we do not record the content of the parallel news articles. Because material on the Web is subject to copyright restrictions, we cannot publish the content directly. Rather, we record the URL of each pair of Japanese and English articles, separated by a tab character. This same format, tab-separated URLs, is also used by the STRAND project for distributing their web-mined parallel corpora (Resnik and Smith, 2003). The STRAND web page (http://umiacs.umd.edu/~resnik/ strand) offers a short Perl program for extracting the actual content from the URL pairs; this program works for our English-Japanese data as well. 4 Results 4.1 A five-week RSS corpus We processed the RSS feeds from the Japanese sources listed in Table 1 over a period of five weeks. At the end of the fifth week, we had collected 333 parallel article pairs in Japanese and English. As Figure 1 shows, the bulk of the 333 article pairs were collected from HotWired (133) and CNET (125), followed by pairs from internet.com (65) and IT Media (10). We then manually inspected all 333 translation pairs to check for problems. One of the URLs we collected, ostensibly a link to an English-language CNET article, turned out to be a stale link. Another link, from a Japanese internet.com article, did not in fact point to an English translation. Finally, two of the HotWired translation pairs were found to be repeats (HotWired occasionally republishes popular articles from the past). We discarded the two bad pairs and the two repeats, leaving us with a final corpus of 329 unique translation pairs. 4.2 Supplementing the corpus by crawling the archives A corpus of 329 parallel news articles is of course insufficient for most tasks. We therefore recursively crawled the past archives of all four URL of main news site http://hotwired.goo.ne.jp http://japan.cnet.com http://japan.internet.com http://www.itmedia.co.jp RSS feed http://www.hotwired.co.jp/news/index.rdf http://japan.cnet.com/rss http://bulknews.net/rss/rdf.cgi?InternetCom http://bulknews.net/rss/rdf.cgi?ITmedia Table 1: RSS feeds used to construct our parallel English-Japanese corpus 350 Total HotWired 300 CNET internet.com IT Media 250 3 3 + 2 × 4 3 200 3 150 100 50 × + 2 3 04 0 3 3 2 + × 4 1 + 2 2 + × 4 3 × 4 2 + 2 + 2 × × 4 4 4 5 Week Figure 1: Sources of the 333 parallel English-Japanese article pairs collected over five weeks HotWired CNET internet.com ITMedia Total 6,701 328 8,227 2,021 17,277 Table 2: Total article pairs after crawling web sites in Table 1 using wget -r to find article pairs that were posted earlier than those in our five-week experiment with RSS feeds. This crawling netted more than 15,000 additional article pairs. The total count of collected article pairs as of the time of writing (duplicates removed) is shown in Table 2. As Table 2 shows, the bulk of the article obtained through crawling came from the HotWired and internet.com sites, both of which maintain archives stretching back several years. ITMedia, on the other hand, observes a policy of removing content after one year, and so is not a substantial source for archived material. 4.3 Availability of our data A regularly updated list of all English-Japanese article pairs we have collected so far can be downloaded from http://johnfry.org/je_ corpus. At the time of writing, the list holds 17,277 English-Japanese article pairs (see Table 2), and is growing at a rate of approximately 70 pairs per week. Where possible, our collected URLs point to ‘printer-friendly’ (as opposed to ‘screenfriendly’) versions of the content. Printerfriendly versions of news articles are structurally simpler, with fewer banners and advertisements cluttering the story content. In addition, printer-friendly versions typically contain the entire news article, whereas screen-friendly versions are sometimes published over several successive pages, making them more difficult to process. All the HotWired and IT Media articles in our corpus have printer-friendly versions in both English and Japanese. In the case of CNET and internet.com, the English-language articles offer printer-friendly versions, but the Japanese articles do not. 5 Other parallel English-Japanese corpora on the web Our corpus of 17,277 article pairs is the largest, but not the only, parallel English-Japanese cor- pus that is freely available on the web. The following are other free sources of EnglishJapanese data: • The NTT Machine Translation Research Group offers a set of 3,718 JapaneseEnglish sentence pairs at http://www. kecl.ntt.co.jp/icl/mtg/resources • The OPUS project at http://logos.uio. no/opus offers 33,143 aligned JapaneseEnglish sentence pairs, taken from the documentation for the OpenOffice software suite. • A set of aligned translations of 114 works of literature (taken from Project Gutenberg and similar sources) is available from the homepage of NICT researcher Masao Utiyama at http://www2.nict.go. jp/jt/a132/members/mutiyama A substantial, but not free, source of Japanese-English data is the set of 150,000 aligned sentence pairs collected from newspaper articles and aligned by Utiyama and Isahara (1993). This collection can be licensed from NICT (see the above link to Utiyama’s home page). 6 Conclusion We have demonstrated how RSS news feeds can be used to quickly assemble a parallel corpus. In the case of our Japanese-English corpus, we supplemented the RSS feeds with web crawling of the news archives in order to assemble a corpus of substantial size (17,277 article pairs and growing). One drawback of our method is that it is feasible only for language pairs with a substantial online news media representation. On the other hand, our approach has two major advantages over web mining systems. First, it is considerably simpler to implement, requiring essentially one extended line of pipelined Unix shell commands (Listing 1). Second, our approach produces a parallel corpus whose quantity, rate of growth, and (most importantly) quality are stable and predictable. The burden of quality control (including article quality, translation quality, and identification of translation pairs) is shifted onto the news organization that publishes the RSS feed, rather than resting on the web crawling system or bilingual human experts. References Jiang Chen and Jian-Yun Nie. 2000. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. In Proceedings of the Sixth Conference on Applied Natural Language Processing, pages 21– 28, Seattle. Xiaoyi Ma and Mark Liberman. 1999. BITS: a method for bilingual text search over the web. In Proceedings of MT Summit VII, September. Philip Resnik and Noah A. Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29(3):349–380. Masao Utiyama and Hitoshi Isahara. 1993. Reliable measures for aligning Japanese-English news articles and sentences. In Proceedings of ACL-93, pages 72–79, Columbus, Ohio.