Measuring Web Site Traffic: Panel vs. Audit I/PRO, A TopicalNet Company 444 Spear Street, Suite 200 San Francisco CA 94105 tel: 415-512-7470 fax: 415-512-7996 email: info@ipro.com Web: www.ipro.com Measuring Web Site Traffic: Panel vs. Audit Measuring Web Site Traffic: Panel vs. Audit 1. Executive summary – “Don’t Guess, Count” Compared with web log analysis and auditing, panel-based web traffic measurement is a relatively inaccurate way to understand how many page views a web site displays. Actual counting gives a much better picture. The highest level of counting accuracy is achieved by auditing. The attention of human auditors with strict guidelines and the independence of a third party audit agency provides monthly snapshots of web site traffic that are verified for public distribution to management, investors, and advertisers. The amount of traffic that the panel-based approach misses is quite large. • Panel vs. web log auditing 85% of the cases show a panel error of more than 10% − 30% of the cases have a panel error of more than 50% − Nearly 60% of the cases show an undercount − The figure below presents these findings graphically. Web Site Traffic: Magnitude of Panel-Based Underreported Traffic vs. Audit - May 2001 700,000,000 Underreported Traffic (page views) 600,000,000 500,000,000 400,000,000 300,000,000 200,000,000 100,000,000 0 Greater than -50% Between -50% and -10% Between -10% and 0% Error Relative to Audit I/PRO, Internet Profiles Corporation Page 2 Measuring Web Site Traffic: Panel vs. Audit 2. Background Consider three media supported by advertising: television, magazines, and the Web. For all three, there exists a need on the part of management, investors, and advertisers for traffic numbers that are both accurate and verifiable. That is, they need external audits. Problems in fulfilling this need—and how these problems are overcome—are unique to each medium. 2.1 Television Television is a "create once, sell many" medium. Production costs are the same regardless of how many people ultimately watch a particular program. The nature of television is that producers cannot directly measure (count) their audience. While they can certainly survey for internal management purposes, such surveys do not satisfy external investors and advertisers. To address this need, third party panel-based measurement companies like Nielsen survey a representative sampling of viewers and the channels to which their television sets are tuned. They then extrapolate to estimate audience size as a whole. This extrapolation works because the number of channels available for viewing at any given time is relatively limited: certainly no more than hundreds. The panel-based approach does not completely solve the accuracy and verification problems, but, in this case is “good enough.” 2.2 Magazines The situation is quite different for magazine publishers, who face a "create many, sell many" situation. They know how many copies they print and how many they sell. Unlike television, they can count. There is currently no way, however, to monitor which magazines are actually opened, much less read. Moreover, internal counts, no matter how accurate, again do not satisfy external investors and advertisers. Third party companies such as the Audit Bureau of Circulation and BPA verify circulation numbers based on audits of financial documents, mailing lists, postal receipts, printing bills, and other indicators. In theory, survey techniques can be used to measure readership, but compared with television, the task is much harder because the number of magazines is much larger than the number of television channels: sample sizes would need to be quite large. The difficulty is compounded because there is no magazine analog to the Nielsen set top box, which records the actual program being displayed; surveys of magazine readership would require that the readers accurately remember what they read. 2.3 The Web Like television, the Web is “create once, sell many.” The Web enables an interesting combination of the two measurement approaches: supply side (like I/PRO, Internet Profiles Corporation Page 3 Measuring Web Site Traffic: Panel vs. Audit magazines) and demand side (like television). Web publishers can use web log analysis to provide the accuracy part of the equation (analogous to a magazine publisher counting the number of magazines that he or she prints). Panel-based survey companies can install measurement software on user computers to provide the accuracy (in a manner similar to panel-based television viewing measurement). Dramatically more so even than magazines, however, the fact that there are tens of millions of web sites and billions of web pages requires that prohibitively large samples be used. The difficulty is again compounded, this time by the fact that representative samples are impractical to put together. Because companies, educational institutions, and other large organizations forbid the installation of measurement software on their users' computers, it is not practical to develop truly representative samples. These troubles are exemplified by the dramatically different traffic numbers that the various web panel measurers report. As an example, in a press release dated August 16, 2001, Gannet Online released “Unique Visitors Per Month” and “Percentage Reach of Internet Audience” numbers from two panel-based measurement services. Both were from “Home/Work Panels Combined” data sets. The table below is a graphic example of the problems with a panel-based approach. Reporting Company Unique Visitors/Month % Reach Nielsen/NetRatings 9,199,000 8.2% Media Metrix 7,712,000 8.4% The Web enables auditing. It is the first advertising medium which supports this level of accuracy in reporting. In short, “Don’t Guess, Count.” The remainder of this paper outlines the components of web site auditing and examines how panel-based traffic estimates differ from the actual audited traffic. 3. Components of web auditing 3.1 What is a web audit? The purpose of an audit is to report not on what a web site serves, but what its visitors see. Audits provide management, investors, advertisers, and others with a credible measure of a web site’s traffic. A web site audit is a validation of traffic by an independent audit agency. I/PRO, Internet Profiles Corporation Page 4 Measuring Web Site Traffic: Panel vs. Audit 3.2 What are the elements of an I/PRO audits? Web log analysis and auditing begins with a “raw” web log file recorded by a web server. These web logs contain a record (or hit) of each file that the web server serves to a user via a web browser. These files include the following. • HTML files (.htm, .html) • Server side code that generates HTML (.cgi, .jsp, .asp, .cfm, .php, …) • Framesets (.frm) • Images (.gif, .jpg, ,jpeg, .bmp, …) • Multimedia files (.mpg, .mpeg, .mp3, .wav, .swf, …) • Stylesheets (.css) • Customized web site extensions For each file, the web log may record the following information depending on the web server software and the web log file format selected. • The file requested • The time and date that the file was served • The Cookie (unique identifier) accepted by the user • The IP address to which the file was served • The success of the delivery of the file (the status code) A pre-audit counting stage is applied to turn Raw Hits into Qualified Hits and then Qualified Pages. Counts for visits, visit length, and unique Visitors are derived from Qualified Pages. The auditing stage consists of an examination of the remaining pages to remove those that are not valid, resulting in either a Document Requests or Page Requests metric. The various removals throughout the audit stages are listed below. • Web log file transfer Raw Hits − • Pre-audit counting stage Qualified Hits − Removal of invalid status code pages > Removal of internal traffic pages > Removal of spider and robot pages > Qualified Pages − Removal of non-HTML generating pages (images, multimedia > files, style sheets, etc.) I/PRO, Internet Profiles Corporation Page 5 Measuring Web Site Traffic: Panel vs. Audit • Auditing stage Document Requests − Removal of blank pages > Removal of redirection pages > Removal of administrative/test pages > Removal of custom error web pages > Removal of other non-viewable files > Page Requests − Multiple frames reduced to one > Removal of WAP and PDA pages > Removal of passive pages > Removal of other error pages > Removal of include pages > I/PRO, Internet Profiles Corporation Page 6 Measuring Web Site Traffic: Panel vs. Audit Web site traffic: Percentage of Panel-Based Cases in Error vs. Audit - May 2001 50% Percentage of Panel-Based Cases in Error 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% Greater than -50% Between -50% and -10% Between -10% and 0% Between 0% and 10% Between 10% and 50% Greater than 50% Panel Error Relative to Audit Figure 1 4. Panel estimation vs. web log auditing Web log analysis counts web site traffic with a level of accuracy that panel-based measurement cannot match. I/PRO compared page view numbers for its audit customers against those from a leading panel-based measurement company. A comparison of three months is examined: May 2001, April 2001, and December 2000. The selected metric is number of pages viewed: “page views” in the parlance of the panel-based measurer, “Page Requests” in the case of I/PRO. Page Requests represent the most conservative counting of traffic. 4.1 Percentage of panel-based cases in error vs. log auditing Figure 1 shows the percentage of panel cases in error (vertical axis) in each of six error bands (the horizontal axis). • Panel traffic lower than audit by more than 50% (red) • Panel traffic lower than audit by between 10% and 50% (green) • Panel traffic lower than audit by between 0% and 10% (blue) • Panel traffic higher than audit by between 0% and 10% (blue) • Panel traffic higher than audit by between 10% and 50% (green) • Panel traffic higher than audit by more than 50% (red) I/PRO, Internet Profiles Corporation Page 7 Measuring Web Site Traffic: Panel vs. Audit Web Site Traffic: Panel-Based Page View Error vs. Audit - May 2001 10,000,000,000 +50% +10% Audit -10% 1,000,000,000 Page Views per Month -50% Panel-Based 100,000,000 10,000,000 1,000,000 100,000 Representative Customer Web Sites Figure 2 For May 2001, more than 30% of the cases have a panel error of more than 50%. That figure climbs to more than 85% for a panel error of more than 10%. And nearly 60% of the cases show an undercount. Similar results are found for April 2001 and December 2000. 4.2 Amount of panel-based error vs. log auditing The number of cases in error as shown in Figure 1 represents a miscounting of page views relative to web log auditing. Figure 2 shows how this miscounting stacks up. The central blue line shows the vertical axis audit traffic for each case laid out on the horizontal axis. The empty circles show the panel traffic. The vertical distance between the circles and the blue line represents the error for a given case. The green lines just above and below the center blue line show the +/- 10% levels. Any circles between one of the green lines and the blue line means that the panel approach has less than a 10% error. Similarly, the red lines show the +/- 50% levels. Circles between the red and green lines have an error between 10% and 50%. Circles above the top red line or below the bottom red line have an error of more than 50%. Note that the vertical is a logarithmic scale, meaning that the error is visually compressed. I/PRO, Internet Profiles Corporation Page 8 Measuring Web Site Traffic: Panel vs. Audit Web Site Traffic: Magnitude of Panel-Based Underreported Traffic vs. Audit - May 2001 700,000,000 Underreported Traffic (page views) 600,000,000 500,000,000 400,000,000 300,000,000 200,000,000 100,000,000 0 Greater than -50% Between -50% and -10% Between -10% and 0% Error Relative to Audit Figure 3 4.3 Magnitude of panel-based underreported For the undercount panel versus audit cases, it is instructive to understand how much traffic has been missed. Figure 3 examines the lower three error bands. • Panel traffic lower than audit by more than 50% (red) • Panel traffic lower than audit by between 10% and 50% (green) • Panel traffic lower than audit by between 0% and 10% (none) For the less than 50% cases, the actual traffic is represented by the height of the white bar on the left: just over 600 million page views per month. The red bar shows how much traffic the panel-based approach shows for the same month: under 200 million page views, or one-third. For the 10% to 50% cases, the actual traffic is represented by the white bar on the right: over 200 million page views. The green bar shows the panel-based count: an equal amount under 200 million page views, or a bit over two-thirds. There were no cases with an undercount of between 0% and 10%. In total, for the undercount cases, the panel-based approach captures only 40% of the actual traffic. I/PRO, Internet Profiles Corporation Page 9 Measuring Web Site Traffic: Panel vs. Audit Web Site Traffic: Panel-Based Error vs. Audit - Dec 2000, Apr 2001, and May 2001 Panel-Based Error vs. Log Analysis: Range Reflects Panel Error One Month vs. the Next 0% 100% 200% 300% 400% 500% 600% 700% Representative Customers -100% Panel-based error > +/- 50% +/- 50% > Panel-based error > +/- 10% +/- 10% > Panel-based error Figure 4 4.4 Variation across multiple months Figure 4 shows how the panel-based error varies with time. The different cases are presented along the vertical axis. The month-to-month variation is expressed by the width of the band along the horizontal axis. The cases are ranked top to bottom from largest to smallest month-to-month variation. The color of the band indicates the extent of the error. Error cases that go beyond the +/- 50% range are red; those in the +/- 10% to +/- 50% are green. All of the cases except one (the second one up from the bottom) have an error of greater than +/- 10% for at least one of the months. Taking the second case down as an example, it can be seen that the error in one of the three months was about –30% versus more than +600% in another. Nine of the cases show overcounting in at least one month versus undercounting in another. I/PRO, Internet Profiles Corporation Page 10 Measuring Web Site Traffic: Panel vs. Audit 5. Conclusion Panel-based web traffic measurement is an inaccurate way to understand how many page views a web site displays. Counting gives a much better picture. The highest level of counting accuracy is achieved by auditing. The attention of human counters with strict guidelines and the independence of a third party audit agency provides monthly snapshots of web site traffic that are verified for public distribution to management, investors, and advertisers. • Panel vs. log auditing − 30% of the cases have a panel error of more than 50% − 85% of the cases show a panel error of more than 10% Nearly 60% of the cases show an undercount − Perhaps most telling, the amount of traffic that the panel-based approach misses is quite large. • Missed traffic: panel vs. log auditing 2/3 of traffic missed in serious undercount cases − 1/3 of traffic missed for milder undercount cases − In short, “Don’t guess, count.” I/PRO, Internet Profiles Corporation Page 11