University of Oslo Department of Informatics Analysis of News-On-Demand Characteristics and Client Access Patterns Espen Nilsen espenni@ifi.uio.no Master Degree Thesis April 26, 2005 Abstract World Wide Web services are continuing to grow along with the number of clients connecting to the Internet and the transfer rates of their connections. News is one of the main areas of usage of clients today. It is also an area which has not received much attention from the research community. In this thesis, we investigate several aspects of news on demand (NoD) services on the Internet today. We analyze log files of a news server and a streaming server from Norway’s largest online newspaper Verdens Gang (VG). Our focus is on the content in a NoD environment, users behavior with the content, and object popularity in terms of both news articles and streaming objects. The most central topics we investigate are types of files on these servers, size distribution, access and interaction patterns, object lifetime, and if the Zipf popularity distribution applies in this scenario. iii Acknowledgements I would like to thank my guidance councelers PhD. Student Frank Johnsen, Prof. Dr. Thomas Plagemann and Dr. Carsten Griwodz at the Department of Informatics, University of Oslo. I would also like to thank Anders Berg at Verdens Gang (VG) for providing us with article logs and Svetlana Boudko, Knut Holmqvist and Wolfgang Leister at Norsk Regnesentral (NR) for providing us with streaming logs. v Preface This document is a Thesis presented to The Department of Informatics University of Oslo. In partial fulfillment of the Requirements for the Degree Master of Science in Informatics University of Oslo, Department of Informatics April 26, 2005 Espen Nilsen vii Contents Abstract ii Acknowledgements iv Preface vi List of Figures xiii List of Tables xv 1 . . . . 17 17 19 20 20 . . . . . . . . 21 21 24 26 26 26 26 27 27 . . . . 31 31 32 33 33 2 3 Introduction 1.1 Motivation . . . . 1.2 Goals . . . . . . . 1.3 Methods . . . . . 1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background 2.1 Web news application . . . . . . . . . . . . . . . . 2.2 Streaming news application . . . . . . . . . . . . . 2.3 List of questions . . . . . . . . . . . . . . . . . . . . 2.3.1 Content analysis questions . . . . . . . . . 2.3.2 Article access patterns questions . . . . . . 2.3.3 Stream interaction patterns questions . . . 2.3.4 Lifetime and popularity analysis questions 2.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . Related Work 3.1 Content analysis . . . . . . . . . . 3.2 Article access patterns . . . . . . 3.3 Stream interaction patterns . . . 3.4 Lifetime and popularity analysis ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS 4 5 6 7 8 Tools 4.1 Requirements . . . 4.2 PostgreSQL . . . . 4.3 R and PL/R . . . . 4.4 C, Python and libpq 4.5 Environment . . . . 4.6 Setup requirements CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design and Implementation 5.1 Content analysis . . . . . . . . . . 5.1.1 Web content . . . . . . . . 5.1.2 Stream content . . . . . . 5.2 Lifetime and popularity analysis 5.3 Article access pattern analysis . . 5.4 Stream interaction analysis . . . . 5.5 Database design . . . . . . . . . . 5.5.1 Stream logs . . . . . . . . 5.5.2 Web logs . . . . . . . . . . 5.6 Database implementation . . . . 5.6.1 Stream logs . . . . . . . . 5.6.2 Web logs . . . . . . . . . . Web Content Analysis 6.1 Preparation . . . . . . . . . 6.2 File types and distribution . 6.3 Size and access distribution 6.4 Internal size distribution . . Streaming Content Analysis 7.1 Preparation . . . . . . . . 7.2 File types and distribution 7.3 Size distribution . . . . . . 7.4 Access distribution . . . . 7.5 Internal size distribution . . . . . . . . . . . . . . . User Behavior 8.1 Workload characterization . . 8.2 Web news sessions . . . . . . 8.3 Web news reference patterns . 8.4 Stream interaction patterns . x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 36 36 36 36 37 . . . . . . . . . . . . 39 39 39 40 40 41 41 42 42 43 43 45 46 . . . . 49 49 50 50 55 . . . . . 59 59 59 61 61 64 . . . . 69 69 72 75 75 CONTENTS 9 CONTENTS Lifetime and Popularity 9.1 Article lifetime analysis . . . . . . . . . 9.2 Article access distribution . . . . . . . 9.3 Article popularity . . . . . . . . . . . . 9.4 Stream objects lifetime and popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 81 84 86 88 10 Conclusion 10.1 Thesis summary . . . . . . . . . . . . . . 10.2 Results . . . . . . . . . . . . . . . . . . . 10.2.1 Tools development . . . . . . . . 10.2.2 Content analysis . . . . . . . . . 10.2.3 Workload characterization . . . . 10.2.4 Article access patterns . . . . . . 10.2.5 Stream interaction patterns . . . 10.2.6 Lifetime and popularity analysis 10.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 91 92 92 92 94 94 95 95 96 Bibliography 98 A Source Code A.1 create-stream-tables.c . . . . . . . . . . . . . . . . . . A.2 insert-web-logs.c . . . . . . . . . . . . . . . . . . . . A.3 extract-typesize.c . . . . . . . . . . . . . . . . . . . . A.4 vgmimetypedist.R . . . . . . . . . . . . . . . . . . . . A.5 vgfiletypedist.R . . . . . . . . . . . . . . . . . . . . . A.6 sortsizes.py . . . . . . . . . . . . . . . . . . . . . . . A.7 vgmediansizedist.R . . . . . . . . . . . . . . . . . . . A.8 vgaccessdist.R . . . . . . . . . . . . . . . . . . . . . . A.9 createR1ktable.py . . . . . . . . . . . . . . . . . . . . A.10 graphscript-jpg-log.R . . . . . . . . . . . . . . . . . . A.11 nrdosls-parser.c . . . . . . . . . . . . . . . . . . . . . A.12 nrfiletypedist.plr . . . . . . . . . . . . . . . . . . . . A.13 nrmediansizedist.R . . . . . . . . . . . . . . . . . . . A.14 nr-map-dosls-to-objects.plr . . . . . . . . . . . . . . A.15 nr-map-objects-to-accesses.plr . . . . . . . . . . . . . A.16 nraccessdist.R . . . . . . . . . . . . . . . . . . . . . . A.17 nrgraphscript-wmv.R . . . . . . . . . . . . . . . . . . A.18 vg-graph-workload.plr . . . . . . . . . . . . . . . . . A.19 nr-graph-workload.plr . . . . . . . . . . . . . . . . . A.20 vg-graph-avg-number-of-timesprday-cip-is-seen.plr A.21 count-avg-time-between-request-prip-prday.plr . . A.22 create-vgsession-table.plr . . . . . . . . . . . . . . . A.23 create-sessions-requests-table.plr . . . . . . . . . . . A.24 graph-sessionrequest-table.plr . . . . . . . . . . . . . xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 103 103 103 103 104 104 104 104 104 104 105 105 105 105 105 105 106 106 106 106 106 106 106 107 CONTENTS CONTENTS A.25 find-avg-time-between-requests-within-session.plr . A.26 create-access-viewstat-table.plr . . . . . . . . . . . . A.27 create-object-howviewed-table.plr . . . . . . . . . . A.28 createRViewTable.py . . . . . . . . . . . . . . . . . . A.29 nrgraphviewscript.R . . . . . . . . . . . . . . . . . . A.30 nrgraphviewscript-cumulative.R . . . . . . . . . . . A.31 populate-vgartinfo.plr . . . . . . . . . . . . . . . . . A.32 graph-avg-day-distance.plr . . . . . . . . . . . . . . A.33 graph-avg-day-distance-firstdayarts.plr . . . . . . . A.34 graph-cumulative-access-frequency.plr . . . . . . . . A.35 graph-cumulative-access-frequency-firstday.plr . . . A.36 graph-pop-zipf-firstday.plr . . . . . . . . . . . . . . . A.37 create-nrobjectinfo-table.plr . . . . . . . . . . . . . . A.38 nr-graph-pop-zipf.plr . . . . . . . . . . . . . . . . . . xii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 107 107 107 107 107 108 108 108 108 108 108 108 109 List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 VG main page - top . . . VG main page - bottom . VG video page . . . . . . VG video player . . . . . VG log sample entries . NR log sample entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 23 24 25 28 28 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 VG mime type distribution . . VG file type distribution . . . . VG median size distribution . . VG file type access distribution VG JPEG size distribution . . . VG GIF size distribution . . . . VG HTML size distribution . . VG Javascript size distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 52 53 54 55 57 57 58 7.1 7.2 7.3 7.4 7.5 7.6 7.7 NR file type distribution . . NR median size distribution NR access distribution . . . NR WMA size distribution . NR WMV size distribution . NR JPEG size distribution . NR ASF size distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 62 63 65 65 66 67 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 VG server workload . . . . . . . . . . . . . . . . . . . . . . NR server workload . . . . . . . . . . . . . . . . . . . . . . VG log size comparison . . . . . . . . . . . . . . . . . . . . VG mean number of times IP is seen pr day . . . . . . . . . VG number of sessions with x number of requests . . . . . NR access view percentage . . . . . . . . . . . . . . . . . . NR access view percentage distribution for partial accesses NR cumulative access view percentage . . . . . . . . . . . . . . . . . . . . 70 70 71 72 74 77 78 79 9.1 VG article lifetime of all articles . . . . . . . . . . . . . . . . . 82 . . . . . . . xiii . . . . . . . LIST OF FIGURES 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 LIST OF FIGURES VG article lifetime of articles seen first day of logging . . VG article cumulative access distribution . . . . . . . . . VG access distribution of articles seen first day of logging VG likeliness of becoming popular compared to Zipf week, top 10 percent of the articles) . . . . . . . . . . . . . VG article popularity vs. Zipf . . . . . . . . . . . . . . . . VG top 150 article popularity vs. Zipf . . . . . . . . . . . NR streaming objects lifetime . . . . . . . . . . . . . . . . NR objects Zipf comparison . . . . . . . . . . . . . . . . . xiv . . . . . . (1 . . . . . . . . . . 83 84 85 86 87 87 89 89 List of Tables 2.1 2.2 VG original log format . . . . . . . . . . . . . . . . . . . . . . NR original log format . . . . . . . . . . . . . . . . . . . . . . 27 29 5.1 5.2 5.3 5.4 5.5 NR directory listing example NR log object attributes . . . . NR log client attributes . . . . NR log access attributes . . . VG articles requests table . . . . . . . 40 42 43 44 44 6.1 VG file type distribution . . . . . . . . . . . . . . . . . . . . . 51 7.1 7.2 NR directory listing example with type field . . . . . . . . . NR server list table . . . . . . . . . . . . . . . . . . . . . . . . 59 60 8.1 8.2 8.3 8.4 VG average request timing table VG session table . . . . . . . . . . NR access statistics table . . . . . NR view statistics table . . . . . . . . . . 73 74 76 78 9.1 VG article information table . . . . . . . . . . . . . . . . . . . 81 10.1 NR median sizes . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 NR median sizes . . . . . . . . . . . . . . . . . . . . . . . . . 93 94 xv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction As the popularity of the World Wide Web continues to increase, we have also seen an increasing popularity of multimedia objects on the Internet. We see that people are moving more and more towards a multimedia oriented way of communication instead of the traditional text oriented way. This new area of usage of the web brings new content to the Internet in the form of media streams and dynamic content, which are different from static HTML pages and images. The difference in characteristics as well as the impact on the network of these new methods of communication are issues that need to be explored. One area in which there has not been conducted much researched is news on demand (NoD). We anticipate that NoD will become an important part of the Internet, and as such needs to be investigated in more detail. In this chapter, we first discuss the motivation behind our work, and introduce some concepts and ideas the reader should know about. Then we talk about our goals as well as the methods we use. In the end, we present the reader with an outline of the rest of this thesis. 1.1 Motivation The INSTANCE II (Intermediate Storage Node Concept) 1 project of the Distributed Multimedia Research Group (DMMS) at the University of Oslo is aimed at developing new solutions for next generation multimedia distribution infrastructure which minimize response time, resource requirements and cost. Research is being conducted on network infrastructure, caching and operating system kernel enhancements. For the network infrastructure we are using an overlay network in a multi-ISP hosting content distribution network (CDN). A CDN provides a method to improve Internet quality and user perceived quality through 1 This thesis has been performed in the context of the INSTANCE II project, which is funded by the Norwegian Research Council’s IKT-2010 Program, Contract No. 147426/431 17 1.1. Motivation Chapter 1. Introduction replication of content from the origin server to servers close to the clients requesting the content [31]. A multi-ISP CDN is a network where a standalone company has servers located in the backbone of many ISPs, which allow for large networks on which one can provide content on a global scale. A hosting CDN means that both origin servers and proxies are part of the same infrastructure, which is favorable since they allow for retrieval and coordination of information from all points of the network infrastructure [32]. On top of the fixed infrastructure of the backbone servers we use an overlay network which is a distribution tree overlaid on the existing IP network. In addition to allow for easier configuration of the network infrastructure, overlay networks also allow for configuration changes due to changes in quality of service (QoS) classes wanted. By QoS we are not only talking about traditional metrics such as delays and error rates, but rather all characteristics of a distributed multimedia service defining requirements for a multimedia application. The overlay network in INSTANCE II has been constructed to automatically reconfigure itself depending on different QoS parameters specified and properties of the underlying network [26]. An important part of an efficient distribution system is the use of caching and prefetching of popular objects. We are researching a new caching system, structured partial caching, [28], which is tailored to caching of NoD objects in a network infrastructure as described above. One important aspect of NoD is the use of inter-related files of different kind, continuous and discrete media. This provides users with more advanced interactivity than that of VoD, and in addition, clients themselves can have varying capabilities ranging from PCs to PDAs and mobile phones. Another aspect of NoD data arising from the use of mixed media is structure between objects. There are two types of structure, internal and external. Internal structure is structure inherent in a media type, such as layered video or progressive JPEG pictures. External structure defines the relationship between different elements of possibly different media types, such as their layout and composition. Our partial caching algorithm assumes that structure is defined in a presentation plan. These are documents residing on the servers describing the content of media elements, their composition and from that, the ways in which the different elements can be divided and served. Unlike video on demand (VoD) there has not been much research specifically targeting NoD. We assume NoD will become increasingly popular in the future and as such needs to be researched further. This assumption is also supported by [17] who found in their analysis of Internet reference behavior in Korea that the majority of requests was for news sites. In addition, in Norway today we see that online newspapers are gaining more and more users and that paper editions are loosing ground [21]. Of the 10 most read newspapers in Norway, three are online. This trend reflects 18 Chapter 1. Introduction 1.2. Goals how the Internet is maturing and how people are growing accustomed to using and interacting with online content. This fact further emphasizes the importance of researching issues related to NoD, Media On Demand, and Content Distribution Networks. 1.2 Goals This thesis is an analysis of both content and user behavior in a NoD environment in order to aid our further understanding of a NoD scenario. Our goals can be divided up into four main areas of focus: content analysis, article access pattern analysis, stream interaction analysis and lifetime and popularity analysis First, we research what kind of content exists on a news server, what type of files we can find, and the distribution among them. We also compare the number of accesses between the specific types to see if some are used more than others. In addition to investigate just what type of files that exist and which are most accessed, we also look at the size distribution both between and within the specific types and compare this to what has been found for regular web content. Next, we investigate user behavior in a NoD environment. We start with a short workload analysis to investigate roughly how many users connect to our servers each day, and how many requests are served each day. Then we continue with more specific analysis for each new log. For web news, this includes access patterns such as the number of articles users request while connected to the server, i.e. do users request several articles in one session or do they usually only read one specific article. If the usual pattern is to request several articles, we also want to see if we can find some relationship between requests in these sessions, in addition to investigate the time between those requests. For streaming news, we look at how users interact with the objects in terms of how many are accessed in full and how many are only accessed partially. For those that are only accessed partially, we also investigate the access percent distribution and see if we can deduct any patterns from that. In the end, we also explore the lifetime and popularity of both web news and streaming news and compare this to what others have found in their research. By lifetime we are talking about the number of days between the first and last access to an object. For popularity we investigate the distribution of accesses to specific objects. From this we can see if there is a small concentration of documents which account for most of the requests, commonly referred to as hot documents. We also look at how the access distribution evolves over the lifetime of an article. In addition, the Zipf distribution is a highly used method to model popularity so we also check if Zipf popularity distribution can be applied to our dataset. On this topic, 19 1.3. Methods Chapter 1. Introduction we also compare web news lifetime and popularity with that of streaming news to see if there are some similarity between the two types of news representations. 1.3 Methods We perform a theoretical analysis of content and user characteristics in the more general area of the Internet as a whole, as well as the more specialized area of NoD. This is done through literature study to familiarize ourselves with the topics. Knowledge of both areas is important, since in addition to comparing our results to what others have found for similar datasets as ours, we can also compare the characteristics, content and usage of NoD to the web in general. Further we acquired logs for both web news and streaming news from the largest online newspaper in Norway, Verdens Gang (VG) [12]. We use systems design and implementation to create our own tools to prepare this data for further analysis. Then, we use statistical analysis to research our dataset as well as graphical representation where this is warrant. 1.4 Thesis Overview In the remainder of this thesis, we introduce some background information on the data we are analyzing, the applications in which the data is made available to the users, and present a list of the most important questions we are researching in Chapter 2. Chapter 3 discuss some related work, what others have done and how it relates to our project. Chapter 4 outlines our work environment, the tools we have used and developed and the reason they have been chosen, and Chapter 5 explains the design and implementation of our tools in detail. Chapters 6 through 9 presents both analysis methods and results from researching the questions outlined in Chapter 2. Chapter 10 summarizes the results and conclude our work, ending with a presentation of ideas for further research on the different topics in this thesis. 20 Chapter 2 Background In this chapter, we first introduce the reader to the applications which present the data we are analyzing to the users. We show what information we can expect to find through both available content and the way users can interact with the applications. Then, we introduce a list of questions we are researching, formulated from the goals in the previous chapter. This chapter concludes with a presentation of our dataset and the formats in which they have been acquired. 2.1 Web news application We received log files from a web news server and a streaming news server from Norway’s largest online newspaper VG. To investigate what we can expect to learn from the log analysis we first look at the applications in which the content is made available to users. This tells us how objects are presented to the clients and can aid our understanding of how they interact with the objects. In this section we present the web news application and in the next section we look at the streaming news application. Figure 2.1 shows a screen shot of the top of the main page on VG’s web server. We see that articles are presented to the user in two columns except for the first article at the top of the page. As with paper newspapers, the article believed to be the most important is given special notice. In a paper newspaper this article usually occupies a significant part of the front page. In our online environment, this article spans across both columns of articles, and it is also larger in size both textually and graphically. This way, the most recent or most important piece of news is presented clearly to the user. From this, there is reason to expect that most requests will be for this article, and that the newest articles are the most popular. We also see a group of different categories on the left side which the clients can use to access news on a specific topic. By collecting articles in different categories and making these categories easily available from 21 2.1. Web news application Chapter 2. Background Figure 2.1: VG main page - top 22 Chapter 2. Background 2.1. Web news application Figure 2.2: VG main page - bottom the main page, the users have lot of articles available within just one to two clicks from the main page. Therefore, even though news articles are created, published and become old quite fast (compared to e.g. movies), older articles can still be readily available to the users. In addition, on the bottom of the main page we again find references to the newest articles in specific categories, Figure 2.2. This tells us that even though news are updated fast and articles are pushed out of the main columns with pictures accompanying headlines to attract the user’s attention, articles can still be available from the main page for some time. To conclude the discussion about availability of articles, we also note that there is a search function conveniently placed in the upper right corner of the main page where clients can search for old articles that are in the archive. Further, we see that there are a lot of images present on the main page. There are photos connected to almost every article heading in the main article columns. In addition to photos connected to the headlines of articles, graphics are used extensively as a layout mechanism to distinguish different parts of the page. However, there are no really large images on the main page. The largest image is the photograph connected to the top news story, but even this is not very large compared to regular photos taken with ordinary digital cameras. There are also a lot of commercial elements on the main page in the form of images, banner ads and flash objects. In the end, while browsing through a couple of pages on this server, we 23 2.2. Streaming news application Chapter 2. Background find that the layout of the pages are very similar, and a number of elements are reused. This means that caching can effectively reduce the transfer of objects between subsequent requests to new pages. From investigating the web news application, we have found that there are a number of different elements that make up a news web page. They seem to be a lot more complex than what is the case for most regular web pages, in terms of both the amount of information they provide and as a result the composition of the HTML pages needed to organize all this information in a user friendly way. For this reason, it is important to analyze the different elements specifically for news sites. 2.2 Streaming news application In this section, we investigate the applications in which streaming news objects are presented to the users. There are several different ways users can access these objects. First, there is a link to a movie page in the categories section on the left side of the main page. This link takes the user to a page which looks much like the main page, only listing video news clips and not articles, as shown in Figure 2.3. Figure 2.3: VG video page 24 Chapter 2. Background 2.2. Streaming news application From this page, we learn a couple of things. First, it is specified in the links to each individual news clip that the video files are Windows Media types. Reading further it is explicitly stated that Windows Media Player is needed to watch their videos, preferably Windows Media Player 9. The server is a Microsoft server, and in our initial conversation with VG they mentioned that they had tried to get their pages and players to cooperate with other browsers in addition to Internet Explorer. They gave up on this because the majority of Internet Explorer users where so overwhelming that they did not feel any responsibility to try and accommodate the small percentage of other clients. By their decision, it is not so surprising that we see a trend towards Microsoft formats in their services and content. On this page they also inform the user that Javascript has to be enabled in the browser, which means that we should also find Javascripts in our content analysis. Another way users can access videos is clicking on a small camera icon sometimes accompanying the ingress text of news elements presented on the main news web page. By clicking either this or one of the links on the video page described above, the user is presented with a video player showing the selected news clip, as Figure 2.4 shows. Figure 2.4: VG video player By studying the video player more closely we see that on the right, there is a list of more videos that can be accessed. One could imagine that once in the video player environment, users request more than just the one video they initially wanted to see. Further, we see that here too we find categories which contain videos in the same topic. Also, the list of videos contain dates of when a video was created and from this we see that the streaming news 25 2.3. List of questions Chapter 2. Background videos in our dataset are not created as fast as the web news. There are usually not more than a couple of days between creation of new objects so they are more like movies. Next, we look closer at the actual video player and note that it presents the user with a set of controls exactly as a VCR. This means that users can play, pause, stop and move back and forth in a video stream. This is an important observation since it can affect how users watch a news clip. They are not limited to watching it beginning to end and there is nothing that dictates that this is a normal behavior either. We also note that there is no control item to choose a set of transfer rates or quality wanted, this is computed by the player itself. 2.3 List of questions After having explored the applications in which our data material is presented to the users, we now present a list of questions we are investigating in this thesis. They are in effect the goals section from Chapter 1 in question form. We will refer back to these questions throughout the thesis. 2.3.1 Content analysis questions Q: File types existing on server Q: Distribution among file types Q: Access distribution among file types Q: Size distribution between file types Q: Size distribution within file types 2.3.2 Article access patterns questions Q: Are there sessions, do users select several articles in sessions Q: If there are sessions, time between requests within sessions Q: If there are sessions, reference patterns between requests 2.3.3 Stream interaction patterns questions Q: How are streaming objects watched, beginning to end or partial Q: If partially, how much is watched 26 Chapter 2. Background Column date time c-ip s-ip s-port cs-method cs-host cs-uri-stem cs-uri-query cs(Cookie) cs(User-Agent) cs(via) cs(forwarded-for cs(Referer) cs(ref-host) cs(ref-uri) cs(ref-args) cs(post-len) time-taken sc-status sc(Content-Type) sc(Set-Cookie) sc-bytes 2.4. Dataset Type STRING32 STRING32 IPADDR IPADDR INT32 STRING32 STRING128 STRING512 STRING512 STRING512 STRING512 STRING128 STRING128 STRING512 STRING128 STRING512 STRING512 INT32 FLOAT INT32 STRING128 STRING512 INT32 Description date time stamp client ip address server ip address port number GET, HEAD or POST host name .. www.vg.no /annonser/... , /bilder/... blank or artid=xxxx Cookie information e.g. Mozille/4.0... host name of proxy if used ip addr of client a proxy is forwarding for entire URL of referrer host name of referrer URI of referrer, e.g. /pub/vgart.hbs arguments from referrer, e.g. artid=xxx usually blank time to complete request HTML status code content type, e.g. image/gif cookie information bytes sent from server to client Table 2.1: VG original log format 2.3.4 Lifetime and popularity analysis questions Q: Lifetime in terms of day distance between the first and last access Q: Time dependent popularity: concentration of references (hot documents) and access distribution over a period Q: Time independent popularity: Zipf Q: Compare lifetime and popularity of web news, streaming news and VoD movies In order to answer these questions we need to analyze the logs we received from both the web news and streaming news servers. 2.4 Dataset Now that we know what we want to find out, we continue with a presentation of the data material. The web news server logs we got directly from VG, which logged accesses between 2004.12.07 09:00 and 2004.12.27 15:00. Each log contains half an hour of material for a total of 968 files. Compressed using gzip, the total size of these logs amount to 86GB. The log format is as listed in Table 2.1 and Figure 2.5 show some example entries from the logs. 27 2.4. Dataset Chapter 2. Background Figure 2.5: VG log sample entries The streaming logs we acquired from Norsk Regnesentral (NR) [6], which where administering VG’s stream server before 2004. The logs contain accesses from January 2002 to November 2003 for a total of 769 log files. Compressed with gzip the total size of these logs is 530MB. The log format is listed in Table 2.2 and Figure 2.6 show some example entries from the logs. Figure 2.6: NR log sample entries 28 Chapter 2. Background Fields c-ip date time c-dns cs-uri-stem c-starttime x-duration c-rate c-status c-playerid c-playerversion c-playerlanguage cs(User-Agent) cs(Referer) c-hostexe c-hostexever c-os c-osver c-cpu filelength filesize avgbandwidth protocol transport audiocodec videocodec channelURL sc-bytes c-bytes s-pkts-sent c-pkts-received c-pkts-lost-client c-pkts-lost-net c-pkts-lost-cont-net c-resendreqs c-pkts-recovered-ECC c-pkts-recovered-resent c-buffercount c-totalbuffertime quality s-ip s-dns s-totalclients s-cpu-util Types IPADDR STRING32 STRING32 STRING32 STRING128 INT32 INT32 INT32 INT32 STRING128 STRING128 STRING128 STRING128 STRING512 STRING128 STRING128 STRING128 STRING128 STRING56 INT32 INT32 INT32 STRING32 STRING32 STRING128 STRING128 STRING128 FLOAT FLOAT INT32 INT32 INT32 INT32 INT32 INT32 INT32 INT32 INT32 INT32 INT32 IPADDR STRING32 INT32 INT32 2.4. Dataset Description client ip address date of request time stamp dns address name of object with complete URL client specified where in stream byte wise, majority at 0 duration of stream, rarely used client rate, -5000, -5, 0, 1, 2, 5, 200, 400, 404, 1000 html status code player id nr from vendor player version nr player language, e.g. noNO e.g. mozilla/4.0.... URL of referer, e.g ...vg.no/video/mp-pop.hbs?id=xxx client executable file e.g. iexplore.exe client version nr of hostexe client operating system client os version client cpu type, e.g 486, Pentium not used rarely used average bandwith achieved, 0-236628 http or mms transport protocol, TCP or UDP audio codec, e.g. WMA video codec, e.g. WMV not used bytes sent from server to client bytes sent from client to server nr. of packets sent by server nr. of packet recieved by client nr. of packets lost on client nr. of packets lost in the net rarely used nr. of resend requests from client nr. of packets recovered due to ecc nr. of packets recovered by resending client buffercount client total buffer time quality descriptor in percent, 0-100 server ip address server dns address total clients currently connected, always 1 cpu utilization Table 2.2: NR original log format 29 Chapter 3 Related Work In this chapter, we introduce the reader to other work related to our different types of analysis. 3.1 Content analysis There are a lot of previous research and articles on regular web content and workload characterization. A lot of these papers are quite old, and none of them are studying news sites exclusively. Their results are, however, important to us, since we need to know about general web characteristics in order to find special characteristics of NoD content. Woodruff et al. analyzed several different aspects of web documents from data collected by the Inktomi Web crawler [37]. One of their studies was of file types used in child URLs in which they found over 40 different file types with different file extensions which they grouped together in five different categories. By counting the total number of occurrences of each file type, they found that HTML, JPEG, GIF and XBM were by far the most used, with HTML leading followed by GIF files. They also did a size analysis, but only for HTML documents and with all markup removed. Jeff Sedayo performed an analysis of size and frequency of objects in log files obtained from a proxy server at Intel [35]. His results are much the same as [37]. HTML, JPEG, GIF and XBM are still the most frequently accessed file types, only in his dataset, GIF files are more accessed than HTML. He also includes information on the average size and standard deviation of the file types. Among the top four most accessed types, JPEG files were much larger than the others, followed by GIF files. When it comes to size distribution he found that there is tremendous variation in the size of image files. Bahn et al. present a characterization study of web references focused on content analysis by studying log files from a proxy server at the Korean Education Network (KREN) [17]. In their first study, they show that 31 3.2. Article access patterns Chapter 3. Related Work references are biased to some hot documents. 10 percent of the documents are responsible for 70 percent of the references. Further, they present an analysis of the distribution of URL types, where they found that 75.2 percent of the total references are to image files such as JPEG and GIF, and about 14 percent of references are to HTML files. Arlitt et al. did a workload characterization study from six different log files collected from different types of servers, [16]. They were searching for invariants in all of the six data set and found some that apply to our study. First, they found that HTML and image files account for 90 to 100 percent of the total requests. In addition, they found that 10 percent of the files accessed accounted for 90 percent of the requests. Since they are analyzing data from six different sources, this implies that the concentration of hot documents are even greater than what [17] concluded from analyzing only one source. While all of the above analyze general content and workload characteristics of web servers and proxies, Acharya et al. performed an experiment to measure how video data specifically is used on the web [14]. Their analysis is much more detailed than ours will be, including analysis of frame rate, duration and aspect ration of individual movie types. Their video objects are of the types MPEG, AVI and Quicktime and as we found in Chapter 2, ours are mostly WMV. Therefore, the distribution between them is insignificant to our analysis but the size distribution is still interesting. They found that most movies are small, 2MB or less with the median size being 1.1MB. They also show that most movies are brief, where 90 percent lasted 45 seconds or less. This is similar to what we expect to find from streaming news video clips. 3.2 Article access patterns Catledge et al. researched user navigation strategies on the web by capturing client side user events from a doctored browser [20]. In this study they defined user sessions to be within 1-1/2 standard deviation of the mean time between each event for all events across users. One of their studies show that within a particular site, users tend to operate in a small area. In addition they found that users accessed on average 10 pages per server and that information must be accessible within two to three jumps from the initial page. There has not been conducted many studies specifically on news servers besides from [29] that we know of. In their paper they make an undocumented but intuitive claim that users requests more than one document while connected to a news server. They use this claim to create a popularity algorithm for groups of articles, called Multi-selection Zipf, which they compare to the Zipf popularity model. See Section 3.4 for more 32 Chapter 3. Related Work 3.3. Stream interaction patterns on Zipf distribution. 3.3 Stream interaction patterns There has not been much research on specifically how users interact with streaming news objects. Most previous work has focused on what type of video data exist on the web, their characteristics and their access frequency [17, 35]. [14] deals mostly with the video data itself in terms of what type of files there are, and the individual properties of each file type such as size, frame rate, duration and average bit rate. This is useful when it comes to modeling content on the web, but in our study we also want to get a sense of how users interact with the data. We want to explore how stream objects are viewed. Are they usually viewed from beginning to end? If not, how many are only seen partially, and how much is usually viewed before stopping? Knowing this would be helpful for many caching mechanisms, like for example prefix caching [25] where the prefix can be decided based on the knowledge of how much of an object is usually accessed. 3.4 Lifetime and popularity analysis Zipf law is a power law modeling frequency of use to popularity rank [33]. It originates from the Harvard student George Kingsley Zipf who first noticed that the distribution of words in a text followed a special statistical pattern. It states that the size (frequency) of an object is inversely proportional to its rank (popularity), i.e. proportional to 1, 1/2, 1/3 etc. If one ranks the popularity of words in a text (denoted i) by their frequency of use (denoted P), then P = 1 / iα The real Zipf distribution is parameter less, i.e. α equals 1, but is commonly referred to with α close to unity instead of being parameter less. Later, Zipf distribution has been applied to many areas in social sciences, one of them being VoD. Many has also modeled regular web page popularity after Zipf and found their popularity distribution to be Zipf like with different values for α. Cunha et al. as well as Barford et al. have performed reference behavior studies of client traces by modifying a browser to record all user accesses, [18, 23]. In [23], Zipf was applied with α = 0.986 which is very close to pure Zipf distribution. [18] show studies from two data sets, one in 1995 and one in 1998. They only show request distribution compared to Zipf for the 1995 dataset, in which they found α to be 0.96. However, they also compare transfers to Zipf in which α drops to 0.83 in 1995 and in 1998 it is 0.65. 33 3.4. Lifetime and popularity analysis Chapter 3. Related Work The reason for the difference between requests and transfers in 1995 is that transfers only show the set of cache misses. From this we see, that between 1995 to 1998, less transfers had to be made, suggesting an improvement in caching techniques. Almeida et al. in [15] applied Zipf with α equal to 0.85 for a dataset containing logs from several different web servers, and Bahn et al. showed that web server popularity also is close to Zipf, without giving exact numbers [17]. Breslau et al. found in their analysis of web proxy traces that the distribution of page requests follow a Zipf-like distribution with α varying from trace to trace but centering around 0.75 for heterogeneous environments [19]. They also point out that their work cannot be directly compared to [15] since proxies deal with only a fixed group of users, while web servers see requests from all users on the Internet. We are analyzing traces from a web server, which means by the results above, we should see a behavior like [15] in our results. However, our dataset, albeit from a web server, is quite different from both [19] and [15]. We do not look at general web pages but news articles specifically. There has not been much study of the lifetime of web pages, but for news pages this is important, since news tend to get old reasonably fast. Kim et al. are in [29] analyzing newspaper traces and as such their dataset is of the same type as ours and their results are very interesting to compare to our findings. They found that recency of articles define their popularity and that the most popular articles do not last more than three days. In addition they also compare article popularity to Zipf and without giving exact numbers conclude that it differs from Zipf. In these studies however, they compare the mean access popularity of articles for a month to Zipf. Zipf is a time independent model, and calculating average accesses over a whole month to data not available the whole period cannot be compared to Zipf. All this tells us is the probability an article has to live long, not its popularity. 34 Chapter 4 Tools Since we are analyzing both web and streaming news content, as well as user behavior and interaction with the content we need tools that can perform analysis over a wide range of areas. In this chapter, we outline details on what our requirements for tools are, what tools we have chosen to use and the environment in which we use these tools. 4.1 Requirements There are numerous excellent web log analysis tools available on the Internet today, such as The Webalizer [13] and AWStats [1]. These tools are very good for analysis of content and visiting statistics such as what file types there are, number of visitors, most accessed pages etc. However, we are analyzing not only content but users interaction with it, and most importantly their interactions with single objects. Only stating which pages are most popular a specific date does not work. We want to see for how long single objects are popular. We also want to see how they relate to each other, e.g. if two pages are usually requested right after each other. In short, not only content analysis of a server, but popularity models and users interaction which require analysis of specific objects and clients. For that reason, we have chosen to develop our own tools in a combination of languages. Since our log traces contains a lot of information, we chose to use a database management system (DBMS) for analyzing them, as these are specifically designed to handle and query large amounts of data. Applications and scripts for different tasks like data handling and creation of graphs have been developed in either C, Python, R, PL/R or plpgsql. 35 4.2. PostgreSQL Chapter 4. Tools 4.2 PostgreSQL We decided to use PostgreSQL as our DBMS for several reasons. First, PostgreSQL is a public domain and completely free. Second, it is an established system with a large user base and it also has very good documentation. Third, it has a very good integration with the C programming language both through the libpq library and with its extensible modules features. We use PostgreSQL version 7.4.5. 4.3 R and PL/R R is a language and environment for statistical computing and graphics [11]. Since much of the work we are doing is statistical analysis of content and user actions, this was a natural choice for us. Another important reason for choosing this language is its graphing capabilities. We use version R-2.0.0. PL/R is a loadable procedural language that enables us to write PostgreSQL functions and triggers in the R programming language [8]. With this module installed we can use most of the R language’s capabilities on the database. The version we use is plr-0.6.0b-alpha. Using the R language and the PL/R module in conjunction with PostgreSQL we get a very elegant and easy way of extracting and analyzing statistical data and create graphical representations of the results. 4.4 C, Python and libpq There are two main reasons for using C as the main language for our applications. The first is that it is by far the most comfortable language for us, it is what we use the most. The other is the libpq library, which provides a powerful and easy to use API for accessing the PostgreSQL server. However, C is not the best language for high level text operations, which is why we for some tasks choose to use Python [10]. 4.5 Environment Our work environment was on a server at the University of Oslo. On this server we set up a DBMS that we used for insertion and querying the log files. The environment we set up is a PostgreSQL [9] server extended with the PL/R module [8], the R programming language [11], and a series of own C applications extended with the libpq [5] library of PostgreSQL, as well as numerous Python scripts. 36 Chapter 4. Tools 4.6. Setup requirements 4.6 Setup requirements There are some requirements of setting up the combination PostgreSQL, R and PL/R. It has to do with the use of the PL/R module and its integration with the database server. In order to get PL/R installed and integrated, we had to compile PostgreSQL from source instead of installing a precompiled package. This is usually a good idea anyway, but it is worth mentioning for everyone else who wants to try this. The reason we had to compile it ourselves is that you need the headers in order to compile the PL/R language module. This is also the case for the R language. In addition, most precompiled versions of the R language are compiled without the –enable-R-shlib option which enables the libR shared object library. libR is also needed in order to compile PL/R so this is another reason we had to compile R from scratch. The installation documentation on [8] gives a complete instruction of how to get PL/R compiled and installed in the database, but with the version we use we encountered a small problem. The r-libdir environment variable in the PL/R Makefile actually pointed to the r-bindir. After changing r-libdir = RHOME/bin to r-libdir = RHOME/lib in the Makefile, following the directions on the PL/R install page worked without problems. 37 Chapter 5 Design and Implementation In this chapter, we analyze how we can answer our questions from Chapter 2 given the data we have available. For convenience, we split this discussion up into the four main areas of focus. In the last two sections we first present the design of the database tables we will import the logs to, and then we discuss our options for performing this task as well as the one we have chosen. 5.1 Content analysis In order to answer our questions related to content we need a list of all the objects on the servers as well as their type and size. From such a list we can extract information on what kind of file types that exist, the distribution between the different file types and the size distribution both between and internal to each type. We also need a recording of requests in order to investigate the access distribution between them. This section first present how to acquire this information for the web news, then the options for the streaming news. 5.1.1 Web content We do not have a list of all the files on the VG web server, so we have to create a parser which extracts this information from the logs. The parser has to record each new file type it sees, each new object it find of the specific file types as well as each new objects respective size. To investigate access distribution we can use the same type of parser, only not caring whether or not the log entry it is examining relates to an object previously seen. That is, this parser has to record information for all entries in the log, while the first parser only need to record information for new objects. To identify each new file type we can use the sc(ContentType) field which tell us the mime type of the file. To identify each new object, we can look at 39 5.2. Lifetime and popularity analysis Chapter 5. Design and Implementation Date 10.07.2002 Time 10:50 Size 3.700.135 Name test.wmv Table 5.1: NR directory listing example the uri-stem field as long as the object is not an article. When the object is an article we have to combine uri-stem with the uri-query field to distinguish them. To find the size of each object we can look at the sc-bytes field which records the number of bytes sent from the server to the client. This field is not always filled in so we skip those that does not have it set. 5.1.2 Stream content For the streaming news we were able to get a recursive directory listing of the files on the streaming server from NR. Table 5.1 shows the format and an example of what this list looks like. In addition to the information present in this list, we also need to know the type of each file. This can be found by looking at the file extension, so we can create a parser which adds a new column recording the file type. By inserting this information into a database table, we can query on both file type distribution as well as size distribution between and within each type. In order to get a count of accesses to each type, we have to match the types we have found from the server list with objects in the streaming logs. These logs are also inserted into a database so this becomes a matter of matching objects in two different database tables and updating the type field wherever we find matching objects. 5.2 Lifetime and popularity analysis We want to look at both lifetime and popularity of objects. When we talk about lifetime we have said that we mean the distance in time between the first and last access to an object. There is a date field for each access in both the web and streaming news logs we can use to investigate this. We also want to investigate the concentration of references in order to find out if there is a small group of articles that account for most of the requests. In addition, we also want to look at the popularity of objects, both in terms of how the access distribution changes over a period since the first access seen, and in terms of Zipf distribution. From these questions we see that we need a way to distinguish specific articles and streaming objects. As noted earlier we can find unique articles in the web logs by looking at two distinct attributes of the logs, uri-stem and uri-query. The uri-stem field is always /pub/vgart.hbs for article requests and the uri-query distinguishes between specific articles, specified by the form: 40 Chapter 5. Design and Implementation 5.3. Article access pattern analysis artid=101011. To find each stream object we can simply use the uri-stem field of the streaming logs. When we have identified each object, we can examine all the entries of the logs and record the first and last request we see. We can record the number of requests to each object both throughout the entire period of the log, and within a limited time period based on the date field. From this we can learn about the concentration of references as well as Zipf distribution by ranking objects after number of requests. 5.3 Article access pattern analysis The first question we study is if users select several articles in sessions. This means that we need a way to define single clients and we also need to define what a session is. The only entry in the web logs that tell us anything about clients is the IP number. There is some uncertainty with using IPs to uniquely identify clients. IPs can be both dynamically assigned through for example DHCP, or they can represent for example a NAT or a proxy server. This means that single user can have multiple IPs, and that several users can be represented by the same IP. There is no way to distinguish IPs that represent single users from other IPs in our logs but since we also define sessions, we increase the chance of identifying single clients. There is a greater possibility that requests from one IP is from a single user when the time period is short than when it is long. In order to define sessions we need to look at the time field in the logs. We have chosen to define sessions to be within 1 standard deviation of the mean time distance between each access to an article per day from each IP. This is in accordance with what [20] did in their study of client side user events. When we have identified a client and sessions, we can look at specific clients access patterns within sessions. In order to find out if specific groups of articles exists, we need to examine the uri-query field for all requests within a session to the requests in all the other sessions. Also, each log entry has a time stamp we can examine to find the time between each new request from a single client within a session. It is important to understand that the use of clients is only a term to refer to an IP address making a number of requests within a specified time period. A client cannot be traced further to track requests from the same IP in another session. 5.4 Stream interaction analysis The first question we study on stream interaction is the distribution between partial and full accesses of videos. To find out how many videos are watched in full and how many are partial there is an sc-bytes field which 41 5.5. Database design Column objectid name size type Chapter 5. Design and Implementation Type integer character varying(128) int character varying(56) Description assigned unique id number name of object, parsed out of URI size of object mime type field Table 5.2: NR log object attributes records the number of bytes sent from the server to the client. We can match this field against the size field of the objects found in the directory listing of the streaming server we got from NR. If we find that many videos are not seen until the end we also wanted to see how much of the videos are usually viewed. This can be done by traversing all requests and record the percentage viewed. 5.5 Database design In this section we discuss the initial database tables in which we insert the logs. The next section will discuss how we implemented the tools to import the logs into the database design presented here. The reader should keep in mind that the design presented here is just an initial design for the database drawing on high level requirements from the discussion above. Many other tables have been created from the tables discussed here to investigate the specific questions we had. These other tables will be presented in the subsequent chapters at appropriate places near the discussion of the results they were designed to produce. This way, the reader gets a quick overview of what we are analyzing, the way we analyze it and the result of the analysis at the same place in the text. 5.5.1 Stream logs From the discussion of lifetime and popularity above, we find that we need to be able to identify specific objects. We also found that the only attribute in the log which can be used for this is the uri-stem attribute. But, from the interaction analysis we also found that we need to know the size of each object, so we have to match the size field of the server list from NR with the objects in the logs. Also, for the access distribution analysis of streaming content, we need to identify the type of all objects. We then conclude that there are three pieces of information we need to know about streaming objects, name, size and type. Table 5.2 shows the attributes connected to objects in the streaming logs. Since we are identifying objects we also looked at the log format to see if we could identify single clients. From this investigation of the log format we found a number of attributes that could be used to identify clients. They are listed in Table 5.3. 42 Chapter 5. Design and Implementation Column clientid cip cdns playerid playerver playerlang useragent hostexe hostexever os osver cpu Type integer inet character varying(128) character varying(128) character varying(128) character varying(128) character varying(128) character varying(128) character varying(128) character varying(128) character varying(128) character varying(56) 5.6. Database implementation Description assigned unique id number client ip address dns address player id nr from vendor player version nr e.g. noNO e.g. mozilla/4.0.... executable file e.g. iexplore.exe version nr of hostexe operating system os version cpu type e.g 486, Pentium Table 5.3: NR log client attributes When we already have identified and used many of the attributes of the streaming log for distinguishing objects and clients, it becomes apparent that we can simply create own database tables to hold the list of objects and clients. The rest of the information can be collected in an access table. In addition, access distribution require that each object is identified by type also, so we incorporate this field here too. The attributes of the access table are shown in Table 5.4. By splitting up the logs in three parts we need to map the objects and clients back to the access table. This means that we need a unique ID for all the objects, a unique ID for all the clients, and map those IDs back to the access table. The access table now contains information about which clients accessed what objects when, along with specific stream related information. It is this table that will be used for most of the further analysis. 5.5.2 Web logs For the web news analysis, there are no high level requirements like identifying single clients or objects, since there is only one attribute who can identify each of them. However, we do find that besides content analysis, the only entries we need are entries related to article requests. This is good, since the size of these logs are tremendous and we do not have capacity to input all of the information in these logs into a database. Therefore, the database table for the web news logs will contain all the attributes of the logs, but only entries related to article requests, see Table 5.5. 5.6 Database implementation In this sections we first discuss the options we have for importing the log information into the database tables in the previous section. Then 43 5.6. Database implementation Column clientid objectid date time start time crate referer avgbw protocol transport aucodec vocodec quality sc bytes c bytes s pkts sent c pkts recv c pkts lost client c pkts lost net c pkts lost cont net c resendreqs c pkts recover ecc c pkts recover resnt c bufcount c tot buf time type Type integer integer date time integer integer varchar(512) integer varchar(56) varchar(12) varchar(128) varchar(128) integer bigint bigint integer integer integer integer integer integer integer integer integer integer varchar(56) Chapter 5. Design and Implementation Description ref client table ref object table date of request time stamp majority at 0 client rate URI of referer average bandwith http or mms transport protocol, TCP or UDP audio codec, e.g. WMA video codec, e.g. WMV quality descriptor in percent, 0-100 server/client bytes sent bytes sent from client nr. of packets sent by server nr. of packet received by client nr. of packets lost on client nr. of packets lost in the net nr. packets lost continuous net nr. of resend requests from client nr. of packets recovered due to ecc nr. of packets recovered by resending buffercount total buffer time mime type of object Table 5.4: NR log access attributes Column date time sip port method host uri-stem uri-query cookie user-agent via host fw-for referer ref-host ref-uri ref-args time taken status ctype set-cookie bytes Type date time inet integer character varying(32) character varying(512) character varying(512) character varying(512) character varying(512) character varying(512) character varying(256) character varying(512) character varying(512) character varying(256) character varying(512) character varying(256) double precision integer character varying(256) character varying(512) integer Description date time stamp server IP address port number .. 80 GET, HEAD or POST host name of referer, www.vg.no path on server .. /pub/vgart.hbs arguments .. e.g. artid=xxx cookie information e.g. Mozilla/4.0... name of proxy if used IP addr of client a proxy is forwarding for refering URL http:www.vg.no host name of referer, www.vg.no path on server .. /pub/vgart.hbs arguments .. e.g. artid=xxx time to complete request HTML thing .. 200, 206, 404, 500 ... content type e.g. image/gif set cookie field size of object .. not always used Table 5.5: VG articles requests table 44 Chapter 5. Design and Implementation 5.6. Database implementation we elaborate on the chosen approach. We do this separately for the two different logs. 5.6.1 Stream logs There are several methods we can use to split up the log information and import them to their own tables as discussed in the previous section. Here we list two, and discuss which one we have chosen. Method 1: Database only We can copy all logs into one big database table, using the COPY command of PostgreSQL. From this table we can extract and create a client and an object table with the attributes identified in Chapter 5 using database commands. An example of such an SQL command is: SQL SELECT DISTINCT attributes INTO new-table FROM big-table We can do the same to create the access table, only without the distinct option since it must hold all entries, discarding the attributes used for clients and objects. When those tables are created we need to assign a unique ID to all the entries in the clients and objects tables, as well as map those IDs back to the access table. Method 2: Everything in C Instead of using only database commands, we can create a C parser to extract the information in the three tables directly from the logs. It has to go through all of the lines in each of the logs, record to file and assign IDs to each new client and object it encounters. When the client and object is recognized and identified, their IDs have to be mapped to the access that entry is representing. The access also has to be recorded to a file along with client and object ID. This way, when the parser is finished we end up with three files on disk, one containing all clients with IDs, one with all objects and IDs, and the last file containing all accesses with the mapped client and object IDs in it. In short, these 3 files will contain our desired tables, so we can now use the COPY command in PostgreSQL to insert those into their own tables in the database. Selected approach With method one, after having created a client and an object table and assigned a unique ID to all of the entries, we have to map the IDs back 45 5.6. Database implementation Chapter 5. Design and Implementation to the access table by looping through and comparing each client and each object to each of the entries in the access table. This has been shown to take up an unreasonable amount of time. The second approach also has to perform this comparison, but only for the entries already recorded, not all entries every time. Therefore, we created a C program to implement the second approach. See Appendix A.1 for the source code of this program. However, since we still have to compare each object seen and each client seen with all new log entries this still take up a lot of time. We therefore limit the time period in which we perform our analysis to data between 2002-01-21 and 2003-01-09 for a total count of 714,907 clients recognized, 2,412 objects, and 5,180,565 accesses. 5.6.2 Web logs As mentioned in Chapter 2, the news logs are divided into 968 files with a total compressed size of 86GB. This is too much information to put into a database on the hardware available to us, and also be able to get query results in the time span of this project. Therefore, we limit our log analysis to the logs between 12-07 and 12-15. In addition, in the previous section we noted that except from content study, the only log information we need in our database is information about article accesses. By limiting our time span for the log entries and only inputting log entries related to article requests we get a total of 14,905,052 article accesses. As mentioned in Section 5.2, we distinguish the article request entries by looking at two distinct attributes of the logs, uri-stem and uri-query. The uri-stem field is always /pub/vgart.hbs for articles and the uri-query distinguish between the articles with the form artid=101011. There are several choices of how to extract the article requests and put them into the database table. We present three methods here and then elaborate on the one we have chosen. Method 1: Everything in the database With this method we can simply just push all the logs into the database using the COPY command of PostgreSQL. From this table we can extract the article request into a new table with a query using the restrictions mentioned above. The query looks much the same as the one we presented for the streaming logs: SQL SELECT * INTO articletable FROM alltable WHERE uri_stem = ’/pub/vgart’ AND uri-query != NULL; 46 Chapter 5. Design and Implementation 5.6. Database implementation Method 2: Combination C parser and DB commands The second approach is a C parser that, instead of just pushing all the logs into one big table, inserts one log at a time into a table. For each log we can use the database commands to copy only the relevant article entries into its own table, excluding all entries referring to image requests etc. Finally, before moving on to the next log, we delete the original table containing the whole log. The query for this is similar to the one above, except that the insert command has to be used instead of select. Method 3: Everything in C + libpq The last method is a parser that operates on one line from one log file at a time. On each line it does exactly what the above SQL command does, which is match uri-stem for /pub/vgart.hbs and check that uri-query is not blank. If an article request is found, the entry is inserted into the article database table using the libpq library for communication with the database. Selected approach With method one the amount of information to be put into the database is too big. The disk on which the logs and the database is situated is a 340GB disk which had about 230GB available for the database. It has been shown that after inserting only 17 of the 973 logs, the database is already 180GB large. With method two we greatly reduce the amount of disk space needed but there is a problem with cleaning the logs before a COPY operation will succeed. With the streaming logs, we used sed to create copies of the logs without any erroneous lines. This was no problem since these logs were not too large. With the news logs however, this takes a considerable amount of time since most of them are over 1GB in size after being decompressed. Because the cleaning of the log files takes a lot of time, we chose to use method three even though a COPY operation on the whole file performs better from the database point of view. By operating on one and one line of the logs we do not get a problem with the database getting too big due to entries not related to article request. Also, we do not get the problem of cleaning the files with sed. When we get an erroneous line with this approach, libpq insert will fail and PQexec will throw an error exception. Because we do not care about erroneous lines we can just ignore the error messages. The parser we made is listed in Appendix A.2 47 Chapter 6 Web Content Analysis In this chapter, we analyze the questions from Chapter 2 regarding web news content. We first introduce the method we have used to extract information from the logs. Then we go into details about how we answer the specific questions and the results we get. 6.1 Preparation As noted in Chapter 5, we need to create a parser which extract information regarding file types, size and accesses to objects from the web logs. We made a C program, Appendix A.3, that goes through the log files, recording to different files on disk the name and size of each new object of each file type it finds. To find the type we look at the ctype field in the log which hold mime type entries of the form image/jpeg. When the uri-stem field is /pub/vgart.hbs we also have to combine this field with the uri-query field so we get e.g. /pub/vgart/artid=56544 as a distinct HTML file. The output of this program is a directory structure like: results/images/jpeg-objects results/images/jpeg-sizes results/images/gif-objects results/images/gif-sizes results/text/plain-objects results/text/plain-sizes results/application/pdf-objects results/application/pdf-sizes In the sizes files, the size of each distinct object the program finds for the respective type is stored. In the objects files the name of each new object is stored. These objects files are used by the application to determine if the 49 6.2. File types and distribution Chapter 6. Web Content Analysis current log entry it is processing references a new object or if it has already been recorded. We ran our program to find and count file types and sizes over logs for a subset of two days. We only use two days worth of logs because the application has to compare the object referenced in each entry of the log to all the previous objects found. For each new object found this takes increasingly more time, and the program did not make much progress at a rate that would give us much more data after this time period. Also, the layout of the pages, as discussed in Chapter 2, dictates that the distribution between the objects does not change much over time. For this reason, we think two days is enough for the type of analysis we are conducting here. 6.2 File types and distribution The line count of either one of the objects or sizes files tell us the number of objects of each file type. In Linux, the shell command cat name-of-file — wc -l gives the line count of a file. To find the mime type distribution we can simply add up the line count of all objects files in each distinct directory /results/images, results/text and so on. Table 6.1 list all the object types we found and the distribution between them including the total distribution per mime type. Figure 6.1 show the mime type distribution as a histogram. To create this figure we entered the sum of the line counts of all files in each distinct directory into the R script listed in Appendix A.4. Not surprisingly, we find a subset of file types that represents the majority of objects. To investigate the difference between the most represented types, we also create a graph of a selected choice of file types, Figure 6.2. To create this figure we used the R script in Appendix A.5. As we can see, most of the objects are of type text/html, image/gif or image/jpeg. This result corresponds to what has earlier been confirmed for general web traffic, [17, 23, 35, 37]. 6.3 Size and access distribution To find the size distribution between the selected file types we use the median size for each type. We did make some sample graphs using the mean, but they gave us a completely wrong picture because the size distribution is very skewed. As an example of this, the smallest JPEG image is 304 Bytes, the largest is 1,918,304 Bytes, the mean is 13,838 Bytes and the standard deviation 21,418 Bytes. Using median for this representation is also in accordance with the rules presented in [27] regarding selection among mean, median and mode. To find the median we use the Python script in Appendix A.6, to sort the sizes files in ascending order so that the entry at line count / 2 of each file give us the median size of the specific 50 Chapter 6. Web Content Analysis 6.3. Size and access distribution Name application/octet-stream application/pdf application/smil application/x-director application/x-javascript application/xml application/x-netcdf application/x-shockwave-flash image/png image/jpg image/gif text/css text/xml text/html text/plain audio/basic audio/x-pn-realaudio video/quicktime Count 126 7 1 1 1156 3 1 1065 1 65214 6700 191 6 16692 213 43 196 12 Total 2360 71915 17102 239 12 Table 6.1: VG file type distribution 60000 40000 20000 0 Number of objects 80000 VG distribution among mime types Application Image Text Audio Types is first part of mime type name, e.g: text/* Figure 6.1: VG mime type distribution 51 Video 6.3. Size and access distribution Chapter 6. Web Content Analysis 50000 30000 0 10000 Number of objects 70000 VG file type distribution Jscript Flash CSS Plain Html Gif Jpeg Selected choice of file types Figure 6.2: VG file type distribution types. To create a graph of the median size of each type we again enter the sizes into an R script, listed in Appendix A.7. To find the access frequency to each type we cannot use these files since they are only records of distinct objects. To create new files recording access distribution, we use the same application as before, only we do not care about previous objects seen. That is, we skip the routine that checks if the object in the current entry has been seen before, and thereby record accesses to the specific types. From these new files we use the same approach as before by plotting the line count of each file into an R script, see Appendix A.8. Figure 6.3 shows the median size distribution between each of the selected types, and Figure 6.4 shows the access distribution between them. We see that the flash objects are rarely accessed so even though their size is about 12.5 times larger than the next group, JPEG, they will not impact the server in any great way. HTML and Javascripts are both of similar size and accessed almost the same amount of times. An interesting observation here is the comparison between the amount of HTML objects and Javascript objects to the file type analysis in Figure 6.2. The number of Javascript files in the logs is only about 1.8 percent the number of HTML files. The reason for this skew in amount of objects versus amount of accesses to objects of type HTML and Javascript lies in the inherent nature of Javascripts. They are contained in an HTML document to do specific tasks, such as banner 52 Chapter 6. Web Content Analysis 6.3. Size and access distribution 40000 0 20000 Size in bytes 60000 80000 VG median size distribution Jscript Flash CSS Plain Html Gif Jpeg Selected choice of file types Figure 6.3: VG median size distribution ads and pop ups. In our NoD environment, the layout of each page is kept consistent across references to pages. As such, the same Javascripts are used throughout several different pages to do the same specific task. To the best of our knowledge, only one other analysis has taken application mime type into account (in which Javascript lies) and they found its access distribution to be far less than HTML [17]. That makes sense for the web in general since the majority of web pages will not contain such things as banner ads or pop ups. However, we see that this is not the case for a web news server, and in such environments Javascripts are an integral part of the HTML documents. When it comes to images, JPEG files are both larger and more frequently accessed than GIF files. This result is in contrast to both [37] and [35]’s findings where GIF was by far the most popular image type, and the image type with the largest average size. [17] and [23] did not distinguish between the different image types. All of these articles are quite old however, and our result does not come as a surprise as it has been predicted that the number of GIF files would drop [30]. The reason for this is that in 1994, Unisys who holds the patent on the compression algorithm used in GIF ,LZW, decided to start enforcing this patent and collect royalties on its use [3]. When this happened a movement was started to move away from the GIF format and encourage people to start using a free image format instead, namely PNG [2]. This suggests an hypothesis that we would see 53 6.3. Size and access distribution Chapter 6. Web Content Analysis 1.2e+08 8.0e+07 4.0e+07 0.0e+00 Number of accesses VG access distribution to the file types Jscript Flash CSS Plain Html Gif Jpeg Selected choice of file types Figure 6.4: VG file type access distribution many PNG files where we saw GIF files before. However, in our study there are very few PNG files. One of the reasons for this could be that the GIF patent is now expired, and GIF does have some features that PNG does not, such as animation. Another reason can be that not all browsers support PNG files. For example, we tested with a Qtek 9090 PDA running Windows Mobile 2003 Second Edition, version 4.21.1088 with Internet Explorer and it did not show any PNG files. GIF and PNG discussion aside, the fact remains that JPEG is by far the superior image format in terms of number of objects in our study. The reason we find much more JPEG files than any other image formats has to do with the properties of the different image formats. JPEG is simply the best format for photographs which comprise of the majority of images on a web news site [4]. One of the reasons why it is better is that it use lossy compression, which effectively removes parts of the image which are not so important while still preserving a reasonably good quality. From the file type distribution and access distribution analysis we have found that the majority of objects on the server as well as accesses are of type GIF, JPEG, HTML and Javascript. Arlitt et al. found in [16] that 90 to 100 percent of all accesses on a web server was to image or HTML files. They were analyzing logs from six different web servers to find common invariants, and they concluded that their results were representative of all web servers. Others have also observed properties consistent with this 54 Chapter 6. Web Content Analysis 6.4. Internal size distribution 10000 5000 0 Number of objects 15000 Size distribution for JPG files 1 5 10 50 100 500 Size in KBytes − Logarithmic X scale Figure 6.5: VG JPEG size distribution result, [23, 35]. They did not however, include any NoD servers in their study, but through our analysis we have found that this distribution applies to web news servers as well. 6.4 Internal size distribution Now that we have done a general analysis of the file types on the server as well as the access and size distribution between them, we go into more specifics internal to each type. The content of the sizes files tells us the size distribution internal to each file type. To investigate the size distribution internal to each file type we used the Python script in Appendix A.9 to create a table collecting entries in buckets of 1KB. We also used different R scripts to create the graphs of the internal size distribution for each file type. The script in Appendix A.10 is an example of such an R scripts. All the other graphs regarding internal file sizes has been made with similar scripts which can be found on a CD distributed with this thesis. We will concentrate our further analysis on the most accessed types, JPEG, GIF, HTML and Javascripts. From both median size distribution and access distribution analysis we find that JPEG files are the largest and also by far the most accessed type of objects. Therefore, we look at these first. Figure 6.5 shows the size distribution among JPEG files. The majority 55 6.4. Internal size distribution Chapter 6. Web Content Analysis of JPEG files is between 1 and 5KB but there are also many objects from 5 to 50Kb. After 50Kb however, there are not many objects left. [35] found the average JPEG size to be 73Kb so here we see a clear difference to regular web content. This makes sense since images on a news web page are generally small images connected to the heading of an article. For general web content, images are more likely to show much more information and therefore are bigger. Also, due to the amount of images on a web news server, it is not unreasonable that they try and shrink the image sizes as much as possible. As we have seen, JPEG is the most common format, and with its lossy compression this can be done quite effectively. Figure 6.6 shows the size distribution among GIF files. Here too, we see that the majority of the objects lies between 1 and 5KB. If we further expand the range from 1 to 10KB we have found an area comprising of almost all the objects. The peak of the curve for GIF files is at 1KB, as opposed to 4KB for JPEG files. This confirms the results in Figure 6.3 that GIF files are generally smaller than JPEG files. Again, our results shows a large difference to [35] which found the average size of GIF files to be approximately 18KB. Here, we cannot draw any conclusion about the type of information GIF images on a web news site conveys as opposed to on the web in general. It is however, reasonable to assume that since the amount of GIF files on the web news server is so small, we only see a limited use of this format, and as such, the average size for the web in general would be larger. Figure 6.7 shows the size distribution among HTML files. A large portion of the documents are less than 2KB. The rest of the documents lie in the range 2 to 20KB. This result corresponds reasonably well with other findings. [23] found that the web strongly favors documents between 256 and 512 Bytes. By looking at the median file size distribution in Figure 6.3 we see that this is also true for the news pages in our research. [35] found the average HTML size to be 5KB so again we see that his study gives a lot larger sizes than ours, but the ratio between the image and HTML sizes remains approximately the same. Figure 6.8 shows the size distribution among Javascript files. This is the clearest size distribution we found. We can see that almost all Javascript files are of 4KB size. There are just a few other documents of this type on the server, and most of them are smaller with the majority at 1KB. 56 Chapter 6. Web Content Analysis 6.4. Internal size distribution 1500 1000 500 0 Number of objects 2000 Size distribution for GIF files 1 5 10 50 100 500 Size in KBytes − Logarithmic X scale Figure 6.6: VG GIF size distribution 1000 2000 3000 4000 5000 0 Number of objects Size distribution for HTML files 1 2 5 10 20 50 100 Size in KBytes − Logarithmic X scale Figure 6.7: VG HTML size distribution 57 200 6.4. Internal size distribution Chapter 6. Web Content Analysis 50 5 10 1 Number of objects 500 Size distribution for Javascript files, log/log scale 1 2 5 10 20 Size in KBytes Figure 6.8: VG Javascript size distribution 58 Chapter 7 Streaming Content Analysis In this chapter, we will analyze content from the streaming news server logs, and investigate the issues discussed at the end of Chapter 2 regarding stream content. 7.1 Preparation As mentioned in Section 5.1.2, we got a recursive directory listing from NR [6] with output as described in Table 5.1. We wanted to put this list into a database table, which requires some changes to the format. The date field has to be changed from e.g. 10.07.2002 to 2002-10-07, and the size field has to be stripped of punctuation marks. In addition we also we wanted a column for the file type. We implemented the C program in Appendix A.11 to perform all these operations and output the result to a new file on disk. The type of each object is found by looking at the file extension. Table 7.1 gives an overview of the format of the new file. To import this new file into a database table we used the COPY command in PostgreSQL. This server list table is used for queries to answer the specific questions about file type, size and distribution. A description of the new table is listed in Table 7.2. 7.2 File types and distribution To find out what type of files exist we do a query on the new table, listing all the distinct types: Date 2002-07-10 Time 10:50 Size 3700135 Name test.wmv Type video/x-ms-wmv Table 7.1: NR directory listing example with type field 59 7.2. File types and distribution Column date time size name type Chapter 7. Streaming Content Analysis Type date time int character varying(128) character varying(56) Description date from server list time from server list size from server list name of object, parsed out of URI mime type field Table 7.2: NR server list table 1000 2000 3000 4000 5000 6000 0 Number of files NR logs filetype distribution WMA WMV JPG ASF OTHER Other = mp3, wav, pdf, playlists, txt, mpg, flash, avi Figure 7.1: NR file type distribution SELECT DISTINCT type FROM nrlisttable; The result of this query gave us a total of 13 different types. To investigate the distribution among these types we created a PLR script that use SQL commands to count all entries of the specific types, and then use R’s graphing capabilities to create a histogram presenting the results. See Appendix A.12 for this script. Figure 7.1 shows the distribution among file types on the streaming server. As we can see, even though we found a total of 13 different file types, almost all of the files are of Microsoft’s WMV video and WMA audio format, with WMV accounting for the absolute majority of objects on the server. The next file type we see a lot of is JPEG. After that there are not many of each of the other types. We find that types like mp3, WAV, MPEG, AVI, Real and Quicktime, which are file types we generally find on the Internet a lot, are not used at all in our streaming environment. Microsoft formats for video and audio is almost exclusively used. This does not come 60 Chapter 7. Streaming Content Analysis 7.3. Size distribution as a surprise as we noted in Chapter 2, it is specified on the video access page that the videos are of WMV type and watching them require Windows Media Player. 7.3 Size distribution To investigate the size distribution of the file types on the streaming server we use the size and type attributes of our server list table. An example format of an SQL query that list the size of all objects of a specific type is: SELECT size FROM serverlisttable WHERE type = ’video/x-ms-wmv’; To record the results we use the \o option in PostgreSQL, which output results from queries to a file (e.g. \o /home/user/wmvsizes). From performing these queries on all types, we get one file on disk for each type containing the size of all objects of that specific type. These files are the equivalent of the sizes files from the web news content analysis in the previous chapter. Further, we use the same python script, Appendix A.6, as with the web news content analysis to sort the sizes in ascending order. Again, we find the median size of each type by looking at the entry at line-count-of-file / 2 of each distinct file. To create a graph of the size distribution between the different types we entered the median sizes into the R script listed in Appendix A.13. We have chosen to only include the types WMV, WMA, JPEG and ASF, since there are so few objects of the other types. Figure 7.2 show the median size of the four selected file types. We see that WMV and ASF files are of almost the same size. This is because ASF is also a video format from Microsoft with similar design and characteristics as WMV. WMA files are smaller than both ASF and WMV. This is not so surprising as audio files tend to be smaller than video files. A bit surprising is the difference between image files compared to audio and video. At first glance it looks like the JPEG files are really big, but if we look closer at the actual sizes, we see that it is the audio and video files that are quite small. The median size of WMV is about 1MB. 7.4 Access distribution Next, we investigate the access distribution of file types on the streaming server. To find the access distribution we can not use the server list as above, since this is only a record of files on the server. To investigate accesses we have to look at the streaming logs access table. 61 7.4. Access distribution Chapter 7. Streaming Content Analysis 800 600 400 0 200 Size in KB 1000 1200 NR median size distribution WMV WMA ASF JPG Object type Figure 7.2: NR median size distribution As mentioned in Chapter 5, since the streaming logs have no type attribute we added a type field to both the objects table and the access table of the logs. These fields have to be filled in by matching the objects each entry is representing to the objects in the server list table. Since we divided up the streaming logs into different tables, the access table does not contain a name of the object accessed, only the ID which match an ID in the objects table. Therefore, in order to match the type of an object from the server list table to an entry in the access table we have to perform two operations. First, we match the objects in the server list table to those in the objects table created from the logs, using the name attribute of each table. Wherever we find a match, we fill in the type in the streaming log objects table. When the type field in the object table has been filled in, we can do the same operation between the streaming log objects and access table using the objectid attribute. To perform these two operation we used the scripts in Appendix A.14 and A.15. We were not able to map all objects from the server list to the objects found in the logs. There can be several reasons for this. For example, not all names for the same object match because of different representation of Norwegian characters in the server logs and the server list. Another source where character mismatches are introduced, are badly formatted log entries. A third reason can simply be that some objects have been removed from the server. Therefore, the type of some objects can not be determined. 62 Chapter 7. Streaming Content Analysis 7.4. Access distribution 3e+06 2e+06 1e+06 0e+00 Number of accesses 4e+06 NR logs access distribution WMA WMV JPG ASF OTHER Other = mp3, wav, pdf, playlists, txt, mpg, flash, avi Figure 7.3: NR access distribution We found 7,593 objects in the server list and 2,413 distinct objects in the log files. Of these 2,413 objects, we were able to map 1,325 of them between the server list and the log objects tables. Now we are ready to investigate accesses between the different types. To find the number of accesses to each type we use the same method as in the previous section with queries of the form: SELECT count(*) FROM accesstable WHERE type = ’video/x-ms-wmv’; We created the PLR script in Appendix A.16 to perform these SQL queries and then use R’s graphing capabilities to create a histogram of the access distribution, Figure 7.3. As we see, WMV is definitely the most accessed type. We also see that JPEG and files from the ”Other” category are not accessed at all. One reason they appear in the list table and not the access table can be that NR had more content on the server than was available through the streaming application interface. Also, the server list contain objects from 2002 to 2004, while we only analyze logs with accesses between January 2002 and January 2003, so the two dataset are not directly comparable beyond the objects we were actually able to map between the tables. However, WMV and WMA files are accessed almost exclusively and 63 7.5. Internal size distribution Chapter 7. Streaming Content Analysis from previous analysis we have also found that the majority of objects on the server are of these types. As such, we need to better understand what these files are besides just video and audio files. Both are Microsoft standards in the Windows Media family. WMV is a video format, which includes both video and audio. It is designed to handle all types of video, be delivered as a continuous flow, and compressed to match different bandwidth requirements [34]. WMA is an audio format in the same family and with the same characteristics as it’s video counterpart. With these characteristics, they fit a streaming environment very well. 7.5 Internal size distribution Now, we also investigate the internal size distribution for the four selected file types, WMV, WMA, JPEG and ASF. The method we use to find out about their internal size distribution is the same as for the web news content analysis. From the median size analysis above we already have files on disk for each type containing the size of each object of that specific type. This file is also sorted in ascending order. As with the web news, we use the Python script in Appendix A.9 to collect these sizes in buckets, only now we make the buckets 100KB in size. We use R scripts similar to those in the web content analysis to create graphs of the outputs from each analysis. See Appendix A.17 for these scripts. Figure 7.4 shows the size distribution of WMA files on the NR server. Although there is a large range of sizes from 5KB to 20MB, most of these audio files are between 200 and 500KB. This is not very big compared to the regular audio content we are used to, like mp3 music. However, the audio files we are looking at are typically small samples of a music file designed to give the user a preview of some particular song. Figure 7.5 shows the size distribution of WMV files on the NR server. As we have seen, most files on the server is of this type, and they are also the most accessed type of files. The range of different sizes within this type is very large, between 1KB and 314MB, but by far the most objects are between 100KB and 1MB. One reason they are this small is because as with the audio files, videos are not full news clips, like on a TV news site such as the Norwegian television NRK site, [7]. They are small news clips that show just a specific piece of information like for example goals scored in a soccer match, or short interviews with celebrities. Another reason can be that the logs we are analyzing are from 2002. In 2002, most clients were still using ISDN or modem to connect to the Internet, [36], and therefore, both size and compression rate were probably fitted towards a lower bandwidth market than todays files. In any case, our results correspond well with what [14] found in their analysis, where most video objects were less than 2MB in size with the median size being 1.1MB. 64 Chapter 7. Streaming Content Analysis 7.5. Internal size distribution 600 400 200 0 Number of objects 800 Size distribution for WMA files 100 200 500 1000 2000 5000 10000 Size in 100 KBytes buckets − Logarithmic X scale Figure 7.4: NR WMA size distribution 400 300 200 100 0 Number of objects 500 600 Size distribution for WMV files 1e+02 5e+02 5e+03 5e+04 Size in 100 KBytes buckets − Logarithmic X scale Figure 7.5: NR WMV size distribution 65 20000 7.5. Internal size distribution Chapter 7. Streaming Content Analysis Last, we also look at the internal size distribution for the JPEG and ASF files. Figure 7.6 shows the size distribution of JPEG files on the streaming server. As we see, the majority of these files are in the range between 500KB to 1MB, which is much larger than those we found in the web log analysis. However, this size range is not large compared to regular image and photograph sizes. The JPEG files we find on the streaming server could be photo series that sometimes are shown on the VG site, now residing on VG’s own servers and accessed through http://www.vg.no/bilderigg/. In the web server analysis in the previous chapter we only analyzed articles and therefore such pictures were not in that subset of images. Figure 7.3 show that these images are never accessed however, so we cannot conclude anything about them. 150 100 50 0 Number of objects 200 Size distribution for JPG files 100 200 500 1000 2000 Size in 100 KBytes buckets − Logarithmic X scale Figure 7.6: NR JPEG size distribution Figure 7.7 shows the size distribution of ASF files on the streaming server. The majority of these files have the same size range as the majority of the WMV files, but we do not have a large enough set of objects to conclude anything. The analysis above has investigated the internal size distribution of files on the server. This can be quite different from the objects actually accessed in the logs. Therefore, we also made graphs of the internal size of those objects that were actually accessed. We found this distribution to be almost exactly the same as the distribution of the files on the server. 66 Chapter 7. Streaming Content Analysis 7.5. Internal size distribution 1.5 1.0 0.5 0.0 Number of objects 2.0 Size distribution for ASF files 100 200 500 1000 2000 5000 10000 20000 Size in 100 KBytes buckets − Logarithmic X scale Figure 7.7: NR ASF size distribution 67 Chapter 8 Access and Interaction Analysis In this chapter, we look at user behavior in a NoD environment. We first present a small workload characterization of the servers. Then we investigate access patterns in the web news environment, and in the end look at interaction patterns in the streaming environment. 8.1 Workload characterization The first behavior analysis we perform is more geared towards the servers, but it gives us a broad overview of what users do as well. To study the workload of the web news server we used the script in Appendix A.18 to investigate the number of requests per hour on December 8, 2004. The result is presented in Figure 8.1, where requests are collected in one hour buckets, meaning that the entry at for example 7 represent requests between 06:00 and 07:00. We see from this figure that the server is very busy throughout the day with a peak of about 175,000 requests at 7 and about 160,000 at 8, which gives an average of 26 requests per second between 06:00 and 08:00. The next peak is at 10 and 11 where there is about 145,000 requests for each. Server workload discussion aside we can already here start to investigate client access behavior. It seems as though many users start their workday by reading news papers, and then they check back to read new articles during their lunch times. We also calculated the number of distinct users this day which was 233,209, telling us that some clients must request several articles. To study the workload of the streaming news server we used the script in Appendix A.19 to investigate the number of requests per hour on February 6, 2002. The result is presented in Figure 8.2, and again the requests are collected in one hour buckets. We see that there are a lot less requests for these type of objects. The peak of these requests is at 10 (between 09:00 and 10:00) with about 400 requests. At 11 there is about 300 requests which gives an average of 5.8 requests per minute between 09:00 69 8.1. Workload characterization Chapter 8. User Behavior 100000 50000 0 Requests pr hour 150000 VG server workload , 2004−12−08 5 10 15 20 Hour Figure 8.1: VG server workload 300 200 100 0 Requests pr hour 400 NR server workload , 2002−02−06 5 10 15 Hour Figure 8.2: NR server workload 70 20 Chapter 8. User Behavior 8.1. Workload characterization and 11:00. Here too we see a small peak at the start of the workday but by far the most requests for these objects are during lunch time. The number of distinct clients this day was 1,352 and the total number of requests was 2,366, suggesting that at least some client request two or more stream objects. To find out if our dataset is representable of the actual workloads of the server on average, we did a comparison of all the logs we received from VG. In a very simple analysis we implemented a script that created a graph of web news log similarities based on the size of each log we received. As mentioned in Chapter 2, each log is comprised of half an hour of material. The result is presented in Figure 8.3. As we see, the logs exhibit a sort of self similarity, showing that our data is in fact representable for performing not only a workload characterization, but also all the other types of analysis we do in this thesis. From this figure we also see that on weekends, around the 200 and 500 mark in the graph, the number of requests are not as high as during regular week days. However, the weekend logs are not very far behind the weekday logs and we see that web news objects are requested during weekends as well. 150 100 50 Size of log in MB 200 VG log size over 2 weeks 0 100 200 300 400 500 600 Log number, half hour each Figure 8.3: VG log size comparison 71 700 8.2. Web news sessions Chapter 8. User Behavior 2.5 2.0 1.5 1.0 0.5 0.0 Avg requests from same IP 3.0 Average number of times pr day we see the same IP 2 4 6 8 Date Figure 8.4: VG mean number of times IP is seen pr day 8.2 Web news sessions Kim et al. claims in their analysis of article access distribution that clients requests several articles while connected to a news server [29]. They show no proof of this, but they do use it as the basis for an article popularity model they present. We want to investigate if this claim is true for our dataset. In order to do so we first need to define client sessions. Catledge et al., in their study of user interface events, defined sessions to be within 1-1/2 standard deviation of the mean between user events [20]. We decided to do the same and define sessions to include all requests within 1 standard deviation of the mean time between requests from each client per day. To find out if clients request several articles in a session, we perform several steps. First, we recognize that for sessions to even exist, some clients must request more than one article per day, and as such there should be more than one entry in the logs per day from the same IP. Therefore, we first check if we actually see multiple requests from some IP addresses each day. We used the script in Appendix A.20 to create a graph of the mean number of requests from the same IP address each day, Figure 8.4. We see that the mean number is between 2 and 3 times a day. Mean is not a very informative value here in terms of regular client behavior, since it is very susceptible to large fluctuations. For example, a proxy server requesting hundreds or thousands of articles per day would greatly influence the 72 Chapter 8. User Behavior Attribute date cip avg Type date inet integer 8.2. Web news sessions Description date we record avg time between requests 5.5 client IP address average time between requests this date Table 8.1: VG average request timing table value of the mean. It does however, tell us that there is a possibility that clients request more than one article in a session, which is what we wanted to learn from this study. Next, since there is a possibility that sessions exits, we need to find out if the requests are reasonably close in time so we can justify grouping them together in a session. In order to test this, we use the definition of sessions from earlier in this section. We implemented the script in Appendix A.21 which calculates the mean interval between requests from each IP per day, giving a result of 22 minutes 31 seconds. This script also created a table, Table 8.1, recording the mean distance between requests for each IP each day the IP is observed. In order to find the standard deviation, we output all the mean times from this table to file, fed those into an R vector on which we could query for information in the R environment. The summary() function show a mean of 1,850.65 seconds and the sd() function give a standard deviation of 3,571.943 seconds which gives us a session of 59.5 minutes. From the definition presented earlier we find sessions in our dataset to be 1 hour, which seems reasonable. With session defined, we created another table, Table 8.2, using the script in Appendix A.22 which assign session IDs to each request from each IP. The scripts in Appendix A.23 and A.24 are used to record the number of requests for each session and create a histogram of number of sessions versus the number of requests in the sessions, Figure 8.5. The number of requests per sessions ranged from 1 and all the way up to 1,397. In our graph we only show results of the sessions with up to 20 requests. The reason for this is both that beyond this limit there are mostly just one or two sessions with the corresponding amount of requests and also, most of those sessions are probably not from distinct users. From the graph we see that the absolute majority of sessions only contain one request for an article. Interestingly, when comparing to Zipf we see that the popularity of sessions ordered by the number of requests they contain follow a Zipf distribution where α equals 1.3. This means the probability of a session containing a certain amount of requests decrease with an increase in number of requests in the session, and small sessions are even more favored than pure Zipf popularity models. From this result, we find that there exist sessions in which clients request several articles, but the probability of sessions containing one more request decrease according to a Zipf distribution with α=1.3. 73 8.2. Web news sessions Attribute cip date time sessionid artid Chapter 8. User Behavior Type inet date integer integer varchar(512) Description client IP address date of the request time of the request id for the session current request is in article id requested Table 8.2: VG session table 0.4 0.6 Dotted curve is fitted Zipf w/alpha = 1.3 0.0 0.2 Sessions 0.8 1.0 VG number of requests in sessions 5 10 15 Requests Figure 8.5: VG number of sessions with x number of requests 74 Chapter 8. User Behavior 8.3. Web news reference patterns 8.3 Web news reference patterns In [22] it is observed that for prefetching of objects, it is only necessary to transfer objects at a rate sufficient to deliver said object in advance of the user’s request. If through our analysis we find some access relationship between articles and also some timing constraint for that relation, it would aid the use of rate controlled prefetching. Therefore, if we were able to verify that sessions exist in which clients request several articles, we also wanted to investigate the time between these requests to find out how much time is spent on each article. From Table 8.2 we can calculate the mean time between requests within sessions, thereby gaining information on how long on average users spend reading an article. For this we used the script in Appendix A.25, which gave us the result of 92 seconds. From a small experiment in reading time of complete articles we find this result to be quite accurate. Not knowing of any other study of article reading times, it seems as though most articles are read beginning to end. Even though we can identify some time requirements for article requests, in order to perform prefetching we need to know which article is going to be requested next. Therefore, we also need to look at relationship between requests. To perform this analysis we wanted to use the web news log attribute ref-args to find out where each request came from, so that we could find if there were groups of articles always requested together. On closer investigation of this attribute we found that of the 14,905,052 accesses, only 123,919 had this attribute filled in. Further, most of them came from other sites, or did not contain values we could deduct any information from. Only 7,067 of these entries had artid= somewhere in the string representation and 6,045 of these entries had this attribute set to a pure artid=number representation. 26,305 number of requests with the ref-args field set came from an image on the VG server, which probably mean that the user clicked one of the images accompanying a headline on the front page. Because of the numerous different string representations found in this attribute, a thorough investigation of relationships between single elements would require complicated string matching methods. Time pressed we did not see an obvious need for such an analysis based on the numbers represented above. The attribute is simply not used enough to contribute any important results. 8.4 Stream interaction patterns Next we analyze the streaming logs to investigate how streaming videos are interacted with. The two questions we had here, was if videos are viewed in full or partial, and if they are only viewed partially, how much of the video is viewed. In this investigation, we can only look at those 75 8.4. Stream interaction patterns Attribute objectid size viewed percent Type inet integer integer integer Chapter 8. User Behavior Description id of object full size of object in bytes bytes viewed percentage of object viewed Table 8.3: NR access statistics table objects we were able to map from the stream server list table, Table 7.1, to the objects table, Table 5.2, since these are the only objects we know the exact size of. As mentioned in Chapter 7, out of the 2,413 distinct objects we found in the logs we were able to map 1,325 of them to the objects table. Of these 1,325 objects, 1,322 were of the types WMA or WMV. The 3 others were ASF/ASX. In addition, we also had to check that the sc-bytes field of the streaming logs access table, Table 5.4, was filled for each entry used in this analysis. With all the above restrictions we ended up with 4,198,779 requests to 1,319 objects we could evaluate. We implemented the script listed in Appendix A.26, which using the above restrictions created Table 8.3. This table list all requests to each object for which we were able to determine the initial size and also had the bytes sent from server to client attribute filled in. Further, the table also contain information on how many percent of the object was viewed for each request. From this we found that out of the 1,319 objects, 886 of them had requests where the bytes sent from server to client was more than the size of the file. This can be due to numerous reasons. First it can be from commercials running first, while the actual video is being buffered. It can also be due to TCP/UDP and streaming protocol overhead, although this should not be very much. We used the transport attribute of the streaming logs to check what kind of protocols where used. We found that about 60 percent of the time TCP was used, 30 percent of the time UDP was used, and the rest was unspecified. Further we calculated the mean view percentage for all requests viewed more than 100 percent of the objects size, which gave us 127 percent. Neither TCP or any streaming protocol at all should give this much overhead. UDP could on a bad link do this with many retransmissions. Another reason could be user interaction like jumping back and forth. Most of the players were Windows Media Players which are buffering clients so user interaction would not be the reason there. The last reason can be testing of the line between client and server in order to establish usable transfer rates and parameters. Even though a lot of the objects which had an access to them where more bytes were sent than the actual size of the objects, when we look at all requests only 417,945 of 4,198,779 had more than 100 percent sent. This is only about 10 percent of the requests. It could be that one part of these accesses with only a slight number above 100 percent are due to overhead in the different protocols 76 Chapter 8. User Behavior 8.4. Stream interaction patterns used. For those requests that has a greater percentage sent, a non buffering client or UDP over a bad link could be used. The most likely reason though, since most users use buffering clients, is overhead from protocols combined with commercial elements and a setup phase. Figure 8.6 show a diagram of number of requests that were viewed in partial, number of requests that were viewed in full (100 percent), and the number of requests that were viewed more than 100 percent. We clearly see that the majority of objects are only accessed partially. NR accesses view percentage Full (9.4%) Partial (80.6%) More than 100 (10%) Figure 8.6: NR access view percentage To investigate this further and find out how much of the objects are usually accessed we created another table, Table 8.4, which summaries view statistics for each object in Table 8.3. We used the script in Appendix A.27 to create this table. From this, we calculated the mean view percentage of all requests which came out 57 percent. We also created a histogram of requests accessing 10 percent of an object, 20 percent and so on, which is shown in Figure 8.7. To do this, we used the same approach as we did with the content analysis. We output to file all the percent counts from Table 8.3 where percent was less than 100. Then we used the Python sorting script in Appendix A.6 to sort them, and another Python script listed in Appendix A.28 to count up entries in buckets of 10 percent each. From that we used the script in Appendix A.29 to create the graph. We see that the most usual access pattern is to only watch the first 10 percent of the object. If the user watch more than 10 percent the next most 77 8.4. Stream interaction patterns Attribute objectid fullcount partialcount accesses fullpercent meanviewpercent Type inet date integer integer integer integer Chapter 8. User Behavior Description id of object count of accesses viewed object in full count of accesses viewed object partially total number of accesses to this object percentage of accesses viewed the object in full mean percent viewed of all accesses to this object Table 8.4: NR view statistics table 5e+05 3e+05 1e+05 Number of accesses 7e+05 View percent distribution for accesses viewed partially 20 40 60 80 100 Percent in 10% buckets Figure 8.7: NR access view percentage distribution for partial accesses 78 Chapter 8. User Behavior 8.4. Stream interaction patterns 0.8 0.6 0.4 0.2 0.0 Percent of number of accesses 1.0 Cumulative view percent distribution for partial accesses 0 20 40 60 80 100 Percent in 10% buckets Figure 8.8: NR cumulative access view percentage common pattern is to watch 100 percent of the object. This is a reasonable perception since one could imagine users viewing the beginning of a news clip, and decide whether or not it is interesting. If it is not, the user will stop viewing early in the stream, or if deciding that the clip is interesting the user will watch it all. It would be tempting to conclude here that either a client watch less than 10 percent or they watch the full 100 percent. The correctness of such a theory would be of great advantage to prefix caching especially. However, we see from Figure 8.7 that there are a substantial amount of requests viewing an object uniformly distributed between the percentage counts other than 10 and 100 percent. To see if we can conclude anything about the size of a prefix we used script A.30 to create a graph of the cumulative distribution for the access count, Figure 8.8. From this we see that 20 percent of the requests watch less than 10 percent of an object. Other than that, the view percent is almost uniformly distributed. There are no clear distinctions to be made here, but if prefix caching is to be used, somewhere between 10 and 20 percent of an object could be a good possibility accounting for between 20 to 30 percent of the requests. 79 Chapter 9 Lifetime and Popularity Analysis In this chapter, we explore the issue of lifetime and popularity of web articles as well as streaming objects. We start out with an analysis of article lifetime, then we look at article popularity, and in the end we compare the lifetime and popularity of streaming objects to investigate if streaming news have similar patterns as web news. 9.1 Article lifetime analysis To answer the question of news articles lifetime in terms of distance in days between the first and last day they are accessed, we first created a smaller table from the web news article table recording the first and last date each article is seen, and the total number of requests to that article. The script we used for this operation is listed in Appendix A.31. Table 9.1 shows a description of the attributes in this table. In order to investigate lifetime, we implemented the script in Appendix A.32, which use this table to create a graph of the distance between the first and last day we see a request to an article for all articles, Figure 9.1. We see from the result that many articles have a lifetime of eight days. Our log material only contains information over an eight day period, meaning that most articles live more than what we can find in our analysis. We also see that many articles are only accessed one day. This could be just old Attribute artid firstday lastday totalreq Type character varying(512) date date integer Description uri-query field from 5.5 first date we see access to this article last date we see access to this article total number of accesses to this article Table 9.1: VG article information table 81 9.1. Article lifetime analysis Chapter 9. Lifetime and Popularity 25000 15000 5000 Number of articles 35000 VG article lifetime 1 2 3 4 5 6 7 8 Number of days Figure 9.1: VG article lifetime of all articles articles being referenced again in one of the new articles of the week we have logs from, but we can not know this for sure. It could also mean that new articles become unimportant almost right away, but then we should not see a substantial amount of articles with a lifetime of eight days. One question that arises is the recycling of article ID’s. We performed sample tests by looking up the most referenced articles and they had not been recycled so this is not the case. Since we cannot say anything about when articles enter into the system from our logs, for further analysis we decided to only use articles we see a reference to at the first day of the logs. By doing so, we limit the number of old articles only accessed once or twice within the time period of our logs. This way, most articles will be new articles which we can use to analyze lifetime more correctly. Therefore, using the script in Appendix A.33, we created a histogram of the distance between first and last day of accesses to only those articles that had been accessed on the first day of the logs, Figure 9.2. We see that the number of articles that are accessed only one day has dropped dramatically, suggesting that those were mostly old articles referenced only a few times through links to related content in recent articles. To emphasize this even further, we calculated the percent of documents in the logs who were only accessed one time, which was 33.6 percent. Interestingly, [16] found in their study of several different logs from different type of web 82 Chapter 9. Lifetime and Popularity 9.1. Article lifetime analysis 15000 0 5000 Number of articles 25000 VG article lifetime 1 2 3 4 5 6 7 8 Number of days Figure 9.2: VG article lifetime of articles seen first day of logging server, none of which were news web server, that approximately a third of all distinct documents were only accessed one time. It would seem then, that old articles generally follow a regular web pattern. A bit of a surprise from this new day distance analysis is the fact that most articles are accessed both the first day and the last day in our logs. Kim et al. found in their study that the average number of days articles with the best popularity last is three days, [29]. In our data set we see that all new articles, not only the most popular, actually live for quite some time, at least eight days. However, they have modeled lifetime as a function of popularity, so the two life cycles are not directly comparable. Figure 9.2 only shows the distance in days between the first and last access. Even though it is unlikely that all articles are requested on the first day of logging and then again only on the last day, we calculated the mean number of days all of these articles where accessed which turned out to be six days. This result is not very informative, but knowing that most articles have a day distance of at least eight days, one could imagine that most of the articles we see on day one of our logs are new articles which are accessed for all of the subsequent eight days. Then there is a subset of old articles accessed on day one of the logs which we do not see again later, and as such they are the once that reduce the mean. However, all of the results we have found so far clearly show that we do not have enough material to evaluate the lifetime of articles. 83 9.2. Article access distribution Chapter 9. Lifetime and Popularity 0.99 0.98 0.97 0.96 Number of accesses 1.00 Cumulative access distribution 0.2 0.4 0.6 0.8 1.0 Number of articles Figure 9.3: VG article cumulative access distribution 9.2 Article access distribution Next we want to explore the distribution between accesses and articles. Through this we can get an idea of the distribution of hot documents. [16] et al. found in their study that general web traffic followed the 90/10 rule, 90 percent of the requests where for 10 percent of the web pages. To investigate this issue for web news we used the script in Appendix A.34 to create a graph of the cumulative accesses distribution for the whole week of logs which are in the database, Figure 9.3. As we can see, the amount of hot documents are even more than 90/10 for web news. About 96 percent of the requests are for 10 percent of the articles. Combining this with the results in the previous section which show that new articles are usually requested for at least eight days, it appears as though new articles are very popular, but there are many older articles also being accessed in the course of a week. To investigate this further we created a graph of the mean access distribution of the articles seen the first day over the whole eight day period, using the script in Appendix A.35. We see from the graph in Figure 9.4 that their popularity in terms of access counts drop dramatically from the first day to the second, and then again to day three. From this we can conclude that web news does become old after just one day, and new articles are much preferred over old ones, even though we have shown that article continue to be requested beyond our dataset of one week. 84 Chapter 9. Lifetime and Popularity 9.2. Article access distribution 1000000 1400000 600000 200000 Total accesses VG access distribution for articles seen day 1 1 2 3 4 5 6 7 8 Day number Figure 9.4: VG access distribution of articles seen first day of logging Many have modeled web page popularity with Zipf and found that α had to be adjusted. The only other work we know of that has modeled article popularity specifically is [29] which also says that article popularity is different from pure Zipf with α=1. They also claim that they get close to a pure Zipf distribution when combining articles together in groups. As mentioned in Chapter 3, the way they have modeled access popularity in their graphs is a bit misleading. One of the graphs is said to show the mean access popularity of articles over a month, and there is a corresponding graph comparing NoD article popularity to Zipf. The problem is that Zipf is time independent. By calculating the mean access popularity over a whole month they are actually creating graphs of the probability articles have to live for one month. Such an analysis does then belong in a lifetime investigation, so we also created a graph like theirs, Figure 9.5. We have put Zipf in our graph even though it is not comparable to what the graph actually show. The reason we have done it is to compare our results to [29]s. They found their curve to be less steep than Zipf, which is the same result we get. 85 9.3. Article popularity Chapter 9. Lifetime and Popularity 1e−01 1e−03 1e−05 Number of requests, R(i) VG article probability of becoming popular 0 10000 20000 30000 40000 Article popularity rank, i Figure 9.5: VG likeliness of becoming popular compared to Zipf (1 week, top 10 percent of the articles) 9.3 Article popularity Putting aside discussions about articles popularity over time, we now continue on to compare the article popularity in our dataset to Zipf. We use the script in Appendix A.36 to create a graph of the popularity distribution on the first day of the logs, Figure 9.6. As we can see, we get a distribution where pure Zipf is the best fit of our curve, and without a log/log scale of this figure they are impossible to tell apart. Kim et al., which analyze a dataset similar to ours use 145 articles in their graphs [29]. Therefore, we created a similar graph, with only the 150 most popular articles on the first day of the logs, Figure 9.7. When only looking at the top 150 articles we need to adjust α to 0.7 to get the closest fit to a pure Zipf curve. We can conclude then, that article popularity do follow Zipf, but as the subset of articles get smaller we need to decrease the value of α. This is a result of the concentration of requests to articles as shown in the previous section. [29] also presents an article popularity model they call Multi-selection Zipf. This model is based on a claim that clients request several articles once connected to a news server, and as such there exists groups of articles which can be ranked by popularity. As we have shown in the previous chapter, clients do request several articles in sessions, but for the most part they only request one article. In their own comparison of the algorithm to Zipf 86 Chapter 9. Lifetime and Popularity 9.3. Article popularity VG article popularity vs. Zipf on December 7, 2004 1e−01 1e−03 1e−05 Number of requests, R(i) Dotted line is pure Zipf w/alpha = 1 1 10 100 1000 10000 Article popularity rank, i Figure 9.6: VG article popularity vs. Zipf 0.8 0.6 0.2 0.4 Dotted line is pure Zipf w/alpha = 1 0.0 Number of requests, R(i) 1.0 VG top 150 articles popularity vs. Zipf on December 7, 2004 0 50 100 Article popularity rank, i Figure 9.7: VG top 150 article popularity vs. Zipf 87 150 9.4. Stream objects lifetime and popularity Chapter 9. Lifetime and Popularity they did find that the less articles in a group, the closer the algorithm comes to Zipf. This is the same result we can read from our sessions analysis in the previous chapter. The probability of a group containing more than one article diminish according to Zipf. However, nothing has been said about the popularity of the groups of articles. As it turns out, they cannot know which group is the most popular since they rank articles according to the mean access over a whole month and Zipf is time independent. For example, there is nothing in their graphs that tell us if a group of the top three articles on one particular day is more popular than a group with the number one popular article from three days in a row. Since we did not investigate the relationship between articles within a session to find groups of articles we cannot conclude anything about the popularity distribution of such groups, but neither can [29]. Therefore, we can not verify or invalidate Multi-selection Zipf, but we do think it needs to be analyzed in more detail. 9.4 Stream objects lifetime and popularity In the end, we also investigate the lifetime and popularity of the streaming objects to see if their properties are similar to articles or regular video, or have their own distribution characteristics entirely. Using the script in Appendix A.37, we created the same kind of table as for the articles with information about the first and last day an object is accessed, as well as the total requests to each streaming object. We first calculated the min, max and mean number between the first and last day these objects are requested, with the results 1, 352, and 61. [24] found that once a movie enters a system, it never leaves. Since we see that some of the stream objects in our dataset are requested for the whole period of one year, it could mean that streaming objects are comparable to movies in terms of lifetime. However, we also created a graph similar to the average day distance graph of the web news articles, Figure 9.8. From this we learn that the absolute majority of stream objects are only accessed one day. Also, from investigating how many new objects there were each day in these logs, we found that as with movies, there were not many new objects seen each day, and many days there where none. In addition, we also compare streaming news popularity to Zipf. For this we used the script in Appendix A.38 which models Zipf distribution on 2 February 2002. In Figure 9.9 we can see that the popularity of streaming news objects is also close to pure Zipf as the web news articles was, but we need to adjust α to 0.8 to get an almost exact match. This is very similar to what we found for web news, 0.7, when we reduced the dataset. By reducing the dataset we were in effect making it more similar to our stream dataset in terms of number of objects. Because we do not have enough 88 Chapter 9. Lifetime and Popularity 9.4. Stream objects lifetime and popularity 400 200 0 Number of objects 600 Day distance between first/last access to a streaming object 0 50 100 150 200 250 300 350 Number of days Figure 9.8: NR streaming objects lifetime 0.8 0.6 0.4 0.2 0.0 Number of requests, R(i) 1.0 NR objects Zipf comparison 5 10 15 Object popularity rank, i Figure 9.9: NR objects Zipf comparison 89 20 9.4. Stream objects lifetime and popularity Chapter 9. Lifetime and Popularity material to correctly investigate the lifetime of articles, we cannot compare this for articles and stream objects. But from the knowledge that there is a substantial amount of old articles in our logs as well as a similar behavior of access popularity of new objects, we could imagine a similar graph for the news articles. If this is correct, then streaming news are comparable to web news in terms of popularity distribution. A third comparison show that there are a lot less number of objects requested as well as number of new objects released each day in the stream log than the web log, which suggests that streaming news has the same characteristics of movies. It does seem as though streaming objects exhibit their own characteristics with similarities to both movies and articles. We cannot however, with our dataset, investigate this issue any further. 90 Chapter 10 Conclusion In this chapter, we summarize the work we have done and present the most important result we have found in this thesis. In the end, we outline ideas for future work within the topic of this thesis, based on some weaknesses and open questions in our analysis 10.1 Thesis summary In this thesis, we have investigated several different aspects of a NoD environment, through analysis of log files from both the web and streaming server of Norway’s largest online newspaper VG. We divided the different analysis parts into four main areas, content analysis, article access pattern analysis, stream interaction analysis and lifetime and popularity analysis. In terms of content, we have studied file type distribution and access distribution, size distribution between and internal to the different object types of both the web and streaming logs. Continuing, we did a short workload characterization of the two server logs. Then, for article access patterns, we have analyzed the existence of sessions, the number of requests in a session, time between requests, and relationships between them. When it comes to streaming objects we have investigated the distribution between partial and full requests of object and the percent viewed distribution for those only accessed partially. Lifetime analysis has been performed in terms of the time period in which we see requests to objects, and we have also performed popularity analysis in terms of both request distribution and in comparison with the well know Zipf popularity distribution. In addition to answering our questions, we have also developed a set of applications and methods for performing this type of analysis. 91 10.2. Results Chapter 10. Conclusion 10.2 Results In this section, we discuss the results we have found from analysis of the questions in Section 2.3. An implicit result of our work has been the development of analysis methods and tools, so we also present a discussion of these. 10.2.1 Tools development As discussed in Chapter 4, due to the numerous different types of analysis we wanted to perform, we could not just pick up an existing analysis tool to use. We had to create our own. We chose to create applications in C to extract information from the logs, format them and import them into a database. Once in the database, we implemented several scripts to perform specific tasks related to the different types of analysis above. We found this method to work very well for several reasons. First, the simplicity of queries provide a great way of investigating answers to single questions. Second, indexes greatly speed up the performance of these queries. Last, the number of ways one can interact with a PostgreSQL database allows for selection of the right tools for the right job. For example, in our work we have used plpgsql script for most of the analysis jobs and PL/R for creating graphs of the results. In addition, there are many libraries which enables the user to interact with the database in the language of choice, like for example libpq for C. 10.2.2 Content analysis We performed separate analysis of contents for the web news and streaming news logs respectively. We discuss each of them here. Web content For our web news content, we first examined the type of files we found on the server and the distribution among them. The result showed that there are many different type of files in this environment, but those that were represented the most were clearly images and HTML documents. Among the image types we found, most of them was of type JPEG, and then GIF files. We also found that PNG files were not used at all. This is the same result as has previously been found for regular web content, which suggests that web news sites exhibit the same content characteristics as the web in general. Next, we examined the median size distribution among the different formats, which is summarized in Table 10.1. In the size analysis, we found that Flash objects were much larger than any other types. After Flash 92 Chapter 10. Conclusion 10.2. Results Type JPEG GIF HTML Javascript Flash Median Size 6.2KB 2.8KB 4.3KB 3.1KB 78.9KB Table 10.1: NR median sizes objects, JPEG, GIF, HTML documents and Javascripts stood out, with JPEG being the largest of these types. However, access analysis showed that Flash objects were hardly ever requested, so they do not place to much load on the server. In the access analysis, JPEG is a clear winner accounting for the absolute majority of requests. After JPEG, GIF is a good number two, so we see that images are absolutely the most requested types. In this study we also found that Javascripts were accessed almost as many times as HTML documents, even though the number of Javascripts in the file type distribution analysis were far less than HTML documents. Here too, we have found that requests for different type of objects are much the same as for regular web sites. Perhaps the most interesting lesson learned however, is that in a web news environment, Javascripts are used extensively and are an integral part of HTML documents. Last, we also investigated the internal size distribution of the four most requested file types, JPEG, GIF, HTML and Javascripts. We found that the range of sizes for images was very large, but the absolute majority of images are small in size compared to what has been found in other research on regular web content. Most JPEG and GIF files were between 1 and 5KB. For GIF files, there are not many images larger than 10KB, but JPEG also had a substantial amount of images ranging between 10 and 50KB. For HTML files most objects were less than 2KB which is similar to what has been found for the web in general. Javascript gave us the clearest internal size distribution, where most of these objects were 4KB in size. We do not know of any other work that has analyzed the size of Javascripts specifically. Streaming content For the streaming content, we had already learned in Chapter 2 that VG used WMV as their video format, and as such we expected to find most objects of this type. The file type distribution investigation supported our belief. We did find a range of 13 different file types, but counting the number of objects showed that WMV and WMA was almost exclusively represented. Only JPEG had enough objects to be seen in the graph. From the size distribution we saw, as expected, that videos were largest. Table 10.2 shows a summary of the median sizes. After WMV came JPEG, 93 10.2. Results Chapter 10. Conclusion Type WMV WMA JPEG Median Size 1048KB 244KB 732KB Table 10.2: NR median sizes and then came WMA, which was substantially smaller than the others. The reason for this we believe, is that the images on this server are large photographs used in image series on the VG news page, as opposed to small photographs accompanying a headline, which is images we saw in the content analysis of the web logs. The access analysis revealed that these JPEG images were almost never requested, so we can not say anything about what these images were. This analysis further showed that the only types that were accessed was WMA and WMV, with WMV clearly accounting for most requests. From the internal size distribution we found that most WMA files were between 100 and 500KB. Most WMV files were between 100KB and 1MB. One similarity between these two types is that the range of sizes is huge, between 5KB and 20MB for audio files and 1KB to 314MB for videos. Our result from the streaming content analysis corresponds with what [14] found in their analysis of videos on the web. 10.2.3 Workload characterization We only did a very simple workload characterization in which we found that the peak hours of web news requests where at the beginning of a work day and at lunch hours. The most requests were made in the morning with an average of 26 requests per second between 06:00 and 08:00. For streaming news, only the lunch hours stood out where the average number of requests was 5 per minute. In addition, we did a study of the size of log files on the web news server where we found a self similarity suggesting that our dataset contains representable data for all of our different investigations. We also found that the size of weekend logs were less than the size of weekday logs, but there were still a lot of requests for web news objects during the weekend. 10.2.4 Article access patterns Next, we looked at access patterns for articles in the web news logs. We first analyzed the existence of sessions, where we defined sessions to include all requests from one client within one hour. We did find that sessions with multiple requests exist, but for the most part only one article was requested. This is possibly the most interesting result of this thesis. The popularity of sessions decrease according to Zipf with an increase in number of requests 94 Chapter 10. Conclusion 10.2. Results in that sessions. That is, the most usual access pattern is to read just one article, and the probability of a session containing one more request follow a Zipf distribution with α equal to 1.3. We also calculated the average time between requests within sessions, which we found to be 92 seconds. From test samples we made, this number seemed representative of the reading time of an article, suggesting that the average users read an article beginning to end once having selected one. In the end we also wanted to investigate reference patterns within sessions to see if we could find any relationship between articles. We were not able to perform this study based on the fact that the attribute of the logs which would give us the information required was not used very much. 10.2.5 Stream interaction patterns In our analysis of stream interaction, we investigated if many objects were accessed in full or partially, and also the distribution of percentage viewed when only partially accessed. We found that only about 10 percent of the objects were accessed in full, 80 percent were accessed partially, and 10 percent were accessed more than 100 percent. We do not know why we get entries in the logs which access more than 100 percent, but arguments have been made that it is due to either commercial elements downloaded before the actual object to be played out, or it could be overhead from protocols. Our analysis does show however, that most objects are only viewed partially. Therefore, we also investigated how much of objects were viewed. From this we did not find very distinct patterns, but we concluded that about 20 percent of the request access roughly 10 percent of the objects. The access was however, quite uniformly distributed in our study. We do not know of any other work which has researched this for streaming objects, so we cannot compare our result to other findings. 10.2.6 Lifetime and popularity analysis Last, we also studied lifetime and popularity of objects. The first thing we learned was that we did not have enough material to conclude anything about the lifetime of articles. The reason for this is that on average, we saw that documents were requested for a time period of the whole subset of our logs, which was one week. But, we did find that most references to articles are performed on the first day, and then there is a steady decline. Further, we investigated the concentration of references to see if there was a high concentration of hot documents. [16] found that for general web content, about 90 percent of requests were for 10 percent of the documents, meaning that there is a small subset of web pages that are popular. In our study of news content we found this to be even more true. About 96 percent 95 10.3. Future work Chapter 10. Conclusion of the requests were for 10 percent of the documents. Combined with the result of a decline in requests seen from the first day, this suggests that new articles are created faster than new pages on the web and recent articles are favored over old articles. Next, we compared article popularity to the Zipf distribution model. When we used all articles requested one day to compare to Zipf, we found pure Zipf with α equals one to be indistinguishable from our results. When we narrowed down the subset to only the 150 most accessed articles we had to adjust α to 0.7. As noted in Chapter 3, many have applied Zipf to requests from both web and proxy servers with about the same numbers for α as we get. However, from applying Zipf to the subset of all articles on a news server, we find that web news popularity is closer to Zipf than regular web pages. We also studied the lifetime and popularity of the streaming objects. From the popularity analysis, we found that as with web news, streaming news are also most accessed the first day. When comparing to Zipf we found that α equals 0.8 gave a close fit, which is similar to web news when the dataset was reduced. However, the lifetime in terms of distance between first and last day seen was very large, which is similar to movies. [24] states that once movies enter a system, they never go out. We cannot compare the lifetime of web news and streaming news though, since we do not have enough log material to conclude anything about the lifetime of articles. It does seem as stream objects are similar to movies in terms of how often they are released, but similar to web news in terms of access behavior and popularity. 10.3 Future work In this thesis, we have analyzed several aspects of NoD environments and their objects. There is still much to learn from the logs we have received and there is also much we cannot learn from our dataset. We believe that NoD will become a popular trend in the not so distant future, and as such needs to be researched further in much more detail. We therefore present some ideas we have for future research topics. Content is changing rapidly, and for our streaming dataset it could be that the streaming content is already becoming old. It is unlikely the difference is too great since our analysis show that video sizes are much the same as a study of videos on the web in 1997 [14]. But, this could be due to improvements in video codecs as much as similarity in content. Also, streaming news is a trend on the rise so the formats and content of these files can change quite a lot in a short time period. Therefore, it is important to follow up an analysis of these objects as time passes. It would also be a good idea to compare streaming objects from several different type of news 96 Chapter 10. Conclusion 10.3. Future work sites, for example between news papers and TV stations. For access pattern analysis, it would be interesting to investigate grouping of articles instead of what we tried with reference patterns. To do this, one must match all sessions containing more than one request to see if many sessions contain the same group of articles. Also, a comparison of the time between requests in a session versus the size of the objects in the session could yield beneficial results for prefetching techniques like rate controlled prefetching [22]. In addition, it would be interesting to see how session characteristics change during the day. For example, if sessions between 07:00 and 09:00 when people start working contain several requests, and the sessions later during the day only contain one request. Subject to future analysis in terms of interaction analysis, would be an investigation of the percentage of an object viewed compared to it’s age to see if there is any relationship one could devise a pattern from. Such a finding would greatly aid caching techniques like prefix caching, where the prefix of an object could be dynamically changed over time. In our study, we have done a simple lifetime analysis based on distance in days between the first and last day we see a reference to an article. Lifetime analysis should be further extended to include lifetime in terms of popularity. That is, investigate the change in Zipf popularity ranking of objects over time. If groups of articles are found from access pattern analysis, these groups should also be compared to Zipf. This would provide a basis for further investigation of the Multi-selection Zipf algorithm [29]. Another relationship for future work to explore would be from how many articles the popular articles and also the referrer articles are linked to. From this we can learn if there is a relationship between the number of ways to access an article and its popularity. 97 Bibliography [1] Awstats - free real-time logfile analyzer to get advanced statistics. http://awstats.sourceforge.net/. [2] Burn all gifs. http://burnallgifs.org/. [3] The gif controversy: A software developers perspective. http://cloanto. com/users/mcb/19950127giflzw.html. [4] Graphics formats for web pages. computing/graphics/. http://amath.colorado.edu/ [5] libpq - the c application programmer’s interface to postgresql. http: //www.postgresql.org/docs/7.4/interactive/libpq.html. [6] Norsk regnesentral. http://www.nr.no/. [7] Nrk.no. http://www.nrk.no/. [8] Pl/r - r procedural language for postgresql. http://www.joeconway. com/plr/. [9] Postgresql database management system. http://www.postgresql.org/. [10] The python programming language. http://www.python.org/. [11] The r project for statistical computing. http://www.r-project.org/. [12] Vg nett. http://www.vg.no/. [13] Webalizer web server log file analysis program. http://www.webalizer. net/. [14] S. Acharya and B. C. Smith. Experiment to characterize videos stored on the Web. In Proc. SPIE Vol. 3310, p. 166-178, Multimedia Computing and Networking 1998, Kevin Jeffay; Dilip D. Kandlur; Timothy Roscoe; Eds., pages 166–178, Dec. 1997. 99 BIBLIOGRAPHY BIBLIOGRAPHY [15] V. Almeida, A. Bestavros, M. Crovella, and A. de Oliveira. Characterizing reference locality in the WWW. In Proceedings of the IEEE Conference on Parallel and Distributed Information Systems (PDIS), Miami Beach, FL, 1996. [16] M. F. Arlitt and C. L. Williamson. Web server workload characterization: The search for invariants. In Measurement and Modeling of Computer Systems, pages 126–137, 1996. [17] H. Bahn, Y. H. Shin, and K. Koh. Analysis of Internet reference behaviors in the Korean Education Network. Lecture Notes in Computer Science, 2105:114–??, 2001. [18] P. Barford, A. Bestavros, A. Bradley, and M. Crovella. Changes in web client access patterns: Characteristics and caching implications. Technical Report 1998-023, 4, 1998. [19] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In INFOCOM (1), pages 126–134, 1999. [20] L. D. Catledge and J. E. Pitkow. Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN Systems, 27(6):1065–1073, 1995. [21] Computerworld. Nettaviser i toppen. http://www.computerworld.no/ index.cfm/bunn/artikkel/id/50177. [22] M. E. Crovella and P. Barford. The network effects of prefetching. In Proceedings of Infocom ’98, pages 1232–1240, Apr. 1998. [23] C. Cunha, A. Bestavros, and M. Crovella. Characteristics of World Wide Web Client-based Traces. Technical Report BUCS-TR-1995-010, Boston University, CS Dept, Boston, MA 02215, April 1995. [24] C. Griwodz, M. Bar, and L. C. Wolf. Long-term movie popularity models in video-on-demand systems: or the life of an on-demand movie. In MULTIMEDIA ’97: Proceedings of the fifth ACM international conference on Multimedia, pages 349–357. ACM Press, 1997. [25] S. Gruber, J. Rexford, and A. Basso. Design considerations for an rtspbased prefix-caching proxy for multimedia streams. Technical Report 990907-01, AT T Labs , Research, September 1999. [26] T. Hafsoe. Automatic Route Maintenance in QoS Aware Overlay Networks. PhD thesis, University of Oslo, 2006. work in progress. 100 BIBLIOGRAPHY BIBLIOGRAPHY [27] R. Jain. The Art of Computer Systems Performance Analysis: Techniques for experimental design, measurement, simulation and modelling. John Wiley and Sons, 1991. [28] F. T. Johnsen, C. Griwodz, and P. Halvorsen. Structured partially caching proxies for mixed media. In WCW 2004, LNCS 3293, pages 144 – 153. Springer-Verlag Berlin Heidelberg, 2004. [29] Y.-J. Kim, T. U. Choi, K. O. Jung, Y. K. Kang, S. H. Park, and K.D. Chung. Clustered multi-media NOD: Popularity-based article prefetching and placement. In IEEE Symposium on Mass Storage Systems, pages 194–202, 1999. [30] M. Nelson. The Data Compression Book. Henry Holt and Co., Inc., New York, NY, USA, 1991. [31] G. Peng. CDN: Content distribution network. http://citeseer.ist.psu. edu/peng03cdn.html. [32] M. Rabinovich and O. Spatcheck. Addison Wesley, 2002. Web Caching and Replication. [33] V. Sawant. Zipf law. http://www.cs.unc.edu/˜vivek/home/stenopedia/ zipf/. [34] W. Schools. Windows multimedia formats. http://www.w3schools. com/media/media˙windowsformats.asp. [35] J. Sedayao. ”mosaic will kill my network!” - studying network traffic patterns of mosaic use. In Electronic Proceedings of the Second World Wide Web Conference ’94: Mosaic and the Web, 1994. [36] S. Sentralbyra. Internet maalingen 2002. http://www.ssb.no/emner/10/ 03/nos˙c737/nos˙c737.pdf. [37] A. Woodruff, P. M. Aoki, E. Brewer, P. Gauthier, and L. A. Rowe. An investigation of documents from the World Wide Web. Computer Networks and ISDN Systems, 28(7–11):963–980, 1996. 101 Appendix A Source Code In this chapter, we list the applications we have developed and give a short explanation of them. All source codes are stored on a CD distributed with this thesis. A.1 create-stream-tables.c This C program parse logs from NR into the tables discussed in Chapter 5. The source code can be found in the nr/progs/ directory on the CD. A.2 insert-web-logs.c This C program parse VG logs, looping through one line from one log at a time inserting only article requests into a database table. The source code can be found in the vg/progs/ directory on the CD. A.3 extract-typesize.c This C program extract size of new objects into distinct files for each new type found. The source code can be found in the vg/progs/ directory on the CD. A.4 vgmimetypedist.R This R script creates a histogram of given values, in this case the count of objects per mime type in the web news logs from VG. It is executed with the command source(”mimetypedist.R”) in the R environment. The source can be found in the vg/scripts/ directory on the CD. 103 A.5. vgfiletypedist.R Chapter A. Source Code A.5 vgfiletypedist.R This R script creates a histogram of given values, in this case the count of objects per file type from selected types in the web news logs from VG. The source can be found in the vg/scripts/ directory on the CD. A.6 sortsizes.py This Python script sort the numbers in a text file in ascending order. The source can be found in the vg/python/ directory on the CD. A.7 vgmediansizedist.R This R script creates a histogram of the median size of selected file types in the web news logs from VG. The source can be found in the vg/scripts/ directory on the CD. A.8 vgaccessdist.R This R script creates a histogram of the access counts to selected file types in the web news logs from VG. The source can be found in the vg/scripts/ directory on the CD. A.9 createR1ktable.py This Python script collects size entries in a file into 1KB buckets and outputs it to a new file. The source can be found in the vg/python/ directory on the CD. A.10 graphscript-jpg-log.R This R script creates a histogram of the content of a table created by Script A.9, the JPEG table in this instance. It is run with the command source(”graphscript-log-jpg”) in the R environment. The source can be found in the vg/scripts/ directory on the CD along with the scripts performing the same task for the GIF, HTML and Javascripts tables. 104 Chapter A. Source Code A.11. nrdosls-parser.c A.11 nrdosls-parser.c This C program fix the format of the NR server listing of objects into a format the PostgreSQL database can understand. It also add a type field. This source added a leading space to names so matching in the database did not work. We created another C program to fix this problem, called fixnrdosls-parse.c Both sources can be found in the nr/progs/ directory on the CD. A.12 nrfiletypedist.plr This PL/R script counts all entries of specific types and creates a histogram of the results. The source can be found in the nr/scripts/ directory on the CD. A.13 nrmediansizedist.R This R script creates a histogram of the median size of selected file types in the web news logs from VG. The source can be found in the nr/scripts/ directory on the CD. A.14 nr-map-dosls-to-objects.plr This PL/R script match names of objects from the server list database table to the streaming log objects table and updates the type field where it finds a match. The source can be found in the nr/scripts/ directory on the CD. A.15 nr-map-objects-to-accesses.plr This PL/R script match IDs from the NR logs objects table to the access table, and updates the type field when it finds a match. The source can be found in the nr/scripts/ directory on the CD. A.16 nraccessdist.R This R script creates a histogram of the access counts to selected file types in the streaming news logs from NR. The source can be found in the nr/scripts/ directory on the CD. 105 A.17. nrgraphscript-wmv.R Chapter A. Source Code A.17 nrgraphscript-wmv.R This R script creates a histogram of the internal size distribution of WMV files on the streaming server. Similar scripts were also used to create graphs for WMA, JPEG and ASF files. The sources can be found in the nr/scripts/ directory on the CD. A.18 vg-graph-workload.plr This PL/R script creates a graph of the workload of the VG news server on December 8, 2004. The source can be found in the vg/scripts/ directory on the CD. A.19 nr-graph-workload.plr This PL/R script creates a graph of the workload of the NR news server on February 6, 2002. The source can be found in the nr/scripts/ directory in the CD. A.20 vg-graph-avg-number-of-timesprday-cip-is-seen.plr This PL/R script creates a graph of the average number of times we see the same IP per day. The source can be found in the vg/scripts/ directory on the CD. A.21 count-avg-time-between-request-prip-prday.plr This PL/R script calculates the average time between requests per IP per day. It also records the findings in a new table. The source can be found in the vg/scripts/ directory on the CD. A.22 create-vgsession-table.plr This PL/R script creates the session table for web news articles. The source can be found in the vg/scripts/ directory on the CD. A.23 create-sessions-requests-table.plr This PL/R scripts creates a table recording the number of requests per session. The source can be found in the vg/scripts/ directory on the CD. 106 Chapter A. Source Code A.24. graph-sessionrequest-table.plr A.24 graph-sessionrequest-table.plr This PL/R script creates a graph of the information in the table created by the previous script. The source can be found in the vg/scripts/ directory on the CD. A.25 find-avg-time-between-requests-within-session.plr This PL/R script calculates the average time between requests in a session. The source can be found in the vg/scripts/ directory on the CD. A.26 create-access-viewstat-table.plr This PL/R script creates the view statistics table for streaming objects where both initial size and bytes sent are know. The source can be found in the nr/scripts/ directory on the CD. A.27 create-object-howviewed-table.plr This PL/R script creates the streaming objects view summary table. The source can be found in the nr/scripts/ directory on the CD. A.28 createRViewTable.py This Python script count requests accessing less than 100 percent of an object into buckets of 10 percent. The source can be found in the nr/python/ directory on the CD. A.29 nrgraphviewscript.R This R script creates a histogram of requests accessing 10 percent of a streaming object, 20 percent of an object and so on. The source can be found in the nr/scripts/ directory on the CD. A.30 nrgraphviewscript-cumulative.R This R script creates a graph of the cumulative access percent of requests to streaming objects. The source can be found in the nr/scripts/ directory on the CD. 107 A.31. populate-vgartinfo.plr Chapter A. Source Code A.31 populate-vgartinfo.plr This script fills in the information in Table 9.1. The source can be found in the vg/scripts/ directory on the CD. A.32 graph-avg-day-distance.plr This PL/R script creates a graph of the average distance between the first and last day all articles are seen. The source can be found in the vg/scripts/ directory on the CD. A.33 graph-avg-day-distance-firstdayarts.plr This PL/R script creates a graph of the average distance between the first and last day articles from the first day of logging are seen. The source can be found in the vg/scripts/ directory on the CD. A.34 graph-cumulative-access-frequency.plr This PL/R script creates a graph of the cumulative access frequency for the whole week of web news logs. The source can be found in the vg/scripts/ directory on the CD. A.35 graph-cumulative-access-frequency-firstday.plr This PL/R script creates a graph of the cumulative access frequency of only articles seen the first day of logging, for the whole week of web news logs. The source can be found in the vg/scripts/ directory on the CD. A.36 graph-pop-zipf-firstday.plr This PL/R script creates a popularity distriution graph for article requests and compares it to Zipf. The source can be found in the vg/scripts/ directory on the CD. A.37 create-nrobjectinfo-table.plr This PL/R script creates the object info table for streaming objects, recording first and last day of access and the total number of requests to each object. The source can be found in the nr/scripts/ directory on the CD. 108 Chapter A. Source Code A.38. nr-graph-pop-zipf.plr A.38 nr-graph-pop-zipf.plr This PL/R script compares streaming object requests to the Zipf popularity distribution. The source can be found in the nr/scripts/ directory on the CD. 109