Analysis of News-On-Demand Characteristics and Client Access

University of Oslo
Department of Informatics
Analysis of
Characteristics and
Client Access
Espen Nilsen
Master Degree Thesis
April 26, 2005
World Wide Web services are continuing to grow along with the number of
clients connecting to the Internet and the transfer rates of their connections.
News is one of the main areas of usage of clients today. It is also an
area which has not received much attention from the research community.
In this thesis, we investigate several aspects of news on demand (NoD)
services on the Internet today. We analyze log files of a news server and a
streaming server from Norway’s largest online newspaper Verdens Gang
(VG). Our focus is on the content in a NoD environment, users behavior
with the content, and object popularity in terms of both news articles and
streaming objects. The most central topics we investigate are types of files
on these servers, size distribution, access and interaction patterns, object
lifetime, and if the Zipf popularity distribution applies in this scenario.
I would like to thank my guidance councelers PhD. Student Frank Johnsen,
Prof. Dr. Thomas Plagemann and Dr. Carsten Griwodz at the Department
of Informatics, University of Oslo.
I would also like to thank Anders Berg at Verdens Gang (VG) for providing
us with article logs and Svetlana Boudko, Knut Holmqvist and Wolfgang
Leister at Norsk Regnesentral (NR) for providing us with streaming logs.
This document is a Thesis presented to
The Department of Informatics
University of Oslo.
In partial fulfillment of the Requirements for the Degree
Master of Science in Informatics
University of Oslo, Department of Informatics
April 26, 2005
Espen Nilsen
List of Figures
List of Tables
1.1 Motivation . . . .
1.2 Goals . . . . . . .
1.3 Methods . . . . .
1.4 Thesis Overview
2.1 Web news application . . . . . . . . . . . . . . . .
2.2 Streaming news application . . . . . . . . . . . . .
2.3 List of questions . . . . . . . . . . . . . . . . . . . .
2.3.1 Content analysis questions . . . . . . . . .
2.3.2 Article access patterns questions . . . . . .
2.3.3 Stream interaction patterns questions . . .
2.3.4 Lifetime and popularity analysis questions
2.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . .
Related Work
3.1 Content analysis . . . . . . . . . .
3.2 Article access patterns . . . . . .
3.3 Stream interaction patterns . . .
3.4 Lifetime and popularity analysis
4.1 Requirements . . .
4.2 PostgreSQL . . . .
4.3 R and PL/R . . . .
4.4 C, Python and libpq
4.5 Environment . . . .
4.6 Setup requirements
Design and Implementation
5.1 Content analysis . . . . . . . . . .
5.1.1 Web content . . . . . . . .
5.1.2 Stream content . . . . . .
5.2 Lifetime and popularity analysis
5.3 Article access pattern analysis . .
5.4 Stream interaction analysis . . . .
5.5 Database design . . . . . . . . . .
5.5.1 Stream logs . . . . . . . .
5.5.2 Web logs . . . . . . . . . .
5.6 Database implementation . . . .
5.6.1 Stream logs . . . . . . . .
5.6.2 Web logs . . . . . . . . . .
Web Content Analysis
6.1 Preparation . . . . . . . . .
6.2 File types and distribution .
6.3 Size and access distribution
6.4 Internal size distribution . .
Streaming Content Analysis
7.1 Preparation . . . . . . . .
7.2 File types and distribution
7.3 Size distribution . . . . . .
7.4 Access distribution . . . .
7.5 Internal size distribution .
User Behavior
8.1 Workload characterization . .
8.2 Web news sessions . . . . . .
8.3 Web news reference patterns .
8.4 Stream interaction patterns .
Lifetime and Popularity
9.1 Article lifetime analysis . . . . . . . . .
9.2 Article access distribution . . . . . . .
9.3 Article popularity . . . . . . . . . . . .
9.4 Stream objects lifetime and popularity
10 Conclusion
10.1 Thesis summary . . . . . . . . . . . . . .
10.2 Results . . . . . . . . . . . . . . . . . . .
10.2.1 Tools development . . . . . . . .
10.2.2 Content analysis . . . . . . . . .
10.2.3 Workload characterization . . . .
10.2.4 Article access patterns . . . . . .
10.2.5 Stream interaction patterns . . .
10.2.6 Lifetime and popularity analysis
10.3 Future work . . . . . . . . . . . . . . . .
A Source Code
A.1 create-stream-tables.c . . . . . . . . . . . . . . . . . .
A.2 insert-web-logs.c . . . . . . . . . . . . . . . . . . . .
A.3 extract-typesize.c . . . . . . . . . . . . . . . . . . . .
A.4 vgmimetypedist.R . . . . . . . . . . . . . . . . . . . .
A.5 vgfiletypedist.R . . . . . . . . . . . . . . . . . . . . .
A.6 . . . . . . . . . . . . . . . . . . . . . . .
A.7 vgmediansizedist.R . . . . . . . . . . . . . . . . . . .
A.8 vgaccessdist.R . . . . . . . . . . . . . . . . . . . . . .
A.9 . . . . . . . . . . . . . . . . . . . .
A.10 graphscript-jpg-log.R . . . . . . . . . . . . . . . . . .
A.11 nrdosls-parser.c . . . . . . . . . . . . . . . . . . . . .
A.12 nrfiletypedist.plr . . . . . . . . . . . . . . . . . . . .
A.13 nrmediansizedist.R . . . . . . . . . . . . . . . . . . .
A.14 nr-map-dosls-to-objects.plr . . . . . . . . . . . . . .
A.15 nr-map-objects-to-accesses.plr . . . . . . . . . . . . .
A.16 nraccessdist.R . . . . . . . . . . . . . . . . . . . . . .
A.17 nrgraphscript-wmv.R . . . . . . . . . . . . . . . . . .
A.18 vg-graph-workload.plr . . . . . . . . . . . . . . . . .
A.19 nr-graph-workload.plr . . . . . . . . . . . . . . . . .
A.20 vg-graph-avg-number-of-timesprday-cip-is-seen.plr
A.21 count-avg-time-between-request-prip-prday.plr . .
A.22 create-vgsession-table.plr . . . . . . . . . . . . . . .
A.23 create-sessions-requests-table.plr . . . . . . . . . . .
A.24 graph-sessionrequest-table.plr . . . . . . . . . . . . .
A.25 find-avg-time-between-requests-within-session.plr .
A.26 create-access-viewstat-table.plr . . . . . . . . . . . .
A.27 create-object-howviewed-table.plr . . . . . . . . . .
A.28 . . . . . . . . . . . . . . . . . .
A.29 nrgraphviewscript.R . . . . . . . . . . . . . . . . . .
A.30 nrgraphviewscript-cumulative.R . . . . . . . . . . .
A.31 populate-vgartinfo.plr . . . . . . . . . . . . . . . . .
A.32 graph-avg-day-distance.plr . . . . . . . . . . . . . .
A.33 graph-avg-day-distance-firstdayarts.plr . . . . . . .
A.34 graph-cumulative-access-frequency.plr . . . . . . . .
A.35 graph-cumulative-access-frequency-firstday.plr . . .
A.36 graph-pop-zipf-firstday.plr . . . . . . . . . . . . . . .
A.37 create-nrobjectinfo-table.plr . . . . . . . . . . . . . .
A.38 nr-graph-pop-zipf.plr . . . . . . . . . . . . . . . . . .
List of Figures
VG main page - top . . .
VG main page - bottom .
VG video page . . . . . .
VG video player . . . . .
VG log sample entries .
NR log sample entries .
VG mime type distribution . .
VG file type distribution . . . .
VG median size distribution . .
VG file type access distribution
VG JPEG size distribution . . .
VG GIF size distribution . . . .
VG HTML size distribution . .
VG Javascript size distribution
NR file type distribution . .
NR median size distribution
NR access distribution . . .
NR WMA size distribution .
NR WMV size distribution .
NR JPEG size distribution .
NR ASF size distribution . .
VG server workload . . . . . . . . . . . . . . . . . . . . . .
NR server workload . . . . . . . . . . . . . . . . . . . . . .
VG log size comparison . . . . . . . . . . . . . . . . . . . .
VG mean number of times IP is seen pr day . . . . . . . . .
VG number of sessions with x number of requests . . . . .
NR access view percentage . . . . . . . . . . . . . . . . . .
NR access view percentage distribution for partial accesses
NR cumulative access view percentage . . . . . . . . . . . .
VG article lifetime of all articles . . . . . . . . . . . . . . . . .
VG article lifetime of articles seen first day of logging . .
VG article cumulative access distribution . . . . . . . . .
VG access distribution of articles seen first day of logging
VG likeliness of becoming popular compared to Zipf
week, top 10 percent of the articles) . . . . . . . . . . . . .
VG article popularity vs. Zipf . . . . . . . . . . . . . . . .
VG top 150 article popularity vs. Zipf . . . . . . . . . . .
NR streaming objects lifetime . . . . . . . . . . . . . . . .
NR objects Zipf comparison . . . . . . . . . . . . . . . . .
. .
. .
. .
. .
. .
. .
. .
. .
List of Tables
VG original log format . . . . . . . . . . . . . . . . . . . . . .
NR original log format . . . . . . . . . . . . . . . . . . . . . .
NR directory listing example
NR log object attributes . . . .
NR log client attributes . . . .
NR log access attributes . . .
VG articles requests table . .
VG file type distribution . . . . . . . . . . . . . . . . . . . . .
NR directory listing example with type field . . . . . . . . .
NR server list table . . . . . . . . . . . . . . . . . . . . . . . .
VG average request timing table
VG session table . . . . . . . . . .
NR access statistics table . . . . .
NR view statistics table . . . . . .
VG article information table . . . . . . . . . . . . . . . . . . .
10.1 NR median sizes . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 NR median sizes . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 1
As the popularity of the World Wide Web continues to increase, we have
also seen an increasing popularity of multimedia objects on the Internet. We
see that people are moving more and more towards a multimedia oriented
way of communication instead of the traditional text oriented way. This
new area of usage of the web brings new content to the Internet in the
form of media streams and dynamic content, which are different from
static HTML pages and images. The difference in characteristics as well
as the impact on the network of these new methods of communication
are issues that need to be explored. One area in which there has not been
conducted much researched is news on demand (NoD). We anticipate that
NoD will become an important part of the Internet, and as such needs to
be investigated in more detail.
In this chapter, we first discuss the motivation behind our work, and
introduce some concepts and ideas the reader should know about. Then we
talk about our goals as well as the methods we use. In the end, we present
the reader with an outline of the rest of this thesis.
1.1 Motivation
The INSTANCE II (Intermediate Storage Node Concept) 1 project of the Distributed Multimedia Research Group (DMMS) at the University of Oslo
is aimed at developing new solutions for next generation multimedia distribution infrastructure which minimize response time, resource requirements and cost. Research is being conducted on network infrastructure,
caching and operating system kernel enhancements.
For the network infrastructure we are using an overlay network in a
multi-ISP hosting content distribution network (CDN). A CDN provides
a method to improve Internet quality and user perceived quality through
This thesis has been performed in the context of the INSTANCE II project, which is
funded by the Norwegian Research Council’s IKT-2010 Program, Contract No. 147426/431
1.1. Motivation
Chapter 1. Introduction
replication of content from the origin server to servers close to the clients
requesting the content [31]. A multi-ISP CDN is a network where a
standalone company has servers located in the backbone of many ISPs,
which allow for large networks on which one can provide content on a
global scale. A hosting CDN means that both origin servers and proxies
are part of the same infrastructure, which is favorable since they allow for
retrieval and coordination of information from all points of the network
infrastructure [32]. On top of the fixed infrastructure of the backbone
servers we use an overlay network which is a distribution tree overlaid
on the existing IP network. In addition to allow for easier configuration of
the network infrastructure, overlay networks also allow for configuration
changes due to changes in quality of service (QoS) classes wanted. By
QoS we are not only talking about traditional metrics such as delays and
error rates, but rather all characteristics of a distributed multimedia service
defining requirements for a multimedia application. The overlay network
in INSTANCE II has been constructed to automatically reconfigure itself
depending on different QoS parameters specified and properties of the
underlying network [26].
An important part of an efficient distribution system is the use of
caching and prefetching of popular objects. We are researching a new
caching system, structured partial caching, [28], which is tailored to
caching of NoD objects in a network infrastructure as described above.
One important aspect of NoD is the use of inter-related files of different
kind, continuous and discrete media. This provides users with more
advanced interactivity than that of VoD, and in addition, clients themselves
can have varying capabilities ranging from PCs to PDAs and mobile
phones. Another aspect of NoD data arising from the use of mixed media
is structure between objects. There are two types of structure, internal
and external. Internal structure is structure inherent in a media type,
such as layered video or progressive JPEG pictures. External structure
defines the relationship between different elements of possibly different
media types, such as their layout and composition. Our partial caching
algorithm assumes that structure is defined in a presentation plan. These
are documents residing on the servers describing the content of media
elements, their composition and from that, the ways in which the different
elements can be divided and served.
Unlike video on demand (VoD) there has not been much research
specifically targeting NoD. We assume NoD will become increasingly
popular in the future and as such needs to be researched further. This
assumption is also supported by [17] who found in their analysis of Internet
reference behavior in Korea that the majority of requests was for news sites.
In addition, in Norway today we see that online newspapers are gaining
more and more users and that paper editions are loosing ground [21]. Of
the 10 most read newspapers in Norway, three are online. This trend reflects
Chapter 1. Introduction
1.2. Goals
how the Internet is maturing and how people are growing accustomed to
using and interacting with online content. This fact further emphasizes the
importance of researching issues related to NoD, Media On Demand, and
Content Distribution Networks.
1.2 Goals
This thesis is an analysis of both content and user behavior in a NoD
environment in order to aid our further understanding of a NoD scenario.
Our goals can be divided up into four main areas of focus: content analysis,
article access pattern analysis, stream interaction analysis and lifetime and
popularity analysis
First, we research what kind of content exists on a news server, what
type of files we can find, and the distribution among them. We also
compare the number of accesses between the specific types to see if some
are used more than others. In addition to investigate just what type of files
that exist and which are most accessed, we also look at the size distribution
both between and within the specific types and compare this to what has
been found for regular web content.
Next, we investigate user behavior in a NoD environment. We start
with a short workload analysis to investigate roughly how many users
connect to our servers each day, and how many requests are served each
day. Then we continue with more specific analysis for each new log. For
web news, this includes access patterns such as the number of articles users
request while connected to the server, i.e. do users request several articles
in one session or do they usually only read one specific article. If the usual
pattern is to request several articles, we also want to see if we can find some
relationship between requests in these sessions, in addition to investigate
the time between those requests. For streaming news, we look at how users
interact with the objects in terms of how many are accessed in full and how
many are only accessed partially. For those that are only accessed partially,
we also investigate the access percent distribution and see if we can deduct
any patterns from that.
In the end, we also explore the lifetime and popularity of both web news
and streaming news and compare this to what others have found in their
research. By lifetime we are talking about the number of days between
the first and last access to an object. For popularity we investigate the
distribution of accesses to specific objects. From this we can see if there is a
small concentration of documents which account for most of the requests,
commonly referred to as hot documents. We also look at how the access
distribution evolves over the lifetime of an article. In addition, the Zipf
distribution is a highly used method to model popularity so we also check
if Zipf popularity distribution can be applied to our dataset. On this topic,
1.3. Methods
Chapter 1. Introduction
we also compare web news lifetime and popularity with that of streaming
news to see if there are some similarity between the two types of news
1.3 Methods
We perform a theoretical analysis of content and user characteristics in the
more general area of the Internet as a whole, as well as the more specialized
area of NoD. This is done through literature study to familiarize ourselves
with the topics. Knowledge of both areas is important, since in addition
to comparing our results to what others have found for similar datasets as
ours, we can also compare the characteristics, content and usage of NoD to
the web in general.
Further we acquired logs for both web news and streaming news from
the largest online newspaper in Norway, Verdens Gang (VG) [12]. We use
systems design and implementation to create our own tools to prepare this
data for further analysis. Then, we use statistical analysis to research our
dataset as well as graphical representation where this is warrant.
1.4 Thesis Overview
In the remainder of this thesis, we introduce some background information
on the data we are analyzing, the applications in which the data is made
available to the users, and present a list of the most important questions
we are researching in Chapter 2. Chapter 3 discuss some related work,
what others have done and how it relates to our project. Chapter 4 outlines
our work environment, the tools we have used and developed and the
reason they have been chosen, and Chapter 5 explains the design and
implementation of our tools in detail. Chapters 6 through 9 presents both
analysis methods and results from researching the questions outlined in
Chapter 2. Chapter 10 summarizes the results and conclude our work,
ending with a presentation of ideas for further research on the different
topics in this thesis.
Chapter 2
In this chapter, we first introduce the reader to the applications which
present the data we are analyzing to the users. We show what information
we can expect to find through both available content and the way users
can interact with the applications. Then, we introduce a list of questions
we are researching, formulated from the goals in the previous chapter. This
chapter concludes with a presentation of our dataset and the formats in
which they have been acquired.
2.1 Web news application
We received log files from a web news server and a streaming news server
from Norway’s largest online newspaper VG. To investigate what we can
expect to learn from the log analysis we first look at the applications in
which the content is made available to users. This tells us how objects are
presented to the clients and can aid our understanding of how they interact
with the objects. In this section we present the web news application and
in the next section we look at the streaming news application.
Figure 2.1 shows a screen shot of the top of the main page on VG’s
web server. We see that articles are presented to the user in two columns
except for the first article at the top of the page. As with paper newspapers,
the article believed to be the most important is given special notice. In a
paper newspaper this article usually occupies a significant part of the front
page. In our online environment, this article spans across both columns of
articles, and it is also larger in size both textually and graphically. This way,
the most recent or most important piece of news is presented clearly to the
user. From this, there is reason to expect that most requests will be for this
article, and that the newest articles are the most popular.
We also see a group of different categories on the left side which the
clients can use to access news on a specific topic. By collecting articles
in different categories and making these categories easily available from
2.1. Web news application
Chapter 2. Background
Figure 2.1: VG main page - top
Chapter 2. Background
2.1. Web news application
Figure 2.2: VG main page - bottom
the main page, the users have lot of articles available within just one to
two clicks from the main page. Therefore, even though news articles are
created, published and become old quite fast (compared to e.g. movies),
older articles can still be readily available to the users. In addition, on the
bottom of the main page we again find references to the newest articles
in specific categories, Figure 2.2. This tells us that even though news are
updated fast and articles are pushed out of the main columns with pictures
accompanying headlines to attract the user’s attention, articles can still be
available from the main page for some time. To conclude the discussion
about availability of articles, we also note that there is a search function
conveniently placed in the upper right corner of the main page where
clients can search for old articles that are in the archive.
Further, we see that there are a lot of images present on the main
page. There are photos connected to almost every article heading in the
main article columns. In addition to photos connected to the headlines of
articles, graphics are used extensively as a layout mechanism to distinguish
different parts of the page. However, there are no really large images on the
main page. The largest image is the photograph connected to the top news
story, but even this is not very large compared to regular photos taken with
ordinary digital cameras. There are also a lot of commercial elements on
the main page in the form of images, banner ads and flash objects.
In the end, while browsing through a couple of pages on this server, we
2.2. Streaming news application
Chapter 2. Background
find that the layout of the pages are very similar, and a number of elements
are reused. This means that caching can effectively reduce the transfer of
objects between subsequent requests to new pages.
From investigating the web news application, we have found that there
are a number of different elements that make up a news web page. They
seem to be a lot more complex than what is the case for most regular
web pages, in terms of both the amount of information they provide and
as a result the composition of the HTML pages needed to organize all
this information in a user friendly way. For this reason, it is important to
analyze the different elements specifically for news sites.
2.2 Streaming news application
In this section, we investigate the applications in which streaming news
objects are presented to the users. There are several different ways users can
access these objects. First, there is a link to a movie page in the categories
section on the left side of the main page. This link takes the user to a page
which looks much like the main page, only listing video news clips and not
articles, as shown in Figure 2.3.
Figure 2.3: VG video page
Chapter 2. Background
2.2. Streaming news application
From this page, we learn a couple of things. First, it is specified in the
links to each individual news clip that the video files are Windows Media
types. Reading further it is explicitly stated that Windows Media Player
is needed to watch their videos, preferably Windows Media Player 9. The
server is a Microsoft server, and in our initial conversation with VG they
mentioned that they had tried to get their pages and players to cooperate
with other browsers in addition to Internet Explorer. They gave up on this
because the majority of Internet Explorer users where so overwhelming
that they did not feel any responsibility to try and accommodate the small
percentage of other clients. By their decision, it is not so surprising that
we see a trend towards Microsoft formats in their services and content. On
this page they also inform the user that Javascript has to be enabled in the
browser, which means that we should also find Javascripts in our content
Another way users can access videos is clicking on a small camera icon
sometimes accompanying the ingress text of news elements presented on
the main news web page. By clicking either this or one of the links on
the video page described above, the user is presented with a video player
showing the selected news clip, as Figure 2.4 shows.
Figure 2.4: VG video player
By studying the video player more closely we see that on the right, there
is a list of more videos that can be accessed. One could imagine that once in
the video player environment, users request more than just the one video
they initially wanted to see. Further, we see that here too we find categories
which contain videos in the same topic. Also, the list of videos contain dates
of when a video was created and from this we see that the streaming news
2.3. List of questions
Chapter 2. Background
videos in our dataset are not created as fast as the web news. There are
usually not more than a couple of days between creation of new objects so
they are more like movies.
Next, we look closer at the actual video player and note that it presents
the user with a set of controls exactly as a VCR. This means that users
can play, pause, stop and move back and forth in a video stream. This is
an important observation since it can affect how users watch a news clip.
They are not limited to watching it beginning to end and there is nothing
that dictates that this is a normal behavior either. We also note that there
is no control item to choose a set of transfer rates or quality wanted, this is
computed by the player itself.
2.3 List of questions
After having explored the applications in which our data material is presented to the users, we now present a list of questions we are investigating
in this thesis. They are in effect the goals section from Chapter 1 in question
form. We will refer back to these questions throughout the thesis.
2.3.1 Content analysis questions
Q: File types existing on server
Q: Distribution among file types
Q: Access distribution among file types
Q: Size distribution between file types
Q: Size distribution within file types
2.3.2 Article access patterns questions
Q: Are there sessions, do users select several articles in sessions
Q: If there are sessions, time between requests within sessions
Q: If there are sessions, reference patterns between requests
2.3.3 Stream interaction patterns questions
Q: How are streaming objects watched, beginning to end or partial
Q: If partially, how much is watched
Chapter 2. Background
2.4. Dataset
time stamp
client ip address
server ip address
port number
host name ..
/annonser/... , /bilder/...
blank or artid=xxxx
Cookie information
e.g. Mozille/4.0...
host name of proxy if used
ip addr of client a proxy is forwarding for
entire URL of referrer
host name of referrer
URI of referrer, e.g. /pub/vgart.hbs
arguments from referrer, e.g. artid=xxx
usually blank
time to complete request
HTML status code
content type, e.g. image/gif
cookie information
bytes sent from server to client
Table 2.1: VG original log format
2.3.4 Lifetime and popularity analysis questions
Q: Lifetime in terms of day distance between the first and last access
Q: Time dependent popularity: concentration of references (hot documents) and access distribution over a period
Q: Time independent popularity: Zipf
Q: Compare lifetime and popularity of web news, streaming news and
VoD movies
In order to answer these questions we need to analyze the logs we
received from both the web news and streaming news servers.
2.4 Dataset
Now that we know what we want to find out, we continue with a
presentation of the data material. The web news server logs we got directly
from VG, which logged accesses between 2004.12.07 09:00 and 2004.12.27
15:00. Each log contains half an hour of material for a total of 968 files.
Compressed using gzip, the total size of these logs amount to 86GB. The log
format is as listed in Table 2.1 and Figure 2.5 show some example entries
from the logs.
2.4. Dataset
Chapter 2. Background
Figure 2.5: VG log sample entries
The streaming logs we acquired from Norsk Regnesentral (NR) [6],
which where administering VG’s stream server before 2004. The logs
contain accesses from January 2002 to November 2003 for a total of 769
log files. Compressed with gzip the total size of these logs is 530MB. The
log format is listed in Table 2.2 and Figure 2.6 show some example entries
from the logs.
Figure 2.6: NR log sample entries
Chapter 2. Background
2.4. Dataset
client ip address
date of request
time stamp
dns address
name of object with complete URL
client specified where in stream byte wise, majority at 0
duration of stream, rarely used
client rate, -5000, -5, 0, 1, 2, 5, 200, 400, 404, 1000
html status code
player id nr from vendor
player version nr
player language, e.g. noNO
e.g. mozilla/4.0....
URL of referer, e.g
client executable file e.g. iexplore.exe
client version nr of hostexe
client operating system
client os version
client cpu type, e.g 486, Pentium
not used
rarely used
average bandwith achieved, 0-236628
http or mms
transport protocol, TCP or UDP
audio codec, e.g. WMA
video codec, e.g. WMV
not used
bytes sent from server to client
bytes sent from client to server
nr. of packets sent by server
nr. of packet recieved by client
nr. of packets lost on client
nr. of packets lost in the net
rarely used
nr. of resend requests from client
nr. of packets recovered due to ecc
nr. of packets recovered by resending
client buffercount
client total buffer time
quality descriptor in percent, 0-100
server ip address
server dns address
total clients currently connected, always 1
cpu utilization
Table 2.2: NR original log format
Chapter 3
Related Work
In this chapter, we introduce the reader to other work related to our
different types of analysis.
3.1 Content analysis
There are a lot of previous research and articles on regular web content and
workload characterization. A lot of these papers are quite old, and none
of them are studying news sites exclusively. Their results are, however,
important to us, since we need to know about general web characteristics
in order to find special characteristics of NoD content.
Woodruff et al. analyzed several different aspects of web documents
from data collected by the Inktomi Web crawler [37]. One of their studies
was of file types used in child URLs in which they found over 40 different
file types with different file extensions which they grouped together in
five different categories. By counting the total number of occurrences of
each file type, they found that HTML, JPEG, GIF and XBM were by far the
most used, with HTML leading followed by GIF files. They also did a size
analysis, but only for HTML documents and with all markup removed.
Jeff Sedayo performed an analysis of size and frequency of objects in
log files obtained from a proxy server at Intel [35]. His results are much
the same as [37]. HTML, JPEG, GIF and XBM are still the most frequently
accessed file types, only in his dataset, GIF files are more accessed than
HTML. He also includes information on the average size and standard
deviation of the file types. Among the top four most accessed types, JPEG
files were much larger than the others, followed by GIF files. When it comes
to size distribution he found that there is tremendous variation in the size
of image files.
Bahn et al. present a characterization study of web references focused
on content analysis by studying log files from a proxy server at the
Korean Education Network (KREN) [17]. In their first study, they show that
3.2. Article access patterns
Chapter 3. Related Work
references are biased to some hot documents. 10 percent of the documents
are responsible for 70 percent of the references. Further, they present an
analysis of the distribution of URL types, where they found that 75.2
percent of the total references are to image files such as JPEG and GIF, and
about 14 percent of references are to HTML files.
Arlitt et al. did a workload characterization study from six different log
files collected from different types of servers, [16]. They were searching for
invariants in all of the six data set and found some that apply to our study.
First, they found that HTML and image files account for 90 to 100 percent
of the total requests. In addition, they found that 10 percent of the files
accessed accounted for 90 percent of the requests. Since they are analyzing
data from six different sources, this implies that the concentration of hot
documents are even greater than what [17] concluded from analyzing only
one source.
While all of the above analyze general content and workload characteristics of web servers and proxies, Acharya et al. performed an experiment to
measure how video data specifically is used on the web [14]. Their analysis
is much more detailed than ours will be, including analysis of frame rate,
duration and aspect ration of individual movie types. Their video objects
are of the types MPEG, AVI and Quicktime and as we found in Chapter 2,
ours are mostly WMV. Therefore, the distribution between them is insignificant to our analysis but the size distribution is still interesting. They found
that most movies are small, 2MB or less with the median size being 1.1MB.
They also show that most movies are brief, where 90 percent lasted 45 seconds or less. This is similar to what we expect to find from streaming news
video clips.
3.2 Article access patterns
Catledge et al. researched user navigation strategies on the web by
capturing client side user events from a doctored browser [20]. In this study
they defined user sessions to be within 1-1/2 standard deviation of the
mean time between each event for all events across users. One of their
studies show that within a particular site, users tend to operate in a small
area. In addition they found that users accessed on average 10 pages per
server and that information must be accessible within two to three jumps
from the initial page.
There has not been conducted many studies specifically on news
servers besides from [29] that we know of. In their paper they make
an undocumented but intuitive claim that users requests more than one
document while connected to a news server. They use this claim to create
a popularity algorithm for groups of articles, called Multi-selection Zipf,
which they compare to the Zipf popularity model. See Section 3.4 for more
Chapter 3. Related Work
3.3. Stream interaction patterns
on Zipf distribution.
3.3 Stream interaction patterns
There has not been much research on specifically how users interact with
streaming news objects. Most previous work has focused on what type of
video data exist on the web, their characteristics and their access frequency
[17, 35]. [14] deals mostly with the video data itself in terms of what type
of files there are, and the individual properties of each file type such as
size, frame rate, duration and average bit rate. This is useful when it comes
to modeling content on the web, but in our study we also want to get a
sense of how users interact with the data. We want to explore how stream
objects are viewed. Are they usually viewed from beginning to end? If not,
how many are only seen partially, and how much is usually viewed before
stopping? Knowing this would be helpful for many caching mechanisms,
like for example prefix caching [25] where the prefix can be decided based
on the knowledge of how much of an object is usually accessed.
3.4 Lifetime and popularity analysis
Zipf law is a power law modeling frequency of use to popularity rank
[33]. It originates from the Harvard student George Kingsley Zipf who
first noticed that the distribution of words in a text followed a special
statistical pattern. It states that the size (frequency) of an object is inversely
proportional to its rank (popularity), i.e. proportional to 1, 1/2, 1/3 etc. If
one ranks the popularity of words in a text (denoted i) by their frequency
of use (denoted P), then
P = 1 / iα
The real Zipf distribution is parameter less, i.e. α equals 1, but is commonly
referred to with α close to unity instead of being parameter less. Later, Zipf
distribution has been applied to many areas in social sciences, one of them
being VoD. Many has also modeled regular web page popularity after Zipf
and found their popularity distribution to be Zipf like with different values
for α.
Cunha et al. as well as Barford et al. have performed reference behavior
studies of client traces by modifying a browser to record all user accesses,
[18, 23]. In [23], Zipf was applied with α = 0.986 which is very close to pure
Zipf distribution. [18] show studies from two data sets, one in 1995 and
one in 1998. They only show request distribution compared to Zipf for the
1995 dataset, in which they found α to be 0.96. However, they also compare
transfers to Zipf in which α drops to 0.83 in 1995 and in 1998 it is 0.65.
3.4. Lifetime and popularity analysis
Chapter 3. Related Work
The reason for the difference between requests and transfers in 1995 is that
transfers only show the set of cache misses. From this we see, that between
1995 to 1998, less transfers had to be made, suggesting an improvement in
caching techniques.
Almeida et al. in [15] applied Zipf with α equal to 0.85 for a dataset
containing logs from several different web servers, and Bahn et al. showed
that web server popularity also is close to Zipf, without giving exact
numbers [17].
Breslau et al. found in their analysis of web proxy traces that the distribution of page requests follow a Zipf-like distribution with α varying
from trace to trace but centering around 0.75 for heterogeneous environments [19]. They also point out that their work cannot be directly compared to [15] since proxies deal with only a fixed group of users, while web
servers see requests from all users on the Internet.
We are analyzing traces from a web server, which means by the results
above, we should see a behavior like [15] in our results. However, our
dataset, albeit from a web server, is quite different from both [19] and [15].
We do not look at general web pages but news articles specifically.
There has not been much study of the lifetime of web pages, but for
news pages this is important, since news tend to get old reasonably fast.
Kim et al. are in [29] analyzing newspaper traces and as such their dataset
is of the same type as ours and their results are very interesting to compare
to our findings. They found that recency of articles define their popularity
and that the most popular articles do not last more than three days. In
addition they also compare article popularity to Zipf and without giving
exact numbers conclude that it differs from Zipf. In these studies however,
they compare the mean access popularity of articles for a month to Zipf.
Zipf is a time independent model, and calculating average accesses over a
whole month to data not available the whole period cannot be compared
to Zipf. All this tells us is the probability an article has to live long, not its
Chapter 4
Since we are analyzing both web and streaming news content, as well
as user behavior and interaction with the content we need tools that can
perform analysis over a wide range of areas. In this chapter, we outline
details on what our requirements for tools are, what tools we have chosen
to use and the environment in which we use these tools.
4.1 Requirements
There are numerous excellent web log analysis tools available on the
Internet today, such as The Webalizer [13] and AWStats [1]. These tools
are very good for analysis of content and visiting statistics such as what
file types there are, number of visitors, most accessed pages etc. However,
we are analyzing not only content but users interaction with it, and most
importantly their interactions with single objects. Only stating which pages
are most popular a specific date does not work. We want to see for how
long single objects are popular. We also want to see how they relate to
each other, e.g. if two pages are usually requested right after each other.
In short, not only content analysis of a server, but popularity models and
users interaction which require analysis of specific objects and clients.
For that reason, we have chosen to develop our own tools in a
combination of languages. Since our log traces contains a lot of information,
we chose to use a database management system (DBMS) for analyzing
them, as these are specifically designed to handle and query large amounts
of data. Applications and scripts for different tasks like data handling and
creation of graphs have been developed in either C, Python, R, PL/R or
4.2. PostgreSQL
Chapter 4. Tools
4.2 PostgreSQL
We decided to use PostgreSQL as our DBMS for several reasons. First, PostgreSQL is a public domain and completely free. Second, it is an established
system with a large user base and it also has very good documentation.
Third, it has a very good integration with the C programming language
both through the libpq library and with its extensible modules features. We
use PostgreSQL version 7.4.5.
4.3 R and PL/R
R is a language and environment for statistical computing and graphics [11]. Since much of the work we are doing is statistical analysis of content and user actions, this was a natural choice for us. Another important
reason for choosing this language is its graphing capabilities. We use version R-2.0.0.
PL/R is a loadable procedural language that enables us to write
PostgreSQL functions and triggers in the R programming language [8].
With this module installed we can use most of the R language’s capabilities
on the database. The version we use is plr-0.6.0b-alpha.
Using the R language and the PL/R module in conjunction with
PostgreSQL we get a very elegant and easy way of extracting and analyzing
statistical data and create graphical representations of the results.
4.4 C, Python and libpq
There are two main reasons for using C as the main language for our
applications. The first is that it is by far the most comfortable language
for us, it is what we use the most. The other is the libpq library, which
provides a powerful and easy to use API for accessing the PostgreSQL
server. However, C is not the best language for high level text operations,
which is why we for some tasks choose to use Python [10].
4.5 Environment
Our work environment was on a server at the University of Oslo. On this
server we set up a DBMS that we used for insertion and querying the log
files. The environment we set up is a PostgreSQL [9] server extended with
the PL/R module [8], the R programming language [11], and a series of
own C applications extended with the libpq [5] library of PostgreSQL, as
well as numerous Python scripts.
Chapter 4. Tools
4.6. Setup requirements
4.6 Setup requirements
There are some requirements of setting up the combination PostgreSQL, R
and PL/R. It has to do with the use of the PL/R module and its integration
with the database server. In order to get PL/R installed and integrated, we
had to compile PostgreSQL from source instead of installing a precompiled
package. This is usually a good idea anyway, but it is worth mentioning
for everyone else who wants to try this. The reason we had to compile
it ourselves is that you need the headers in order to compile the PL/R
language module. This is also the case for the R language. In addition,
most precompiled versions of the R language are compiled without the
–enable-R-shlib option which enables the libR shared object library. libR is
also needed in order to compile PL/R so this is another reason we had
to compile R from scratch. The installation documentation on [8] gives a
complete instruction of how to get PL/R compiled and installed in the
database, but with the version we use we encountered a small problem.
The r-libdir environment variable in the PL/R Makefile actually pointed to
the r-bindir. After changing r-libdir = RHOME/bin to r-libdir = RHOME/lib
in the Makefile, following the directions on the PL/R install page worked
without problems.
Chapter 5
Design and Implementation
In this chapter, we analyze how we can answer our questions from
Chapter 2 given the data we have available. For convenience, we split this
discussion up into the four main areas of focus. In the last two sections we
first present the design of the database tables we will import the logs to,
and then we discuss our options for performing this task as well as the one
we have chosen.
5.1 Content analysis
In order to answer our questions related to content we need a list of all the
objects on the servers as well as their type and size. From such a list we can
extract information on what kind of file types that exist, the distribution
between the different file types and the size distribution both between and
internal to each type. We also need a recording of requests in order to
investigate the access distribution between them.
This section first present how to acquire this information for the web
news, then the options for the streaming news.
5.1.1 Web content
We do not have a list of all the files on the VG web server, so we have to
create a parser which extracts this information from the logs. The parser has
to record each new file type it sees, each new object it find of the specific
file types as well as each new objects respective size. To investigate access
distribution we can use the same type of parser, only not caring whether or
not the log entry it is examining relates to an object previously seen. That
is, this parser has to record information for all entries in the log, while the
first parser only need to record information for new objects.
To identify each new file type we can use the sc(ContentType) field which
tell us the mime type of the file. To identify each new object, we can look at
5.2. Lifetime and popularity analysis
Chapter 5. Design and Implementation
Table 5.1: NR directory listing example
the uri-stem field as long as the object is not an article. When the object is an
article we have to combine uri-stem with the uri-query field to distinguish
them. To find the size of each object we can look at the sc-bytes field which
records the number of bytes sent from the server to the client. This field is
not always filled in so we skip those that does not have it set.
5.1.2 Stream content
For the streaming news we were able to get a recursive directory listing of
the files on the streaming server from NR. Table 5.1 shows the format and
an example of what this list looks like.
In addition to the information present in this list, we also need to know
the type of each file. This can be found by looking at the file extension, so
we can create a parser which adds a new column recording the file type. By
inserting this information into a database table, we can query on both file
type distribution as well as size distribution between and within each type.
In order to get a count of accesses to each type, we have to match the
types we have found from the server list with objects in the streaming logs.
These logs are also inserted into a database so this becomes a matter of
matching objects in two different database tables and updating the type
field wherever we find matching objects.
5.2 Lifetime and popularity analysis
We want to look at both lifetime and popularity of objects. When we talk
about lifetime we have said that we mean the distance in time between the
first and last access to an object. There is a date field for each access in both
the web and streaming news logs we can use to investigate this. We also
want to investigate the concentration of references in order to find out if
there is a small group of articles that account for most of the requests. In
addition, we also want to look at the popularity of objects, both in terms
of how the access distribution changes over a period since the first access
seen, and in terms of Zipf distribution.
From these questions we see that we need a way to distinguish specific
articles and streaming objects. As noted earlier we can find unique articles
in the web logs by looking at two distinct attributes of the logs, uri-stem and
uri-query. The uri-stem field is always /pub/vgart.hbs for article requests and
the uri-query distinguishes between specific articles, specified by the form:
Chapter 5. Design and Implementation 5.3. Article access pattern analysis
artid=101011. To find each stream object we can simply use the uri-stem field
of the streaming logs.
When we have identified each object, we can examine all the entries
of the logs and record the first and last request we see. We can record the
number of requests to each object both throughout the entire period of the
log, and within a limited time period based on the date field. From this we
can learn about the concentration of references as well as Zipf distribution
by ranking objects after number of requests.
5.3 Article access pattern analysis
The first question we study is if users select several articles in sessions.
This means that we need a way to define single clients and we also need to
define what a session is. The only entry in the web logs that tell us anything
about clients is the IP number. There is some uncertainty with using IPs to
uniquely identify clients. IPs can be both dynamically assigned through
for example DHCP, or they can represent for example a NAT or a proxy
server. This means that single user can have multiple IPs, and that several
users can be represented by the same IP. There is no way to distinguish
IPs that represent single users from other IPs in our logs but since we also
define sessions, we increase the chance of identifying single clients. There
is a greater possibility that requests from one IP is from a single user when
the time period is short than when it is long.
In order to define sessions we need to look at the time field in the logs.
We have chosen to define sessions to be within 1 standard deviation of the
mean time distance between each access to an article per day from each IP.
This is in accordance with what [20] did in their study of client side user
When we have identified a client and sessions, we can look at specific
clients access patterns within sessions. In order to find out if specific groups
of articles exists, we need to examine the uri-query field for all requests
within a session to the requests in all the other sessions. Also, each log entry
has a time stamp we can examine to find the time between each new request
from a single client within a session. It is important to understand that the
use of clients is only a term to refer to an IP address making a number of
requests within a specified time period. A client cannot be traced further to
track requests from the same IP in another session.
5.4 Stream interaction analysis
The first question we study on stream interaction is the distribution
between partial and full accesses of videos. To find out how many videos
are watched in full and how many are partial there is an sc-bytes field which
5.5. Database design
Chapter 5. Design and Implementation
character varying(128)
character varying(56)
assigned unique id number
name of object, parsed out of URI
size of object
mime type field
Table 5.2: NR log object attributes
records the number of bytes sent from the server to the client. We can match
this field against the size field of the objects found in the directory listing
of the streaming server we got from NR. If we find that many videos are
not seen until the end we also wanted to see how much of the videos are
usually viewed. This can be done by traversing all requests and record the
percentage viewed.
5.5 Database design
In this section we discuss the initial database tables in which we insert the
logs. The next section will discuss how we implemented the tools to import
the logs into the database design presented here. The reader should keep in
mind that the design presented here is just an initial design for the database
drawing on high level requirements from the discussion above. Many other
tables have been created from the tables discussed here to investigate the
specific questions we had. These other tables will be presented in the
subsequent chapters at appropriate places near the discussion of the results
they were designed to produce. This way, the reader gets a quick overview
of what we are analyzing, the way we analyze it and the result of the
analysis at the same place in the text.
5.5.1 Stream logs
From the discussion of lifetime and popularity above, we find that we need
to be able to identify specific objects. We also found that the only attribute
in the log which can be used for this is the uri-stem attribute. But, from the
interaction analysis we also found that we need to know the size of each
object, so we have to match the size field of the server list from NR with the
objects in the logs. Also, for the access distribution analysis of streaming
content, we need to identify the type of all objects. We then conclude that
there are three pieces of information we need to know about streaming
objects, name, size and type. Table 5.2 shows the attributes connected to
objects in the streaming logs.
Since we are identifying objects we also looked at the log format to see
if we could identify single clients. From this investigation of the log format
we found a number of attributes that could be used to identify clients. They
are listed in Table 5.3.
Chapter 5. Design and Implementation
character varying(128)
character varying(128)
character varying(128)
character varying(128)
character varying(128)
character varying(128)
character varying(128)
character varying(128)
character varying(128)
character varying(56)
5.6. Database implementation
assigned unique id number
client ip address
dns address
player id nr from vendor
player version nr
e.g. noNO
e.g. mozilla/4.0....
executable file e.g. iexplore.exe
version nr of hostexe
operating system
os version
cpu type e.g 486, Pentium
Table 5.3: NR log client attributes
When we already have identified and used many of the attributes of the
streaming log for distinguishing objects and clients, it becomes apparent
that we can simply create own database tables to hold the list of objects
and clients. The rest of the information can be collected in an access table.
In addition, access distribution require that each object is identified by type
also, so we incorporate this field here too. The attributes of the access table
are shown in Table 5.4.
By splitting up the logs in three parts we need to map the objects and
clients back to the access table. This means that we need a unique ID for all
the objects, a unique ID for all the clients, and map those IDs back to the
access table. The access table now contains information about which clients
accessed what objects when, along with specific stream related information.
It is this table that will be used for most of the further analysis.
5.5.2 Web logs
For the web news analysis, there are no high level requirements like
identifying single clients or objects, since there is only one attribute who
can identify each of them. However, we do find that besides content
analysis, the only entries we need are entries related to article requests.
This is good, since the size of these logs are tremendous and we do not
have capacity to input all of the information in these logs into a database.
Therefore, the database table for the web news logs will contain all the
attributes of the logs, but only entries related to article requests, see Table
5.6 Database implementation
In this sections we first discuss the options we have for importing the
log information into the database tables in the previous section. Then
5.6. Database implementation
start time
sc bytes
c bytes
s pkts sent
c pkts recv
c pkts lost client
c pkts lost net
c pkts lost cont net
c resendreqs
c pkts recover ecc
c pkts recover resnt
c bufcount
c tot buf time
Chapter 5. Design and Implementation
ref client table
ref object table
date of request
time stamp
majority at 0
client rate
URI of referer
average bandwith
http or mms
transport protocol, TCP or UDP
audio codec, e.g. WMA
video codec, e.g. WMV
quality descriptor in percent, 0-100
server/client bytes sent
bytes sent from client
nr. of packets sent by server
nr. of packet received by client
nr. of packets lost on client
nr. of packets lost in the net
nr. packets lost continuous net
nr. of resend requests from client
nr. of packets recovered due to ecc
nr. of packets recovered by resending
total buffer time
mime type of object
Table 5.4: NR log access attributes
via host
time taken
character varying(32)
character varying(512)
character varying(512)
character varying(512)
character varying(512)
character varying(512)
character varying(256)
character varying(512)
character varying(512)
character varying(256)
character varying(512)
character varying(256)
double precision
character varying(256)
character varying(512)
time stamp
server IP address
port number .. 80
host name of referer,
path on server .. /pub/vgart.hbs
arguments .. e.g. artid=xxx
cookie information
e.g. Mozilla/4.0...
name of proxy if used
IP addr of client a proxy is forwarding for
refering URL
host name of referer,
path on server .. /pub/vgart.hbs
arguments .. e.g. artid=xxx
time to complete request
HTML thing .. 200, 206, 404, 500 ...
content type e.g. image/gif
set cookie field
size of object .. not always used
Table 5.5: VG articles requests table
Chapter 5. Design and Implementation
5.6. Database implementation
we elaborate on the chosen approach. We do this separately for the two
different logs.
5.6.1 Stream logs
There are several methods we can use to split up the log information and
import them to their own tables as discussed in the previous section. Here
we list two, and discuss which one we have chosen.
Method 1: Database only
We can copy all logs into one big database table, using the COPY command
of PostgreSQL. From this table we can extract and create a client and
an object table with the attributes identified in Chapter 5 using database
commands. An example of such an SQL command is:
INTO new-table
FROM big-table
We can do the same to create the access table, only without the distinct
option since it must hold all entries, discarding the attributes used for
clients and objects. When those tables are created we need to assign a
unique ID to all the entries in the clients and objects tables, as well as map
those IDs back to the access table.
Method 2: Everything in C
Instead of using only database commands, we can create a C parser to
extract the information in the three tables directly from the logs. It has to
go through all of the lines in each of the logs, record to file and assign IDs
to each new client and object it encounters. When the client and object is
recognized and identified, their IDs have to be mapped to the access that
entry is representing. The access also has to be recorded to a file along with
client and object ID. This way, when the parser is finished we end up with
three files on disk, one containing all clients with IDs, one with all objects
and IDs, and the last file containing all accesses with the mapped client and
object IDs in it. In short, these 3 files will contain our desired tables, so we
can now use the COPY command in PostgreSQL to insert those into their
own tables in the database.
Selected approach
With method one, after having created a client and an object table and
assigned a unique ID to all of the entries, we have to map the IDs back
5.6. Database implementation
Chapter 5. Design and Implementation
to the access table by looping through and comparing each client and each
object to each of the entries in the access table. This has been shown to
take up an unreasonable amount of time. The second approach also has to
perform this comparison, but only for the entries already recorded, not all
entries every time. Therefore, we created a C program to implement the
second approach. See Appendix A.1 for the source code of this program.
However, since we still have to compare each object seen and each client
seen with all new log entries this still take up a lot of time. We therefore
limit the time period in which we perform our analysis to data between
2002-01-21 and 2003-01-09 for a total count of 714,907 clients recognized,
2,412 objects, and 5,180,565 accesses.
5.6.2 Web logs
As mentioned in Chapter 2, the news logs are divided into 968 files with
a total compressed size of 86GB. This is too much information to put into
a database on the hardware available to us, and also be able to get query
results in the time span of this project. Therefore, we limit our log analysis
to the logs between 12-07 and 12-15. In addition, in the previous section
we noted that except from content study, the only log information we need
in our database is information about article accesses. By limiting our time
span for the log entries and only inputting log entries related to article
requests we get a total of 14,905,052 article accesses.
As mentioned in Section 5.2, we distinguish the article request entries
by looking at two distinct attributes of the logs, uri-stem and uri-query.
The uri-stem field is always /pub/vgart.hbs for articles and the uri-query
distinguish between the articles with the form artid=101011.
There are several choices of how to extract the article requests and put
them into the database table. We present three methods here and then
elaborate on the one we have chosen.
Method 1: Everything in the database
With this method we can simply just push all the logs into the database
using the COPY command of PostgreSQL. From this table we can extract
the article request into a new table with a query using the restrictions
mentioned above. The query looks much the same as the one we presented
for the streaming logs:
INTO articletable
FROM alltable
WHERE uri_stem = ’/pub/vgart’ AND uri-query != NULL;
Chapter 5. Design and Implementation
5.6. Database implementation
Method 2: Combination C parser and DB commands
The second approach is a C parser that, instead of just pushing all the logs
into one big table, inserts one log at a time into a table. For each log we can
use the database commands to copy only the relevant article entries into
its own table, excluding all entries referring to image requests etc. Finally,
before moving on to the next log, we delete the original table containing
the whole log. The query for this is similar to the one above, except that the
insert command has to be used instead of select.
Method 3: Everything in C + libpq
The last method is a parser that operates on one line from one log file at
a time. On each line it does exactly what the above SQL command does,
which is match uri-stem for /pub/vgart.hbs and check that uri-query is not
blank. If an article request is found, the entry is inserted into the article
database table using the libpq library for communication with the database.
Selected approach
With method one the amount of information to be put into the database is
too big. The disk on which the logs and the database is situated is a 340GB
disk which had about 230GB available for the database. It has been shown
that after inserting only 17 of the 973 logs, the database is already 180GB
With method two we greatly reduce the amount of disk space needed
but there is a problem with cleaning the logs before a COPY operation will
succeed. With the streaming logs, we used sed to create copies of the logs
without any erroneous lines. This was no problem since these logs were not
too large. With the news logs however, this takes a considerable amount of
time since most of them are over 1GB in size after being decompressed.
Because the cleaning of the log files takes a lot of time, we chose to use
method three even though a COPY operation on the whole file performs
better from the database point of view. By operating on one and one line
of the logs we do not get a problem with the database getting too big due
to entries not related to article request. Also, we do not get the problem
of cleaning the files with sed. When we get an erroneous line with this
approach, libpq insert will fail and PQexec will throw an error exception.
Because we do not care about erroneous lines we can just ignore the error
messages. The parser we made is listed in Appendix A.2
Chapter 6
Web Content Analysis
In this chapter, we analyze the questions from Chapter 2 regarding web
news content. We first introduce the method we have used to extract
information from the logs. Then we go into details about how we answer
the specific questions and the results we get.
6.1 Preparation
As noted in Chapter 5, we need to create a parser which extract information
regarding file types, size and accesses to objects from the web logs.
We made a C program, Appendix A.3, that goes through the log files,
recording to different files on disk the name and size of each new object of
each file type it finds. To find the type we look at the ctype field in the log
which hold mime type entries of the form image/jpeg. When the uri-stem
field is /pub/vgart.hbs we also have to combine this field with the uri-query
field so we get e.g. /pub/vgart/artid=56544 as a distinct HTML file. The output of this program is a directory structure like:
In the sizes files, the size of each distinct object the program finds for the
respective type is stored. In the objects files the name of each new object
is stored. These objects files are used by the application to determine if the
6.2. File types and distribution
Chapter 6. Web Content Analysis
current log entry it is processing references a new object or if it has already
been recorded.
We ran our program to find and count file types and sizes over logs
for a subset of two days. We only use two days worth of logs because
the application has to compare the object referenced in each entry of the
log to all the previous objects found. For each new object found this takes
increasingly more time, and the program did not make much progress at
a rate that would give us much more data after this time period. Also, the
layout of the pages, as discussed in Chapter 2, dictates that the distribution
between the objects does not change much over time. For this reason, we
think two days is enough for the type of analysis we are conducting here.
6.2 File types and distribution
The line count of either one of the objects or sizes files tell us the number of
objects of each file type. In Linux, the shell command cat name-of-file — wc -l
gives the line count of a file. To find the mime type distribution we can
simply add up the line count of all objects files in each distinct directory
/results/images, results/text and so on. Table 6.1 list all the object types we
found and the distribution between them including the total distribution
per mime type. Figure 6.1 show the mime type distribution as a histogram.
To create this figure we entered the sum of the line counts of all files
in each distinct directory into the R script listed in Appendix A.4. Not
surprisingly, we find a subset of file types that represents the majority of
objects. To investigate the difference between the most represented types,
we also create a graph of a selected choice of file types, Figure 6.2. To
create this figure we used the R script in Appendix A.5. As we can see,
most of the objects are of type text/html, image/gif or image/jpeg. This
result corresponds to what has earlier been confirmed for general web
traffic, [17, 23, 35, 37].
6.3 Size and access distribution
To find the size distribution between the selected file types we use the
median size for each type. We did make some sample graphs using the
mean, but they gave us a completely wrong picture because the size
distribution is very skewed. As an example of this, the smallest JPEG image
is 304 Bytes, the largest is 1,918,304 Bytes, the mean is 13,838 Bytes and
the standard deviation 21,418 Bytes. Using median for this representation
is also in accordance with the rules presented in [27] regarding selection
among mean, median and mode. To find the median we use the Python
script in Appendix A.6, to sort the sizes files in ascending order so that the
entry at line count / 2 of each file give us the median size of the specific
Chapter 6. Web Content Analysis
6.3. Size and access distribution
Table 6.1: VG file type distribution
Number of objects
VG distribution among mime types
Types is first part of mime type name, e.g: text/*
Figure 6.1: VG mime type distribution
6.3. Size and access distribution
Chapter 6. Web Content Analysis
0 10000
Number of objects
VG file type distribution
Selected choice of file types
Figure 6.2: VG file type distribution
types. To create a graph of the median size of each type we again enter the
sizes into an R script, listed in Appendix A.7.
To find the access frequency to each type we cannot use these files since
they are only records of distinct objects. To create new files recording access
distribution, we use the same application as before, only we do not care
about previous objects seen. That is, we skip the routine that checks if the
object in the current entry has been seen before, and thereby record accesses
to the specific types. From these new files we use the same approach as
before by plotting the line count of each file into an R script, see Appendix
Figure 6.3 shows the median size distribution between each of the
selected types, and Figure 6.4 shows the access distribution between them.
We see that the flash objects are rarely accessed so even though their size
is about 12.5 times larger than the next group, JPEG, they will not impact
the server in any great way. HTML and Javascripts are both of similar size
and accessed almost the same amount of times. An interesting observation
here is the comparison between the amount of HTML objects and Javascript
objects to the file type analysis in Figure 6.2. The number of Javascript files
in the logs is only about 1.8 percent the number of HTML files. The reason
for this skew in amount of objects versus amount of accesses to objects of
type HTML and Javascript lies in the inherent nature of Javascripts. They
are contained in an HTML document to do specific tasks, such as banner
Chapter 6. Web Content Analysis
6.3. Size and access distribution
Size in bytes
VG median size distribution
Selected choice of file types
Figure 6.3: VG median size distribution
ads and pop ups. In our NoD environment, the layout of each page is kept
consistent across references to pages. As such, the same Javascripts are used
throughout several different pages to do the same specific task.
To the best of our knowledge, only one other analysis has taken
application mime type into account (in which Javascript lies) and they
found its access distribution to be far less than HTML [17]. That makes
sense for the web in general since the majority of web pages will not contain
such things as banner ads or pop ups. However, we see that this is not the
case for a web news server, and in such environments Javascripts are an
integral part of the HTML documents.
When it comes to images, JPEG files are both larger and more frequently
accessed than GIF files. This result is in contrast to both [37] and [35]’s
findings where GIF was by far the most popular image type, and the image
type with the largest average size. [17] and [23] did not distinguish between
the different image types. All of these articles are quite old however, and
our result does not come as a surprise as it has been predicted that the
number of GIF files would drop [30]. The reason for this is that in 1994,
Unisys who holds the patent on the compression algorithm used in GIF
,LZW, decided to start enforcing this patent and collect royalties on its
use [3]. When this happened a movement was started to move away from
the GIF format and encourage people to start using a free image format
instead, namely PNG [2]. This suggests an hypothesis that we would see
6.3. Size and access distribution
Chapter 6. Web Content Analysis
Number of accesses
VG access distribution to the file types
Selected choice of file types
Figure 6.4: VG file type access distribution
many PNG files where we saw GIF files before. However, in our study
there are very few PNG files. One of the reasons for this could be that the
GIF patent is now expired, and GIF does have some features that PNG does
not, such as animation. Another reason can be that not all browsers support
PNG files. For example, we tested with a Qtek 9090 PDA running Windows
Mobile 2003 Second Edition, version 4.21.1088 with Internet Explorer and
it did not show any PNG files. GIF and PNG discussion aside, the fact
remains that JPEG is by far the superior image format in terms of number
of objects in our study. The reason we find much more JPEG files than any
other image formats has to do with the properties of the different image
formats. JPEG is simply the best format for photographs which comprise
of the majority of images on a web news site [4]. One of the reasons why
it is better is that it use lossy compression, which effectively removes parts
of the image which are not so important while still preserving a reasonably
good quality.
From the file type distribution and access distribution analysis we have
found that the majority of objects on the server as well as accesses are of
type GIF, JPEG, HTML and Javascript. Arlitt et al. found in [16] that 90 to
100 percent of all accesses on a web server was to image or HTML files.
They were analyzing logs from six different web servers to find common
invariants, and they concluded that their results were representative of
all web servers. Others have also observed properties consistent with this
Chapter 6. Web Content Analysis
6.4. Internal size distribution
Number of objects
Size distribution for JPG files
Size in KBytes − Logarithmic X scale
Figure 6.5: VG JPEG size distribution
result, [23, 35]. They did not however, include any NoD servers in their
study, but through our analysis we have found that this distribution applies
to web news servers as well.
6.4 Internal size distribution
Now that we have done a general analysis of the file types on the server
as well as the access and size distribution between them, we go into more
specifics internal to each type. The content of the sizes files tells us the size
distribution internal to each file type. To investigate the size distribution
internal to each file type we used the Python script in Appendix A.9 to
create a table collecting entries in buckets of 1KB. We also used different
R scripts to create the graphs of the internal size distribution for each file
type. The script in Appendix A.10 is an example of such an R scripts. All
the other graphs regarding internal file sizes has been made with similar
scripts which can be found on a CD distributed with this thesis.
We will concentrate our further analysis on the most accessed types,
JPEG, GIF, HTML and Javascripts. From both median size distribution and
access distribution analysis we find that JPEG files are the largest and also
by far the most accessed type of objects. Therefore, we look at these first.
Figure 6.5 shows the size distribution among JPEG files. The majority
6.4. Internal size distribution
Chapter 6. Web Content Analysis
of JPEG files is between 1 and 5KB but there are also many objects from 5
to 50Kb. After 50Kb however, there are not many objects left. [35] found
the average JPEG size to be 73Kb so here we see a clear difference to
regular web content. This makes sense since images on a news web page are
generally small images connected to the heading of an article. For general
web content, images are more likely to show much more information and
therefore are bigger. Also, due to the amount of images on a web news
server, it is not unreasonable that they try and shrink the image sizes as
much as possible. As we have seen, JPEG is the most common format, and
with its lossy compression this can be done quite effectively.
Figure 6.6 shows the size distribution among GIF files. Here too, we
see that the majority of the objects lies between 1 and 5KB. If we further
expand the range from 1 to 10KB we have found an area comprising of
almost all the objects. The peak of the curve for GIF files is at 1KB, as
opposed to 4KB for JPEG files. This confirms the results in Figure 6.3 that
GIF files are generally smaller than JPEG files. Again, our results shows
a large difference to [35] which found the average size of GIF files to be
approximately 18KB. Here, we cannot draw any conclusion about the type
of information GIF images on a web news site conveys as opposed to on the
web in general. It is however, reasonable to assume that since the amount
of GIF files on the web news server is so small, we only see a limited use of
this format, and as such, the average size for the web in general would be
Figure 6.7 shows the size distribution among HTML files. A large
portion of the documents are less than 2KB. The rest of the documents lie
in the range 2 to 20KB. This result corresponds reasonably well with other
findings. [23] found that the web strongly favors documents between 256
and 512 Bytes. By looking at the median file size distribution in Figure 6.3
we see that this is also true for the news pages in our research. [35] found
the average HTML size to be 5KB so again we see that his study gives a
lot larger sizes than ours, but the ratio between the image and HTML sizes
remains approximately the same.
Figure 6.8 shows the size distribution among Javascript files. This is the
clearest size distribution we found. We can see that almost all Javascript
files are of 4KB size. There are just a few other documents of this type on
the server, and most of them are smaller with the majority at 1KB.
Chapter 6. Web Content Analysis
6.4. Internal size distribution
Number of objects
Size distribution for GIF files
Size in KBytes − Logarithmic X scale
Figure 6.6: VG GIF size distribution
1000 2000 3000 4000 5000
Number of objects
Size distribution for HTML files
Size in KBytes − Logarithmic X scale
Figure 6.7: VG HTML size distribution
6.4. Internal size distribution
Chapter 6. Web Content Analysis
5 10
Number of objects
Size distribution for Javascript files, log/log scale
Size in KBytes
Figure 6.8: VG Javascript size distribution
Chapter 7
Streaming Content Analysis
In this chapter, we will analyze content from the streaming news server
logs, and investigate the issues discussed at the end of Chapter 2 regarding
stream content.
7.1 Preparation
As mentioned in Section 5.1.2, we got a recursive directory listing from
NR [6] with output as described in Table 5.1. We wanted to put this list
into a database table, which requires some changes to the format. The date
field has to be changed from e.g. 10.07.2002 to 2002-10-07, and the size field
has to be stripped of punctuation marks. In addition we also we wanted a
column for the file type. We implemented the C program in Appendix A.11
to perform all these operations and output the result to a new file on disk.
The type of each object is found by looking at the file extension.
Table 7.1 gives an overview of the format of the new file. To import this
new file into a database table we used the COPY command in PostgreSQL.
This server list table is used for queries to answer the specific questions
about file type, size and distribution. A description of the new table is listed
in Table 7.2.
7.2 File types and distribution
To find out what type of files exist we do a query on the new table, listing
all the distinct types:
Table 7.1: NR directory listing example with type field
7.2. File types and distribution
Chapter 7. Streaming Content Analysis
character varying(128)
character varying(56)
date from server list
time from server list
size from server list
name of object, parsed out of URI
mime type field
Table 7.2: NR server list table
1000 2000 3000 4000 5000 6000
Number of files
NR logs filetype distribution
Other = mp3, wav, pdf, playlists, txt, mpg, flash, avi
Figure 7.1: NR file type distribution
SELECT DISTINCT type FROM nrlisttable;
The result of this query gave us a total of 13 different types. To investigate
the distribution among these types we created a PLR script that use SQL
commands to count all entries of the specific types, and then use R’s
graphing capabilities to create a histogram presenting the results. See
Appendix A.12 for this script.
Figure 7.1 shows the distribution among file types on the streaming
server. As we can see, even though we found a total of 13 different file
types, almost all of the files are of Microsoft’s WMV video and WMA audio
format, with WMV accounting for the absolute majority of objects on the
server. The next file type we see a lot of is JPEG. After that there are not
many of each of the other types. We find that types like mp3, WAV, MPEG,
AVI, Real and Quicktime, which are file types we generally find on the
Internet a lot, are not used at all in our streaming environment. Microsoft
formats for video and audio is almost exclusively used. This does not come
Chapter 7. Streaming Content Analysis
7.3. Size distribution
as a surprise as we noted in Chapter 2, it is specified on the video access
page that the videos are of WMV type and watching them require Windows
Media Player.
7.3 Size distribution
To investigate the size distribution of the file types on the streaming server
we use the size and type attributes of our server list table. An example
format of an SQL query that list the size of all objects of a specific type is:
FROM serverlisttable
WHERE type = ’video/x-ms-wmv’;
To record the results we use the \o option in PostgreSQL, which output
results from queries to a file (e.g. \o /home/user/wmvsizes). From
performing these queries on all types, we get one file on disk for each
type containing the size of all objects of that specific type. These files are
the equivalent of the sizes files from the web news content analysis in the
previous chapter.
Further, we use the same python script, Appendix A.6, as with the web
news content analysis to sort the sizes in ascending order. Again, we find
the median size of each type by looking at the entry at line-count-of-file /
2 of each distinct file. To create a graph of the size distribution between
the different types we entered the median sizes into the R script listed in
Appendix A.13. We have chosen to only include the types WMV, WMA,
JPEG and ASF, since there are so few objects of the other types.
Figure 7.2 show the median size of the four selected file types. We see
that WMV and ASF files are of almost the same size. This is because ASF is
also a video format from Microsoft with similar design and characteristics
as WMV. WMA files are smaller than both ASF and WMV. This is not so
surprising as audio files tend to be smaller than video files. A bit surprising
is the difference between image files compared to audio and video. At first
glance it looks like the JPEG files are really big, but if we look closer at the
actual sizes, we see that it is the audio and video files that are quite small.
The median size of WMV is about 1MB.
7.4 Access distribution
Next, we investigate the access distribution of file types on the streaming
server. To find the access distribution we can not use the server list as above,
since this is only a record of files on the server. To investigate accesses we
have to look at the streaming logs access table.
7.4. Access distribution
Chapter 7. Streaming Content Analysis
Size in KB
1000 1200
NR median size distribution
Object type
Figure 7.2: NR median size distribution
As mentioned in Chapter 5, since the streaming logs have no type
attribute we added a type field to both the objects table and the access table
of the logs. These fields have to be filled in by matching the objects each
entry is representing to the objects in the server list table. Since we divided
up the streaming logs into different tables, the access table does not contain
a name of the object accessed, only the ID which match an ID in the objects
table. Therefore, in order to match the type of an object from the server list
table to an entry in the access table we have to perform two operations.
First, we match the objects in the server list table to those in the objects
table created from the logs, using the name attribute of each table. Wherever
we find a match, we fill in the type in the streaming log objects table. When
the type field in the object table has been filled in, we can do the same
operation between the streaming log objects and access table using the
objectid attribute. To perform these two operation we used the scripts in
Appendix A.14 and A.15.
We were not able to map all objects from the server list to the objects
found in the logs. There can be several reasons for this. For example, not
all names for the same object match because of different representation
of Norwegian characters in the server logs and the server list. Another
source where character mismatches are introduced, are badly formatted log
entries. A third reason can simply be that some objects have been removed
from the server. Therefore, the type of some objects can not be determined.
Chapter 7. Streaming Content Analysis
7.4. Access distribution
Number of accesses
NR logs access distribution
Other = mp3, wav, pdf, playlists, txt, mpg, flash, avi
Figure 7.3: NR access distribution
We found 7,593 objects in the server list and 2,413 distinct objects in the log
files. Of these 2,413 objects, we were able to map 1,325 of them between the
server list and the log objects tables.
Now we are ready to investigate accesses between the different types.
To find the number of accesses to each type we use the same method as in
the previous section with queries of the form:
SELECT count(*)
FROM accesstable
WHERE type = ’video/x-ms-wmv’;
We created the PLR script in Appendix A.16 to perform these SQL queries
and then use R’s graphing capabilities to create a histogram of the access
distribution, Figure 7.3. As we see, WMV is definitely the most accessed
type. We also see that JPEG and files from the ”Other” category are not
accessed at all. One reason they appear in the list table and not the
access table can be that NR had more content on the server than was
available through the streaming application interface. Also, the server list
contain objects from 2002 to 2004, while we only analyze logs with accesses
between January 2002 and January 2003, so the two dataset are not directly
comparable beyond the objects we were actually able to map between the
However, WMV and WMA files are accessed almost exclusively and
7.5. Internal size distribution
Chapter 7. Streaming Content Analysis
from previous analysis we have also found that the majority of objects
on the server are of these types. As such, we need to better understand
what these files are besides just video and audio files. Both are Microsoft
standards in the Windows Media family. WMV is a video format, which
includes both video and audio. It is designed to handle all types of video,
be delivered as a continuous flow, and compressed to match different
bandwidth requirements [34]. WMA is an audio format in the same family
and with the same characteristics as it’s video counterpart. With these
characteristics, they fit a streaming environment very well.
7.5 Internal size distribution
Now, we also investigate the internal size distribution for the four selected
file types, WMV, WMA, JPEG and ASF. The method we use to find out
about their internal size distribution is the same as for the web news
content analysis. From the median size analysis above we already have files
on disk for each type containing the size of each object of that specific type.
This file is also sorted in ascending order. As with the web news, we use the
Python script in Appendix A.9 to collect these sizes in buckets, only now
we make the buckets 100KB in size. We use R scripts similar to those in the
web content analysis to create graphs of the outputs from each analysis. See
Appendix A.17 for these scripts.
Figure 7.4 shows the size distribution of WMA files on the NR server.
Although there is a large range of sizes from 5KB to 20MB, most of these
audio files are between 200 and 500KB. This is not very big compared to the
regular audio content we are used to, like mp3 music. However, the audio
files we are looking at are typically small samples of a music file designed
to give the user a preview of some particular song.
Figure 7.5 shows the size distribution of WMV files on the NR server. As
we have seen, most files on the server is of this type, and they are also the
most accessed type of files. The range of different sizes within this type
is very large, between 1KB and 314MB, but by far the most objects are
between 100KB and 1MB. One reason they are this small is because as with
the audio files, videos are not full news clips, like on a TV news site such as
the Norwegian television NRK site, [7]. They are small news clips that show
just a specific piece of information like for example goals scored in a soccer
match, or short interviews with celebrities. Another reason can be that the
logs we are analyzing are from 2002. In 2002, most clients were still using
ISDN or modem to connect to the Internet, [36], and therefore, both size and
compression rate were probably fitted towards a lower bandwidth market
than todays files. In any case, our results correspond well with what [14]
found in their analysis, where most video objects were less than 2MB in
size with the median size being 1.1MB.
Chapter 7. Streaming Content Analysis
7.5. Internal size distribution
Number of objects
Size distribution for WMA files
Size in 100 KBytes buckets − Logarithmic X scale
Figure 7.4: NR WMA size distribution
Number of objects
Size distribution for WMV files
Size in 100 KBytes buckets − Logarithmic X scale
Figure 7.5: NR WMV size distribution
7.5. Internal size distribution
Chapter 7. Streaming Content Analysis
Last, we also look at the internal size distribution for the JPEG and ASF
files. Figure 7.6 shows the size distribution of JPEG files on the streaming
server. As we see, the majority of these files are in the range between
500KB to 1MB, which is much larger than those we found in the web log
analysis. However, this size range is not large compared to regular image
and photograph sizes. The JPEG files we find on the streaming server could
be photo series that sometimes are shown on the VG site, now residing
on VG’s own servers and accessed through In
the web server analysis in the previous chapter we only analyzed articles
and therefore such pictures were not in that subset of images. Figure 7.3
show that these images are never accessed however, so we cannot conclude
anything about them.
Number of objects
Size distribution for JPG files
Size in 100 KBytes buckets − Logarithmic X scale
Figure 7.6: NR JPEG size distribution
Figure 7.7 shows the size distribution of ASF files on the streaming
server. The majority of these files have the same size range as the majority of
the WMV files, but we do not have a large enough set of objects to conclude
The analysis above has investigated the internal size distribution of files
on the server. This can be quite different from the objects actually accessed
in the logs. Therefore, we also made graphs of the internal size of those
objects that were actually accessed. We found this distribution to be almost
exactly the same as the distribution of the files on the server.
Chapter 7. Streaming Content Analysis
7.5. Internal size distribution
Number of objects
Size distribution for ASF files
10000 20000
Size in 100 KBytes buckets − Logarithmic X scale
Figure 7.7: NR ASF size distribution
Chapter 8
Access and Interaction Analysis
In this chapter, we look at user behavior in a NoD environment. We
first present a small workload characterization of the servers. Then we
investigate access patterns in the web news environment, and in the end
look at interaction patterns in the streaming environment.
8.1 Workload characterization
The first behavior analysis we perform is more geared towards the servers,
but it gives us a broad overview of what users do as well. To study the
workload of the web news server we used the script in Appendix A.18
to investigate the number of requests per hour on December 8, 2004.
The result is presented in Figure 8.1, where requests are collected in one
hour buckets, meaning that the entry at for example 7 represent requests
between 06:00 and 07:00. We see from this figure that the server is very busy
throughout the day with a peak of about 175,000 requests at 7 and about
160,000 at 8, which gives an average of 26 requests per second between
06:00 and 08:00. The next peak is at 10 and 11 where there is about 145,000
requests for each. Server workload discussion aside we can already here
start to investigate client access behavior. It seems as though many users
start their workday by reading news papers, and then they check back to
read new articles during their lunch times. We also calculated the number
of distinct users this day which was 233,209, telling us that some clients
must request several articles.
To study the workload of the streaming news server we used the
script in Appendix A.19 to investigate the number of requests per hour
on February 6, 2002. The result is presented in Figure 8.2, and again the
requests are collected in one hour buckets. We see that there are a lot
less requests for these type of objects. The peak of these requests is at 10
(between 09:00 and 10:00) with about 400 requests. At 11 there is about 300
requests which gives an average of 5.8 requests per minute between 09:00
8.1. Workload characterization
Chapter 8. User Behavior
Requests pr hour
VG server workload , 2004−12−08
Figure 8.1: VG server workload
Requests pr hour
NR server workload , 2002−02−06
Figure 8.2: NR server workload
Chapter 8. User Behavior
8.1. Workload characterization
and 11:00. Here too we see a small peak at the start of the workday but by
far the most requests for these objects are during lunch time. The number
of distinct clients this day was 1,352 and the total number of requests
was 2,366, suggesting that at least some client request two or more stream
To find out if our dataset is representable of the actual workloads of the
server on average, we did a comparison of all the logs we received from
VG. In a very simple analysis we implemented a script that created a graph
of web news log similarities based on the size of each log we received. As
mentioned in Chapter 2, each log is comprised of half an hour of material.
The result is presented in Figure 8.3. As we see, the logs exhibit a sort of self
similarity, showing that our data is in fact representable for performing not
only a workload characterization, but also all the other types of analysis
we do in this thesis. From this figure we also see that on weekends, around
the 200 and 500 mark in the graph, the number of requests are not as high
as during regular week days. However, the weekend logs are not very far
behind the weekday logs and we see that web news objects are requested
during weekends as well.
Size of log in MB
VG log size over 2 weeks
Log number, half hour each
Figure 8.3: VG log size comparison
8.2. Web news sessions
Chapter 8. User Behavior
Avg requests from same IP
Average number of times pr day we see
the same IP
Figure 8.4: VG mean number of times IP is seen pr day
8.2 Web news sessions
Kim et al. claims in their analysis of article access distribution that clients
requests several articles while connected to a news server [29]. They show
no proof of this, but they do use it as the basis for an article popularity
model they present. We want to investigate if this claim is true for our
dataset. In order to do so we first need to define client sessions. Catledge
et al., in their study of user interface events, defined sessions to be within
1-1/2 standard deviation of the mean between user events [20]. We decided
to do the same and define sessions to include all requests within 1 standard
deviation of the mean time between requests from each client per day.
To find out if clients request several articles in a session, we perform
several steps. First, we recognize that for sessions to even exist, some clients
must request more than one article per day, and as such there should be
more than one entry in the logs per day from the same IP. Therefore, we
first check if we actually see multiple requests from some IP addresses each
day. We used the script in Appendix A.20 to create a graph of the mean
number of requests from the same IP address each day, Figure 8.4. We see
that the mean number is between 2 and 3 times a day. Mean is not a very
informative value here in terms of regular client behavior, since it is very
susceptible to large fluctuations. For example, a proxy server requesting
hundreds or thousands of articles per day would greatly influence the
Chapter 8. User Behavior
8.2. Web news sessions
date we record avg time between requests 5.5
client IP address
average time between requests this date
Table 8.1: VG average request timing table
value of the mean. It does however, tell us that there is a possibility that
clients request more than one article in a session, which is what we wanted
to learn from this study.
Next, since there is a possibility that sessions exits, we need to find out
if the requests are reasonably close in time so we can justify grouping them
together in a session. In order to test this, we use the definition of sessions
from earlier in this section. We implemented the script in Appendix A.21
which calculates the mean interval between requests from each IP per
day, giving a result of 22 minutes 31 seconds. This script also created a
table, Table 8.1, recording the mean distance between requests for each
IP each day the IP is observed. In order to find the standard deviation,
we output all the mean times from this table to file, fed those into an R
vector on which we could query for information in the R environment. The
summary() function show a mean of 1,850.65 seconds and the sd() function
give a standard deviation of 3,571.943 seconds which gives us a session of
59.5 minutes. From the definition presented earlier we find sessions in our
dataset to be 1 hour, which seems reasonable.
With session defined, we created another table, Table 8.2, using the
script in Appendix A.22 which assign session IDs to each request from
each IP. The scripts in Appendix A.23 and A.24 are used to record the
number of requests for each session and create a histogram of number
of sessions versus the number of requests in the sessions, Figure 8.5. The
number of requests per sessions ranged from 1 and all the way up to
1,397. In our graph we only show results of the sessions with up to 20
requests. The reason for this is both that beyond this limit there are mostly
just one or two sessions with the corresponding amount of requests and
also, most of those sessions are probably not from distinct users. From
the graph we see that the absolute majority of sessions only contain one
request for an article. Interestingly, when comparing to Zipf we see that
the popularity of sessions ordered by the number of requests they contain
follow a Zipf distribution where α equals 1.3. This means the probability
of a session containing a certain amount of requests decrease with an
increase in number of requests in the session, and small sessions are even
more favored than pure Zipf popularity models. From this result, we find
that there exist sessions in which clients request several articles, but the
probability of sessions containing one more request decrease according to
a Zipf distribution with α=1.3.
8.2. Web news sessions
Chapter 8. User Behavior
client IP address
date of the request
time of the request
id for the session current request is in
article id requested
Table 8.2: VG session table
Dotted curve is fitted Zipf w/alpha = 1.3
VG number of requests in sessions
Figure 8.5: VG number of sessions with x number of requests
Chapter 8. User Behavior
8.3. Web news reference patterns
8.3 Web news reference patterns
In [22] it is observed that for prefetching of objects, it is only necessary to
transfer objects at a rate sufficient to deliver said object in advance of the
user’s request. If through our analysis we find some access relationship
between articles and also some timing constraint for that relation, it would
aid the use of rate controlled prefetching. Therefore, if we were able to
verify that sessions exist in which clients request several articles, we also
wanted to investigate the time between these requests to find out how
much time is spent on each article. From Table 8.2 we can calculate the
mean time between requests within sessions, thereby gaining information
on how long on average users spend reading an article. For this we used
the script in Appendix A.25, which gave us the result of 92 seconds. From a
small experiment in reading time of complete articles we find this result to
be quite accurate. Not knowing of any other study of article reading times,
it seems as though most articles are read beginning to end.
Even though we can identify some time requirements for article
requests, in order to perform prefetching we need to know which article is
going to be requested next. Therefore, we also need to look at relationship
between requests. To perform this analysis we wanted to use the web news
log attribute ref-args to find out where each request came from, so that
we could find if there were groups of articles always requested together.
On closer investigation of this attribute we found that of the 14,905,052
accesses, only 123,919 had this attribute filled in. Further, most of them
came from other sites, or did not contain values we could deduct any
information from. Only 7,067 of these entries had artid= somewhere in
the string representation and 6,045 of these entries had this attribute set
to a pure artid=number representation. 26,305 number of requests with the
ref-args field set came from an image on the VG server, which probably
mean that the user clicked one of the images accompanying a headline on
the front page. Because of the numerous different string representations
found in this attribute, a thorough investigation of relationships between
single elements would require complicated string matching methods. Time
pressed we did not see an obvious need for such an analysis based on the
numbers represented above. The attribute is simply not used enough to
contribute any important results.
8.4 Stream interaction patterns
Next we analyze the streaming logs to investigate how streaming videos
are interacted with. The two questions we had here, was if videos are
viewed in full or partial, and if they are only viewed partially, how much
of the video is viewed. In this investigation, we can only look at those
8.4. Stream interaction patterns
Chapter 8. User Behavior
id of object
full size of object in bytes
bytes viewed
percentage of object viewed
Table 8.3: NR access statistics table
objects we were able to map from the stream server list table, Table 7.1,
to the objects table, Table 5.2, since these are the only objects we know the
exact size of. As mentioned in Chapter 7, out of the 2,413 distinct objects
we found in the logs we were able to map 1,325 of them to the objects table.
Of these 1,325 objects, 1,322 were of the types WMA or WMV. The 3 others
were ASF/ASX. In addition, we also had to check that the sc-bytes field of
the streaming logs access table, Table 5.4, was filled for each entry used in
this analysis. With all the above restrictions we ended up with 4,198,779
requests to 1,319 objects we could evaluate.
We implemented the script listed in Appendix A.26, which using the
above restrictions created Table 8.3. This table list all requests to each object
for which we were able to determine the initial size and also had the
bytes sent from server to client attribute filled in. Further, the table also
contain information on how many percent of the object was viewed for
each request.
From this we found that out of the 1,319 objects, 886 of them had
requests where the bytes sent from server to client was more than the
size of the file. This can be due to numerous reasons. First it can be from
commercials running first, while the actual video is being buffered. It can
also be due to TCP/UDP and streaming protocol overhead, although this
should not be very much. We used the transport attribute of the streaming
logs to check what kind of protocols where used. We found that about
60 percent of the time TCP was used, 30 percent of the time UDP was
used, and the rest was unspecified. Further we calculated the mean view
percentage for all requests viewed more than 100 percent of the objects
size, which gave us 127 percent. Neither TCP or any streaming protocol
at all should give this much overhead. UDP could on a bad link do this
with many retransmissions. Another reason could be user interaction like
jumping back and forth. Most of the players were Windows Media Players
which are buffering clients so user interaction would not be the reason
there. The last reason can be testing of the line between client and server
in order to establish usable transfer rates and parameters. Even though a
lot of the objects which had an access to them where more bytes were sent
than the actual size of the objects, when we look at all requests only 417,945
of 4,198,779 had more than 100 percent sent. This is only about 10 percent
of the requests. It could be that one part of these accesses with only a slight
number above 100 percent are due to overhead in the different protocols
Chapter 8. User Behavior
8.4. Stream interaction patterns
used. For those requests that has a greater percentage sent, a non buffering
client or UDP over a bad link could be used. The most likely reason though,
since most users use buffering clients, is overhead from protocols combined
with commercial elements and a setup phase.
Figure 8.6 show a diagram of number of requests that were viewed in
partial, number of requests that were viewed in full (100 percent), and the
number of requests that were viewed more than 100 percent. We clearly see
that the majority of objects are only accessed partially.
NR accesses view percentage
Full (9.4%)
Partial (80.6%)
More than 100 (10%)
Figure 8.6: NR access view percentage
To investigate this further and find out how much of the objects are
usually accessed we created another table, Table 8.4, which summaries
view statistics for each object in Table 8.3. We used the script in Appendix
A.27 to create this table. From this, we calculated the mean view percentage
of all requests which came out 57 percent. We also created a histogram of
requests accessing 10 percent of an object, 20 percent and so on, which is
shown in Figure 8.7. To do this, we used the same approach as we did with
the content analysis. We output to file all the percent counts from Table 8.3
where percent was less than 100. Then we used the Python sorting script in
Appendix A.6 to sort them, and another Python script listed in Appendix
A.28 to count up entries in buckets of 10 percent each. From that we used
the script in Appendix A.29 to create the graph.
We see that the most usual access pattern is to only watch the first 10
percent of the object. If the user watch more than 10 percent the next most
8.4. Stream interaction patterns
Chapter 8. User Behavior
id of object
count of accesses viewed object in full
count of accesses viewed object partially
total number of accesses to this object
percentage of accesses viewed the object in full
mean percent viewed of all accesses to this object
Table 8.4: NR view statistics table
Number of accesses
View percent distribution for accesses viewed partially
Percent in 10% buckets
Figure 8.7: NR access view percentage distribution for partial accesses
Chapter 8. User Behavior
8.4. Stream interaction patterns
Percent of number of accesses
Cumulative view percent distribution for partial accesses
Percent in 10% buckets
Figure 8.8: NR cumulative access view percentage
common pattern is to watch 100 percent of the object. This is a reasonable
perception since one could imagine users viewing the beginning of a news
clip, and decide whether or not it is interesting. If it is not, the user will
stop viewing early in the stream, or if deciding that the clip is interesting
the user will watch it all. It would be tempting to conclude here that either
a client watch less than 10 percent or they watch the full 100 percent. The
correctness of such a theory would be of great advantage to prefix caching
especially. However, we see from Figure 8.7 that there are a substantial
amount of requests viewing an object uniformly distributed between the
percentage counts other than 10 and 100 percent. To see if we can conclude
anything about the size of a prefix we used script A.30 to create a graph
of the cumulative distribution for the access count, Figure 8.8. From this
we see that 20 percent of the requests watch less than 10 percent of an
object. Other than that, the view percent is almost uniformly distributed.
There are no clear distinctions to be made here, but if prefix caching is to be
used, somewhere between 10 and 20 percent of an object could be a good
possibility accounting for between 20 to 30 percent of the requests.
Chapter 9
Lifetime and Popularity
In this chapter, we explore the issue of lifetime and popularity of web
articles as well as streaming objects. We start out with an analysis of article
lifetime, then we look at article popularity, and in the end we compare
the lifetime and popularity of streaming objects to investigate if streaming
news have similar patterns as web news.
9.1 Article lifetime analysis
To answer the question of news articles lifetime in terms of distance in days
between the first and last day they are accessed, we first created a smaller
table from the web news article table recording the first and last date each
article is seen, and the total number of requests to that article. The script
we used for this operation is listed in Appendix A.31. Table 9.1 shows a
description of the attributes in this table.
In order to investigate lifetime, we implemented the script in Appendix
A.32, which use this table to create a graph of the distance between the first
and last day we see a request to an article for all articles, Figure 9.1. We
see from the result that many articles have a lifetime of eight days. Our
log material only contains information over an eight day period, meaning
that most articles live more than what we can find in our analysis. We also
see that many articles are only accessed one day. This could be just old
character varying(512)
uri-query field from 5.5
first date we see access to this article
last date we see access to this article
total number of accesses to this article
Table 9.1: VG article information table
9.1. Article lifetime analysis
Chapter 9. Lifetime and Popularity
Number of articles
VG article lifetime
Number of days
Figure 9.1: VG article lifetime of all articles
articles being referenced again in one of the new articles of the week we
have logs from, but we can not know this for sure. It could also mean that
new articles become unimportant almost right away, but then we should
not see a substantial amount of articles with a lifetime of eight days. One
question that arises is the recycling of article ID’s. We performed sample
tests by looking up the most referenced articles and they had not been
recycled so this is not the case.
Since we cannot say anything about when articles enter into the system
from our logs, for further analysis we decided to only use articles we see a
reference to at the first day of the logs. By doing so, we limit the number of
old articles only accessed once or twice within the time period of our logs.
This way, most articles will be new articles which we can use to analyze
lifetime more correctly.
Therefore, using the script in Appendix A.33, we created a histogram of
the distance between first and last day of accesses to only those articles
that had been accessed on the first day of the logs, Figure 9.2. We see
that the number of articles that are accessed only one day has dropped
dramatically, suggesting that those were mostly old articles referenced only
a few times through links to related content in recent articles. To emphasize
this even further, we calculated the percent of documents in the logs who
were only accessed one time, which was 33.6 percent. Interestingly, [16]
found in their study of several different logs from different type of web
Chapter 9. Lifetime and Popularity
9.1. Article lifetime analysis
Number of articles
VG article lifetime
Number of days
Figure 9.2: VG article lifetime of articles seen first day of logging
server, none of which were news web server, that approximately a third of
all distinct documents were only accessed one time. It would seem then,
that old articles generally follow a regular web pattern.
A bit of a surprise from this new day distance analysis is the fact that
most articles are accessed both the first day and the last day in our logs.
Kim et al. found in their study that the average number of days articles
with the best popularity last is three days, [29]. In our data set we see that
all new articles, not only the most popular, actually live for quite some time,
at least eight days. However, they have modeled lifetime as a function of
popularity, so the two life cycles are not directly comparable.
Figure 9.2 only shows the distance in days between the first and last
access. Even though it is unlikely that all articles are requested on the first
day of logging and then again only on the last day, we calculated the mean
number of days all of these articles where accessed which turned out to be
six days. This result is not very informative, but knowing that most articles
have a day distance of at least eight days, one could imagine that most
of the articles we see on day one of our logs are new articles which are
accessed for all of the subsequent eight days. Then there is a subset of old
articles accessed on day one of the logs which we do not see again later, and
as such they are the once that reduce the mean. However, all of the results
we have found so far clearly show that we do not have enough material to
evaluate the lifetime of articles.
9.2. Article access distribution
Chapter 9. Lifetime and Popularity
Number of accesses
Cumulative access distribution
Number of articles
Figure 9.3: VG article cumulative access distribution
9.2 Article access distribution
Next we want to explore the distribution between accesses and articles.
Through this we can get an idea of the distribution of hot documents. [16] et
al. found in their study that general web traffic followed the 90/10 rule, 90
percent of the requests where for 10 percent of the web pages. To investigate
this issue for web news we used the script in Appendix A.34 to create a
graph of the cumulative accesses distribution for the whole week of logs
which are in the database, Figure 9.3. As we can see, the amount of hot
documents are even more than 90/10 for web news. About 96 percent of the
requests are for 10 percent of the articles. Combining this with the results in
the previous section which show that new articles are usually requested for
at least eight days, it appears as though new articles are very popular, but
there are many older articles also being accessed in the course of a week. To
investigate this further we created a graph of the mean access distribution
of the articles seen the first day over the whole eight day period, using
the script in Appendix A.35. We see from the graph in Figure 9.4 that their
popularity in terms of access counts drop dramatically from the first day
to the second, and then again to day three. From this we can conclude that
web news does become old after just one day, and new articles are much
preferred over old ones, even though we have shown that article continue
to be requested beyond our dataset of one week.
Chapter 9. Lifetime and Popularity
9.2. Article access distribution
1000000 1400000
Total accesses
VG access distribution for articles seen day 1
Day number
Figure 9.4: VG access distribution of articles seen first day of logging
Many have modeled web page popularity with Zipf and found that α
had to be adjusted. The only other work we know of that has modeled
article popularity specifically is [29] which also says that article popularity
is different from pure Zipf with α=1. They also claim that they get close
to a pure Zipf distribution when combining articles together in groups. As
mentioned in Chapter 3, the way they have modeled access popularity in
their graphs is a bit misleading. One of the graphs is said to show the mean
access popularity of articles over a month, and there is a corresponding
graph comparing NoD article popularity to Zipf. The problem is that Zipf
is time independent. By calculating the mean access popularity over a
whole month they are actually creating graphs of the probability articles
have to live for one month. Such an analysis does then belong in a lifetime
investigation, so we also created a graph like theirs, Figure 9.5. We have
put Zipf in our graph even though it is not comparable to what the graph
actually show. The reason we have done it is to compare our results to [29]s.
They found their curve to be less steep than Zipf, which is the same result
we get.
9.3. Article popularity
Chapter 9. Lifetime and Popularity
Number of requests, R(i)
VG article probability of becoming
Article popularity rank, i
Figure 9.5: VG likeliness of becoming popular compared to Zipf (1 week,
top 10 percent of the articles)
9.3 Article popularity
Putting aside discussions about articles popularity over time, we now
continue on to compare the article popularity in our dataset to Zipf. We use
the script in Appendix A.36 to create a graph of the popularity distribution
on the first day of the logs, Figure 9.6. As we can see, we get a distribution
where pure Zipf is the best fit of our curve, and without a log/log scale
of this figure they are impossible to tell apart. Kim et al., which analyze a
dataset similar to ours use 145 articles in their graphs [29]. Therefore, we
created a similar graph, with only the 150 most popular articles on the first
day of the logs, Figure 9.7. When only looking at the top 150 articles we
need to adjust α to 0.7 to get the closest fit to a pure Zipf curve. We can
conclude then, that article popularity do follow Zipf, but as the subset of
articles get smaller we need to decrease the value of α. This is a result of
the concentration of requests to articles as shown in the previous section.
[29] also presents an article popularity model they call Multi-selection
Zipf. This model is based on a claim that clients request several articles once
connected to a news server, and as such there exists groups of articles which
can be ranked by popularity. As we have shown in the previous chapter,
clients do request several articles in sessions, but for the most part they
only request one article. In their own comparison of the algorithm to Zipf
Chapter 9. Lifetime and Popularity
9.3. Article popularity
VG article popularity vs. Zipf on December 7, 2004
Number of requests, R(i)
Dotted line is pure Zipf w/alpha = 1
Article popularity rank, i
Figure 9.6: VG article popularity vs. Zipf
Dotted line is pure Zipf w/alpha = 1
Number of requests, R(i)
VG top 150 articles popularity vs. Zipf on December 7, 2004
Article popularity rank, i
Figure 9.7: VG top 150 article popularity vs. Zipf
9.4. Stream objects lifetime and popularity
Chapter 9. Lifetime and Popularity
they did find that the less articles in a group, the closer the algorithm comes
to Zipf. This is the same result we can read from our sessions analysis
in the previous chapter. The probability of a group containing more than
one article diminish according to Zipf. However, nothing has been said
about the popularity of the groups of articles. As it turns out, they cannot
know which group is the most popular since they rank articles according
to the mean access over a whole month and Zipf is time independent.
For example, there is nothing in their graphs that tell us if a group of
the top three articles on one particular day is more popular than a group
with the number one popular article from three days in a row. Since we
did not investigate the relationship between articles within a session to
find groups of articles we cannot conclude anything about the popularity
distribution of such groups, but neither can [29]. Therefore, we can not
verify or invalidate Multi-selection Zipf, but we do think it needs to be
analyzed in more detail.
9.4 Stream objects lifetime and popularity
In the end, we also investigate the lifetime and popularity of the streaming
objects to see if their properties are similar to articles or regular video, or
have their own distribution characteristics entirely.
Using the script in Appendix A.37, we created the same kind of table
as for the articles with information about the first and last day an object
is accessed, as well as the total requests to each streaming object. We first
calculated the min, max and mean number between the first and last day
these objects are requested, with the results 1, 352, and 61. [24] found that
once a movie enters a system, it never leaves. Since we see that some of the
stream objects in our dataset are requested for the whole period of one year,
it could mean that streaming objects are comparable to movies in terms
of lifetime. However, we also created a graph similar to the average day
distance graph of the web news articles, Figure 9.8. From this we learn that
the absolute majority of stream objects are only accessed one day. Also,
from investigating how many new objects there were each day in these
logs, we found that as with movies, there were not many new objects seen
each day, and many days there where none.
In addition, we also compare streaming news popularity to Zipf. For
this we used the script in Appendix A.38 which models Zipf distribution on
2 February 2002. In Figure 9.9 we can see that the popularity of streaming
news objects is also close to pure Zipf as the web news articles was, but we
need to adjust α to 0.8 to get an almost exact match. This is very similar
to what we found for web news, 0.7, when we reduced the dataset. By
reducing the dataset we were in effect making it more similar to our stream
dataset in terms of number of objects. Because we do not have enough
Chapter 9. Lifetime and Popularity
9.4. Stream objects lifetime and popularity
Number of objects
Day distance between first/last access
to a streaming object
Number of days
Figure 9.8: NR streaming objects lifetime
Number of requests, R(i)
NR objects Zipf comparison
Object popularity rank, i
Figure 9.9: NR objects Zipf comparison
9.4. Stream objects lifetime and popularity
Chapter 9. Lifetime and Popularity
material to correctly investigate the lifetime of articles, we cannot compare
this for articles and stream objects. But from the knowledge that there is a
substantial amount of old articles in our logs as well as a similar behavior
of access popularity of new objects, we could imagine a similar graph for
the news articles. If this is correct, then streaming news are comparable to
web news in terms of popularity distribution.
A third comparison show that there are a lot less number of objects
requested as well as number of new objects released each day in the stream
log than the web log, which suggests that streaming news has the same
characteristics of movies. It does seem as though streaming objects exhibit
their own characteristics with similarities to both movies and articles. We
cannot however, with our dataset, investigate this issue any further.
Chapter 10
In this chapter, we summarize the work we have done and present the most
important result we have found in this thesis. In the end, we outline ideas
for future work within the topic of this thesis, based on some weaknesses
and open questions in our analysis
10.1 Thesis summary
In this thesis, we have investigated several different aspects of a NoD
environment, through analysis of log files from both the web and streaming
server of Norway’s largest online newspaper VG. We divided the different
analysis parts into four main areas, content analysis, article access pattern
analysis, stream interaction analysis and lifetime and popularity analysis.
In terms of content, we have studied file type distribution and access
distribution, size distribution between and internal to the different object
types of both the web and streaming logs. Continuing, we did a short
workload characterization of the two server logs. Then, for article access
patterns, we have analyzed the existence of sessions, the number of
requests in a session, time between requests, and relationships between
them. When it comes to streaming objects we have investigated the
distribution between partial and full requests of object and the percent
viewed distribution for those only accessed partially. Lifetime analysis has
been performed in terms of the time period in which we see requests
to objects, and we have also performed popularity analysis in terms of
both request distribution and in comparison with the well know Zipf
popularity distribution. In addition to answering our questions, we have
also developed a set of applications and methods for performing this type
of analysis.
10.2. Results
Chapter 10. Conclusion
10.2 Results
In this section, we discuss the results we have found from analysis of
the questions in Section 2.3. An implicit result of our work has been the
development of analysis methods and tools, so we also present a discussion
of these.
10.2.1 Tools development
As discussed in Chapter 4, due to the numerous different types of analysis
we wanted to perform, we could not just pick up an existing analysis tool
to use. We had to create our own. We chose to create applications in C
to extract information from the logs, format them and import them into a
database. Once in the database, we implemented several scripts to perform
specific tasks related to the different types of analysis above. We found
this method to work very well for several reasons. First, the simplicity of
queries provide a great way of investigating answers to single questions.
Second, indexes greatly speed up the performance of these queries. Last,
the number of ways one can interact with a PostgreSQL database allows
for selection of the right tools for the right job. For example, in our work we
have used plpgsql script for most of the analysis jobs and PL/R for creating
graphs of the results. In addition, there are many libraries which enables
the user to interact with the database in the language of choice, like for
example libpq for C.
10.2.2 Content analysis
We performed separate analysis of contents for the web news and streaming news logs respectively. We discuss each of them here.
Web content
For our web news content, we first examined the type of files we found
on the server and the distribution among them. The result showed that
there are many different type of files in this environment, but those that
were represented the most were clearly images and HTML documents.
Among the image types we found, most of them was of type JPEG, and
then GIF files. We also found that PNG files were not used at all. This is the
same result as has previously been found for regular web content, which
suggests that web news sites exhibit the same content characteristics as the
web in general.
Next, we examined the median size distribution among the different
formats, which is summarized in Table 10.1. In the size analysis, we found
that Flash objects were much larger than any other types. After Flash
Chapter 10. Conclusion
10.2. Results
Median Size
Table 10.1: NR median sizes
objects, JPEG, GIF, HTML documents and Javascripts stood out, with JPEG
being the largest of these types. However, access analysis showed that Flash
objects were hardly ever requested, so they do not place to much load on
the server.
In the access analysis, JPEG is a clear winner accounting for the absolute
majority of requests. After JPEG, GIF is a good number two, so we see that
images are absolutely the most requested types. In this study we also found
that Javascripts were accessed almost as many times as HTML documents,
even though the number of Javascripts in the file type distribution analysis
were far less than HTML documents. Here too, we have found that requests
for different type of objects are much the same as for regular web sites.
Perhaps the most interesting lesson learned however, is that in a web news
environment, Javascripts are used extensively and are an integral part of
HTML documents.
Last, we also investigated the internal size distribution of the four most
requested file types, JPEG, GIF, HTML and Javascripts. We found that
the range of sizes for images was very large, but the absolute majority
of images are small in size compared to what has been found in other
research on regular web content. Most JPEG and GIF files were between 1
and 5KB. For GIF files, there are not many images larger than 10KB, but
JPEG also had a substantial amount of images ranging between 10 and
50KB. For HTML files most objects were less than 2KB which is similar to
what has been found for the web in general. Javascript gave us the clearest
internal size distribution, where most of these objects were 4KB in size. We
do not know of any other work that has analyzed the size of Javascripts
Streaming content
For the streaming content, we had already learned in Chapter 2 that VG
used WMV as their video format, and as such we expected to find most
objects of this type. The file type distribution investigation supported our
belief. We did find a range of 13 different file types, but counting the
number of objects showed that WMV and WMA was almost exclusively
represented. Only JPEG had enough objects to be seen in the graph.
From the size distribution we saw, as expected, that videos were largest.
Table 10.2 shows a summary of the median sizes. After WMV came JPEG,
10.2. Results
Chapter 10. Conclusion
Median Size
Table 10.2: NR median sizes
and then came WMA, which was substantially smaller than the others.
The reason for this we believe, is that the images on this server are large
photographs used in image series on the VG news page, as opposed to
small photographs accompanying a headline, which is images we saw in
the content analysis of the web logs.
The access analysis revealed that these JPEG images were almost never
requested, so we can not say anything about what these images were. This
analysis further showed that the only types that were accessed was WMA
and WMV, with WMV clearly accounting for most requests.
From the internal size distribution we found that most WMA files were
between 100 and 500KB. Most WMV files were between 100KB and 1MB.
One similarity between these two types is that the range of sizes is huge,
between 5KB and 20MB for audio files and 1KB to 314MB for videos.
Our result from the streaming content analysis corresponds with what [14]
found in their analysis of videos on the web.
10.2.3 Workload characterization
We only did a very simple workload characterization in which we found
that the peak hours of web news requests where at the beginning of
a work day and at lunch hours. The most requests were made in the
morning with an average of 26 requests per second between 06:00 and
08:00. For streaming news, only the lunch hours stood out where the
average number of requests was 5 per minute. In addition, we did a study
of the size of log files on the web news server where we found a self
similarity suggesting that our dataset contains representable data for all
of our different investigations. We also found that the size of weekend logs
were less than the size of weekday logs, but there were still a lot of requests
for web news objects during the weekend.
10.2.4 Article access patterns
Next, we looked at access patterns for articles in the web news logs. We first
analyzed the existence of sessions, where we defined sessions to include all
requests from one client within one hour. We did find that sessions with
multiple requests exist, but for the most part only one article was requested.
This is possibly the most interesting result of this thesis. The popularity of
sessions decrease according to Zipf with an increase in number of requests
Chapter 10. Conclusion
10.2. Results
in that sessions. That is, the most usual access pattern is to read just one
article, and the probability of a session containing one more request follow
a Zipf distribution with α equal to 1.3.
We also calculated the average time between requests within sessions,
which we found to be 92 seconds. From test samples we made, this number
seemed representative of the reading time of an article, suggesting that the
average users read an article beginning to end once having selected one.
In the end we also wanted to investigate reference patterns within
sessions to see if we could find any relationship between articles. We were
not able to perform this study based on the fact that the attribute of the logs
which would give us the information required was not used very much.
10.2.5 Stream interaction patterns
In our analysis of stream interaction, we investigated if many objects were
accessed in full or partially, and also the distribution of percentage viewed
when only partially accessed.
We found that only about 10 percent of the objects were accessed in
full, 80 percent were accessed partially, and 10 percent were accessed more
than 100 percent. We do not know why we get entries in the logs which
access more than 100 percent, but arguments have been made that it is due
to either commercial elements downloaded before the actual object to be
played out, or it could be overhead from protocols.
Our analysis does show however, that most objects are only viewed
partially. Therefore, we also investigated how much of objects were viewed.
From this we did not find very distinct patterns, but we concluded that
about 20 percent of the request access roughly 10 percent of the objects. The
access was however, quite uniformly distributed in our study. We do not
know of any other work which has researched this for streaming objects, so
we cannot compare our result to other findings.
10.2.6 Lifetime and popularity analysis
Last, we also studied lifetime and popularity of objects. The first thing we
learned was that we did not have enough material to conclude anything
about the lifetime of articles. The reason for this is that on average, we saw
that documents were requested for a time period of the whole subset of our
logs, which was one week. But, we did find that most references to articles
are performed on the first day, and then there is a steady decline.
Further, we investigated the concentration of references to see if there
was a high concentration of hot documents. [16] found that for general web
content, about 90 percent of requests were for 10 percent of the documents,
meaning that there is a small subset of web pages that are popular. In our
study of news content we found this to be even more true. About 96 percent
10.3. Future work
Chapter 10. Conclusion
of the requests were for 10 percent of the documents. Combined with the
result of a decline in requests seen from the first day, this suggests that new
articles are created faster than new pages on the web and recent articles are
favored over old articles.
Next, we compared article popularity to the Zipf distribution model.
When we used all articles requested one day to compare to Zipf, we found
pure Zipf with α equals one to be indistinguishable from our results. When
we narrowed down the subset to only the 150 most accessed articles we
had to adjust α to 0.7. As noted in Chapter 3, many have applied Zipf to
requests from both web and proxy servers with about the same numbers
for α as we get. However, from applying Zipf to the subset of all articles
on a news server, we find that web news popularity is closer to Zipf than
regular web pages.
We also studied the lifetime and popularity of the streaming objects.
From the popularity analysis, we found that as with web news, streaming
news are also most accessed the first day. When comparing to Zipf we
found that α equals 0.8 gave a close fit, which is similar to web news
when the dataset was reduced. However, the lifetime in terms of distance
between first and last day seen was very large, which is similar to movies.
[24] states that once movies enter a system, they never go out. We cannot
compare the lifetime of web news and streaming news though, since we
do not have enough log material to conclude anything about the lifetime of
articles. It does seem as stream objects are similar to movies in terms of how
often they are released, but similar to web news in terms of access behavior
and popularity.
10.3 Future work
In this thesis, we have analyzed several aspects of NoD environments and
their objects. There is still much to learn from the logs we have received
and there is also much we cannot learn from our dataset. We believe that
NoD will become a popular trend in the not so distant future, and as such
needs to be researched further in much more detail. We therefore present
some ideas we have for future research topics.
Content is changing rapidly, and for our streaming dataset it could
be that the streaming content is already becoming old. It is unlikely the
difference is too great since our analysis show that video sizes are much the
same as a study of videos on the web in 1997 [14]. But, this could be due
to improvements in video codecs as much as similarity in content. Also,
streaming news is a trend on the rise so the formats and content of these
files can change quite a lot in a short time period. Therefore, it is important
to follow up an analysis of these objects as time passes. It would also be a
good idea to compare streaming objects from several different type of news
Chapter 10. Conclusion
10.3. Future work
sites, for example between news papers and TV stations.
For access pattern analysis, it would be interesting to investigate
grouping of articles instead of what we tried with reference patterns. To
do this, one must match all sessions containing more than one request to
see if many sessions contain the same group of articles. Also, a comparison
of the time between requests in a session versus the size of the objects
in the session could yield beneficial results for prefetching techniques
like rate controlled prefetching [22]. In addition, it would be interesting
to see how session characteristics change during the day. For example,
if sessions between 07:00 and 09:00 when people start working contain
several requests, and the sessions later during the day only contain one
Subject to future analysis in terms of interaction analysis, would be an
investigation of the percentage of an object viewed compared to it’s age
to see if there is any relationship one could devise a pattern from. Such a
finding would greatly aid caching techniques like prefix caching, where the
prefix of an object could be dynamically changed over time.
In our study, we have done a simple lifetime analysis based on distance
in days between the first and last day we see a reference to an article.
Lifetime analysis should be further extended to include lifetime in terms
of popularity. That is, investigate the change in Zipf popularity ranking
of objects over time. If groups of articles are found from access pattern
analysis, these groups should also be compared to Zipf. This would provide
a basis for further investigation of the Multi-selection Zipf algorithm [29].
Another relationship for future work to explore would be from how
many articles the popular articles and also the referrer articles are linked
to. From this we can learn if there is a relationship between the number of
ways to access an article and its popularity.
[1] Awstats - free real-time logfile analyzer to get advanced statistics.
[2] Burn all gifs.
[3] The gif controversy: A software developers perspective. http://cloanto.
[4] Graphics formats for web pages.
[5] libpq - the c application programmer’s interface to postgresql. http:
[6] Norsk regnesentral.
[8] Pl/r - r procedural language for postgresql. http://www.joeconway.
[9] Postgresql database management system.
[10] The python programming language.
[11] The r project for statistical computing.
[12] Vg nett.
[13] Webalizer web server log file analysis program. http://www.webalizer.
[14] S. Acharya and B. C. Smith. Experiment to characterize videos stored
on the Web. In Proc. SPIE Vol. 3310, p. 166-178, Multimedia Computing
and Networking 1998, Kevin Jeffay; Dilip D. Kandlur; Timothy Roscoe; Eds.,
pages 166–178, Dec. 1997.
[15] V. Almeida, A. Bestavros, M. Crovella, and A. de Oliveira. Characterizing reference locality in the WWW. In Proceedings of the IEEE
Conference on Parallel and Distributed Information Systems (PDIS), Miami
Beach, FL, 1996.
[16] M. F. Arlitt and C. L. Williamson. Web server workload characterization: The search for invariants. In Measurement and Modeling of Computer Systems, pages 126–137, 1996.
[17] H. Bahn, Y. H. Shin, and K. Koh. Analysis of Internet reference
behaviors in the Korean Education Network. Lecture Notes in Computer
Science, 2105:114–??, 2001.
[18] P. Barford, A. Bestavros, A. Bradley, and M. Crovella. Changes in
web client access patterns: Characteristics and caching implications.
Technical Report 1998-023, 4, 1998.
[19] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and
zipf-like distributions: Evidence and implications. In INFOCOM (1),
pages 126–134, 1999.
[20] L. D. Catledge and J. E. Pitkow. Characterizing browsing strategies
in the World-Wide Web. Computer Networks and ISDN Systems,
27(6):1065–1073, 1995.
[21] Computerworld. Nettaviser i toppen.
[22] M. E. Crovella and P. Barford. The network effects of prefetching. In
Proceedings of Infocom ’98, pages 1232–1240, Apr. 1998.
[23] C. Cunha, A. Bestavros, and M. Crovella. Characteristics of World
Wide Web Client-based Traces. Technical Report BUCS-TR-1995-010,
Boston University, CS Dept, Boston, MA 02215, April 1995.
[24] C. Griwodz, M. Bar, and L. C. Wolf. Long-term movie popularity
models in video-on-demand systems: or the life of an on-demand
movie. In MULTIMEDIA ’97: Proceedings of the fifth ACM international
conference on Multimedia, pages 349–357. ACM Press, 1997.
[25] S. Gruber, J. Rexford, and A. Basso. Design considerations for an rtspbased prefix-caching proxy for multimedia streams. Technical Report
990907-01, AT T Labs , Research, September 1999.
[26] T. Hafsoe. Automatic Route Maintenance in QoS Aware Overlay Networks.
PhD thesis, University of Oslo, 2006. work in progress.
[27] R. Jain. The Art of Computer Systems Performance Analysis: Techniques for
experimental design, measurement, simulation and modelling. John Wiley
and Sons, 1991.
[28] F. T. Johnsen, C. Griwodz, and P. Halvorsen. Structured partially
caching proxies for mixed media. In WCW 2004, LNCS 3293, pages
144 – 153. Springer-Verlag Berlin Heidelberg, 2004.
[29] Y.-J. Kim, T. U. Choi, K. O. Jung, Y. K. Kang, S. H. Park, and K.D. Chung. Clustered multi-media NOD: Popularity-based article
prefetching and placement. In IEEE Symposium on Mass Storage
Systems, pages 194–202, 1999.
[30] M. Nelson. The Data Compression Book. Henry Holt and Co., Inc., New
York, NY, USA, 1991.
[31] G. Peng. CDN: Content distribution network.
[32] M. Rabinovich and O. Spatcheck.
Addison Wesley, 2002.
Web Caching and Replication.
[33] V. Sawant. Zipf law.˜vivek/home/stenopedia/
[34] W. Schools. Windows multimedia formats. http://www.w3schools.
[35] J. Sedayao. ”mosaic will kill my network!” - studying network traffic
patterns of mosaic use. In Electronic Proceedings of the Second World
Wide Web Conference ’94: Mosaic and the Web, 1994.
[36] S. Sentralbyra. Internet maalingen 2002.
[37] A. Woodruff, P. M. Aoki, E. Brewer, P. Gauthier, and L. A. Rowe.
An investigation of documents from the World Wide Web. Computer
Networks and ISDN Systems, 28(7–11):963–980, 1996.
Appendix A
Source Code
In this chapter, we list the applications we have developed and give a short
explanation of them. All source codes are stored on a CD distributed with
this thesis.
A.1 create-stream-tables.c
This C program parse logs from NR into the tables discussed in Chapter 5.
The source code can be found in the nr/progs/ directory on the CD.
A.2 insert-web-logs.c
This C program parse VG logs, looping through one line from one log at a
time inserting only article requests into a database table.
The source code can be found in the vg/progs/ directory on the CD.
A.3 extract-typesize.c
This C program extract size of new objects into distinct files for each new
type found.
The source code can be found in the vg/progs/ directory on the CD.
A.4 vgmimetypedist.R
This R script creates a histogram of given values, in this case the count of
objects per mime type in the web news logs from VG. It is executed with
the command source(”mimetypedist.R”) in the R environment.
The source can be found in the vg/scripts/ directory on the CD.
A.5. vgfiletypedist.R
Chapter A. Source Code
A.5 vgfiletypedist.R
This R script creates a histogram of given values, in this case the count of
objects per file type from selected types in the web news logs from VG.
The source can be found in the vg/scripts/ directory on the CD.
This Python script sort the numbers in a text file in ascending order.
The source can be found in the vg/python/ directory on the CD.
A.7 vgmediansizedist.R
This R script creates a histogram of the median size of selected file types in
the web news logs from VG.
The source can be found in the vg/scripts/ directory on the CD.
A.8 vgaccessdist.R
This R script creates a histogram of the access counts to selected file types
in the web news logs from VG.
The source can be found in the vg/scripts/ directory on the CD.
This Python script collects size entries in a file into 1KB buckets and outputs
it to a new file.
The source can be found in the vg/python/ directory on the CD.
A.10 graphscript-jpg-log.R
This R script creates a histogram of the content of a table created by
Script A.9, the JPEG table in this instance. It is run with the command
source(”graphscript-log-jpg”) in the R environment.
The source can be found in the vg/scripts/ directory on the CD along with
the scripts performing the same task for the GIF, HTML and Javascripts
Chapter A. Source Code
A.11. nrdosls-parser.c
A.11 nrdosls-parser.c
This C program fix the format of the NR server listing of objects into a
format the PostgreSQL database can understand. It also add a type field.
This source added a leading space to names so matching in the database
did not work. We created another C program to fix this problem, called fixnrdosls-parse.c
Both sources can be found in the nr/progs/ directory on the CD.
A.12 nrfiletypedist.plr
This PL/R script counts all entries of specific types and creates a histogram
of the results.
The source can be found in the nr/scripts/ directory on the CD.
A.13 nrmediansizedist.R
This R script creates a histogram of the median size of selected file types in
the web news logs from VG.
The source can be found in the nr/scripts/ directory on the CD.
A.14 nr-map-dosls-to-objects.plr
This PL/R script match names of objects from the server list database table
to the streaming log objects table and updates the type field where it finds
a match.
The source can be found in the nr/scripts/ directory on the CD.
A.15 nr-map-objects-to-accesses.plr
This PL/R script match IDs from the NR logs objects table to the access
table, and updates the type field when it finds a match.
The source can be found in the nr/scripts/ directory on the CD.
A.16 nraccessdist.R
This R script creates a histogram of the access counts to selected file types
in the streaming news logs from NR.
The source can be found in the nr/scripts/ directory on the CD.
A.17. nrgraphscript-wmv.R
Chapter A. Source Code
A.17 nrgraphscript-wmv.R
This R script creates a histogram of the internal size distribution of WMV
files on the streaming server. Similar scripts were also used to create graphs
for WMA, JPEG and ASF files.
The sources can be found in the nr/scripts/ directory on the CD.
A.18 vg-graph-workload.plr
This PL/R script creates a graph of the workload of the VG news server on
December 8, 2004.
The source can be found in the vg/scripts/ directory on the CD.
A.19 nr-graph-workload.plr
This PL/R script creates a graph of the workload of the NR news server on
February 6, 2002.
The source can be found in the nr/scripts/ directory in the CD.
A.20 vg-graph-avg-number-of-timesprday-cip-is-seen.plr
This PL/R script creates a graph of the average number of times we see the
same IP per day.
The source can be found in the vg/scripts/ directory on the CD.
A.21 count-avg-time-between-request-prip-prday.plr
This PL/R script calculates the average time between requests per IP per
day. It also records the findings in a new table.
The source can be found in the vg/scripts/ directory on the CD.
A.22 create-vgsession-table.plr
This PL/R script creates the session table for web news articles.
The source can be found in the vg/scripts/ directory on the CD.
A.23 create-sessions-requests-table.plr
This PL/R scripts creates a table recording the number of requests per
The source can be found in the vg/scripts/ directory on the CD.
Chapter A. Source Code
A.24. graph-sessionrequest-table.plr
A.24 graph-sessionrequest-table.plr
This PL/R script creates a graph of the information in the table created by
the previous script.
The source can be found in the vg/scripts/ directory on the CD.
A.25 find-avg-time-between-requests-within-session.plr
This PL/R script calculates the average time between requests in a session.
The source can be found in the vg/scripts/ directory on the CD.
A.26 create-access-viewstat-table.plr
This PL/R script creates the view statistics table for streaming objects
where both initial size and bytes sent are know.
The source can be found in the nr/scripts/ directory on the CD.
A.27 create-object-howviewed-table.plr
This PL/R script creates the streaming objects view summary table.
The source can be found in the nr/scripts/ directory on the CD.
This Python script count requests accessing less than 100 percent of an
object into buckets of 10 percent.
The source can be found in the nr/python/ directory on the CD.
A.29 nrgraphviewscript.R
This R script creates a histogram of requests accessing 10 percent of a
streaming object, 20 percent of an object and so on.
The source can be found in the nr/scripts/ directory on the CD.
A.30 nrgraphviewscript-cumulative.R
This R script creates a graph of the cumulative access percent of requests to
streaming objects.
The source can be found in the nr/scripts/ directory on the CD.
A.31. populate-vgartinfo.plr
Chapter A. Source Code
A.31 populate-vgartinfo.plr
This script fills in the information in Table 9.1.
The source can be found in the vg/scripts/ directory on the CD.
A.32 graph-avg-day-distance.plr
This PL/R script creates a graph of the average distance between the first
and last day all articles are seen.
The source can be found in the vg/scripts/ directory on the CD.
A.33 graph-avg-day-distance-firstdayarts.plr
This PL/R script creates a graph of the average distance between the first
and last day articles from the first day of logging are seen.
The source can be found in the vg/scripts/ directory on the CD.
A.34 graph-cumulative-access-frequency.plr
This PL/R script creates a graph of the cumulative access frequency for the
whole week of web news logs.
The source can be found in the vg/scripts/ directory on the CD.
A.35 graph-cumulative-access-frequency-firstday.plr
This PL/R script creates a graph of the cumulative access frequency of only
articles seen the first day of logging, for the whole week of web news logs.
The source can be found in the vg/scripts/ directory on the CD.
A.36 graph-pop-zipf-firstday.plr
This PL/R script creates a popularity distriution graph for article requests
and compares it to Zipf.
The source can be found in the vg/scripts/ directory on the CD.
A.37 create-nrobjectinfo-table.plr
This PL/R script creates the object info table for streaming objects,
recording first and last day of access and the total number of requests to
each object.
The source can be found in the nr/scripts/ directory on the CD.
Chapter A. Source Code
A.38. nr-graph-pop-zipf.plr
A.38 nr-graph-pop-zipf.plr
This PL/R script compares streaming object requests to the Zipf popularity
The source can be found in the nr/scripts/ directory on the CD.