An Enhanced Framework For Performing Pre- Processing On Web Server Logs

advertisement
International Journal of Engineering Trends and Technology- Volume3Issue2- 2012
An Enhanced Framework For Performing PreProcessing On Web Server Logs
T.Subha Mastan Rao #1, P.Siva Durga Bhavani#2, M.Revathi #3, N.Kiran Kumar#4 ,V.Sara#5
#
Department of information science and technology,koneru lakshmaiah college of engineering
green fields, vaddeswaram,guntur-522502,INDIA
II. PROPOSED FRAME WORK FOR PERFORMING PREPROCESSING:
Abstract- Now, peoples are interested in analyzing log files
which can offer valuable insight into web site usage. The log files
shows actual usage of web site under all circumstances and don't
need to conduct external experimental labs to get this
information. This paper describes the effective preprocessing of
access stream before actual mining process can be performed.
The log file collected from different sources undergoes different
preprocessing phases to make actionable data source. It will help
to automatic discovery of meaningful pattern and relationships
from access stream of user
Keywords: Web Usage Mining, Web Server,Data Mining, Data
Preprocessing
I. INTRODUCTION
The World wide Web has become one of the most important
media to store, share and distribute information .At present,
Google is indexing more than 8 billion Web pages. The rapid
expansion of the Web has provided a great opportunity to
study user and system behavior by exploring Web access logs.
The WWW is serving as a huge widely distributed global
information service center for technical information, news,
advertisement, e-commerce and other information service. By
using web log db software export the web log file .It yields
output in the form of access file format. Now this access file
format is ready for performing pre-processing.
The main intension of our paper is to perform pre-processing
on web log data.Before analyzing such data using web mining
techniques, the web log has to be pre processed, integrated
and transformed. As the World Wide Web is continuously and
rapidly growing, it is necessary for the web miners to utilize
intelligent tools in order to find, extract, filter and evaluate the
desired information. The data pre-processing stage is the most
important phase for investigation of the web user usage
behavior. To do this one must extract the only human user
accesses from weblog data which is critical and complex.
Fig: Framework for Pre-Processing
III.WEB LOG DATA
Log File is the input to pre-processing block. A Web log is a
file to which the Web server writes information each time a
user requests a resource from that particular site.
The log files[4] are text files that can range in size
from 1KB to 100MB, depending on the traffic. In
determining the amount of traffic a site receives during a
specified period of time, it is important to understand what
exactly; the log files are counting and tracking.
The raw log files consists of 19 attributes such as :
Date, Time, Client IP, Auth User, Server Name, Server IP,
Server Port, Request Method, URI-Stem, URI Query,
Protocol Status, Time Taken, Bytes Sent, Bytes Received,
Protocol Version, Host, User Agent, Cookies, Referer
Example: 2003-11-23 16:00:13 210.186.180.199 CSLNTSVR20202.190.126.8580GET/tutor/images/icons/fold
.gif– 304 140 4700 HTTP/1.1
www.tutor.com.myMozilla/4.0+(compatible;+MSIE+5.5;+Wi
ndows+98;+Win+9x+4.90)ASPSESSIONIDCSTSBQDC=NB
KBCPIBBJHCMMFIKMLNNKFD;+browser=done;+ASPSES
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 178
International Journal of Engineering Trends and Technology- Volume3Issue2- 2012
SIONIDAQRRCQCC=LBDGBPIBDFCOKHMLHEHNKFBN
http://www.tutor.com.my/
1) Date
The date from Greenwich Mean Time (GMT x 100) is
recorded for each hit. The date format is YY
-MM-DD The example above shows that the transaction was
recorded at 2003-11-3.
2) Time
Time of transactions. The time format is HH:MM:SS. The
example above shows that the
transaction time was recorded at 16:00:13.
3) Client IP Address
Client IP is the number of computer who access or request
the site.
IV.WEB LOG DB SOFTWARE:
The Web Log DB exports web log data to databases via
ODBC. Web Log DB uses ODBC to perform database inserts
data using SQL queries. Web Log DB allows you to use the
applications you have become accustomed to such as MS
SQL, MS Excel, MS Access etc. Also, any other ODBC
compliant application can now be used to produce the output
you desire. Use Web Log DB to perform further analysis and
special softs. Web Log DB analyze most popular log file
formats MS IIS logfile format, Apache logfile format etc. It
can even read GZip(gz) compressed logs so you won't need
to unpack them manually.
4) User Authentication
Some web sites are set up with a security feature that
requires a user to enter username and
password. Once a
user logs on to a Website, that user’s “username” is logged
in the fourth field of
the log file
5)Server Name
Name of the server. In example
isCSLNTSVR20.
the name of the server
Fig: log file browsi ng
6)Server IP Address
Server IP is a static IP provided by Internet Service Provider.
This IP will be a reference for access the information from
the server.
7) Server Port
Server Port is a port used for data transmission. Usually, th e
port used is port 80.
8) Server Method
The word request refers to an image, movie, sound, pdf, txt,
HTML
file
and
more.
The
above
example
indicatesthatfolder.gif was the item accessed. It is also
important to note that the full path name from the document
root. The GET in front of the path name specifies the way in
which the server sends the requested information. Currently,
there are there formats that Web servers send information in
GET, POST and Head. Most HTML files are served via GET
Method while most CGI functionality is served via POST.
Fig: FTP Server Aut henti cation
9)URI-Stem
URI-Stem is path from the host. It represents the structure of
the websites.
For examples:-/tutor/images/icons/fold.gif
10) Server URI-Query
URI-Query usually appears after sign “?”. This represents
the type of user request and the value usually appears in the
Address Bar. For
example:?q=tawaran+biasiswa&hl=en&lr=&ie=UTF8&oe=UTF-8&start=20&sa=N
Fig: Settings
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 179
International Journal of Engineering Trends and Technology- Volume3Issue2- 2012
Fig: Web log db s/w
Fig: While Pre-Processing
Fig: Before Pre-Processing
Fig: After Pre-Processing
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 180
International Journal of Engineering Trends and Technology- Volume3Issue2- 2012
V.BRIEFVIEW OF DATA PRE-PROCESSING:
1)Data Cleaning:
Data Cleaning[2] is one of the Pre-Processing steps which is
used to eliminate the duplicates, fill the missing values,
remove unwanted data. The following are some of the types
of unwanted and irrelevant data that is to be removed are:
a)The Records having status code above 299 and
below 200.
b)The Records in which the attribute cs_uri_stem
has extensions like CSS,JPEG,GIF.
2)User and Session Identification:
The task of user and session identification is to find out
the different user sessions from the original web access
log. A referrer-based method is used for identifying
sessi ons. The different IP addresses distinguish
different users.
a. If the IP addresses are same, different browsers and
operation system’s indicate different users which can
be found by client IP address and user agent who gives
information of user’s browsers and operating system.
b. If all of the IP address, browsers and operating
systems are same, the referrer information should be
taken into account. The Refer URI is checked, new
user’s session is identified if the URL in the Refer URI
is ‘-’ that is field hasn't been accessed previously, or
there is a large interval of more than 30 minutes
between the accessing time of this record.
3)Path Completion:
Path Completion should be used acquiring the complete
user access path. The incomplete access path of every
user session is recognized based on user session
identification. If in a start of user session, Referrer as
well URI has data value, delete value of Referrer by
adding ‘-‘. Web log pre-processing helps in removal of
unwanted click-streams from the log file and also
reduces the size of original file by 40 -50%.
4)Data pre-processing is performed in two types of
approaches:
a)XML
b)TEXT FILE
a)XML:
i)Logs[3] recorded in web log which is text file are converted
to DOM tree structure using XML Parser.
ii)Since DOM tree structure is used, pre-processing stages can
be analysed very well.
iii)Time taken to convert is 20minutes.
iv)XML approach can be used when the web log file consists
of more number of attributes describing usage profile of user
as IIS web server having Extern Log File Format having 17
attributes.
b)TEXT File:
i)Logs [3]recorded in web log which is text file are first needs
to be separated using delimiter as Space.
ii)Understanding of each step of pre-processing would be
difficult for user because this approach demands analysis and
knowledge of how web log looks.
iii)Time taken to convert is 10 sec.
iv)Text file approach can be used when the web log file
consists of very few attributes describing usage profile of
users i.e., less than 10 as in Common Log File Format
V1.WEB USAGE MINING:
Web usage mining is the type of web mining allows for the
collection of Web access information for Web pages. This
usage data provides the paths leading to accessed Web pages.
This information is often gathered automatically into access
logs via the Web server
Data which is used for web usage mining can be
collected at three different levels
1)Server level
2)Client level
3)Proxy level
1)Server Level:
The server stores data regarding request performed by the
client. Data can be collected from multiple users to single site
2)Client Level:
Client level is the client itself which sends information
regarding the users behaviour. This is done either with an adhoc browsing application or through client side application
running standard browsers.
3)Proxy Level:
Information regarding user behaviour is stored at proxy side,
thus web data is collected from multiple users on several
websites, but only users whose web clients pass through the
proxy.
4)Applications Of Web Usage Mining:
Usage mining allows companies to produce productive
information [1] pertaining to the future of their business
function ability. Some of this information can be derived from
the collective information of lifetime user value, product cross
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 181
International Journal of Engineering Trends and Technology- Volume3Issue2- 2012
marketing strategies and promotional campaign effectiveness.
The usage data that is gathered provides the companies with
the ability to produce results more effective to their businesses
and increasing of sales. Usage data can also be useful for
developing marketing skills that will out-sell the competitors
and promote the company’s services or product on a higher
level.
Usage mining [5] is valuable not only to businesses using
online marketing, but also to e-businesses whose business is
based solely on the traffic provided through search engines.
The use of this type of web mining helps to gather the
important information from customers visiting the site. This
enables an in-depth log to complete analysis of a company’s
productivity flow. E-businesses depend on this information to
direct the company to the most effective Web server for
promotion of their product or service
Finally, the third use is structure processing. This consists of
analysis of the structure of each page contained in a Web site.
This structure process can prove to be difficult if resulting in a
new structure having to be performed for each page.
VII.CONCLUSION
In this paper, we have taken the web log data as source. The
web log data is converted to accessible format using web log
db software. The data pre-processing is then performed on the
obtained accessible format to increase the quality of data by
removing the erroneous and noisy data Web Log DB s/w
which converts the logged data into simple MS Access file
format. Functions and mining done on this access format is
very easy and useful for the humans. The missing values are
replaced by the most frequent ones and the unwanted data is
deleted by keeping some parameters.
REFERENCES:
[1]Google Website. http://www.google.com.
[2]Jiawei Han and M. Kamber. “Data Mining: Concepts and
Techniques,” In Morgan Kaufmann publishers, 2001[8] ZY
COMPUTING-2003 ,123 Log Analyzer. San Jose USA.
Available at http://www.123loganalyzer.com
[3]Ms. Dipa Dixit et. al. / (IJCSE) International Journal on
Computer Science and Engineering Vol. 02, No. 07, 2010,
2447-2452 IN ISSN : 0975-3397.
Fig: Web Usage Mining
The first is usage[1] processing, used to complete pattern
discovery. This first use is also the most difficult because only
bits of information like IP addresses, user information, and
site clicks are available. With this minimal amount of
information available, it is harder to track the user through a
site, being that it does not follow the user throughout the
pages of the site.
[4] Mohd Helmy Abd Wahab, Mohd Norzali Haji Mohd,
Hafizul Fahri Hanafi, Mohamad Farhan Mohamad Mohsin IN
“World Academy of Science, Engineering and Technology 48
2008”
[5] International Journal of Information Technology and
Knowledge Management July-December 2010, Volume 2, No.
2, pp. 279-283 BY Navin Kumar Tyagi, A.K. Solanki &
Sanjay Tyagi
The second use is content processing, consisting of the
conversion of Web information like text, images, scripts and
others into useful forms. This helps with the clustering and
categorization of Web page information based on the titles,
specific content and images available
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 182
Download