International Journal of Engineering Trends and Technology- Volume3Issue2- 2012 An Enhanced Framework For Performing PreProcessing On Web Server Logs T.Subha Mastan Rao #1, P.Siva Durga Bhavani#2, M.Revathi #3, N.Kiran Kumar#4 ,V.Sara#5 # Department of information science and technology,koneru lakshmaiah college of engineering green fields, vaddeswaram,guntur-522502,INDIA II. PROPOSED FRAME WORK FOR PERFORMING PREPROCESSING: Abstract- Now, peoples are interested in analyzing log files which can offer valuable insight into web site usage. The log files shows actual usage of web site under all circumstances and don't need to conduct external experimental labs to get this information. This paper describes the effective preprocessing of access stream before actual mining process can be performed. The log file collected from different sources undergoes different preprocessing phases to make actionable data source. It will help to automatic discovery of meaningful pattern and relationships from access stream of user Keywords: Web Usage Mining, Web Server,Data Mining, Data Preprocessing I. INTRODUCTION The World wide Web has become one of the most important media to store, share and distribute information .At present, Google is indexing more than 8 billion Web pages. The rapid expansion of the Web has provided a great opportunity to study user and system behavior by exploring Web access logs. The WWW is serving as a huge widely distributed global information service center for technical information, news, advertisement, e-commerce and other information service. By using web log db software export the web log file .It yields output in the form of access file format. Now this access file format is ready for performing pre-processing. The main intension of our paper is to perform pre-processing on web log data.Before analyzing such data using web mining techniques, the web log has to be pre processed, integrated and transformed. As the World Wide Web is continuously and rapidly growing, it is necessary for the web miners to utilize intelligent tools in order to find, extract, filter and evaluate the desired information. The data pre-processing stage is the most important phase for investigation of the web user usage behavior. To do this one must extract the only human user accesses from weblog data which is critical and complex. Fig: Framework for Pre-Processing III.WEB LOG DATA Log File is the input to pre-processing block. A Web log is a file to which the Web server writes information each time a user requests a resource from that particular site. The log files[4] are text files that can range in size from 1KB to 100MB, depending on the traffic. In determining the amount of traffic a site receives during a specified period of time, it is important to understand what exactly; the log files are counting and tracking. The raw log files consists of 19 attributes such as : Date, Time, Client IP, Auth User, Server Name, Server IP, Server Port, Request Method, URI-Stem, URI Query, Protocol Status, Time Taken, Bytes Sent, Bytes Received, Protocol Version, Host, User Agent, Cookies, Referer Example: 2003-11-23 16:00:13 210.186.180.199 CSLNTSVR20202.190.126.8580GET/tutor/images/icons/fold .gif– 304 140 4700 HTTP/1.1 www.tutor.com.myMozilla/4.0+(compatible;+MSIE+5.5;+Wi ndows+98;+Win+9x+4.90)ASPSESSIONIDCSTSBQDC=NB KBCPIBBJHCMMFIKMLNNKFD;+browser=done;+ASPSES ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 178 International Journal of Engineering Trends and Technology- Volume3Issue2- 2012 SIONIDAQRRCQCC=LBDGBPIBDFCOKHMLHEHNKFBN http://www.tutor.com.my/ 1) Date The date from Greenwich Mean Time (GMT x 100) is recorded for each hit. The date format is YY -MM-DD The example above shows that the transaction was recorded at 2003-11-3. 2) Time Time of transactions. The time format is HH:MM:SS. The example above shows that the transaction time was recorded at 16:00:13. 3) Client IP Address Client IP is the number of computer who access or request the site. IV.WEB LOG DB SOFTWARE: The Web Log DB exports web log data to databases via ODBC. Web Log DB uses ODBC to perform database inserts data using SQL queries. Web Log DB allows you to use the applications you have become accustomed to such as MS SQL, MS Excel, MS Access etc. Also, any other ODBC compliant application can now be used to produce the output you desire. Use Web Log DB to perform further analysis and special softs. Web Log DB analyze most popular log file formats MS IIS logfile format, Apache logfile format etc. It can even read GZip(gz) compressed logs so you won't need to unpack them manually. 4) User Authentication Some web sites are set up with a security feature that requires a user to enter username and password. Once a user logs on to a Website, that user’s “username” is logged in the fourth field of the log file 5)Server Name Name of the server. In example isCSLNTSVR20. the name of the server Fig: log file browsi ng 6)Server IP Address Server IP is a static IP provided by Internet Service Provider. This IP will be a reference for access the information from the server. 7) Server Port Server Port is a port used for data transmission. Usually, th e port used is port 80. 8) Server Method The word request refers to an image, movie, sound, pdf, txt, HTML file and more. The above example indicatesthatfolder.gif was the item accessed. It is also important to note that the full path name from the document root. The GET in front of the path name specifies the way in which the server sends the requested information. Currently, there are there formats that Web servers send information in GET, POST and Head. Most HTML files are served via GET Method while most CGI functionality is served via POST. Fig: FTP Server Aut henti cation 9)URI-Stem URI-Stem is path from the host. It represents the structure of the websites. For examples:-/tutor/images/icons/fold.gif 10) Server URI-Query URI-Query usually appears after sign “?”. This represents the type of user request and the value usually appears in the Address Bar. For example:?q=tawaran+biasiswa&hl=en&lr=&ie=UTF8&oe=UTF-8&start=20&sa=N Fig: Settings ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 179 International Journal of Engineering Trends and Technology- Volume3Issue2- 2012 Fig: Web log db s/w Fig: While Pre-Processing Fig: Before Pre-Processing Fig: After Pre-Processing ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 180 International Journal of Engineering Trends and Technology- Volume3Issue2- 2012 V.BRIEFVIEW OF DATA PRE-PROCESSING: 1)Data Cleaning: Data Cleaning[2] is one of the Pre-Processing steps which is used to eliminate the duplicates, fill the missing values, remove unwanted data. The following are some of the types of unwanted and irrelevant data that is to be removed are: a)The Records having status code above 299 and below 200. b)The Records in which the attribute cs_uri_stem has extensions like CSS,JPEG,GIF. 2)User and Session Identification: The task of user and session identification is to find out the different user sessions from the original web access log. A referrer-based method is used for identifying sessi ons. The different IP addresses distinguish different users. a. If the IP addresses are same, different browsers and operation system’s indicate different users which can be found by client IP address and user agent who gives information of user’s browsers and operating system. b. If all of the IP address, browsers and operating systems are same, the referrer information should be taken into account. The Refer URI is checked, new user’s session is identified if the URL in the Refer URI is ‘-’ that is field hasn't been accessed previously, or there is a large interval of more than 30 minutes between the accessing time of this record. 3)Path Completion: Path Completion should be used acquiring the complete user access path. The incomplete access path of every user session is recognized based on user session identification. If in a start of user session, Referrer as well URI has data value, delete value of Referrer by adding ‘-‘. Web log pre-processing helps in removal of unwanted click-streams from the log file and also reduces the size of original file by 40 -50%. 4)Data pre-processing is performed in two types of approaches: a)XML b)TEXT FILE a)XML: i)Logs[3] recorded in web log which is text file are converted to DOM tree structure using XML Parser. ii)Since DOM tree structure is used, pre-processing stages can be analysed very well. iii)Time taken to convert is 20minutes. iv)XML approach can be used when the web log file consists of more number of attributes describing usage profile of user as IIS web server having Extern Log File Format having 17 attributes. b)TEXT File: i)Logs [3]recorded in web log which is text file are first needs to be separated using delimiter as Space. ii)Understanding of each step of pre-processing would be difficult for user because this approach demands analysis and knowledge of how web log looks. iii)Time taken to convert is 10 sec. iv)Text file approach can be used when the web log file consists of very few attributes describing usage profile of users i.e., less than 10 as in Common Log File Format V1.WEB USAGE MINING: Web usage mining is the type of web mining allows for the collection of Web access information for Web pages. This usage data provides the paths leading to accessed Web pages. This information is often gathered automatically into access logs via the Web server Data which is used for web usage mining can be collected at three different levels 1)Server level 2)Client level 3)Proxy level 1)Server Level: The server stores data regarding request performed by the client. Data can be collected from multiple users to single site 2)Client Level: Client level is the client itself which sends information regarding the users behaviour. This is done either with an adhoc browsing application or through client side application running standard browsers. 3)Proxy Level: Information regarding user behaviour is stored at proxy side, thus web data is collected from multiple users on several websites, but only users whose web clients pass through the proxy. 4)Applications Of Web Usage Mining: Usage mining allows companies to produce productive information [1] pertaining to the future of their business function ability. Some of this information can be derived from the collective information of lifetime user value, product cross ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 181 International Journal of Engineering Trends and Technology- Volume3Issue2- 2012 marketing strategies and promotional campaign effectiveness. The usage data that is gathered provides the companies with the ability to produce results more effective to their businesses and increasing of sales. Usage data can also be useful for developing marketing skills that will out-sell the competitors and promote the company’s services or product on a higher level. Usage mining [5] is valuable not only to businesses using online marketing, but also to e-businesses whose business is based solely on the traffic provided through search engines. The use of this type of web mining helps to gather the important information from customers visiting the site. This enables an in-depth log to complete analysis of a company’s productivity flow. E-businesses depend on this information to direct the company to the most effective Web server for promotion of their product or service Finally, the third use is structure processing. This consists of analysis of the structure of each page contained in a Web site. This structure process can prove to be difficult if resulting in a new structure having to be performed for each page. VII.CONCLUSION In this paper, we have taken the web log data as source. The web log data is converted to accessible format using web log db software. The data pre-processing is then performed on the obtained accessible format to increase the quality of data by removing the erroneous and noisy data Web Log DB s/w which converts the logged data into simple MS Access file format. Functions and mining done on this access format is very easy and useful for the humans. The missing values are replaced by the most frequent ones and the unwanted data is deleted by keeping some parameters. REFERENCES: [1]Google Website. http://www.google.com. [2]Jiawei Han and M. Kamber. “Data Mining: Concepts and Techniques,” In Morgan Kaufmann publishers, 2001[8] ZY COMPUTING-2003 ,123 Log Analyzer. San Jose USA. Available at http://www.123loganalyzer.com [3]Ms. Dipa Dixit et. al. / (IJCSE) International Journal on Computer Science and Engineering Vol. 02, No. 07, 2010, 2447-2452 IN ISSN : 0975-3397. Fig: Web Usage Mining The first is usage[1] processing, used to complete pattern discovery. This first use is also the most difficult because only bits of information like IP addresses, user information, and site clicks are available. With this minimal amount of information available, it is harder to track the user through a site, being that it does not follow the user throughout the pages of the site. [4] Mohd Helmy Abd Wahab, Mohd Norzali Haji Mohd, Hafizul Fahri Hanafi, Mohamad Farhan Mohamad Mohsin IN “World Academy of Science, Engineering and Technology 48 2008” [5] International Journal of Information Technology and Knowledge Management July-December 2010, Volume 2, No. 2, pp. 279-283 BY Navin Kumar Tyagi, A.K. Solanki & Sanjay Tyagi The second use is content processing, consisting of the conversion of Web information like text, images, scripts and others into useful forms. This helps with the clustering and categorization of Web page information based on the titles, specific content and images available ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 182