data mining using weka

advertisement
DATA MINING USING WEKA:
AN ANALYSIS OF WEB BROWSING BEHAVIOR EVALUATION USING FIREFOX CACHE
TEAM-3:
Mike Egbert
Gonzalo Perez
Diah Schur
Novelle Maxwell-Sinclair
For:
Emerging IT
Dr. Charles Tappert
Dr. Sun-Hyuk Cha
Pace University
2012
Page 1 of 13
TABLE OF CONTENTS
Introduction........................................................................................................................................................ 3
Common Browsers ............................................................................................................................................................................. 3
Internet Explorer ........................................................................................................................................................................... 3
Google Chrome................................................................................................................................................................................ 3
Mozilla Firefox ................................................................................................................................................................................ 3
Safari ................................................................................................................................................................................................... 3
Opera ................................................................................................................................................................................................... 4
Figure 1: Browser Market Share ....................................................................................................................................... 4
Previous Work in Cache Analysis ................................................................................................................................................. 4
Figure 2: IE Browser Cache Clearing Dialog Box ....................................................................................................... 5
Browser Cache and Forensics .................................................................................................................................................. 5
Figure 3: Browser Cache and Forensics ........................................................................................................................ 6
Browser-based Forensics: History as a Cache Artifact ................................................................................................. 6
Figure 4: Browser History ................................................................................................................................................... 6
Persistence Browser Cache ....................................................................................................................................................... 6
Figure 5: Browser Cache left on a Hard Drive............................................................................................................. 7
Web Browsing Cache Log ............................................................................................................................... 7
Data Collection ...................................................................................................................................................................................... 8
Figure 6: About:Cache in Firefox ...................................................................................................................................... 8
Methodology ....................................................................................................................................................... 8
DataSet ..................................................................................................................................................................................................... 8
Categories ............................................................................................................................................................................................... 9
Data Loading .......................................................................................................................................................................................... 9
Figure 7: Loaded Data on WEKA Pre-processing with Category shown .......................................................10
Analysis ..................................................................................................................................................................................................10
Figure 8: Categories vs. ID .................................................................................................................................................11
Conclusion and Recommendation ........................................................................................................... 11
References ........................................................................................................................................................ 12
List of Figures .................................................................................................................................................. 13
Page 2 of 13
Introduction
This project seeks to evaluate the possibilities of utilizing browser cache/history log to determine user
browsing behavior, especially in the work place. The first step in this activity is to acquire the actual log
file(s). The second step is to identify the relevant file that can be analyzed by the tool. The third step is
to evaluate the data recovered to determine whether the user has been abusing his or her time at work.
The last step is to present the discovery at a high level. For this project, we use Firefox as the subject
browser. However, there are several other available browsers that are commonly used by individuals or
companies.
Previous work in computer forensic s has used cache analysis as the main methodology. In the following
sections, common web browsers and previous work in cache analysis will be discussed.
Common Browsers
The web provides a diverse source of information and services and there are many browsers in the
market competing to deliver the experience users desire at a click of a button. Browsers are the tools
that actually request content from web servers, understand the markup language, interpret the content
and them present it to users. These Web browsers are compatible with PCs, Macs and other Internetcapable devices such as the Apple iPod (though Internet Explorer is compatible only with PCs running
Windows). The most popular browser today is Internet Explorer with Google Chrome close behind, then
Mozilla Firefox, Safari, Opera and a few others.
Internet Explorer
Internet Explorer was introduced in the mid-1990’s and is embedded on every Windows-based PC.
Google Chrome
Google Chrome has gained serious market share within the past few years. Chrome has ranked very
high in independent tests with regards to speed and page load times. Chrome also features an
“incognito mode” where users can stealthily visit web sites without having any cookies reside on their
pc’s.
Mozilla Firefox
Firefox is an open source web browser with a very simple interface with enhanced security features to
help protect users. Since Firefox is open source, it affords users a healthy library of add-ons that can be
installed to the browser to augment and customize the user experience. Different types of add-ons
include extensions, themes, search providers, dictionaries, language packs and plugins.
Safari
Safari, which runs on the Mac OS and the IPhone, offers many browser extensions, including an eBay
manager and twitter integration.
Page 3 of 13
Opera
The Opera browser has some unique features such as text and graphic enlarging on a web page.
Figure 1: Browser Market Share
Recent trends show browsers becoming their own operating systems; integrating many functions that
were historically performed on a local machine. The Google Chrome book doesn’t have a Windows
operating system installed; Chrome assumes that function. Users can create, save and edit spreadsheets
or word processing documents directly through the Google browser.
Previous Work in Cache Analysis
Browser Cache is a form of temporary file folders where content from web sites you’ve visited (e.g.,
graphics, static pages, cookies, entire web pages) are stored. The theory behind browser cache was to
improve performance. Every time you would revisit sites, much of the content could be cached and
then the browser is serving up local cache pages versus going out to a site and retrieving them again. [1]
What kinds of files are stored in the browser cache? The following is a list of typical files that are stored
in browser cache:
1. Files from the web sites you've opened in the browser
Page 4 of 13
2.
3.
4.
5.
6.
Entire web pages
Images
CSS
Audio
Video
Figure 2: IE Browser Cache Clearing Dialog Box
An example of using browser cache would be, if you save web pages for offline browsing, all the files
would be stored in the browser cache. Depending on your browser and the operating system, both the
hard disk and RAM are used to store the cache files. [1]
Browser Cache and Forensics
Prior work, including forensics and browser cache, has been to use browser-based cache when
investigating cybercrime. While browser cache is an important part of web architecture and we can
derive many benefits, it is also a forensics artifact that may be used in cybercrime investigations. One
might look at this as a negative, but law officials look at this as a positive and powerful artifact.
Page 5 of 13
Figure 3: Browser Cache and Forensics
In addition to informational artifacts, browser cache can also be tied to Geo-location. This is possible
because information from Map web pages and internet addresses may be in cache. Forensics
investigators will look for this. Chad Tilbury states in an article in Forensic Methods: “Geo-location
artifacts demonstrate an interesting concept with regard to browser-based evidence. Among the various
browser artifacts, Internet history is a fan favorite because it provides such rich information. There is no
easier place to look to identify sites visited by a specific user at a specific time. Browser history is so
useful, a critical shortcoming is often ignored; with today’s dynamic web pages, the vast number of web
page requests goes unrecorded.” [2]
Browser-based Forensics: History as a Cache Artifact
Historical data is kept in browser cache. If you look at most browsers today, you will find multiple days
and weeks of stored browser. The figure below is an example of historical browser cache. Again,
forensics can utilize such historical data.
Figure 4: Browser History
Persistence Browser Cache
Peter Grant indicates the importance of browser cache and even though a user deletes their cache, the
data is still able to be retrieved in most cases. Federal and State cybercrime investigators can utilize this
data in a court of law. [3] The figure below is an example of browser cache that still remains on the hard
disk, even though the user deleted it.
Page 6 of 13
Figure 5: Browser Cache left on a Hard Drive
We can see that the development of browser cache has been a good tool for web performance, as well
as an important tool for cyber forensics. Keith J. Jones, Rohyt Belani state: “Critical electronic evidence
is often found in the suspect's web browsing history in the form of received emails; sites visited and
attempted Internet searches.” [4]
The prior work done in browser cache technology has yielded a powerful tool. It is because of the prior
work that we have modern day browser cache technology and a powerful cyber forensics tool.
Web Browsing Cache Log
Cache analysis can be done by using cache log of browsing history from a user’s computer to find out
his/her browsing behavior. This is particularly useful in the workplace environment. Other potential
evidence should be available from the registry entries, temporary files, index.dat, cookies, bookmarked
pages, saved html pages, emails sent and received by the user, etc. A study conducted by Junghoon Oh,
Seungbong Lee, Sangjin Lee asserts that searching for evidence left by web browsing activity is typically
a crucial component of digital forensic investigations. [5] Almost every movement a suspect performs
while using a web browser leaves a trace on the computer; even a simple search for information using a
web browser. Therefore, when an investigator analyzes the suspect’s computer, this evidence can
provide useful information. After retrieving data such as cache, history, cookies, and download list from
a suspect’s computer, it is possible to analyze this evidence for web sites visited, time and frequency of
access, and search engine keywords used by the suspect.
Computer forensic analysis can also be done from the server side by analyzing access logs, error logs and
Page 7 of 13
FTP log files, as well as network traffic. For this project, we employ a simple cache log analysis method
using Firefox Disk Cache.
Data Collection
In the beginning of the project, we decided that Firefox cache will be used to obtain the browsing
history for an actual user. We further decided to use 499 entries of the cache entries, after acquiring
the data by accessing “about:cache” from the url address box of Firefox browser. The screenshot is as
follows:
Figure 6: About:Cache in Firefox
As is shown in the figure above, there are three cache devices, but only two have number of entries:
Memory cache and Disk cache device. Memory cache is held in RAM (application processed) while disk
cache is stored on the user’s hard drive. Disk Cache is where Temp Internet Files are stored, thus it has
more entries.
In order to obtain various good data, we used the cache entries from the Disk cache device log and 499
entries from the cache were taken as samples.
Methodology
Once we acquired the log from the cache, the data is then prepared so that it can be loaded and
processed by WEKA seamlessly.
DataSet
The log file was transferred to an Excel spreadsheet and then saved as a CSV file. An ID attribute was
added to the file so that Weka can process the data. We encountered a few errors while trying to load
Page 8 of 13
the data into Weka for the pre-processing. However, we managed to resolve these errors after the ID
field was added.
Categories
Categories attribute was also added to classify the “key” attribute or the visited sites into: general
browsing, shopping, entertainment, lifestyle, foreign news, and alert.

General Browsing is a harmless browsing that might be work related research activities, thus
this type of behavior is accepted

Shopping is a category that is not acceptable in work environment

Entertainment is a category that is not acceptable

Lifestyle is another category that is not acceptable

Foreign News is not acceptable

Alert is a heightened level category; this is beyond unacceptable e.g. visiting porn sites or
monitored sites
With the 5 original attributes from Firefox cache: Key, Datasize, Fetchcount, Lastmodified, and Expires,
as well as the added ID and Categories attributes, the data now has 7 attributes and ready to be
preprocessed in Weka.
Data Loading
The data loading process starts by opening Weka Explorer and loading the CSV file prepared from the
log file. The result is as follows:
Page 9 of 13
Figure 7: Loaded Data on WEKA Pre-processing with Category shown
Analysis
The pre-processing panel reveals that our log consists of:

General Browsing = 371 entries

Lifestyle= 63 entries

Shopping= 11 entries

Social network= 9 entries

Entertainment= 16 entries

Foreign News= 16 entries

Alert= 13 entries
The ID attribute is a unique identifier for each individual browsing instance. The data has 499 unique
instances and analyzed against the Categories. It is shown that the majority of IDs are in the General
Browsing category: 371 out of 499 instances (74.35%) are General Browsing, which is the only
acceptable browsing behavior category. Thus, the rest of the IDs (128 instances), which is 25.65%, are in
the unacceptable behavior categories.
Page 10 of 13
Figure 8: Categories vs. ID
Conclusion and Recommendation
Firefox cache can be used to analyze web browsing behavior, especially for browsing activities during
working hours at the workplace. These kinds of logs can be easily acquired since browsers have caching
capability. For each cached page, this capability provides the address/URL from which the page was
fetched, the name of the file, the size, the time it was last modified, and its expiry date.
From the data acquired in this project, the web browsing cache log shows misuse of internet access at
work. The user has spent 25% of his/her time conducting unacceptable browsing activities.
Most web browsers provide an erase function for log information such as cache, history, cookies, and
download list. If a user ran this function to erase log information, investigation will be difficult. [5]
Consequently, we recommend that companies disable their employees’ ability to delete browsing
history cache or retain a copy outside of the user’s computer.
Page 11 of 13
References
1. Web Developers Notes. Browser Cache – What is It?,
http://www.webdevelopersnotes.com/basics/what-is-browser-cache-definition.php 2012
2. Chad Tilbury. Big Brother Forensics: Device Tracking Using Browser-Based Artifacts,
http://forensicmethods.com/browser-geolocation2 April 11, 2012
3. Peter Grant. Forensic Tools for Internet Activity. http://www.ehow.com/info_8344366_forensictools-internet-activity.html 2012
4. Keith J. Jones, Rohyt Belani. Web Browser Forensics,
http://www.symantec.com/connect/articles/web-browser-forensics-part-1 2012
5. Junghoon Oh, Seungbong Lee, Sangjin Lee: Advanced evidence collection and analysis of web
browser activity
6. Jones Keith J. Forensic analysis of internet explorer activity files. Foundstone,
http://www.foundstone.com/us/pdf/wp_index_ dat.pdf; 2003.
7. Jones Keith j, Rohyt Blani. Web browser forensic. Security focus,
http://www.securityfocus.com/infocus/1827; 2005a.
8. Jones Keith j, Rohyt Blani. Web browser forensic. Security focus,
http://www.securityfocus.com/infocus/1832; 2005b.
9. Arvidson, Erick. Types of Web Browsers. 2012. <http://www.ehow.com/info_8256056_typesbrowsers.html>.
10. Davison, Brian D. Web Caching. Feb 2008. <http://www.web-caching.com/welcome.html>.
11. Jaroslovsky, Rich. Bloomberg Technology Columnist Renee Montagne. 12 August 2012.
Page 12 of 13
List of Figures
Figure 1: Browser Market Share ..................................................................................................................... 4
Figure 2: IE Browser Cache Clearing Dialog Box ...................................................................................... 5
Figure 3: Browser Cache and Forensics ....................................................................................................... 6
Figure 4: Browser History ................................................................................................................................. 6
Figure 5: Browser Cache left on a Hard Drive ........................................................................................... 7
Figure 6: About:Cache in Firefox..................................................................................................................... 8
Figure 7: Loaded Data on WEKA Pre-processing with Category shown ...................................... 10
Figure 8: Categories vs. ID .............................................................................................................................. 11
Page 13 of 13
Download