DATA MINING USING WEKA: AN ANALYSIS OF WEB BROWSING BEHAVIOR EVALUATION USING FIREFOX CACHE TEAM-3: Mike Egbert Gonzalo Perez Diah Schur Novelle Maxwell-Sinclair For: Emerging IT Dr. Charles Tappert Dr. Sun-Hyuk Cha Pace University 2012 Page 1 of 13 TABLE OF CONTENTS Introduction........................................................................................................................................................ 3 Common Browsers ............................................................................................................................................................................. 3 Internet Explorer ........................................................................................................................................................................... 3 Google Chrome................................................................................................................................................................................ 3 Mozilla Firefox ................................................................................................................................................................................ 3 Safari ................................................................................................................................................................................................... 3 Opera ................................................................................................................................................................................................... 4 Figure 1: Browser Market Share ....................................................................................................................................... 4 Previous Work in Cache Analysis ................................................................................................................................................. 4 Figure 2: IE Browser Cache Clearing Dialog Box ....................................................................................................... 5 Browser Cache and Forensics .................................................................................................................................................. 5 Figure 3: Browser Cache and Forensics ........................................................................................................................ 6 Browser-based Forensics: History as a Cache Artifact ................................................................................................. 6 Figure 4: Browser History ................................................................................................................................................... 6 Persistence Browser Cache ....................................................................................................................................................... 6 Figure 5: Browser Cache left on a Hard Drive............................................................................................................. 7 Web Browsing Cache Log ............................................................................................................................... 7 Data Collection ...................................................................................................................................................................................... 8 Figure 6: About:Cache in Firefox ...................................................................................................................................... 8 Methodology ....................................................................................................................................................... 8 DataSet ..................................................................................................................................................................................................... 8 Categories ............................................................................................................................................................................................... 9 Data Loading .......................................................................................................................................................................................... 9 Figure 7: Loaded Data on WEKA Pre-processing with Category shown .......................................................10 Analysis ..................................................................................................................................................................................................10 Figure 8: Categories vs. ID .................................................................................................................................................11 Conclusion and Recommendation ........................................................................................................... 11 References ........................................................................................................................................................ 12 List of Figures .................................................................................................................................................. 13 Page 2 of 13 Introduction This project seeks to evaluate the possibilities of utilizing browser cache/history log to determine user browsing behavior, especially in the work place. The first step in this activity is to acquire the actual log file(s). The second step is to identify the relevant file that can be analyzed by the tool. The third step is to evaluate the data recovered to determine whether the user has been abusing his or her time at work. The last step is to present the discovery at a high level. For this project, we use Firefox as the subject browser. However, there are several other available browsers that are commonly used by individuals or companies. Previous work in computer forensic s has used cache analysis as the main methodology. In the following sections, common web browsers and previous work in cache analysis will be discussed. Common Browsers The web provides a diverse source of information and services and there are many browsers in the market competing to deliver the experience users desire at a click of a button. Browsers are the tools that actually request content from web servers, understand the markup language, interpret the content and them present it to users. These Web browsers are compatible with PCs, Macs and other Internetcapable devices such as the Apple iPod (though Internet Explorer is compatible only with PCs running Windows). The most popular browser today is Internet Explorer with Google Chrome close behind, then Mozilla Firefox, Safari, Opera and a few others. Internet Explorer Internet Explorer was introduced in the mid-1990’s and is embedded on every Windows-based PC. Google Chrome Google Chrome has gained serious market share within the past few years. Chrome has ranked very high in independent tests with regards to speed and page load times. Chrome also features an “incognito mode” where users can stealthily visit web sites without having any cookies reside on their pc’s. Mozilla Firefox Firefox is an open source web browser with a very simple interface with enhanced security features to help protect users. Since Firefox is open source, it affords users a healthy library of add-ons that can be installed to the browser to augment and customize the user experience. Different types of add-ons include extensions, themes, search providers, dictionaries, language packs and plugins. Safari Safari, which runs on the Mac OS and the IPhone, offers many browser extensions, including an eBay manager and twitter integration. Page 3 of 13 Opera The Opera browser has some unique features such as text and graphic enlarging on a web page. Figure 1: Browser Market Share Recent trends show browsers becoming their own operating systems; integrating many functions that were historically performed on a local machine. The Google Chrome book doesn’t have a Windows operating system installed; Chrome assumes that function. Users can create, save and edit spreadsheets or word processing documents directly through the Google browser. Previous Work in Cache Analysis Browser Cache is a form of temporary file folders where content from web sites you’ve visited (e.g., graphics, static pages, cookies, entire web pages) are stored. The theory behind browser cache was to improve performance. Every time you would revisit sites, much of the content could be cached and then the browser is serving up local cache pages versus going out to a site and retrieving them again. [1] What kinds of files are stored in the browser cache? The following is a list of typical files that are stored in browser cache: 1. Files from the web sites you've opened in the browser Page 4 of 13 2. 3. 4. 5. 6. Entire web pages Images CSS Audio Video Figure 2: IE Browser Cache Clearing Dialog Box An example of using browser cache would be, if you save web pages for offline browsing, all the files would be stored in the browser cache. Depending on your browser and the operating system, both the hard disk and RAM are used to store the cache files. [1] Browser Cache and Forensics Prior work, including forensics and browser cache, has been to use browser-based cache when investigating cybercrime. While browser cache is an important part of web architecture and we can derive many benefits, it is also a forensics artifact that may be used in cybercrime investigations. One might look at this as a negative, but law officials look at this as a positive and powerful artifact. Page 5 of 13 Figure 3: Browser Cache and Forensics In addition to informational artifacts, browser cache can also be tied to Geo-location. This is possible because information from Map web pages and internet addresses may be in cache. Forensics investigators will look for this. Chad Tilbury states in an article in Forensic Methods: “Geo-location artifacts demonstrate an interesting concept with regard to browser-based evidence. Among the various browser artifacts, Internet history is a fan favorite because it provides such rich information. There is no easier place to look to identify sites visited by a specific user at a specific time. Browser history is so useful, a critical shortcoming is often ignored; with today’s dynamic web pages, the vast number of web page requests goes unrecorded.” [2] Browser-based Forensics: History as a Cache Artifact Historical data is kept in browser cache. If you look at most browsers today, you will find multiple days and weeks of stored browser. The figure below is an example of historical browser cache. Again, forensics can utilize such historical data. Figure 4: Browser History Persistence Browser Cache Peter Grant indicates the importance of browser cache and even though a user deletes their cache, the data is still able to be retrieved in most cases. Federal and State cybercrime investigators can utilize this data in a court of law. [3] The figure below is an example of browser cache that still remains on the hard disk, even though the user deleted it. Page 6 of 13 Figure 5: Browser Cache left on a Hard Drive We can see that the development of browser cache has been a good tool for web performance, as well as an important tool for cyber forensics. Keith J. Jones, Rohyt Belani state: “Critical electronic evidence is often found in the suspect's web browsing history in the form of received emails; sites visited and attempted Internet searches.” [4] The prior work done in browser cache technology has yielded a powerful tool. It is because of the prior work that we have modern day browser cache technology and a powerful cyber forensics tool. Web Browsing Cache Log Cache analysis can be done by using cache log of browsing history from a user’s computer to find out his/her browsing behavior. This is particularly useful in the workplace environment. Other potential evidence should be available from the registry entries, temporary files, index.dat, cookies, bookmarked pages, saved html pages, emails sent and received by the user, etc. A study conducted by Junghoon Oh, Seungbong Lee, Sangjin Lee asserts that searching for evidence left by web browsing activity is typically a crucial component of digital forensic investigations. [5] Almost every movement a suspect performs while using a web browser leaves a trace on the computer; even a simple search for information using a web browser. Therefore, when an investigator analyzes the suspect’s computer, this evidence can provide useful information. After retrieving data such as cache, history, cookies, and download list from a suspect’s computer, it is possible to analyze this evidence for web sites visited, time and frequency of access, and search engine keywords used by the suspect. Computer forensic analysis can also be done from the server side by analyzing access logs, error logs and Page 7 of 13 FTP log files, as well as network traffic. For this project, we employ a simple cache log analysis method using Firefox Disk Cache. Data Collection In the beginning of the project, we decided that Firefox cache will be used to obtain the browsing history for an actual user. We further decided to use 499 entries of the cache entries, after acquiring the data by accessing “about:cache” from the url address box of Firefox browser. The screenshot is as follows: Figure 6: About:Cache in Firefox As is shown in the figure above, there are three cache devices, but only two have number of entries: Memory cache and Disk cache device. Memory cache is held in RAM (application processed) while disk cache is stored on the user’s hard drive. Disk Cache is where Temp Internet Files are stored, thus it has more entries. In order to obtain various good data, we used the cache entries from the Disk cache device log and 499 entries from the cache were taken as samples. Methodology Once we acquired the log from the cache, the data is then prepared so that it can be loaded and processed by WEKA seamlessly. DataSet The log file was transferred to an Excel spreadsheet and then saved as a CSV file. An ID attribute was added to the file so that Weka can process the data. We encountered a few errors while trying to load Page 8 of 13 the data into Weka for the pre-processing. However, we managed to resolve these errors after the ID field was added. Categories Categories attribute was also added to classify the “key” attribute or the visited sites into: general browsing, shopping, entertainment, lifestyle, foreign news, and alert. General Browsing is a harmless browsing that might be work related research activities, thus this type of behavior is accepted Shopping is a category that is not acceptable in work environment Entertainment is a category that is not acceptable Lifestyle is another category that is not acceptable Foreign News is not acceptable Alert is a heightened level category; this is beyond unacceptable e.g. visiting porn sites or monitored sites With the 5 original attributes from Firefox cache: Key, Datasize, Fetchcount, Lastmodified, and Expires, as well as the added ID and Categories attributes, the data now has 7 attributes and ready to be preprocessed in Weka. Data Loading The data loading process starts by opening Weka Explorer and loading the CSV file prepared from the log file. The result is as follows: Page 9 of 13 Figure 7: Loaded Data on WEKA Pre-processing with Category shown Analysis The pre-processing panel reveals that our log consists of: General Browsing = 371 entries Lifestyle= 63 entries Shopping= 11 entries Social network= 9 entries Entertainment= 16 entries Foreign News= 16 entries Alert= 13 entries The ID attribute is a unique identifier for each individual browsing instance. The data has 499 unique instances and analyzed against the Categories. It is shown that the majority of IDs are in the General Browsing category: 371 out of 499 instances (74.35%) are General Browsing, which is the only acceptable browsing behavior category. Thus, the rest of the IDs (128 instances), which is 25.65%, are in the unacceptable behavior categories. Page 10 of 13 Figure 8: Categories vs. ID Conclusion and Recommendation Firefox cache can be used to analyze web browsing behavior, especially for browsing activities during working hours at the workplace. These kinds of logs can be easily acquired since browsers have caching capability. For each cached page, this capability provides the address/URL from which the page was fetched, the name of the file, the size, the time it was last modified, and its expiry date. From the data acquired in this project, the web browsing cache log shows misuse of internet access at work. The user has spent 25% of his/her time conducting unacceptable browsing activities. Most web browsers provide an erase function for log information such as cache, history, cookies, and download list. If a user ran this function to erase log information, investigation will be difficult. [5] Consequently, we recommend that companies disable their employees’ ability to delete browsing history cache or retain a copy outside of the user’s computer. Page 11 of 13 References 1. Web Developers Notes. Browser Cache – What is It?, http://www.webdevelopersnotes.com/basics/what-is-browser-cache-definition.php 2012 2. Chad Tilbury. Big Brother Forensics: Device Tracking Using Browser-Based Artifacts, http://forensicmethods.com/browser-geolocation2 April 11, 2012 3. Peter Grant. Forensic Tools for Internet Activity. http://www.ehow.com/info_8344366_forensictools-internet-activity.html 2012 4. Keith J. Jones, Rohyt Belani. Web Browser Forensics, http://www.symantec.com/connect/articles/web-browser-forensics-part-1 2012 5. Junghoon Oh, Seungbong Lee, Sangjin Lee: Advanced evidence collection and analysis of web browser activity 6. Jones Keith J. Forensic analysis of internet explorer activity files. Foundstone, http://www.foundstone.com/us/pdf/wp_index_ dat.pdf; 2003. 7. Jones Keith j, Rohyt Blani. Web browser forensic. Security focus, http://www.securityfocus.com/infocus/1827; 2005a. 8. Jones Keith j, Rohyt Blani. Web browser forensic. Security focus, http://www.securityfocus.com/infocus/1832; 2005b. 9. Arvidson, Erick. Types of Web Browsers. 2012. <http://www.ehow.com/info_8256056_typesbrowsers.html>. 10. Davison, Brian D. Web Caching. Feb 2008. <http://www.web-caching.com/welcome.html>. 11. Jaroslovsky, Rich. Bloomberg Technology Columnist Renee Montagne. 12 August 2012. Page 12 of 13 List of Figures Figure 1: Browser Market Share ..................................................................................................................... 4 Figure 2: IE Browser Cache Clearing Dialog Box ...................................................................................... 5 Figure 3: Browser Cache and Forensics ....................................................................................................... 6 Figure 4: Browser History ................................................................................................................................. 6 Figure 5: Browser Cache left on a Hard Drive ........................................................................................... 7 Figure 6: About:Cache in Firefox..................................................................................................................... 8 Figure 7: Loaded Data on WEKA Pre-processing with Category shown ...................................... 10 Figure 8: Categories vs. ID .............................................................................................................................. 11 Page 13 of 13