Digital Forensics Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #27 Evidence Correlation October 31, 2007 Outline Review of Lectures 26 Discussion of the papers on Evidence Correlation Review of Lecture 26 FORZA – Digital forensics investigation framework that incorporate legal issues - http://dfrws.org/2006/proceedings/4-Ieong.pdf A cyber forensics ontology: Creating a new approach to studying cyber forensics - http://dfrws.org/2006/proceedings/5-Brinson.pdf Arriving at an anti-forensics consensus: Examining how to define and control the anti-forensics problem - http://dfrws.org/2006/proceedings/6-Harris.pdf Papers to discuss Forensic feature extraction and cross-drive analysis - http://dfrws.org/2006/proceedings/10-Garfinkel.pdf md5bloom: Forensic file system hashing revisited (OPTIONAL) http://dfrws.org/2006/proceedings/11-Roussev.pdf Identifying almost identical files using context triggered piecewise hashing (OPTIONAL) - http://dfrws.org/2006/proceedings/12-Kornblum.pdf A correlation method for establishing provenance of timestamps in digital evidence - http://dfrws.org/2006/proceedings/13-%20Schatz.pdf - Abstract of Paper 1 This paper introduces Forensic Feature Extraction (FFE) and Cross- Drive Analysis (CDA), two new approaches for analyzing large data sets of disk images and other forensic data. FFE uses a variety of lexigraphic techniques for extracting information from bulk data; CDA uses statistical techniques for correlating this information within a single disk image and across multiple disk images. An architecture for these techniques is presented that consists of five discrete steps: imaging, feature extraction, first-order cross-drive analysis, cross-drive correlation, and report generation. CDA was used to analyze 750 images of drives acquired on the secondary market; it automatically identified drives containing a high concentration of confidential financial records as well as clusters of drives that came from the same organization. FFE and CDA are promising techniques for prioritizing work and automatically identifying members of social networks under investigation. Authors believe it is likely to have other uses as well. Outline Introduction Forensics Feature Extraction Single Drive Analysis Cross drive analysis Implementation Directions Introduction: Why? Improper prioritization. In these days of cheap storage and fast computers, the critical resource to be optimized is the attention of the examiner or analyst. Today work is not prioritized based on the information that the drive contains. Lost opportunities for data correlation. Because each drive is examined independently, there is no opportunity to automatically ‘‘connect the dots’’ on a large case involving multiple storage devices. Improper emphasis on document recovery. Because today’s forensic tools are based on document recovery, they have taught examiners, analysts, and customers to be primarily concerned with obtaining documents. Feature Extraction An email address extractor, which can recognize RFC822- style email addresses. An email Message-ID extractor. An email Subject: extractor. A Date extractor, which can extract date and time stamps in a variety of formats. A cookie extractor, which can identify cookies from the SetCookie: header in web page cache files. A US social security number extractor, which identifies the patterns ###-##-#### and ######### when preceded with the letters SSN and an optional colon. A Credit card number extractor. Single Drive analysis Extracted features can be used to speed initial analysis and answer specific questions about a drive image. Authors have successfully used extracted features for drive image attribution and to build a tool that scans disks to report the likely existence of information that should have been destroyed under Fair and Accurate Credit Transactions Act Drive attribution: an analyst might encounter a hard drive and wish to determine to whom that drive previously belonged. For example, the drive might have been purchased on eBay and the analyst might be attempting to return it to its previous owner. powerful technique for making this determination is to create a histogram of the email addresses on the drive (as returned by the email address feature extractor). Cross drive analysis (CDA) Cross-drive analysis is the term that coined to describe forensic analysis of a data set that spans multiple drives. The fundamental theory of cross-drive analysis is data gleaned from multiple drives can improve the forensic analysis of a drive in question both in the case when the multiple drives are related to the drive in question and in the case when they are not. two forms of CDA: first order, in which the results of a feature extractor are compared across multiple drives, an O(n) operation; and second order, where the results are correlated, an O(n2) operation. Implementation 1. Disks collected are imaged onto into a single AFF file. (AFF is the Advanced Forensic Format, a file format for disk images that contains all of the data accession information, such as the drive’s manufacturer and serial number, as well as the disk contents) 2. The afxml program is used to extract drive metadata from the AFF file and build an entry in the SQL database. 3. Strings are extracted with an AFF-aware program in three passes, one for 8-bit characters, one for 16-bit characters in lsb format, and one for 16-bit characters in msb format. 4. Feature extractors run over the string files and write their results to feature files. 5. Extracted features from newly-ingested drives are run against a watch list; hits are reported to the human operator. 6. The feature files are read by indexers, which build indexes in the SQL server of the identified features. Implementation 7. A multi-drive correlation is run to see if the newly accessioned drive contained features in common with any drives that are on a drive watch list. 8. A user interface allows multiple analysts to simultaneously interact with the database, to schedule new correlations to be run in a batch mode, or to view individual sectors or recovered files from the drive images that are stored on the file server. Directions Improve feature extraction Improve the algorithms Develop end to end systems Abstract of Paper 2 Hashing is a fundamental tool in digital forensic analysis used both to ensure data integrity and to efficiently identify known data objects. Authors objective is to leverage advanced hashing techniques in order to improve the efficiency and scalability of digital forensic analysis. They explore the use of Bloom filters as a means to efficiently aggregate and search hashing information. In They present md5bloo a Bloom filter manipulation tool that can be incorporated into forensic practice, along with example uses and experimental results. Outline Introduction Bloom filter Applications Directions Introduction The goal is to pick from a set of forensic images the one(s) that are most like (or perhaps most unlike) a particular target. This problem comes up in a number of different variations, such as comparing the target with previous/related cases, or determining the relationships among targets in a larger investigation. The goal is to get a high-level picture that will guide the following in-depth inquiry. already existing problems of scale in digital forensic tools are further multiplied by the number of targets, which explains the fact that in other forensic areas comparison with other cases is routine and massive, whereas in digital forensics it is the exception. An example is object versioning detection: need to detect a particular version of an object and not the target object Introduction need to address is a way to store a set of hashes representing the different components of a composite object as opposed to a single hash. For example, hashing the individual routines of libraries or executables would enable fine-grained detection of changes (e.g. only a fraction of the code changes from version to version). The problem is that storing more hashes presents a scalability problem even for targets of modest sizes. Therefore, authors propose the use of Bloom filters as an efficient way to store and query large sets of hashes. Bloom Filters A Bloom filter B is a representation of a set S = {s1,., sn} of n elements from a universe (of possible values) U. The filter consists of an array of m bits, initially all set to 0. the ratio r =m/n is a key design element and is usually fixed for a particular application. To represent the set elements, the filter uses k independent hash functions h1, ., hk, with a range {0, ., m 1}. All hash functions are assumed to be independent and to map elements from U uniformly over the range of the function. Md5bloom: Authors have a prototype stream-oriented Bloom filter implementation called md5bloom. Application of Bloom Filter in Security Spafford (1992) was one of the first person to use Bloom filters to support computer security. The OPUS system by Spafford uses a Bloom filter which efficiently encodes a wordlist containing poor password choices to help users choose strong passwords. Bellovin and Cheswick present a scheme for selectively sharing data while maintaining privacy. Through the use of encrypted Bloom filters, they allow parties to perform searches against each other’s document sets without revealing the specific details of the queries. The system supports query restrictions to limit the set of allowed queries. Aguilera et al. discuss the use of Bloom filters to enhance security in a network-attached disks (NADs) infrastructure. The authors use bloom filtering to detect hash tampering Directions Cryptography is a kept application for detecting evidence tampering Bloom filters are one application for tampering the hash Need to compare different cryptographic algorithms Relations to correlation needs to be determined Abstract of Paper 3 Homologous files share identical sets of bits in the same order. Because such files are not completely identical, traditional techniques such as cryptographic hashing cannot be used to identify them. This paper introduces a new technique for constructing hash signatures by combining a number of traditional hashes whose boundaries are determined by the context of the input. These signatures can be used to identify modified versions of known files even if data has been inserted, modified, or deleted in the new files. The description of this method is followed by a brief analysis of its performance and some sample applications to computer forensics. Outline Introduction Piece-wise hashing Spamsum algorithms Directions Introduction This paper describes a method for using a context triggered rolling hash in combination with a traditional hashing algorithm to identify known files that have had data inserted, modified, or deleted. First, they examine how cryptographic hashes are currently used by forensic examiners to identify known files and what weaknesses exist with such hashes. Next, the concept of piecewise hashing is introduced. Finally a rolling hash algorithm that produces a pseudo-random output based only on the current context of an input is described. By using the rolling hash to set the boundaries for the traditional piecewise hashes, authors create a Context Triggered Piecewise Hash (CTPH). Piece wise hashing arbitrary hashing algorithm to create many checksums for a file instead of just one. Rather than to generate a single hash for the entire file, a hash is generated for many discrete fixed-size segments of the file. For example, one hash is generated for the first 512 bytes of input, another hash for the next 512 bytes, and so on. A rolling hash algorithm produces a pseudo-random value based only on the current context of the input. The rolling hash works by maintaining a state based solely on the last few bytes from the input. Each byte is added to the state as it is processed and removed from the state after a set number of other bytes have been processed. Spamsum Email spam detection written by Dr. Andrew Tridgell is Spamsum can identify emails that are similar but not identical to samples of known spam. The spamsum algorithm was in turn based upon the rsync checksum also by Dr. Tridgell. The spamsum algorithm uses FNV hashes for the traditional hashes which produce a 32-bit output for any input. In spamsum, Dr. Tridgell further reduced the FNV hash by recording only a base64 encoding of the six least significant bits (LS6B) of each hash value The algorithm for the rolling hash was inspired by the Alder32 checksum. Directions Many applications in alerted document matching and partial file matching Improvement to hash algorithms Performance studies Abstract of Paper 4 Establishing the time at which a particular event happened is a fundamental concern when relating cause and effect in any forensic investigation. Reliance on computer generated timestamps for correlating events is complicated by uncertainty as to clock skew and drift, environmental factors such as location and local time zone offsets, as well as human factors such as clock tampering. Establishing that a particular computer’s temporal behavior was consistent during its operation remains a challenge. The contributions of this paper are both a description of assumptions commonly made regarding the behavior of clocks in computers, and empirical results demonstrating that real world behavior diverges from the idealized or assumed behavior. Authors present an approach for inferring the temporal behavior of a particular computer over a range of time by correlating commonly available local machine timestamps with another source of timestamps. We show that a general characterization of the passage of time may be inferred from an analysis of commonly available browser records. Outline Introduction Factors to consider Drifting clocks Identifying computer timescales by correlation with corroborating sources Directions Introduction Timestamps are increasingly used to relate events which happen in the digital realm to each other and to events which happen in the physical realm, helping to establish cause and effect. Difficulty with timestamps is how to interpret and relate the timestamps generated by separate computer clocks when they are not synchronized Current approaches to inferring the real world interpretation of timestamps assume idealized models of computer clock Uncertainty with the behavior of suspect’s clock computer before seizure. Authors explore two themes related to this uncertainty. - investigate whether it is reasonable to assume uniform behavior of computer clocks over time, and test these assumptions by attempting to characterize how computer clocks behave in the wild. - investigate the feasibility of automatically identifying the local time on a computer by correlating timestamps embedded in digital evidence with corroborative time sources. Factors Computer timekeeping Real-time synchronization Factors affecting timekeeping accuracy - Clock configuration - Tampering - Synchronization protocol - Misinterpretation Usage of timestamps in forensics Drifting clocks behavior Enumerate the main factors influencing the temporal behavior of the clock of a computer, and then attempt to experimentally validate whether one can make informed assumptions about such behavior. Authors do this by empirically studying the temporal behavior of a network of computers found in the wild. The subject of case study is a network of machines in active use by a small business. The network consists of a Windows 2000 domain, consisting of one Windows 2000 server, and mixed number of Windows XP and 2000 workstations. The goal here is to observe the temporal behavior. In order to observe this behavior, authors have constructed a simple service that logs both the system time of a host computer and the civil time for the location. The program samples both sources of time and logs the results to a file. The logging program was deployed on all workstations and the server Correlation Automated approach which correlates time stamped events found on a suspect computer with time stamped events from a more reliable, corroborating source. Web browser records are increasingly employed as evidence in investigations, and are a rich source of time stamped data. Techniques implemented are” Click stream correlation algorithm and Non-cached correlation algorithm Authors compare the results of both algorithms Directions Need to determine whether the conditions and the assumptions of the experiments are realistic What are the most appropriate correlation algorithms? Need to integrate with clock synchronization algorithms