Example: Data Mining for the NBA - The University of Texas at Dallas

advertisement
Digital Forensics
Dr. Bhavani Thuraisingham
The University of Texas at Dallas
Lecture #27
Evidence Correlation
October 31, 2007
Outline
 Review of Lectures 26
 Discussion of the papers on Evidence Correlation
Review of Lecture 26
 FORZA – Digital forensics investigation framework that incorporate
legal issues
- http://dfrws.org/2006/proceedings/4-Ieong.pdf
 A cyber forensics ontology: Creating a new approach to studying
cyber forensics
- http://dfrws.org/2006/proceedings/5-Brinson.pdf
 Arriving at an anti-forensics consensus: Examining how to define
and control the anti-forensics problem
- http://dfrws.org/2006/proceedings/6-Harris.pdf
Papers to discuss
 Forensic feature extraction and cross-drive analysis
- http://dfrws.org/2006/proceedings/10-Garfinkel.pdf
 md5bloom: Forensic file system hashing revisited
(OPTIONAL)
http://dfrws.org/2006/proceedings/11-Roussev.pdf
 Identifying almost identical files using context triggered
piecewise hashing (OPTIONAL)
- http://dfrws.org/2006/proceedings/12-Kornblum.pdf
 A correlation method for establishing provenance of timestamps in
digital evidence
- http://dfrws.org/2006/proceedings/13-%20Schatz.pdf
-
Abstract of Paper 1
 This paper introduces Forensic Feature Extraction (FFE) and Cross-
Drive Analysis (CDA), two new approaches for analyzing large data
sets of disk images and other forensic data. FFE uses a variety of
lexigraphic techniques for extracting information from bulk data;
CDA uses statistical techniques for correlating this information
within a single disk image and across multiple disk images. An
architecture for these techniques is presented that consists of five
discrete steps: imaging, feature extraction, first-order cross-drive
analysis, cross-drive correlation, and report generation. CDA was
used to analyze 750 images of drives acquired on the secondary
market; it automatically identified drives containing a high
concentration of confidential financial records as well as clusters of
drives that came from the same organization. FFE and CDA are
promising techniques for prioritizing work and automatically
identifying members of social networks under investigation. Authors
believe it is likely to have other uses as well.
Outline
 Introduction
 Forensics Feature Extraction
 Single Drive Analysis
 Cross drive analysis
 Implementation
 Directions
Introduction: Why?
 Improper prioritization. In these days of cheap storage and fast
computers, the critical resource to be optimized is the attention of
the examiner or analyst. Today work is not prioritized based on the
information that the drive contains.
 Lost opportunities for data correlation. Because each drive is
examined independently, there is no opportunity to automatically
‘‘connect the dots’’ on a large case involving multiple storage
devices.
 Improper emphasis on document recovery. Because today’s
forensic tools are based on document recovery, they have taught
examiners, analysts, and customers to be primarily concerned with
obtaining documents.
Feature Extraction
 An email address extractor, which can recognize RFC822- style
email addresses.
 An email Message-ID extractor.
 An email Subject: extractor.
 A Date extractor, which can extract date and time stamps in a
variety of formats.
 A cookie extractor, which can identify cookies from the SetCookie: header in web page cache files.
 A US social security number extractor, which identifies the
patterns ###-##-#### and ######### when preceded with the
letters SSN and an optional colon.
 A Credit card number extractor.
Single Drive analysis
 Extracted features can be used to speed initial analysis and
answer specific questions about a drive image.
 Authors have successfully used extracted features for drive image
attribution and to build a tool that scans disks to report the likely
existence of information that should have been destroyed under
Fair and Accurate Credit Transactions Act
 Drive attribution: an analyst might encounter a hard drive and wish
to determine to whom that drive previously belonged. For example,
the drive might have been purchased on eBay and the analyst
might be attempting to return it to its previous owner.
 powerful technique for making this determination is to create a
histogram of the email addresses on the drive (as returned by the
email address feature extractor).
Cross drive analysis (CDA)
 Cross-drive analysis is the term that coined to describe forensic
analysis of a data set that spans multiple drives.
 The fundamental theory of cross-drive analysis is data gleaned
from multiple drives can improve the forensic analysis of a drive in
question both in the case when the multiple drives are related to
the drive in question and in the case when they are not.
 two forms of CDA: first order, in which the results of a feature
extractor are compared across multiple drives, an O(n) operation;
and second order, where the results are correlated, an O(n2)
operation.
Implementation
 1. Disks collected are imaged onto into a single AFF file. (AFF is
the Advanced Forensic Format, a file format for disk images that
contains all of the data accession information, such as the drive’s
manufacturer and serial number, as well as the disk contents)
 2. The afxml program is used to extract drive metadata from the
AFF file and build an entry in the SQL database.
 3. Strings are extracted with an AFF-aware program in three
passes, one for 8-bit characters, one for 16-bit characters in lsb
format, and one for 16-bit characters in msb format.
 4. Feature extractors run over the string files and write their results
to feature files.
 5. Extracted features from newly-ingested drives are run against a
watch list; hits are reported to the human operator.
 6. The feature files are read by indexers, which build indexes in
the SQL server of the identified features.
Implementation
 7. A multi-drive correlation is run to see if the newly accessioned
drive contained features in common with any drives that are on a
drive watch list.
 8. A user interface allows multiple analysts to simultaneously
interact with the database, to schedule new correlations to be run
in a batch mode, or to view individual sectors or recovered files
from the drive images that are stored on the file server.
Directions
 Improve feature extraction
 Improve the algorithms
 Develop end to end systems
Abstract of Paper 2
 Hashing is a fundamental tool in digital forensic analysis used both
to ensure data integrity and to efficiently identify known data
objects. Authors objective is to leverage advanced hashing
techniques in order to improve the efficiency and scalability of
digital forensic analysis. They explore the use of Bloom filters as a
means to efficiently aggregate and search hashing information. In
They present md5bloo a Bloom filter manipulation tool that can be
incorporated into forensic practice, along with example uses and
experimental results.
Outline
 Introduction
 Bloom filter
 Applications
 Directions
Introduction
 The goal is to pick from a set of forensic images the one(s) that are
most like (or perhaps most unlike) a particular target.
 This problem comes up in a number of different variations, such as
comparing the target with previous/related cases, or determining
the relationships among targets in a larger investigation.
 The goal is to get a high-level picture that will guide the following
in-depth inquiry.
 already existing problems of scale in digital forensic tools are
further multiplied by the number of targets, which explains the fact
that in other forensic areas comparison with other cases is routine
and massive, whereas in digital forensics it is the exception.
 An example is object versioning detection: need to detect a
particular version of an object and not the target object
Introduction
 need to address is a way to store a set of hashes representing the
different components of a composite object as opposed to a single
hash.
 For example, hashing the individual routines of libraries or
executables would enable fine-grained detection of changes (e.g.
only a fraction of the code changes from version to version).
 The problem is that storing more hashes presents a scalability
problem even for targets of modest sizes.
 Therefore, authors propose the use of Bloom filters as an efficient
way to store and query large sets of hashes.
Bloom Filters
 A Bloom filter B is a representation of a set S = {s1,., sn} of n
elements from a universe (of possible values) U. The filter consists
of an array of m bits, initially all set to 0.
 the ratio r =m/n is a key design element and is usually fixed for a
particular application.
 To represent the set elements, the filter uses k independent hash
functions h1, ., hk, with a range {0, ., m 1}. All hash functions are
assumed to be independent and to map elements from U uniformly
over the range of the function.
 Md5bloom: Authors have a prototype stream-oriented Bloom filter
implementation called md5bloom.
Application of Bloom Filter in Security
 Spafford (1992) was one of the first person to use Bloom filters to
support computer security.
 The OPUS system by Spafford uses a Bloom filter which efficiently
encodes a wordlist containing poor password choices to help
users choose strong passwords.
 Bellovin and Cheswick present a scheme for selectively sharing
data while maintaining privacy. Through the use of encrypted
Bloom filters, they allow parties to perform searches against each
other’s document sets without revealing the specific details of the
queries. The system supports query restrictions to limit the set of
allowed queries.
 Aguilera et al. discuss the use of Bloom filters to enhance security
in a network-attached disks (NADs) infrastructure.
 The authors use bloom filtering to detect hash tampering
Directions
 Cryptography is a kept application for detecting evidence
tampering
 Bloom filters are one application for tampering the hash
 Need to compare different cryptographic algorithms
 Relations to correlation needs to be determined
Abstract of Paper 3
 Homologous files share identical sets of bits in the same order.
Because such files are not completely identical, traditional
techniques such as cryptographic hashing cannot be used to
identify them. This paper introduces a new technique for
constructing hash signatures by combining a number of traditional
hashes whose boundaries are determined by the context of the
input. These signatures can be used to identify modified versions
of known files even if data has been inserted, modified, or deleted
in the new files. The description of this method is followed by a
brief analysis of its performance and some sample applications to
computer forensics.
Outline
 Introduction
 Piece-wise hashing
 Spamsum algorithms
 Directions
Introduction
 This paper describes a method for using a context triggered rolling
hash in combination with a traditional hashing algorithm to identify
known files that have had data inserted, modified, or deleted.
 First, they examine how cryptographic hashes are currently used
by forensic examiners to identify known files and what
weaknesses exist with such hashes.
 Next, the concept of piecewise hashing is introduced.
 Finally a rolling hash algorithm that produces a pseudo-random
output based only on the current context of an input is described.
 By using the rolling hash to set the boundaries for the traditional
piecewise hashes, authors create a Context Triggered Piecewise
Hash (CTPH).
Piece wise hashing
 arbitrary hashing algorithm to create many checksums for a file
instead of just one. Rather than to generate a single hash for the
entire file, a hash is generated for many discrete fixed-size
segments of the file. For example, one hash is generated for the
first 512 bytes of input, another hash for the next 512 bytes, and so
on.
 A rolling hash algorithm produces a pseudo-random value based
only on the current context of the input. The rolling hash works by
maintaining a state based solely on the last few bytes from the
input. Each byte is added to the state as it is processed and
removed from the state after a set number of other bytes have
been processed.
Spamsum
 Email spam detection written by Dr. Andrew Tridgell is Spamsum
can identify emails that are similar but not identical to samples of
known spam. The spamsum algorithm was in turn based upon the
rsync checksum also by Dr. Tridgell.
 The spamsum algorithm uses FNV hashes for the traditional
hashes which produce a 32-bit output for any input. In spamsum,
Dr. Tridgell further reduced the FNV hash by recording only a
base64 encoding of the six least significant bits (LS6B) of each
hash value
 The algorithm for the rolling hash was inspired by the Alder32
checksum.
Directions
 Many applications in alerted document matching and partial
file matching
 Improvement to hash algorithms
 Performance studies
Abstract of Paper 4
 Establishing the time at which a particular event happened is a
fundamental concern when relating cause and effect in any forensic
investigation. Reliance on computer generated timestamps for
correlating events is complicated by uncertainty as to clock skew
and drift, environmental factors such as location and local time zone
offsets, as well as human factors such as clock tampering.
Establishing that a particular computer’s temporal behavior was
consistent during its operation remains a challenge. The
contributions of this paper are both a description of assumptions
commonly made regarding the behavior of clocks in computers, and
empirical results demonstrating that real world behavior diverges
from the idealized or assumed behavior. Authors present an
approach for inferring the temporal behavior of a particular
computer over a range of time by correlating commonly available
local machine timestamps with another source of timestamps. We
show that a general characterization of the passage of time may be
inferred from an analysis of commonly available browser records.
Outline
 Introduction
 Factors to consider
 Drifting clocks
 Identifying computer timescales by correlation with
corroborating sources
 Directions
Introduction
 Timestamps are increasingly used to relate events which happen in the
digital realm to each other and to events which happen in the physical realm,
helping to establish cause and effect.
 Difficulty with timestamps is how to interpret and relate the timestamps
generated by separate computer clocks when they are not synchronized
 Current approaches to inferring the real world interpretation of timestamps
assume idealized models of computer clock
 Uncertainty with the behavior of suspect’s clock computer before seizure.
 Authors explore two themes related to this uncertainty.
-
investigate whether it is reasonable to assume uniform behavior of
computer clocks over time, and test these assumptions by attempting to
characterize how computer clocks behave in the wild.
-
investigate the feasibility of automatically identifying the local time on a
computer by correlating timestamps embedded in digital evidence with
corroborative time sources.
Factors
 Computer timekeeping
 Real-time synchronization
 Factors affecting timekeeping accuracy
- Clock configuration
- Tampering
- Synchronization protocol
- Misinterpretation
 Usage of timestamps in forensics
Drifting clocks behavior
 Enumerate the main factors influencing the temporal behavior of the
clock of a computer, and then attempt to experimentally validate
whether one can make informed assumptions about such behavior.
 Authors do this by empirically studying the temporal behavior of a
network of computers found in the wild.
 The subject of case study is a network of machines in active use by
a small business. The network consists of a Windows 2000 domain,
consisting of one Windows 2000 server, and mixed number of
Windows XP and 2000 workstations.
 The goal here is to observe the temporal behavior. In order to
observe this behavior, authors have constructed a simple service
that logs both the system time of a host computer and the civil time
for the location.
 The program samples both sources of time and logs the results to a
file. The logging program was deployed on all workstations and the
server
Correlation
 Automated approach which correlates time stamped events found
on a suspect computer with time stamped events from a more
reliable, corroborating source.
 Web browser records are increasingly employed as evidence in
investigations, and are a rich source of time stamped data.
 Techniques implemented are” Click stream correlation algorithm
and Non-cached correlation algorithm
 Authors compare the results of both algorithms
Directions
 Need to determine whether the conditions and the
assumptions of the experiments are realistic
 What are the most appropriate correlation algorithms?
 Need to integrate with clock synchronization algorithms
Download