slides

advertisement
COE589
Paper Presentation
The Evolution of File Carving
Presenters:
Muhammad Mohsin Butt(g201103010)
Contents
• Introduction
• Background
• Traditional Recovery
• File Carving
• Smart Carver
• Conclusion
Introduction
• This Survey presents various File Carving techniques.
• File carving is a forensic technique to recover data
based on file structure and content.
• No file system meta-data is required
• Main Focus of this paper is on File carving techniques
for Fragmented Data.
Background
• File System
• Part of OS that manages the creation, deletion,
allocation various other functions on files.
• FAT 32 and NTFS File Systems are most famous for
Windows OS.
• Basic unit of data storage on disks is cluster.
• Clusters are usually multiples of 512 Bytes.
Background
• Recovery In FAT -32(File Allocation Table)
• Files can be allocated in different ways.
• Contiguous Allocation.
• Linked Allocation.
• Indexed Allocation.
Background
• Contiguous Allocation.
Linked Allocation
Background
• Indexed Allocation
Background
• Indexed Allocation
Traditional Recovery Techniques
• These recovery techniques use the met-data of file system to
recover data.
• Data Storage in FAT32
Traditional Recovery Techniques
• Deletion and Recovery in FAT32.
File Carving
• What if we don’t have file system meta-data
information ??
• File carving recovers data without using file system
information.
• Knowledge of Structure of files to be recovered is
used.
• File Carving can be divided into two categories
• File Carving for non Fragmented data.
• File Carving for Fragmented data.
File Carving (First Generation)
• Performed good for non fragmented data.
• In forensics user data (Images, documents etc) is
important to recover.
• The search pool is reduced by removing operating
system files which are detected using their MD5 Hash
and keywords.
• Byte Sequences at prescribed offsets are used to
identify files.
File Carving (First Generation)
• Header and footer information of files to be recovered
is used.
• JPEG image header cluster begin with sequence FFD8.
• JPEG image footer cluster contains the sequence FFD9.
• Some files don’t have footer information.
• BMP image has file size, number of clusters and other
info present in header.
• Number of unallocated clusters as indicated by the
header of BMP image are merged for recovery.
File Carving (First Generation)
• Foremost tool implemented both header to footer
carving and also carving based on header and size of
file information.
• Scalpel built on foremost engine improved the
performance and memory usage of this file carving
techniques.
• Both these suffer degradation in performance when
data is fragmented.
Fragmentation
• As files are edited, modified and deleted, most hard
drives get fragmented.
• Also depends on allocation methodology of file
system.
• Fragmentation in forensically important files like
email, WORD document etc. is high. Why??
• Because of constant editing, deletion and addition PST
files are most fragmented.
• Wear Leveling Algorithms in Next Gen Hard Drives
(SSD) also cause fragmentation.
Fragmentation
Fragmented File Recovery
Graph Theoretic Carvers.
• Provide Recovery of fragmented files.
• Recovery is formulated as a Hamiltonian Path
Problem.
• Solved using alpha-beta heuristics.
Hamiltonian Path Problem.
• Given a set of clusters.
• Find a permutation of these clusters that recovers the
correct file.
• Identify pairs that are adjacent in original document.
• Assign weights between clusters which represent the
likelihood one cluster following the other in original
file.
• The best permutation is the on that maximizes the
candidate weights of adjacent clusters.
Hamiltonian Path Problem.
• Formulated as a graph.
• Vertices represent clusters.
• Edges represent weights between clusters.
• Problem Reduces to finding a maximum weight
Hamiltonian path in this graph.
Assigning Weights
• Weight assignment is the key in this type of carving.
• Prediction By Partial Matching (PPM) technique is
used for assigning weights.
• PPM is good for Texts.
Assigning Weights
• Weight Assignment in Images
K-Vertex Disjoint Path Problem.
• Hamiltonian Path method assumed that all the clusters
belong to same file.
• In actual systems multiple files are fragmented
together.
• Headers of various files are identified from the pool of
clusters.
• Graph is again formed using weights.
• Now K-disjoint paths are found in this graph using
various algorithms where k represents number of
headers found in previous step.
• Developed primarily for recovering images.
K-Vertex Disjoint Path Problem.
• Various algorithms to find k disjoint paths.
• Unique Path (UP) Algorithms provides best
performance.
• Each Cluster is assigned to only one file.
• Incorrect assignment may result in two files incorrectly
recovered.
• Parallel Unique Path Algorithm.
• Shortest Path First Algorithm.
Parallel Unique Path (PUP).
• Variation of dijkstra’s single source shortest path
algorithm.
1. Given k headers and a pool of clusters.
2. Find the best cluster match for each of the headers.
3. From the matches found in previous step take the
best one and assign it to the header.
4. Remove the chosen cluster from the available clusters
pool.
5. Find again the best match for found cluster and
repeat the step3 until all files recovered.
Parallel Unique Path (PUP).
Shortest Path First
• This algorithm presents the idea that best recoveries have
lowest average path costs.
• The average path cost is simply the sum of the weights between
the clusters of a recovered file divided by the number of
clusters.
• Takes one image at a time.
• Reconstruct the image.
• After reconstruction the clusters used are not removed from the
cluster pool.
• This process is repeated for all the images.
• Out of all the recovered images the one with lowest path cost is
assumed as the best recovery.
• Clusters associated with the best recovery are than removed.
Shortest Path First
• This algorithm presents the idea that best recoveries have
lowest average path costs.
• The average path cost is simply the sum of the weights between
the clusters of a recovered file divided by the number of
clusters.
• Takes one image at a time.
• Reconstruct the image.
• After reconstruction the clusters used are not removed from the
cluster pool.
• This process is repeated for all the images.
• Out of all the recovered images the one with lowest path cost is
assumed as the best recovery.
• Clusters associated with the best recovery are than removed.
Results
• Shortest Path First provides an accuracy of 88%
• PUP provides an accuracy of 83% but is faster.
• Both require edge weights to be pre computed.
• For large hard drives requirement of forming weights
by checking the likelihood between clusters is a major
drawback.
BiFragment Gap Carving
• Most of the real world data is bi-fragmented.
• This technique works for files with known header and
footer.
• Files should be decodable or be validated via their
structure.
• Works by searching for combinations between
identified header and footer.
BiFragment Gap Carving
Smart Carver
• Can work on fragmented and non fragmented data.
• Wide variety of file types supported.
• Preprocessing
• Data clusters are decrypted or decompressed.
• Collating
• Classification of cluster to various file types.
• Reassembly
Smart Carver (PreProcessing)
• Compressed and encrypted drive are
decrypted/decompressed in this stage.
• Removing known clusters from the disk based on file
system met-data.
• Helps increase the speed and reduce the amount of
data for next phases.
• Allocated files and Operating system specific data can
be pruned since it doesn’t have any use in forensics.
Smart Carver (Collating)
• Classifies the disk clusters as belonging to certain file
types.
• Reduces the cluster pool in recovery of file of each
type.
• Keyword/Pattern Matching
• Looking for sequences to determine the type of cluster.
• E.g. <html> tags in a cluster collates to html file.
• ASCII characters frequency
• High frequency of these indicate that data is non Video
or Image.
Smart Carver (Collating)
• File Fingerprints
• Uses Byte Frequency Distribution (BFD) to determine
the type of file.
• BFD is generated by creating a histogram for the file.
• A centroid model for each file type is created using the
mean and standard deviation of each byte value.
• Still they face problem differentiating JPEG and ZIP
• Still a hot research topic.
Smart Carver (ReAssembly)
• Reassembly can done by
• Finding the starting fragment of a file that contains the
header.
• Merging clusters belonging to same fragment.
• Finding the fragmentation point i.e. the last cluster in
current segment.
• Starting point of next fragment.
• Ending point of last fragment. Last cluster contating the
footer.
Smart Carver (ReAssembly)
• Merging of similar Clusters can be done in two ways.
• KeyWord/Dictionary
•This occurs when a word is formed between the two
cluster boundaries.
•E.g. One cluster ends at “he”, second starting at “llo
World”. Both can be merged.
• File Structure
•File structure can help in merging. Length field in
headers indicate the length of data. E.g. in PNG file if
length value is k than after k clusters CRC of data
associated is present. If the data in between has same
CRC than we can merger all clusters in between.
Otherwise fragmentation is present.
Smart Carver (ReAssembly)
• Sequential Hypothesis Parallel Unique Path Algorithm(
SHT-PUP) for reassembly.
•Modification of PUP algorithm.
•In PUP when best match is found for the available k
headers and out of them the best one is selected.
•The clusters immediately following the newly found
clusters are tested using sequential hypothesis testing
until a fragmentation point is reached.
Smart Carver (ReAssembly)
• Sequential Hypothesis Testing.
•This is done by using the weight vector. i.e. the weights of
all clusters in the pool.
•Two Hypothesis are tested.
• One that says the clusters belong in sequence to fragment
• Other says that they don’t.
•The ratio
•is used to test the hypothesis.
Conclusion
• Various File Carving methods for fragmented files are
presented in the survey.
• Problem of finding best weight is still an open research
issue.
Download