Hashing Concepts CSC 485/585 Objectives Define Hashing and Hash Values. Explain the common uses of Hashes within the field of Computer Forensics. Data Authentication Data Reduction File Identification Explain the limitations of Hashes. What is a Hash Function? A hash function is any well-defined procedure or mathematical function which converts a large, possibly variable-sized amount of data into a small datum. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes. Cryptographic Hash Functions A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an accidental or intentional change to the data will change the hash value. The data to be encoded is often called the "message", and the hash value is sometimes called the message digest or simply digest. The ideal cryptographic hash function has the main properties: it is infeasible to find a message that has a given hash, it is infeasible to modify a message without changing its hash, it is infeasible to find two different messages with the same hash. MD5 and SHA-1 are the most commonly used cryptographic hash functions (a.k.a. algorithms) in the field of Computer Forensics. MD5 MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash value. The 128-bit MD5 hashes (also termed message digests) are represented as a sequence of 16 hexadecimal bytes. The following demonstrates a 40-byte ASCII input and the corresponding MD5 hash: MD5 of “This is an example of an MD5 Hash Value.” = 3413EE4F01F2A0AA17664088E79CF5C2 Even a small change in the message will result in a completely different hash. For example, changing the period at the end of the sentence to an exclamation mark: MD5 of "This is an example of an MD5 Hash Value!” = B872D23A7D14B6EE3B390A58C17F21A8 SHA-1 SHA stands for Secure Hash Algorithm. SHA-1 produces a 160-bit digest from a message and is represented as a sequence of 20 hexadecimal bytes. The following is an example of SHA-1 digests: Just like MD5, even a small change in a message will result in a completely different hash. For example: SHA1 of "This is a test.” = AFA6C8B3A2FAE95785DC7D9685A57835D703AC88 SHA1 of "This is a pest.” = FE43FFB3C844CC93093922D1AAC44A39298CAE11 Statistics The MD5 hash algorithm - the chance of 2 files having the same MD5 hash value is 2 to the 128th power = 3.4028236692093846346337460743177e+38 or 1 in 340 billion billion billion billion. The SHA-1 hash algorithm - the chance of 2 files having the same SHA-1 hash value is 2 to the 160th power = 1.4615016373309029182036848327163e+48 or 1 in....a REALLY big number! What do CF Examiners use Hashes for? Data Authentication Data Reduction To prove two things are the same To exclude many “known” files from hundreds of thousands of file you have to look at. File Identification To find a needle in a haystack. Data Authentication One of the most important issues a computer forensic examiner faces is ensuring the ability to “authenticate” your digital evidence. This is done via Chain of Custody, Documentation, and Hash values. Using MD5 or SHA-1 hashing tools, an examiner should be able to verify that data has not changed. A hash of the acquired data must be identical to a hash of the original evidence. Data Authentication Calculating a “hash value” for any block of data (i.e. a file, an entire disk, a partition, etc.) can be accomplished as a stand-alone task or simultaneous with the acquisition process (by most tools). Calculating the “hash value” of an entire disk is done by reading all data on the disk, running it through the desired algorithm, and generating a hash of all data read. The examiner then typically documents the resulting hash value. The resulting “hash value” is a hash of the data READ from the disk, not necessarily a hash of the data WRITTEN to your target disk during the acquisition process. Input/Output errors and bad sector errors encountered during the acquisition process will effect the resulting hash value. An examiner should run a verification process after acquisition to ensure that the original hash value calculated while reading the original data matches the hash value of the data written out to your target disk. FTK Imager (Hashes calculated without acquiring drive) WinHex Specialist (Hash calculated without acquiring drive) Linux – md5sum & sha1sum (using Helix3-2009R1) FTK Imager (Hashes calculated (and verified) as part of acquisition process) Data Authentication Considerations: Drives will start to fail as they get older, resulting in “bad sectors”. Bad sectors = inability to obtain matching hash values when comparing a hash of the original disk to the hash of a forensic image of the data read from the disk. The more time a disk spins up, the more chance of disk failure(s). To calculate a hash value of a drive, you must read all data on the disk. To acquire a forensic image, you must read all data on the disk. If your imaging tool does not simultaneously capture a hash value as part of the data acquisition process, consider whether the risk of double the spin-up time to obtain a pre-acquisition hash values is appropriate given that your primary objective is to obtain the data. Data Authentication In the previous slides, we looked at hashing an entire drive. Using hashes, an examiner can also verify that a specific file or any block of data has not changed. Hash individual file(s) with FTK Imager, WinHex, md5summer, and many other hashing tools. Data Authentication Note that although these graphic files look identical, a single modified byte will result in hash values that do not match. Data Authentication When hashing individual files: Changing filename or extension does NOT change hash value. Changing Modified, Accessed, Created dates does NOT change hash value. Changing file system attributes (read-only, hidden, system, etc.) does NOT change hash value. Changing ANYTHING within the file contents DOES change the hash value of the file. For files like MS Word documents, that contain “Metadata”, changes within the Metadata DO change the contents of the file and therefore change the hash value of the file. For example, if you opened a MS Word document, made no changes to the contents of the file and just re-saved the file, MS Word would update the dates saved within the Metadata and the actual raw content of the overall word document would change and therefore generate a different hash value. Cropping a graphic, changing the resolution, saving as another graphic format (BMP to JPEG), or any other change that may not necessarily change the visual depiction of the picture, WILL change the raw contents of the file and therefore will change the hash value of the file. Data Authentication NOTE: Although we just told you that changing a filename or other “non-content” of a file does not change the hash value of the file…. Such a “non-content” change DOES make a change to the FAT directory entry, MFT entry, or other file system component that holds the filename, MAC dates, attributes, etc. and therefore DOES change the data on the file system that holds the file in question. Therefore a change of a filename, MAC date, file attribute, etc. DOES NOT change the hash value of the file, but it DOES change the hash value of the disk on which the file is stored. Data Reduction As the storage capacity of disks grows, so does the number of files a computer forensic examiner must examine. A typical hard drive containing a Windows installation, software applications, user files, temporary Internet files, music downloads, etc. will contain well over a hundred thousand files. Large databases containing hash values of “known” files can be used by a forensic examiner to reduce the number of files he or she must analyze. Files that are known to be part of the operating system and/or installed software applications are likely not going to contain evidence. By excluding all known operating system files and files from known software applications, an examiner is left with only user created files to review for potential evidence Data Reduction Using forensic software tools, an examiner calculates the hash value of all files on a disk. Then the examiner uses the software tool to compare the calculated hash values against all of the hash values within a known hash database to identify any matching hash values. The examiner can then exclude from view, any files with hash values matching those in the database. The examiner can also exclude from view, any files that are duplicates of each other according to their hash values, further reducing the number of files in view. This process called “Data Reduction” can save the examiner from analyzing many thousands of un-necessary files. Data Reduction Hash Databases: National Software Reference Library (NSRL) – Reference Data Sets (RDS) - NIST HashKeeper (LE, Military and Government only) - NDIC Known File Filter (KFF) – AccessData, Inc. Self-generated or shared databases NIST NSRL (RDS) Forensic Tool Kit - KFF File Identification Quickly identifying a specific “notable” file or files amongst the hundreds of thousands of files on a disk can also be accomplished by use of hash databases….finding the needle in the haystack! Instead of using a database of known “ignorable” files such as OS files, databases containing hash values of known “notable” files can be utilized. Example of common “Notable” files are: Child Pornography and other contraband images Hacker Tools Viruses, Trojans and other Malware The examiner can search by hash value and flag any files with hash values matching those in the “notable” database. Limitations A mismatched hash value only tells you something changed, not what changed! When using MD5, SHA-1 or other standard cryptographic hashes to identify known files, only EXACT matches will result in success. When files are slightly modified, standard hashing will not identify similar files. “Fuzzy Hashing” uses a concept called context triggered piecewise hashes in the tool ssdeep to identify files that have similar pieces but may not be entirely identical. Hash “collisions” have been discovered and some argue that stronger (more collision proof) hash algorithms should be used in computer forensics. Questions ??? …as usual, use the discussion board!