Files Chapter 4 What is a File? A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished. Computer files can be considered as the modern counterpart of paper documents which traditionally are kept in offices' and libraries' files, and this is the source of the term. --Wikipedia • Questions: – How are files stored? – How do we retrieve them? 2 Definition obtained from: http://en.wikipedia.org/wiki/Computer_file How are files stored? • At a low level, files are stored as a bunch of bytes on a hard drive (or other storage media) • However, hard drives don’t understand “files” – Hard drives simply contain a huge collection of bytes of data • We need some way to retrieve the bytes composing our file from the hard drive – That’s where filesystems come in 3 Human vs. Hard Drive View of a File • Left Side: Hard Drive View – Actually, data would be further converted from hex to binary (1 and 0) • Right Side: Human View – Data is converted into human readable characters 4 Files HEX is useful when attempting to view a file that is partially deleted. Which lends us to two questions: 1. 2. Why would a partially deleted file have difficulties being opened or viewed normally? What parts of a file does a HEX editor allow us to see, which otherwise would not be visible? Files, File Structures, and File Formats • To answer the questions on the other slide, we need to investigate the basics of a file, file structure, and file format. • A partially deleted file in many cases may be missing part of its formatting data, the data that identifies the file. • It is the formatting file that identifies the file to its parent or native software. • If a file doesn’t contain the formatting information, the software or Operating System will most likely not be able to access or execute the file. • It is this formatting information that uniquely identifies a file. Different Formats • There are hundreds of different formats for data. • There are also formats for executable programs on different platforms. (Windows, Linux, Mac, Unix, etc.…) • Each format defines how the sequence of bits and bytes are laid out, with ASCII based text files being one of the simplest formats for humans to decipher. Other Formats • Some file formats are designed to store very particular sorts of data: – JPEG formats – is designed to store photo images. – Gifs formats – is designed for both photo images and animation. – QuickTime format – can act as a container for many different types of multimedia. Text Files Formats • A Text File is simply one that stores any text. – Format such as ASCII or UTF-8, with few if any control characters. – Other file formats, such as HTML, or the source code of some particular programming language, are in fact also text files, but follows more strict rules for specific purposes. • Parent program, meaning the program or software that is used to create, execute, or otherwise access the file. • In most cases a file will contain data , its file signature, from which its parent software will be able to identify and handle its operation. File Signatures • File Signature – contained in the file header. • File Header – Not see by the user of the software, but very important for the file to function as designed. – It is this data contained within the file header that is used to identify the format of the file. • File Headers – may also contain data regarding the integrity of the file as well as information about itself and its contents. This data is often referred to as Metadata. File Format Structures • There is no one specific file format structure that fits all file types. • File formats will vary as well as file content. • The contents of an image, as well as its format, for example, will be different from the contents and format of a word processing document. File Extensions • File formats are easily identified by file extensions. • Windows Operating System uses file extensions to bind an application to a specific file type. – Example: Windows binds Adobe Reader to the .PDF file extension. Whereas, MS WORD to the .Doc or .DocX file extension. • File extensions are specific to the Window Operating System and without an extension the Window Operating System would not know how to open, process, or handle a file. Question: What would occur if the file extension of an executable (.EXE) file was changed to that of an Adobe file extension (.PDF)???? ANSWER: Windows would look at the file extension and see that it’s a .PDF; it would therefore hand that file over to Adobe to open. Adobe would attempt to launch or open the file and report an error since the file, regardless of its name, is not actually an Adobe file. Registry • Window stores this application binding information in a section of the Operating System (OS) called the registry. • Each file type contains a corresponding file extension; this correlation stored within the registry tells the OS what type of program is needed to access a certain file type. This is Window’s way of organizing the many different types of files to their corresponding software. OS • When the OS identifies an extension say .CSV (Comma Separated Values), the OS looks to the registry and finds which application is bound to this extension. In most cases, MS Excel is bound to CSVs, so Windows will hand it over to Excel. • A file extension and/or its corresponding registry information can be manipulated by a savvy user. Changing File Extensions • Suppose a change was made to the registry so that the .CSV file extension was associated to and therefore opened with an image viewer such as Window Picture Viewer. • This will cause an error because the file was an Excel file and not an image. • A file with an incorrect file extension would open as long as the Window Registry had that “incorrect” file extension associated with the correct software. • Remember, changing or renaming a file extension does not change the content of the file; it only changes the way in which Windows OS handles the file (i.e. which application the file is sent to). Computer Criminals • So why is the way the OS handles the interpretation of a file’s extension important to a cyber forensic investigator? • Computer criminals can use file extensions to hide files simply by changing the file extension. Changing A File’s Extension To Evade Detection • The process to change a file’s extension to evade detection is quite simple: – Step 1: Create a legitimate looking folder into which you wish to place your files. Use a name that will not be conspicuous. Creating a file extension to evade detection • Step 2: – Open the folder that you created – Select Organize menu, select layout and select Menu Bar • Step 3: – Open the Tools tab and select Folder Options, and select the View Tab Removing the file extension • Step 4: – Uncheck “Hide extensions for known file types” – File extension type is revealed • Step 5: – Right-Click on the file name to Rename the file, including providing any valid file extension type (.doc,.xls, .exe,.txt) The file name is changed based upon the extension provided (Do this to 4 images) Removing the file extension • Step 6: – Click “Hide extensions for known file types, to hide the new file extensions. • Notice where there was once 10 image files there are now only six. • Scanning simply for image files will results in missing the four files with modified extensions! Notes about Hiding Files • Remember Windows looks at a file’s extension first, and hands that file over to the appropriate application to open. A Microsoft Word application attempting to open a .JPEG or .TIF file would attempt to launch or open the file and report an error since the file, regardless of its name, is not actually a Microsoft Word file File Signature • File Signature – also known as the “Magic Number”. • File Signature – is the binary that identifies a particular file: the data that will aid in the identification of the file to its native or parent software. HEX Editor • For common file formats, the file signatures conveniently represent the names of the file types. – Example: Image file GIF87a format in HEX equals 0x474946383761. GIF89a format in HEX equals 0x3474946383961. GIF (Graphic Interchange Format) – First 6 Bytes of the file. JPEG • JPEG – Joint Photographic Expert Group image file is 0x4A464946, which is the ASCII equivalent of JFIF (JPEG File Interchange Format) – JPEG begin at the seventh byte of the file signature. Files and The Hex Editor • Back to our case, a forensic investigators will have to look at million pieces of data for potential evidence. • These files can be renamed and moved deeply in the logical folder structure. • Logical folder structure – A way in which to store your files. – – – – Assists in the orderly storage of your files. Makes it easier to find your files. Aids in managing your files. Simpler way to archive your files. • Remember, there can be hundreds if not thousands of folders and even more files, all of which may seem inconsequential as they are scattered and stored throughout an individual’s hard drive. File Signature • File Signature also known as the magic number. • Magic numbers are referred to as magic because the purpose and significance of their values are not apparent without some additional knowledge. • A file signature is the binary that identifies a particular file: the data that will aid in the identification of the file to its native or parent software. • For common file formats, the file signatures conveniently represent the names of the file types. Gifs and Jpegs • Gifs file signature occupies the first six bytes of the file. • Jpegs file signatures starts at the seventh byte. • MS Word document signature is represented by d0 cf 11 e0 which looks like docfile ASCII is Not Text or HEX • There may not always be an ASCII equivalent to a file type; this is one reason to use HEX. • ASCII has limitation and remember Unicode extends ASCII. That is why we use HEX because we can’t represents all characters from other languages using ASCII. Value of File Signature • We see that even when a file extension has been change that we still can view the file contents. • If we would search the entire drive for a binary representation of company “XYZ”, we will be able to find it even with the file signature deleted or changed. • Even if file may have been deleted or file signature changed, and some data may have been overwritten but there may be remnants of the file that can be retrieved. • A forensic examiner cannot always depend on having an intact file or file with a signature. File Signature Database • There are many different file signatures. Too many to remember. • Internet search is the best way to file a signature. http://www.filesignatures.net/ • Is a good place to look for a file signature Complex Files: Compound, Compressed, and Encrypted Files • We will discuss just the basics of the above topics. • A Compoundfile – is a file format that consists of numerous files. The compound file itself is little more than a container for those files. The structure within a compound file is similar to that of a real file system consisting of a hierarchy of storage with one parent directory. • There is a root directory folder, children contained within and files (data streams) contained therein. Compound files are sometimes associated with Microsoft’s Compound File Binary Format (CFBF) file. Compound File • All allocations of space within a Compound File are done in chunks or units called sectors. • The size of a sector is definable at creation time of a Compound File, and those sectors are usually 512 bytes in size. • A virtual stream is made up of a sequence of sectors. • At its simplest, the Compound File Binary Format is a container, with little restriction on what can be stored within it. • In a more loosely way, compound files represents any file that may contain a directory structure. Compound File Signature • As with other files, the file header of a compound file will contain a file signature, identifying the file; it will also contain information required to interpret the rest of the file such as file’s size and storage location. • It is this metadata that allows the software to reconstruct the file into the appropriate file format that will display the file’s specific information (i.e. size, creation date, change date, etc.). • The file therefore needs to “reconstructed” by its parent software in order for the data to be legible or otherwise accessible. Example • We think data storage is linear. Example Company XYZ Corp. We think X comes before Y and Y before Z. • What would we see if it is nonlinear? Maybe “oZpYCrX” • If that same data is non contiguous, other data maybe intertwined. (e.g., …?>>o…Z^qLp…77Ymn….C@qwerbsbdX…) • Thus XYZ Corp is not easily discernable now. • We would need an instruction set to reconstruct this data. Why Do Compound Files Exist? • Files have become more complex and need to contain a lot of information. • Many files contain Object Linking and Embedding(OLE) technology, in which one file may contain many files. OLE • Allows user to integrate data from different applications. • Object linking allows user to share a single source of data for a particular object. • The document contains the name of the file containing the data, along with a picture of the data. • When the source is updated, all the documents using the data are updated as well. Object Embedding • With object embedding, one application (referred to as the source) provides data or an image that will be contained in the document of another application (referred to as the destination). The destination contains the data or graphic image, but does not understand it or have the ability to edit it. • It simply displays, prints, and/or plays the embedded item. • To edit or update the embedded object, it must be opened in the source application that created it. • This occurs by double clicking the object or choose the appropriate edit command when highlighted. Embedding • While embedding doesn’t allow user to have a single source of data, it does make it easier to integrate applications • An embedded object contains the actual data for the object, the name of the application that created it, and a picture of the data. Example • MS Word document may contain a JPG image; a file within a file. • Compound files allow for incremental access, allowing for individual components to be accessed without the need of the entire file. • This can save time and resources by not having to load an entire file, only the piece or pieces desired. Compressed Files • Compressed files are essential compound files that are compressed. • Contained within the compound files are compression instructions. .ZIP • Common file extensions associated with compressed files is .zip. • Other ZIP file formats including WINZIP, 7-Zip, Gzip, and Rzip. • A file format of a compressed file (.zip) changes depending upon its compression algorithm. Questions • What happens when an application is upgraded (example: going from MS Office 03 to MS Office 2007)? How might this effect the application’s file signature? File signature has changed. Questions • What is the importance to a cyber forensic investigation and what does this mean? It means that the file is a compound file consisting of other files. If we would view the entirety of the file with our Hex editor we would not uncover any legible ASCII characters. Question • Why? The file structure and assembly instructions are contained within the file; thus, the file would need to be mounted (process by making it ready for use by OS) by its native software in order for the contents to be viewed. Mount • Viewing and, more importantly, searching the contents of these “complex” files are possible once they are mounted. Forensic tools incorporate the software to mount these so that searching is possible. • If these complex files are not mounted then no search results will be obtained. Forensics and Encrypted Files • Encrypted files are also complex but differ in that an encryption key is required to decrypt an encrypted file. • Encryption uses an algorithm (cipher) to alter or transform the data in an attempt to prevent reconstruction by those without the instruction set. A.K.A Encryption Key. • Decryption refers to the reverse process of making the data readable or otherwise accessible. Encrypted Files • Encryption – is a method by which confidentiality of data can be protected. • An encrypted cannot be decrypted without the encryption key (aka password). • The encryption process uses an algorithm or cipher to mathematically transform the plaintext along with the encryption key (password), thereby encoding it in such a manner that it is illegible or indecipherable. Encrypted Files • With the correct decryption key (password) the data is then run through its associated cipher text (algorithm) and converted back to clear text, which is, by default, decrypted. Remember, this entire process occurs in binary, as 0’s and 1’s. • It is the cipher that actually changes the files; the password is just a set of data which are used to “mathematically mix” and set the process in motion, turning the plaintext data into an unreadable end product. The Structure of Cipher • The structure of ciphers depends upon the cipher’s type. Types of ciphers vary but generally they can be categorized by the following: – Block or Stream – Block ciphers generally work on fixed length bits of data called blocks. The cypher may take a 256 – bit block of encrypted data. In a stream cipher, the plaintext bits are encrypted one at a time along with the encryption key. – Symmetric or Asymmetric – • Symmetric encryption – the same encryption key or password is used for both encryption and decryption. • Asymmetric encryption (public-key cryptography) – different keys (public & private) are used for encryption and decryption. Data is encrypted using a person’s public key, one in which everyone may have access to or even be distributed. However, data can only be decrypted using the person’s private key, one which is kept secret by the individual. Advance Encryption Standard (AES) • Standard adopted by the United States government and one of the most popular encryption methods available encryption methods available and in use today. • There are many other encryption algorithms or formats available and many books on them. We will not be cover them in this class. • They all contain some level or form of instruction needed to reconstruct the file. • If the instructional data needed to reconstruct a compound file is missing, overwritten, destroyed, or compromised, the file may not be recoverable, even though the data containing the evidence may still be contained within the file itself. Summing it up • It may be possible to reconstruct a complex file which has been partially overwritten, Forensic analysts are creative, cutting edge, innovative, and very intelligent; they have developed solutions for some of the most complex problems. • However, recovering the data with normal “point and click” methods may not always be possible.