Using Cryptographic Hashes Correctly and Effectively Introduction Many people have heard that in some vague way some obscure technology called cryptographic hash values can be used to verify a software component is or is not the correct version of the software component. This is true. Moreover, you do not need to have a significant understanding of how hashing works in order to use hash values correctly and effectively. This article is for intended for election officials, elected officials, citizen activists and attorneys. First, I want to assure you it is possible to use complex technology without a significant understanding of the technology. Most people do it every day. If you are reading this article on the World Wide Web you are currently using complex technology without a significant understanding of that technology. How many readers of this article are familiar with the fine details of the any of following? Internet protocol (IP) which includes, internet protocol packets (IP packets), domain name servers, IP routing, Hypertext transfer protocol (the HTTP of the URL to this article), Lossless data compression, Error detection, or Error correcting codes? The honest answer for all of us is none of the above. Yet, all of the technologies listed above (and more) are used right now as your read this online article. I firmly believe any reader of this article could master the fine details of the above technologies if they chose. But, life is too short and there are only so many hours in a day. Who wants to learn the details of IP packet routing in order to read an online document? Or learn the intricacies of fuel injection in order to drive an automobile? Both may be interesting to some, but deep knowledge of either is unnecessary to use the technology they represent: the world wide web and automobiles. Unfortunately, in the area of cryptography and cryptographic hashing in particular, many familiar with the technology insist on explaining how hashing works rather than how to use hashes and hash values correctly. Or, worse in some ways, the proper use of cryptographic hashes and hash values is buried within a chapter in book covering how hashes work. I will attempt to excerpt and distill such chapters to explain how to use hashes and hash values without explaining how hashing works. What little technical exposition there is in this article will be hidden behind links to more detailed information or placed at the end of this article in endnotes. What do I need to know? For a car with an automatic transmission you need to know how only 4 controls work in order to drive the automobile. Those controls would be: The steering wheel, The accelerator, The brake, and The turn indicators. With cryptographic hashes you need to know only 5 things in order to use cryptographic hashes properly. These 5 items are: 1. The name of the algorithm(s) used, 2. The name of the software component for which the hash value was calculated (optional) 3. The size of the software component for which the hash value was calculated, 4. The hash value generated by the selected algorithm, and 5. The error rate for the algorithm Much to the chagrin of some technologists I would argue the fifth item (error rate) is also optional. So what are the meanings of some these terms which I used without definition so far? I think the definitions can be aided by setting some context. The following sentences are typical statements which involve cryptographic hashes and hash values. The file, _SETUP.EXE, found on the installation disk for the Sequoia WinEDS Election Database System 2.6 Build 220 had a size of 8192 bytes, an MD5 hash value of 1F9BBFAAB8DEC9AC4416E5BE2D22E315, and a SHA-1 hash value of E0A61765653F18ED0777DF975B10D46D586541E6. The scanner firmware for the M650 of the ES&S optical scanner in Columbia County, WI had a SHA1 hash value of 8585D0B734F85A37E0C6AA35391E66F873AD3064. The SHA1 hash value for the firmware of the M650 scanner version 1.2.0.0 (as recorded by the National Software Reference Library) is 8585D0B734F85A37E0C6AA35391E66F873AD3064. The file KERNEL.DLL found on the Hart Intercivic of Tarrent County, TX central tabulator had a file size of 983,552 bytes, an MD5 hash value of 775191A31455FAD793312F8D087146EB and a SHA1 hash value of 888190F293016D9541DDD6AEF5AC94EE3886849A. The file KERNEL.DLL from the NSRL is 983,552 bytes and has a reference hash values of 888190E31455FAD793312F8D087146EB and 775191D293016D9541DDD6AEF5AC94AB3776849A for MD5 and SHA; respectively. Since the hash values of the software component, KERNEL.DLL, are different. Thus, the software component, KERNEL.DLL, is not from the COTS operating system Microsoft XP Professional Version 2002 Service Pack 2. Definitions MD5 SHA1 is the name of a hash algorithm which is short for Message Digest 5. is the name of a hash algorithm which is short for Secure Hash Algorithm one SHA-256 is the name of one of the three hash algorithms which belong to the family of hash algorithms know as SHA2 or Secure Hash Algorithm 2 SHA-384 is the name of one of the three hash algorithms which belong to the family of hash algorithms know as SHA2 or Secure Hash Algorithm 2 SHA-512 is the name of one of the three hash algorithms which belong to the family of hash algorithms know as SHA2 or Secure Hash Algorithm 2 Software Is an collection of digital data which is under some form of version Component contro. Software component can be a file, an OCX control, a document, a text file of configuration parameter, contents of the MS Windows registry, an executable image, the contents of a memory card, the contents of a portion of a memory card, the contents of a memory chip (e.g. containing firmware) Hash Value Is the large binary number (between 128 and 512 bits in length) which produced by applying a hash algorithm to a software component. This value is usually written down as a hexadecimal number. Hexadecimal Is numbering system in base 16 instead of 10, the common base. The 16 digits are 0-9 followed by A-F. . And I use Hash Values How? Now with some of the definitions out of the way we can discuss using hash algorithms and hash values properly and effectively. The basic idea is to compare the hash values of 2 software components instead of performing a laborious byte for byte binary compare of the 2 software components themselves. Aside from being excruciatingly slow, a byte for byte comparison may be illegal because of trade secret, copyright or other intellectual property concerns. Using hash values allows one to verify software components without impinging on the intellectual property rights of the developer of the software component. With cryptographic hash algorithms, if the hash values are different, the 2 software components are different. If the hash values, are the same the software components are the same1. This sounds simple and it is, but there is the issue of the chain of custody and trust. In order to make either of the following statements: The hash value A does not equal the hash value B. Therefore, software component A is not the same as software component B. The hash value A does equal the hash value B. Therefore, software component A is the same as software component B. There need to be 3 pieces of infrastructure in place. 1. The hash value for the chosen algorithm of a known "good" version of the software component. This is called the reference hash value. 2. A hash calculator which is trusted to produce the correct hash value when applying the chosen hash algorithm to a software component. 3. A means to calculate the hash value of second, suspect software component with this trusted hash calculator. On the first item of infrastructure there are 2 generally accepted sources of reference hash values: the National Software Reference Library (NSRL) or a reference installation. The NSRL is a collection of both SHA1 and MD5 hash values of the software components of many commercial computer applications, commercially-available operating systems, and voting systems. Some entries in the NSRL are hash values for the contents of an installation CD-ROM instead of the hash values of all the software components installed by the installation CD-ROM. An example of this would be the installation file, BallotStation.ins, found in the NSRL are devoted to voting equipment. For the version of the installation file which installs the WinCE application, BallotStation 4.5.2 (which runs on a Diebold TSx touch screen DRE), has a file size of 4,505,149 byte, an MD5 hash value of 663B473011996898B65C3F3B74CD8DB4, and a SHA1 hash value of 3FC23B448EC036C5CABC2220C1989F07974A2B1B. Unfortunately this information does not provide a reference hash which allows for you to know if your particular TSx has version 4.5.2 or version 4.4.5 of the WinCE application, BallotStation. Fro this you will need a reference installation of WinCE application, BallotStation 4.5.2 installed on some trusted TSx in the state capital. This is where the chain of custody issues comes into play. Generating the reference hash values is not a multi-step process. 1. From the NSRL it is possible to compare the hash values of the installation program and verify it will install the correct version of BallotStation. 2. Using the NSRL verify installation CD-ROM, you install the WinCE application onto an empty TSx. 3. Once the installation is complete, you now have a reference installation 4. From this reference installation, it is now possible to calculate the hash values of some or all of the software components on the reference installation with a hash calculator which supports the selected hash algorithms. This list of hash values then become the reference hash values. 5. This collection of reference hash vlaues is published for use in verify the version of Ballot station found a particular TSx DRE is or is not version 4.5.2. For example, the list of reference hash values might include an MD5 and a SHA1 hash value for the software component, BallotStatation.EXE. The second portion of required infrastructure is a hash calculator trusted to give the correct hash values for all of the desired hash algorithms. One such hash calculator is called HashCalc. The third portion of the required infrastructure is to be able to apply the trusted hash calculator to the suspect software component. Continuing our specific example, this would mean there must be a way to execute the program HashCalc against the software component, BallotStatation.EXE, as found on the specific TSx DRE number 6354 used last Tuesday in precinct 47. Go Forth and Hash With this paper you now are ready to use cryptographic hash algorithms properly and effectively. Remember the 4 essential things you need use cryptographic hash algorithms properly and effectively are: 1. The name of the algorithm(s) used. 2. Where are the reference hash values located 3. What is the trusted hash calculator used. 4. The hash value generated by trusted hash calculator against the suspicious software component using the selected algorithm. Because of the known defects in both the MD5 and SHA1 algorithms, the author recommends you use both algorithms together and concatenate the 2 hash values into a single composite hash value. Never use MD5 or SHA1 alone. Since there are no known defects in the SHA-2 family of hash algorithms (SHA-256, SHA-384, or SHA-512), you can use any of the SHA-2 algorithms singly. How were the reference hash values create? Where are the reference hash values published? Is the source from the NSRL or from a reference installation? What is the program or application generating the hash values of the software components to be tested? Is the hash calculator trusted or able to give the correct hash values for the suspicious software components? End Notes 1 While it is true that if the hash vales are different it is a mathematical certainty the software component are different, this is not true if the hash values are the same. There is a very small probability 2 different files of the same size could have the same hash value for a given algorithm. This is called a collision. For SHA-1 the probability of a collision is effectively 1 part in 2^63 or 1 part in 10^19. For MD5 the probability of a collision is effectively 1 part in 2^24 or 1 part in 10 million. For SHA-256 the probability of a collision is effectively 1 part in 2^256 or 1 part in 10^77. SHA-384 and SHA-512 are even stronger. If you use either one of the SHA-2 algorithms or MD5 together with SHA1, it is literally more likely a cosmic ray has hitting the CPU of your computer induced an error in the calculation of the hash value than the was collision of the hashing algorithm(s) between the 2 files.