Proposal for efficient hashes effectively differentiating application generated data Abstract Hash functions are widely used in computer applications to optimize various computing resources. In many situations (such as duplicate detection, synchronization/mirroring, indexing, and many other storage settings), hashes are used to represent large amounts of data with a significantly small number of bits. In order for these hashed representation to be suitable, it is important that the likelihood of two different data streams hashing to the same value (called collision probability) is negligible. Given this property, one also wishes to use hashes that may be promptly computed, and whose hash values are represented with a small number of bits. Yet often programs will use hash functions which were not designed for the situation at hand. For example, in the context of duplicate detection and synchronization/mirroring, CRC and MD5 hashes are frequently employed. These hashes were designed for other purposes than distinguishing data streams generated by computer applications. The collision probabilities claimed by systems using these hashes are those inferred by the assumption that all bit-streams are equiprobable (i.e. occur with the same probability). In situations where this assumption is valid, it would be more advantageous to use faster hashes than CRCs and MD5s since the computational complexity of these is only necessary for the properties sought by their design, which are irrelevant in our present situation. Yet, in many cases, computer applications will not produce all possible bit-streams with equal probability. Rather, typical bit sequences will occur frequently, others less often, while some will never occur. Thus there is an opportunity to create hash functions better fit for effectively differentiating data generated by typical computer applications. Further, by exploring a wide solution space, we believe it is likely that we may find effective hashes having additionally lower hash size and computational complexity than aforementioned customary hashes. In short, we believe that we can find hash functions which will be, in most storage management circumstances, shorter, faster, and more effective than commonly used hash functions. Table of Contents 1. Introduction 1.1 Hash functions 1.2 Balanced hash functions: An example 1.3 Hashes in storage problems: Duplicate detection synchronization as examples 2. Problems with current use of hash functions in the context of duplicate detection and synchronization 2.1 Customary Hash Functions: CRC and MD5 2.2 Purpose of CRC and MD5 hash functions 2.3 Differing Purposes of data and storage management 2.4 Probabilistically balanced hash functions 2.5 Problems in using CRC and MD5 for distinguishing computer generated bit-streams 2.6 The case of two Avamar Technologies patents 2.7 Hash size and computational complexity considerations 2.8 Conclusion 3. Proposed approach for finding better hash functions 3.1 An overview of the problems and solution approaches 3.2 Work plan 3.3 Deliverables 3.4 Further details 1. Introduction 1.1 Hash functions A hash function is a function that converts an input from a (typically) large domain into an output in a (typically) smaller range (the hash value, often a subset of the integers). Hash functions vary in the domain of their inputs and the range of their outputs and in how patterns and similarities of input data affect output data. Hash functions are used in hash tables, cryptography and data processing. A good hash function is one that experiences few hash collisions in the expected domain of values it will have to deal with; that is, it would be possible to uniquely identify most of these values using this hash. (from Wikipedia.com) Many computer applications use hash functions to optimize storage and speed when performing certain tasks. This is the case of synchronization/mirroring and duplicate detection where the hash values of documents or parts of documents (generally “bitstreams”) are calculated continuously in order to represent these with a small number of bits. These hashes can then be used to compare bit-streams with each other by comparing the smaller hash representations of these. The idea here is to design the hash function so that the hash values are small (thus requiring little memory), fast to compute, and so that if two bit-streams have the same hash then the probability that the bit-streams are different is very small: We say then the collision probability is small. It is in this latter sense that we say that the hash value is said to “represent” the bit-streams. 1.2 Balanced hash functions: An example Fig. 1 Fig. 1 illustrates a hash function mapping nine possible bit-streams to three distinct hash values. If every bit-stream has the same probability of being present, then each has a 1/9 chance of being present. It may be verified that the hash function represented in Fig. 1 has a 2/9=0.222 collision probability. This is the lowest collision probability one can achieve with a hash function having 3 hash values where all bit-streams have an equal probability of being present. In general, the lowest collision probability one can achieve when hashing B bit-streams with H hash values is (B/H-1)/H. A balanced hash function is one whose inverse images are all more or less the same size: That is, one where the sets of bit-stream mapping to a same hash value are more or less of equal size. If all the bit-streams are equiprobable, then a balanced hash function achieves the lowest collision probability. 1.3 Hashes in storage problems: Duplicate detection synchronization as examples In many storage problems, it is useful to have hashes that will differentiate bit-streams from each other. In the case of duplicate detection, hashes are used to separate a collection of documents into probable duplicate groups, allowing an eventual further process decide if the documents in those groups (having at least two elements) are actually duplicates. Dividing the collection of documents into groups accelerates the duplication detection. File synchronization in computing is a two-way synchronization of files and directories rather than mirroring which is one-way. There is a left side and a right side to perform synchronization between. A mirror in computing is a direct copy of a data set. On the Internet, a mirror site is an exact copy of another Internet site (often a web site). Mirror sites are most commonly used to provide multiple sources of the same information, and are of particular value as a way of providing reliable access to large downloads. Mirroring is a one-way operation whereas file synchronization is two-way. In the case of synchronization/mirroring, the hashes are again used to compare two (here specific) documents (or often portions of these documents) stored in different locations to see if they are the same, and therefore need to be synched. Sending whole documents over the network would be infeasible on a regular basis, thus the smaller hashes are sent instead. In this case, if the hashes are the same, the documents are considered to be the same, and no synchronization takes place. Hence it is important here to have a very low collision probability. 2. Problems with current use of hash functions in the context of duplicate detection and synchronization 2.1 Customary Hash Functions: CRC and MD5 Most current duplicate detection processes and synchronization schemes use CRC and MD5 hash functions. The Wikipedia.com definitions of these are provided below. A cyclic redundancy check (CRC) is a type of hash function used to produce a checksum, which is a small number of bits, from a large block of data, such as a packet of network traffic or a block of a computer file, in order to detect errors in transmission or storage. A CRC is computed and appended before transmission or storage, and verified afterwards to confirm that no changes occurred. CRCs are popular because they are simple to implement in binary hardware, are easy to analyze mathematically, and are particularly good at detecting common errors caused by noise in transmission channels. In cryptography, MD5 (Message-Digest algorithm 5) is a widely-used cryptographic hash function with a 128-bit hash value. As an Internet standard (RFC 1321), MD5 has been employed in a wide variety of security applications, and is also commonly used to check the integrity of files. 2.2 Purpose of CRC and MD5 hash functions These two classes of hash functions are ingenious and very appropriate in some situations, and have thus known much popularity in the IT world. Yet these were designed for a specific purpose, and are often used in situations they are not suitable for. The CRC hashes were designed for burst errors, which is the type of data corruption or loss that occurs when this data is transmitted or stored. A burst error is a random swapping of a consecutive section of bits of a bit-stream. Since this type of clustered errors are much more likely to happen than corruption of bits in scattered throughout the bit-stream, it makes sense to tailor a hash that will distinguish a bit-stream from other bitstreams that were obtained from this original one by introducing burst errors. The CRC hash functions do exactly this. The MD5 hash functions are, on the other hand, designed specifically so that knowing a hash value, it is very hard (intractable) to create a bit-stream having that would produce this hash value. Hence MD5 hash functions are useful in when privacy/security are at stake. 2.3 Differing Purposes of data and storage management Yet, in the case of data and storage management in general—and duplicate detection and synchronization in particular—we are not concerned with burst errors or privacy issues. Are goal here is only to distinguish computer generated bit-streams from each other. Using CRC and MD5 hashes for this purpose has several downfalls: 1) It is computationally intensive to compute the hash values (especially in the case of MD5 hashes). 2) They are not adapted to the probability distribution of typical bit-streams produced by computer applications. 2.4 Probabilistically balanced hash functions In order to convey some understanding of this latter issue, we will illustrate the problem of adapting a hash function to the probability distribution of its domain (here, bitstreams). Fig. 2 Fig. 3 Consider Fig. 2 where the same hash function as Fig. 1 is represented, but where the probabilities of the bit-streams are not equal. Note that the probability that a bit-stream hashes to the first hash value is 16/36=0.444, where the probabilities for the second and third hash values are respectively 11/36=0.306 and 9/36=0.250.The hash function is balanced with respect to the number of bit-streams, but not with respect to their probabilities. In this particular case, the probability collision is 0.233. Methods using balanced hashes usually claim a collision probability corresponding to the case where all bit-streams are equiprobable. In the case of Fig. 2, we would then claim a collision probability of 0.222 (resulting from the Fig. 1 situation). Yet, if the bit-stream probabilities are as indicated in Fig. 2, one sees that the claimed 0.222 collision probability is clearly underestimated, the actual collision probability being 0.233. Not only can one achieve lower collision probabilities than the balanced hash of Fig. 2, but moreover can achieve lower collision probabilities than the theoretical lower bound 0.222 for balanced hash functions in the equiprobable bit-streams case (with three hash values). For example, the hash function depicted in Fig. 3 achieves a collision probability of 0.213. Note that this hash function is not balanced in the conventional sense, but is what we will call “probabilistically balanced’’ since the probability of obtaining any given hash is uniform. Note that the collision probabilities of the illustrative hash functions mentioned above (Fig.1, 2, and 3) are not too far apart. This is only due to the fact that these were small examples. If one increases the ratio of the number of bit-streams to the number of hash values and/or the skew of the bit-stream probability distribution, these differences become more extreme. We conducted some experiments to look into how different the collision probabilities of balanced and probabilistically balanced hash functions would yield under different bitstream probability skews. Some results are presented here. In this experiment, we consider a group of 1000 bit-streams hashed to 10 hash values. We defined a family of Poisson distributions with a skew factor varying from 0 to 1, which we use as the bitstream probabilities. Five such distributions are graphed in Fig. 4 with skew k=0, 0.25, 0.50, 0.75, 1. Fig. 4 Fig. 5 displays the probabilities of each “same hash” bit-streams in the balanced case (in green) and the probabilistically balanced case (in yellow) for skews 0.25, 0.50, 0.75, and 1.00. Fig. 6 graphs the collision probabilities (in the balanced and the probabilistically balanced cases) resulting from increased bit-stream distribution skew. Fig. 5 Fig. 6 2.5 Problems in using CRC and MD5 for distinguishing computer generated bit-streams So we see that in the case where the bit-streams that we are hashing are weighted by a non-uniform probability distribution, balanced hash functions do not effectively yield optimal collision probability. Yet, when these bit-streams are documents or portions of documents that have been produced by computer application, the domain probability distribution is in fact not uniform. Many studies have been made on the structure of (natural and computer) languages. Some language models are deterministic (e.g. Chomsky grammars) and some are statistical (e.g. Markov models, Bayesian inference, Entropy theory). All of these naturally yield skewed bit-stream probabilities of computer data since whether entered by a human or generated by a computer program, this data will follow certain patterns, as opposed to being completely random. Some authors/inventors make the mistake of claiming theoretical optimal collision probabilities when using CRC, MD5, or other balanced hashes for computer generated bit-streams. In reality, when taking into account the actual skewed distribution of the domain, the collision probabilities would be found to be higher. How much higher these collision probabilities are depends on how skewed the domain probability distribution is. The underestimation of collision probability could potentially lead to faulty systems if non-collision is a critical system of this system. In this case, one could increase the hash size (hence the number of possible hash values) sufficiently to compensate for possible underestimation. Yet even this won’t help in some situations, as we will exemplify now. 2.6 The case of two Avamar Technologies patents Consider the twin patents assigned to Avamar Technologies: “Hash file system and method for use in a commonality factoring system” and “System and method for unorchestrated determination of data sequences using sticky byte factoring to determine breakpoints in digital sequences”. These describe methods to represent the contents of the files of a file system as a sequence of hashes associated with commonly encountered bit sequences. This is an ingenious idea that could yield considerable storage savings. Only the hashes proposed to implement the scheme might impede its functionality. Indeed, since bit sequences will be represented by there hash value, it is imperative that no two distinct sequences hash to the same value. Hence, in order for this scheme to not lead to a disastrous loss of data, it is essential that the collision probability of the hash function be very close to zero. In order to claim the optimal collision probability implied by balanced hash functions, we must assume that all bit sequences are equiprobable. Yet, if this were the case, then the said scheme wouldn’t be advantageous since then there wouldn’t be many common bit sequences throughout the file system which would yield storage savings. 2.7 Hash size and computational complexity considerations We have seen the importance of minimizing actual collision probability when designing a hash function. Further, when designing any algorithm, one wishes to minimize the time and space complexity. In the case of hash functions, this translates to Minimizing the hash size, that is the number of bits that will be used to represent the hash values. Minimizing the time (and space) required to compute the hash value of a bitstream. One wishes to minimize the hash size because If the hash values are stored on RAM or disk drives, this will minimize the space required to store the hash values. More critically, if the hash values are transmitted—as is the case in mirroring, synchronization, and distributed duplicate detection—this will minimize the amount of data to be communicated, thus the amount of time needed to communicate the hash values. Of course, lowering the hash size will increase the collision probability, thus an appropriate tradeoff must be decided upon. One wishes to minimize the time (and space) required to compute the hash value for obvious reasons. Yet, more we restrict the amount of operations we wish to undertake to compute hash values, more we restrict the what hash functions we can compute, thus less we can lower the collision probability. So again, an appropriate tradeoff must be decided upon. Even if we assume that the bit-stream probability distribution is indeed nearly uniform, it suffices to use any balanced hash function to obtain minimal collision probabilities since we are not concerned with the circumstances CRCs and MD5s are designed for. In this case, one should choose to use hash functions that are faster to compute than CRCs and MD5s. This threefold tradeoff is depicted in Fig. 7. Fig. 7 2.8 Conclusion Due to their popularity and capability in given contexts, customary hash functions such as CRC and MD5 are often used outside the context they were designed for. As a consequence, the systems they are used in afford higher (actual) collision probabilities, hash sizes and computational complexities than necessary. More precisely, since our goal is here solely to distinguish bit-streams, we do not need the property of burst-error capturing (which CRCs were designed for) or complex inverse hash (which MD5s were designed for). Hence the computation complexities of these hash functions are unnecessary in our context. Further, these customary hash functions may not be fit for the skewed bit-stream probabilities of computer data, which yields higher collision probabilities. We believe that we may find hash functions more apt for the task at hand. More precisely, we propose to find hash functions that will be better situated in the three-fold tradeoff space composed of the following attributes: 1) Lower collision probability. 2) Smaller size. 3) Faster to compute. 3. Proposed approach for finding better hash functions 3.1 An overview of the problems and solution approaches A great deal of literature can be found in the field of hash functions and the computational optimization of the latter. Moreover, much research has been done on studying the structure of language, both deterministically and statistically. In this project, we wish to bring both of these areas together with the objective to find efficient hash functions having the purpose of distinguishing computer generated bit-streams. Since computer generated bit-streams rarely random, but have a structure specific to the program that generated this bit-stream, these bit-streams can be thought of as being written in a given language—where “language” is taken in its broad sense. This implies that these bit-streams will have different probabilities of occurrence, and thus a hash function designed to distinguish bit-streams of this language should be adapted to these probabilities in question. Fig. 8 depicts the different components of the proposed approach to the problem. Fig. 8 Let A be the family of all possible bit-streams. Let B be a family of bit-streams of a certain type: By this, we mean that these bit-streams were generated by a typical use of a given computer program. For example, this family of bit-streams could correspond to the bits representing the contents of files (or parts of files) generated by programs such as Word, Excel, PowerPoint, Outlook, or having extensions .pdf, .bmp, .java, .exe, etc. Usually, many bit-streams of A are excluded from B. For example, consider opening a .doc or .xls as a .pdf or .bmp. Usually, if a bit-stream was generated by a given program, most other programs will not even be able to interpret this bit-stream. Further, even within a same family B, bit-streams some bit-streams are more likely than others. For example, consider the bit-stream representing the fragment “To whom it may concern:” (written in MsWord) in contrast with the bit-stream representing the fragment “aslkjf asofi (*$ kjfbb!!”. All this to say that to a given family B corresponds a typical probability space P(B) on the bit-streams of B, where to each bit-stream of B is attached a “probability of occurrence”. This probability space may be inferred logically in the case where we have a deterministic model of the language, or statistically if such a model is not available. If B were finite, given P(B), we could theoretically find the hash function(s) having the least collision probability. Only, there are a few problems here: Bit-stream families are typically infinite Even in the finite case, if B is too large, it would be intractable to find this optimal hash. In fact, if the bit-stream family is too large, we may not even be able to represent the probability space, lest it could be computed directly from any given bit-stream. Even if we found an optimal hash, it may be intractable to compute this hash if it is not smoothly correlated to the actual bit formation of the bit-streams. Regarding the large (or infinite) bit-streams problem, several approaches may be explored: Take a sampling approach. We could determine how large of a sample we should take so that the results thus obtained reflect the population (i.e. all possibilities). Find a fortuitous relation between these large bit-streams and their constituting parts. Show that the hashes found for small bit-streams of a same LM are also optimal for large bit-streams. Regarding the computability of hash functions, we will need to explore several approaches: Determine if the desired probabilistically balanced hashes found are computable. That is, see if a chosen large class Z of “computable” hash functions can express these. Determine which hashes of Z yield the minimal (with respect to Z) collision probabilities. 3.2 Work plan We propose a phased approach to the problem. Phase I: The goal of this phase is to provide Client-Company with resources (both collected and produced) that will help Client-Company to posses state-of-the art knowledge on the problem as well as determine the worth of further research and development on this subject. In particular, we will Collect and read through patents, scientific articles, and other related materials. Pinpoint possible applications (besides duplicate detection and synchronization/mirroring) of application generated bit-stream distinguishing hash functions. Statistically determine the bit-stream probabilities produced by typical computer applications. Compare actual collision probabilities of commonly used hash functions such as CRC and MD5 to probabilistically balanced hash functions in the context of the said typical applications. Phase II: If Client-Company decides that the results obtained in Phase I are promising, we will undertake the task of finding optimal hash functions and efficient ways to compute these. This task will include Finding language models that spawn the bit-streams of typical computer applications with probabilities compared to those exhibited by the statistical experiments of Phase I. Determine a large family of efficient hash functions as well as their computational complexity. Find efficient hash functions yielding low collision probabilities in the context of the said application generated or language model spawned bit-streams. Test the said hash functions in a real world situation. Phase III: If deemed useful, we will write patents imparting the methods and hash functions developed during Phase II. 3.3 Deliverables Phase I: Background research, testing, and proof of concept. The goal of this phase is to provide Client-Company with resources (both collected and produced) that will help Client-Company to posses state-of-the art knowledge on the problem as well assess the worth of further research and development on this subject. In short, we wish to determine if we can find hash functions that will be both more effective (low collision probabilities) and more efficient (low hash size and computational complexity) than existing ones. The deliverables include” A website where various collected or produced resources will be organized and readily available through hyperlinks. These resources may include o This proposal o Reports o Patents o Scientific articles o Tutorials o Webpages o Software o Blogs o Contacts o Java programs o Progress log Java modules for o Studying the structure of computer generated bit-streams o Compute collision probabilities that various hash functions yield when applied a set of inputted bit-streams o Automatically generate graphs comparing collision probabilities for different hashes and bit-stream families o Determine (absolutely or statistically) minimally colliding hash functions when given either A set of bit-streams A bit-stream probability distribution o Test actual speed of various general classes of hash functions o Write and test collision probabilities of fast hashes and compare these to those of commonly used hashes. A final report, presenting our findings. Phase II: Background research, testing, and proof of concept. The objective of this phase is to find efficient ways to compute hashes reaching the low collision probabilities which were put forward in Phase I. In order to do so, we will need to find good language models, i.e. models which spawn desired bit-streams with similar probabilities to those found in actual application generated data. Further, though probabilistically balanced hashes may easily be defined, we must find an efficient scheme to compute these, or at least hashes that are nearly probabilistically balanced. This will require some research and testing. More details on Phase II will be provided after Phase I is completed. For time being, these are the projected deliverables: Java modules for o Generating bit-streams given various language models o Determine how “good” a language model is at approximating the structure and probability of typical bit-streams o Compute collision probabilities that various hash functions yield when applied to a language model (generating typical bit-streams) o Determine (absolutely or statistically) minimally colliding hash functions when given a language model o Finding a minimally colliding hash of a family of hashes. o Determining if a given hash is computable (given restrictions on the type of hashes to be used) o Automatically find the minimally colliding hash of a specified bit-stream type (defined by a file type or language model) within a family of hashes, in a given time limit. o Computing the best hashes found for a list of file types. A final report, presenting our findings. 3.4 Further details Phase I (non-exhaustive) list of tasks: 1. Collect literature (patents, books, scientific articles) on hash functions. 2. Collect literature (patents, books, scientific articles) on language statistics. 3. Collect literature (patents, books, scientific articles) on applications of hash functions in this context (duplicate detection, synchronization/mirroring, content hashing, etc.). 4. Determine a typical set X of bit-stream families of interest. For example bit-streams created by typical use of given programs (e.g. those generating .doc, xls, .ppt, pdf, etc.) This set X a. Should not be too large b. Should contain the most frequent type of document types the hashes will be used on. 5. Conduct statistical experiments on bit-streams the different families listed in X created by typical use of chosen programs so as to infer the bit-stream probabilities (BSPs) of these different families. 6. Compute the collision probabilities (for CRCs, MD5s, and possibly other hash functions) inferred by the statistically obtained bit-stream probabilities. 7. Test CRCs and MD5s directly on bit-streams of the families of X and record the collision probabilities thus obtained. 8. Compare collision probabilities obtained in 5. and 6. with the collision probabilities of the balanced and probabilistically balanced situations. 9. Prove that probabilistically balanced hashes (PBHs) yield minimal collision probability. 10. Write a module that will find PBHs for given BSPs. 11. Write modules that will test customary hashes (CRC, MD5, etc.) against other possible hashes. 12. Write report presenting our findings. 13. Write document and/or design website where various resources (literature, webpages, software, blogs, contacts, etc.) on the subject will be organized and readily available through hyperlinks. Phase II (non-exhaustive) list of tasks: 14. Determine a list of language models (LMs) W that will be used/tested in this project. 15. Figure out how the models of W influence the bit-stream probabilities. 16. Write modules that can compute the bit-stream probabilities inferred by the models of W. 17. Measure how well a LM fits the actual bit-streams probabilities. 18. Write a module that will find PBHs for given LMs. 19. Write modules that will take a bit-stream probability distribution and return the closest fitting (in the sense of 17.) model of W (along with the closest fitting parameters of the latter). 20. Determine a class Z of hash functions we wish to use. These should be able to be expressed by specific low level operations and look-up tables. 21. Test the speed of the different operations and lookup tables that will be used by hashes of Z. 22. Using 21., establish a measure for how efficient the hashes of Z are (should include time and space considerations). 23. Find mathematical and programmatic ways to relate PSPs to efficient hashes of Z. 24. Find mathematical and programmatic ways to relate BSPs to efficient hashes of Z. 25. Find mathematical and programmatic ways to relate LMs to efficient hashes of Z. For example, suppose that Z contains only “subset hashes”, i.e. those corresponding to a fixed size subset of the bits of the bit-streams. It is straightforward to see that in this case, if the LM implies a neighborhood correlation of the bits (or chunks of bits) of the bit-stream (i.e. whereby a given bit is correlated to it’s neighboring bits with a strength increasing with the closeness of these neighbors), then the optimal hash will be one where the bits retrieved for the hash are farthest apart. 26. Find mathematical and programmatic ways to efficient hashes of Z to PSPs. 27. Find mathematical and programmatic ways to efficient hashes of Z to BSPs. 28. Find mathematical and programmatic ways to efficient hashes of Z to LMs.