Proposal for optimal document hashes

advertisement
Proposal for efficient hashes effectively
differentiating application generated data
Abstract
Hash functions are widely used in computer applications to optimize various
computing resources. In many situations (such as duplicate detection,
synchronization/mirroring, indexing, and many other storage settings), hashes are
used to represent large amounts of data with a significantly small number of bits. In
order for these hashed representation to be suitable, it is important that the likelihood
of two different data streams hashing to the same value (called collision probability)
is negligible. Given this property, one also wishes to use hashes that may be promptly
computed, and whose hash values are represented with a small number of bits.
Yet often programs will use hash functions which were not designed for the
situation at hand. For example, in the context of duplicate detection and
synchronization/mirroring, CRC and MD5 hashes are frequently employed. These
hashes were designed for other purposes than distinguishing data streams generated
by computer applications. The collision probabilities claimed by systems using these
hashes are those inferred by the assumption that all bit-streams are equiprobable (i.e.
occur with the same probability).
In situations where this assumption is valid, it would be more advantageous to use
faster hashes than CRCs and MD5s since the computational complexity of these is
only necessary for the properties sought by their design, which are irrelevant in our
present situation. Yet, in many cases, computer applications will not produce all
possible bit-streams with equal probability. Rather, typical bit sequences will occur
frequently, others less often, while some will never occur.
Thus there is an opportunity to create hash functions better fit for effectively
differentiating data generated by typical computer applications. Further, by exploring
a wide solution space, we believe it is likely that we may find effective hashes having
additionally lower hash size and computational complexity than aforementioned
customary hashes.
In short, we believe that we can find hash functions which will be, in most storage
management circumstances, shorter, faster, and more effective than commonly used
hash functions.
Table of Contents
1. Introduction
1.1 Hash functions
1.2 Balanced hash functions: An example
1.3 Hashes in storage problems: Duplicate detection synchronization as examples
2. Problems with current use of hash functions in the context of duplicate detection
and synchronization
2.1 Customary Hash Functions: CRC and MD5
2.2 Purpose of CRC and MD5 hash functions
2.3 Differing Purposes of data and storage management
2.4 Probabilistically balanced hash functions
2.5 Problems in using CRC and MD5 for distinguishing computer generated bit-streams
2.6 The case of two Avamar Technologies patents
2.7 Hash size and computational complexity considerations
2.8 Conclusion
3. Proposed approach for finding better hash functions
3.1 An overview of the problems and solution approaches
3.2 Work plan
3.3 Deliverables
3.4 Further details
1. Introduction
1.1 Hash functions
A hash function is a function that converts an input from a (typically) large domain into
an output in a (typically) smaller range (the hash value, often a subset of the integers).
Hash functions vary in the domain of their inputs and the range of their outputs and in
how patterns and similarities of input data affect output data. Hash functions are used in
hash tables, cryptography and data processing. A good hash function is one that
experiences few hash collisions in the expected domain of values it will have to deal
with; that is, it would be possible to uniquely identify most of these values using this
hash. (from Wikipedia.com)
Many computer applications use hash functions to optimize storage and speed when
performing certain tasks. This is the case of synchronization/mirroring and duplicate
detection where the hash values of documents or parts of documents (generally “bitstreams”) are calculated continuously in order to represent these with a small number of
bits. These hashes can then be used to compare bit-streams with each other by comparing
the smaller hash representations of these.
The idea here is to design the hash function so that the hash values are small (thus
requiring little memory), fast to compute, and so that if two bit-streams have the same
hash then the probability that the bit-streams are different is very small: We say then the
collision probability is small. It is in this latter sense that we say that the hash value is
said to “represent” the bit-streams.
1.2 Balanced hash functions: An example
Fig. 1
Fig. 1 illustrates a hash function mapping nine possible bit-streams to three distinct hash
values. If every bit-stream has the same probability of being present, then each has a 1/9
chance of being present. It may be verified that the hash function represented in Fig. 1
has a 2/9=0.222 collision probability. This is the lowest collision probability one can
achieve with a hash function having 3 hash values where all bit-streams have an equal
probability of being present.
In general, the lowest collision probability one can achieve when hashing B bit-streams
with H hash values is (B/H-1)/H.
A balanced hash function is one whose inverse images are all more or less the same size:
That is, one where the sets of bit-stream mapping to a same hash value are more or less of
equal size. If all the bit-streams are equiprobable, then a balanced hash function achieves
the lowest collision probability.
1.3 Hashes in storage problems: Duplicate detection synchronization as examples
In many storage problems, it is useful to have hashes that will differentiate bit-streams
from each other. In the case of duplicate detection, hashes are used to separate a
collection of documents into probable duplicate groups, allowing an eventual further
process decide if the documents in those groups (having at least two elements) are
actually duplicates. Dividing the collection of documents into groups accelerates the
duplication detection.
File synchronization in computing is a two-way synchronization of files and directories
rather than mirroring which is one-way. There is a left side and a right side to perform
synchronization between.
A mirror in computing is a direct copy of a data set. On the Internet, a mirror site is an
exact copy of another Internet site (often a web site). Mirror sites are most commonly
used to provide multiple sources of the same information, and are of particular value as a
way of providing reliable access to large downloads. Mirroring is a one-way operation
whereas file synchronization is two-way.
In the case of synchronization/mirroring, the hashes are again used to compare two (here
specific) documents (or often portions of these documents) stored in different locations to
see if they are the same, and therefore need to be synched. Sending whole documents
over the network would be infeasible on a regular basis, thus the smaller hashes are sent
instead. In this case, if the hashes are the same, the documents are considered to be the
same, and no synchronization takes place. Hence it is important here to have a very low
collision probability.
2. Problems with current use of hash functions in the context of duplicate detection
and synchronization
2.1 Customary Hash Functions: CRC and MD5
Most current duplicate detection processes and synchronization schemes use CRC and
MD5 hash functions. The Wikipedia.com definitions of these are provided below.
A cyclic redundancy check (CRC) is a type of hash function used to produce a
checksum, which is a small number of bits, from a large block of data, such as a packet of
network traffic or a block of a computer file, in order to detect errors in transmission or
storage. A CRC is computed and appended before transmission or storage, and verified
afterwards to confirm that no changes occurred. CRCs are popular because they are
simple to implement in binary hardware, are easy to analyze mathematically, and are
particularly good at detecting common errors caused by noise in transmission channels.
In cryptography, MD5 (Message-Digest algorithm 5) is a widely-used cryptographic
hash function with a 128-bit hash value. As an Internet standard (RFC 1321), MD5 has
been employed in a wide variety of security applications, and is also commonly used to
check the integrity of files.
2.2 Purpose of CRC and MD5 hash functions
These two classes of hash functions are ingenious and very appropriate in some
situations, and have thus known much popularity in the IT world. Yet these were
designed for a specific purpose, and are often used in situations they are not suitable for.
The CRC hashes were designed for burst errors, which is the type of data corruption or
loss that occurs when this data is transmitted or stored. A burst error is a random
swapping of a consecutive section of bits of a bit-stream. Since this type of clustered
errors are much more likely to happen than corruption of bits in scattered throughout the
bit-stream, it makes sense to tailor a hash that will distinguish a bit-stream from other bitstreams that were obtained from this original one by introducing burst errors. The CRC
hash functions do exactly this.
The MD5 hash functions are, on the other hand, designed specifically so that knowing a
hash value, it is very hard (intractable) to create a bit-stream having that would produce
this hash value. Hence MD5 hash functions are useful in when privacy/security are at
stake.
2.3 Differing Purposes of data and storage management
Yet, in the case of data and storage management in general—and duplicate detection and
synchronization in particular—we are not concerned with burst errors or privacy issues.
Are goal here is only to distinguish computer generated bit-streams from each other.
Using CRC and MD5 hashes for this purpose has several downfalls:
1) It is computationally intensive to compute the hash values (especially in the
case of MD5 hashes).
2) They are not adapted to the probability distribution of typical bit-streams
produced by computer applications.
2.4 Probabilistically balanced hash functions
In order to convey some understanding of this latter issue, we will illustrate the problem
of adapting a hash function to the probability distribution of its domain (here, bitstreams).
Fig. 2
Fig. 3
Consider Fig. 2 where the same hash function as Fig. 1 is represented, but where the
probabilities of the bit-streams are not equal. Note that the probability that a bit-stream
hashes to the first hash value is 16/36=0.444, where the probabilities for the second and
third hash values are respectively 11/36=0.306 and 9/36=0.250.The hash function is
balanced with respect to the number of bit-streams, but not with respect to their
probabilities. In this particular case, the probability collision is 0.233.
Methods using balanced hashes usually claim a collision probability corresponding to the
case where all bit-streams are equiprobable. In the case of Fig. 2, we would then claim a
collision probability of 0.222 (resulting from the Fig. 1 situation). Yet, if the bit-stream
probabilities are as indicated in Fig. 2, one sees that the claimed 0.222 collision
probability is clearly underestimated, the actual collision probability being 0.233.
Not only can one achieve lower collision probabilities than the balanced hash of Fig. 2,
but moreover can achieve lower collision probabilities than the theoretical lower bound
0.222 for balanced hash functions in the equiprobable bit-streams case (with three hash
values). For example, the hash function depicted in Fig. 3 achieves a collision probability
of 0.213. Note that this hash function is not balanced in the conventional sense, but is
what we will call “probabilistically balanced’’ since the probability of obtaining any
given hash is uniform.
Note that the collision probabilities of the illustrative hash functions mentioned above
(Fig.1, 2, and 3) are not too far apart. This is only due to the fact that these were small
examples. If one increases the ratio of the number of bit-streams to the number of hash
values and/or the skew of the bit-stream probability distribution, these differences
become more extreme.
We conducted some experiments to look into how different the collision probabilities of
balanced and probabilistically balanced hash functions would yield under different bitstream probability skews. Some results are presented here. In this experiment, we
consider a group of 1000 bit-streams hashed to 10 hash values. We defined a family of
Poisson distributions with a skew factor varying from 0 to 1, which we use as the bitstream probabilities. Five such distributions are graphed in Fig. 4 with skew k=0, 0.25,
0.50, 0.75, 1.
Fig. 4
Fig. 5 displays the probabilities of each “same hash” bit-streams in the balanced case (in
green) and the probabilistically balanced case (in yellow) for skews 0.25, 0.50, 0.75, and
1.00. Fig. 6 graphs the collision probabilities (in the balanced and the probabilistically
balanced cases) resulting from increased bit-stream distribution skew.
Fig. 5
Fig. 6
2.5 Problems in using CRC and MD5 for distinguishing computer generated bit-streams
So we see that in the case where the bit-streams that we are hashing are weighted by a
non-uniform probability distribution, balanced hash functions do not effectively yield
optimal collision probability. Yet, when these bit-streams are documents or portions of
documents that have been produced by computer application, the domain probability
distribution is in fact not uniform.
Many studies have been made on the structure of (natural and computer) languages.
Some language models are deterministic (e.g. Chomsky grammars) and some are
statistical (e.g. Markov models, Bayesian inference, Entropy theory). All of these
naturally yield skewed bit-stream probabilities of computer data since whether entered by
a human or generated by a computer program, this data will follow certain patterns, as
opposed to being completely random.
Some authors/inventors make the mistake of claiming theoretical optimal collision
probabilities when using CRC, MD5, or other balanced hashes for computer generated
bit-streams. In reality, when taking into account the actual skewed distribution of the
domain, the collision probabilities would be found to be higher. How much higher these
collision probabilities are depends on how skewed the domain probability distribution is.
The underestimation of collision probability could potentially lead to faulty systems if
non-collision is a critical system of this system. In this case, one could increase the hash
size (hence the number of possible hash values) sufficiently to compensate for possible
underestimation. Yet even this won’t help in some situations, as we will exemplify now.
2.6 The case of two Avamar Technologies patents
Consider the twin patents assigned to Avamar Technologies: “Hash file system and
method for use in a commonality factoring system” and “System and method for
unorchestrated determination of data sequences using sticky byte factoring to determine
breakpoints in digital sequences”. These describe methods to represent the contents of the
files of a file system as a sequence of hashes associated with commonly encountered bit
sequences. This is an ingenious idea that could yield considerable storage savings. Only
the hashes proposed to implement the scheme might impede its functionality.
Indeed, since bit sequences will be represented by there hash value, it is imperative that
no two distinct sequences hash to the same value. Hence, in order for this scheme to not
lead to a disastrous loss of data, it is essential that the collision probability of the hash
function be very close to zero. In order to claim the optimal collision probability implied
by balanced hash functions, we must assume that all bit sequences are equiprobable. Yet,
if this were the case, then the said scheme wouldn’t be advantageous since then there
wouldn’t be many common bit sequences throughout the file system which would yield
storage savings.
2.7 Hash size and computational complexity considerations
We have seen the importance of minimizing actual collision probability when designing a
hash function. Further, when designing any algorithm, one wishes to minimize the time
and space complexity. In the case of hash functions, this translates to
 Minimizing the hash size, that is the number of bits that will be used to represent
the hash values.
 Minimizing the time (and space) required to compute the hash value of a bitstream.
One wishes to minimize the hash size because
 If the hash values are stored on RAM or disk drives, this will minimize the space
required to store the hash values.
 More critically, if the hash values are transmitted—as is the case in mirroring,
synchronization, and distributed duplicate detection—this will minimize the
amount of data to be communicated, thus the amount of time needed to
communicate the hash values.
Of course, lowering the hash size will increase the collision probability, thus an
appropriate tradeoff must be decided upon.
One wishes to minimize the time (and space) required to compute the hash value for
obvious reasons. Yet, more we restrict the amount of operations we wish to undertake to
compute hash values, more we restrict the what hash functions we can compute, thus less
we can lower the collision probability. So again, an appropriate tradeoff must be decided
upon.
Even if we assume that the bit-stream probability distribution is indeed nearly uniform, it
suffices to use any balanced hash function to obtain minimal collision probabilities since
we are not concerned with the circumstances CRCs and MD5s are designed for. In this
case, one should choose to use hash functions that are faster to compute than CRCs and
MD5s.
This threefold tradeoff is depicted in Fig. 7.
Fig. 7
2.8 Conclusion
Due to their popularity and capability in given contexts, customary hash functions such as
CRC and MD5 are often used outside the context they were designed for. As a
consequence, the systems they are used in afford higher (actual) collision probabilities,
hash sizes and computational complexities than necessary.
More precisely, since our goal is here solely to distinguish bit-streams, we do not need
the property of burst-error capturing (which CRCs were designed for) or complex inverse
hash (which MD5s were designed for). Hence the computation complexities of these hash
functions are unnecessary in our context. Further, these customary hash functions may
not be fit for the skewed bit-stream probabilities of computer data, which yields higher
collision probabilities.
We believe that we may find hash functions more apt for the task at hand. More
precisely, we propose to find hash functions that will be better situated in the three-fold
tradeoff space composed of the following attributes:
1) Lower collision probability.
2) Smaller size.
3) Faster to compute.
3. Proposed approach for finding better hash functions
3.1 An overview of the problems and solution approaches
A great deal of literature can be found in the field of hash functions and the
computational optimization of the latter. Moreover, much research has been done on
studying the structure of language, both deterministically and statistically. In this project,
we wish to bring both of these areas together with the objective to find efficient hash
functions having the purpose of distinguishing computer generated bit-streams.
Since computer generated bit-streams rarely random, but have a structure specific to the
program that generated this bit-stream, these bit-streams can be thought of as being
written in a given language—where “language” is taken in its broad sense. This implies
that these bit-streams will have different probabilities of occurrence, and thus a hash
function designed to distinguish bit-streams of this language should be adapted to these
probabilities in question.
Fig. 8 depicts the different components of the proposed approach to the problem.
Fig. 8
Let A be the family of all possible bit-streams. Let B be a family of bit-streams of a
certain type: By this, we mean that these bit-streams were generated by a typical use of a
given computer program. For example, this family of bit-streams could correspond to the
bits representing the contents of files (or parts of files) generated by programs such as
Word, Excel, PowerPoint, Outlook, or having extensions .pdf, .bmp, .java, .exe, etc.
Usually, many bit-streams of A are excluded from B. For example, consider opening a
.doc or .xls as a .pdf or .bmp. Usually, if a bit-stream was generated by a given program,
most other programs will not even be able to interpret this bit-stream. Further, even
within a same family B, bit-streams some bit-streams are more likely than others. For
example, consider the bit-stream representing the fragment “To whom it may concern:”
(written in MsWord) in contrast with the bit-stream representing the fragment “aslkjf
asofi (*$ kjfbb!!”.
All this to say that to a given family B corresponds a typical probability space P(B) on
the bit-streams of B, where to each bit-stream of B is attached a “probability of
occurrence”. This probability space may be inferred logically in the case where we have a
deterministic model of the language, or statistically if such a model is not available.
If B were finite, given P(B), we could theoretically find the hash function(s) having the
least collision probability. Only, there are a few problems here:
 Bit-stream families are typically infinite
 Even in the finite case, if B is too large, it would be intractable to find this optimal
hash. In fact, if the bit-stream family is too large, we may not even be able to
represent the probability space, lest it could be computed directly from any given
bit-stream.
 Even if we found an optimal hash, it may be intractable to compute this hash if it
is not smoothly correlated to the actual bit formation of the bit-streams.
Regarding the large (or infinite) bit-streams problem, several approaches may be
explored:
 Take a sampling approach. We could determine how large of a sample we should
take so that the results thus obtained reflect the population (i.e. all possibilities).
 Find a fortuitous relation between these large bit-streams and their constituting
parts.
 Show that the hashes found for small bit-streams of a same LM are also optimal
for large bit-streams.
Regarding the computability of hash functions, we will need to explore several
approaches:
 Determine if the desired probabilistically balanced hashes found are computable.
That is, see if a chosen large class Z of “computable” hash functions can express
these.
 Determine which hashes of Z yield the minimal (with respect to Z) collision
probabilities.
3.2 Work plan
We propose a phased approach to the problem.
Phase I: The goal of this phase is to provide Client-Company with resources (both
collected and produced) that will help Client-Company to posses state-of-the art
knowledge on the problem as well as determine the worth of further research and
development on this subject. In particular, we will
 Collect and read through patents, scientific articles, and other related materials.
 Pinpoint possible applications (besides duplicate detection and
synchronization/mirroring) of application generated bit-stream distinguishing
hash functions.
 Statistically determine the bit-stream probabilities produced by typical computer
applications.
 Compare actual collision probabilities of commonly used hash functions such as
CRC and MD5 to probabilistically balanced hash functions in the context of the
said typical applications.
Phase II: If Client-Company decides that the results obtained in Phase I are promising,
we will undertake the task of finding optimal hash functions and efficient ways to
compute these. This task will include
 Finding language models that spawn the bit-streams of typical computer
applications with probabilities compared to those exhibited by the statistical
experiments of Phase I.
 Determine a large family of efficient hash functions as well as their computational
complexity.
 Find efficient hash functions yielding low collision probabilities in the context of
the said application generated or language model spawned bit-streams.
 Test the said hash functions in a real world situation.
Phase III: If deemed useful, we will write patents imparting the methods and hash
functions developed during Phase II.
3.3 Deliverables
Phase I: Background research, testing, and proof of concept.
The goal of this phase is to provide Client-Company with resources (both collected and
produced) that will help Client-Company to posses state-of-the art knowledge on the
problem as well assess the worth of further research and development on this subject. In
short, we wish to determine if we can find hash functions that will be both more effective
(low collision probabilities) and more efficient (low hash size and computational
complexity) than existing ones.
The deliverables include”
 A website where various collected or produced resources will be organized and
readily available through hyperlinks. These resources may include
o This proposal
o Reports
o Patents
o Scientific articles
o Tutorials
o Webpages
o Software
o Blogs
o Contacts
o Java programs
o Progress log
 Java modules for
o Studying the structure of computer generated bit-streams
o Compute collision probabilities that various hash functions yield when
applied a set of inputted bit-streams

o Automatically generate graphs comparing collision probabilities for
different hashes and bit-stream families
o Determine (absolutely or statistically) minimally colliding hash functions
when given either
 A set of bit-streams
 A bit-stream probability distribution
o Test actual speed of various general classes of hash functions
o Write and test collision probabilities of fast hashes and compare these to
those of commonly used hashes.
A final report, presenting our findings.
Phase II: Background research, testing, and proof of concept.
The objective of this phase is to find efficient ways to compute hashes reaching the low
collision probabilities which were put forward in Phase I. In order to do so, we will need
to find good language models, i.e. models which spawn desired bit-streams with similar
probabilities to those found in actual application generated data. Further, though
probabilistically balanced hashes may easily be defined, we must find an efficient scheme
to compute these, or at least hashes that are nearly probabilistically balanced. This will
require some research and testing.
More details on Phase II will be provided after Phase I is completed. For time being,
these are the projected deliverables:

Java modules for
o Generating bit-streams given various language models
o Determine how “good” a language model is at approximating the structure
and probability of typical bit-streams
o Compute collision probabilities that various hash functions yield when
applied to a language model (generating typical bit-streams)
o Determine (absolutely or statistically) minimally colliding hash functions
when given a language model
o Finding a minimally colliding hash of a family of hashes.
o Determining if a given hash is computable (given restrictions on the type
of hashes to be used)
o Automatically find the minimally colliding hash of a specified bit-stream
type (defined by a file type or language model) within a family of hashes,
in a given time limit.
o Computing the best hashes found for a list of file types.
 A final report, presenting our findings.
3.4 Further details
Phase I (non-exhaustive) list of tasks:
1. Collect literature (patents, books, scientific articles) on hash functions.
2. Collect literature (patents, books, scientific articles) on language statistics.
3. Collect literature (patents, books, scientific articles) on applications of hash functions
in this context (duplicate detection, synchronization/mirroring, content hashing, etc.).
4. Determine a typical set X of bit-stream families of interest. For example bit-streams
created by typical use of given programs (e.g. those generating .doc, xls, .ppt, pdf,
etc.) This set X
a. Should not be too large
b. Should contain the most frequent type of document types the hashes will be
used on.
5. Conduct statistical experiments on bit-streams the different families listed in X
created by typical use of chosen programs so as to infer the bit-stream probabilities
(BSPs) of these different families.
6. Compute the collision probabilities (for CRCs, MD5s, and possibly other hash
functions) inferred by the statistically obtained bit-stream probabilities.
7. Test CRCs and MD5s directly on bit-streams of the families of X and record the
collision probabilities thus obtained.
8. Compare collision probabilities obtained in 5. and 6. with the collision probabilities
of the balanced and probabilistically balanced situations.
9. Prove that probabilistically balanced hashes (PBHs) yield minimal collision
probability.
10. Write a module that will find PBHs for given BSPs.
11. Write modules that will test customary hashes (CRC, MD5, etc.) against other
possible hashes.
12. Write report presenting our findings.
13. Write document and/or design website where various resources (literature, webpages,
software, blogs, contacts, etc.) on the subject will be organized and readily available
through hyperlinks.
Phase II (non-exhaustive) list of tasks:
14. Determine a list of language models (LMs) W that will be used/tested in this project.
15. Figure out how the models of W influence the bit-stream probabilities.
16. Write modules that can compute the bit-stream probabilities inferred by the models of
W.
17. Measure how well a LM fits the actual bit-streams probabilities.
18. Write a module that will find PBHs for given LMs.
19. Write modules that will take a bit-stream probability distribution and return the
closest fitting (in the sense of 17.) model of W (along with the closest fitting
parameters of the latter).
20. Determine a class Z of hash functions we wish to use. These should be able to be
expressed by specific low level operations and look-up tables.
21. Test the speed of the different operations and lookup tables that will be used by
hashes of Z.
22. Using 21., establish a measure for how efficient the hashes of Z are (should include
time and space considerations).
23. Find mathematical and programmatic ways to relate PSPs to efficient hashes of Z.
24. Find mathematical and programmatic ways to relate BSPs to efficient hashes of Z.
25. Find mathematical and programmatic ways to relate LMs to efficient hashes of Z. For
example, suppose that Z contains only “subset hashes”, i.e. those corresponding to a
fixed size subset of the bits of the bit-streams. It is straightforward to see that in this
case, if the LM implies a neighborhood correlation of the bits (or chunks of bits) of
the bit-stream (i.e. whereby a given bit is correlated to it’s neighboring bits with a
strength increasing with the closeness of these neighbors), then the optimal hash will
be one where the bits retrieved for the hash are farthest apart.
26. Find mathematical and programmatic ways to efficient hashes of Z to PSPs.
27. Find mathematical and programmatic ways to efficient hashes of Z to BSPs.
28. Find mathematical and programmatic ways to efficient hashes of Z to LMs.
Download