2. the encryption scheme - University of Colorado Colorado Springs

advertisement
Analysis of an HMAC Based Database Encryption Scheme
Brad Baker
7/24/2009
CS592 through CS960 independent study
University of Colorado, Colorado Springs
1420 Austin Bluffs Pkwy
Colorado Springs, CO 80918
bbaker@uccs.edu
ABSTRACT
Encryption in database systems is an important topic for
research, as secure and efficient algorithms are needed that
provide the ability to query over encrypted data and allow
optimized encryption and decryption of data. Values in a
database must be encrypted and decrypted separately for
insertion or update, so traditional cipher chaining methods
for symmetric encryption are not ideal. This paper presents
an analysis of one proposed database encryption scheme for
integer values, which is based on the keyed Hash Message
Authentication Code (keyed HMAC) operation over
numeric bucket and remainder values. This database
encryption scheme was presented in “How to Construct a
New Encryption Scheme Supporting Range Queries on
Encrypted Database” presented by Dong Hyeok Lee, You
Jin Song, Sung Min Lee, Taek Yong Nam and Jong Su Jang
at the IEEE 2007 International Conference on Convergence
Information Technology. This analysis includes an
implementation and test of the proposed algorithm and
identification of potential areas for future research on this
topic.
such as cipher chaining or cipher feedback are problematic
in databases because ciphertext values must be independent
of one another. This project studied and implemented a
proposed database encryption algorithm that employs
symmetric, hash based encryption. This algorithm was
presented in “How to Construct a New Encryption Scheme
Supporting Range Queries on Encrypted Database”
presented by Dong Hyeok Lee, You Jin Song, Sung Min
Lee, Taek Yong Nam and Jong Su Jang at the IEEE 2007
International Conference on Convergence Information
Technology. The referenced paper reviews several
algorithms for database encryption methods, and presents a
new algorithm for numeric encryption using a bucket and
Keyed Hash Message Authentication Code (Keyed HMAC)
based process. This algorithm was implemented and tested
for this project in C using the SHA1 hash algorithm.
2. THE ENCRYPTION SCHEME
Database Encryption, keyed HMAC
The HMAC and bucket based encryption scheme analyzed
for this project has several advantages. The encryption does
not preserve plaintext ordering, it protects against inference
attacks, and the strength of the algorithm can be improved
by using different hash mechanisms for the HMAC
operation. Additionally, individual values can be securely
encrypted and decrypted independently without decreasing
security, so block chaining and feedback in encryption is
not needed. Several performance and configuration
challenges are presented with the encryption scheme, and
discussed with test results.
1. INTRODUCTION
2.1 Summary of HMAC
Database encryption is an important topic for continuing
research, as databases are commonly used for storing and
processing large amounts of sensitive data in various
industries. Goals include efficient and secure algorithms
that allow some forms of querying over encrypted data,
simple insertion or update of encrypted values, and
selective decryption of data. In the context of most database
systems, symmetric algorithms are preferred for efficiency
and the dual party nature of asymmetric algorithms is not
needed. Typically the database administrator can use a
secret key that is shared among authorized users.
Traditional methods to strengthen symmetric algorithms
Keyed HMAC, or Hash Message Authentication Code, is a
process that uses a secret key and a hash algorithm such as
MD5 or SHA-1 to generate a message authentication code.
This process is symmetric, so two parties communicating
with HMAC must share the same secret key. By using a
hash algorithm in conjunction with a key, it prevents an
unauthorized user from modifying the message or the
digest without being detected. This can protect against
man-in-the-middle attacks on the message, but it is not
designed to encrypt the message itself; only protect it from
unauthorized update. HMAC can be defined as a function
that takes a key and a plaintext message as input. Any hash
Categories and Subject Descriptors
C.
[Data]:Data Encryption – Public key cryptosystems
General Terms
Experimentation, Security
Keywords
algorithm can be used, including MD5, SHA-1, SHA-256,
etc. The HMAC algorithm defines two padding constants,
the inner pad and the outer pad, with values (0x3636…)
and (0x5c5c…) respectively, each expanded to the block
size of the hash algorithm. To calculate the HMAC, first
the exclusive-or of the key and the input pad is found. This
result is appended to the beginning of the message to be
processed. The result is then hashed with the chosen hash
algorithm, producing an intermediate digest. In the next
step, the exclusive-or of the key and the output pad is
found, and that result is appended to the beginning of the
intermediate digest. The result is hashed again, producing
the final message authentication code. This operation is
summarized in Figure 1, where { } denotes exclusive-or,
{++} denotes concatenation, {K} is the secret key, {m} is
the plaintext message, and {H} is the hash function.
Figure 1 - HMAC operation
Each calculation of the HMAC digest requires running the
underlying hash function twice. The output of HMAC is a
binary code, equal in length to the hash function digest.
This code can only be reproduced with the same key and
message, and the cryptographic strength is based on the
strength of the hash algorithm, which can be modified if
required.
2.2 Summary of the Encryption Scheme
The data encryption scheme studied for this project makes
use of the HMAC operation recursively for encryption and
decryption of integer data. The scheme works primarily for
positive integer values, however it can be extended to real
numbers by scaling with a factor of 10 to convert the real
number to an integer. In addition, negative numbers cannot
be stored with the algorithm as designed due to the use of
modular arithmetic. It could be possible to encrypt real
numbers and negative numbers if a method is designed to
store the scale factor and sign of the plaintext data. On the
encryption side the algorithm performs a pre-processing
step, a transformation step, and storage of the encrypted
data in a database. On the decryption side the algorithm
uses an inverse transformation and post-processing to
reproduce the plaintext. The encryption and decryption
processing steps are shown in Figure 2.
Figure 2 - Processing steps in encryption/decryption
3. PROCESSING DETAILS
3.1 Pre-Processing Step
The pre-processing step performs modulus arithmetic on
the plaintext integer, calculating the remainder or residual
{r} with the formula {r = m mod Sb}, where {m} is the
plaintext and {Sb} is a predefined bucket size. After
calculating the residual, the bucket ID {I b} is found using
the formula {Ib = (m – r)/ Sb}. As an example, when
processing the integer 485,321 with a 20,000 bucket size,
the residual is 5,321 and the bucket ID is 24. The bucket ID
and the residual integer are encrypted separately in the
transformation phase. The selection of bucket size is an
important factor in the application of this encryption
scheme; it will affect efficiency and validity of the
encryption process if the bucket size is incorrectly selected.
The modulus operation and calculation of bucket ID
provide difficulty when encrypting negative integers.
3.2 Encryption Transformation
The next phase of the algorithm is the transformation step
which provides the primary encryption operation. The
inputs for this step include a secret key, a seed value, the
plaintext bucket ID and the residual. Keyed HMAC is used
recursively to encrypt the bucket ID and residual
independently. The encrypted bucket ID is found by
calculating the HMAC repeatedly N times, where N is
equal to the bucket ID. On the first iteration, the secret key
and a predefined seed value are used as input into the
HMAC operation. For successive iterations, the output of
the previous HMAC is used as input into the next iteration,
along with the secret key. This is repeated until bucket ID
iterations are completed. For example, in the case of bucket
ID equal to 24, HMAC will be executed recursively 24
times, using a predefined seed value for the initial message.
The result is labeled {T(Ib)K}, designating the
transformation on bucket ID {Ib} using key {K}. In this
way, the bucket ID is not directly encrypted, but the
execution of HMAC is based on the value of the bucket ID.
The encrypted value for the residual is found in a similar
operation, differing only in the secret key that is used. Each
bucket ID and residual forms a pair from the decomposition
of the plaintext. When encrypting the residual value, the
corresponding bucket ID is appended to the beginning of
the secret key to form a new key. After finding the new
key, the recursive HMAC operation is the same. Beginning
with the seed, the digest is calculated N times where N is
equal to the value of the residual. This result is labeled
{T(r)Ib||K}, designating the transformation on residual {r}
using the composite key {Ib||K}As an example, consider the
encryption of integer 336,789 with a bucket size of 1,000.
The bucket ID is 336 and the residual is 789. If using the
SHA-1 hash algorithm, a key of “999”, and a seed value of
“test”, HMAC will be executed recursively 336 times for
the bucket ID, and 789 times for the residual. Both
recursions use “test” as the initial HMAC message, but the
bucket ID uses key {K} and the residual uses key {Ib||K}.
The resulting encrypted values are {T(Ib)k} =
“2CI0b3pNB8KbiCIUbKkOd2ciRAc=” and {T(r)Ib||k} =
“PynDpvSFSSUZCqk3yVY8J2g3Ks4=”,
using
base64
encoding. Note that the output in this situation is two 28
character base64 encoded strings, which is a result of the
160 bit digest output of the SHA-1 hash used with HMAC
in this project. The pseudocode for the encryption
transformation is presented in Figure 2.
Procedure Transformation
Begin
t:=seed
For j=1 to x
t := HMAC(t)K
Endfor
return t
End.
T(x)K
Figure 3 – Pseudocode for encryption transformation
The transformation is used twice, once for the bucket ID
and once for the residual. The resulting value {t} is the
encrypted data. The values for {T(Ib)K} and {T(r)Ib||K}
calculated from the encryption transformation represent the
ciphertext and are stored in the database. Note that a single
integer data field is replaced with two data fields, and an
increased amount of data. Considering a typical range for
long integer values, 4 bytes of data will be replaced with 56
bytes of data if using base64 encoding on ciphertext. This
is a 14-fold increase in stored data.
3.3 Decryption Transformation
To reproduce the plaintext from the ciphertext, an inverse
transformation is defined. Because the algorithm uses a
hash as the basis of its encryption a direct inverse cannot be
calculated. The inverse transformation must search through
potential bucket ID and residual values. The inverse
transformation uses the set of possible bucket IDs as a
range for the search process. This requires that the values
of possible bucket IDs is known beforehand, possibly from
domain knowledge or the data type being encrypted.
Because the set of all bucket IDs is processed even for
decryption of one value, it is more efficient on a cost per
record basis to decrypt all records at the same time.
In the decryption transformation, the first step is finding the
bucket ID of the ciphertext elements. This operation will
reproduce the value of {Ib} from ciphertext {T(Ib)K}. The
same seed and key value are used in the HMAC operation,
and this operation is executed N times, where N is the
number of possible buckets in the domain. For example, if
using a bucket size of 2,000 in a domain where the
maximum data value is 1,000,000, there are 500 possible
bucket values and HMAC is executed 500 times. In this
way the upper limit of allowable data values must be
known in order to provide a limit to the HMAC search
loop. While the N iterations of HMAC are calculated, the
input for each calculation is based on the output of the
previous iteration. Each time, the resulting value is
compared against all encrypted bucket IDs for a match. If a
match is found, the bucket ID plaintext is equal to the
number of loops executed in the search.
Once a bucket ID is found, a similar search can be made for
the residual value. Once again, a new key is constructed by
appending the decrypted bucket ID to the beginning of the
secret key, and HMAC is calculated N times, where N is
equal to the bucket size. The bucket size defines all
possible residual values. Once a match is found between
the HMAC output and the encrypted residual value, the
plaintext residual is equal to the number of loops executed
in the search. The pseudocode for the decryption
transformation is shown in Figure 4.
Procedure Inverse Transformation T-1(T)K
Begin
u := s
for i=1 to MAX(T)
u := HMAC(u)K
if i>=MIN(T) Then
find Ti = u ?
If find any Ti Then ITi := i
Endif
Endfor
return IT
End.
Figure 4 – Pseudocode for decryption transformation
The inverse transformation is executed twice for each
plaintext value, once to find the bucket ID and once to find
the residual. The value of MAX(T) in pseudocode
represents the maximum number of possible buckets when
searching for {T-1(Ib)K}, and it represents the bucket size
when searching for {T-1(r)Ib||K}. In the pseudocode, Ti
represents the encrypted values, and IT i represents the
resulting plaintext. If an encrypted value matches the
HMAC output, then the number of loop iterations is the
plaintext value, for either bucket ID or residual.
3.4 Post-Processing Step
The post processing step reverses the modulus operation
from the pre-processing step to generate the original
plaintext from the decrypted bucket ID and residual. The
plaintext value {m} is found using {m = Ib * Sb + r}, where
{Ib} is the decrypted bucket ID, {r} is the decrypted
residual, and {Sb} is the bucket size. In the post processing
step, and scaling of plaintext or encoding of negative values
is reversed if these actions were performed in the preprocessing step.
4. EXPERIMENTATION
4.1 Implementation
In this project, the HMAC based database encryption
scheme was implemented in C, using the SHA-1 hash
algorithm and base64 data encoding. The HMAC and
SHA-1 algorithms were implemented using existing source
under the GPL license, distributed by the free software
foundation. Existing source was used to avoid errors in
implementing SHA-1 and HMAC, and to focus resources
on the proposed encryption scheme and testing.
Several challenges were encountered in the implementation
based on data handling in C, including memory
management, null byte processing, and data encoding
including base64. For the purposes of this project,
configuration parameters such as the bucket size, maximum
number of buckets, and maximum number of encrypted
records were defined in compiled constant values. A
production version of the algorithm should allow dynamic
specification of these parameters.
4.2 Testing
The test phase of the project focused on three aspects: data
validity, program efficiency, and ideal bucket size. The
testing strategy included two input data sets, each with
2,000 random integer values. One data set was very large
integers, ranging from 1,000,000 to 999,000,000 the other
data set was small integers ranging from 1 to 999. These
datasets were run through encryption and decryption
transformations in five program configurations. Four of the
configurations supported a maximum integer value of
2,500,000,000. The fifth configuration supported a
maximum integer value of 5,000,000. The maximum
supported integer is equal to the bucket size multiplied by
the number of possible buckets. The product represents the
largest value that can be encrypted safely by the algorithm.
The five configurations used for testing were:
- 500,000 bucket size, 5000 possible buckets
o 2.5 Billion supported values
- 50,000 bucket size, 50,000 possible buckets
o 2.5 Billion supported values
- 5,000 bucket size, 500,000 possible buckets
o 2.5 Billion supported values
- 500 bucket size, 5,000,000 possible buckets
o 2.5 Billion supported values
- 500 bucket size, 10,000 possible buckets
o 5 Million supported values
Distinct differences in performance between different
bucket configurations were observed. Additionally, a
relationship between the distribution of plaintext values and
the efficiency of the program when using different
configurations was discovered. The results for the five
program configurations for encryption and decryption are
presented in Table 1 and Table 2 on the next page. The
values for time elapsed are in minutes and seconds for the
encryption and decryption operation over the 2,000 integer
values.
In the small dataset results presented in Table 1, the
encryption operation was much faster than the decryption
operation. This is due to the small or non-existent bucket
ID values, and the small residual values. The recursive
HMAC operation required a small number of iterations to
calculate the encrypted values.
test
data
set
mode
time
elapsed
data
validity
test set
1
small
500K bucket
encrypt
00:02.2
match
2
small
500K bucket
decrypt
29:36.4
match
3
small
50K bucket
encrypt
00:01.8
match
4
small
50K bucket
decrypt
03:05.2
match
5
small
5K bucket
encrypt
00:01.8
match
6
small
5K bucket
decrypt
01:27.1
match
7
small
500 bucket
encrypt
00:00.9
match
8
small
500 bucket
decrypt
11:35.8
match
9
small
5M max int
encrypt
00:00.9
match
10
small 5M max int
decrypt 00:03.2
Table 1 - Test results for small data set
match
The decryption operation is less efficient in all tests
because of the large number of residual values that must be
searched to find the plaintext. As the bucket size decreases
across different tests, the number of residuals also
decreases and the decryption operation is faster. Once the
500 bucket size limit is reached, decryption slows again
because plaintext values require a bucket ID search in
addition to residual searching. An important point in the
small data set result is the improved performance in the 5
million supported value test (500 buckets, 10,000 possible
buckets). This is because of the smaller search domain in
that test relative to the original integer data. For maximum
performance, the bucket size multiplied by the possible
buckets should be as small as possible while still able to
represent all possible data values.
test
data
set
mode
time
elapsed
data
validity
test set
11
large
500K bucket
encrypt
14:28.3
match
12
large
500K bucket
decrypt
29:49.1
match
13
large
50K bucket
encrypt
02:01.9
match
14
large
50K bucket
decrypt
03:06.6
match
15
large
5K bucket
encrypt
05:50.7
match
16
large
5K bucket
decrypt
01:27.5
match
17
large
500 bucket
encrypt
56:56.4
match
18
large
500 bucket
decrypt
11:31.6
19
large
5M max int
encrypt
59:01.2
20
large
5M max int
decrypt 00:03.9
Table 2 - Test results for large data set
match
no rows
match
no rows
match
In the large data set tests presented in Table 2 the
encryption operation was also faster than decryption with
two exceptions. In the case of the 500 bucket size test, there
were 5 million possible buckets, and each plaintext value
executed the HMAC operation for all possible buckets
during encryption. The decryption process was faster with a
500 bucket size because the bucket ID decoding step was
done once, and yielded all plaintext bucket values. The
residual decryption step for each bucket ID was fast
because of the small bucket size.
In the case of the 5 million supported value test, encryption
time was slow for the same reason as the 500 bucket size
test, the domain of bucket IDs was very large, and all
values exceeded the maximum possibly number of buckets.
Decryption was extremely fast because it could find no
matching rows within the 5 million possible records. This
demonstrates that the bucket size multiplied by the possible
number of buckets must support all integer values found in
the domain. Generally, a calculated bucket ID cannot
exceed the number of possible buckets.
Another important point seen in the large data set results is
the sequence from 500K bucket size, to 50K bucket size, to
5K bucket size. In the 500K bucket size, both encryption
and decryption times were increased because of the large
amount of residual value recursion and searching. The 50K
bucket size balances the number of records evenly between
bucket ID and residuals, resulting in improved encryption
and decryption times. With the 5K bucket size, the
decryption time is much faster due to the small number of
residual values to search against, but increased time is
invested in the encryption step. From these results, it
appears that a smaller bucket size improves decryption
performance, and a moderate bucket size improves
encryption performance.
4.3 Analysis
Several points to guide configuration and use of the
algorithm were obtained from testing including:
- (Bucket size * possible buckets) should be small
- (Bucket size * possible buckets) should be greater
than the maximum desired integer value
- Smaller bucket sizes improve decryption
performance
- Moderate bucket sizes improve encryption
performance
It is apparent that the algorithm is computationally
intensive for large integers because of the large number of
recursive hash calculations, and the exhaustive search
strategy for decryption. For the encryption side, if {N} is
the number of plaintext records, {P b} is the number of
possible buckets, and {Sb} is the bucket size, encryption
could require up to {N*(Pb *2)*(Sb*2)} hash operations,
because each HMAC operation includes two hash
calculations. For a dataset of one million records, with a
50K bucket size and 50K possible buckets, the maximum
number of hash operations would be 1x10 16. The actual
number of operations would decrease if the plaintext values
were smaller in the set of {Pb * Sb} integers.
databases, an increase of this proportion can be difficult to
implement.
5. FUTURE WORK
For decryption, the bucket recursion is calculated once for
all plaintext values, but the residual recursion is calculated
for each plaintext value. The decryption process results in a
maximum number of hash operations of {(Pb *2) +
N*(Sb*2)}. Using the same dataset parameters as above,
this would result in 1x1011 hash operations, a significant
improvement but still an intensive operation.
The proposed encryption scheme can be improved to
include the efficiency seen in the decryption process. If
encryption iterated over the number of possible bucket IDs
once, the performance should be similar to the decryption
process. In the discussion of performance, it is clear that
there is a benefit to encrypting or decrypting a large
number of values at one time. While the algorithm can
support encryption and decryption of individual values, the
performance per record is worse in that scenario. The
underlying cost for one record, in decryption of the above
example would be 200,000 hash operations. When
decrypting one million records, the average cost is 100,000
hash operations per plaintext value. These results are
assuming the full range of possible integer values, from 1
to {Pb * Sb}. The performance could be improved by
reducing the maximum and minimum integer values to fit a
limited problem domain. For example, if only values from
10,000 to 100,000 are required, the search over residual
values will be more efficient. However this provides an
attacker with additional information about the data being
encrypted.
In further analysis of the ciphertext, it is not clear how
range queries would be executed over encrypted data. The
encryption scheme was presented as supporting range
queries and some aggregation queries such as MIN, MAX
and COUNT. Equality queries or list criteria could be
easily implemented by selectively encrypting the criteria
values with the same seed and secret key, and comparing to
the stored ciphertext data. Because the ciphertext is
unordered hash output which has no relation to the
plaintext data, range and aggregation operations aren’t
possible over the encrypted data. The solution to this
problem would require decryption of all ciphertext data in a
database table in order to execute the query. Based on the
100,000 average hash operations per value, this could be a
computationally intensive process for large databases.
As noted previously, the encryption output contains two
base64 strings, for a total of 56 bytes of character data
storage in the database. In many systems, 4 bytes are
required to store a typical 2.5 billion integer value. This is a
factor of 14 increase to the data storage requirements, from
10 Mb to 140 Mb for a given integer table. For very large
5.1 Algorithm Improvements
There are several ways to improve the algorithm as
presented. Because this encryption scheme uses a hash
method to create ciphertext, it is not possible to incorporate
data ordering into the output. This is good because it
prevents inference attacks and deduction based on known
plaintext/ciphertext pairs. The best way to support queries
with a method like this is to improve efficiency, primarily
on the decryption side.
It appears that recursive HMAC was included in the
encryption scheme to support the exhaustive search
process. This process does not improve the security of the
algorithm, and does not provide querying support. An
alternative process that can be considered in future research
is a non-recursive HMAC based algorithm. Rather than
using a seed value, the process would calculate HMAC
based on the bucket ID and residual directly. This process
would still employ the secret key, and the modified key
concatenated with bucket ID. The performance of the
encryption process would be improved in this situation.
However, the decryption process would still require
exhaustive searching across the bucket ID and residual
numeric domains to identify plaintext values.
5.2 Potential Multiple Bucket Solution
The encryption and decryption processes of this algorithm
could be improved further by decomposing the original
number into a larger number of smaller values, and
applying HMAC to the smaller values. One model is to use
powers of 1000 to decompose the plaintext into multiple
buckets. For example, if encrypting a value of
122,344,566,788 then four values could be ran through
HMAC. Three buckets would have values of 122, 344 and
566, with a residual of 788. Each of these for values exists
in a range of 1 through 999, giving a small search range to
identify each plaintext value in decryption. This potential
solution maximizes the effect of small search ranges, but it
also increases the problem of data storage.
In this scenario, four base64 encoded values are stored for a
total of 112 bytes ciphertext per plaintext value. This is a
28-fold increase in data storage requirements, which is
excessive. Additional research is needed to determine if
this drawback ban be improved.
In multiple bucket solution described above, the maximum
number of hash operations for encryption would be
{(N*log1000(m)*2 + N*2}, where log1000(m) represents the
number of buckets encrypted for the process. This would
require eight million hash operations for one million
records, an average of eight hash operations per record to
encrypt data, which is an acceptable cost. Pseudocode for
the encryption process is shown in Figure 5 below. This
solution defines the encryption transformation as Tx(m)K,
where the pre-processing and encryption steps are
presented together to illustrate the concept.
Procedure Transformation Tx(m)K
Begin
d := log1000(m)
t := m
For j=d to 1
**Find bucket for each power of 1000
r = t mod (1000d)
Bj = (t – r) / 1000d
t = r
Endfor
**Data Bd to B1 are buckets.
**R is residual for <1000
**Encrypt buckets in d HMAC operations
**Key is modified on each loop
For j=d to 1
Ej := HMAC(Bj)K
K = Bj || K
Endfor
**Encrypt the residual value
ER := HMAC(r)K
return Ed, Ed-1,..., E1, ER
End.
Figure 5 – Pseudocode for multiple bucket encryption
In the encryption process presented in Figure 5, the number
of buckets of size 1,000 is found using log1000(m). The
number of encrypted values produced is log1000(m)+1, for
the additional residual. The total number of HMAC
operations used for encrypting N plaintext values is
{N*log1000(m)+N}, making it very efficient. Because the
primary cryptographic strength is the use of a hash
algorithm and a strong secret key, this process will not
decrease the cryptographic security. The decryption process
is demonstrated in pseudocode in Figure 6.
Procedure Inverse Transformation Tx-l(c)K
Begin
**c is log1000(domain) encrypted values
d := MAX(log1000(domain))
t := T
For j=d to 0
**Find bucket for each power of 1000
For i=1 to 1000
e := HMAC(i)K
**test possible buckets vs cipher
If e = cd
Bd = i
K = Bd || K
End if
Endfor
Endfor
r = B0
m = Bd*1000d + Bd-1*1000d-1 + ... + r
return m
End.
Figure 6 – Pseudocode for multiple bucket decryption
In the decryption process presented in Figure 6, the only
domain based information that is required is the
MAX(log1000(domain)), representing the largest power of
1,000 that will be supported. For example, in a domain of
one billion, the MAX(log1000(domain)) is three. In a domain
of one trillion, it is four. The remainder of the algorithm
steps through the powers of 1,000 from most significant to
least significant, collecting the plaintext bucket values and
composite key along the way. In the post-processing step,
all bucket values and the residual are combined into the
plaintext. The decryption process requires more hash
operations than the encryption side, because searching is
still performed. The maximum number of operations for
decryption would be {(N*MAX(log1000(domain))*1000*2}.
This is due to two hash operations per HMAC process,
1,000 possible searches per bucket, MAX(log1000(domain))
buckets, and N ciphertext values. For one million ciphertext
values in a domain of one billion, the number of decryption
hash operations would be 6x109 or 6000 operations per
record on average.
With the proposed modified algorithm, the processing
efficiency for encryption and decryption could be greatly
improved. The major drawback to this multiple bucket
method is the greatly increased ciphertext storage
requirements, requiring a 28-fold increase in stored text.
There could be unforeseen drawbacks to the multiple
bucket method other than the large ciphertext out.
Additional research and improvement for this algorithm is a
potential topic for future research.
6. CONCLUSIONS
7. REFERENCES
The database encryption scheme researched for this project
provides an interesting use of the keyed Hash Message
Authentication Code algorithm in conjunction with an
underlying hash algorithm such as SHA-1. The proposed
process uses information about the problem domain to
encrypt and decrypt integer values using modular
arithmetic, buckets and residual values. The encryption and
decryption process use secret keys and strong hash
algorithms to ensure the security of encrypted data. The
HMAC operation is used recursively in both the encryption
and decryption sides which creates performance problems
for large integer values. On the decryption side, an
exhaustive search is performed to determine the correct
plaintext from the ciphertext and secret key. This
exhaustive search requires between 100,000 and 200,000
hash operations on average to identify plaintext values,
making processing large amounts of data infeasible.
The efficiency problem can be minimized in certain
problem domains by defining the minimum and maximum
possible plaintext values, and picking a small bucket size.
In some domains such as personally-identifiable social
security numbers (SSNs), a plaintext range of 100,000,000
to 999,000,000 could be defined, with a 1,000 bucket size.
With this configuration the efficiency should be
manageable. Whenever the encrypted data is stored in a
database, the attacker will most likely know that the field
stores SSN data rather than other numeric data, but they
will be unable to distinguish patterns or plaintext values
from the ciphertext.
Another drawback to the proposed method is the greatly
increased data storage requirements for ciphertext data.
Once four byte input value will produce two 28 byte
ciphertext outputs, resulting in a 14-fold increase of stored
data.
The proposed algorithm has several strengths; it protects
against inference attacks, does not preserve plaintext
ordering, and supports single record encryption/decryption.
Because strong hash algorithms are used, individual
encryption will not reveal patterns that can be exploited to
find the key. This means that traditional defenses for
symmetric ciphertext such as cipher block chaining, cipher
feedback, etc are not needed. Because these chaining
methods are not used, each ciphertext value is independent
of the other cihpertext values.
There is an opportunity for future research motivated by
this encryption scheme, in order to improve the processing
efficiency of the algorithm. One potential area for research
is a multiple bucket based HMAC encryption schemete.
Challenges still remain with the quantity of ciphertext data
produced in relation to the plaintext data values.
[1] Dong Hyeok Lee; You Jin Song; Sung Min Lee; Taek Yong
Nam; Jong Su Jang, "How to Construct a New Encryption
Scheme Supporting Range Queries on Encrypted Database,"
Convergence Information Technology, 2007. International
Conference on , vol., no., pp.1402-1407, 21-23 Nov. 2007
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4
420452&isnumber=4420217
[2] Forouzan, Behrouz A. 2008. Cryptography and Network
Security. McGraw Hill higher Education. ISBN 978-0-07287022-0
[3] Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant,
Yirong Xu, "Order Preserving Encryption for Numeric
Data," Proceedings of the 2004 ACM SIGMOD international
conference on Management of data. 2007
URL: http://doi.acm.org/10.1145/1007568.1007632
[4] Tingjian Ge; Zdonik, S., "Fast, Secure Encryption for
Indexing in a Column-Oriented DBMS," Data Engineering,
2007. ICDE 2007. IEEE 23rd International Conference on ,
vol., no., pp.676-685, 15-20 April 2007
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4
221716&isnumber=4221635
[5] Wikipedia, July 2009. HMAC reference material. URI=
http://en.wikipedia.org/wiki/Hmac
[6] Wikipedia, July 2009. SHA-1 reference material. URI=
http://en.wikipedia.org/wiki/SHA-1
[7] Simon Josefsson, 2006. GPL implementation of HMACSHA1. URI=
http://www.koders.com/c/fidF9A73606BEE357A031F14689
D03C089777847EFE.aspx
[8] Scott G. Miller, 2006. GPL implementation of SHA-1 hash.
URI=
http://www.koders.com/c/fid716FD533B2D3ED4F230292A
6F9617821C8FDD3D4.aspx
[9] Bob Trower, August 2001. Open source base64 encoding
implementation, adapted for test program. URI=
http://base64.sourceforge.net/b64.c
Download