Large uniqueness constraints

advertisement
SQL Server uniqueness constraints create an underlying unique index. SQL Server index keys may
not be more than 900 bytes. Below I discuss how to implement uniqueness constraints with hash
indexes when the key size can exceed 900 bytes and give the results of some tests on the relative
performance of the hash index approach.
If there are variable length columns, SQL Server may allow the uniqueness constraint to be created
but oversized rows will not be insertable. Below is a formula to determine when an index may
overflow the 900 byte limit. The maximum size of a unique constraint index key can be calculated
using Table 1 as:
𝑛𝑐
𝑜 + 𝑣(𝑛𝑣 ) + 𝑢(𝑛𝑢 ) + ∑ 𝑡(𝑖)
𝑖=1
where 𝑜 is the overhead, 𝑣(𝑛) is the variable sized column overhead for 𝑛 variable sized columns,
𝑢(𝑛) is the nullable column overhead for 𝑛 nullable columns, 𝑡(𝑖) is the type cost for column 𝑖, 𝑛𝑣 is
the number of variable sized columns, 𝑛𝑢 is the number of nullable columns and 𝑛𝑐 is the number of
columns. The overhead 𝑜 varies depending on whether or not the underlying index is clustered (see
Table 1).
Table 1 Table of contributions to the size of an index key. To calculate the maximum size of an index key in bytes
add the size of every column in the index key and the primary key and add the overhead.
Object
bigint
binary(n)
binary(max)
bit
char
date
datetime
datetime2
datetimeoffset
decimal
float
geography
geometry
hierarchyid
image
int
money
nchar(n)
nchar(max)
ntext
numeric
nvarchar(n)
nvarchar(max)
real
smalldatetime
Size (bytes)
8
𝑛
∞
1
𝑛
3
8
8
10
17
8
∞
∞
892
∞
4
8
𝑛×2
∞
∞
17
𝑛×2
∞
4
4
Object
smallint
smallmoney
sql_variant
sysname
text
time
timestamp
tinyint
uniqueidentifier
varbinary
varbinary(max)
varchar
xml
CLR type
variable size columns (n)
nullable columns (n)
clustered overhead
nonclustered overhead
Size (bytes)
2
4
8016
256
∞
5
8
1
16
𝑛
∞
𝑛
∞
?
2 + (2 × 𝑛) (0 if n = 0)
2 + ⌈𝑛/8⌉ (0 if n = 0)
4
7
If the index key will always be 900 bytes or less, then an ordinary uniqueness constraint may be
used. If the index key could be greater than 900 bytes, then a uniqueness constraint can be enforced
with an after-trigger. To avoid table scans, a hash index can be generated for the columns with the
unique constraint. The uniqueness constraint can be enforced with the scheme:
create table [Schema].[Table]
(
[Id]
int not null constraint [PK_Table] primary key,
…,
[Column1] Column1TypeDeclaration not null,
[Column2] Column2TypeDeclaration null,
…,
[Internal_Hash_Column1_Column2] as checksum([Column1], [Column2]) persisted not null
);
go
create index [Unique_Hash_Table_Column1_Column2] on [Schema].[Table]
([Internal_Hash_Column1_Column2]);
go
create trigger [ConstraintInsertTrigger_Extent] on [Schema].[Table]
after of insert, update as
begin
if (update([Column1]) or update([Column2])) and
exists (select *
from [Schema].[Table] as T
inner join inserted as I on I.[Internal_Hash_Column1_Column2] = T.
[Internal_Hash_Column1_Column2] and
I.[Column1] = T.[Column1] and
(I.[Column2] = T.[Column2] or
(I.[Column2] is null and T.[Column2] is null))
group by I.[Internal_Hash_Column1_Column2], I.[Column1], I.[Column2]
having count(*) > 1)
begin
if @@trancount > 0
begin
rollback transaction;
end;
raiserror(N'Error: Values in the columns [Schema].[Table] ([Column1], [Column2])
must be unique.', 16, 0);
end;
end;
go
Essentially, the strategy is to store an indexed hash of the columns in the uniqueness constraint.
The after trigger makes sure there are no duplicates, by searching for rows that hash to the same
value as one of the inserted rows and also match on the constrained columns. Using hashes avoids
scanning the entire table for matches and has good space efficiency.
The reason to use checksum and neither hashbytes nor binary_checksum is that checksum respects
collation. For example,
checksum(N'X') = checksum(N'x')
but
binary_checksum(N'X') ≠ binary_checksum(N'x').
However, hashbytes and binary_checksum are suitable hashing functions for non-text columns. Also
if the constrained columns are unlikely to contain common prefixes, a prefix of the columns (e.g.
left([Column], 10)) may be a very good hashing function.
A common alternative strategy is to put a unique constraint over a persisted column defined using
hashbytes. The idea is that the chance of a collision for, say, a SHA1 hash is so low that other bugs or
hardware failures are much more likely to cause a failure. However, because of the Birthday
Fallacy/Paradox people often underestimate the chance of a collision happening. If a hashing
function is perfect then the chance of there being a collision among n randomly chosen values in a
b-bit hash is approximately:
−𝑛(𝑛−1)
1 − 𝑒 2𝑏+1
In general, a collision is highly likely in a sample of size
2𝑏/2
Thus for, say, the 160 bit SHA1 hash the chance of a collision in even a trillion random rows seems
very low. However, there are some flaws in this argument:
 Cryptographic hashing functions are poorly understood but there are strong indications that
MD5 and SHA1 are weaker than perfect hashing functions.
 For a given value, it is often feasible to find another value that hashes to the same hash (i.e. a
hash collision). Thus if an application could be compromised by a hash collision and an
adversary has some control over the values the application uses then this strategy is unsuitable.
 Cryptographic hash functions do not respect collations. Thus, two strings that are equal like ‘X’
and ‘x’ (in a case insensitive collation) may hash to different hashes.
 The input to the hashing function is not random.
It seems likely that the hash index approach may be more efficient than the unique index based
approach to enforcing a uniqueness constraints. Table 2 and Table 3 show the average time to
insert and delete 1 000 and 100 000 rows for various total byte sizes of two columns with a
uniqueness constraint. The SQL Server unique constraint and the hash index approach are
compared for both the no prefix and the 20 byte common prefix cases. From this analysis, it is clear
that for column sets under 900 bytes the hash index approach is unlikely to perform better than
Table 2: The average time, in seconds, to insert and delete 1 000 rows with a Unicode string with a uniqueness
constraint. Both the unique and hash index approaches to uniqueness constraint enforcement are tabulated.
Both random strings and strings with a fixed 20 byte (10 Unicode character) prefix are tabulated. The first time
in each pair, is the time to insert 1 000 rows at once. The second time in each pair, is the time to insert 1 000 rows
separately (i.e. 1 at a time).
Unique index
Length No prefix With prefix
50
0.10 0.43 0.11 0.46
100
0.18 0.45 0.18 0.44
200
0.26 0.56 0.26 0.54
400
0.37 0.62 0.39 0.66
Hash index
No prefix With prefix
0.12 0.59 0.12 0.59
0.14 0.58 0.14 0.57
0.28 0.62 0.24 0.60
0.31 0.65 0.42 0.70
Table 3: The average time, in seconds, to insert and delete 1 000 000 rows with a Unicode string with a
uniqueness constraint. Both the unique and hash index approaches to uniqueness constraint enforcement are
tabulated. Both random strings and strings with a fixed 20 byte (10 Unicode character) prefix are tabulated. The
first time in each pair, is the time to insert 1 000 000 rows at once. The second time in each pair, is the time to
insert 1 000 000 rows separately (i.e. 1 at a time).
Unique index
Hash index
Length No prefix
With prefix
No prefix
With prefix
50
6.71 36.22
6.92 35.05
20.82 52.60
20.54 60.33
100
7.09 37.36
7.33 37.02
39.92 57.03
36.14 57.46
200
16.38 43.13 15.01 44.09
65.81 59.12
62.35 59.55
400
33.59 55.07 24.99 56.63 149.95 70.06 151.21 71.07
Download