Indexing-binning

advertisement
Indexing and Binning Large Databases
Abstract



Problems with large databases
 Biometric identification (1:N Matching) does not scale well
with size
 No established way to organize high dimensional biometric
data
Proposed Solution
 Reduce search space before 1:N matching
 Divide the database using Clustering Techniques
Contributions
 We analyze the effect of implementing a binning scheme
on search performance and accuracy
 We present binning and pruning approaches using multiple
biometrics
 Using hand geometry and signature, we have achieved a
search space reduction of 95% without any FRR
Background





Only biometric identification (1:N matching) can
prevent duplicate enrollments, double dipping
Biometrics are being deployed for immigration and
national ID applications
 US-VISIT program
 Voter ID and national ID programs[3]
Potential size that can run into millions
Current research is focused only on accuracy
Apart from accuracy, scalability, speed and efficiency
also become important at this scale
Challenges
Textual/Numeric Data


Data is scalar(1D)
Textual/numeric data can be linearly
ordered and therefore easily indexed
Biometric Data


Biometric templates are high
dimensional
No linear ordering or sorting methods
exists for biometric data
Search space analysis
 As number of stored templates increases, template
density (TD) also increases
1
 DIC 

TD  
 NC 
DIC  Averageintracluster distance
N C  Cluster count
Identification problem
Number of false positives grows geometrically with the size of
the database
Let FAR and FRR be the False Acceptance Rate (probability)
and False Reject Rate (probability) for 1:1 matching
 For a 1:N matching,
FARN  1  (1  FAR) N  N  FAR
FRRN  FRR
The total number of False Accepts is given by
False accepts NFARN  N (1  (1  FAR) N )  N 2  FAR
State of the Art
Biometrics
State of the art
Research Problems
Fingerprint
0.15% FRR at 1% FAR
(FVC 2002)
Fingerprint Enhancement
Partial fingerprint matching
Face Recognition
10% FRR at 1% FAR
(FRVT 2002)
Improving accuracy
Face alignment variation
Handling lighting variations
Hand Geometry
4% FRR at 0% FAR
(Transport Security
Administration Tests)
Developing reliable models
Identification problem
Signature
Verification
1.5%(IBM Israel)
Developing offline verification
systems
Handling skillful forgeries
Voice
Verification
<1% FRR
(Current Research)
Handling channel normalization
User habituation
Text and language independence
State of the Art
Biometrics
State of the art
Research Problems
Fingerprint
0.15% FRR at 1% FAR
(FVC 2002)
Fingerprint Enhancement
Partial fingerprint matching
Face Recognition
10% FRR at 1% FAR
(FRVT 2002)
Improving accuracy
Face alignment variation
Handling lighting variations
Hand
Geometry
2.6% FRR at 0.02%
FAR
(CUBS, SUNY-Buffalo)
Developing reliable models
Identification problem
Signature
Verification
1.5%(IBM Israel)
Developing offline verification
systems
Handling skillful forgeries
Voice
Verification
<1% FRR
(Current Research)
Handling channel normalization
User habituation
Text and language independence
Identification problem (contd.)



Even if FAR = 0.0001%, False accepts = 1 in 10 for
N=100000(lower bound) in the identification case.
No single biometric is capable of meeting this security
requirement individually
Ways to reduce identification errors:
 Reduce FAR
 FAR is limited by feature representation and the
recognition algorithm
 Cannot be indefinitely reduced
 Reduce N
 Classify or index the biometric database. (e.g Henry
classification system for fingerprints)
 Index the records based on meta-data
 Can we do better?
Fingerprint Features
Fingerprints can be classified based on the ridge flow pattern
Fingerprints can be distinguished based on the ridge characteristics
65% of fingerprints belong to the Loop class
Henry Classification of Fingerprints
 [Ratha et al,1996] used Henry Classification on database of
1800 templates, tested on 100 templates
 Search Space: 25%; FRR: 10%
 [Jain, Pankanti,2000] similar experiment on database of 700
templates achieved FRR: 7.4% (Focus on classification only)
 State-of-art Fingerprint classification system
[Capelli,Maio,Maltoni,Nanni,2003] has FRR 4.8% for 5 class
problem and 3.7% for 4 class problem
 Though natural class exists, still classification is non-trivial
 Natural classes do not exist for biometrics like Hand Geometry
 Need more sophistication for partitioning database
Analysis of search space reduction
We can improve performance by reducing the search space
during identification
 Let PSYS – Penetration rate [between 0.0 and 1.0]
 Penetration rate is the average fraction of the database
searched during identification
 Effective size = N*PSYS
 For a 1:N matching,

FARPSYS N  1  (1  FAR) PSYS  N  PSYS  N  FAR
FRRPSYS N  FRR
 The total number of False Accepts is given by
False accepts N  PSYS  FARN  ( N  PSYS )2  FAR
State of the art fingerprint systems has PSYS=0.5
Effect of binning on accuracy
7
False Accepts
10
x 10
8
6
0.1
0.3
4
0.5
0.75
1.0
2
0
0
2
4
N
6
8
10
5
x 10

For PSYS < 0.2, the false accepts are almost constant

Query response time improves by a factor of

Capabilities of a low FAR system
 Will allow us to screen immigrants at airports
 Will make biometric systems more user-friendly by
eliminating the need to remember PINs and IDs
PSYS
Binning
 Binning can be used to achieve a smaller PSYS
 Partition the feature space
 Each bin is represented by a cluster center CK
 Records are compared with only NB cluster centers
 Bin representatives are computed offline during training
 Challenges
 How to handle clustering of large databases?
 How to handle additions and deletions?
Tradeoff






Although binning reduces search space, it introduces another
source of identification error : Bin Miss
If the bin in which the user record exists is not searched, then
FRR is generated no matter how good the matcher is
If P(B) is the probability of getting the correct bin
Binning increases the probability of False Rejects
Not tolerable in security and screening applications
Solution:
 Use K-means clustering to find K bins
 Check Ns nearest bins for the record, such that P(B) = 1
Formal definition of Binning
 In general a biometric template may be represented
k
as a vector [ xi1, xi 2 , xi3 , xi 4 ....xik ]  
 Vectors are represented into N distinct clusters; each
represented by a ‘code book vector’
[Y1 , Y2 , Y3 , Y4 ....YN ]  k
 The code book vectors divide the feature space into
N distinct Voronoi regions V
i

Every template is closest to the mean (codebook
vector) of the region it belongs to
2
Yi  x  Y j  x i  j
2
Vi  k and Vi  
Search Space Partition: Voronoi Regions
Hand Geometry Template
Feature extraction stages
•Image capture
•Binarization
•Contour Extraction
•Noise Removal
35 Features are extracted
•25 directly measured features
•10 ratio and perimeter features
Signature Template
11 Features Extracted
•Regression Constants b0,b1
•Compactness
•Signature Length
•Major Stroke Length
•Major Stroke Angle
•Connected Components
•Hole Count
•Hole Area
•Stroke Count
•Signing Time
Results
Hand Geometry Binning
80
70
70
60
Penetration Rate
Penetration Rate
Signature Binning
60
50
40
30
20
50
40
30
20
10
10
0
0
3
4
5
6
7
8
9
10 12 14 16 18 20 25 30
3
4
5
6
7
8
9
10 12 14 16 18 20 25 30
Num ber of Bins
Num ber of Bins
11 – Dimensional Signature data
Best Penetration: 35.57% for 6 bins
FRR = 0%
PSYS 
35 – Dimensional Hand Geometry data
Best Penetration: 35.8% for 6 bins
FRR = 0%
N S TB  N B
ND
N S : Number of bins tosearch,TB : Averagetemplatesin bin
N B : Number of bins, N D :Size of thedatabase
Dataset 250 Training Set & 250 Testing Set
Multi-modal approach
 Resulting bins have very high template densities
 A different biometric modality should be used to classify
templates within a bin
 Multimodal biometrics
 Using multiple biometrics improves accuracy
 It is difficult to forge multiple biometrics
 Composite templates reduce template density
 Statistical independence ensures that individual
binning results are diverse
 The search space (intersection of bins) is reduced
due to low commonality between the individual
binning results
Multi-Modal Approach
Multi-Modal Approach
Search Space: 5%
original database size;
FRR – 0%
Results of Combination
Penetration Rate
45
40
35
30
Signature
25
20
Hand Geometry
Combination
15
10
5
0
5
6
7
8
9
10 12 14 16 18 20 25 30
Number of Bins
Best combined penetration rate of 5%
PSYS 
N S TB
ND
N S : Number of bins tosearch,TB : Averagetemplatesin bin
N D :Size of thedatabase
Dataset 250 Training Set & 250 Testing Set
Binning v/s Indexing
 Applications can have frequent insertions of new
templates
 Binning works well when database is static
 Insertions will require re-partitioning the entire
database
 Indexing can be used in both – static and dynamic
database scenarios
 Trees are commonly used for indexing
 Extend the concept of indexing relational databases
to indexing biometric databases
 Much more challenging – no concept of primary key
exists in biometric templates!
Pyramid Technique spatial hashing




Determine the Pyramid (i) within with which the
template lies
Determine height (h) of template from the apex
The 1-D value = Pyramid Number (i) + Height (h)
Indexing done using B+ Trees
Various Indexing Techniques
Grid Files
R Tree
Pyramid Technique
R+ Tree
KD Tree
X Tree
Comparative Study
Method
Scalable
Order
Invariant
Dynamic
Range
Query
No
Overlap
Grid File
Y
Y
N
N
Y
R Tree
Y
N
N
N
N
R* Tree
Y
N
N
N
N
R+ Tree
Y
N
N
N
Y
KD Tree
Y
N
N
N
Y
X Tree
Y
N
Y
Y
Y
Pyramid
Tech
Y
Y
Y
Y
Y
Results of Indexing
35 – Dimensional Hand Geometry data
Best Penetration: 27%
FRR = 0%
Dataset 450 Training Set & 450 Testing Set
 Parallel combination with signature will further
reduce the search space
Multimodal Biometrics
2D Biometric: Signature & Fingerprint Fusion
Impostor Score Pairs
True Match Score Pairs
Optimal Fusion Algorithm
Signature Fused With Fingerprint
True Match Score Pairs
Unrealizable Performance Area
Fusion
Algorithm
Accuracy (1-FRR)
Optimal Fusion ROC
Suboptimal Performance Area
False Accept Rate (FAR)
Impostor Score Pairs
The ROC is the boundary between
what is possible and suboptimal performance.
Optimal Fusion Algorithm Decision Regions
2nd Biometric Score Axis
99.04% Accuracy @ Specified FAR of 1 in a Million
1st Biometric Score Axis
irregular decision region boundary due to finite sample size
the more data the smoother the boundaries
Match Zone
No-Match Zone
RSS Fusion Algorithm for Fingerprint & Signature
Provides A Suboptimal Performance ROC
Optimal ROC
True Match Score Pairs
RSS
Fusion
Accuracy (1-FRR)
RSS Fusion ROC
False Accept Rate (FAR)
Impostor Score Pairs
RSS Fusion Decision Regions
2nd Biometric Score Axis
96.11% Accuracy @ Specified FAR of 1 in a Million
1st Biometric Score Axis
Match Zone
No-Match Zone
OR Fusion Algorithm for Fingerprint & Signature
Provides A Suboptimal Performance ROC
Optimal ROC
True Match Score Pairs
OR
Fusion
Accuracy (1-FRR)
OR Fusion ROC
False Accept Rate (FAR)
Impostor Score Pairs
OR Fusion Decision Regions
2nd Biometric Score Axis
96.85% Accuracy @ Specified FAR of 1 in a Million
1st Biometric Score Axis
Match Zone
No-Match Zone
AND Fusion Algorithm for Fingerprint & Signature
Provides A Suboptimal Performance ROC
Optimal ROC
True Match Score Pairs
AND
Fusion
Accuracy (1-FRR)
AND Fusion ROC
False Accept Rate (FAR)
Impostor Score Pairs
AND Fusion Decision Regions
2nd Biometric Score Axis
62.91% Accuracy @ Specified FAR of 1 in a Million
1st Biometric Score Axis
Match Zone
No-Match Zone
ROC
Thank You
Download