第12章索引与散列

第12章: 索引与散列  基本概念  有序索引  B+-Tree 索引文件  B-Tree 索引文件  静态散列  动态散列  有序索引与散列的比较  SQL中的索引定义  多键存取 Database System Concepts 12.1 ©Silberschatz, Korth and Sudarshan 基本概念  索引机制用于加速对所需数据的存取.  例如, 图书馆中的作者目录  搜索键 – 用来在文件中查找记录的属性或属性集合.  索引文件由如下形式的记录(称为索引项)组成 search-key pointer  索引文件一般比原始文件小的多  两种基本索引:  有序索引: 搜索键按顺序存储  散列索引: 搜索键被“散列函数”一致地分配到若干“桶”中. Database System Concepts 12.2 ©Silberschatz, Korth and Sudarshan 索引评价度量对索引技术的评价是基于:  有效支持的存取类型, 如  在某属性上具有特定值的记录  属性值落入指定范围的记录  存取时间  插入时间  删除时间  空间开销 Database System Concepts 12.3 ©Silberschatz, Korth and Sudarshan 有序索引  有序索引: 索引项按搜索键值的顺序有序存储.  主索引: 顺序文件的记录顺序正是索引搜索键的顺序.  也称为聚簇索引  主索引的搜索键通常是主键, 但并非必要.  索引顺序文件: 带有主索引的顺序文件.  次级索引: 索引搜索键的顺序与文件的记录顺序不同.  也称为非聚簇索引 Database System Concepts 12.4 ©Silberschatz, Korth and Sudarshan 稠密索引文件  稠密索引 — 对文件中的每个搜索键值都有索引记录. Database System Concepts 12.5 ©Silberschatz, Korth and Sudarshan 稀疏索引文件  稀疏索引: 只对某些搜索键值有索引记录.  仅当文件记录按搜索键排序才可用  为定位具有搜索键值K的记录:  找到具有比K小的最大搜索键值的索引记录  从该索引记录指向的文件记录开始顺序搜索文件  对插入和删除需要较少空间与维护开销.  查找记录一般比稠密索引慢.  一个好的折衷:针对文件的每个块建立稀疏索引的索引项, 对应于该块中的最小搜索键值. Database System Concepts 12.6 ©Silberschatz, Korth and Sudarshan 稀疏索引文件例 Database System Concepts 12.7 ©Silberschatz, Korth and Sudarshan 多级索引  如果主索引不能一次放入内存, 存取代价就会很大.  为减少对索引记录的磁盘存取次数, 将主索引视为存储在磁盘上的顺序文件并为它建立一个稀疏索引.  外索引 – 主索引的稀疏索引  内索引 – 主索引文件  如果外索引仍太大而不能放入内存, 还可再为它创建另一层索引, 如此类推.  当插入或删除文件记录时, 各层索引都必须更新. Database System Concepts 12.8 ©Silberschatz, Korth and Sudarshan 多级索引(续) Database System Concepts 12.9 ©Silberschatz, Korth and Sudarshan 索引更新: 删除  如果被删除记录是文件中具有其搜索键值的唯一记录, 则需从索引中删除该搜索键.  单层索引删除:  稠密索引 – 搜索键的删除类似于文件记录的删除.  稀疏索引 – 如果存在该搜索键值的索引项, 则删除之并用文件中的下一搜索键值(按搜索键顺序)代替. 如果下一搜索键值已经存在索引项, 则直接删除. Database System Concepts 12.10 ©Silberschatz, Korth and Sudarshan 索引更新: 插入  单层索引插入:  用被插入记录的搜索键值执行一次查找.  稠密索引 – 若该搜索键值在索引中不存在, 插入之.  稀疏索引 – 若索引对每个文件块存有索引项, 则不必修改索引, 除非需创建一个新块. 这时, 新块的第一个搜索键值插入到索引中.  多级索引的插入和删除算法是对单层索引算法的简单扩展 Database System Concepts 12.11 ©Silberschatz, Korth and Sudarshan 次级索引  经常需要查找某属性值(不是主索引的搜索键) 满足某种条件的全体记录.  例1: 在按账号顺序存储的account 数据库中, 可能希望查找某分行的全体账户  例2: 同上, 但希望查找具有指定余额或余额范围的全体账户  可建立次级索引, 它对每个搜索键值都有一索引记录; 索引记录指向包含具有该搜索键值的所有实际记录的指针的桶. Database System Concepts 12.12 ©Silberschatz, Korth and Sudarshan 建立在balance字段上的次级索引 Database System Concepts 12.13 ©Silberschatz, Korth and Sudarshan 主索引与次级索引  次级索引必须是稠密的  索引对搜索文件记录很有益处.  更新文件时, 该文件上的每个索引都必须更新. 更新索引是数据库更新带来的开销.  利用主索引做顺序扫描很高效, 但利用次级索引做顺序扫描则代价昂贵  因为对每个记录的存取都可能导致存取一个新的磁盘块 Database System Concepts 12.14 ©Silberschatz, Korth and Sudarshan B+-树索引文件 B+-树索引是相对于索引顺序文件的另一选择.  索引顺序文件的缺点: 当文件增大时性能降低, 因为可能创建许多溢出块. 需要对整个文件进行周期性的重组.  B+-树索引文件的优点: 插入与删除时仅以较少的局部的变化来自动重组. 不需要整个文件重组来维持性能.  B+-树索引的缺点: 额外的插入与删除开销, 空间开销.  B+-树的优点超过了缺点, 因此被广泛使用. Database System Concepts 12.15 ©Silberschatz, Korth and Sudarshan B+-树索引文件 (续) B+-树是满足下列性质的树:  从根到叶的所有路径长度相同  每个非根非叶节点有[n/2] 到n 个子节点.  叶节点有[(n–1)/2] 到n–1 个值  特殊情况:  若根非叶, 则至少有2 个子节点.  若根是叶(即树中没有其他节点), 则可有0 到(n–1) 个值. Database System Concepts 12.16 ©Silberschatz, Korth and Sudarshan B+-树节点结构  典型节点  Ki 是搜索键值  Pi 是子节点指针 (非叶节点中)或记录/记录桶指针(叶节点).  每个节点中搜索键是有序的 K1 < K2 < K3 < . . . < Kn–1 Database System Concepts 12.17 ©Silberschatz, Korth and Sudarshan B+-树中的叶节点叶节点的性质:  对i = 1, 2, . . ., n–1, 指针Pi 要么指向具有搜索键值Ki 的文件记录, 要么指向一个具有搜索键值Ki 的所有文件记录的指针桶. 仅当搜索键不构成主键时才需要桶结构.  如果Li, Lj 是叶节点且i < j, 则Li 的搜索键值小于 Lj 的搜索键值  Pn 指向按搜索键顺序的下一叶节点 Database System Concepts 12.18 ©Silberschatz, Korth and Sudarshan B+-树中的非叶节点  非叶节点形成了对叶节点的多级稀疏索引. 对具有m个指针的非叶节点:  P1指向的子树中的所有搜索键都小于K1  对2  i  n – 1, Pi 指向的子树中的所有搜索键都大于Ki–1小于Km–1 Database System Concepts 12.19 ©Silberschatz, Korth and Sudarshan B+-树例 account 文件的B+-树(n = 3) Database System Concepts 12.20 ©Silberschatz, Korth and Sudarshan B+-树例 account 文件的B+-树(n = 5)  叶节点必须具有2到4个值(即(n–1)/2 和n –1, 这里n = 5).  根以外的非叶节点必须具有3到5个子节点(即n/2 和n, 这里n =5).  根节点必须至少有2个子节点. Database System Concepts 12.21 ©Silberschatz, Korth and Sudarshan 对B+-树的若干观察  由于内节点通过指针相连, “逻辑上”相近的块在“物理上”不必靠近.  B+-树的非叶层形成对叶节点的稀疏索引层.  B+-树包含较少的层数(是主文件大小的对数), 因此可高效搜索.  对主文件的插入和删除可高效处理, 因为索引可在对数时间内重构. Database System Concepts 12.22 ©Silberschatz, Korth and Sudarshan 对B+-树的查询  查找具有搜索键值k 的所有记录. 1. 从根节点开始 1. 查找节点中大于k 的最小搜索键值. 2. 如果存在这样的值, 假设为Kj . 则沿Pi 到相应子节点 3. 否则k  Km–1, 其中m 是节点指针数. 则沿 Pm 到相应子节点. 2. 若沿上述指针到达的节点不是叶节点, 则对该节点重复上述过程. 3. 最终到达叶节点. 若存在 i, 使键值Ki = k 则沿 Pi 到所需记录或桶 . 否则不存在具有搜索键值k 的记录. Database System Concepts 12.23 ©Silberschatz, Korth and Sudarshan 对B+-树的查询 (续)  处理查询时, 需遍历一条从根到叶的路径.  若文件中有K 个搜索键值, 则路径长度不超过 logn/2(K).  节点一般与磁盘块大小相同, 典型地为4KB, 则n 典型地为 100左右(设每个索引项40字节).  若有1百万搜索键值且n =100, 则一次查询最多需存取 log50(1,000,000) = 4个节点.  而用具有1百万个搜索键值的平衡二叉树则需存取大约20 个节点  以上差别是巨大的, 因为每次存取节点都需要一次磁盘I/O, 花费大约20毫秒! Database System Concepts 12.24 ©Silberschatz, Korth and Sudarshan 对B+-树的更新: 插入  找到搜索键值可能出现的叶节点  若搜索键值已经存在, 则向文件中增加记录, 并且如有必要还需向桶中插入一个指针.  若搜索键值不存在, 则向主文件增加记录, 并且如有必要创建一个桶. 然后:  如果该叶节点还有空间, 则在其中插入(键值, 指针)  否则, 分裂节点. Database System Concepts 12.25 ©Silberschatz, Korth and Sudarshan B+-树的更新: 插入(续)  分裂节点:  将n 个(搜索键值, 指针)对排序. 将前  n/2  个放在原节点中, 其余放在一个新节点中.  设新节点为p, 并令k 是p 中最小的键值. 将(k,p) 插入到分裂节点的父节点中. 若父节点满, 则继续分裂.  向上进行节点分裂直至找到不满的节点. 最坏情况下导致根节点的分裂, 树的高度加 1. 在包含Brighton和Downtown的节点中插入Clearview导致分裂的结果 Database System Concepts 12.26 ©Silberschatz, Korth and Sudarshan B+-树的更新: 插入(续) 插入“Clearview” 前后的B+-树 Database System Concepts 12.27 ©Silberschatz, Korth and Sudarshan B+-树的更新: 删除  找到待删除记录, 从主文件及桶(如果存在的话)中删除之  如果没有桶或桶变成空则从叶节点中删除 (搜索键值, 指针)  若节点因删除导致索引项过少, 且该节点与一兄弟节点的索引项可以放入一个节点中, 则  将两个节点的所有搜索键值插入到一个节点中(左边的), 在删除另一个节点.  从父节点中删除 (Ki–1, Pi), 其中Pi 是指向被删除节点的指针, 递归使用上述过程. Database System Concepts 12.28 ©Silberschatz, Korth and Sudarshan B+-树的更新: 删除(续)  如果节点因删除导致索引项过少, 并且该节点与一兄弟节点的索引项数不能放入一个节点中, 则  在两个节点之间重新分配指针, 使两者都具有超过最低要求的索引项数.  更新父节点中对应的搜索键值.  节点删除可能向上扩散直至找到一个具有n/2  或更多指针的节点 . 若删除后根节点只有一个指针, 则删除根节点而其唯一子节点成为根. Database System Concepts 12.29 ©Silberschatz, Korth and Sudarshan B+-树删除例 “Downtown”删除前后  删除包含“Downtown” 的叶节点并不导致其父节点具有过少指针. 因此没有引起父节点的级联删除. Database System Concepts 12.30 ©Silberschatz, Korth and Sudarshan B+-树删除例 (续) 从前例中继续删除 “Perryridge”  “Perryridge”所在节点不满 (这里事实上为空)故与其兄弟合并.  导致 “Perryridge” 的父节点也变成不满, 又与其兄弟节点合并 (又导致它们的父节点删除一个索引项)  根节点只有一个子节点, 从而被删除, 它的子节点成为新的根节点 Database System Concepts 12.31 ©Silberschatz, Korth and Sudarshan B+-树的更新: 删除(续) 前例中删除 “Perryridge” 前后  包含Perryridge的叶节点的父节点不满, 从其左兄弟借来一个指针  父节点的父节点中的搜索键值要做相应改变 Database System Concepts 12.32 ©Silberschatz, Korth and Sudarshan B+-树文件组织  可用B+-树索引解决索引文件性能降低问题. 数据文件性能降低问题可通过B+-树文件组织解决.  B+-树文件组织中的叶节点存储数据记录, 而非指针.  由于记录比指针大, 叶节点中可存储的最大记录数小于非叶节点中的指针数.  叶节点仍要求至少半满.  插入与删除的处理方式与B+-树索引中索引项的插入与删除相同. Database System Concepts 12.33 ©Silberschatz, Korth and Sudarshan B+-树文件组织 (续) B+-树文件组织  好的空间利用率是重要的, 因为记录比指针要用更多的空间.  为改善空间利用率, 在分裂与合并重分配时考虑多个兄弟节点  涉及2个兄弟的重分配(尽量避免分裂/合并)导致每个节点至少具有个索引项 2 n / 3  Database System Concepts 12.34 ©Silberschatz, Korth and Sudarshan B-树索引文件  类似于B+-树, 但B-树允许搜索键值只出现一次; 消除搜索键的冗余存储.  非叶结点中的搜索键在B-树中的其他地方不出现; 对每个非叶结点中的搜索键必须包含一个附加指针域.  一般的B-树叶节点  非叶结点 – 指针Bi 是桶或文件记录指针. Database System Concepts 12.35 ©Silberschatz, Korth and Sudarshan B-树索引文件例同样数据上的B-树(上) 和B+-树(下) Database System Concepts 12.36 ©Silberschatz, Korth and Sudarshan B-树索引文件 (续)  B-树索引的优点:  与对应的B+-树相比可能使用较少树节点.  有时可能在到达叶结点之前就找到搜索键值.  B-树索引的缺点:  只有小部分搜索键值可以早点找到  非叶结点较大, 从而扇出减少. 因此B-树通常深度比对应的B+-树大  插入和删除比B+-树复杂  实现比B+-树难.  典型地, B-树的优点抵不过缺点. Database System Concepts 12.37 ©Silberschatz, Korth and Sudarshan Static Hashing  A bucket is a unit of storage containing one or more records (a bucket is typically a disk block).  In a hash file organization we obtain the bucket of a record directly from its search-key value using a hash function.  Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B.  Hash function is used to locate records for access, insertion as well as deletion.  Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record. Database System Concepts 12.38 ©Silberschatz, Korth and Sudarshan Example of Hash File Organization (Cont.) Hash file organization of account file, using branch-name as key (See figure in next slide.)  There are 10 buckets,  The binary representation of the ith character is assumed to be the integer i.  The hash function returns the sum of the binary representations of the characters modulo 10  E.g. h(Perryridge) = 5 Database System Concepts h(Round Hill) = 3 h(Brighton) = 3 12.39 ©Silberschatz, Korth and Sudarshan Example of Hash File Organization Hash file organization of account file, using branch-name as key (see previous slide for details). Database System Concepts 12.40 ©Silberschatz, Korth and Sudarshan Hash Functions  Worst has function maps all search-key values to the same bucket; this makes access time proportional to the number of search-key values in the file.  An ideal hash function is uniform, i.e., each bucket is assigned the same number of search-key values from the set of all possible values.  Ideal hash function is random, so each bucket will have the same number of records assigned to it irrespective of the actual distribution of search-key values in the file.  Typical hash functions perform computation on the internal binary representation of the search-key.  For example, for a string search-key, the binary representations of all the characters in the string could be added and the sum modulo the number of buckets could be returned. . Database System Concepts 12.41 ©Silberschatz, Korth and Sudarshan Handling of Bucket Overflows  Bucket overflow can occur because of  Insufficient buckets  Skew in distribution of records. This can occur due to two reasons:  multiple records have same search-key value  chosen hash function produces non-uniform distribution of key values  Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled by using overflow buckets. Database System Concepts 12.42 ©Silberschatz, Korth and Sudarshan Handling of Bucket Overflows (Cont.)  Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list.  Above scheme is called closed hashing.  An alternative, called open hashing, which does not use overflow buckets, is not suitable for database applications. Database System Concepts 12.43 ©Silberschatz, Korth and Sudarshan Hash Indices  Hashing can be used not only for file organization, but also for index-structure creation.  A hash index organizes the search keys, with their associated record pointers, into a hash file structure.  Strictly speaking, hash indices are always secondary indices  if the file itself is organized using hashing, a separate primary hash index on it using the same search-key is unnecessary.  However, we use the term hash index to refer to both secondary index structures and hash organized files. Database System Concepts 12.44 ©Silberschatz, Korth and Sudarshan Example of Hash Index Database System Concepts 12.45 ©Silberschatz, Korth and Sudarshan Deficiencies of Static Hashing  In static hashing, function h maps search-key values to a fixed set of B of bucket addresses.  Databases grow with time. If initial number of buckets is too small, performance will degrade due to too much overflows.  If file size at some point in the future is anticipated and number of buckets allocated accordingly, significant amount of space will be wasted initially.  If database shrinks, again space will be wasted.  One option is periodic re-organization of the file with a new hash function, but it is very expensive.  These problems can be avoided by using techniques that allow the number of buckets to be modified dynamically. Database System Concepts 12.46 ©Silberschatz, Korth and Sudarshan Dynamic Hashing  Good for database that grows and shrinks in size  Allows the hash function to be modified dynamically  Extendable hashing – one form of dynamic hashing  Hash function generates values over a large range — typically b-bit integers, with b = 32.  At any time use only a prefix of the hash function to index into a table of bucket addresses.  Let the length of the prefix be i bits, 0  i  32.  Bucket address table size = 2i. Initially i = 0  Value of i grows and shrinks as the size of the database grows and shrinks.  Multiple entries in the bucket address table may point to a bucket.  Thus, actual number of buckets is < 2i  The number of buckets also changes dynamically due to coalescing and splitting of buckets. Database System Concepts 12.47 ©Silberschatz, Korth and Sudarshan General Extendable Hash Structure In this structure, i2 = i3 = i, whereas i1 = i – 1 (see next slide for details) Database System Concepts 12.48 ©Silberschatz, Korth and Sudarshan Use of Extendable Hash Structure  Each bucket j stores a value ij; all the entries that point to the same bucket have the same values on the first ij bits.  To locate the bucket containing search-key Kj: 1. Compute h(Kj) = X 2. Use the first i high order bits of X as a displacement into bucket address table, and follow the pointer to appropriate bucket  To insert a record with search-key value Kj  follow same procedure as look-up and locate the bucket, say j.  If there is room in the bucket j insert record in the bucket.  Else the bucket must be split and insertion re-attempted (next slide.)  Overflow buckets used instead in some cases (will see shortly) Database System Concepts 12.49 ©Silberschatz, Korth and Sudarshan Updates in Extendable Hash Structure To split a bucket j when inserting record with search-key value Kj:  If i > ij (more than one pointer to bucket j)  allocate a new bucket z, and set ij and iz to the old ij -+ 1.  make the second half of the bucket address table entries pointing to j to point to z  remove and reinsert each record in bucket j.  recompute new bucket for Kj and insert record in the bucket (further splitting is required if the bucket is still full)  If i = ij (only one pointer to bucket j)  increment i and double the size of the bucket address table.  replace each entry in the table by two entries that point to the same bucket.  recompute new bucket address table entry for Kj Now i > ij so use the first case above. Database System Concepts 12.50 ©Silberschatz, Korth and Sudarshan Updates in Extendable Hash Structure (Cont.)  When inserting a value, if the bucket is full after several splits (that is, i reaches some limit b) create an overflow bucket instead of splitting bucket entry table further.  To delete a key value,  locate it in its bucket and remove it.  The bucket itself can be removed if it becomes empty (with appropriate updates to the bucket address table).  Coalescing of buckets can be done (can coalesce only with a “buddy” bucket having same value of ij and same ij –1 prefix, if it is present)  Decreasing bucket address table size is also possible  Note: decreasing bucket address table size is an expensive operation and should be done only if number of buckets becomes much smaller than the size of the table Database System Concepts 12.51 ©Silberschatz, Korth and Sudarshan Use of Extendable Hash Structure: Example Initial Hash structure, bucket size = 2 Database System Concepts 12.52 ©Silberschatz, Korth and Sudarshan Example (Cont.)  Hash structure after insertion of one Brighton and two Downtown records Database System Concepts 12.53 ©Silberschatz, Korth and Sudarshan Example (Cont.) Hash structure after insertion of Mianus record Database System Concepts 12.54 ©Silberschatz, Korth and Sudarshan Example (Cont.) Hash structure after insertion of three Perryridge records Database System Concepts 12.55 ©Silberschatz, Korth and Sudarshan Example (Cont.)  Hash structure after insertion of Redwood and Round Hill records Database System Concepts 12.56 ©Silberschatz, Korth and Sudarshan Extendable Hashing vs. Other Schemes  Benefits of extendable hashing:  Hash performance does not degrade with growth of file  Minimal space overhead  Disadvantages of extendable hashing  Extra level of indirection to find desired record  Bucket address table may itself become very big (larger than memory)  Need a tree structure to locate desired record in the structure!  Changing size of bucket address table is an expensive operation  Linear hashing is an alternative mechanism which avoids these disadvantages at the possible cost of more bucket overflows Database System Concepts 12.57 ©Silberschatz, Korth and Sudarshan Comparison of Ordered Indexing and Hashing  Cost of periodic re-organization  Relative frequency of insertions and deletions  Is it desirable to optimize average access time at the expense of worst-case access time?  Expected type of queries:  Hashing is generally better at retrieving records having a specified value of the key.  If range queries are common, ordered indices are to be preferred Database System Concepts 12.58 ©Silberschatz, Korth and Sudarshan SQL中的索引定义  创建索引 create index <索引名> on <关系名>(<属性列表>) 例如: create index b-index on branch(branch-name)  用create unique index 可间接指定搜索键是候选键.  如果支持SQL的 unique 完整性约束则不需要  删除索引 drop index <index-name> Database System Concepts 12.59 ©Silberschatz, Korth and Sudarshan 多键访问  使用多个索引进行某些类型的查询.  例如: select account-number from account where branch-name = “Perryridge” and balance - 1000  使用单个属性上的索引进行查询处理的可能策略: 1. 利用 branch-name 上的索引查找属于Perryridge分行的账号; 再测试 balance = 1000. 2. 利用balance上的索引查找具有余额$1000的账户; 再测试 branchname = “Perryridge”. 3. 利用branch-name 索引找出所有指向 Perryridge 分行的记录的指针. 类似地利用 balance上索引. 再求所得指针集合的交集. Database System Concepts 12.60 ©Silberschatz, Korth and Sudarshan 多属性索引假设我们在联合搜索键(branch-name, balance)上有一个索引.  对于where子句 where branch-name = “Perryridge” and balance = 1000 联合搜索键上的索引将只取出两个条件都满足的记录. 利用分开的索引效率较差 — 因为只满足一个条件时我们可能取出很多记录(或指针).  也可以高效地处理 where branch-name - “Perryridge” and balance < 1000  但是不能高效地处理 where branch-name < “Perryridge” and balance = 1000 可能取出很多满足第一个条件却不满足第二个条件的记录. Database System Concepts 12.61 ©Silberschatz, Korth and Sudarshan 网格文件  是一种用于加速处理涉及一个或多个比较运算的一般的多搜索键查询的结构.  网格文件(grid file) 只有一个网格数组, 并对每个搜索键属性有一个线性标度. 网格数组的维数等于搜索键属性的个数.  网格阵列的每个单元有一个指向桶的指针  桶中包含搜索键的值和记录指针  网格数组的多个单元可以指向同一个桶  为找到某一搜索键值对应的桶, 利用线性标度定位单元的行和列, 并沿指针找到 Database System Concepts 12.62 ©Silberschatz, Korth and Sudarshan account的网格文件例 Database System Concepts 12.63 ©Silberschatz, Korth and Sudarshan 基于网格文件的查询  两个属性A和B上的网格文件可以具有一定效率地处理所有下列形式的查询  (a1  A  a2)  (b1  B  b2)  (a1  A  a2  b1  B  b2)  例如, 为回答 (a1  A  a2  b1  B  b2), 利用线性标度找到对应的候选网个数组单元, 并查找所有被这些单元指向的桶. Database System Concepts 12.64 ©Silberschatz, Korth and Sudarshan 网格文件 (续)  在插入时, 如果一个桶变满, 且有多个单元指向该桶, 则可以创建新桶.  思想类似于可扩展散列, 但针对的是多个维  如果只有一个单元指向他, 要么必须创建一个溢出桶, 要么网格尺寸必须增加  线性标度必须选择得使记录在单元间一致分布.  否则将会有太多溢出桶.  定期重组以增加网格尺寸是有益的.  但重组代价很大.  网格数组的空间开销可能较高.  R-树(Chapter 23) 是另一种方法 Database System Concepts 12.65 ©Silberschatz, Korth and Sudarshan 位图索引  位图索引是一种特殊类型的索引, 用来实现高效的多键查询  假设关系中的记录从0开始顺序编号  给定数 n 必须很容易取得记录n  如果是定长记录尤其容易  适用于具有相对较少数目的不同值的属性  例如性别, 国家, 州, …  例如收入等级(收入分解成少数几个等级, 如0-9999, 10000-19999, 20000-50000, 50000- 无穷大)  位图就是位的数组 Database System Concepts 12.66 ©Silberschatz, Korth and Sudarshan 位图索引(续)  一个属性的最简单形式的位图索引对该属性的每个值都有一个位图  位图的位数等于记录个数  在值v的位图中,如果记录在该属性上的值是v, 则对应的位为1, 否则为 0 Database System Concepts 12.67 ©Silberschatz, Korth and Sudarshan 位图索引 (续)  位图索引对多属性查询很有用  对单属性查询不是特别有用  利用位图运算来解答查询  交 (and)  并 (or)  补 (not)  每个运算施加于两个大小相同的位图并对对应的位进行运算, 得到结果位图  例如 100110 AND 110011 = 100010 100110 OR 110011 = 110111 NOT 100110 = 011001  收入等级为L1的男性: 10010 AND 10100 = 10000  然后即可读取所需元组.  匹配元组的计数更快 Database System Concepts 12.68 ©Silberschatz, Korth and Sudarshan 位图索引(续)  与关系大小相比, 位图索引一般很小  例如若记录为100字节, 单个位图的空间为是关系所用空间的1/800.  若不同属性值的个数为8, 位图仅为关系大小的1%  删除需要恰当地处理  存在位图指示在一个记录位置是否有合法记录  求补时需要  not(A=v): (NOT bitmap-A-v) AND ExistenceBitmap  必须为所有值保存位图, 甚至包括空值  为了对NOT(A=v)正确处理SQL空值语义:  将上述结果与(NOT bitmap-A-Null)求交 Database System Concepts 12.69 ©Silberschatz, Korth and Sudarshan 位图运算的高效实现  位图压缩成字; 单个字AND(基本CPU指令) 一次计算32或64位  例如1百万位的位图可以仅用31,250条指令进行与  计数1的个数有一个技巧快速完成:  根据每个字节的值索引到一个预先计算的有256 个元素的数组中, 该元素存储该值对应的二进制表示中的1的个数  可利用字节对来加速, 但存储代价更高  将所有计数值相加  对具有大量匹配记录的值, 位图可替代元组标识列表而用于B+-树的叶子层  若超过记录的1/64都具有该值, 就是值得的(假设元组标识是64位的)  上述技术结合了位图与B+-树索引的优点 Database System Concepts 12.70 ©Silberschatz, Korth and Sudarshan End of Chapter Partitioned Hashing  Hash values are split into segments that depend on each attribute of the search-key. (A1, A2, . . . , An) for n attribute search-key  Example: n = 2, for customer, search-key being (customer-street, customer-city) search-key value (Main, Harrison) (Main, Brooklyn) (Park, Palo Alto) (Spring, Brooklyn) (Alma, Palo Alto) hash value 101 111 101 001 010 010 001 001 110 010  To answer equality query on single attribute, need to look up multiple buckets. Similar in effect to grid files. Database System Concepts 12.72 ©Silberschatz, Korth and Sudarshan Sequential File For account Records Database System Concepts 12.73 ©Silberschatz, Korth and Sudarshan Sample account File Database System Concepts 12.74 ©Silberschatz, Korth and Sudarshan

第12章索引与散列

Related documents

Products

Support

第12章索引与散列

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib