Random projection trees and low dimensional manifolds Yoav Freund, Sanjoy Dasgupta University of California, San Diego 2008 2013. 01.07(월) Jeonbuk National Univ. DBLAB 김태훈 Contents 1. Introduction 2. Detailed overview 3. An RP-Tree-MAX adapts to assouad dimension. 4. An RP-Tree-MEAN adapts to local covariance dimension. Introduction A k-d Tree is spatial data structure that partitions Rd into hyperrectangular cells. • k-d Tree는 hyperrectangular cells 속 Rd 파티션들의 공간적인 데이터 구조 It is built in a recursive manner, splitting along one coordinate direction at a time. • k-d Tree는 한 방향을 따라서 한번에 분리되는 재귀적인 방법을 이용 Introduction The succession of splits corresponds to a binary tree whose leave contain the individual cells in Rd . • 이 분리의 연속은 각 셀들을 포함하고 있는 잎의 이진 트리와 부합함. suppose that *. The dots are points in a database. *. The cross is a query point q. Introduction K-d Tree requires D level in order to halve the cell diameter. • K-d 트리는 각 반경을 나누기 위해서 D level을 요구 the data lie in R1000 , it could take 1000 levels If of the tree to bring the diameter of cells down to half that of the entire data set. • 만약 R1000 data가 주어 졌을 경우 1000 level 을 내려가야 함. This would require 𝟐𝟏𝟎𝟎𝟎 data points! Introduction Thus k-d trees are susceptible to the same curse of dimensionality. • However, a recent positive development in machine learning has been realization that a lot of data which superficially lie in a very high-dimensional space RD , actually have low intrinsic dimension. • 그래서 k-d tree는 차원의 저주를 받을 정도로 민감. 하지만 최근 machine learning에서 깨닫게 되었는데 많은 데이터들이 주어졌 을 때 실제로는 매우 높은 RD 는 낮은 고유한 차원을 가짐. d << D • d(nonparameter 실제 주어지는 데이터)보다 D차원에 더 민감함 Introduction In this paper, we are interested in techniques that automatically adapt to intrinsic low dimensional structure without having to explicitly learn this structure. • 이 논문에서는 명시적으로 이 구조에 배울 필요 없이 관심 있는 테크닉인 자동 적으로 적응하는 고유의 저차원 구조에 대해서 서술 하고자 함. Detailed overview Both k-d trees and RP trees are built by recursive binary splits. • K-d tree와 RP tree는 재귀적으로 이진으로 분리되서 만듬. The core tree-building algorithm is called MakeTree, and takes as input a data set S ⊂ ℝ𝑑 . • 이 코어 트리 빌딩 알고리즘은 MakeTree라고 불리는데 이것은 어떤 집합셋인 S가 Rd에 속하는 input 데이터를 가짐. MakeTree algorithm procedure MakeTree(S) if |S| < MinSize return (Leaf) Rule ← ChooseRule(S) LeftTree ← MakeTree({x ∈ S : Rule(x) = true}) RightTree ← MakeTree({x ∈ S : Rule(x) = false}) return ([Rule, LeftTree, RightTree]) K-d tree version procedure MakeTree(S) if |S| < MinSize return (Leaf) Rule ← ChooseRule(S) LeftTree ← MakeTree({x ∈ S : Rule(x) = true}) RightTree ← MakeTree({x ∈ S : Rule(x) = false}) return ([Rule, LeftTree, RightTree]) procedure ChooseRule(S) comment: k-d tree version choose a coordinate direction 𝑖 Rule(𝑥) := 𝑥𝑖 ≤ median({𝑧𝑖 : 𝑧 ∈ S}) return (Rule) RP-tree version PCA • 임의의 방향을 선정해서 중점을 기준으로 방향을 선택. Principal Component Analysis(주성분 분석) RP-tree Max version procedure MakeTree(S) if |S| < MinSize return (Leaf) Rule ← ChooseRule(S) LeftTree ← MakeTree({x ∈ S : Rule(x) = true}) RightTree ← MakeTree({x ∈ S : Rule(x) = false}) return ([Rule, LeftTree, RightTree]) procedure ChooseRule(S) comment: RPTree-Max version choose a random unit direction v ∈ Rd pick any x ∈ S; let y ∈ S be the farthest point from it choose δ uniformly at random in [−1, 1] · 6 𝑥 − 𝑦 / 𝐷 Rule(𝑥) := 𝑥 · 𝑣 ≤ (𝑚𝑒𝑑𝑖𝑎𝑛({𝑧 · 𝑣 ∶ 𝑧 ∈ 𝑆}) + 𝛿) return (Rule RP-tree Mean version procedure MakeTree(S) if |S| < MinSize return (Leaf) Rule ← ChooseRule(S) LeftTree ← MakeTree({x ∈ S : Rule(x) = true}) RightTree ← MakeTree({x ∈ S : Rule(x) = false}) return ([Rule, LeftTree, RightTree])