Why Not Store Everything in Main Memory? Why use disks?

advertisement
10. Data Mining
Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend
querying end of Query Processing, rather than the "please find" end. To say it another way,
data mining queries are on the ad hoc or unstructured end of the query spectrum rather than
standard report generation or "retieve all records matching a criteria" or SQL side).
Still, Data Mining queries ARE queries and are processed (or will eventually be processed) by a
Database Management System the same way queries are processed today, namely:
1. SCAN and PARSE (SCANNER-PARSER): A Scanner identifies the tokens or language elements of
the DM query. The Parser check for syntax or grammar validity.
2. VALIDATED: The Validator checks for valid names and semantic correctness.
3. CONVERTER converts to an internal representation.
|4. QUERY OPTIMIZED: the Optimzier devises a stategy for executing the DM query (chooses among
alternative Query internal representations).
5. CODE GENERATION: generates code to implement each operator in the selected DM query plan
(the optimizer-selected internal representation).
6. RUNTIME DATABASE PROCESSORING: run plan code.
Developing new, efficient and effective DataMining Query (DMQ) processors is the central need
and issue in DBMS research today (far and away!).
These notes concentrate on 5,i.e., generating code (algorithms) to implement operators (at a high level):
Association Rule Mining (ARM), Clustering (CLU),
Classification (CLA)
Database analysis can be broken down into 2 areas,
Data Mining can be broken down into 2 areas,
Machine Learning can be broken down into 2 areas,
Clustering can be broken down into 2 types,
Classification can be broken down into to types,
Querying and Data Mining.
Machine Learning and Assoc. Rule Mining
Clustering and Classification.
Isotropic (round clusters) and Density-based
Model-based and Neighbor-based
Machine Learning is almost always based on Near Neighbor Set(s), NNS.
Clustering, even density based, identifies near neighbor cores 1st (round NNSs,  about a center).
Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity
>0 >0 : d(x,a)<  d(f(x),f(a))<
where f assigns a class to a feature vector, or
 -NNS of f(a),  a -NNS of a in its pre-image. f(Dom) categorical >0 : d(x,a)<f(x)=f(a)
Caution: For classification, boundary analysis may be needed also to see the class (done by projecting?).
1234
Finding NNS in lower a dimension may still the 1st step. Eg, 12345678 are all  from a
*
5 a 6
(unclassified sample); 1234 are red-class, 5678 are blue-class.
*
7 8
Any  that gives us a vote gives us a tie vote (0-to-0 then 4-to-4).
*
But projecting onto the vertical subspace,
then taking /2 we see that /2 about a contains only blue class (5,6) votes.
Using horizontal data, NNS derivation requires ≥1 scan (O(n)). L ε-NNS can be derived using vertical-data
in O(log2n) (but Euclidean disks are preferred). (Euclidean and L coincide in Binary data sets).
Query Processing and Optimization:
Relational Queries to Data Mining
Most people have Data from which they want information.
So, most people need DBMSs whether they know it or not.
A major component of any DBMS is the DataMining Query Processor.
Queries can range from structure to unstructured:
Relational querying Simple Searching and aggregating
SQL:
SELECT
FROM
WHERE
Complex
SQL
queries
(nested,
EXISTS..
)
FUZZY queries (e.g.,
BLAST searches, ..
OLAP
(rollup,
drilldown,
slice/dice.
.
Machine Learning
Supervised Classification
Regression
UnsupervisedClustering
Data Mining
Association
Rule Mining
On the Data Mining end, we have barely scratched the surface.
But those scratches have already made the difference between becoming the world’s
biggest corporation (Walmart - got into DM for supply chain mgmt early) and filing for
bankruptcy (KMart - didn't!).
Recall the Entity-Relationship (ER) Model's notion of a Relationship
• Relationship: Association among 2 [or more, the # of entities is the degree] entities.
• The Graph of a Relationship: A degree=2 relationship between entity T and I
generates a bipartite undirected graph (bipartite means that the node set is a disjoint
union of two subsets and that all edges must run from one subset to the other).
Relationships can have attributes too!
name
since
name
ssn
dname
lot
Employee
did
Works_In
ssn
lot
budget
Department
Degree=2 relationship between entities, Employees and Departments.
Employee
supervisor
subordinate
Reports_To
A degree=2 relationship between an entity
and itself, e.g., Employee Reports_To
Employee, generates a uni-partite undirected
To distinguish roles in a unipartite
graph, can specify “role” of each entity.
A
Association Rule Mining (ARM)
In a relationship between entities, T is the set of Transactions an enterprise
performs and I is the set of Items on which those transactions are performed.
In Market Basket Research (MBR) a transaction is a checkout transaction and
an item is an Item in that customer's market basket which gets checked out)
t1
t2
t3
i1
i2
i3
t4
An I-Association Rule, AC, relates 2 disjoint subsets of I (I-temsets) has 2
main measures, support and confidence (A is called the antecedent, C is called the consequent)
i4
t5
T
I
The support of an I-set, A, is the fraction of T-instances related to every I-instance in A, e.g. if A={i1,i2}
and C={i4} then supp(A)= |{t2,t4}|/|{t1,t2,t3,t4,t5}| = 2/5 Note: | | means set size or count of elements in the set.
I.e., T2 and T4 are the only transactions from the total transaction set, T={T1,T2,T3,T4,T5}. that are
related to both i1 and i2, (buy i1 and i2 during the pertinent T-period of time).
support of rule,
AC, is defined as supp{A C} = |{T2, T4}|/|{T1,T2,T3,T4,T5}| = 2/5
confidence of rule, AC, is supp(AC)/ supp(A)
= (2/5) / (2/5) = 1
DM Queriers typically want STRONG RULES: supp≥minsupp, conf≥minconf
(minsupp and minconf are threshold levels)
Note that Conf(AC) is also just the conditional probability of t being related to C, given that t is related to A).
There are also the dual concepts of T-association rules (just reverse the roles of T and I above).
Examples of Association Rules include: The MBR, relationship between customer cash-register transactions, T, and
purchasable items, I (t is related to i iff i is being bought by that customer during that cash-register transaction.).
In Software Engineering (SE), the relationship between Aspects, T, and Code Modules, I (t is related to i iff module, i, is
part of the aspect, t).
In Bioformatics, the relationship between experiments, T, and genes, I (t is related to i iff gene, i, expresses at a threshold
level during experiment, t).
In ER diagramming, any “part of” relationship in which iI is part of tT (t is related to i iff i is part of t); and any “ISA”
relationship in which iI ISA tT (t is related to i iff i IS A t) . . .
C
Finding Strong Assoc Rules
The relationship between Transactions and Items can be expressed in a
Transaction Table where each transaction is a row containing its ID
and the list of the items that are related to that transaction:
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
If minsupp is set by the querier at .5 and minconf at .75:
To find frequent or Large itemsets (support ≥ minsupp)
T ID
A
B
C
D
E
F
2000
1
1
1
0
0
0
1000
1
0
1
0
0
0
4000
1
0
0
1
0
0
5000
0
1
0
0
1
1
1-itemset supp
Large (supp2)
Start by finding large 1-ItemSets.
Or Transaction Table Items
can be expressed using
“Item bit vectors”
3 2 2 1 1 1
3 2 2
PseudoCode: Assume the items in Lk-1 are ordered:
FACT: Any subset of a large itemset is large. Why?
(e.g., if {A, B} is large, {A} and {B} must be large)
APRIORI METHOD:
Iteratively find the large k-itemsets, k=1...
Find all association rules supported by each large Itemset.
Ck denotes the candidate k-itemsets generated at each step
Lk denotse the Large k-itemsets.
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
p from Lk-1, q from Lk-1 where
p.item1=q.item1,..,p.itemk-2=q.itemk-2, p.itemk-1<q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) delete c from Ck
Ptree Review: A data table, R(A1..An),
containing horizontal structures (records) is
processed vertically (vertical scans)
Vertical basic binary Predicate-tree (P-tree):
vertically partition table; compress each vertical bit
slice into a basic binary P-tree as follows
then process using multi-operand logical ANDs.
R( A1 A2 A3 A4)
Horizontal
structures
(records)
Scanned
vertically
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R[A1] R[A2] R[A3] R[A4]
R11
0
0
0
0
1
0
1
1
The basic binary P-tree,
P1,1, for R11 is built topdown by record truth of
predicate pure1 recursively
on halves, until purity.
1. Whole file is not pure1 0
2. 1st half is not pure1  0
3. 2nd half is not pure1  0
4. 1st half of 2nd half not  0
5. 2nd half of 2nd half is  1
6. 1st half of 1st of 2nd is  1But it is pure
(pure0) so this
7. 2nd half of 1st of 2nd not 0 branch ends
010
011
010
010
101
010
111
111
0
0
0
01
1
10
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
P11 P12 P13
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
P21 P22 P23
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
1
1
1
1
1
P31 P32 P33
0
0
0
0
0
0
1 0
0 0 1 0 1 0
0 01 0
10 10
01 0001
01
01 01 10
01
10
P0 0
0 43
1
1
0
0
0
0
0
0
0
R41 R42 R43
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
P41 P42
0
0
0
0 0 0 1 0 1 0 0 0 0
10 01
01 01
0100
01
01 10 01
0
Eg, Count number of occurences of 111 000 001 100
P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 =
0 23-level
0 0 22-level
01 21-level
=2
Top-down construction of basic binary P-trees is good for understanding, but bottom-up is more efficient.
R11
P11
0
0
0
0
0
0 0
0 0
0
1
1 0 1 1
0
0
0
0
1
0
1
1
Bottom-up construction of P11 is done using in-order tree traversal and the collapsing of pure siblings, as follow:
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
Processing Efficiencies? (prefixed leaf-sizes have been removed)
R(A1
2
6
2
2
5
2
7
7
R[A1] R[A2] R[A3] R[A4]
A2 A3 A4 )
7
7
7
7
2
2
0
0
6
6
5
5
1
1
1
1
1
0
1
7
4
5
4
4
=
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
P11 P12 P13
1
1
1
1
0
0
0
0
P21 P22 P23
P31
0
0
0
0
0
0
0
0 0 1 0 1 0
0 01 0
0 0 1 0
10 10
01 0001
01
^ 01
^ 10
^
^
^
01
01
10
This 0 makes entire left branch 0
These 0s make this node 0 7 0 1 4
To count occurrences of 7,0,1,4 use pure111000001100:
P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 =
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
P32 P33
0
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
P41 P42 P43
0
0
0
0 0 0 1 0 1 0 0 0 0
10 01
01 01
0100
^
^ 01
^
^ 10
01
01
0
These 1s and these 0s make this 1
0
0 0
^
01
21-level has the only 1-bit so
the 1-count = 1*21 = 2
C1
Database D
TID
100
200
300
400
itemset sup.
Items
{1}
2
134
{2}
3
235
3
1 2 3 5 Scan D {3}
{4}
1
25
{5}
3
L1
C2
itemset sup.
{1}
2
{2}
3
{3}
3
{5}
3
itemset
itemset
{1 2}
{1 2}
Scan D
{1 3}
{1 3}
{1 5}
{1 5}
{2 3}
{2 3}
{2 5}
{2 5}
{3 5}
{3 5}
Example ARM using uncompressed P-trees
TID
100
1 2 3 4 5 Build
Ptrees:
1 0 1 1 0 Scan D
200
0
1
1
0
1
300
1
1
1
0
1
400
0
1
0
0
1
P1 2
//\\
1010
P3 3
//\\
1110
P1^P3 2
//\\
1010
P4 1
//\\
1000
P5 3
//\\
0111
L1={1,2,3,5}
sup
1
2
1
2
3
2
itemset
{1 3}
{2 3}
{2 5}
{3 5}
C3
sup
2
2
3
2
L3
itemsetScan D itemset sup
{2 3 5} 2
{2 3 5}
{123} pruned since {12} not large
{135} pruned since {15} not Large
(note: I have placed the 1-count at the root of each Ptree)
P1^P2 1
//\\
0010
P2 3
//\\
0111
L2
C2
P1^P5 1
//\\
0010
P2^P3 2
//\\
0110
P2^P5 3
//\\
0111
P3^P5 2
//\\
0110
P1^P2^P3 1
//\\
0010
P1^P3 ^P5 1
//\\
0010
P2^P3 ^P5 2
//\\
0110
L2={13,23,25,35}
L3={235}
1-ItemSets don’t support
Association Rules (They will
have no antecedent or no consequent).
2-Itemsets do support ARs.
L1
L3
L2
itemset sup.
{1}
2
{2}
3
{3}
3
{5}
3
itemset
{1 3}
{2 3}
{2 5}
{3 5}
itemset sup
{2 3 5} 2
sup
2
2
3
2
Are there any Strong Rules supported by
Large 2-ItemSets
(at minconf=.75)?
{1,3}
conf{1}{3} = supp{1,3}/supp{1} = 2/2 = 1 ≥ .75 STRONG
conf{3}{1} = supp{1,3}/supp{3} = 2/3 = .67 < .75
{2,3}
conf{2}{3} = supp{2,3}/supp{2} = 2/3 = .67 < .75
conf{3}{2} = supp{2,3}/supp{3} = 2/3 = .67 < .75
{2,5}
conf{2}{5} = supp{2,5}/supp{2} = 3/3 = 1
conf{5}{2} = supp{2,5}/supp{5} = 3/3 = 1
{3,5}
conf{3}{5} = supp{3,5}/supp{3} = 2/3 = .67 < .75
conf{5}{3} = supp{3,5}/supp{5} = 2/3 = .67 < .75
≥ .75 STRONG!
≥ .75 STRONG!
Are there any Strong Rules supported by
Large 3-ItemSets?
{2,3,5}
conf{2,3}{5} = supp{2,3,5}/supp{2,3} = 2/2 = 1 ≥ .75 STRONG!
conf{2,5}{3} = supp{2,3,5}/supp{2,5} = 2/3 = .67 < .75
No subset antecedent can yield a strong rule either (i.e., no need to check
conf{2}{3,5} or conf{5}{2,3} since both denominators will be at least
as large and therefore, both confidences will be at least as low.
conf{3,5}{2} = supp{2,3,5}/supp{3,5} = 2/3 = .67 < .75
No need to check conf{3}{2,5} or conf{5}{2,3}
DONE!
Ptree-ARM versus Apriori
on aerial photo (RGB) data together with yeild data
800
1200
700
1000
600
P-ARM
500
400
Apriori
300
200
100
Time (Sec.)
Run time (Sec.)
P-ARM compared to Horizontal Apriori (classical) and FP-growth (an improvement of it).
In P-ARM, we find all frequent itemsets, not just those containing Yield (for fairness)
Aerial TIFF images (R,G,B) with synchronized yield (Y).
800
A pr i or i
600
P -A RM
400
200
0
0
100
500
900
1300
1700
10% 20% 30% 40% 50% 60% 70% 80% 90%
N umb er o f t r ansact io ns( K)
Support threshold
Scalability with support threshold
• 1320  1320 pixel TIFFYield dataset (total number of
transactions is ~1,700,000).
Scalability with number of transactions
Identical results
P-ARM is more scalable for lower
support thresholds.
P-ARM algorithm is more
scalable to large spatial datasets.
P-ARM versus FP-growth (see literature for definition)
17,424,000 pixels (transactions)
1200
700
600
P-ARM
500
400
FP-grow th
300
200
100
Time (Sec.)
Run time (Sec.)
800
1000
800
FP-growt h
600
P-ARM
400
200
0
0
10%
30%
50%
70%
90%
Support threshold
Scalability with support threshold
100
500
900
1300
1700
Num ber of transactions(K)
Scalability with number of trans
FP-growth = efficient, tree-based frequent pattern mining method (details later)
For a dataset of 100K bytes, FP-growth runs very fast. But for images of large
size, P-ARM achieves better performance.
P-ARM achieves better performance in the case of low support threshold.
Other methods (other than FP-growth) to Improve Apriori’s Efficiency
(see the literature or the html notes 10datamining.html in Other Materials for more detail)
•
•
•
•
•
Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count
is below the threshold cannot be frequent
Transaction reduction: A transaction that does not contain any frequent k-itemset is
useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least
one of the partitions of DB
Sampling: mining on a subset of given data, lower support threshold + a method to
determine the completeness
Dynamic itemset counting: add new candidate itemsets only when all of their subsets
are estimated to be frequent
 The core of the Apriori algorithm:
– Use only large (k – 1)-itemsets to generate candidate large k-itemsets
– Use database scan and pattern matching to collect counts for the candidate itemsets
 The bottleneck of Apriori: candidate generation
1. Huge candidate sets:
104 large 1-itemset may generate 107 candidate 2-itemsets
To discover large pattern of size 100, eg, {a1…a100}, we need to generate 2100  1030 candidates.
2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)
Classification
Using a Training Data Set (TDS) in which each feature tuple is already classified (has a class value attached
to it in the class column, called its class label.),
1. Build a model of the TDS (called the TRAINING PHASE).
2. Use that model to classify unclassified feature tuples (unclassified samples). E.g., TDS = last year's aerial
image of a crop field (feature columns are R,G,B columns together with last year's crop yeilds attached in
a class column, e.g., class values={Hi, Med, Lo} yeild. Unclassified samples are the RGB tuples from
this year's aerial image
3. Predict the class of each unclassified tuple (in the e.g.,: predict yeild for each point in the field.)
INPUT
Unclassified sample
hopper
1. Build a MODEL of the
feature_tupleclass relationship
tuple# feature1
tuple1 value1,1
. . .
tuplem
value1,m
f2
f3 ... fn
class
val2,1 val3,1 valn,1 class1
val2,m val3,m
Predicted class of
unclassified sample
valn,m classm
3 steps: Build a Model of the TDS feature-to-class relationship, Test that Model, Use the Model
(to predict the most likely class of each unclassified sample). Note: other names for this process: regression analysis, case-based reasoning,...)
Other Typical Applications:
•
– Targeted Product Marketing (the so-called classsical Business Intelligence problem)
– Medical Diagnosis (the so-called Computer Aided Diagnosis or CAD)
Nearest Neighbor Classifiers (NNCs) use a portion of the TDS as the model (neighboring tuples vote)
–
•
finding the neighbor set is much faster than building other models but it must be done anew for each unclasified
sample. (NNC is called a lazy classifier because it get's lazy and doesn't take the time to build a concise model of
the relationship between feature tuples and class labels ahead of time).
Eager Classifiers (~all other classifiers) build 1 concise model once and for all - then use it for all
unclassified samples. The model building can be very costly but that cost can be amortized over all the
classifications of a large number of unclassified samples (e.g., all RGB points in a field).
Eager Classifiers
Classification Algorithm
(creates the Classifier or Model
during training phase)
Training
Data
NAME
M ike
M ary
B ill
J im
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso c iate P ro f
7
yes
A ssistan t P ro f
6
no
A sso c iate P ro f
3
no
Unclassified sample
clyde,
professor,8
tim,herb,
assoc.
professor,
professor,
3 4
e.g., Model
(as a rule set)
IF rank = ‘professor’
INPUT
hopper OR years > 6
THEN tenured = ‘yes’
Test Process (2):
Usually some of the Training Tuples are set aside as a Test Set and after a
model is constructed, the Test Tuples are run through the Model.
The Model is acceptable if, e.g., the % correct > 60%. If not, the Model is rejected (never used).
Classifier
Testing
Data
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Associate Prof
5
yes
Assistant Prof
7
no
% correct
classifications?
Correct=3
Incorrect=1
75%
Since 75% is above the acceptability threshold, accept the model!
Classification by Decision Tree Induction
• Decision tree (instead of a simple case statement of rules, the rules are prioritized into a tree)
– Each Internal node denotes a test or rule on an attribute (test attribute for that node)
– Each Branch represents an outcome of the test (value of the test attribute)
– Leaf nodes represent class label decisions (plurality leaf class is predicted class)
• Decision tree model development consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Decision tree use: Classifying unclassified samples by filtering them
down the decision tree to their proper leaf, than predict the plurality
class of that leaf (often only one, depending upon the stopping
condition of the construction phase)
Algorithm for Decision Tree Induction
• Basic ID3 algorithm (a simple greedy top-down algorithm)
– At start, the current node is the root and all the training tuples are at the root
– Repeat, down each branch, until the stopping condition is true
• At current node, choose a decision attribute (e.g., one with largest information gain).
• Each value for that decision attribute is associated with a link to the next level down
and that value is used as the selection criterion of that link.
• Each new level produces a partition of the parent training subset based on the
selection value assigned to its link.
– stopping conditions:
• When all samples for a given node belong to the same class
• When there are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
• When there are no samples left
Bayesian Classification
(eager: Model is based on conditional probabilities. Prediction is done by taking the highest conditionally probable class)
A Bayesian classifier is a statistical classifier, which is based on
following theorem known as Bayes theorem:
Bayes theorem:
Let X be a data sample whose class label is unknown.
Let H be the hypothesis that X belongs to class, H.
P(H|X) is the conditional probability of H given X. P(H) is prob of H,
then
P(H|X) = P(X|H)P(H)/P(X)
Naïve Bayesian Classification
Given training set, R(f1..fn, C) where C={C1..Cm} is the class label attribute.
A Naive Bayesian Classifier will predict the class of unknown data sample, X=(x1..xn),
to be the class, Cj having the highest conditional probability, conditioned on X.
That is it will predict the class to be Cj iff (a tie handling algorithm may be required).
P(Cj|X) ≥ P(Ci|X),
i  j.
• From the Bayes theorem;
P(Cj|X) = P(X|Cj)P(Cj)/P(X)
– P(X) is constant for all classes so we need only maximize P(X|Cj)P(Cj):
– P(Ci)s are known.
– To reduce the computational complexity of calculating all P(X|Cj)s, the naive assumption
is to assume class conditional independence: P(X|Ci) is the product of the P(Xi|Ci)s.
Neural Network Classificaton
• A Neural Network is trained to make the prediction
• Advantages
– prediction accuracy is generally high
– it is generally robust (works when training examples contain errors)
– output may be discrete, real-valued, or a vector of several
discrete or real-valued attributes
– It provides fast classification of unclassified samples.
• Criticism
– It is difficult to understand the learned function (involves
complex and almost magic weight adjustments.)
– It makes it difficult to incorporate domain knowledge
– long training time (for large training sets, it is prohibitive!)
A Neuron
- k
x0
w0
x1
w1
xn

f
output y
wn
Input
weight
vector x vector w
weighted
sum
Activation
function
• The input feature vector x=(x0..xn) is mapped into
variable y by means of the scalar product and a
nonlinear function mapping, f (called the damping function).
and a bias function, 
Neural Network Training
• The ultimate objective of training
– obtain a set of weights that makes almost all the tuples in the training data
classify correctly (usually using a time consuming "back propagation" procedure
which is based, ultimately on Neuton's method. See literature of Other materials
- 10datamining.html for examples and alternate training techniques).
• Steps
– Initialize weights with random values
– Feed the input tuples into the network
– For each unit
•
•
•
•
Compute the net input to the unit as a linear combination of all the inputs to the unit
Compute the output value using the activation function
Compute the error
Update the weights and the bias
Neural Multi-Layer Perceptron
Output vector
Err j  O j (1  O j ) Errk w jk
Output nodes
k
 j   j  (l) Err j
wij  wij  (l ) Err j Oi
Hidden nodes
Err j  O j (1  O j )(T j  O j )
wij
Input nodes
Oj 
I j
1 e
I j   wij Oi   j
i
Input vector: xi
1
These next 3 slides treat the concept of Distance it great detail. You may feel you don't need this much
detail - if so, skip what you feel you don't need.
For Nearest Neighbor Classification, a distance is needed
(to make sense of "nearest". Other classifiers also use distance.)
A distance is a function, d, applied to two n-dimensional points X and Y, is such that
d(X, Y) is positive definite:
if (X  Y), d(X, Y) > 0;
if (X = Y), d(X, Y) = 0
d(X, Y) is symmetric:
d(X, Y) = d(Y, X)
d(X, Y) holds triangle inequality:
d(X, Y) + d(Y, Z)  d(X, Z)
X  x1 , x2 , x3 ,… , xn
Y  y1 , y 2 , y3 ,… , y n
1
pp

d p X ,Y   
xi  y i 
 i 1

n

Minkowski distance or Lp distance,
d1  X , Y  
Manhattan distance,
n
x
i
(P = 1)
 yi
i 1
d 2 X ,Y  
Euclidian distance,
n
2


x

y
 i i
(P = 2)
i 1
n
d   X , Y   max xi  y i
Max distance,
(P = )
i 1
Canberra distance
dc X ,Y  
n
xi  y i
x
i 1
i
 yi
Squared cord distance
d sc  X , Y  

n
i 1
xi 
yi

2
Squared chi-squared distance
d chi  X , Y  
n

i 1
xi  yi 2
xi  yi
An Example
A two-dimensional space:
Y (6,4)
Manhattan, d1(X,Y) = XZ+ ZY = 4+3 = 7
Z
X (2,1)
Euclidian, d2(X,Y) = XY = 5
Max, d(X,Y) = Max(XZ, ZY) = XZ = 4
d1  d2  d always
In fact, for any positive integer p,
d p  d p 1
Neighborhoods of a Point
A Neighborhood (disk neighborhood) of a point, T, is a set of points, S, :
X  S iff d(T, X)  r
2r
2r
2r
X
X
X
T
T
Manhattan
Euclidian
T
Max
If X is a point on the boundary, d(T, X) = r
Classical k-Nearest Neighbor Classification
• Select a suitable value for k (how many Training Data Set (TDS) neighbors do you
want to vote as to the best predicted class for the unclassified feature sample? )
• Determine a suitable distance metric (to give meaning to neighbor)
• Find the k nearest training set points to the unclassified
sample.
• Let them vote (tally up the counts of TDS neighbors that for each class.)
• Predict the highest class vote (plurality class) from among
the k-nearest neighbor set.
Closed-KNN
Example assume 2 features (one in the x-direction and one in the y
T is the unclassified sample.
using k = 3, find the three nearest neighbor,
KNN arbitrarily select one point from the boundary line shown
T
Closed-KNN includes all points on the boundary
Closed-KNN yields higher classification accuracy than traditional KNN (thesis of
MD Maleq Khan, NDSU, 2001).
The P-tree method always produce closed neighborhoods (and is faster!)
k-Nearest Neighbor (kNN) Classification and
Closed-k-Nearest Neighbor (CkNN) Classification
1) Select a suitable value for k
2) Determine a suitable distance or similarity notion.
3) Find the k nearest neighbor set [closed] of the unclassified sample.
4) Find the plurality class in the nearest neighbor set.
5) Assign the plurality class as the predicted class of the sample
T is the unclassified sample. Use Euclidean distance.
k = 3: Find 3 closest neighbors. Move out from T until ≥ 3 neighbors
T
That's
21more
! thanselect
kNN arbitrarily
3 ! one point from that boundary line as 3rd nearest
neighbor, whereas, CkNN includes all points on that boundary line.
CkNN yields higher classification accuracy than traditional kNN.
At what additional cost? Actually, at negative cost (faster and more accurate!!)
The slides numbered 28 through 93 give great detail on the relative performance of
kNN and CkNN, on the use of other distance functions and some exampels, etc.
There may be more detail on these issue that you want/need. If so, just scan for
what you are most interested in or just skip ahead to slide 94 on CLUSTERING.
Experimented on two sets of (Arial) Remotely Sensed Images of
Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND
Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture,
Nitrate, and Yield (class label).
Band values ranges from 0 to 255 (8 bits)
Considering 8 classes or levels of yield values
Performance – Accuracy (3
horizontal methods in middle, 3 vertical
methods (the 2 most accurate and the least
1997 Dataset:
accurate)
80
75
Accuracy (%)
70
65
60
55
50
45
kNN-Manhattan
kNN-Euclidian
kNN-Max
kNN using HOBbit distance
P-tree Closed-KNN0-max
Closed-kNN using HOBbit distance
40
256
1024
4096
16384
Training Set Size (no. of pixels)
65536
262144
Performance – Accuracy
(3 horizontal methods in
middle, 3 vertical methods (the 2 most accurate and the least accurate)
1998 Dataset:
65
60
55
Accuracy (%)
50
45
40
kNN-Manhattan
kNN-Euclidian
kNN-Max
kNN using HOBbit distance
P-tree Closed-KNN-max
Closed-kNN using HOBbit distance
20
256
1024
4096
16384
Training Set Size (no of pixels)
65536
262144
Performance – Speed
(3 horizontal methods in middle,
3 vertical methods (the 2 fastest (the same 2) and the slowest)
1997 Dataset: both axis in logarithmic scale
Training Set Size (no. of pixels)
256
1024
4096
16384
65536
262144
Per Sample Classification time (sec)
1
0.1
0.01
0.001
0.0001
kNN-Manhattan
kNN-Euclidian
kNN-Max
kNN using HOBbit distance
P-tree Closed-KNN-max
Closed-kNN using HOBbit dist
Hint: NEVER
use a log scale to
show a WIN!!!
Performance – Speed
(3 horizontal methods in middle, 3 vertical
methods (the 2 fastest (the same 2) and the slowest)
Win-Win situation!!
1998 Dataset : both axis in logarithmic scale
256
Training Set Size (no. of pixels)
1024
4096
16384
65536
262144
Per Sample Classification Time (sec)
1
(almost never
happens)
P-tree CkNN and
CkNN-H are more
accurate and much
faster.
kNN-H is not
recommended because
it is slower and less
accurate (because it
doesn't use Closed nbr
sets and it requires
another step to get rid
of ties (why do it?).
0.1
0.01
0.001
0.0001
kNN-Manhattan
kNN-Euclidian
kNN-Max
kNN using HOBbit distance
P-tree Closed-kNN-max
Closed-kNN using HOBbit dist
Horizontal kNNs are
not recommended
because they are less
accurate and slower!
WALK THRU: 3NN CLASSIFICATION of an unclassified sample, a=(a5 a6 a11a12a13a14 )=(000000).
HORIZONTAL APPROACH
( relevant attributes are a5 a6 a11 a12 a13 a14 )
Note only 1 of many training tuple at a distance=2 from the sample got to vote.
We didn’t know that distance=2 was going to be the vote cutoff until the end of the 1 st scan.
Finding the other distance=2 voters (Closed 3NN set or C3NN) requires another scan.
a5 a6
a10=C a11 a12 a13 a14
distance
t12
0
0
1
0
1
1
0
2
t13
0
0
1
0
1
0
0
1
t15
t53
0
0
1
0
0
1
0
1
0
2
1
The 3
nearest
neighbors
C=1 wins!
0
Key
t12
t13
t15
t16
t21
t27
t31
t32
t33
t35
t51
t53
t55
t57
t61
t72
t75
a1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
0
0
a2 a3 a4 a5 a6 a7 a8 a9 a10=C a11
0 1 0 0 0 1 1 0 1
0
0 1 0 0 0 1 1 0 1
0
0 1 0 0 0 1 1 0 1
0
0 1 0 0 0 1 1 0 1
1
0
1 1 0 1
1
0 1
0 0 0 0 1
0
1 1 0 0
1 0
1 0 0 0 1
0
1 0 0 0
1 0 0 0 1 1
1
0
1 0 0 0
1 0 0 0 1 1
0
1 0 0 1
0
0 0 0 0 1 1
1 0 0 1
0
0 0 0 0 1 1
1 0 1 0 0 1 1 0 0
1
0
1 0 1 0 0 1 1 0 0
0
1 0 1 0 0 1 1 0 0
0
1 0 1 0 0 1 1 0 0
0
0 1 0 0
1 0 0 0 1 0
1
0
0 1 1 0 0 1 1 0 0
0
0 1 1 0 0 1 1 0 0
0
a12
1
1
1
0
0
0
0
1
0
1
0
1
0
0
1
0
1
0
0
0
1
0
1
0
a13
1
0
0
1
0
1
0
1
0
1
0
1
0
0
0
1
0
0
0
1
0
1
0
1
0
0
a14
0
0
1
0
0
1
0
0
0
0
1
0
0
0
1
0
1
0
0
0
1
0
a15
1
1
0
1
0
0
0
1
1
0
0
1
0
0
0
1
0
a16
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
a17 a18
0 0
0 0
1 1
0 d=2,
0
d=4,
1 1
1 d=4,
1
d=3,
1 1
0 d=3,
0
d=2,
0 0
1 d=3,
1
d=2,
1 1
0 d=1,
0
d=2,
1 1
1 d=2,
1
d=3,
1 1
0 d=2,
0
d=2,
1 1
1
a19 a20
0 1
1 1
0 0
don’t replace
1 0
don’t replace
0 1
don’t replace
0 0
don’t replace
0 1
don’t
0 replace
1
don’t
replace
1 1
don’t
0 replace
0
don’t replace
0 1
1 replace
1
don’t replace
0 0
don’t
0 replace
0
don’t
replace
0 1
don’t
0 replace
1
don’t
replace
0 0
WALK THRU of required 2nd scan to find Closed 3NN set. Does it change vote?
YES! C=0 wins now!
Unclassified sample:
0 0
a5 a6
0
0
0
0
a10=C a11 a12 a13 a14
3NN set
after 1st scan
distance
t12
0
0
1
0
1
1
0
2
t13
0
0
1
0
1
0
0
1
t53
0
0
0
0
1
0
0
1
a2 a3 a4 a5 a6 a7 a8 a9 a10=C a11
0 1 0 0 0 1 1 0 1
0
0
0 1 0 0 0 1 1 0 1
0 1 0 0 0 1 1 0 1
0
0 1 0 0 0 1 1 0 1
1
0
1 1 0 1
1
0 1
0 0 0 0 1
0
1 1 0 0
1 0
1 0 0 0 1
0
1 0 0 0
1 0 0 0 1 1
1
0
1 0 0 0
1 0 0 0 1 1
0
1 0 0 1
0
0 0 0 0 1 1
1 0 0 1
0
0 0 0 0 1 1
1 0 1 0 0 1 1 0 0
1
0
1 0 1 0 0 1 1 0 0
0
1 0 1 0 0 1 1 0 0
0
1 0 1 0 0 1 1 0 0
0
0 1 0 0
1 0 0 0 1 0
1
0
0 1 1 0 0 1 1 0 0
0
0 1 1 0 0 1 1 0 0
0
a12
1
0
0
1
1
0
0
0
0
0
1
0
1
0
1
0
0
1
0
1
0
0
0
1
0
1
0
a13
1
0
0
0
1
0
1
0
1
0
1
0
1
0
0
0
1
0
0
0
1
0
1
0
1
0
0
a14
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
1
0
0
0
1
0
a15
1
1
0
1
0
0
0
1
1
0
0
1
0
0
0
1
0
Key
t12
t13
t15
t16
t21
t27
t31
t32
t33
t35
t51
t53
t55
t57
t61
t72
t75
a1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
0
0
Vote after 1st scan.
0
a16
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
a17 a18
0 d=2,
0
d=1,
0 0
1 d=2,
1
d=2,
0 0
1 d=4,
1
d=4,
1 1
1 d=3,
1
d=3,
0 0
0 d=2,
0
d=3,
1 1
1 d=2,
1
d=1,
0 0
1 d=2,
1
d=2,
1 1
1 d=3,
1
d=2,
0 0
1 d=2,
1
1
a19 a20
already voted
0 1
already voted
1 1
include it also
0 0
include it also
1 0
don’t include
0 1
don’t include
0 0
don’t include
0 1
don’t
0 include
1
include
1 1it also
don’t
0 include
0
include it also
0 1
already voted
1 1
include it also
0 0
include
0 0it also
don’t
0 replace
1
include
0 1it also
include
0 0it also
WALK THRU: Closed 3NNC using P-trees
First let all training points at distance=0 vote, then distance=1, then distance=2, ... until  3
For distance=0 (exact matches) constructing the P-tree, Ps
then AND with PC and PC’ to compute the vote.
(black denotes complement, red denotes uncomplemented
a14
a13
a12
a11
a6
a5
1
0
1
0
1
0
0
1
0
1
1
0
0
1
1
0
1
0
1
0
0
1
1
0
1
0
0
1
0
1
0
1
1
0
No neighbors at distance=0
C
C
Ps
1
0
1
0
1
0
1
0
1
0
1
0
0
1
1
0
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
0
1
0
key
t12
t13
t15
t16
t21
t27
t31
t32
t33
t35
t51
t53
t55
t57
t61
t72
t75
a1 a2 a 3
1 0 1
1 0 1
1 0 1
1 0 1
0 1 1
0 1 1
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
1 0 1
0 0 1
0 0 1
a4
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
1
1
a5
0
0
0
0
1
1
1
1
1
1
0
0
0
0
1
0
0
a6
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
a7
1
1
1
1
0
0
0
0
0
0
1
1
1
1
0
1
1
a8
1
1
1
1
0
0
0
0
0
0
1
1
1
1
0
1
1
a9
0
0
0
0
0
0
1
1
1
1
0
0
0
0
1
0
0
C
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
a11
0
0
0
1
1
0
1
0
0
0
1
0
0
0
1
0
0
a12
1
1
1
0
0
0
0
1
1
1
0
1
1
0
0
1
1
a13
1
0
0
1
1
1
1
1
0
0
1
0
0
1
1
1
0
a14
0
0
1
0
0
1
0
0
0
1
0
0
1
1
0
0
1
a15
1
1
0
1
0
0
0
1
1
0
0
1
0
0
0
1
0
a16
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
a17
0
0
1
0
1
1
1
0
0
1
1
0
1
1
1
0
1
a18
0
0
1
0
1
1
1
0
0
1
1
0
1
1
1
0
1
a19 a20
0 1
1 1
0 0
1 0
0 1
0 0
0 1
0 1
1 1
0 0
0 1
1 1
0 0
0 0
0 1
0 1
0 0
WALK THRU: C3NNC
distance=1 nbrs:
Construct Ptree,
PS(s,1) = OR
= OR
PS(si,1)
i=5,6,11,12,13,14
i=5,6,11,12,13,14
Pi = P|si-ti|=1; |sj-tj|=0,
ji
  S(sj,0)
j{5,6,11,12,13,14}-{i}
P5 P6 P11 P12 P13 P14
a14 a14
a13 a13
a12 a12
a11 a11
a6 a6
a5 a5
1 1
0
0
1 1
0
0
1
1
0 0
0
1 0
1
0 0
1
1
0 0
1
1
0
1
1 0
1
0 0
1
1
0 0
1
1
0 1
0
1 1
0
0
1 1
0
0
1
0 1
0
1 1
0
0
0 0
1
1
0 1
0
1
0
1 1
0
a14
a13
a12
a11
a6
a5
0
1
0
1
0
1
0
1
1
0
1
0
0
1
1
0
1
0
1
0
0
1
0
1
1
0
0
1
1
0
1
0
0
1
a14 a14 a14
a13 a13 a13
a12 a12 a12
a11 a11 a11 C
a6 a 6 a6 C
a5 a5 a5 PD(s,1)
0 1
1
0 0
1 0
1
0
0 1
0 1
1 1
0 1
0 0
1
1
0 1
0
0 0
1
1 0
1 1
1 0
1 0
1
1 0
0
1 0
1 0
1 0
1
0
OR
0
1 0
1 1
1
0 0
1 0
1 0
1 0
1
0
1 0
1 0
1 0
1
0
1
1 1
1
0 0
0 0
1
0 0
0 1
0 1
1
0 1
0
0 1
1 1
1
0 0
0 1
0
1 1
0 0
1
0 1
0 1
1
1
1 0
1 0
1 0
0
0 1
0 1
1
0 1
0
1
0 1
0 0
0 1
1
key
t12
t13
t15
t16
t21
t27
t31
t32
t33
t35
t51
t53
t55
t57
t61
t72
t75
a1 a2 a 3
1 0 1
1 0 1
1 0 1
1 0 1
0 1 1
0 1 1
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 0
1 0 1
0 0 1
0 0 1
a4
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
1
1
a5
0
0
0
0
1
1
1
1
1
1
0
0
0
0
1
0
0
a6
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
a7
1
1
1
1
0
0
0
0
0
0
1
1
1
1
0
1
1
a8
1
1
1
1
0
0
0
0
0
0
1
1
1
1
0
1
1
a9
0
0
0
0
0
0
1
1
1
1
0
0
0
0
1
0
0
a10
=C
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
a11
0
0
0
1
1
0
1
0
0
0
1
0
0
0
1
0
0
a12
1
1
1
0
0
0
0
1
1
1
0
1
1
0
0
1
1
a13
1
0
0
1
1
1
1
1
0
0
1
0
0
1
1
1
0
a14
0
0
1
0
0
1
0
0
0
1
0
0
1
1
0
0
1
a15
1
1
0
1
0
0
0
1
1
0
0
1
0
0
0
1
0
a16
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
a17
0
0
1
0
1
1
1
0
0
1
1
0
1
1
1
0
1
a18
0
0
1
0
1
1
1
0
0
1
1
0
1
1
1
0
1
1
a19 a20
0 1
1 1
0 0
1 0
0 1
0 0
0 1
0 1
1 1
0 0
0 1
1 1
0 0
0 0
0 1
0 1
0 0
WALK THRU: C3NNC
distance=2 nbrs:
OR{all double-dim interval-Ptrees}; PD(s,2) = OR
Pi,j
= PS(si,1) S(sj,1) 
Pi,j
i,j{5,6,11,12,13,14}
S(sk,0)
k{5,6,11,12,13,14}-{i,j}
We now have
the
C3NN
setnbrs.
andWe
wecould
canquite
declare
C=0C=1
thewinner?
winner!
We now
have
3 nearest
and declare
P5,6 P5,11 P5,12P5,13P5,14
a14
a13
a12
a11
a6
a5
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
1
0
0
1
0
1
1
0
a14
a13
a12
a11
a6
a5
0
1
0
1
0
1
0
1
1
0
1
0
0
1
0
1
0
1
1
0
0
1
0
1
1
0
0
1
0
1
1
0
0
1
a14 a14 a14
a13 a13 a13
a12 a12 a12
a11 a11 a11
a6 a6 a6
a5 a5 a5
1
0
1 0
1 0
1
0
1 0
1 0
1
0
1 1
0 0
0
1
1 1
0 0
0
1
0 1
0 1
0
1 1
1
0 0
0
0 0
1 1
1
0 1
0
1
0 1
0
1 1
0 1
1
0 0
0
1 1
1
0
1 0
1 0
1
0
1 0
1 0
1
0 0
0
1 1
1
0
1 0
1 0
0
0 1
0 1
1
0
0 1
1
0 1
1
0
1 0
1 0
P6,11 P6,12 P6,13 P6,14 P11,12 P11,13 P11,14 P12,13 P12,14
a14 a14 a14a14
a13 a13 a13 a13
a12 a12 a12 a12
a11 a11 a11 a11
a6 a6 a6 a6
a5 a5 a5 a5
0 1
0 1
0
0
1 1
0 1
0 1
0
0
1 1
1 1
0 1
0 1
0
0
0 0
0 0
1
1 1
1
1 0
1 0
1
1
0 0
1 0
1 0
1
1 0
0
0 0
1 0
1 0
1
1
1 0
1 0
1 0
1
0
1 0
1 0
1 0
1
0
1 1
1 1
0
0 0
0
0 1
0 1
0
0 1
1
0 1
0 1
0
0
1 1
1
0 1
0 0
0
1 1
0 1
0
0 1
0
1 1
1 0
1 0
1
1
0 0
0
0 1
0
1 1
0 1
0 1
0 1
0
0
1 1
a14 a14 a14
a13 a13 a13
a12 a12 a12
a11 a11 a11
a6 a6 a6
a5 a5 a5
0 0
0
1 1
1
0 0
0
1
1 1
0 0
0
1
1 1
0
1 1 0
1
1 1
1
0
0 0
1
1 0
0 0
0
1 1
1
0 0
1 0
1 0
1
0
1 0
1 0
1
0
1
1 1
0 0
0
0 1 1
0
1
0 0
0
1
1 1
1
0 0
0
1 1
0 0
0
1 1
1
1 1
1
0
0 0
0
1 0
1 0
1
0 0
0
1
1 1
a14
a13
a12
a11
a6
a5
1
0
1
0
1
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
1
0
1
0
1
a14
a13
a12
a11
a6
a5
0
1
0
1
1
0
1
0
1
0
0
1
1
0
1
0
1
0
0
1
0
1
1
0
1
1
0
0
1
1
P13,14
a14
a13
a12
a11
a6
a5
1
0
0
1
0
1
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
1
0
0
1
0
1
a10
key a5 a6 C
t12 0 0 1
t13 0 0 1
t15 0 0 1
t16 0 0 1
t21 1 1 1
t27 1 1 1
t31 1 0 1
t32 1 0 1
t33 1 0 1
t35 1 0 1
t51 0 0 0
t53 0 0 0
t55 0 0 0
t57 0 0 0
t61 1 0 0
t72 0 0 0
t75 0 0 0
0
a11
0
0
0
1
1
0
1
0
0
0
1
0
0
0
1
0
0
a12
1
1
1
0
0
0
0
1
1
1
0
1
1
0
0
1
1
1
a13
1
0
0
1
1
1
1
1
0
0
1
0
0
1
1
1
0
a14
0
0
1
0
0
1
0
0
0
1
0
0
1
1
0
0
1
In the previous example, there were no exact matches (dis=0 neighbors
or similarity=6 neighbors) for the sample.
There were two neighbors were found at a distance of 1 (dis=1 or sim=5)
and nine dis=2, sim=4 neighbors.
All 11 neighbors got an equal votes even though the two sim=5 are much
closer neighbors than the nine sim=4. Also processing for the 9 is costly.
A better approach would be to weight each vote by the similarity of the
voter to the sample (We will use a vote weight function which is linear in
the similarity (admittedly, a better choice would be a function which is
Gaussian in the similarity, but, so far, it has been too hard to compute).
As long as we are weighting votes by similarity, we might as well also
weight attributes by relevance also (assuming some attributes are more
relevant neighbors than others. e.g., the relevance weight of a feature
attribute could be the correlation of that attribute to the class label).
P-trees accommodate this method very well (in fact, a variation on this
theme won the KDD-cup competition in 02 ( http://www.biostat.wisc.edu/~craven/kddcup/ )
Association of Computing Machinery KDD-Cup-02
NDSU Team
Closed Manhattan Nearest Neighbor Classifier (uses a linear fctn of Manhattan
similarity)
Sample is (000000), attribute weights of relevant attributes are their subscripts)
black is attribute complement, red is uncomplemented.
The vote is even simpler than the "equal" vote case. We just note that all tuples vote in accordance with their
weighted similarity (if the ai values differs form that of (000000) then the vote contribution is the subscript of that
attribute, else zero). Thus, we can just add up the root counts of each relevant attribute weighted by their subscript.
a13
a6
a11
a12
a14
Class=1 root counts:
C=1 vote is: 343 =4*5 +
8*6 +
7*11 +
key
a
rc(P
rc(P
rc(P
)=4
^P
)=4 )=7
5 ^P
C^P
Ca^P
^P
rc(P
12 aC13 a14
a5)=4
C rc(P
a6)=8
C^Pa11)=7
t12
1 1 1 0 0 1
t13
1 1 1 0 1 1
t15
1 1 1 0 1 0
t16
1 1 0 1 0 1
C=1 vote is: 343
t21
0 0 0 1 0 1
t27
0 0 1 1 0 0
t31
0 1 0 1 0 1
t32
0 1 1 0 0 1
t33
0 1 1 0 1 1
t35
0 1 1 0 1 0
t51
1 1 0 1 0 1
t53
1 1 1 0 1 1
1 1 1 0 1 0 Similarly, C=0 vote is: 258= 6*5 + t55
t57
1 1 1 1 0 0 7*6 + 5*11 + 3*12 + 3*13 + 4*14
t61
0 1 0 1 0 1
t72
1 1 1 0 0 1
t75
1 1 1 0 1 0
a5
0
0
0
0
1
1
1
1
1
1
0
0
0
0
1
0
0
4*12 +
a6 C a11
0 1 0
0 1 0
0 1 0
0 1 1
1 1 1
1 1 0
0 1 1
0 1 0
0 1 0
0 1 0
0 0 1
0 0 0
0 0 0
0 0 0
0 0 1
0 0 0
0 0 0
a12
1
1
1
0
0
0
0
1
1
1
0
1
1
0
0
1
1
4*13 +
a13 a14
1 0
0 0
0 1
1 0
1 0
1 1
1 0
1 0
0 0
0 1
1 0
0 0
0 1
1 1
1 0
1 0
0 1
7*14
We note that the Closed Manhattan NN Classifier uses an influence
function which is pyramidal
It would be much better to use a
Gaussian influence function
but it is much harder to implement.
One generalization of this method to the case of integer values rather than
Boolean, would be to weight each bit position in a more Gaussian shape (i.e.,
weight the bit positions, b, b-1, ..., 0 (high order to low order) using Gaussian
weights. By so doing, at least within each attribute, influences are Gaussian.
We can call this method, Closed Manhattan Gaussian NN Classification.
Testing the performance of either CM NNC or CMG NNC would make a great
paper for this course (thesis?).
Improving it in some way would make an even better paper (thesis).
Review of slide 2 (with additions): Database analysis can be broken down into 2 areas,
Querying and Data Mining.
Data Mining can be broken down into 2 areas,
Machine Learning can be broken down into 2 areas,
Clustering can be broken down into 2 types,
Classification can be broken down into to types,
Machine Learning and Assoc. Rule Mining
Clustering and Classification.
Isotropic (round clusters) and Density-based
Model-based and Neighbor-based
Machine Learning is based on Near Neighbor Set(s), NNS.
Clustering, even density based, identifies near neighbor cores 1st (round NNSs,  about a center).
Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity
>0 >0 : d(x,a)<  d(f(x),f(a))<
where f assigns a class to a feature vector, or
 -NNS of f(a),  a -NNS of a in its pre-image. f(Dom) categorical >0 : d(x,a)<f(x)=f(a)
Caution: For classification, boundary analysis may be needed also to see the class (done by projecting?).
1234
Finding NNS in lower a dimension may still the 1st step. Eg, 12345678 are all  from a
*
5 a 6
(unclassified sample); 1234 are red-class, 5678 are blue-class.
*
7 8
Any  that gives us a vote gives us a tie vote (0-to-0 then 4-to-4).
*
But projecting onto the vertical subspace,
then taking /2 we see that /2 about a contains only blue class (5,6) votes.
Using horizontal data, NNS derivation requires ≥1 scan (O(n)). L ε-NNS can be derived using vertical-data
in O(log2n) (but Euclidean disks are preferred). (Euclidean and L coincide in Binary data sets).
Solution (next slide): Circumscribe desired Euclidean-NNS with a few intersections of functional-contours,
(f -1([b,c] ) sets, until the intersection is scannable, then scan it for Euclidean--nbrhd membership.
Advantage: intersection can be determined before scanning - create and AND functional contour P-trees.
Functional Contours:
 function, f:R(A1..An)  Y
Y
R
A1 A2
An
x x1 x2
xn
:
:
R* A1 A2
f
...
:
:
An
f
S
...
Af
xn f(x1..xn)
...
f(x)
Y
A1 A2
An
x1 x2
:
and  S  Y, contour(f,S) = f-1(S)
R
Equivalently,  derived attribute, Af, with DOMAIN(Af)Y
(The equivalence is x.Af = f(x) x  R )
Equivalently, contour(Af,S) = SELECT A1..An FROM R* WHERE x.AfS. Graphically:
Y
S1
S2S
S3
R
graph(f) = {(x, f(x)) | xR}
f-contour(S)
If S={a},
Isobar(f,a)= contour(f,{a})
 partition, {Si} of Y, the contour set, {f-1(Si)}, is a partition of R (clustering of R):
A Weather map, f = barometric pressure or temperature, {Si}=equi-width partion of Reals.
f = local density (eg, OPTICS: f = reachability distance, {Sk} = partition produced by intersection points of
{graph(f), plotted wrt to some walk of R} and a horizontal threshold line.
A grid is the intersection of dimension projection contour partitions (next slide for more defintions).
A Class is a contour under f:RClassAttr wrt the partition, {Ci} of ClassAttr (where {Ci} are the classes).
An L -disk about a is the intersection of all -dimension_projection contours containing a.
f:RY,  partition S={Sk} of Y, {f-1(Sk)}= S,f-grid of R (grid cells=contours)
GRIDs
If Y=Reals, the
j.lo f-grid is produced by agglomerating over the j lo bits of Y,  fixed (b-j) hi bit pattern.
The j lo bits walk [isobars of] cells.
The b-j hi bits identify cells. (lo=extension / hi=intention)
Let b-1,...,0 be the b bit positions of Y. The j.lo f-grid is the partition of R generated by f and S =
{Sb-1,...,b-j | Sb-1,...,b-j = [(b-1)(b-2)...(b-j)0..0, (b-1)(b-2)...(b-j)1..1)} partition of Y=Reals.
If F={fh}, the
j.lo F-grid is the intersection partition of the j.lo fh-grids (intersection of partitions).
The canonical j.lo grid is the j.lo -grid; ={d:RR[Ad] | d = dth coordinate projection}
j-hi gridding is similar ( the b-j lo bits walk cell contents / j hi bits identify cells).
If the horizontal and vertical dimensions have bitwidths 3 and 2 respectively:
2.lo grid
1.hi grid
Want square cells
or a
square pattern?
11
11
10
10
01
01
00
00
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
j.lo and j.hi gridding continued
The horizontal_bitwidth = vertical_bitwidth = b
iff
j.lo grid = (b-j).hi grid
e.g., for hb=vb=b=3 and j=2:
2.lo grid
1.hi grid
111
111
110
110
101
101
100
100
011
011
010
010
001
001
000
000
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
Similarity NearNeighborSets (SNNS)
Given similarity s:RRPartiallyOrderedSet (eg, Reals)
( i.e., s(x,y)=s(y,x) and s(x,x)s(x,y) x,yR ) and given any C  R
The Ordinal disks, skins and rings are:
disk(C,k)  C : |disk(C,k)C'|=k and s(x,C)s(y,C) xdisk(C,k), ydisk(C,k)
skin(C,k) = disk(C,k)-C
(skin comes from s k immediate neighbors and is a kNNS of C.)
ring(C,k) = cskin(C,k)-cskin(C,k-1)
closeddisk(C,k)alldisk(C,k); closedskin(C,k)allskin(C,k)
The Cardinal disk, skins and rings are (PartiallyOrderedSet = Reals)
disk(C,r)
 {xR | s(x,C)r} also = functional contour, f-1([r, ), where f(x)=sC(x)=s(x,C)
skin(C,r)
 disk(C,r) - C
ring(C,r2,r1)
 disk(C,r2)-disk(C,r1)  skin(C,r2)-skin(C,r1) also = functional contour, sC-1(r1,r2]
Note: closeddisk(C,r) is redundant, since all r-disks are closed and closeddisk(C,k) = disk(C,s(C,y)) where y = kth NN of C
A distance, d, generates a similarity many ways, e.g., s(x,y)=1/(1+d(x,y)):
(or if the relationship various by location, s(x,y)=(x,y)/(1+d(x,y))
a
1
s
d
C
s
s(x,y)=a*e-b*d(x,y)2
d
:
s
0
: d(x,y)>
2
-b
a-ae
2
2
d (vote weighting IS a similarity
s(x,y)= ae-bd(x,y) -ae-b : d(x,y)

assignment, so the similarity-to-distance graph IS a vote weighting for classification)
L skins: skin(a,k) = {x | d, xd is one of the k-NNs of ad} (a local normalizer?)
For C = {a}
r1
a
r2
Ptrees
Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing
Jury is still out on parallelization, vertical (by relation) or horizontal (by tree node) or some
combination?
Horizontal parallelization is pretty, but network multicast overhead is huge
Use active networking? Clusters of Playstations?...
Formally, P-trees are be defined as any of the following;
Partition-tree: Tree of nested partitions (a partition P(R)={C1..Cn}; each component is
partitioned
by P(Ci)={Ci,1..Ci,ni} i=1..n;
each component is partitioned by P(Ci,j)={Ci,j1..Ci,jn }... )
Partition tree
R
/ … \
C1 … Cn
/…\ … /…\
C11…C1,n1
Cn1…Cn,nn
. . .
ij
Predicate-tree: For a predicate on the leaf-nodes of a partition-tree (also induces predicates on i-nodes using quantifiers)
Predicate-tree nodes can be truth-values (Boolean P-tree); can be quantified existentially (1 or a threshold
%) or universally; Predicate-tree nodes can count # of true leaf children of that component (Count P-tree)
Purity-tree: universally quantified Boolean-Predicate-tree (e.g., if the predicate is <=1>, Pure1-tree or P1tree)
A 1-bit at a node iff corresponding component is pure1 (universally quantified)
There are many other useful predicates, e.g., NonPure0-trees;
But we will focus on P1trees.
All Ptrees shown so far were: 1-dimensional (recursively partition by halving bit files), but they can be;
2-D (recursively quartering) (e.g., used for 2-D images); 3-D (recursively eighth-ing), …; Or based on
purity runs or LZW-runs or …
Further observations about Ptrees:
Partition-tree: have set nodes
Predicate-tree: have either Boolean nodes (Boolean P-tree) or count nodes (Count P-tree)
Purity-tree: being universally quantified Boolean-Predicate-tree have Boolean nodes (since the count
is always the “full” count of leaves, expressing Purity-trees as count-trees is redundant.
Partition-tree can sliced at a given level if each partition at that level is labeled with very same label set (e.g., Month
partition of years).
A Partition-tree can be generalized to a Set-graph when the siblings of a node do not form a partition.
The partitions used to create P-trees can come from functional contours (Note: there is a natural duality
between partitions and functions, namely a partition creates a function from the space of points partitioned to
the set of partition components and a function creates the pre-image partition of its domain).
In Functional Contour terms (i.e., f-1(S) where f:R(A1..An)Y, SY), the uncompressed Ptree or
uncompressed Predicate-tree 0Pf, S = bitmap of set containment-predicate, 0Pf,S(x)=true iff xf-1(S)
0P
f,S = equivalently, the existential R*-bit map of predicate, R*.Af S
The Compressed Ptree, sPf,S is the compression of 0Pf,S with equi-width leaf size, s, as follows
1.
2.
3.
4.
(converts 0Pf,S from bit map to bit vector)
(s=leafsize, the last segment can be short)
(call mask, NotPure0 Mask or EM)
(call mask, Pure1 Mask
or UM)
Choose a walk of R
Equi-width partition 0Pf,S with segment size, s
Eliminate and mask to 0, all pure-zero segments
Eliminate and mask to 1, all pure-one segments
(EM=existential aggregation UM=universal aggregation)
s
Compressing each leaf of Pf,S with leafsize=s2 gives:
(builds an EM and a UM tree)
s1,s2
Pf,S
Recursivly,
s ,s ,s
1
2
3
Pf,S
s ,s ,s ,s
1
2
3
4
Pf,S ...
BASIC P-trees
If Ai Real or Binary and fi,j(x)  jth bit of xi ;
{(*)Pfi,j ,{1} (*)Pi,j}j=b..0 are basic (*)P-trees of Ai, *= s1..sk
If Ai Categorical
and fi,a(x)=1 if xi=a, else 0; {(*)Pfi,a,{1} (*)Pi,a}aR[Ai] are basic (*)P-trees of Ai
Notes:
The UM masks (e.g., of 2k,...,20Pi,j, with k=roof(log2|R| ), form a (binary) tree.
Whenever the EM bit is 1, that entire subtree can be eliminated (since it represents a pure0 segment), then a 0-node at level-k
(lowest level = level-0) with no sub-tree indicates a 2k-run of zeros. In this construction, the UM tree is redundant.
We call these EM trees the basic binary P-trees. The next slide shows a top-down (easy to understand) construction of and
the following slide is a (much more efficient) bottom up construction of the same. We have suppressed the leafsize prefix.
Example functionals: Total Variation (TV) functionals
TV(a) = xR(x-a)o(x-a)
If we use d for a index variable over the dimensions,
= xRd=1..n(xd2
-
2adxd
+
ad2)
= xRd=1..n(k2kxdk)2 - 2xRd=1..nad(k2kxdk) +
|R||a|2
= xd(i2ixdi)(j2jxdj) - 2xRd=1..nad(k2kxdk) +
|R||a|2
= xdi,j 2i+jxdixdj
- 2 x,d,k2k ad xdk
+
|R||a|2
= x,d,i,j 2i+j xdixdj
- 2 dad (x,k2kxdk)
+
|R||a|2
i,j,k bit slices indexes
-2 dad ( x(k2kxdk) )
and x xd = |R|μd so,
-2 dad ( x (xd) )
= x,d,i,j 2i+j xdixdj
- 2|R| dadd +
= x,d,i,j 2i+j xdixdj
+ |R|( -2dadd + dadad )
TV(a) = i,j,d 2i+j |Pdi^dj| -
k2k+1 dad |Pdk| +
|R|dadad
(equation 7)
|R||a|2
Note that the first term does not depend upon a. Thus, the derived attribute, TV-TV() (eliminate 1st term) is
much simpler to compute and has identical contours (just lowers the graph by TV() ).
We also find it useful to post-compose a log to reduce the number of bit slices.
The resulting functional is called the High-Dimension-ready Total Variation or HDTV(a).
i+j
TV(a) = x,d,i,j 2 xdixdj + |R| ( -2dadd +dadad )
From equation 7,
Normalized Total Variation, NTV(a)  TV(a)-TV() =
= |R| (-2d(add-dd) + d(adad- dd)) = |R|( dad2 - 2ddad + dd2 ) = |R| |a-|2
Thus there is a simpler function which gives us circular contours, the Log Normal TV function<
LNTV(a) = ln( NTV(a) ) = ln( TV(a)-TV() ) =
ln|R| + ln|a-|2
The length of LNTV(a) depends only on the length of a-, so isobars are hyper-circles centered at 
The graph of LNTV is a log-shaped hyper-funnel:
For an -contour ring (radius  about a) go inward and outward along
a- by  to the points;
inner point, b=+(1-/|a-|)(a-) and
g(a)=LNTV(x)
outer point, c=-(1+/|a-|)(a-).
g(c)
Then take g(b) and g(c) as lower and
upper endpoints of a vertical interval.
Then we use EIN formulas on that interval
to get a mask P-tree for the -contour
(which is a well-pruned superset of the
-neighborhood of a)
x2
g(b)
x1

-contour
(radius  about a)
ba
c
If the LNTV circumscribing contour of a is still too populous, use circumscribing Ad-contour
(Note: Ad is not a derived attribute at all, but just Ad, so we already have its basic P-trees).
As pre-processing, calculate basic P-trees for the LNTV derived attribute
(or another hypercircular contour derived attribute).
To classify a
1. Calculate b and c (Depend on a, )
2. Form mask P-tree for training pts with LNTV-values[LNTV(b), LNTV(c)]
3. User that P-tree to prune out the candidate NNS.
4. If the count of candidates is small, proceed to scan and assign class votes using Gaussian vote
function, else prune further using a dimension projections).
(Use voting function, G(x) = Gauss(|x-a|)-Gauss(), where Gauss(r) is (1/(std*2)e-(r-mean)2/2var
(std, mean, var are wrt set distances from a of voters i.e., {r=|x-a|: x a voter} )
LNTV(c)
We can also note that LNTV can be further
simplified (retaining same contours) using
h(a)=|a-|. Since we create the derived
attribute by scanning the training set, why
not just use this very simple function?
Others leap to mind, e.g., hb(a)=|a-b|
x2
LNTV(x)
LNTV(b)
x1

-contour
(radius  about a)
ba
c
contour of
dimension
projection
f(a)=a1
Graphs of functionals with hyper-circular
TV
LNTV
contours
TV(x15)

TV()=TV(x33)
5
4
3
2
1
h(a)=|a-|
1
2
3
4
5
Y
X

TVTV()
1
2
3
4
1
TV(x15)TV()
2
3
4
hb(a)=|a-b|
5
5
Y
X

b
Angular Variation functionals: e.g., AV(a)  ( 1/|a| ) xR xoa
COS(a)
= (1/|a|)xRd=1..nxdad
= (1/|a|)d(xxdad)
d is an index over the dimensions,
factor out ad
= (1/|a|)d=1..n(xxd) ad

= |R|/|a|d=1..n((xxd)/|R|) ad
= |R|/|a|d=1..n d ad
COS(a)
= ( |R|/|a| )  o a
a
COS(a)  AV(a)/(|||R|) = oa/(|||a|) = cos( )

COS (and AV) has hyper-conic isobars center on 
a
COSb(a)?
COS and AV have -contour(a) = the space
between two hyper-cones center on  which just
circumscribes the Euclidean -hyperdisk at a.
Intersection (in pink) with LNTV -contour.
Graphs of functionals with hyper-conic contours:
E.g., COSb(a) for any vector, b
b
a

f(a)x = (x-a)o(x-a)
= d=1..n(xd2
2)
-
2adxd
+
ad
= d=1..n(k2kxdk)2
-
2d=1..nad(k2kxdk)
+
|a|2
= d(i2ixdi)(j2jxdj)
- 2d=1..nad(2kxdk)
+
|a|2
=di,j 2i+jxdixdj
- 2 d,k2k ad xdk
+
|a|2
+
|a|2
f(a)x = i,j,d 2i+j (Pdi^dj)x
k2k+1 dad (Pdk)x
-
k+1
β exp( -f(a)x ) = βexp(-i,j,d 2i+j (Pdi^dj)x) * exp( k2 dad (Pdk)x )
(exp(-i,j,d 2 (Pdi^dj)x) * exp( k2
xcβ exp( -f(a)x ) = β exp( -|a|2 ) xc (exp(-i,j,d 2i+j (Pdi^dj)x)
xc exp((-i,j,d 2i+j (Pdi^dj)x)
Collecting diagonal terms inside exp
xc exp( ij,d -2i+j (Pdi^dj)x
i+j
β exp( -f(a)x ) = β exp( -|a|2 )
k+1
*exp( -|a|2
)
dad (Pdk)x ) )
d = index over dims,
i,j,k bit slices indexes
Adding up the Gaussian
weighted votes for class c:
* exp( k2k+1 dad (Pdk)x ) )
+
k,d2k+1 ad (Pdk)x
+ i=j,d(ad2i+1-22i ) (Pdi)x
)
)
i,j,d inside exp, coefs are multiplied by 1|0-bit (depends on x). For fixed i,j,d either coef is x-indep (if 1bit) or not (if 0bit)
xc (
Some additiona formulas:
= d( (i2ixdi)(j2jxdj)
=d ( i,j 2i+jxdixdj

(
ij,d exp(-2i+j (Pdi^dj)x) * i=j,d exp((ad2i+1-22i)(Pdi)x)
ij,d:Pdijx=1 exp(-2i+j )
i
j
-
2(i2 adi)(j2 xdj)
-
i,j 2i+j+1adixdj
=d,i,j 2i+j ( xdixdj - 2adixdj + adiadj )
)
)
i
j
(eq1)
+
(i2 adi) (j2 adj) )
+
i,j 2i+jadiadj )
* i=j,d:Pdijx=1 exp((ad2i+1-22i))
=d,i,j 2i+j ( xdi-adj )(xdj-adj)
fd(a)x = |x-a|d
= |i 2i ( xdi – adi )|
= | i: adi=0 2ixdi - i: adi=1 2ix'di |
Thus, for the derived attribute, fd(a) = numeric distance of xd from ad, if we remember that:
when adi=1, subtract those contributing powers of 2 (don't add) and that we use the complement
dimension=d basic Ptrees, then it should work.
The point is that we can get a set of near basic or negative basic Ptrees, nbPtrees, for derived
attr fd(a) directly from the basic Ptrees for Ad for free. Thus, the near basic Ptrees for fd(a) are
the basic Ad Ptrees for those bit-positions where adi = 0
the basic Ad Ptrees for those bit-positions where adi = 1
and they are the complements of
(called fd(a)'s nbPtrees)
Caution: subtract the contribution of the nbPtrees for positions where adi=1
Note: nbPtrees are not predicate trees (are they? What's the predicate?) The EIN ring formulas
are related to this, how?
If we are simply after easy pruning contours containing a (so that we can scan to get the actual
Euclidean epsilon nbrs and/or to get Guassian weighted vote counts, we can use Hobbit-type
contours (middle earth contours of a?).
See next slide for a discussion of hobbit contours.
A principle: A job is not done until the Mathematics is completed.
The Mathematics of a research job includes
0. Getting to the frontiers of the area (researching, organizing, understanding and integrating
everything others have done in the area up to the present moment and what they are likely to do
next).
1. developing a killer idea for a better way to do something.
2. proving claims (theorems, performance evaluation, simulation, etc.),
3. simplification (everything is simple once fully understood),
4. generalization (to the widest possible application scope), and
4. insight (what are the main issues and underlying mega-truths (with full drill down)).
Therefore, we need to ask the following questions at this point:
Should we use the vector of medians (the only good choice of middle point in mulidimensional
space, since the point closest to the mean definition is influenced by skewness, like the mean).
We will denote the vector of medians as 
h(a)=|a-| is an important functional (better than h(a)=|a-|?)
If we compute the median of an even number of values as the count-weighted average of the
middle two values, then in binary columns,  and  coincide. (so if µ and  are far apart, that tells
us there is high skew in the data (and the coordinates where they differ are the columns where the
skew is found).
Additional Mathematics to enjoy:
What about the vector of standard deviations, ? (computable with P-trees!) Do we
have an improvement of BIRCH here? - generating similar comprehensive statistical
measures, but much faster and more focused?)
We can do the same for any rank statistic (or order statistic), e.g., vector of 1st or 3rd
quartiles, Q1 or Q3 ; the vector of kth rank values (kth ordinal values).
If we preprocessed to get the basic P-trees of , and each mixed quartile vector (e.g., in
2-D add 5 new derived attributes; , Q1,1, Q1,2, Q2,1, Q2,2; where Qi,j is the ith quartile of
the jth column), what does this tell us (e.g., what can we conclude about the location of
core clusters? Maybe all we need is the basic P-trees of the column quartiles, Q1..Qn ?)
L ordinal disks:
disk(C,k) = {x | xd is one of the k-Nearest Neighbors of ad  d}.
skin(C,k), closed skin(C,k) and ring(C,k) are defined as above.
Are they easy P-tree computations? Do they offer advantages? When? What? Why?
E.g., do they automatically normalize for us?
The Middle Earth Contours of a are gotten by ANDing in the basic Ptree for ad,i=1 and ANDing
in the complement if ad,i=0 (down to some bit-position threshold in each dimension, bptd .
bptd can be the same for each d or not).
Caution: Hobbit contours of a are not symmetric about a. That becomes a problem (for
knowing when you have a symmetric nbrhd in the contour) expecially when many lowest order
bits of a are identical (e.g., if ad = 8 = 1000 )
If the low order bits of ad are zeros, one should union (OR) take the Hobbit contour of ad - 1
(e.g., for 8 also take 7=0111)
If the low order bits of ad are ones, one should union (OR) the Hobbit contour of ad + 1 (e.g, for
7=111 also take 8=1000)
Some need research:
Since we are looking for an easy prune to get our mask down to a scannable size (low root
count) but not so much of a prune that we have too few voters within Euclidean epsilon distance
of a for a good vote, how can we quickly determine an easy choice of a Hobbit prune to
accomplish that? Note that there are many Hobbit contours. We can start with pruning in
just one dimension and with only the lowest order bit in that dimension and work from there,
how though?
THIS COULD BE VERY USEFUL?
Suppose there are two classes, red and green and they are on the cylinder shown.
Then the vector connecting medians (vcm) in YZ space is shown in purple.
Then the unit vector in the direction of the vector connecting medians (uvcm) in YZ space is shown in blue.
The vector from the midpoint of the medians to s is in orange.
The inner product of the blue and the orange is the same as the inner product we would get by doing the same
thing in all 3 dimensions!
The point is that the x-component of the red vector of medians and that of the green are identical so that the x
component of the vcm is zero.
Thus, when the small vcm comp in a given dimension is very small or zero, we can eliminate that dimension!
That's why I suggest a threshold for the innerproduct in each dimension first.
It is a feature or attribute relevance tool.
y
s
x
z
DBQ versus MIKE (DataBase Querying vs. Mining through data for Information and Knowledge Extraction
Why do we call it Mining through data for Information & Knowledge Extraction and not just Data Mining?
Mine Silver and Gold! We don't just Mine Rock (The emphasis should be on the desired result, not the discard.
The name should emphasize what we mine for, not what we mine from.)
We
Silver and Gold are low-volume, high-value products, found (or not) in the mountains of rock (high-volume, lowvalue). Information and knowledge are low-volume, high-value, hiding in mountains of data (high-volme, low-value)
In both MIKE and MSG the output and substrate are substantially different in structure (chemical / data structure)
Just as in Mining Silver and Gold, we extract (hopefully) Silver and Gold from raw Rock, in Mining through data for
Information and Knowledge, we extract (hopefully) Information and Knowledge from raw Data.
So Mining through data for Information and Knowledge Extraction is the correct terminology and MIKE is the
correct acronym, not Data Mining (DM).
How is Data Base Querying (DBQ) different from Mining thru data for Info & Knowledge (MIKE)?
In all mining (MIKE as well as MSG) we hope to successfully mine out something of value, but failure is likely,
whereas in DBQ, valuable results are likely and no result is unlikely.
DBQ should be called Data Base Quarrying, since it is more similar to Granite Quarrying (GQ), in that what we
extract has the same structure as that from which we extract it (the substrate). It has higher value because its detail
and specificity. I.e., the output records of a DBQ are exactly the reduce size set of records we demanded and
expected from our query and the output grave stones of GQ are exactly the size and shape we demanded and
expected, and in both cases what is left is a substance that is the same as what is taken).
In sum,
DBQ = Quarrying (highly predictable output and the output has same structure as the substrate (sets of records)).
MIKE = Mining (unpredictable output and the output has different structure than the substrate (e.g., T/F or partition).
Some good Dataset for classification
1. KDDCUP-99 Dataset (Network Intrusion Dataset)
– 4.8 millions records, 32 numerical attributes
– 6 classes, each contains >10,000 records Normal
IP sweep
– Class distribution:
972,780
Neptune
1,072,017
12,481
Port sweep
10,413
Satan
15,892
Smurf
– Testing set: 120 records, 20 per class
– 4 synthetic datasets (randomly generated):
- 10,000 records (SS-I)
- 100,000 records (SS-II)
- 1,000,000 records (SS-III)
2,807,886
Speed and Scalability
Speed (Scalability) Comparison (k=5, hs=25)
x 1000 cardinality
Algorithm
10
100
1000
2000
4891
SMART-TV
0.14
0.33
2.01
3.88
9.27
P-KNN
0.89
1.06
3.94
12.44
30.79
0.39
2.34
23.47
49.28
NA
KNN
Running Time Against Varying Cardinality
100
90
80
Machine: Intel Pentium 4 CPU 2.6
GHz, 3.8GB RAM, running Red
Hat Linux
SMART-TV
PKNN
KNN
Time in Seconds
70
Note: these evaluations were
done when we were still
sorting the derived TV
attribute and before we used
Gaussian vote weighting.
Therefore both speed and
accuracy of SMART-TV
have improved markedly!
60
50
40
30
20
10
0
1000
2000
3000
Training Set Cardinality (x1000)
4891
Dataset (Cont.)
2. OPTICS dataset
– 8,000 points, 8 classes (CL-1, CL-2,…,CL-8)
– 2 numerical attributes
CL-1
CL-2
CL-3
CL-6
CL-4
CL-5
CL-7
CL-8
– Training set: 7,920 points
Dataset (Cont.)
3. IRIS dataset
– 150 samples
– 3 classes (iris-setosa, irisversicolor, and irisvirginica)
– 4 numerical attributes
– Training set: 120 samples
– Testing set: 30 samples, 10
per class
Overall Accuracy
Overall Classification Accuracy Comparison
Datasets
SMARTTV
PKNN
KNN
IRIS
0.97
0.71
0.97
OPTICS
0.96
0.99
0.97
SS-I
0.96
0.72
0.89
SS-II
0.92
0.91
0.97
SS-III
0.94
0.91
0.96
SS-IV
0.92
0.91
0.97
NI
0.93
0.91
NA
Comparison of the Algorithms Overall Classif ication Accuracy
Average F-Score
1.00
0.75
SMART-TV
0.50
PKNN
KNN
0.25
0.00
IRIS
OPTICS
SS-I
SS-II
Dataset
SS-III
SS-IV
NI
More Mathematics to enjoy:
As you probably know, Taufik used a heap process to get the k nearest neighbors of unclassified samples
(requiring one scan through the well-pruned nearest neighbor candidate set).
This means that Taufik did not use the closed kNN set, so accuracy will be the same as horizontal kNN
(single scan, non-closed) Actually, accuracy varies slightly depending on the kth NN picked from the ties).
Taufik is planning to leave the thesis that way and explain why he did it that way (over-fairness) ;-)
A great learning experience with respect to using DataMIME and a great opportunity for thesis exists here namely showing that when one uses closed kNN in SMART-TV,
not only do we get a tremendously scalable algorithm, but also a
much more accurate result (even slightly faster since no heaping to do).
A project: Re-run Taufik's performance measurements (just for SMART-TV) using a final scan of the pruned
candidate Nearest Neighbor Set. Let all candidates vote using a Gaussian vote drop-off function as:
If candidate lies Euclidean distance > from a, vote weight = 0, else, we define Gaussian drop-off function,
2
g(x)= Gauss(r=|x-a|)=1/(std*2 ) * e -(r-mean) /2var where std, mean, var refer to the set of distances from a of
2
non-zero voters (i.e., the set of r=|x-a| numbers), but use the Modified Gaussian, MG(x) = g(x) - e- so the
vote weight function drops smoothly to 0 (right at the boundary of the -disk about a and then stays zero
outside it).
More Mathematics to enjoy:
Meegeum has claimed the study of how much the above improves accuracy in various settings
(and how various parameterization of g(x) (ala statistics) affect it)?
More enjoyment!:
If there is a example data set where accuracy goes down when using the Gaussian and closed NNS, that
proves the data set is noise at that point? (i.e.,  no continuity relationship between features and classes at a).
This leads to an interesting theory regarding the continuity of a training set.
Everyone assume the training set is what you work with and you assume continuity!
In fact, Cancer studies forge ahead with classification even when p>>n, that is there are just too few
feature tuples for continuity to even makes good sense!
So now we have a method of identifying those regions of the training space where there is a discontinuity
in the feature-to-class relationship, FC. That will be valuable (e.g., in Bioinformatics).
Has this been studied?
(I think, everyone just goes ahead and classifies based on continuity and lives with the result!
More enjoyment (continued):
I claim, when the class label is real, then with a properly chosen Gaussian vote-weight function (chosen by
domain experts) and with a good method of testing the classifier, if SMART-TV miss-classifies a test point,
then it is not a miss-classification at all, but there is a discontinuity in FC there (between feature and class)!
In other words, that proves the training set is wrong there and should not be used.
I am claiming (do you back me up, Dr. Abidin?) SMART-TV, properly tuned, DEFINES CORRECT!
It will be fun to see how far we can go with this point of view. Be warned - it will cause quite a stir!
Thoughts:
1. Choose a vote drop-off Gaussian carefully (with domain knowledge in play) and designates it as "right".
What could be more right? - if you agree that classification has to be based on FC continuity.
2. Analyze (very carefully) SMART-TV vote histograms for 1 < 2 < ... < h If all are inconclusive then the
Feature-to-Class function (FC) is discontinuous at a and classification SHOULD NOT BE ATTEMPTED
USING THAT TRAINING SET! (This may get us into Krigging). If the histograms get more and more
conclusive as the radius increases, then possibly one would want to rely on the outer ring votes, but one
should also report that there is class noise at a!
3. We can get into tuning the Modified Gaussian, MG, by region. Determine the subregion where MC gives
conclusive histograms. Then for each inconclusive training point, examine modifications (other parameters
or even other dropoff functions) until you find one that is conclusive. Determine the full region where that
one is conclusive ...
More enjoyment (continued):
CAUTION! Many important Class Label Attributes (CLAs) are real (e.g., level of ill intent in homeland
security data mining, level of ill intent in Network Intrusion analysis, probability of cancer in a cancer tissue
microarray dataset), but many important Class Label Attributes are categorical (e.g., bioinformatic anotation,
Phenotype prediction, etc.). When the Class label is categorical, the distance on the CLA becomes the
characteristic distance function (distance=0 iff the 2 categories are different). Continuity at a becomes:
>0 :d(x,a)<  f(x)=f(a). Possibly boundary determination of the training set classes is most important
in case? Is that why SVM works so well in those situations? Still, For there to be continuity at a, it must be
the case that some NNS of a maps to f(a).
However, if the (CLA) is real:
Has anyone every seen analysis of "what is the best definition of correct" done (do statistician do this?).
Maybe we need a good statistician to take a look, but let's pursue it anyway so we know what we can claim.
Step 1: re-run Taufik's SMART-TV with the Modified Gaussian, MG(x), and closed kNN.
We should also take several standard UCI ML classification datasets, randomize classes in some particular
isotropic neighborhood (so that we know where there is definitely an FC discontinuity)
Then show (using matlab?) that SVM fails to detect that discontinuity (i.e., SVM gives a definitive class to it
without the ability to detect the fact that it doesn't deserve one? or do the Lagrange multipliers do that for
them?) and then show that we can detect that situation.
Does any other author do this?
Other fun ideas?
Classifying with multi-relational (heterogeneous) training data
Classify on a foreign key table (S) when some of the features need to come from the primary key table (R)?
R(A0 A1 A2)
S(B0 B1 B2 B3)
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
S11 S12 S13 S21 S22 S23 S31 S32 S33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
P11 P12 P13
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
P21 P22 P23
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
11
10
01
01
11
00
0
1
1
0
0
0
S41 S42 S43
0
0
1
1
1
1
1
1
P31 P32 P33
0
0
0
0
0
0
0
0 0 1 0 1 0
0 01 0
0 0 1 0
10 10
01 0001
01
01 01 10
01
10
001
011
100
101
110
111
001
000
001
111
100
101
100
100
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
P41 P42 P43
0
0
0
0
0
0 0 0 1 0 1 0 0 0 0
10 01
01 01
0100
01
01 10 01
To data mine this PK multi-relation, (R.A0 is ordered
ascending primary key and S.B2 is the foreign key),
scan S building (basic P-trees for) the derived
attributes Bn+1..Bn+m (here B4,B5) from A1..Am using
the bottom up approach (next slide)?
Note: Once the derived basic P-trees are built, what if
a tuple is added to S? If it has a new B2-value then a
new tuples must have been added to R also (with that
value in A0). Then all basic P-trees must be extended
(including the derived).
S(B0 B1 B2 B3)
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R(B2 A1 A2)
001
011
100
101
110
111
11
10
01
01
11
00
S.B4,1
S.B4,2
0
0
1
1 1
1 0
S.B5
1
1
1
0
0
1
1
0
0
0
0
1
1
1
P01 P02 P03 P11 P12 P13 P21 P22 P23
0
1 01 10010 1 00 1 00 0 00 1 00 0 00 0 01
0 0 1 0
01
10
01 0001
01 01 10
10 10
01
10 01
0
0
0
0
P31 P32 P33
0
00 0 0
0 1 0 00 00 0
P4,1
01 01
0100
01
01 10 01
0
0 1
10
P4,2
1
P5
0
The cost is the same as an indexed nested loop join (reasonable to assume there is a primary index on R).
When an insert is made to R, nothing has to change. When an insert is made to S, the P-tree structure is
expanded and filled using the values in that insert plus the R-attribute values of the new S.B2 value (This is
one index lookup. The S.B2 value must exists in R.A0 by referential integrity).
Finally, if we are using, e.g. 4Pi,j P-trees instead of the (4,2,1)Pi,j P-trees shown here, it's the same:
The basic P-tree fanout is /\ , the left leaf is filled by the first 4 values, the right leaf is filled with the last 4.
If we are using, e.g. 4Pi,j P-trees instead of the (4,2,1)Pi,j P-trees shown here, it's the same:
The basic P-tree fanout is /\ , the left leaf is filled by the first 4 values, the right leaf is filled with the last 4.
S(B0 B1 B2 B3)
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R(B2 A1 A2)
001
011
100
101
110
111
11
10
01
01
11
00
S.B4,1
0
1
1
0
0
1
1
1
1
1
1
1
1
0
1
1
0
0
0
S.B4,2
S.B5
1
0
1
1
1
1
0
0
0
0
0
0
0
0
or
S.B4,1
1,1 = EM-S.B4,1
0,1 = UM-S.B4,1
1
1
0
0
S.B4,2
1,1 = EM-S.B4,2
1,1 = UM-S.B4,2
S.B5
0,0 = EM-S.B5
0,0 = UM-S.B5
VertiGO (Vertical Gene Ontology)
The GO is a data structure which needs to be mined together with various valuable bioinformatic data sets.
Biologists waste time searching for all available information about each small area of research.
This is hampered further by variations in terminology in common usage at any given time, and
that inhibit effective searching by computers as well as people.
E.g., In a search for new targets for antibiotics, you want all gene products involved in bacterial protein
synthesis, that have significantly different sequence or structure from those in humans.
If one DB says these molecules are involved in 'translation' and another uses 'protein synthesis', it is difficult
for you - and even harder for a computer - to find functionally equivalent terms.
GO is an effort to address the need for consistent descriptions of gene products in different DBs.
The project began in 1988 as a collaboration between three model organism databases:
FlyBase (Drosophila),
Saccharomyces Genome Database (SGD)
Mouse Genome Database (MGD).
Since then, the GO Consortium has grown to include several of the world's major repositories for plant,
animal and microbial genomes. See the GO web page for a full list of member orgs.
VertiGO (Vertical Gene Ontology)
The GO is a DAG which needs to be mined in conjunction with the Gene Table (one tuple for each gene with
feature attributes). The DAG links are IS-A or PART-OF links. (Description follows from the GO website).
If we take the simplified view that the GO assigns annotations of types ( Biological Process (BP);
Molecular Function (MF); Cellular Component (CC)) to genes, with qualifiers ( "contributes to", "colocalizes with", "not" ) and evidence codes: IDA=InferredfromDirectAssay; IGI=InferredfromGeneticInteraction,
IMP=InferredfromMutantPhenotype; IPI=InferredfromPhysicalInteraction, TAS=TraceableAuthorStatement; IEP=InferredfromExpressionPattern,
RCA=InferredfromReviewedComputationalAnalysis, IC=InferredbyCurator IEA=InferredbyElectronicAnnotation
ISS=InferredfromSequence/StructuralSimilarity, NAS=NontraceableAuthorStatement, ND=NoBiologicalDataAvailable, NR=NotRecorded
Solution-1: For each annotation (term or GOID) have a
2-bit type code column GOIDType BP=11 MF=10 CC=00 and a
2-bit qualifier code column GOIDQualifier with contributesto=11, co-localizeswith=10 and not=00 and a
4-bit evidence code column GOIDEvidence: e,g,: IDA=1111, IGI-1110, IMP=1101, IPI=1100, TAS=1011,
IEP=1010, ISS=1001, RCA=1000, IC=0111, IEA=0110, NAS=0100, ND=0010, NR=0001 (putting DAG
structure in schema catalog). (Increases width by 8-bits * #GOIDs to losslessly incorporate the GO info).
Solution-2: BP, MF and CC DAGs are disjoint (share no GOIDs? true?), an alternative solution is: Use a
4-bit evidencecode/qualifier column, GOIDECQ: For evidence codes: IDA=1111 IGI-1110 IMP=1101
IPI=1100 TAS=1011 IEP=1010 ISS=1001 RCA=1000 IC=0111 IEA=0110 NAS=0101 ND=0100 NR=0011.
Qualifiers: 0010=contributesto 0001=colocalizeswith 0000=not (width increases 4-bits*#GOID lossless GO).
Solution-3: bitmap all 13 evidencecodes and all 3 qualifiers (adds 16 bit map per GO term). Keep in mind
that genes are assumed to be inherited up the DAGs but are only listed at the lowest level to which they
apply. This will keep the bitmaps sparse. If a GO term has no attached genes, it need not be included (many
such?). It will be in the schema with its DAG links, and will be assumed to inherit all downstream genes, but
it will not generate 16 bit columns in Gene Table). Is the not qualifier the complement of the term bitmap?
GO has 3 structured, controlled vocabularies (ontologies) describing gene products
(the RNA or protein resulting after transcription) by their species-independent, associated
biological processes (BP),
cellular components (CC)
molecular functions (MF).
There are three separate aspects to this effort: The GO consortium
1. writes and maintains the ontologies themselves;
2. makes associations between the ontologies and genes / gene products in the collaborating DBs,
3. develops tools that facilitate the creation, maintainence and use of ontologies.
The use of GO terms by several collaborating databases facilitates uniform queries across them.
The controlled vocabularies are structured so that you can query them at different levels: e.g.,
1. use GO to find all gene products in the mouse genome that are involved in signal transduction,
2. zoom in on all the receptor tyrosine kinases.
This structure also allows annotators to assign properties to gene products at different levels, depending on
how much is known about a gene product.
GO is not a database of gene sequences or a catalog of gene products
GO describes how gene products behave in a cellular context. GO is not a way to unify biological
databases (i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards unification, but
is not sufficient. Reasons include:
Knowledge changes and updates lag behind.
Curators evaluate data differently (e.g., agree to use the word 'kinase', but not to support this by
stating how and why we use 'kinase', and consistently to apply it. Only in this way can we hope to
compare gene products and determine whether they are related.
GO does not attempt to describe every aspect of biology. For example, domain structure, 3D structure,
evolution and expression are not described by GO.
GO is not a dictated standard, mandating nomenclature across databases.
Groups participate because of self-interest, and cooperate to arrive at a consensus.
The 3 organizing GO principles: molecular function, biological process, cellular component.
A gene product has one or more molecular functions and is used in one or more biological processes; it
might be associated with one or more cellular components.
E.g., the gene product cytochrome c can be described by the molecular function term oxidoreductase
activity, the biological process terms oxidative phosphorylation and induction of cell death, and the
cellular component terms mitochondrial matrix, mitochondrial inner membrane.
Molecular function (organizing principle of GO) describes e.g., catalytic/binding activities, at molecular level
GO molecular function terms represent activities rather than the entities (molecules / complexes) that
perform actions, and do not specify where or when, or in what context, the action takes place. Molecular
functions correspond to activities that can be performed by individual gene products, but some activities are
performed by assembled complexes of gene products.
Examples of broad functional terms are catalytic activity, transporter activity, or binding;
Examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding. It is easy to
confuse a gene product with its molecular function, and for thus many GO molecular functions are appended
with the word "activity".
The documentation on gene products explains this confusion in more depth.
A Biological Process is series of events accomplished by 1 or more ordered assemblies of molecular fctns.
Examples of broad biological process terms: cellular physiological process or signal transduction.
Examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport.
It can be difficult to distinguish between a biological process and a molecular function, but the general rule is
that a process must have more than one distinct steps.
A biological process is not equivalent to a pathway. We are specifically not capturing or trying to represent
any of the dynamics or dependencies that would be required to describe a pathway.
A cellular component is just that, a component of a cell but with the proviso that it is part of some larger
object, which may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene
product group (e.g. ribosome, proteasome or a protein dimer).
What does the Ontology look like?
GO terms are organized in structures called directed acyclic graphs (DAGs), which differ from hierarchies in
that a child (more specialized term) can have many parent (less specialized term).
For example, the biological process term hexose biosynthesis has two parents, hexose metabolism and
monosaccharide biosynthesis. This is because biosynthesis is a subtype of metabolism, and a hexose is a type
of monosaccharide.
When any gene involved in hexose biosynthesis is annotated to this term, it is automatically annotated to both
hexose metabolism and monosaccharide biosynthesis, because every GO term must obey the true path rule: if
the child term describes the gene product, then all its parent terms must also apply to that gene product.
It is easy to confuse a gene product and its molecular function, because very often these are
described in exactly the same words. For example, 'alcohol dehydrogenase' can describe what
you can put in an Eppendorf tube (the gene product) or it can describe the function of this stuff.
There is, however, a formal difference: a single gene product might have several molecular
functions, and many gene products can share a single molecular function, e.g., there are many
gene products that have the function 'alcohol dehydrogenase'.
Some, but by no means all, of these are encoded by genes with the name alcohol dehydrogenase.
A particular gene product might have both the functions 'alcohol dehydrogenase' and
'acetaldehyde dismutase', and perhaps other functions as well.
It's important to grasp that, whenever we use terms such as alcohol dehydrogenase activity in
GO, we mean the function, not the entity; for this reason, most GO molecular function terms are
appended with the word 'activity'.
Many gene products associate into entities that function as complexes, or 'gene product groups',
which often include small molecules. They range in complexity from the relatively simple (for
example, hemoglobin contains the gene products alpha-globin and beta-globin, and the small
molecule heme) to complex assemblies of numerous different gene products, e.g., the ribosome.
At present, small molecules are not represented in GO. In the future, we might be able to create
cross products by linking GO to existing databases of small molecules such as Klotho , LIGAND
How do terms get associated with gene products?
Collaborating databases annotate their gene products (or genes) with GO terms, providing references and indicating what
kind of evidence is available to support the annotations. More info in GO Annotation Guide.
If you browse any of the contributing databases, you'll find that each gene or gene product has a list of associated GO terms.
Each database also publishes a table of these associations, and these are freely available from the GO ftp site.
You can also browse the ontologies using a range of web-based browsers. A full list of these, and other tools for analyzing
gene function using GO, is available on the GO Tools page .
In addition, the GO consortium has prepared GO slims, 'slimmed down' versions of the ontologies that allow you to annotate
genomes or sets of gene products to gain a high-level view of gene functions.
Using GO slims you can, for example, work out what proportion of a genome is involved in signal transduction, biosynthesis
or reproduction. See the GO Slim Guide for more information.
All GO data is free. Download the ontology data in a number of different formats, including XML and mySQL, from the GO
Downloads page (more info on syntax of these formats, GO File Format Guide.
If you need lists of the genes or gene products that have been associated with a particular GO term, the Current Annotations
table tracks the number of annotations and provides links to gene association files for each of collaborating DBs is available.
GO allows us to annotate genes and their products with a limited set of attributes. e.g., GO does not allow us to describe
genes in terms of which cells or tissues they're expressed in, which developmental stages they're expressed at, or their
involvement in disease. It is not necessary for GO to this since other ontologies are doing it. The GO consortium supports
the development of other ontologies and makes its tools for editing and curating ontologies available. A list of freely
available ontologies that are relevant to genomics and proteomics and are structured similarly to GO can be found at the
Open Biomedical Ontologies website. A larger list, which includes the ontologies listed at OBO and also other controlled
vocabularies that do not fulfil the OBO criteria is available at the Ontology Working Group page of the Microarray Gene
Expression Data Society (MGED).
Cross-products: The existence of several ontologies will also allow us to create 'cross-products' that maximize the utility of
each ontology while avoiding redundancy. For example, by combining the developmental terms in the GO process ontology
with a second ontology that describes Drosophila anatomical structures, we could create an ontology of fly development.
We could repeat this process for other organisms without having to clutter up GO with large numbers of species-specific
terms. Similarly, we could create an ontology of biosynthetic pathways by combining biosynthesis terms in the GO process
ontology with a chemical ontology.
Mappings to other classification systems: GO is not the only attempt to build structured controlled vocabularies for
genome annotation. Nor is it the only such series of catalogs in current use. We have attempted to make translation tables
between these catalogs and GO. We caution that these mappings are neither complete nor exact; they are to be used as a
guide. One reason for this is absence of definitions from many of the other catalogs and of a complete set of definitions in
GO itself. More information on the syntax of these mappings can be found in the GO File Format Guide.
Contributing to GO: The GO project is constantly evolving, and we welcome feedback from all users. If you need a new
term or definition, or would like to suggest that we reorganize a section of one of the ontologies, please do so through our
online request-tracking system, which is hosted by SourceForge.net.
What is a GO term? The purpose of GO is to define particular attributes of gene products. A term is simply
the text string used to describe an entry in GO, e.g. cell, fibroblast growth factor receptor binding or signal
transduction. A node refers to a term and all its children.
GO does not contain the following:
Gene products: e.g. cytochrome c is not in GO; attributes of it, e.g., oxidoreductase activity, are.
Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO
term because causing cancer is not the normal function of any gene.
Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be
described in a separate sequence ontology (see OBO web site for more information).
Protein domains or structural features.
Protein-protein interactions.
Conventions when adding a term: These stylistic points should be applied to all aspects of the ontologies.
Spelling conventions: Where there are differences in accepted spelling between English and US, use US form, e.g.
polymerizing, signaling, rather than polymerising, signalling. A dictionary of 'words' used in GO terms at GODict.DAT
Abbreviations: Avoid abbreviations unless they're self-explanatory. Use full element names, not symbols. Use hydrogen for
H+. Use copper and zinc rather than Cu and Zn. Use copper(II), copper(III), etc., rather than cuprous, cupric, etc. For
biomolecules, spell out the term in full wherever practical: use fibroblast growth factor, not FGF.
Greek symbols: Spell out Greek symbols in full: e.g. alpha, beta, gamma.
Case: GO terms are all lower case except where demanded by context, e.g. DNA, not dna.
Singular/plural: Use singula, except where a term is only used in plural (eg caveolae).
Be descriptive: Be reasonably descriptive, even at the risk of verbal redundancy. Remember, DBs that refer to GO terms
might list only the finest-level terms associated with a particular gene product. If the parent is aromatic amino acid family
biosynthesis, then child should be aromatic amino acid family biosynthesis, anthranilate pathway, not anthranilate pathway.
Anatomical qualifiers: Do not use anatomical qualifiers in the cellular process and molecular function ontologies. For
example, GO has the molecular function term DNA-directed DNA polymerase activity but neither nuclear DNA polymerase
nor mitochondrial DNA polymerase. These terms with anatomical qualifiers are not necessary because annotators can use
the cellular component ontology to attribute location to gene products, independently of process or fctn.
Synonyms: When several words or phrases that could be used as the term name, one form will be chosen as term name
whilst the other possible names are added as synonyms. Despite the name, GO synonyms are not always 'synonymous' in the
strictest sense of the word, as they do not always mean exactly the same as the term they are attached to.
Instead, a GO synonym may be broader or narrower than the term string; it may be a related phrase; it may be alternative
wording, spelling or use a different system of nomenclature; or it may be a true synonym. This flexibility allows GO
synonyms to serve as valuable search aids, as well as being useful for apps such as text mining and semantic matching.
Having a single, broad relationship between a GO term and its synonyms is adequate for most search purposes, but for other
applications such as semantic matching, the inclusion of a more formal relationship set is valuable. Thus, GO records a
relationship type for each synonym, stored in OBO format flat file.
Synonym types: The synonym relationship types are: term is an exact synonym (ornithine cycle is an exact synonym of urea cycle)
terms are related (cytochrome bc1 complex is a related to ubiquinol-cytochrome-c reductase activity) synonym is broader than the term name (cell
division is a broad synonym of cytokinesis) synonym is narrower or more precise (pyrimidine-dimer repair by photolyase is a narrow synonym of
photoreactive repair) synonym is related to, but not exact, broader or narrower (virulence has synonym type of other related to term pathogenesis)
related
[i] exact synonym
These types form a loose hierarchy:
[i] broad synonym [i] narrow synonym
[i] other related synonym
The default relationship is related to, as all synonyms are in some way related to the term name, but more specific
relationships are assigned where possible. The synonym type other related is used where the relationship between a term
and its synonym is NOT exact, narrower or broader. In some cases, broader and narrower synonyms are created in the place
of new parent or child terms because some synonym strings may not be valid GO terms but may still be useful for search
purposes. This may be because the synonym is the name of a gene product e.g. ubiquitin-protein ligase activity has the
narrower synonym E3, as E3 is a specific gene product with ubiquitin-protein ligase activity.
Adding synonyms: When you add a synonym using DAG-Edit, choose a type from the pull-down selector (see the DAGEdit user guide for more information). DAG-Edit will incorporate the synonym type into the OBO format flat file when you
save. The default synonym type is the broadest, 'synonym' (equivalent to 'related' above). Number of synonyms for a term is
not limited, and the same text string can be used for more than 1 GO term. Add synonyms if you edit a term name but the
old name is still a valid synonym; for example, if you change respiration to cellular respiration, keep respiration as a
synonym. This helps other users find familiar terms. Add synonyms if the term has (or contains) a commonly used
abbreviation. For example, FGF binding could be used as a synonym for fibroblast growth factor binding. Do not add a
synonym if the only difference is case (e.g. start vs. START). Synonyms, like term names, are all lower case except where
demanded by context (e.g. DNA, not dna).
Rules for Synonyms: Acronyms are exactly synonymous with full name (if acronym is not used in any other sense elsewhere)
'Jargon' type phrases are exactly synonymous w full name (if phrase is not used in any other sense elsewhere)
proton is exactly synonymous with hydrogen in most senses EXCEPT where hydrogen means H 2 (i.e. gas)
include implicit information when making decision; take into account which ontology the term is in - e.g. an entry term that
ends in 'factor' is not synonymous with a molecular function.
ligand is NOT exactly synonymous with binding (ligand is an entity, binding an action)
XXX receptor ligand is NOT exactly synonymous with XXX (1 potential ligands so XXX receptor ligand broader than XXX
XXX complex is NOT exactly synonymous with XXX (XXX is ambiguous - could describe activity of XXX)
porter and transporter are NOT exactly synonymous (transporter is broader)
symporter/antiporter and transporter are NOT exactly synonymous (transporter is broader)
General database cross references (general dbxrefs) should be used whenever a GO term has an identical
meaning to an object in another database. Some ex. of common general dbxrefs in GO:
Ontology DB Sample dbxref Fctn Enzyme Commission EC:3.5.1.6 Transport Protein Database
TC:2.A.29.10.1 Biocatalysis/Biodegradation DB UM-BBD_enzymeID:e0310 Biocatalysis/Biodegradation
DB UM-BBD_pathwayID:dcb MetaCyc Metabolic Pathway DB MetaCyc:XXXX-RXN Process MetaCyc
Metabolic Pathway DB MetaCyc:2ASDEG-PWY Component None The GO.xrf_abbs file is maintained by
the BioMOBY project, so to make changes to the file, you need to use their web form.
Understanding relationships in GO: GO ontologies are structured as a directed acyclic graph (DAG), which
means that a child (more specialized) term can have multiple parents (less specialized terms).
This makes GO a powerful system to describe biology, but creates some pitfalls for curators
Keeping the following guidelines in mind should help you to avoid these problems.
A child term can have one of two different relationships to its parent(s): is_a or part_of.
The same term can have different relationships to different parents; for example, the child 'GO term 3' may be an is_a
of parent 'GO term 1' and a part_of parent, 'GO term 2':
In GO, an is_a relationship means that the term is a subclass of its parent. For example, mitotic cell cycle is_a cell
cycle, not confused with an 'instance' which is a specific example. E.g., clogs are a subclass or is_a of shoes, while the
shoes I have on my feet now are an instance of shoes. GO, like most ontologies, does not use instances. The is_a
relationship is transitive, which means that if 'GO term A' is a subclass of 'GO term B', and 'GO term B' is an
subclass of 'GO term C', 'GO term A' is also a subclass of 'GO term C', E.g.,
Terminal N-glycosylation is a subclass of terminal glycosylation.
Terminal glycosylation is a subclass of protein glycosylation.
Terminal N-glycosylation is a subclass of protein glycosylation.
part_of in GO is more complex. There are 4 basic levels of restriction for a part_of relationship:
1st type has no restrictions - no inferences can be made from the relationship between parent and child other than that
parent may have child as a part, and the child may or may not be a part of the parent.
2nd type, 'necessarily is_part', means that wherever the child exists, it is as part of the parent.
To give a biological example, replication fork is part_of chromosome, so whenever replication fork occurs, it is as
part_of chromosome, but chromosome does not necessarily have part replication fork.
3rd type, 'necessarily has_part', is the exact inverse of type two; wherever the parent exists, it has the child as a part,
but the child is not necessarily part of the parent. For example, nucleus always has_part chromosome, but
chromosome isn't necessarily part_of nucleus.
4th type, is a combination of both two and three, 'has_part' and 'is_part'. An example of this is nuclear membrane is
part_of nucleus. So nucleus always has_part nuclear membrane, and nuclear membrane is always part_of nucleus.
The part_of relationship used in GO is usually type two, 'necessarily is_part'. Note that part_of types 1 and 3 are not
used in GO, as they would violate the true path rule. Like is_a, part_of is transitive, so that if 'GO term A' is part_of
'GO term B', and 'GO term B' is part_of 'GO term C', 'GO term A' is part_of 'GO term C':
E.g., Laminin-1 is part_of basal lamina.
Basal lamina is part_of basement membrane.
Laminin-1 is part_of basement membrane.
The ontology editing tool DAG-Edit, from version 1.411 on, allows you to specify the necessity of relationships. The
part_of relationship used in GO, necessarily is_part, would correspond to part_of, [inverse] necessarily true. For more
information, see the DAG-Edit user guide.
For info on how these relationships are represented in the GO flat files, see the GO File Format Guide.
For technical info on the relationships used in GO and OBO, see the OBO relationships ontology.
The true path rule states that "the pathway from a child term all the way up to its top-level parent(s) must always be
true". One of the implications of this is that the type of part_of relationship used in GO, outlined more fully in the
part_of relationship section above, is restricted to those types where a child term must always be part_of its parent.
Often, annotating a new gene product reveals relationships in an ontology that break the true path rule, or species
specificity becomes a problem. In such cases, the ontology must be restructured by adding more nodes and connecting
terms such that any path upwards is true. When a term is added to the ontology, the curator needs to add all of the
parents and children of the new term.
This becomes clear with an example: consider how chitin metabolism is represented in the process ontology. Chitin
metabolism is a part of cuticle synthesis in fly and is also part of cell wall organization in yeast. This was once
represented in process ontology as: cuticle synthesis, [i]chitin metabolism, cell wall biosynthesis, [i]chitin
metabolism, ---[i]chitin biosynthesis, ---[i]chitin catabolism
Illustration The problem with this organization becomes apparent when one tries to annotate a specific gene product
from one species. A fly chitin synthase could be annotated to chitin biosynthesis, and appear in a query for genes
annotated to cell wall biosynthesis (and its children), which makes no sense because flies don't have cell walls.
This is revised ontology structure which ensures that the true path rule is not broken: chitin metabolism, [i]chitin
biosynthesis, [i]chitin catabolism, [i]cuticle chitin metabolism ---[i]cuticle chitin biosynthesis, ---[i]cuticle chitin
catabolism [i]cell wall chitin metabolism, ---[i]cell wall chitin biosynthesis, ---[i]cell wall chitin catabolism
Illustration The parent chitin metabolism now has the child terms cuticle chitin metabolism and cell wall chitin
metabolism, with the appropriate catabolism and synthesis terms beneath them. With this structure, all the daughter
terms can be followed up to chitin metabolism, but cuticle chitin metabolism terms do not trace back to cell wall
terms, so all the paths are true. In addition, gene products such as chitin synthase can be annotated to nodes of
appropriate granularity in both yeast and flies, and queries will yield the expected results.
Dependent ontology terms: Some GO terms imply presence of others. Examples from process ontology include:
If either X biosynthesis or X catabolism exists, then parent X metabolism must also exist. If regulation of X exists, then the
process X must also exist. Potentially any process in the ontology can be regulated. Note: X may refer to a phenotype (for
example cell size in regulation of cell size; in these cases, X should not be added to ontology. GO nodes must avoid using
species-specific defs. Nevertheless, many functions, processes and components are not common to all life forms. Our current
convention is to include any term that can apply to more than one taxonomic class of organism. Within ontologies, are cases
where a word/phrase has different meanings for organisms. E.g., embryonic development in insects is very different from
embryonic development in mammals. Such terms are distinguished from one another by their definitions and by the sensu
designation (sensu means 'in the sense of'), as in the term embryonic development (sensu Insecta). Nodes should be divided
into sensu sub-trees where the children are or are likely to be different. Using sensu designation in a term does not exclude
that term from being used to annotate species outside that designation. e.g., a 'sensu Drosophila' term might reasonably used
to annotate a mosquito gene product. A GO node should never be more species-specific than any of its children. Child nodes
can be at the same level of species specificity as the parent node(s), or more specific. When adding more species-specific
nodes, curators should make sure that non-species-specific parents exist (or add them if necessary).
E.g., take the process of sporulation. This occurs in both bacteria and fungi, but bacterial sporulation is quite a different
process to fungal sporulation, so we therefore add two children to sporulation, sporulation (sensu Bacteria) and sporulation
(sensu Fungi). If we now want to add a term to represent the assembly of the spore wall in fungi, we cannot just add spore
wall assembly as a direct child of sporulation (sensu Fungi) as such a term could conceivably refer to the assembly of spore
walls in bacteria. Name child term spore wall assembly (sensu Fungi) to ensure that it is as species-specific as parent term.
References and Evidence
Every annotation must be attributed to a source, which may be a literature reference, another database or a
computational analysis. The annotation must indicate what kind of evidence is found in the cited source to support
the association between the gene product and the GO term. A simple controlled vocabulary is used to record evidence:
IMP inferred from mutant phenotype
IGI inferred from genetic interaction <db:gene_symbol[allele_symbol]>
IPI inferred from physical interaction [w <db:protein_name>]
ISS inferred from sequence similarity [with <database:sequence_id>]
IDA inferred from direct assay
IEP inferred from expression pattern
IEA inferred from electronic annotation [with <database:id>]
TAS traceable author statement
NAS non-traceable author statement
ND no biological data available
RCA inferred from reviewed computational analysis
IC inferred by curator [from <GO:id>]
Annotation File Format: Collaborating databases export to GO a tab delimited file, known informally as a "gene association
file" of links between database objects and GO terms. Despite jargon, the db object may represent a gene or a gene product
(transcript or protein). Columns in file are described below, a table showing columns in order, with examples, is available.
The entry in the DB_Object_ID field (see below) of the association file is the identifier for the database object, which may or
may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support
annotations to the gene encoding the protein (gene ID in DB_Object_ID field) or annotations to a protein object (protein ID
in DB_Object_ID field).
The entry in the DB_Object_Symbol field should be a symbol that means something to a biologist, wherever possible (gene
symbol, for example). It is not an ID or an accession number (the second column, DB_Object_ID, provides the unique
identifier), although IDs can be used in DB_Object_Symbol if there is no more biologically meaningful symbol available
(e.g., when an unnamed gene is annotated).
The object type (gene, transcript, protein, protein_structure, or complex) listed in the DB_Object_Type field MUST match
the database entry identified by DB_Object_ID. Note that DB_Object_Type refers to the database entry (i.e. does it represent
a gene, protein, etc.); this column does not reflect anything about the GO term or the evidence on which the annotation is
based. For example, if your database entry represents a gene, then 'gene' goes in the DB_Object_Type column, even if the
annotation is to a component term relevant to the localization of a protein product of the gene. The text entered in the
DB_Object_Name and DB_Object_Symbol can refer to the same database entry (recommended), or to a "broader" entity. For
example, several alternative transcripts from one gene may be annotated separately, each with a unique transcript
DB_Object_ID, but list the same gene symbol in the DB_Object_Symbol column.
DB refers to the database contributing the gene_association file the value must be present in the file of database abbreviations. [Database
abbreviations explanation] this field is mandatory, cardinality 1 DB_Object_ID unique identifier in DB for the item being annotated this field is
mandatory, cardinality 1 DB_Object_Symbol (unique and valid) symbol to which DB_Object_ID is matched can use ORF name for otherwise
unnamed gene or protein if gene products are annotated, can use gene product symbol if available, or many gene product annotation entries can
share a gene symbol this field mandatory, card 1 Qualifier flags that modify interpretation of an annotation 1 (or more) of NOT, contributes_to,
colocalizes_with field not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g. NOT|contributes_to) GOid GO
identifier for the term attributed to the DB_Object_ID this field is mandatory, cardinality 1 DB:Reference one or more unique identifiers for a
single source cited as an authority for the attribution of the GOid to the DB_Object_ID. This may be a literature reference or a database record.
The syntax is DB:accession_number. Note that only one reference can be cited on a single line in the gene_association file. If a reference has
identifiers in more than one database, multiple identifiers for that reference can be included on a single line. For example, if the reference is a
published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism
database. Note that if the model organism database has an identifier for the reference, that idenitifier should always be included, even if a PubMed
ID is also used. this field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. SGD:8789|PMID:2676709).
The flat file format comprises 15 tab-delimited fields. Blue denotes required fields:
Evidence: IMP, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA this is mandatory, cardinality 1
With (or) From one of: DB:gene_symbol; DB:gene_symbol[allele_symbol]; DB:gene_id; DB:protein_nam; DB:sequence_id; GO:GO_id.
this field is not mandatory (except in the case of IC evidence code), cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g.
CGSC:pabA|CGSC:pabB) . Note: This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI,
IPI, ISS). For example, it can identify another gene product to which the annotated gene product is similar (ISS) or interacts with (IPI). More
information on the meaning of 'with/from' column entries is available in the evidence documentation entries for the relevant codes. Cardinality =
0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical
interaction or sequence similarity to a protein, but no ID provided). Annotations where evidence is IGI, IPI, or ISS and 'with' cardinality = 0
should link to an explanation of why there is no entry in 'with.' Cardinality may be >1 for any of the evidence codes that use 'with'; for IPI and IGI
cardinality >1 has a special meaning. For cardinality >1 use a pipe to separate entries (e.g. FB:FBgn1111111|FB:FBgn2222222). Note that a
gene ID may be used in the 'with' column for a IPI annotation, or for an ISS annotation based on amino acid sequence or protein structure
similarity, if the database does not have identifiers for individual gene products. A gene ID may also be used if the cited reference provides
enough information to determine which gene ID should be used, but not enough to establish which protein ID is correct. 'GO:GO_id' is used only
when the evidence code is 'IC', and refers to the GO term(s) used as the basis of a curator inference. In these cases the entry in the 'DB:Reference'
column will be that used to assign the GO term(s) from which the inference is made. This field is mandatory for evidence code IC. The ID is
usually an identifier for an individual entry in a database (such as a sequence ID, gene ID, GO ID, etc.). Identifiers from the Center for Biological
Sequence Analysis (CBS), however, represent tools used to find homology or sequence similarity; these identifiers can be used in the 'with'
column for ISS annotations. 'with' col may not be used with evidence codes IDA, TAS, NAS, ND
The flat file format comprises 15 tab-delimited fields. Blue denotes required fields:
Aspect one of P (biological process), F (molecular function) or C (cellular component) this field is
mandatory; cardinality 1
DB_Object_Name name of gene or gene product. not mandatory, cardinality 0, 1 [white space allowed]
Synonym Gene_symbol [or other text]. Strongly recommend gene synonyms are included in the gene
association file, as this aids the searching of GO. this field is not mandatory, cardinality 0, 1, >1 [white
space allowed]; for cardinality >1 use a pipe to separate entries (e.g. YFL039C|ABY1|END7|actin gene)
DB_Object_Type what kind of thing is being annotated one of gene, transcript, protein,
protein_structure, complex this field is mandatory, cardinality 1
Taxon taxonomic identifier(s). For cardinality 1, the ID of the species encoding the gene product.
For cardinality 2, to be used only in conjunction with terms that have the term 'interaction between
organisms' as an ancestor. The first taxon id should be that of the organism encoding the gene or gene
product, and the taxon id after the pipe should be that of the other organism in the interaction.
mandatory, cardinality 1, 2; for cardinality 2 use a pipe to separate entries (e.g. taxon:1|taxon:1000)
Date: on which the annotation was made; format is YYYYMMDD this field is mandatory, cardinality 1
Assigned_by The database which made the annotation one of the values in the table of database
abbreviations. [Database abbreviations explanation] Used for tracking the source of an individual
annotation. Default value is value entered in column 1 (DB). Value will differ from column 1 for any
that is made by one database and incorporated into another. this field is mandatory, cardinality 1
Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are:
GOid (where dbname is always GO), DB:Reference, With, Taxon (where dbname is always taxon). For GO ids, do not
repeat the 'GO:' prefix (i.e. always use GO:0000000, not GO:GO:0000000)
Computational Annotation Methods
This section includes descriptions of automated annotation methods used by participating databases
(descriptions have been provided by each group listed). EBI | MGI | TIGR
EBI GOA Electronic Annotation The large-scale assignment of GO terms to UniProt Knowledgebase entries
involves electronic techniques. This strategy exploits existing properties within database entries including
keywords and Enzyme Commission (EC) numbers and cross-reference to InterPro (a database of protein
motifs) which are manually mapped to GO. SWISS-PROT keyword and InterPro to GO mappings are
maintained in-house and shared on the GO home page for local database updates. Electronically combining
these mappings with a table of matching Uniprot Knowledgebase entries generates a table of associations. For
each GOA association, an evidence code, which summarizes how the association is made is provided.
Associations are made electronically are labeled as 'inferred from electronic annotation' (IEA). Evelyn
Camon, 2002-09-03
MGI Electronic Annotation Methods Every object in the MGI databases (markers, seqids, references, etc.)
has an MGI: accession ID. See details in GO
TIGR ISS Annotation (Arabidopsis, T. brucei) For TIGR Arabidopsis or T. brucei annotations using
'Inferred from Sequence Similarity' (ISS) evidence, the reference is usually 'TIGR_Ath1:annotation' for
Arabidopsis (author: TIGR Arabidopsis annotation team) and TIGR_Tba1:annotation for T. brucei (author:
TIGR Trypanosoma brucei annotation team), which are defined as follows:
name: TIGR annotation based upon multiple sources of similarity evidence
description: TIGR_Ath1:annotation or TIGR_Tba1:annotation denotes a curator's interpretation of a
combination of evidence. Our internal software tools present us with a great deal of evidence based domains,
sequence similarities, signal sequences, paralogous proteins, etc. The curator interprets the body of evidence to
make a decision about a GO assignment when an external reference is not available. The curator places one or
more accessions that informed the decision in the "with" field.
What this says is that we have used many sequence similarity hits, etc., to make our decision. However, we
choose only 1-3 pieces of information as "with" information, as it is not practical to enter and submit many
entries for each annotation. We also have internal calculations of paralogy and new domains we are
identifying which have not yet been published, but which help inform our decisions.
Clustering Methods
Clustering is partitioning
into mutually exclusive and collectively exhaustive subsets, such that
each point is:
very similar to (close to) the other points in its component
and
very dissimilar to (far from) the points in the other components.
A Categorization of Major Clustering Methods
• Partitioning methods (K-means | K-medoids...)
• Hierarchical methods (Agnes, Diana...)
• Density-based methods
• Grid-based methods
• Model-based methods
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in 4 steps (assumes partitioning
criteria is: maximize intra-cluster similarity and minimize inter-cluster
similarity. Of course, a heuristic is used. Method isn’t really an optimization)
1.
Partition objects into k nonempty subsets (or pick k initial means).
2.
Compute the mean (center) or centroid of each cluster of the current
partition (if one started with k means, then this step is done).
centroid ~ point that minimizes the sum of dissimilarities from the mean or the
sum of the square errors from the mean.
Assign each object to the cluster with the most similar (closest) center.
3.
Go back to Step 2
4.
Stop when the new set of means doesn’t change (or some other stopping
condition?)
k-Means Clustering annimation
Step 1
Step 2
Step 3
10
10
10
10
9
9
9
9
8
8
8
8
7
7
7
7
6
6
6
6
5
5
5
5
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Step 4
Strength relatively efficient: O(tkn),
n is # objects, k is # clusters
t is # iterations.
Normally, k, t << n.
Weakness Applicable only when mean is defined (e.g., a vector space). Need to
specify k, the number of clusters, in advance. It is sensitive to noisy data and outliers.
The K-Medoids Clustering Method
• Find representative objects, called medoids, (must be an actual object
in the cluster, where as the mean seldom is).
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids
– iteratively replaces one of the medoids by a non-medoid
– if it improves the aggregate similarity measure, retain the swap.
Do this over all medoid-nonmedoid pairs
– PAM works for small data sets. Does not scale for large data sets
• CLARA (Clustering LARge Applications) (Kaufmann,Rousseeuw, 1990)
Sub-samples then apply PAM
• CLARANS (Clustering Large Applications based on RANdom
Search) (Ng & Han, 1994): Randomized the sampling
Hierarchical Clustering Methods: AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Use the Single-Link (distance between two sets is the minimum
pairwise distance) method
• Other options are complete link (distance is max pairwise); average,...
• Merge nodes that are most similarity
• Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
DIANA (Divisive Analysis)
• Introduced in Kaufmann and Rousseeuw (1990)
• Inverse order of AGNES (initially all objects are in one cluster; then it
is split according to some criteria (e.g., maximize some aggregate
measure of pairwise dissimilarity again)
• Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Contrasting hierarchical Clustering Techniques
• Partitioning algorithms: Partition a
dataset to k clusters, e.g., k=3 
• Hierarchical alg: Create hierarchical
decomposition of ever-finer partitions.
e.g., top down (divisively).
bottom up (agglomerative)
Hierarchical Clustering
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
d
e
cde
de
Agglomerative
Hierarchical Clustering (top
down)
a
ab
b
abcde
c
cde
d
de
e
Step 4
Divisive
Step 3
Step 2 Step 1 Step 0
In either case, one gets a nice dendogram in which any maximal antichain (no 2 nodes linked) is a clustering (partition).
Hierarchical Clustering (Cont.)









Recall that any maximal anti-chain (maximal set of nodes in which
no 2 are chained) is a clustering (a dendogram offers many).

Hierarchical Clustering (Cont.)
But the “horizontal” anti-chains are the clusterings resulting from the
top down (or bottom up) method(s).
Data Mining Summary
Data Mining on a given table of data includes
Association Rule Mining (ARM) on Bipartite Relationships
Clustering
Partitioning methods (K-means | K-medoids...), Hierarchical methods (Agnes, Diana...),
Model-based methods (K-Means, K-Medoids..), ....
Classification
Decision Tree Induction, Bayesian, Neural Network, k-Nearest-Neighbor,...)
But most data mining is on a database, not just one table,
that is, often times, first one must apply the appropriate
SQL query to a database to get the table to be data mined.
The next slides discuss vertical data methods for doing that.
You may wish to skip this material if not interested in the topic.
Vertical Select-Project-Join (SPJ) Queries
A Select-Project-Join query has joins, selections and projections.
Typically there is a central fact relation to which several dimension relations are to be joined (standard STAR DW)
E.g., Student(S), Course(C), Enrol(E) STAR DB below (bit encoding is shown in reduced font italics for certain attributes)
S|s____|name_|gen| C|c____|name|st|term|
|0 000|CLAY |M 0| |0 000|BI |ND|F 0|
|1 001|THAIS|M 0| |1 001|DB |ND|S 1|
|2 010|GOOD |F 1| |2 010|DM |NJ|S 1|
|3 011|BAID |F 1| |3 011|DS |ND|F 0|
|4 100|PERRY|M 0| |4 100|SE |NJ|S 1|
|5 101|JOAN |F 1| |5 101|AI |ND|F 0|
E|s____|c____|grade |
|0 000|1 001|B
10|
|0 000|0 000|A
11|
|3 011|1 001|A
11|
|3 011|3 011|D
00|
|1 001|3 011|D
00|
|1 001|0 000|B
10|
|2 010|2 010|B
10|
|2 010|3 011|A
11|
|4 100|4 100|B
10|
|5 101|5 101|B
10|
Vertical bit sliced (uncompressed) attrs stored as:
S.s2
0
0
1
1
0
0
S.s1
0
0
0
0
1
1
S.s0
0
1
0
1
0
1
S.g
0
0
0
1
1
1
C.c2
0
0
1
1
0
0
C.c1
0
0
0
0
1
1
C.c0
0
1
0
1
0
1
Vertical (un-bit-sliced) attributes are stored:
C.t
0
1
1
0
1
0
E.s2
0
0
0
0
0
0
0
0
1
1
S.name
|CLAY |
|THAIS|
|GOOD |
|BAID |
|PERRY|
|JOAN |
E.s1
0
0
0
0
1
1
1
1
0
0
E.s0
0
0
1
1
1
1
0
0
0
1
C.name
|BI |
|DB |
|DM |
|DS |
|SE |
|AI |
E.c2
0
0
0
0
0
0
0
0
1
1
E.c1
0
0
1
0
1
1
1
0
0
0
C.st
|ND|
|ND|
|NJ|
|ND|
|NJ|
|ND|
E.c0
1
0
1
0
1
1
0
1
0
1
E.g1
1
1
0
1
1
0
1
1
1
1
E.g0
0
1
0
0
1
0
0
1
0
0
Vertical preliminary Select-Project-Join Query Processing (SPJ)
In the SCORE database (Students, Courses, Offerings, Rooms, Enrollments), numeric attributes are represente
vertically as P-trees (not compressed).
SELECT S.n, C.n
FROM S, C, O, R, E
Categorical are projected to a 1 column
R:rR.r
cap
R.r
R.c
R.c
10
10
WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r
vertical file
|0 0
00|30
0
1
11|
1
& S.g=M & C.r=2
& E.g=A
& R.c=20;
|1 0
01|20
1
1
10|
0
decimal binary.
S:sS.s
n
S.s
S.g
2S.s
1 S.n
0 gen
|0 0
000|A|M|
00 A M
|1 0
001|T|M|
01 T M
|2 1
100|S|F|
00 S F
|3 1
111|B|F|
01 B F
|4 0
010|C|M|
10 C M
|5 0
011|J|F|
11 J F
C.c
C.r
C:cC.c
n0 cred
C.r
1 C.n
10
0 B
|0 0
00|B|1
01|
01
1 D
|1 0
01|D|3
11|
11
0 M
|2 1
10|M|3
11|
11
1 S
|3 1
11|S|2
10|
10
|2 1
10|30
0
1
11|
1
|3 1
11|10
1
0
01|
1
E:sE.s
E.s
E.s
21o
0
|0 0
000|1
00
|0 0
000|0
00
|3 0
011|1
11
|3 0
011|3
11
|1 0
001|3
01
|1 0
001|0
01
|2 0
010|2
10
|2 0
010|7
10
|4 1
100|4
00
|5 1
101|5
01
E.o
E.o
E.o
E.g
E.g
2 1 grade
0
1 0
0
001|2
01
1
10|
0
0
000|3
00
1
11|
1
0
001|3
01
1
11|
1
0
011|0
11
0
00|
0
0
011|0
11
0
00|
0
0
000|2
00
1
10|
0
0
010|2
10
1
10|
0
1
111|3
11
1
11|
1
1
100|2
00
1
10|
0
1
101|2
01
1
10|
0
O :oO.o
O.o
1 0
|0 000|0
00
|1 001|0
01
|2 010|1
10
|3
O.o 011|1
11
2
|4 100|2
00
0 101|2
|5
01
0
|6 110|2
10
0 111|3
|7
11
0
1
1
1
1
cO.c
O.c
10
00|0
0
0
00|1
0
0
01|0
0
1
01|1
0
1
10|0
1
0
10|2
1
0
10|3
1
0
11|2
1
1
r10
O.r
O.r
01|
0
1
01|
0
1
00|
0
0
01|
0
1
00|
0
0
10|
1
0
11|
1
1
10|
1
0
S.s2
0
0
1
1
0
0
S.s1
0
0
0
0
1
1
S.s0
0
1
0
1
0
1
S.n
A
T
S
B
C
J
C.c1
0
0
1
1
C.c1
0
1
0
1
C.n
B
D
M
S
C.r1
0
1
1
1
S.g
SM
1
1
0
0
1
0
C.r’22
C.r
1
0
1
0
1
0
0
1
O.o2
0
0
0
0
1
1
1
1
O.o1
0
0
1
1
0
0
1
1
O.o0
0
1
0
1
0
1
0
1
O.c1
0
0
0
0
1
1
1
1
O.c0
0
0
1
1
0
0
0
1
O.r1
0
0
0
0
0
1
1
1
O.r0
1
1
0
1
0
0
1
0
For selections, S.g=M=1b C.r=2=10b E.g=A=11b R.c=20=10b
create the selection masks using ANDs and COMPLEMENTS.
R.r1 R.r0
0 0
0 1
1 0
1 1
R.c1 R.c’
R.c00
1 0
1
1 1
0
1 0
1
0 0
1
EgA
0
1
Cr2 1
0
0
0
0
0
0
1
0
1
0
0
Rc20
0
1
0
0
E.s2
0
0
0
0
0
0
0
0
1
1
E.s1
0
0
1
1
0
0
1
1
0
0
E.s0
0
0
1
1
1
1
0
0
0
1
E.o2
0
0
0
0
0
0
1
0
1
1
E.o1
0
0
0
1
1
0
1
1
0
0
E.o0
1
0
1
1
1
0
0
1
0
1
E.s2
0
0
0
0
0
0
0
0
0
0
E.s1
0
0
1
0
0
0
0
1
0
0
E.s0
0
0
1
0
0
0
0
0
0
0
E.o2
0
0
0
0
0
0
0
1
0
0
E.o1
0
0
0
0
0
0
0
1
0
0
E.o0
0
0
1
0
0
0
0
1
0
0
Apply these selection masks (Zero out numeric values, blanked out others).
S.s2
0
0
0
0
0
0
S.s1
0
0
0
0
1
0
S.s0
0
1
0
0
0
0
C.c1
0
0
0
1
C.c0 C.n
0
0
0
S
1
S.n
A
T
C
O.o2
0
0
0
0
1
1
1
1
O.o1
0
0
1
1
0
0
1
1
0
0
1
0
1
0
1
0
1
O.c1
0
0
0
0
1
1
1
1
O.c0
0
0
1
1
0
0
0
1
O.r1
0
0
0
0
0
1
1
1
O.r0
1
1
0
1
0
0
1
0
R.r1 R.r0
0 0
0 1
0 0
0 0
SELECT S.n, C.n
FROM S, C, O, R, E
WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r
& S.g=M & C.r=2
& E.g=A
& R.c=20;
E.g1
1
1
1
0
0
1
1
1
1
1
E.g0
0
1
1
0
0
0
0
1
0
0
S.s2
0
0
0
0
0
0
S.s1
0
0
0
0
1
0
S.s0
0
1
0
0
0
0
C.c1
0
0
0
1
C.c0 C.n
0
0
0
S
1
S.n
A
T
C
O.o2
0
0
0
0
1
1
1
1
O.o1
0
0
1
1
0
0
1
1
O.o0
0
1
0
1
0
1
0
1
O.c1
0
0
0
0
1
1
1
1
O.c0
0
0
1
1
0
0
0
1
O.r1
0
0
0
0
0
1
1
1
O’.r00
O.r
1
0
1
0
0
1
1
0
0
1
0
1
1
0
0
1
R.r1 R.r0
0 0
0 1
0 0
0 0
Rc20
0
1
0
0
E.s2
0
0
0
0
0
0
0
0
0
0
E.s1
0
0
1
0
0
0
0
1
0
0
E.s0
0
0
1
0
0
0
0
0
0
0
E.o2
0
0
0
0
0
0
0
1
0
0
E.o1
0
0
0
0
0
0
0
1
0
0
E.o0
0
0
1
0
0
0
0
1
0
0
For the joins, S.s=E.s C.c=O.c O.o=E.o O.r=R.r, one approach is to follow an indexed nested loop like method.
(Noting that attribute P-trees ARE an index for that attribute).
The join O.r=R.r is simply part of a selection on O (R doesn’t contribute output nor participate in any further operations)
Use the Rc20-masked R as the outer relation
Use O as the indexed inner relation to produce that O-selection mask.
Get 1st R.r value, 01b (there's only 1) Mask the O tuples:
PO.r1^P’O.r0
O.o2
0
0
0
0
0
1
0
1
O.o1
0
0
0
0
0
0
0
1
O.o0
0
0
0
0
0
1
0
1
O.c1
0
0
0
0
0
1
0
1
O.c0
0
0
0
0
0
0
0
1
OM
0
0
0
0
0
1
0
1
This is the only R.r value (if there were more,
one would do the same for each, then OR
those masks to get the final O-mask).
Next, we apply the O-mask, OM to O
SELECT S.n, C.n
FROM S, C, O, R, E
WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r
& S.g=M & C.r=2
& E.g=A
& R.c=20;
C.c1
0
0
0
1
C.c0 C.n
0
0
0
1
S
O.o2
0
0
0
0
0
1
0
1
For the final 3 joins C.c=O.c
O.o1
0
0
0
0
0
0
0
1
O.o0
0
0
0
0
0
1
0
1
O.c1
0
0
0
0
0
1
0
1
O.c0
0
0
0
0
0
0
0
1
E.s2
0
0
0
0
0
0
0
0
0
0
E.s1
0
0
1
0
0
0
0
1
0
0
E.s0
0
0
1
0
0
0
0
0
0
0
E.o2
0
0
0
0
0
0
0
1
0
0
E.o1
0
0
0
0
0
0
0
1
0
0
S’.s22 S.s1
S.s
0 0
1
0 0
1
0 0
0 0
0 1
1
0 0
E.o0
0
0
1
0
0
0
0
1
0
0
S’.s00 S.n
S.s
0 A
1
1 T
0
0
0
0 C
1
0
O.o=E.o E.s=S.s the same indexed nested loop like method can be used.
Get 1st masked C.c value, 11b Mask corresponding O tuples: PO.c1^PO.c0
OM
0
0
0
0
0
0
0
1
Get 1st masked O.o value, 111b Mask corresponding E tuples: PE.o2^PE.o1^PE.o0
Get 1st masked E.s value, 010b
EM
0
0
0
0
0
0
0
1
0
0
Mask corresponding S tuples: P’S.s2^PS.s1^P’S.s0
SM
0
0
0
0
1
0
Get S.n-value(s), C, pair it with C.n-value(s), S, output concatenation, C.n S.n
S C
There was just one masked tuple at each stage in this
SELECT S.n, C.n
FROM S, C, O, R, E
example. In general, one would loop through the
masked portion of the extant domain at each level
WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r
(thus, Indexed Horizontal Nested Loop or IHNL)
& S.g=M & C.r=2
& E.g=A
& R.c=20;
Vertical Select-Project-Join-Classification Query
Given previous SCORE Training Database (not presented as just one training table),
predict what course a male student will register for, who got an A in a previous course
in Room with a capacity of 20.
This is a matter of applying the previous complex SPJ query first to get the pertinent
Training table and then classifying the above unclassified sample
(e.g., using, 1-nearest neighbour classification).
The result of the SPJ is the single row Training Set, (S,C) and so the prediction is
course=C.
Download