3.6 use a flow chart to summarize the folloeing procedures for

advertisement
3.6 use a flow chart to summarize the folloeing procedures for attribute subset selection:
a-stepwise forward selection
b- stepwise backward elimination
c- a combination of forward selection and backward elimination
solution
a- Step-wise forward selection:
The best of the original attributes is picked first.
Then the next best of remaining attributes is added to the set, ...
b- Step-wise backward elimination:
-Repeatedly eliminate the worst attribute.
-Combination of forward selection and backward elimination
Decision tree induction:
-A tree is constructed from the given data.
-Set of attributes appearing in the tree form the reduced attributes subset.
4.1 list and describe the five primitives for specifying data mining task
solution
a- Task-Relevant Data: (Database portion to be investigated)
-Database or data warehouse name.
-Database tables or data warehouse cubes.
-Conditions for data selection.
-Relevant attributes or dimensions.
-Data grouping criteria.
b- Types of knowledge to be mined :(Data mining functions to be performed)
-Characterization
-Discrimination
-Association
-Classification/prediction
-Clustering
-Outlier analysis
-Other data mining tasks
c- Background Knowledge:(Knowledge about the domain to be mined) -concept hierarchies
-Schema hierarchy: a total or partial ordering among attributes in the database schema.
E.g., street < city < province_or_state < country
-Set-grouping hierarchy: organize values for a given attribute or dimension into groups
of constants or range values.
E.g., {20-39} = young, {40-59} = middle_aged
-Operation-derived hierarchy: based on operations specified by users, experts, or the
data mining system.
E.g., email address: login-name < department < university < country
-Rule-based hierarchy: occur when a hierarchy is defined by a set of rules.
low_profit_margin(X)
- P2) < $50)
More examples :
 To specify what concept hierarchies to use
use hierarchy <hierarchy> for <attribute_or_dimension>
 We use different syntax to define different types of hierarchies:

Schema hierarchies
define hierarchy time_hierarchy on date
as [day,month quarter,year]

Set-grouping hierarchies
define hierarchy age_hierarchy for age on customer
as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior

Operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age)

Rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)<= $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) <= $250))
level_1: high_profit_margin < level_0: all
if (price - cost) > $250
d -Measurements of Pattern Interestingness:(to evaluate the discovered patterns)
- Simplicity
e.g., (association) rule length, (decision) tree size.
-Certainty
e.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy,
certainty factor, rule strength, rule quality, discriminating weight, etc.
-Utility
potential usefulness, e.g., support (association), noise threshold (description).
-Novelty
not previously known, surprising (used to remove redundant rules).
E - Visualization of Discovered Patterns :(How display the discovered patterns)
-Users with different backgrounds, to identify patterns of interest, may require different
forms of representation.
E.g., rules, tables, crosstabs, pie or bar charts, … etc.
-Concept hierarchy is also important to visualize the discovered patterns.
a)Discovered knowledge might be more understandable when represented at
high level of abstraction.
b)Interactive drill up/down, pivoting, slicing and dicing provide different
perspective to data.
4.2 Describe why concept hierarchies are useful in data mining.
solution
They are useful in data mining because they allow the discovery of knowledge at multiple levels
of abstraction and provide the structure on which data can be generalized (rolled-up) or
specialized(drilled-down).
4.3 the four major types of concept hierarchies are :schema hierarchies , setgrouping,operation-drived, rule based hierarchies
a- brifly define each type of hierarchy
b- for each hierarchie type provide an examble
solution
-Schema hierarchy: a total or partial ordering among attributes in the database schema.
E.g., street < city < province_or_state < country
-Set-grouping hierarchy: organize values for a given attribute or dimension into groups of
constants or range values.
E.g., {20-39} = young, {40-59} = middle_aged
-Operation-derived hierarchy: based on operations specified by users, experts, or the data
mining system.
E.g., email address: login-name < department < university < country
-Rule-based hierarchy: occur when a hierarchy is defined by a set of rules.
low_profit_margin(X)
- P2) < $50)
More examples :
 To specify what concept hierarchies to use
use hierarchy <hierarchy> for <attribute_or_dimension>
 We use different syntax to define different types of hierarchies:

Schema hierarchies
define hierarchy time_hierarchy on date
as [day,month quarter,year]

Set-grouping hierarchies
define hierarchy age_hierarchy for age on customer
as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior

Operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age)

Rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)<= $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) <= $250))
level_1: high_profit_margin < level_0: all
if (price - cost) > $250
4.4 a- propose concept hierarchy for the attrebutes : address ,status ,major , gpa
b- what types of concept hierarchy is it
solution
address : Schema hierarchy
[ street , city ,state,country]
status: ????????????????????????
major: ??????????????????????????
gpa: Rule-based hierarchies
if(grade>90)
gpa = A
else
if (grade>60)
gpa = b
4.8 discuss the important of establishing a standard data minning query languge?list a few of the
recent proposal in this area
Solution
A DMQL can provide the ability to support interactive and to facilitate flexible knowledge
discovery
-Hope to achieve a similar effect like that SQL has on relational database.
-Foundation for system development and evolution.
-Facilitate information exchange, technology transfer, commercialization and wide acceptance.
4.9 no coupling ,lose coupling…..

No coupling
•

Loose coupling
•

Fetching data from DB/DW.
Semi-tight coupling — enhanced DM performance
•

flat file processing, not recommended.
Provide efficient implementations of essential data mining primitives in
a DB/DW system, e.g., sorting, indexing, aggregation, histogram
analysis, multiway join, precomputation of some statistical measures.
Tight coupling
•
A uniform information processing environment.
•
A DM system is smoothly integrated into a DB/DW system.
•
Mining queries are optimized based on mining query analysis, data
structures, indexing schemes, and query processing methods of a DB.
5.2solution
Class/birth place
Canada
other
Both-places
count
t-weight
d-weight
count
t-weight
d-weight
count
t-weight
d-weight
Programmer
180
Both-classes
200
120/300
=40%
80/100
=80%
200/400
=50%
120/200
=60%
80/200
=40%
200/200
=100%
300
20
180/200
=40%
20/200
=10%
200/200
=100%
120
DBA
180/300
=60%
20/100
=20%
200/400
=50%
300/300
=100%
100/100
=100%
400/400
=100%
300/400
=75%
100/400
=25%
100%
80
200
100
400
Programmer(x) (birthplace(x) =”Canada”[t:60%,d:40%],birth place(x)=”other”[t:40%,d:60%])
5.3????????????????????????????/
5.6


When new tuples set, DB, is inserted into the database:
•
Generalize DB to the same level of abstraction in the generalized
relation R to derive R.
•
Union R U R, i.e., merge counts and other statistical information to
produce a new relation R’
Deletion can be performed in a similar manner.
Download