The voluminous amount of data now available, and ever increasing

advertisement
Dissertation Idea Paper
Gregory A. Vaughn Sr.
Research Question
Is there a set of factors that provide association rules for the best combination of cube
elements? Cans a multidimensional data model using an adaptive piecewise constant
approximation or linear regression to reduce sparcity, satisfy these rules? Can these
associations be visualized? Can next level of modeling should also combine the concept
encapsulation of the object-oriented model to support recent trends in distributed
computing?
Data Mining and Computing
Contrary to purely retrieval efforts, data mining 5“looks for relations and associations between
phenomenon that are not known beforehand”. 6Data mining (also known as Knowledge Discovery
in Databases - KDD) has been defined as "The nontrivial extraction of implicit, previously
unknown, and potentially useful information from data”. Data mining goes beyond statistical data
analyses and computational algorithms and aims to be a major part of the business intelligence
that supports business decisions.
Research databases either from business organizations, universities or research centers are
usually created from transactional data. More often than not these databases contain information
that goes undetected by the organizations and their researchers, and fails to be recognized and
used by the organizations that own and maintain them. KDD also known as Data Mining looks to
unearth the hidden yet meaningful data within. A question that naturally derives from this
circumstance is, are there best methods or practices to discovering this undiscovered wealth.
KDD13 is seen by some as being a different activity from statistical analysis, one which takes into
account “non-statistical” issues. “Examples of “non-statistical” issues in KDD include the following
1. Data cleaning
What can be done to locate and ameliorate the pervasive problems of invalid or
incomplete data?
2. “First cut” analysis
What can be done to automatically provide an initial assessment of the patterns and
potentially useful or interesting knowledge in a database? The aim here is,
realistically, to automate some of the basic work that is now done by skilled human
analysts.
3. Hypothesis generation
What can be done to support, or even automate, the finding of plausible hypotheses in
the data? Found hypotheses would, of course, need to be tested subsequently with
statistical techniques, but where do you get “the contenders” in the first place?”
Some researchers feel that efforts should be focused on “the hypothesis generation problem for
KDD. Because hypothesis space is generally quite large…, it is normally quite impossible to
enumerate and investigate all the potentially interesting hypotheses.
However better results may be obtained by utilization of older statistical techniques on the first
two issues prior to any meaningful hypotheses generation can begin.
If information retrieval from large or small data repositories is based on pre-supposed ideas about
the data, and there is a plan for the extraction of information from the data on hand, which is
5“exogenous from the extraction itself”, (an example might be a request from the FDA Federally
funded clinic for data on patients who received Metroformin XR, Glipizide and Actos in
combination were afforded a better diabetic treatment regimen. The FDA may have determined
that there is some relationship to the administration of these drugs in combination and based on
Dissertation Idea Paper
Gregory A. Vaughn Sr.
the summarized results may initiate further investigation of these patients) then data/attribute
sparcity must be resolved (perhaps linear regression prediction).
The tools/techniques involved in the data mining process take a center stage. The multidimensional modeling tools for visualization of the data to be mined, e.g. online analytic
processing tools (OLAP), usually appear in the form of a derived relation stored in terms of a
base relation [17]. By instantiating the tuples of view in a database the view is materialized. The
benefits gained are access speed (especially when the results are the product of complex
computations).
…………………In Progress
Multi-Dimensional Modeling
As described by Kimbal(1997). “The Dimensional model adheres to a discipline that incorporates
the relational model with restrictions. The dimensional model is composed of a table called a fact
table that has multi-part keys and a set of smaller tables that are called dimension tables. The
dimension tables have a single-part primary key that relates to only one of the components of the
multi-part keys in the fact table. This structure is known as the "star join" and dates back to the
earliest days of relational databases” [15, 16].
Muti-dimensional modeling presents data 10 “as facts with associated numerical measures,
dimensional tables as mentioned above, or as textual dimensions of the facts”. In the case of
treatment for a given disease, dosage and frequency would be measures while Laboratory/drug
Company or regional location would form the dimensions. Researchers at SAP define
multidimensional modeling in terms of the goals to be achieved.
14”The
overarching goals of multi-dimensional models are:
To present information to the end-user in a way that corresponds to his normal understanding of
his business/ i.e. to show the KPIs, key figures or facts from the different perspectives that
influence them (sales organization, product/ material or time). In other words, to deliver structured
information that the end-user can easily navigate by using any possible combination of business
terms to illustrate the behavior of the KPIs.
To offer the basis for a physical implementation that the software recognizes (the OLAP engine),
thus allowing a program to easily access the data required.
The Multi-Dimensional Model (MDM) has been introduced in order to achieve the first. The most
popular physical implementation of multi-dimensional models on relational database systembased data warehouses is the Star schema implementation. SAP BW uses the Star schema
approach and extends it to support integration within the data warehouse, to offer easy handling
and allow high performance solutions”.
The steps necessary to accomplish the modeling include:
1. Complete understanding of the underlying processes that generate the data
2. Create a desired schema
3. Create a cube description
.
In progress ---- According to Hacid and Satler here in lies the strength of Multidimensional
Databases.
Multi-Dimensional Data (Data Cubes)
The voluminous amount of data now available, and ever increasing, requires new techniques for
discovery of information for decision making. The goal of the data spelunker is to find unusual
patterns that may yield heretofore un-evidenced information. Traditional methods have used
techniques that focus on data in a two dimensional plane. Current methods involving data cubes
offer a new view of data that may afford many more decision-making opportunities. The relational
Dissertation Idea Paper
Gregory A. Vaughn Sr.
model of data storage and retrieval is the standard of the day but tables and rows by their very
nature limit the dimensionality that may naturally exist in the data.
1”Data
cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a threedimensional extension of a square. The word cube brings to mind a 3-D object, and we can think
of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one
another.” Data cubes can be constructed with many more dimensions while still affording single
dimension indexing and query, but provide additional views to the data and consequently many
more decision points.
Multi-Dimensional databases, with data cubes, 4instead of presenting data to the user in the form
of tables presents it presents it in a form that can be manipulated by operators that can cut out
pieces from large cubes, change granularity, of dimensions, and turn cubes.
Dissertation Idea Paper
Gregory A. Vaughn Sr.
Sample Table 1.
Give the three dimensions X, Y, and Z, let each represent a dimension. X = a particular year of
sales (2004), let y = and area of sales, and let Z = a particular product.
X = year (2004)
Y = area (Brooklyn, Queens, Bronx, Manhattan)
Z = product (Scotch, Bourbon, Cognac, Vodka)
The cube is a set of cells, and a cell represents the association of a measure with one member in
each dimension. A cube representing X, Y, and Z would look like the following:
With kind of multidimensional representation data can be viewed by each of the dimensions and
aggregates derived for each dimension i.e.
SELECT * FROM Data Cube
GROUPED BY X
SELECT * FROM Data Cube
GROUPED BY Y
SELECT * FROM Data Cube
GROUPED BY Z
Or
SELECT * FROM Data Cube
GROUPED BY GROUP SET ((X), (Y), (Z))
OR SOME VARIANT.
3“There
are inherent features of the multidimensional model that make it an appropriate
environment for business intelligence. The multidimensional model:

Enforces referential integrity. Each dimension member is unique and cannot be NA. If a
measure has three dimensions, then each data value of that measure must be qualified
by a member of each dimension.
Dissertation Idea Paper
Gregory A. Vaughn Sr.




Promotes consistency. Dimensions are maintained as separate workspace objects and
are shared by measures.
Preserves the order of data. Each dimension has a default status list, which contains all
of its members in the order they are stored. The default status list is always the same
unless it is purposefully altered by adding, deleting, or moving members. Within a
session, the user can change the selection and order of the status list; this is called the
current status list. The current status list remains the same until the user purposefully
alters it by adding, removing, or changing the order of its members.
Because the order of dimension members is consistent and known, the selection of
members can be relative. For example, the function call
lag (sales, 12, month) compares the sales values of all months in the current status list
against sales from a year ago (that is, 12 time periods earlier in the default status list for
the month dimension).
Presents data as fully solved. Applications do not need to define calculations. Because of
the combination of power and ease-of-use of the OLAP DML, the analytic workspace can
be prepared so that the data is presented as fully solved to the application.
Manages calculated members and measures transparently. Users can define their own
dimension members (often called custom aggregates), that function identically to the
other dimension members and can be used transparently in any calculation. Similarly,
users can define their own measures and assign values to them using any of the
methods available in the OLAP DML. Throughout the session, these additions behave
identically to the dimension members and objects originally provided in the workspace.
Users can save their changes from one session to the next with a single DML command.
“
The process for this type of mining is constant, as outlined by 2Gray et. al., 1) formulation – a
query that extracts relevant data from a large database; 2) extracting – the aggregated data from
the database into a file or table; 3) visualizing – the results in a graphical way; 4) analyzing – the
results and formulating a new query.
Materialized Views
3A
relatively new data structure, data cubes, are far more complex than their earlier purely, two
dimensional relational siblings, and are that much more difficult to fathom and extract meaningful
information. For this reason analyses materialized views of this complex data provide a better
means of access and decision reporting. “A materialized view (summary table) can be thought of
as a special kind of view, which physically exists inside the database, it can contain joins and or
aggregates and exists to improve query execution time by pre-calculating expensive joins and
aggregation operations prior to execution”.
In progress ………
Past Research
In progress ……..
Why use regression analyses!!!!
A number of studies have used regression techniques in attempting to derive a predictive model
for single or multiple response variables on the basis of one or more of the other variables to
describe columnar data entries, and have found the technique to produce less error than other
available methods (8,9,10).
Dissertation Idea Paper
Gregory A. Vaughn Sr.
This Research
This study will test the efficiency of this model in discovering new associative or correctional
information vs. a more traditional method.
Objectives To access the predictability of type 2-onset diabetes from undiscovered
physio/environmental predicates.
- 7To determine if association rule mining can discover strong association or
correlation relationships between predicates.
Hypotheses - 1) the multidimensional model/technique will disclose new relationships and
subsequently new predicates for in detecting and treating diabetes
References
1. “Data Cubes“, Russell Kay, MARCH 29, 2004 (COMPUTERWORLD
2. “Data Cube: A relational Aggregation Operator Generalizing group-By, Cross-Tab, and
Sub-Totals, S. Gray, et al., in Data Mining and Knowledge Discovery 1, 29-53 (1997)
Kluwer Academic Publishers, Manufactured in the Netherlands
3. “Oracle9I Materialized Views”, An Oracle White Paper, May 2001
4. M. S. Hacid, and U Sattler, Modeling Multidimensional Databases: A Formal ObjectCentered Approach Proc. Of the Sixth European Conference on Information Systems
1998 (ECIS98)
5. Paolo Giudici, “Applied Data Mining: Statistical Methods for Business and Industry”, John
Wiley and Son, 2003
6. W. Frawley and G. Piatetsky-Shapiro and C. Matheus,” Knowledge Discovery in
Databases: An
7. Overview”, AI Magazine, Fall 1992, pgs 213-228.
8. Hua Zhu, “On-Line Analytical Mining of Association Rules”, Thesis, Simon Fraser
University 1998
9. Daniel Barbara and Mark Sullivan, “Quasi-Cubes: A space efficient way to support
approximate multidimensional databases’, 1998.
10. S. Abad-Mota, Approximate Query Processing with Summary Tables in Statistical
Database. In Proceedings of the 3rd Int’l Conference on Extending Database technology,
Vienna, Austria, March 1992.
11. Paolo Giudici, Applied Data Mining: Statistical Models for Business and Industry”, John
Wiley and Sons Ltd, 2003.
12. Torben Bach Pedersen and Christian S Jensen, “Multidimensional Database Technology”,
Dec 2001, Aalborg University, IEEE Distributed Systems Online, computer.org/dsonline
13. MOTC: An Interactive Aid for Multidimensional Hypothesis Generation,
K. Balachandran, J. Buzydlowski, G. Dworman, S.O. Kimbrough, T. Shafer, & W.
Vachula
14. Multi-Dimensional Modeling with BW ASAP for BW Accelerator Business Information
Warehouse, SAP America Inc and SAP AG.
15. Kimball, Ralph. "A Dimensional Modeling Manifesto", DBMS. 10(9). 1997 Aug.
16. "Star Schemas and STARjoin? Technology", A Red Brick Systems White Paper.
17. “What is the Data Warehousing Problem? (Are Materialized Views the Answer)”, Ashish
Gupta, Inderpal Sigh Mumick, VLDB 1996: 602 , ww.sigmod.org/vldb/conf/1996/P602.PDF
Dissertation Idea Paper
Gregory A. Vaughn Sr.
13. "Star Schemas and STARjoin? Technology", A Red Brick Systems White Paper.
4. Kimball, Ralph. "A Dimensional Modeling Manifesto", DBMS. 10(9). 1997 Aug.
1. Date, C. J. "A Fruitful Union", Computerworld. 27(24): 130. 1994 Jun 14.
Raden, Neil. "Modeling the Data Warehouse", Manuscript of an article by Neil Raden that was
excerpted in the January 29, 1996 issue of Information Week,
http://members.aol.com/nraden/iw0196_1.htm.
Download