Uploaded by Amare lakew

DM-LAB-MANUAL-IV-CSE-I-SEM

advertisement
SRI INDU INSTITUTE OF ENGINEERING &TECHNOLOGY
LAB MANUAL
ON
“DATA MINING LAB“
PREPARED BY
Mr B.HARI KUMAR
Associate Professor
In The Department Of
Computer Science & Engineering
SRI INDU INSTITUTE OF ENGINEERING &TECHNOLOGY
(AFFILIATED TO J.N.T.U, HYDERABAD)
SHERIGUDA(V),IBRAHIMPATNAM (M) R.R DIST-501 510
(2019-20)
Syllabus:
Task 1: Credit Risk Assessment
Description:
The business of banks is making loans. Assessing the credit worthiness of an applicant is of crucial
importance. You have to develop a system to help a loan officer decide whether the credit of a customer is
good, or bad. A bank’s business rules regarding loans must consider two opposing factors. On the one
hand, a bank wants to make as many loans as possible. Interest on these loans is the banks profit source.
On the other hand, a bank cannot afford to make too many bad loans. Too many bad loans could lead to
the collapse of the bank. The bank’s loan policy must involve a compromise: not too strict, and not too
lenient.
To do the assignment, you first and foremost need some knowledge about the world of credit. You can
acquire such knowledge in a number of ways.
1.
Knowledge engineering. Find a loan officer who is willing to talk. Interview her and try to
represent her knowledge in the form of production rules.
2.
Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance.
Translate this knowledge from text form to production rule form.
3.
Common sense. Imagine yourself as a loan office and make up reasonable rules which can be used
to judge the credit worthiness of a loan applicant.
4.
Case histories. Find records of actual cases where competent loan officers correctly judged when
and when not to, approve a loan application.
The German credit data:
Actually historical credit data is not always easy to come by because of confidentiality rules. Here is one
such dataset, consisting of 1000 actual cases collected in Germany. Credit dataset(original) excel
spreadsheet version of the German credit data.
In spite of the fact that the data is German, you should probably make use of it for this assignment (unless
you really can consult a real loan officer!)
A few notes on the German dataset
 DM stands for Deutsche mark, the unit of currency, worth about 90 cents Canadian (but looks and
acts like a quarter).
 Owns telephone. German phone rates are much higher than in Canada so fewer people own
telephones.
 Foreign worker. There are millions of these in Germany( many from turkey). It is very hard to get
German citizenship if you were not born of German parents.
 There are 20 attributes used in judging a loan applicant. The goal is the classify the applicant into
one of two categories, good or bad.
Subtasks: (Turn in your answers to the following tasks)
1.
2.
3.
List all the categorical (or nominal) attributes and the real valued attributes separately. (5 marks)
What attributes do you think might be crucial in making the credit assessment? Come up with
some simple rules in plain English using your selected attributes. (5 marks)
One type of model that you can create is a decision tree – train a decision tree using the complete
dataset as the training data. Report the model obtained after training. (10 marks)
4.
Suppose you use your above model trained on the complete data set, and classify credit good/bad
for each of the examples in the dataset. What % of examples can you classify correctly? (this is also
called testing on the training set) why do you think you cannot get 100% training accuracy? (10
marks)
5.
Is testing on the training set as you did above a good idea? Why or Why not?(10 marks)
6.
One approach for solving the problem encountered in the previous question is using crossvalidation? Describe what cross-validation is and report your results. Does your accuracy increase
/decrease? Why? (10 marks)
7.
Check to see if the data shows a bias against ”foreign workers”(attribute 20),or “personal –
status”(attribute 9).One way to do this (perhaps rather simple minded) is to remove these
attributes from the data set and see if the decision tree created in those cases is significantly
different from the full data set case which you have already done .To remove an attribute you can
use the preprocess tab in Weka’s GUI Explorer. Did removing these attributes have any significant
effect? Discuss (10 marks)
8.
Another question might be, do you really need to input so many attributes to get good results?
May be only a few would do. For example, you could try just having attributes 2,3,5,7,10,17(and
21,the class attribute (naturally)).Try out some combination .(You had removed two attributes in
problem 7.Remember to reload the arff data file to get all the attributes initially before you start
selecting the ones you want.)(10 marks)
9.
Sometimes ,the cost of rejecting an applicant who actually has a good credit (case 1) might be
higher than accepting an applicant who has bad credit (case 2).Instead of counting the
misclassifications equally in both cases , give a higher cost to the first case(say cost 5 ) and lower
cost to the second case. You can do this by using a cost matrix in Weka .Train your Decision Tree
again and report the Decision Tree and cross-validation results .Are they significantly different from
results obtained in problem 6(using equal cost)?(10 marks)
10.
Do you think it is a good idea to prefer simple decision trees instead of having long complex
decision trees? How does the complexity of a Decision Tree relate to the bias of the model?(10
marks)
11.
You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced
Error Pruning-Explain this idea briefly .Try reduced error pruning for training your Decision Trees
using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also ,
report your accuracy using the pruned model . Does your accuracy increase?(10 marks)
12.
(Extra credit): How can you convert a Decision Trees into “if-then-else rules”. Make up your own
small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist
different classifiers that output the model in the form of rules –one such classifier in Weka is rules.
PART, train this model and report the set of rules obtained .Sometimes just one attribute can be
good enough in making the decision, yes, just one! Can you predict what attribute that might be in
this data set ? One R classifier uses to a single attribute to make decisions (it chooses the attribute
based on minimum error ).Report the rule obtained by training a one R classifier . Rank the
performance of j 48,PART and one R .(10 marks)
Task Resources:
 Mentor lecture on Decision Trees




Andrew Moore’s Data mining Tutorials(See tutorials on Decision Trees and cross-validation )
Decision Trees(Source: Tan, MSU)
Tom Mitchell’s book slides (See slides on concept learning and Decision Trees)
Weka resources:
 Introduction to Weka (html version ) (download ppt version)
 Download Weka
 Weka Tutorial
 ARFF format
 Using Weka from command line
Task 2 : Hospital Management System
Data Warehouse consist Dimension Table and Fact Table.
REMEMBER the following
Dimension
The dimension object (Dimension):
- Name
- Attributes (Levels),with one primary key
- Hierarchies
One time dimension is must
About Levels and Hierarchies
Dimension objects (dimension ) consist of a set of levels and a set of hierarchies defined over those levels
.The levels represent levels of aggregation .Hierarchies describe parent-child relationships among a set of
levels .
For example, a typical calendar dimension could contain five levels .Two hierarchies can be defined on
these levels :
H1: Year L> QuarterL > MonthL >WeekL >DayL
H2: YearL>WeekL>DayL
The hierarchies are described from parent to child, so that year is the parent of Quarter, Quarter the parent
of Month , and so forth.
About Unique Key Constraints
When you create a definition for a hierarchy, Warehouse Builder creators an identifier key for each level of
the hierarchy and a unique key constraint on the lowest level (Base level)
Design a Hospital Management system data warehouse (TARGET) consists of Dimensions patient , Medicine
, Supplier, Time .Where measures are ’NO UNITS ‘,UNIT PRICE.
Assume the Relational data base (SOURCE) Table schemas as follows
TIME (day, month, year),
PATIENT (patient _name, Age, Address, etc.,)
MEDICINE (Medicine_Brand_name,Drug_name,Supplier,no_units,uinit_Price,etc.,)
SUPPLIER :( Supplier _name, Medicine _Brand _name, Address, etc…..,)
If each Dimension has 6 levels, decide the levels hierarchies, Assume the level names suitable.
Design the Hospital Management system data warehouse using all schemas. Give the example4 –D cube
with assumption names.
EXP:1
Introduction
Explore WEKA Data mining /machine Learning Toolkit
I.
II.
III.
IV.
V.
VI.
VII.
Downloading and/or installation of WEKA data mining toolkit,
Understanding features of WEKA toolkit such as Explorer , Knowledge flow interface,
Experimenter, command-line interface.
Navigate the options available in the WEKA (ex. Select attributes panel , preprocess panel,
classify panel, cluster panel, associate panel and visualize panel)
Study the arff file format
Explorer the available data sets in WEKA.
Load a data set(ex. weather dataset, iris dataset, etc.)
Load each dataset and observe the following:
i.
ii.
iii.
iv.
v.
vi.
List the attributes names and they types
Number of records in each dataset
Identity the class attribute(if any)
Plot Histogram
Determine the number of records for each class.
Visualize the data in various dimensions.
PROCEDURE:
I. Downloading and/or installation of WEKA data mining toolkit
Download the WEKA tool from the following link. And Install the WEKA tool.
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
II.
Understanding features of WEKA toolkit such as Explorer , Knowledge flow
interface,Experimenter, command-line interface.
WEKA : Waikato Environment for Knowledge Analysis
The WEKA GUI Chooser provides a starting point for launching WEKA’S main GUI
applications and supporting tools. The GUI Chooser consists of four buttons—one for each
of the four major Weka applications—and four menus.
The buttons can be used to start the following applications:
1. Explorer: An environment for exploring data with WEKA.
2. Experimenter: An environment for performing experiments and conducting
statistical tests between learning schemes.
3. Knowledge Flow: It supports essentially the same functions as the explorer but
with drag and drop interface. One advantage is that it supports incremental
learning.
4. Simple CLI: Provides a simple command-line interface that allows direct
execution of WEKA commands for operating systems that do not provide their
own command line interface.
EXPLORER:
1.
2.
3.
4.
5.
6.
It is a user interface which contains a group of tabs just below the title bar. The
tabs are as follows:
Preprocess
Classify
Cluster
Associate
Select Attributes
Visualize
The bottom of the window contains status box, log and WEKA bird.
Experimenter:
The Weka Experiment Environment enables the user to create, run, modify, and
analyse experiments in a more convenient manner than is possible when processing the
schemes individually. For example, the user can create an experiment that runs several
schemes against a series of datasets and then analyse the results to determine if one of the
schemes is (statistically) better than the other schemes.
The Experiment Environment can be run from the command line using the Simple
CLI.
You can choose between those two with the Experiment Configuration Mode radio buttons:
• Simple
• Advanced
Both setups allow you to setup standard experiments, that are run locally on a single machine,
or remote experiments, which are distributed between several hosts.
Knowledge Flow
The Knowledge Flow provides an alternative to the Explorer as a graphical front end
to WEKA’s core algorithms. The KnowledgeFlow presents a data-flow inspired interface to
WEKA. The user can selectWEKA components from a palette, place them on a layout canvas
and connect them together in order to form a knowledge flow for processing and analyzing
data. At present, all of WEKA’s classifiers, filters, clusterers, associators, loaders and savers
are available in the KnowledgeFlow along withsome extra tools.
Simple CLI
The Simple CLI provides full access to all Weka classes, i.e., classifiers, filters,
clusterers, etc., but without the hassle of the CLASSPATH (it facilitates the one, with
which WEKA was started). It offers a simple Weka shell with separated command line
and output.
III.
Navigate the options available in the WEKA (ex. Select attributes panel, preprocess
panel, classify panel, cluster panel, associate panel and visualize panel). And
Explorer the available data sets in WEKA. Load a data set. Load each dataset and
observe the following:
i.
ii.
iii.
iv.
v.
vi.
List the attributes names and they types
Number of records in each dataset
Identity the class attribute(if any)
Plot Histogram
Determine the number of records for each class.
Visualize the data in various dimensions
PREPROCESSING:
It is a process of identifying the unwanted data (data cleaning) before loading the data from
the data base.
• Now Open the WEKA application as shown in the bellow figure-1
Figure-1
•
Now Click on Explorer as shown in the above figure-2
•
Figure-2
Now open file by choosing the “open file” button as shown in the above figure-3.
Figure-3
Relation specifies the name of the database used, instances specify the objects involved, and
attributes specify the number of attributes used in the data base or relation.
•
Now choose the data folder in the open dialogue box as in figure-4.

Now choose the “ house.arff” file in the above figure-5.
Figure-5
Now click the visualize all.
IV.
Study the arff file format
An ARFF (= Attribute-Relation File Format ) file is an ASCII text file that describes a list of
instances sharing a set of attributes. ARFF files are not the only format one can load, but all
files that can be converted with Weka’s “core converters”. The following formats are
currently supported.
Now you create the arff file. Then open the Notepad type the following code and save
file with .arff extension.
@RELATION Student
@ATTRIBUTE customerid NUMERIC
@ATTRIBUTE age{youth,middle,senior}
@ATTRIBUTE income{low,medium,high}
@ATTRIBUTE student{yes,no}
@ATTRIBUTE credit_rating{fair,excellent}
@ATTRIBUTE buy_computer{yes,no}
@data
%
1,youth,high,no,fair,no
2,youth,high,no,excellent,no
3,middle,high,no,fair,yes
4,senior,medium,no,fair,yes
5,senior,low,yes,fair,yes
6,senior,low,yes,excellent,no
7,middle,low,yes,excellent,yes
8,youth,medium,no,fair,no
9,youth,low,yes,fair,yes
10,senior,medium,yes,fair,yes
11,youth,medium,yes,excellent,yes
12,middle,medium,no,excellent,yes
13,middle,high,yes,fair,yes
14,senior,medium,no,excellent,no
%

now save the file with extension .arff
N
TASK-1
Credit Risk Assessment
Description: The business of banks is making loans. Assessing the credit worthiness of an
applicant is of crucial importance. You have to develop a system to help a loan officer decide
whether the credit of a customer is good. Or bad. A bank’s business rules regarding loans must
consider two opposing factors. On th one han, a bank wants to make as many loans as possible.
Interest on these loans is the banks profit source. On the other hand, a bank cannot afford to make
too many bad loans. Too many bad loans could lead to the collapse of the bank. The bank’s loan
policy must involved a compromise. Not too strict and not too lenient.
The German Credit Data
Actual historical credit data is not always easy to come by because of confidentiality rules.

Tasks:
Now select the credit-g.arff file from weka data folder. After load the file as shown in figure-1.
1. List all the categorical (or nominal) attributes and the real valued attributes
separately.
Ans) The following are the Categorical (or Nominal) attributes)
1. Checking_Status
2. Credit_history
3. Purpose
4. Savings_status
5. Employment
6. Personal_status
7. Other_parties
8. Property_Magnitude
9. Other_payment_plans
10. Housing
11. Job
12. Own_telephone
13. Foreign_worker
The following are the Numerical attributes)
1. Duration
2. Credit_amout
3. Installment_Commitment
4. Residence_since
5. Age
6. Existing_credits
Num_dependents
2. What attributes do you think might be crucial in making the credit assessment? Come
up with some simple rules in plain English using your selected attributes.
Ans) The following are the attributes may be crucial in making the credit assessment.
1. Credit_amount
2. Age
3. Job
4. Savings_status
5. Existing_credits
6. Installment_commitment
7. Property_magnitude
3. One type of model that you can create is a Decision tree. Train a Decision tree using the
complete data set as the training data. Report the model obtained after training.
4. Suppose you use your above model trained on the complete dataset, and classify credit
good/bad for each of the examples in the dataset. What % of examples can you classify
correctly?(This is also called testing on the training set) why do you think can not get
100% training accuracy?
Ans) If we used our above model trained on the complete dataset and classified credit as
good/bad for each of the examples in that dataset. We can not get 100% training accuracy only
85.5% of examples, we can classify correctly.
5. Is testing on the training set as you did above a good idea? Why or why not?
Ans)It is not good idea by using 100% training data set
6. One approach for solving the problem encountered in the previous question is using
cross-validation? Describe what is cross validation briefly. Train a decision tree again
using cross validation and report your results. Does accuracy increase/decrease? Why?
Ans)Cross-Validation Definition: The classifier is evaluated by cross validation using the
number of folds that are entered in the folds text field.
In Classify Tab, Select cross-validation option and folds size is 2 then Press Start Button,
next time change as folds size is 5 then press start, and next time change as folds size is 10
then press start.
i) Fold Size-10
ii) Fold Size-5
iii) Fold Size-2
Note: With this observation, we have seen accuracy is increased when we have folds
size is 5 and accuracy is decreased when we have 10 folds
7. Check to see if the data shows a bias against “foreign workers” or “personal-status”.
One way to do this is to remove these attributes from the data set and see if the
decision tree created in those cases is significantly different from the full dataset case
which you have already done. Did removing these attributes have any significantly
effect? Discuss.
Ans) We use the Preprocess Tab in Weka GUI Explorer to remove an attribute “Foreignworkers” & “Perosnal_status” one by one. In Classify Tab, Select Use Training set option then
Press Start Button, If these attributes removed from the dataset, we can see change in the
accuracy compare to full data set when we removed.
i) If Foreign_worker is removed
ii) If Personal_status is removed
Analysis:
With this observation we have seen, when “Foreign_worker “attribute is removed from
the Dataset, the accuracy is decreased. So this attribute is important for classification.
8. Another question might be, do you really need to input so many attributes to get good
results? May be only a few would do. For example, you could try just having attributes
2,3,5,7,10,17 and 21. Try out some combinations.(You had removed two attributes in
problem 7. Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
Procedure:
1. Remove the 2nd Attribute:
We use the Preprocess Tab in Weka GUI Explorer to remove 2nd attribute (Duration). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed. Then see the output as shown in figure-1.
2. Remove the 3rd Attribute:
Fugure-1
Remember to reload the previous removed attribute, press Undo option in Preprocess tab.
We use the Preprocess Tab in Weka GUI Explorer to remove 3rd attribute (Credit_history).
In Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed. Then see the output as shown in figure-2.
Figure-2
3. Remove 5th attribute (Credit_amount).
Remember to reload the previous removed attribute, press Undo option in Preprocess tab.
We use the Preprocess Tab in Weka GUI Explorer to remove 5th attribute (Credit_amount).
In Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed. Then see the output as shown in figure -3.
Figure-3
4. Remove 7th attribute (Employment)
Remember to reload the previous removed attribute, press Undo option in Preprocess tab.
We use the Preprocess Tab in Weka GUI Explorer to remove 7th attribute (Employment). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed. Then see the output as shown in figure-4.
Figure-4
5. Remove 10th attribute (Other_parties):
Remember to reload the previous removed attribute, press Undo option in Preprocess tab.
We use the Preprocess Tab in Weka GUI Explorer to remove 10th attribute
(Other_parties). In Classify Tab, Select Use Training set option then Press Start Button, If
these attributes removed from the dataset, we can see change in the accuracy compare to
full data set when we removed. Then see the output as shown in figure-5.
Figure-5.
6. Remove 17th attribute (Job):
Remember to reload the previous removed attribute, press Undo option in Preprocess tab.
We use the Preprocess Tab in Weka GUI Explorer to remove 17th attribute (Job). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set
when we removed. Then see the output as shown in figure-6.
Figure-6
7. Remove 21st attribute (Class):
Remember to reload the previous removed attribute, press Undo option in Preprocess tab.
We use the Preprocess Tab in Weka GUI Explorer to remove 21st attribute (Class). In
Classify Tab, Select Use Training set option then Press Start Button, If these attributes
removed from the dataset, we can see change in the accuracy compare to full data set when
we removed. Then see the output as shown in figure- 7.
Figure-7
ANALYSIS:
With this observation we have seen, when 3rd attribute is removed from the Dataset, the
accuracy (83%) is decreased. So this attribute is important for classification. when 2nd and
10th attributes are removed from the Dataset, the accuracy(84%) is same. So we can remove
any one among them. when 7th and 17th attributes are removed from the Dataset, the
accuracy(85%) is same. So we can remove any one among them. If we remove 5th and 21st
attributes the accuracy is increased, so these attributes may not be needed for the
classification.
9.
Sometimes, The cost of rejecting an applicant who actually has good credit
might be higher than accepting an applicant who has bad credit. Instead of
counting the misclassification equally in both cases, give a higher cost to the
first case (say cost 5) and lower cost to the second case. By using a cost matrix
in weak. Train your decision tree and report the Decision Tree and cross
validation results. Are they significantly different from results obtained in
problem 6?
Procedure:
 Now Open the WEKA GUI Explorer, Select Classify Tab, In that Select Use Training
set option .
 In Classify Tab then press Choose button in that select J48 as Decision Tree Technique.
 In Classify Tab then press More options button then we get classifier evaluation
options window
 Now select cost sensitive evaluation the press set option Button then we get Cost
Matrix Editor.
 Now change classes as 2 then press Resize button. Then we get 2X2 Cost matrix.
 Now in Cost Matrix (0,1) location value change as 5, then we get modified cost matrix
is as follows. Show in figure-8.
Figure-8
 Then close the cost matrix editor, then press ok button. Then press start button. Then
shown below figure-9.
Figure-9
Analysis:
With this observation we have seen that , total 700 customers in that 669 classified as good
customers and 31 misclassified as bad customers. In total 300cusotmers, 186 classified as bad
customers and 114 misclassified as good customers.
10. Do you think it is a good idea to prefect simple decision trees instead of having long
complex decision tress? How does the complexity of a Decision Tree relate to the
bias of the model?
Analysis:
It is Good idea to prefer simple Decision trees, instead of having complex Decision tree.
11. You can make your Decision Trees simpler by pruning the nodes. One approach is
to use Reduced Error Pruning. Explain this idea briefly. Try reduced error pruning
for training your Decision Trees using cross validation and report the Decision
Trees you obtain? Also Report your accuracy using the pruned model Does your
Accuracy increase?
 Now we can make our decision tree simpler by pruning the nodes.
 For that In Weka GUI , Select Classify Tab, In that Select Use Training set option .
 Now select the Classify Tab then press Choose button in that select J48 as Decision Tree
Technique.
 Now Beside Choose Button Press on J48 –c 0.25 –M2 text we get Generic Object Editor.
 Now select Reduced Error pruning Property as True then press ok.
 Now then press start button.
Figure-10
Analysis:
By using pruned model, the accuracy decreased. Therefore by pruning the nodes we can make our
decision tree simpler.
12. How can you convert a Decision Tree into “if-then-else rules”. Make up your own
small Decision Tree consisting 2-3 levels and convert into a set of rules. There also
exist different classifiers that output the model in the form of rules. One such
classifier in weka is rules. PART, train this model and report the set of rules obtained.
Sometimes just one attribute can be good enough in making the decision, yes, just one
! Can you predict what attribute that might be in this data set? OneR classifier uses a
single attribute to make decisions(it chooses the attribute based on minimum
error).Report the rule obtained by training a one R classifier. Rank the performance
of j48,PART,oneR.
Procedure:

with 2-3 levels .
Sample Decision Tree shown in figure -11, for weather dataset,
Figure-11
Now converting above Decision tree into a set of rules is as

follows:
Rule1: If age = youth AND student=yes THEN buys_computer=yes
Rule2: If age = youth AND student=no THEN buys_computer=no
Rule3: If age = middle_aged THEN buys_computer=yes
Rule4: If age = senior AND credit_rating=excellent THEN buys_computer=yes
Rule5: If age = senior AND credit_rating=fair THEN buys_computer=no
 Now open the Weka GUI Explorer, Select Classify Tab.
 Now Select Use Training set option. There also exist different classifiers that output the
model in the form of Rules. Such classifiers in weka are “PART” and ”OneR” .
 Then go to Choose and select Rules in that select PART and press start Button. Show
the result in figure-12.
Figure-12
 Then go to Choose and select Rules in that select OneR and press start Button. Show the
result in figure-13 .
Figure-13
 Then go to Choose and select Trees in that select J48 and press start Button. Show the
result in figure-14.
Figure-14.
Analysis:
This observation we have seen the performance of classifier and Rank is as follows
1. PART
2. J48
3. OneR
TASK-2 Hospital management
AIM:
Data Warehouse consist Dimension Table and Fact Table.
REMEMBER the following
Dimension
The dimension object (Dimension):
- Name
- Attributes (Levels),with one primary key
- Hierarchies
One time dimension is must
About Levels and Hierarchies
Dimension objects (dimension ) consist of a set of levels and a set of hierarchies defined over those levels .
The levels represent levels of aggregation .Hierarchies describe parent-child relationships among a set of
levels .
For example, a typical calendar dimension could contain five levels .Two hierarchies can be defined on
these levels :
H1: Year L> QuarterL > MonthL >WeekL >DayL
H2: YearL>WeekL>DayL
The hierarchies are described from parent to child, so that year is the parent of Quarter, Quarter the parent
of Month , and so forth.
About Unique Key Constraints
When you create a definition for a hierarchy, Warehouse Builder creators an identifier key for each level of
the hierarchy and a unique key constraint on the lowest level (Base level)
Design a Hospital Management system data warehouse (TARGET) consists of Dimensions patient, Medicine,
Supplier; Time .Where measures are ’NO UNITS ‘, UNIT PRICE.
Assume the Relational data base (SOURCE) Table schemas as follows
TIME (day, month, year),
PATIENT (patient _name, Age, Address, etc.,)
MEDICINE (Medicine_Brand_name, Drug_name, Supplier,no_units,uinit_Price,etc.,)
SUPPLIER :( Supplier _name, Medicine _Brand _name, Address, etc…..,)
If each Dimension has 6 levels, decide the levels hierarchies, Assume the level names suitable.
Design the Hospital Management system data warehouse using all schemas. Give the example4 –D cube
with assumption names.
PROCEDURE:
Design the Hospital Management system data warehouse using pentaho business anaytics open
source tool. Give the example 4 –D cube with assumption names. Apply OLAP operations for analysing
the data.
A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in
support of management’s decision-making process
The steps in building a data warehouse are as follows:
1. Extracting the transactional data from the data sources into a staging area
2. Transforming the transactional data
3. Designing a multi diminsentional data model
4. Performing various OLAP operations and visualing the multidimensional data for analysis
Create a relational database with the following tables in postgressql
Time(Timeid,day, month,year),
Patient (Patientid,patient _name ,Age ,Address )
Medicine(Medicineid, Medicine_Brand_name,Drug_name,Supplier, unit_Price)
Supplier(Supplierid, Supplier _name ,Medicine _Brand _name, Address)
sales(Timeid, Patientid, Medicineid,Supplierid,units)
Populate the tables with the available data.
Alter the sales table and update the table to include additional field revenue, which is the product of units
and unit_price
Create a multidimensional model by adding the data source in Pentaho
Specify the dimension tables as Time, Patient, Medicine, Supplier
Specify the fact table as sales table specify the joins of foriegn keys with primary keys in dimension tables
The start schema has the three main components as shown in the figure 1
Fact Table and its contents: metric attributes and the foreign keys necessary to join to the
dimension tables,
Dimension Tables and their contents: reference attributes, hierarchical attributes, and metric
attributes. The dimension tables are highly denormalized,
The lines that link the Dimension Tables to the Fact Table
Supplier
Time
TimeId
SupplierId
Name
City
Country
Day
Facttable
Month
Year
SupplierId
TimeId
MedicineId
PatientId
Unitssold
Medicine
Revenue
Patient
PatientId
MedicineId
Name
Drug Name
City
Supplier
Country
Figure 1: Star schema showing dimensions and fact table for Hospital management System
PenathoAnalyzerautomatically fetches data in real time as you add and remove fields, so you may
find it easier to build a report with the Auto Refresh feature turned off. This lets you design your
report layout first, including calculations and filtering, without querying the database
automatically after each change. Just click the auto refresh icon in the tool bar to toggle Auto
Refresh On or Off, or you can click the Refresh Report button at any time.
1. From User Console Home, click Create New, then Analysis Report.
2. Choose a data source for the report from the Select Data Source dialog box. Click OK.
3. From the Available Fields pane on the left, click and drag an object to the Rows or Columns
area in the Layout panel. The data row or column appears in the table workspace. Note
that you can remove an object from a row or column by dragging it from the Layout panel
back to the Available Fields list.
4. In the list of fields, click and drag a measure to the Measures area in the Layout pane. The
measure appears as a column in the table workspace.
5. If you want to rename or reformat your columns, right-click a column and choose Column
Name and Format from the menu. The Edit Column window appears. You can also sort the
data in your columns by clicking and choosing a sort-order from the drop-down menu.
6. Choose a format from the Format drop-down box, or choose a visualizationfrom the dropdown menu. Click to refresh the report if you need to, then click OK.
7. Click Save As. Type a file name for your report and choose a location to save it in, then click
OK.
The new Analyzer report is created and saved in a location of your choice.
Some of the queries that can be answered using the 4-D data cube are given below
1) Revenue generated by the sales of medicine of particular brand for a given patient over
a given duration
2) Revenue generated by the sales of medicine for a given Location, Brand or duration
and brand, location and duration
3) Revenue generated by the sales of medicine for a given duration for a specific Brand
4) Supplier wise sales of medicine
Download