data warehousing, data mining, OLAP

advertisement
Business Intelligence : Competitive Advantage
Data Warehousing and Data mining
R.K.Gupta*
.“Knowledge [no more Information] is not only power, but also has significant competitive
advantage”
The globalization of business, the liberalization of the economy and the rapid strides
in technology make strategic and operational plan (micro level) virtually outdated, by the time
they are generated and ready for implementation. The changing dynamic & economic
scenario with on-going process of liberalization, globalization and privatization is posing new
challenges before the planner and decision-makers with multiple level of complexity. The
planning has been transformed into dynamic on-going process - thus needs "Business
Intelligence Solution". In the present economic scenario, for large number of organizations,
making unbiased and faster decisions, can make the difference between surviving and
thriving, more so in an increasingly competitive market. Buried in the huge databases
assembled by large organizations is the information useful for generating new facts and
relationships that can provide significant competitive advantage.
Organizations have lately realized that just processing transactions and/or
information’s faster and more efficiently, no longer provides them with a competitive
advantage vis-à-vis their competitors for achieving business excellence. Information
technology (IT) tools that are oriented towards knowledge processing can provide the edge
that organizations need to survive and thrive in the current era of fierce competition.
Enterprises are no longer satisfied with business information system(s); they require business
intelligence system(s). The increasing competitive pressures and the desire to leverage
information technology techniques have led many organizations to explore the benefits of
new emerging technology – viz. "Data Warehousing and Data Mining". What is needed today
is not just the latest and updated to the nano-second information, but the cross-functional
information that can help decisions making activity as "on-line" process.
The data mining technology is like extracting gold, parallel to gold extraction
technology. Data mining is based on filtration and assaying of a mountain of data "ore", in
order to get data "nuggets" and is designed to help corporate organization(s) to discover
hidden patterns and to delve deeper to establish hidden connections in organization's data –
patterns that can help planner & decision makers to understand the behavior of key users,
detect likely trends-growth pattern, predict change(s) in the financial sector etc. Thus,
managing the business effectively and gaining competitive edge.
*Senior
Technical Director & Head, Analytics and Modelling Division, National Informatics Centre, Planning Commission
(GOI), CGO Complex, Lodhi Road, New Delhi-110003. Tel. No.: (O) 011-4362530 E-Mail : rkg@hub.nic.in OR
gupta@amdiv.delhi.nic.in
Evolution of Information Technology Tools:
The initial information technology tools that were utilized in the managerial world
were for data collection and storage (Transaction Processing System) – a tool for systemizing
the collection and storage of data needed for the management of the transactions, in which the
enterprise entered into with its trading partner(s). This was followed by the realization that
there was an enormous amount of data which is created by the organization (either internally
from the interaction of the different functional areas or with the interaction with business
partners) thus, a system which manages these data in an integrated manner was needed. To
address these needs the Management Information System (MIS) concept was created, this
system (theoretically) allowed the management to query the transactional database for
summary as well as detailed report on any matter of interest on a routine or on exception
basis. Unfortunately, the MIS concept could not be implemented at the organizational level as
an integrated system, but was implemented within the functionally distinct areas like finance,
purchase, audit, human resource management department etc. The organization wide MIS has
only recently been accepted and is being implemented in many organizations in a new
incarnation – the Enterprise (wide) Resource Planning (ERP) systems. Together with the MIS
concept there was a need felt for the development of systems that supported the management
in decision making by providing the analytical and/or heuristic model(s) for decision making
– the Decision Support Systems (DSS).
The evolution of the information systems characterize the evolution of systems from
data maintenance systems, to systems that transform the data into "information" for use in the
decision making process. Organizations had realized that information was a competitive tool
that allowed them to perform better in the dynamic environment. These systems supported the
information acquisition from the database of transactional data. The managerial knowledge
acquisition function is/was not directly supported by these systems (Fig. 1). The evolution of
new patterns in the changing scenario could not be provided by these systems directly, the
planner was supposed to do this from experience – for example the shift in consumers buying
behavior is/was not directly reported, the manager is/was supposed to analyze the sales figure
over time by him (her) self to identify the reasons.
Processing
Processing
Data
Information
Knowledge
Transactions
Processing
Systems
Management Information
Systems
Data Mining Tools &
On-Line Analytical
Processing Tools
Fig. 1: The Transformation of Data into Knowledge and associated tools.
2
To answer questions that requires analysis of historical data, a repository of the
relevant data is needed. With advances in the On-Line Transaction Processing systems
(OLTP), the organizations have gathered data/information for decades, viz. customer names,
addresses, credit lines, purchasing preferences, product sales history, pricing elasticity,
seasonality – the data exists somewhere – in the departmental database or in the organization
database. The problem is not in the data, but rather data access, data consistency, data
timeliness, data accuracy and data granularity (the level of detail) and in turn the tools needed
to access this data, and the types of queries that can be handled by these tools?
On-line Analytical Processing (OLAP) tools are decision support tools that can access
data in the operational database across multiple-dimensions and apply statistical analysis tools
(like cluster analysis, factor analysis etc.) on them. But to use, the user must have a domain
knowledge and define of the objective(s) for analyses? While, data mining tools allow the
user to run more general queries, they allow the discovery of previously unknown or obscure
patterns and relationships in a very large database, with the aim of arriving at comprehensible
and meaningful results from extensive analysis of information. Though data mining tools can
also run on the database management system (DBMS), supporting the transaction/operational
data processing (which may have a centralized structure on one end of the possible
configuration to a totally distributed system running disparate DBMS’s), they impose a large
overhead on these DBMS’s (OLAP tools also impose similar overheads), therefore to support
the data mining operation huge repository of data known as "data-warehouses" are used. Data
warehousing provides an architecture, which provides access to the data to different data
mining technology in a structured manner.
Data Warehouse:
The data warehouse makes an attempt to figure out "what we need", before we know
we need it? Data warehousing is the process of integrating enterprise-wide corporate data into
a single repository, from which end-users can easily run complex queries, generate multidimensional reports, and perform analyses.
In general a database is not a data warehouse unless it has the following two features:

It collects information from a number of different disparate sources and is the place
where this disparity is reconciled, and

It allows several different applications to make use of the same information.
Table 1. Similarities and Differences between OLTP and Data Warehouse Systems
OLTP
DATA WAREHOUSE
Purpose
Run day-to-day operation
Information retrieval and analysis
Structure
Data Model
RDBMS
Normalized
RDBMS/Multi-dimensional DBMS
Denormalized and/or Multi-dimensional
Access
SQL
SQL plus data analysis extensions
Type of data
Data that runs the business
Data to analyze the business
Condition of data
Changing, incomplete
Historical, descriptive
3
Together, with the data collected from various sources (Fig. 2) the data warehouse
stores a kind of data called the metadata, which is the "data about data". Metadata may
represent the information about when the data was created, from which system it came, and
what different tools have accessed it to move in from where it was originally, to where it is
now. It’s all the things that surround the actual content of the data to give a person
understanding of how it was created and how it is maintained.
Data warehouse also differs from conventional databases in that they are
denormalized; i.e. the same data may appear several times in different places.
Denormalization allows for combination of data into larger tables (structures in which
RDBMS’s hold the data ) and reduce the number of input/output that have to be made,
thereby speeding system operation.
W.H. Inmon (1993) in his landmark work building the Data Warehouse, offers the
following definition of a data warehouse: “A data ware directory is to help locate the contents
of the data warehouse.

A guide to the mapping of data as the data is transformed from the operational
environment to the data warehouse environment.

A guide to the algorithms used for summarization between the current data and the
summarized data, etc.
In a typical implementation, the data warehouse application is coupled to the warehouse
via the metadata (that is, "data about data"), allowing changes to the data warehouse to be
immediately reflected in the enduser data-access application. For example, if a corporation
restructures to eliminate a layer of management, as soon as the data corresponding to the new
organizational hierarchy is added to the warehouse, the application should "reconfigure" itself
using metadata to reflect the new hierarchy.
Data warehouse also differ from conventional databases in that they are denormalized,
i.e. the same data may appear several times in different use is a subject - oriented, integrated,
time-variant, non-volatile collection of data in support of management’s decision making
process.”
Subject Oriented means that the data warehouse focuses on the high-level entities of the
business; in the case of marketing, subjects such as consumers, their income, their addresses,
sales figures etc. This is contrast to operational systems, which deal with processes such as
bill’s of payment etc.
Integrated means that the data is stored in a consistent format (i.e., naming conventions,
domain constraints, physical attributes, and measurements). For example, production systems
may have several unique coding schemes for parts. In the data warehouse there is only one
coding scheme.
Time variant means that the data associates with a point in time (i.e. fiscal year, pay
period etc.)
Lastly, non-volatile means that the data does not change once it gets into the data
warehouse.
4
Legacy Database
Operational Database
Extract
Transform
Maintain
Metadata
Data Warehouse
External Data Source
Fig 2. : Data Warehouse Architecture.
While designing and building a data warehouse several important points must be kept in
mind, a few of these points being:




To support accelerated decision making, right information at the right time should
be available and easily accessible.
The effort needed to create the infrastructure to support the data warehouse should
not be underestimated.
The requirement definition (to build a data warehouse) is more difficult because a
data warehouse requires developing a system to support undefined requests.
The data warehouse is not an operational system that the people have to use to do
their jobs. It has value, however, only if used.
Need for a Data Warehouse:
The data warehouse (DW) concept sprang from the growing competitive need to
quickly analyse information. Existing operational systems cannot meet this need because,




They lack on-line historical data.
The data required for analysis resides in different operational systems.
The query performance is extremely poor, which in turn impacts performance of
operational systems.
The operational DBMS designs are inadequate for decision support.
As a result, information stored in operational systems is inaccessible to planner &
decision-makers. A data warehouse eliminates these problems by storing current and
historical data
From disparate operational systems that decision-makers need in a single consolidated
system. This makes data readily accessible to all in the organisation who needs it without
interrupting on-line operational workloads.
Type: Data Warehouses:
There are two major approaches that differ greatly in scale and complexity. They are
the data mart and the data warehouse.
A "data mart" is a department or functional oriented data warehouse. It is a scaled
down version of a data warehouse that focuses on the local needs of a specific department like
5
finance or purchase. A data mart contains a subset of the data that would be in an
organisation's data warehouse since it is department oriented. An organization may have
many data marts, each focused on a subset of a distinct organization activity (distinct
functional domains : like finance, purchase, planning, human resource, etc.).
Table 2. Differences between Data Marts and Data Warehouses
Attribute


Data Mart
Data Warehouse
Scope
Department
Enterprise
Time to build
~Months
~Years
Cost to build
Few Lakh(s)of Rs.
Crore(s) of Rs.
Complexity to build
Low to Medium
High
Shared (within business are)
Few operational and external
system
Common (across enterprise)
Multiple operational and external
systems
Size
Megabyte to low gigabyte
Gigabytes to terabytes
Time Horizon
Near-current
data
Historical data
Effort
Data
Requirements
sources
for
sharing
Amount
of
transformation
data

historical
Low to Medium
High
Daily, weekly
Weekly, monthly
Hardware
Intel-based (or compatible)
computers and minicomputers
Minicomputers
computers
Operating System
NT
UNIX,MVS, and others
Database
Workgroup database servers
Large
(enterprise)
servers
Number of concurrent users
Tens
Hundreds
Types of users
Business area analysts and
managers
Enterprise analysts
senior executives
Business focus
Optimising activities within the
department/business area
Cross-functional optimisation and
decision making
Frequency of updates

and
Technology
and
mainframe
database
Usage
and
key
A data warehouse, is an orderly and accessible repository of known facts or things
from many subject areas, used for decision making? In contrast to the data mart approach, the
data warehouse is generally organization-wide in scope. Its goal is to provide a single,
integrated view of the enterprises’ data, spanning all the enterprises’ activities. The data
warehouse consolidates the various departmental perspectives into a single enterprise
perspective. The data mart being more focused than the data warehouse, the complexities
6
involved in its creation and maintenance are less as compared to the data warehouse. The
major differences between data marts and data warehouse, Table 2.
Data – Warehouse Functions:
OPERATIONAL AND EXTERNAL DATA
Access
Transform
Distribute
Operational
and
External
Data
 Cleanse
 Reconcile
 Enhance
 Summarize
 Aggregate
 Stage
 Join
Multiple
Sources
 Populate
on
demand
Store
Find
 Relational
Data
 Specialized
Cache
 Multiple
Platforms
and
Hardware
 Informati
on
Catalogue
 Business
Views
 Models
Display &
Analyze
 Query and
Reporting
 Multi-
dimensional
Analysis
 Data
Mining
(mainly
statistical)
FLOW
Fig.3 Data Warehouse Functions / Components.
Fig. 3, represents the flow of data from the original source (of the data) to the user,
and includes management and implementation capabilities. For example, there are access
mechanisms required to retrieve data from heterogeneous operational databases. That data is
then transformed and delivered to the “data warehouse store” based on a selected model (or
mapping definition). The metadata defines this model and definition of the transformation of
the original data. The data transformation and movement processes are executed whenever an
update to the warehouse data is desired. And, the data warehouse management software has
the capability to manage and automate the processes required to execute these functions.
Data Warehouse Architecture:
Each implementation of a data warehouse is different in its detailed design (a
schematic high-level of the architecture and its components is given in Fig. 4), but all are
characterised by a handful of the following key components:

A data model to define the warehouse contents. This model is different for every
implementation but the utility and success of the data warehouse depends to a
large extent on how well this data model reflects the type of processing that will
done on the data stored and on how well the warehouse reflects the business
process. The data warehouse (is a subject oriented database as detailed above)
modelling must take into consideration the following issues: a) what business
process is being modelled, b) what are the measures or the facts (information) to
be stored, c) at what level of detail (granularity) is “active” analysis conducted, d)
what do the measures have in common (the “dimensions”), e) what are the
7
dimensions’ attribute and f) are the attributes stable or variable over time and is
their “cardinality” bounded or unbounded.

A carefully designed warehouse database, whether hierarchical, relational, or
multidimensional. While choosing a DBMS it must be kept in view that the
database management system should be powerful enough to handle huge amount
of data running up to terabytes. Well known relational DBMS vendors are DB2,
Informix, Oracle, Sybase, etc; and multidimensional DBMS’s are offered by
Kenan Systems Corp. (Acumate ES), Dimensional Insight (CrossTarget), Informix
(MetaCube) etc.
Legacy Database
Metadata
Operational Database
Extract
Transform
Maintain
Data
Warehouse
External Data Source
 Query and
reporting
 Multidimensional
analysis
tools
 Other OLAP
tools
 Data mining
tools
Fig 4. Schematic view of the Data Warehouse Architecture.

A front end for Decision Support System (DSS) for reporting and for structured
and unstructured analysis.
The data warehouse allows the storage of data in a format that facilitates its access, but
if the tools for deriving information and/or knowledge and presenting them in a format that is
useful for decision making are not provided the whole rationale for the existence of the
warehouse disappears. Various technologies for extracting new insight from the data
warehouse have come up which we classify loosely as "Data Mining Techniques".
Data Mining:
The process of extracting new information from the repository of historical data (data
warehouse) using advanced statistical and artificial intelligence techniques is known as data
mining. Data mining requires prospecting – the exploration that constantly guides mining
operations. To get the best out of the data mining process, home-grown data is not sufficient;
one may also have to add outsourced data – overlay on it, viz., demographics, geographical
information, weather and climate patterns, economic and social indicators etc.
8
On-Line Analytical Processing (OLAP), though strictly speaking not a data mining
technique, is an efficient architecture for performing complex from a business perspective
while hiding the complexity of underlying data structures. OLAP is also called multiple
dimensional analysis, as it typically involves analysis of trends and comparisons across
business dimensions such as product, sales region, or distribution channel, via analytical
operations such as data consolidation, drill-down, and slicing and dicing, OLAP tools allow
the user to analyse complex data relationships quickly and easily using historical, projected,
and derived data to provide detailed reports. OLAP relies on the user to provide the path or
the route through the data for the analysis, but databases in the data warehouses are often so
large and complex that they cannot be analysed adequately with repetitive queries and reports.
In such situations, data mining tools can be used to automate the decision-support process and
find facts hidden in databases (Fig. 5). Using a combination of machine learning and database
technology, data mining tools find patterns in data and infer rules about the patterns –
essentially, finding answers to questions, users do not know to ask? Techniques such as
multidimensional analysis are then employed to evaluate the implications of the inferred
rules, and the information is presented in a suitable form with graphics, reports, text, and
hypertext.
Data
Presentation
Business
Decision(s)
Data Mining
Knowledge
Data Analysis
Information
Data
Data Warehouse
Operational Database Management Systems
Fig 5. The Information Value Chain and Information Value Pyramid
On-Line Analytical Processing (OLAP):
On-line analytical processing is the next logical step beyond query and reporting.
OLAP software tools deliver the technological means for complex business analysis by
enabling end-users to analyse data in a multidimensional environment. With OLAP tools,
users can analyse and navigate through data to discover trends, spot exceptions, and get
underlying details to better understand the on-going process. One example of an OLAP tool is
the Pluto OLAP tool bundled with the SQL Server 7.xx by Microsoft.
A user’s view of the enterprise is typically multidimensional in nature. Fertiliser
consumption, for example, can be viewed in three dimensions, type (N,P,and K) time and
9
region. Thus, this requires that OLAP tools to be effective must allow multidimensional
"visualisation" and analysis of data.
Analysis requirements span a spectrum from statistics to simulation. The two form of
analysis most relevant to in this context is commonly known as “slice and dice” and “drilldown”.
“Slice” means the facility to view data along any dimensions, that is, for example if
the Fertilizer consumption data is available across three dimensions of type, region and time,
the user can fix any one of the dimensions, say time as 1998 and view the data distributed
across type and region. This allows the user to view the data in a more specific context.
Dicing is the facility to rotate the data about any particular dimension.
“Drill – down” is the technique by which a data, which is presented in a summarised
form, is expanded to show more detail. This allows the user to navigate through or “drill”
through information to get more detail. During data analysis, a user can spot exceptions.
Using OLAP data navigation, the user can drill-down through levels of data to get more
details to help answer “why” questions about the exceptions. For e.g. why the consumption
of phosphate fertiliser in June 1998 was low in UP state, as compared to Jan, 1998.
Although OLAP users do not formulate questions in advance. OLAP still requires the
user to select paths through the data, which limits the findings to the areas pursued. The
results of OLAP analysis help organisations answer specific questions like, “Who are my best
customers for this product in this region?” But it cannot answer general questions like “How
can I segment my customer base for better targeting of products manufactured by us?”
Data mining helps the user to ask more general questions, one that does not
automatically limit the results. For example, “What sort of customers should we be
targeting?”, “Which airline passengers flew to Bombay last month and might be invited to
respond to special pricing on tickets for this month?” or “Which customers bought computers
but no printers last month, so that we can entice them with a discount on a new printer?”
Data Mining:
Data base mining or Data mining (DM) (formally termed Knowledge Discovery in
Databases – KDD) is a process that aims to use existing data to invent new facts and to
uncover new relationships previously unknown even to experts thoroughly familiar with the
data. It is like extracting precious metal (say gold etc.) and/or gems, hence the term “mining”,
It is based on filtration and assaying of mountain of data “ore” in order to get “nuggets” of
knowledge. The data mining process is diagrammatically exemplified in Fig. 6.
10
Transformed Data
1
Extracted
Information
2
Data
Warehouse
Selected
Data
N
Select
Transform
Mine
Assimilated Information
Data Sources
Assimilate
Fig. 6: The Data Mining Process.
Humans are especially adept at both of these tasks (filtration and assaying of data), but
the brain makes such advances slowly and sporadically. Computer databases pose additional,
unique problems:

Database structures are highly complex. They contain numerous tables connected
through abstract linkages that mind finds difficult to trace. This has lead to the
phenomenon of very efficient data gathering mechanisms leading to the drowning
of the decision-making capabilities of the management in the sea of information.

Digitised databases are hidden from sight so the details in the records are unseen
and unanalysed.

The size and nature of the databases make it impossible for the mind to detect
hidden patterns and ill-formed relationships.
Data mining include the identification of relationships that would have gone
undetected without the application of specialised approaches. For example, one application
determined that certain bank customers with occasional overdrafts and characteristic deposit
histories were especially good candidates for home equity loan advertising. Another, a fraud
detection system identified a fraudulent mortgage unit that changed names frequently and
defrauded many different banks, duplicating in minutes the findings of a team of investigators
who worked with the same data for two years.
Data mining helps the user to discover the right questions to ask based on the patterns
found in the data. It is a strategic tool that can uncover patterns already owned by the
enterprise, allowing the enterprise to build a more effective customer relationship, which is
11
recognised now to be a very powerful competitive tool. It automates the process of knowledge
discovery; helping to pinpoint particular areas of interest, predict outcomes, and extend the
capabilities offered by other business intelligence tools like OLAP tools.
Although predefined and ad-hoc access tools provide top-down, query-driven data
analysis, data mining provides bottom-up, discovery-driven data analysis (also known as
“knowledge discovery”). The predefined and ad-hoc access tools allow users repeatedly to
test their theories or hypotheses by exploring the data. In contrast, data mining identifies facts
of conclusions based on shifting through the data to discover patterns or anomalies. Data
mining tools typically access more granular data than ad-hoc query tools. Data mining
complements predefined and ad-hoc access tools by enabling users to discover new
relationships in the data that they may have overlooked, such as that helps explain consumer
behaviour. Unlike, analytical tools, it can be automated to run continuously in the
background, saving significant time. Data mining does not necessarily require a data
warehouse to be effective, but the presence of a data warehouse makes the data mining
operation easier. The data mining project requires a well formulated strategy to design the
warehouse tools to mine the data warehouse and then the follow up of actions on the basis of
the mined knowledge (Fig 7).
Transformed Data
 Define
the
Problem
 Scope
the
Project
 Identify
Data
Sources
 Form
the
team.
Extracted
Information
Selected
Data
Data
Ware-House
Assimilated
Information
Select
Transform
Mine
 Take
Action
 Measur
e
Results
 Assess
Perman
ent
Adopti
on
Assimilate
Fig. 7: The Data Mining Project.
Data mining derives business intelligence from the data warehouse by using advanced
analytic techniques such as neural networks, logic (heuristics, inductive reasoning, and fuzzy
logic), tree-based models, and advanced statistical techniques (cluster analysis, discriminant
analysis, logistic regression, or survival testing, hypotheses testing, perceptual mapping and
conjoint analysis; these are appropriate for answering “why” and “how” questions”).
12
Neural networks have been used to forecast electronic network and component
failure, identify loan applicants who are likely to default, carry out image recognition, and
perceive stock and bond market fluctuations. Surprisingly accurate predictions and
identifications have been made by neural networks in areas in which human experts have had
difficulty defining and programming traditional systems to these tasks.
Logical inference theory has been used to locate relationships and examples of
relationships that may be suspected but unverified. For example, suppose one had a database
of persons who contracted a variety of diseases and another database that contained
genealogical relationships among people whose disease histories are found in the other
database. Some pattern-matching rule-based algorithms are capable of determining which
diseases might be genetic because they exhibit characteristics found predominantly in males
or commonly found both in parents and in children. In business, these pattern-matching
systems can be used to identify airline passengers who travel on particular routes or
potentially fraudulent credit requests that differ from an individual’s normal buying behaviour
etc.
Four major operations for data mining include: predictive modelling, database
segmentation, link analysis, and deviation detection.

Predictive Modelling – A form of inductive reasoning that uses neural networks
and inductive reasoning algorithms (rule-based models) to create software models
that can be used to predict future situations, such as which customers are likely to
leave for the competition.

Database Segmentation – Partitions the data into clusters using statistical cluster
analysis techniques.

Link Analysis – Identifies connections between records, based on association
discovery and sequential patterns.

Deviation Detection – Detection and explanation of why records cannot be put
into specific segments?
Text Mining:
Organisations generate, collect and have large volumes of data, which they use
in day to day operations. These data are mostly in the form of numeric and text.
There are large number of tools products available to analyse and generate valuable
information from the stored numeric data. The text based databases are in various
forms, like:
a) Electronic mails from customers, containing feedback about products and services.
b) Internet documents such as memos and presentations which embody corporate
expertise.
c) Technical reports which describe new technology - Patent Information Systems.
d) News works carrying information about the business environment and the
activities of competitors.
13
Many of the organisations are unable to capitalise fully on the value of this
data because information implicit in the data is not easy to discuss. The need for tools
to deal with such databases is already large. This implies an opportunity to make
more effective use of repositories of business communications, and other unstructured
data, by using computer analysis. Since the database contain only implicit and not
explicit information, it needs a specialised software tool. Text mining is a new and
emerging technology that promises to discover hidden patterns and extract valuable
information from the text data.
Features:
Unlike most of the search engines, text mining, in addition to searching have
build in intelligence and also trains the system, parallel to natural network
methodology. Typically it broadly performs the following:
. Feature extractions - finding the key single or multi-word concept in a
document or a collection of document.
. Clustering - discovering pre-document themes in a document collection.
The added feature of clustering is that it can learn itself and act intelligently.
The clustering algorithm can take a query and determine which areas and data sources
need to be searched and eliminate the others, drastically reducing information query
and retrieval times.
Functions:
Much of the benefits of text mining lie in the combinations of its various
functions. Some of the applications functions include:
. Language Identification - to identify/discover the language in which a document is
written.
. Feature Extraction - To recognise significant vocabulary items in a document, like
names, abbreviations, dates and currency etc.; Clustering; Categorisation - To
assign documents to pre existing categories; Visualisation - To present the
information in a way which is easy to understand.
Both Data mining and Text mining are well established and widely used tools,
as regard the GIS/Video documents mining the technology is still evolving, and it may
take some more time before it stabilises.
14
Data Warehouses - the `Ideal' Solution v/s Data Marts:
ADVANTAGES OF DATAWAREHOUSING:
ADVANTAGES OF DATA MARTS:

Centralized storage of information reducing
redundancy




Ensures data integrity

Common understanding of data across the
enterprise

Effort of data extraction and loading is done
once



More efficient use of hardware and networking
resources.



Quicker return on investment
Less costly (time, money, personnel)
Lesser number of variables to control and hence
lower risk.
Less chance of scope escalation
Possible to implement even when common
understanding of business across departments
does not exist.
Lesser data and conflicts, and simpler models.
Less complex in terms of the types of users and
hence designing of summary tables.
More focussed on specific business problems.
DISADVANTAGES OF DATAWAREHOUSING:
DISADVANTAGES OF DATA MARTS:



Duplication of effort in data extraction and
cleaning.

Duplication of data at times with integrity
problems.

Different understanding of data between
departments.

Possibility of different standards.






Long term, longer time for ROI
High number of users has to agree to spend
time and money for the process to succeed.
One person can stop the process.
Differences among persons handling functional
areas of LoBs, can bring the project to a halt.
Common metadata cannot be achieved when
common understanding of business does not
exist.
More types of users, means complex summaries
and applications need to be built.
Scope escalation, as requirement change
through the length of the project.
Higher risk as number of variable increases.
Reasons for Failure:
A data warehouse can only be successful if carried out meticulously. Some of the
main reasons to be kept in mind, before going for Data warehouse:

Cost overruns, caused by wrong estimation of hardware/network resources.

Time overruns, and changing priorities.

Scope escalation

Lack of focus on main issue(s)

Non-co-operation by one or more user groups.
Myths about Data warehouses:
Myth: Data warehouse is a repository of all historical data
15
Reality: Data warehouse is historical data required for decision support, arranged by subject
area.
Myth: Data warehouse is always a very large database.
Reality: Data warehouse could be small. It depends on the kind of business, and the amount
of information required for solving business problems.
Myth: Complexity of data warehouse comes from the size of data and number of users.
Reality: Complexity comes from the multiplicity of data sources, and the different types of
users/subject areas.
Myth: Data marts are smaller data warehouses
Reality: Data marts are focused subject specific data warehouses. Actual database size does
not determine where it’s a warehouse or a mart.
Myth: Data marts have fewer users
Reality: Data marts have fewer types of users. If a solution has to be used by 100 sales
manager it's still a data mart. If it is used by 5 sales managers, 5 production managers and the
CEO, it turns into a warehouse.
Data Mining Tools:
Numerous tools specifically designed for data mining are available. The tools differ
substantially in the types of problems they are designed to address and in the ways in which
they work.
Table 3: Data Mining Tools
Product
Company
URL
Intelligent Miner
Data Mart Suite/Express
Data Mine
Discovery Server
Enterprise Miner
Express
Business Miner
Meta Cube/Informix/Red Bricks
MineSet
Scenario
Seagate Holos
SPSS -Clementine
XpertRule
IBM
Oracle
Red bricks Systems
Pilot Software
SAS Institute
IRI Express
Business Objects
Informix
Silicon Graphics
Cognos Corporation
Seagate Software
SPSS Inc.
Attar Software
www.ibm.com
www.oracle.com
www.redbrick.com
www.pilotsw.com
www.sas.com
www.express.com
www.businessobject.com
www.informix.com
www.sgi.com
www.cognos.com
www.seagate.com
www.spss.com
www.attar.com
Tools that use advanced statistical techniques, neural networks or genetic algorithms
are also available (Table 3 provides information about some other data mining tools
available).
16
Data Mining Applications
Typical data mining applications could include:










Consumer segmentation on similar buying behaviour.
Profiling customers for individual relationship management.
Increasing response rate from mailshots.
Identifying the most profitable customers and the underlying reasons.
Understanding why customers are leaving for the competitors – it can provide
information like: customer dissatisfaction peaks during holidays, when the
company’s on-line staff runs a skeletal service; using such information the
management can plan future strategies or tactics.
Uncovering factors affecting purchasing patterns, payments, and response rates.
Detection of fraudulent credit card transaction or insurance claims.
Preparing for utility demand (telecom sector, transport, energy and water).
Anticipating a customer’s future actions based on current histories and
characteristics.
Other applications areas can be mass customisation, cross-selling, demand
forecasting, inventory control, machine (part) maintenance, risk analysis, multiproduct campaigning, etc.
Data mining, enables organisations to take full advantage of the investments they have
made and currently building data stores. The decision-maker can tap it with the unique
opportunities that data mining offers, thus, large corporate house(s) are capitalising on their
databases and becoming sole proprietor of competitive advantage.
Due to the strategic nature of applications and use, organisations normally do not
discuss the very use of data mining and data warehousing technology with outsiders. Even
then, the growing data about reported instances of data mining being utilised (effectively) is
growing at a very fast pace.
More specifically, the following are the reported applications (or planned use) of data
mining technique in the business environment (both Indian and foreign):







A medical supplier company increased its return on advertisement by targeting
doctors who were most likely to make a second purchase.
A collection agency improved its ability to determine which delinquent accounts
were most likely to be collectable.
A bank initiated an automobile loan campaign by predicting which customers are
likely to buy a new car?
A researcher discovered the conditions under which it was most likely that
companies would take corporate write-downs.
A health insurance company discovered that understaffed medical units were
sending patients for test as a means of warehousing them until staff could deal with
them.
A telecom company found out how the cricket telecasts affect the
telecommunication services.
A leading bank discovered how punctuality in the bank’s teller counter determine
daily cash outflows.
17













An insurance company searched its data for patterns indicating fraud(s).
A telephone company began predicting which of its new customers were likely to
turn over (to the competition) in a short period of time, limited its advertising to
them, and increased its customer retention ratio by evaluating their payment
patterns.
A life insurance company discovered the pattern that lead to early cancellation of
insurance policies.
A cosmetic company found that the sales of moisturising lotion to be below average
in northern India last year (1998), as humidity levels were above normal that year.
Production manager comes to know that the mean time between failure of industrial
refrigeration systems to be sold, have a direct correlation with the age of the
customer.
The BPL group is deploying statistical analysis tools to extract patterns from its sale
and component-inventory database.
Hutchison Max and Modi-Telstra are both constructing integrated data warehouses,
so as to plumb them for insights into customer behaviour.
The Reliance Industries is setting up an enterprise wide data warehouse and data
mining system.
Citibank (India) is using data mining to manage customer relationships.
Godrej-GE Appliances has built a data warehouse to use data mining to understand
its distribution chain better.
The National Stock Exchange (NSE) is putting up a data warehouse as a prelude to
using data mining tools to manage its clearing house operations, followed by capital
market operations, and then derivatives trading.
The State Bank of India plans to use data mining as a means for better customer
account management.
The MCI Communications (a $20 billion organisation) discovered complex patterns
in the usage of telecommunication services by different set of customers, and is
using the finding to revise its tariffs in a way that benefits users and optimises its
revenues.
Applications in the Government:
"Data warehouse and Data mining" is a perfect means of preparing the government to
face the challenges of the next millennium.
Government departments are in the process of a paradigm shift - a transformation in
how to better govern at centre, state and district level. Officials are often forced to do more in
less time, complete with the private sector, operate with tighter budget and smaller staffs, and
provide better service to the people. As a result, they are being forced to evaluate their core
strengths and weaknesses and find new strategy of doing development activities. Information
technology has served a vital role in the drive to meet these new challenges. It is no surprise
that data warehousing - one of the hottest developments in the IT industry is quickly
becoming an integral part of IT strategy. Warehousing and mining could be the most
significant advancement in the government computing in the near future.
The proven benefits of data warehousing for commercial organisations are clear.
Recent surveys indicate that a large percentage of Fortune 2000 corporations are either
planning or have already built large scale data warehouse initiatives, as a means to increase
18
sales, reduce costs and maximise profits. These initiatives will enable sophisticated decision
support systems to deliver necessary information throughout organisations and beyond. But
the question remains: How will public organisations benefit from jumping on the data
warehousing bandwagon? The answer lies in the realisation that information is still through
government's largest asset and undertapped for its full potentials.
Both private and public organisations face the reality that resources are limited.
Capital assets are scarce and will only continue to be so, and rightsizing and downsizing
results in limited human resources. As these resources decline and organisations continue to
amass large amounts of data - information that often holds the key to more efficient
organisational operation. However, government organisations are realising that the means to
access this information is still have some issues unresolved. The information exists, but
creating smooth enterprise-wide access to these data stores is another matter.
To understand in-depth relevance of the "Data warehouse and Data mining" in the
government sector, one must understand the major difference between the objective of a
government /public sector undertaking (enterprise) and that of a private sector enterprise. A
government/Public Sector enterprise objective is not maximisation of profit solely, but also
economic development of the nation (as a long-term goal) and the welfare of the society;
where as a private sector enterprise is oriented towards the sole objective of maximisation of
profit. But even, if the objective of these two exclusive categories of enterprises are entirely
different, they share some features:


To generate & process the latest, timely and update information to create an information/
knowledge base.
Allocation of limited resources (of the nation and/or enterprise) to meet the above
objective.
Typically, environment in the government is such that all development sectors, (have
direct or indirect impact on each other and are inter-linked) for example, health has
implications for productivity. Investment in education eventually leads to higher standards of
nutrition and family planning. Investment in the fertiliser sector increases agriculture
productivity. However, the resources are limited, thus, it may result in lower productivity in
other sectors. One needs to study and describe these links to achieve the common objective
of national development. Moreover, to evaluate any scenario in advance, for planning and
decision making, typically one need to develop a data warehouse corresponding to economic,
production, national accounts, demography, agriculture, energy, health, education, nutrition,
environment etc. In short, one needs to develop local and global data warehouse depending
upon the needs, to strengthen the decision making. Broadly, it provides capability of moving
from "Planning in isolation to planning following Integrated Approach".
Though, DW & DM is being used effectively by large sales, services and marketing
organisation for activities such as data base marketing; segmentation and consumer
management. There are a large number of applications in the government both at centre, state
as well as attached organisations.
Some of the major application areas includes: development of local and global Data
warehouse from the following data bases/data marts depending upon the key objective(s).
19
A. Data Warehouse and "Data mining"

Ministry of Agriculture :
Production; Consumption; Agricultural Marketing; Fertiliser Consumption; Seeds;
Prices (wholesales & retail); Technology; Agricultural census; Marketing region(s); Live
stock; Crops; Agricultural credit; Plant Protection; Watershed; Area under Productions
yields; Land use statistics; Finance & Budget etc.

Ministry of Petroleum & Natural Gas:
Marking; Finance; Personnel; Pricing; Import; Crude & Product Production; Sale from Oil Corporations (IOC, BPCL, HPCL, & other); Marketing Division of Ministry of
P&NG.

Department of Tourism:
Foreign Tourist Arrival System (FTAS); Customer preference/behaviour Data base;
Tourism & Product Development Information; Foreign Exchange earning; Employment
Opportunities; Manpower & Training; Marketing Research; Publicity; Hotel
Classification System; Travel & Tour Operators data base.

Ministry of Rural Development :
Below Poverty line; DRDA; Drinking Water; Rural Population - census; Rural
Development scheme of the state & Central govt.; per capital income - Rural, Urban;

Ministry of Health & Family Welfare:
Health & Family Welfare MIS; Community needs Assessment Approach:
Immunisation (mother & child health); External Aid monitoring; National Programme for
Control of Blindness (NPCB); National Leproscapy Eradication (NLEP); National
Malaria Eradication (NMEP); National Aids Control; Indian System of Medicine &
Homeopathy (plants, herbs, medicines etc.); Drug policy; Health law; Morbadity &
Motility pattern; Medical Record System (Hospital); Stores; Medical & Para medical
manpower; NGO data base; Emergency Medical Relief; Health Education; census etc.

Ministry of Energy :
CEA; MOC; MOP&NG; DC&PC; Power Plants; Non-conventional energy;

Planning Commission:
State Plans (All sectors); Labour; Health; Education; Trade; Industry; Annual Budget;
Five Year Plans; State Plan Project; Rural Development; Energy including Nonconventional.

Department of Programme Implementation:
Central Sector Projects costing Rs.20 crores & above;
20

Ministry of Commerce:
Import & Export (Trade); E-Commerce; Exports & Imports data bank (8 digit
HSCODE); Foreign Trade of India (Principal Commodities and counties); Trade Policy;
Balance of Payment; World Price monitoring system; Provisional Estimates of Import &
Export.

Deptt. of Revenue :
Customs & Central Excise; Income Tax; Commercial Taxes;

Deptt. of Economic Affairs :
External Assisted Projects in the various central govt. ministries/deptts.; Budget
Expenditure; Annual Economic Survey(s).

Ministry of Welfare:
Welfare Schemes data bases; Programmes for Weaker sections of society; NGO - for
Welfare Projects/Schemes;

Ministry of Shipping & Transport:
Shipping Information System; Shipping Tonnage Information System; Chartering
Information System; Transport Statistics;

Audit & Accounts:
Govt. Accounts Data base;

Ministry of Railways; Deptt. of Coal; Department of Posts; Department of
Telecommunication; Ministry of Labour; Ministry of Civil Supplies; Ministry of
Education;

Public Utilities Departments:
CPWD; State Vidyut Board; State Development Authority;
Department; Sales Tax; Accident Tribunals; Hospitals;
B. Data Warehouse and "Text Mining"







Human Rights Commission
MRTPC
Supreme Court & High Court Judgement Cases
Parliamentary question - answers and Debate Information
Patent Information
Public Grievances
Department of Welfare
21
MTNL; Police


Land Records
Public Interface Departments:
Passport; Licensing Authority; Ration Card
In general, for the government departments, the data base generated, updated and
maintained by the corresponding public sector undertaking or attached organisation, will
act as a Data Mart (or a Local warehouse), to the global warehouse to be generated and
maintained in the corresponding central govt. ministry/deptt. Similar is true for the state
govt. deptts. Typically, for any state planning deptt. or even for planning commission, the
global data warehouse in the central ministry will act as a local data warehouse and the
data warehouse to be generated, maintained at the planning commission will act more as a
"Super data warehouse".
Conclusion:
Organisations are today suffering from a malaise of data overflow. The developments
in the transaction processing technology has given rise to a situation where the amount and
rate of data capture is very high, but the processing of this data into information that can be
utilised for decision making, is not developing at the same pace. Data warehousing and data
mining (both data & text) provide a technology that enables the decision-maker in the
corporate sector/govt. To process this huge amount of data in a reasonable amount of time, to
extract intelligence/knowledge in a near real time.
A data warehouse takes the organisations operational data, historical data and external
data (Fig. 2); consolidates it into a separately designed database (which can either be
relational or multi-dimensional in nature); manages it into a format that is optimised for end
users to access and analyse. When a data warehouse has been constructed, it provides a
complete picture of the enterprise (similarly, a data mart provides a full representation of the
business area it is designed to serve), may be for the first time. It provides an unparalleled
opportunity to the management to learn about their customers. The data warehouse
technology together with online transaction processing and data mining, allows the
management to provide better customer service, create greater customer loyalty and activity,
focus customer acquisition and retention of the most profitable customer, increase revenue,
reduce operating cost; provides tools that facilitate sounder decision making; improves
worker/management knowledge and productivity; spares the operational database from adhoc queries with the resulting performance degradation and clears the legacy database system,
while moving the corporate system architecture forward.
All this has become possible due to development on two fronts: a) on the hardware
front by the emergence of faster processors (which also can work in parallel configurations)
having greater computational power as compared to processors even a year ago, and reduced
data storage costs and larger and faster secondary storage devices that further decrease
processing time and provide online data in amounts that were impossible earlier and b)
emergence of new software technologies from artificial intelligence and innovative constructs
about how to carry out intelligent (and optimised) data mining.
With the incorporation of new data delivery and presentation techniques, like
hypertext mark up language (HTML), Open Database Connectivity (ODBC) etc. the database
mining (Data & Text) operation has gained wide spread recognition as a viable tool for
business intelligence gathering. Advances in the document mining technology (database
22
mining of free form text/data, in contrast to the “classical” approach to data mining of fixed
length records) are making the data mining technology more powerful.
Last but not the least, the Internet has emerged as the largest data warehouse of
unstructured and free form data. The new technologies are geared towards mining this great
data warehouse.
****
23
Download