Data Warehousing 2.0 and SQL Server * Architecture and

advertisement
SQL Server Technical Article
DW 2.0 and SQL Server Architecture and Vision
Writer: W.H. Inmon
Published: October 2009
Applies to: SQL Server 2008
Summary: Architecture and data warehousing are not static. From the first notion of a data
warehouse to a full-blown analytical processing architecture that includes data marts, ETL, near
line storage, exploration warehouses, and other constructs, data warehousing and its
associated architecture continue to evolve. In 2008, the book on the latest evolution of data
warehousing appeared – DW 2.0: The Architecture for the Next Generation of Data
Warehousing (Morgan Kaufman). In that book the general architecture for data warehousing in
its highest evolved form appeared.
Introduction
Architecture and data warehousing are not static. From the first notion of a data warehouse to a
full-blown analytical processing architecture that includes data marts, ETL, near line storage,
exploration warehouses, and other constructs, data warehousing and its associated architecture
continue to evolve. In 2008, the book on the latest evolution of data warehousing appeared –
DW 2.0: The Architecture for the Next Generation of Data Warehousing (Morgan Kaufman). In
that book the general architecture for data warehousing in its highest evolved form appeared.
Figure 1: Diagram of the granular data found in DW 2.0
Among other things, DW 2.0 recognizes the life cycle of data within the data warehouse,
recognizes the need for including textual data in the data warehouse, and recognizes that
metadata is an essential component of the data warehouse environment. Along the way, DW
2.0 recognizes that data warehouses attract large amounts of data, store that data over a
lengthy period of time, support a wide variety of processing, and finally – data warehouses can
become very costly if you choose to make design and infrastructure decisions that are
expensive.
2
SQL Server in Evolution
While architecture has been evolving, so Microsoft® SQL Server® has also been evolving.
From the humble origins as a database that served small amounts of data on a personal
computer with very basic functions, SQL Server now is prepared to serve as a database
foundation for mid-size and very large amounts of data for data warehousing.
It is said that to increase the capacity or performance of a system that tuning the system can get
up to a 10% improvement. But to get an order or two magnitude of performance and capacity
improvement, a change in architecture of the system is required. And indeed that is what SQL
Server has undergone – a fundamental change in architecture from the early days of SQL
Server.
Just as data warehouse and architecture has evolved, so SQL Server has also been evolving.
And whether by chance or by design, SQL Server has turned into the preferred technology
platform for the most advanced form of data warehouse architecture – DW 2.0.
This means that SQL Server has advanced mightily up the evolutionary curve for serving the
data warehouse community as the database foundation for large and sophisticated data
warehouses. No longer is SQL Server limited to small amounts of data and personal computers.
With the architectural enhancements of SQL Server it is ready to become the infrastructure of
choice when implementing advanced data warehouse and analytical architectures such as DW
2.0.
Aspects of DW 2.0
There are many aspects to the DW 2.0 architecture. Not all of them can be addressed in the
space of this white paper. However, some of the more profound and more important aspects of
DW 2.0 will be discussed in the context of SQL Server.
Basic Access of Data
For years the preferred storage medium for data has been disk storage. Disk storage appeared
at the time that online transaction processing was first being done. In fact, in many ways it was
the advent of disk storage that allowed online transaction processing to become a reality. The
way that online transaction processing enables access of data for transaction processing is to
access online storage randomly. To this end data is loaded onto disk storage either by hashing
the data as it is placed on storage or by the creation of an index (or both). In accessing data for
online transactions, a random and rapid access of small amounts of data is required.
For many online applications and for many usages of data, a pattern of rapid, random access of
small amounts of data on disk storage works just fine. But when it comes to DSS, analytic
processing, the basic pattern of access of data is quite different. Most DSS, analytical
processing is done by means of SQL. SQL operates on sets of data, not records of data.
Therefore, for analytical processing, an access mode where the first record being sought is
3
accessed randomly and then the remaining records in the set are accessed sequentially fit
optimally with data warehouse analytical processing. Furthermore the sets of data accessed by
DSS processing may not be small at all. On many occasions very large sets of data are
accessed. In other words, a sequential access mode for most of the data to be accessed in a
data warehouse is optimal, not a random mode of access for every record of data to be sought.
Figure 2 shows this difference in the fundamental mode of access in the OLTP environment and
the data warehouse analytic environment.
Figure 2: Illustrating the difference between random and sequential I/O
In the latest release of SQL Server the first random record then a sequential mode of access is
the one that is supported. This means that at the most basic level of processing, SQL Server
holds a major performance advantage over their competition.
A Data Mart Migration Path
Another recurring problem with data warehouses and analytic processing is the fact that many
organizations prefer to build data marts first, before they build an actual data warehouse. Then
one day the organization wakes up and discovers that in addition to their data marts, they need
a data warehouse. It is at this point that there is no easy or graceful migration plan to go from
multiple data marts to a data-warehouse-centric environment. Many organizations start out in
the hopes that a data mart or two are going to meet their analytical needs. But over time the
problems with a data mart centric architecture start to appear – there is no definitive source of
corporate data, there is the need to build every data mart from scratch when a new need for
data appears, data marts are terribly brittle and need to be destroyed and rewritten when basic
business requirements change, and so forth.
In DW 2.0, the granular data found in DW 2.0 forms what is called the “system of record” and
becomes the “single version of the truth” for the organization. And from the system of record,
data marts are created just as they were created in classical first-generation data warehouses.
Now SQL Server supports the easy migration from data marts to data warehouses. SQL Server
offers the ability to build small data warehouses or data marts in its FastTrack option. Then,
4
when the volume of data grows and the need to create a full scale data warehouse arises, SQL
Server offers their SQL Server 2008 R2 Parallel Data Warehouse (previously codenamed
Project “Madison”) option.
Figure 3: Parallel Data Warehouse offers an easy migration path from data marts and smaller
data warehouses to enterprise-scale data warehouses
In the SQL Server 2008 R2 Parallel Data Warehouse edition of SQL Server there is the
opportunity to manage as much data as is possible – up to petabytes of data. There is the
opportunity to support this volume of data in a parallel fashion. There is the opportunity for
redundancy of components so that the system can process in an efficient and in a failsafe
manner.
But perhaps most importantly there is the opportunity to synchronize – automatically – data
residing in the FastTrack data mart or mini data warehouse with the data managed centrally by
SQL Server 2008 R2 Parallel Data Warehouse. While there are other aspects to the conversion
of a data mart to a data warehouse, SQL Server has solved some of the hardest aspects of the
problem very nicely. If an organization commits to SQL Server as a basis for data warehouse
processing, many of the problems of migration from a data-mart-centric environment to a real
data warehouse are mitigated.
Data Warehouse Costs
Another sensitivity of DW 2.0 is the recognition that the cost of the data warehouse is an issue.
And if it is not an issue today, it will be an issue tomorrow. As data warehouse volumes of data
grow, so grows the cost associated with data warehousing. And the costs of the data
warehouse grow along with the rise in the volume of data.
When the discussion of cost arises, it must be noted that the more centralized the components
of technology become, the more expensive they become. For example, suppose that an
organization needs a total of n units of processing power. The most expensive thing an
organization can do is to buy one central processor that provides n units of power. The most
cost-effective thing an organization can do is to break up the n units of power into many different
5
units. n units of power are needed. Suppose that the cost of n units in a single processor is X.
Now suppose that the n units are divided into ten units – n/10. Suppose that each n/10 unit
costs Y. Then:
10 x Y < X.
In fact, 10 x Y is FAR less than X. The most expensive processing cycles are those that are
found in the largest machines. The more the workload can be divided, the less expensive the
processing cycles become. Using the above equation, it is not unreasonable that Y would be
1/100th of X. Using this equation, 10 x Y = 1/10X. Therefore, from an economic standpoint, it
makes sense to take work that needs to be done and distribute that work over many different
processors. This has the effect of greatly reducing the costs of the data warehouse
environment.
DW 2.0 is aware of this general cost equation. In fact DW 2.0 starts with this basic hypothesis
as the basis for all following architectural decisions. And SQL Server is also aware of this
fundamental fact of life regarding technology costs.
SQL Server accommodates the need for the distribution of processing across the data
warehousing environment. Figure 4 shows this basic understanding of the costs of technology.
Figure 4: The unique hub-and-spoke architecture for SQL Server’s Parallel Data Warehouse
SQL Server distributes the processing workload in several ways. The first way that SQL Server
supports the distribution of work across multiple locations is in support of the hub-and-spoke
architecture. Basic data management is done in the data warehouse hub, where massive
amounts of data can be handled. And end-user analytical processing is handled in the different
spokes of the architecture. In fact, within the hub processor there is a distribution of the
workload. The hub processes data differently in different places, thus avoiding a large queue
that can and will negatively impact performance.
6
In doing so, the costs of the infrastructure for SQL Server are held to a minimum, thus allowing
the organization to easily and cost effectively grow their data warehouse and to achieve good
and consistent performance at the same time.
Compression
Another aspect of data warehouses – recognized by both DW 2.0 and SQL Server - is the need
to store and manage a large volume of data. There are many ways that large volumes of data
can be managed. A simple way to manage volumes of data is through compression. In
compression, extraneous data is removed or stored in a minimized fashion. The techniques for
compression are especially applicable for a data warehouse because data warehouses – built
properly – do not allow data to be updated. Compression actually harms performance when
update of data is allowed because it is costly for the system to go and find data, decompress it,
update it, recompress the data and then try and replace the data in the database efficiently. But
because updating of data does not occur in the data warehouse environment, compression of
data makes a lot of sense. And indeed SQL Server allows data to be compressed.
Parallel Processing
But the biggest gain in the management of volumes of data that is now part of SQL Server (SQL
Server 2008 R2 Parallel Data Warehouse) is that of the parallel management of volumes of
data. In parallel processing of data, data is stored on more than one device so that more than
one processor can access and manage data at the same time. In order to understand the value
of parallel management of data, consider what a buggy driver must do when the weight of the
buggy becomes too heavy for the horse. One alternative is to go from a regular sized horse to
an oversized horse such as a Percheron or a Clydesdale. Percherons were bred years ago to
allow knights in armor to ride them into battle or jousting contests. And a knight in armor weighs
a lot. This strategy works well as long as there is a Percheron that is available and is less than
ten years old. But what happens if there is no Percheron available? Or what happens if the load
is too heavy for a Clydesdale to pull? Then, a team of horses – not a single horse - is needed.
And up to a point more horses can be added as the load to be pulled grows.
The same analogy applies to managing a lot of data. If a single server is overwhelmed by its
load of data, then multiple servers can be used at the same time and the load of data can be
divided over more than one server. Such an approach is called a parallel approach because
different sets of data are operated on in parallel independently. In doing so adding more parallel
servers increases the total throughput that a system can handle.
And the SQL Server 2008 R2 Parallel Data Warehouse option handles data in a parallel
fashion.
7
Figure 5: Parallel management of data in the SQL Server hub
Probability of Access of Data
But compression and parallel approaches to the management of data are not the only way that
large volumes of data can be managed. DW 2.0 calls for the physical separation of data based
on the probability of access of the data. Very highly accessed data needs to be placed in high
performance storage. In this regard, a data warehouse built under SQL Server is like any other
database management system. But as the volume of data grows and the probability of access
of data drops, it no longer makes sense to store all data on high-performance storage. Not only
is data that is not accessed very expensive to place on high-performance storage, the unused
data gets in the way of accessing data whose probability of access is indeed high. By placing all
data on high-performance storage, the organization has the worst of all worlds – large expense
and poor performance.
In order to understand why data with a low probability of access should be removed from highperformance storage, consider that an information system is in many ways like the blood
pumping through the human body. In a young athlete running a marathon, there is very little
cholesterol. The heart pumps blood efficiently through the blood vessels of the athlete. But now
consider a lethargic couch potato. The couch potato has a lot of cholesterol in his/her body. The
heart has to work hard to pump blood through the cholesterol-clogged arteries of the couch
potato.
Dormant, unused data in a data warehouse is like cholesterol in the body of an athlete. The less
cholesterol there is, the more efficiently the heart pumps. The less unused data there is in highperformance storage, the more efficient it is to find data that is being looked for in highperformance storage.
DW 2.0 recognizes this basic fact of life and SQL Server also recognizes this fact.
8
Figure 6: Logical separation of data based on probability of access
SQL Server allows data to be divided according to its probability of access. In SQL Server data
can be hot, warm, or cold. By physically dividing data into different sectors, performance of data
is greatly enhanced.
Streaming Data
But there is another very important feature of SQL Server that sets it apart from any other
database management system. That capability is the ability to handle streaming data.
DW 2.0 calls for the separation of online high performance from integrated data processing. DW
2.0 recognizes what is termed the “interactive sector”. It is in the interactive layer that data can
be entered in a continuous, high-performance fashion. Figure 7 shows that there are two basic
divisions of data – static data and streamed data.
Figure 7: Static and streamed data
9
Most database management systems manage static data. Static data is data that is recorded as
a byproduct of some event occurring on an event-by-event basis. The event that occurs usually
happens in a relaxed manner. At 10:01 am there is a bank deposit. At 11:03 am there is an ATM
activity. Throughout the day, static data enters the system in a random, somewhat relaxed
manner. Streamed data differs from static data in that streamed data occurs and enters the
database system very rapidly and very predictably.
One way to think of streamed data is that a single event occurs, then multiple fast
measurements are made because of the occurrence of the event.
As an example of streamed data, consider an electronic eye that measures the output of steel in
a steel manufacturing environment. The event that occurs is the production of a batch of steel.
The output steel from the batch is rolled out in a bar very rapidly as the smelting process is
finished. An electronic eye takes a measurement of the steel every 12 inches, or about 1,000
times a minute. As long as the mill is producing steel, the measurements are made quickly and
consistently.
The electronic eye measures lots of variables as the steel passes by, such as:



temperature of the steel
chemical composition of the steel
width of the steel bar, and so forth
In such a manner, very large volumes of data are captured in a rapid and predictable manner
triggered by a single event. This mode of capturing data into a database is called the streaming
mode. Streaming mode data fits very nicely with the interactive layer of data within DW 2.0.
Static data does not fit well or comfortably within the interactive layer of data within DW 2.0.
Historical Data - What Does That Mean?
One of the database variables that must be addressed in database technology is the meaning of
historical data or archival data. When most people think of archival data, they often think of
large archives of data that are 5 to 10 years old. Indeed that kind of data is archival data. But
when one considers the meaning of historical data, one is faced with a dilemma. Data that is
one second old is historical data. But is data that is one second old also archival data? The
answer is that if all data that is historical data is also archival data, then indeed data that is one
second old is archival data.
But in most environments calling data that is one second old archival data simply is very
misleading. So let us call data that is still very freshly newly created historical data and let us
call data that is older than that true archival data.
Figure 8 shows that historical data can be divided up into two classes – newly created historical
data and true archival data.
10
Figure 8: Two classes of historical data
This distinction of what is meant by historical data is necessary for understanding what kind of
data needs to be placed in the interactive sector. In DW 2.0 there is a sector of data called the
interactive sector. The interactive sector contains newly created historical data such a streamed
data, not archival data.
This distinction is important the face of understanding the true nature of DW 2.0’s interactive
sector data and processing. One vendor has created what is called the “active data warehouse”.
This vendor claims that up-to-the second processing and analysis can be done with the active
data warehouse in the data warehouse. Unfortunately the data that is being managed under the
active data warehouse is static data, not streamed data. And therein lies the problem. When
static data in the interactive sector is treated as streamed data, much confusion and much
inefficiency occurs. Static data requires an infrastructure that demands integrity – integrity of
transaction processing and integrity of database processing. By trying to treat static data as if it
were streamed data in the interactive sector, there is a very high and unnecessary cost of the
infrastructure. In addition the vendor that supports the active data warehouse tries to query its
static/streamed data using the query tools designed for static data. Streamed data requires an
entirely different means of query and analysis. The net result of active data warehousing is
confusion, waste, and the inability to handle streamed data as it needs to be handled. Active
data warehousing is not a form of data warehousing at all. Only when the data that is streamed
is managed by techniques fit for streaming is the interactive sector actually created.
An active data warehouse with all of its very expensive costs and technological limitations is no
substitute for handling streamed data. It is streamed data that belongs in the interactive sector
of DW 2.0, not static data.
Unlike the vendor that promulgates active data warehousing, SQL Server handles streamed
data. Figure 9 shows how SQL Server handles streamed data.
11
Figure 9: How SQL Server handles streamed data
Figure 9 shows that streamed data is handled separately from static data. Figure 9 shows that
streamed data is captured in the form of newly created historical data. The data found in the
steaming component/the DW 2.0 interactive sector is accessed and manipulated by a language
suited for accessing and manipulating streams of data, not static data. The output of processing
can be a report, an answer, or the movement of streamed data into a static environment.
Microsoft will release its new streaming engine, StreamInsight, in the same timeframe as SQL
Server 2008 R2.
The Fit Between DW 2.0 and SQL Server
These then are merely the highlights of how there is a very good architectural fit between the
architecture of the future of data warehouse – DW 2.0 – and SQL Server. DW 2.0 and SQL
Server recognize –









the need to handle very large volumes of data
the need to be constantly aware the costs of data warehousing
the need to separate interactive streamed data and processing from other parts of data
and processing
the need to manage data in a parallel manner
the need to divide the workload onto as many smaller components as possible
the need for basic sequential access of sets of data
the need to have a rational migration path from data marts and mini data warehouses to
a large centralized data warehouse
the place and position of streamed data
the need to physically separate data based on the differences in the probability of
access of the data.
References
DW 2.0 – Architecture for the Next Generation of Data Warehousing, Morgan Kaufman, 2008
For more information:
12
http://www.microsoft.com/madison: SQL Server 2008 R2 Parallel Data Warehouse
http://www.microsoft.com/sqlserver/: SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5
(excellent), how would you rate this paper and why have you given it this rating? For example:
Are you rating it high due to having good examples, excellent screen shots, clear writing, or
another reason?
Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of white papers we release.
Send feedback.
13
Download