Data Normalization Fundamentals

advertisement
Tuesday 26 Jul, 2005
Quick Find:
Search for:
Data Normalization Fundamentals
by Luke Chung
President of FMS Inc.
http://www.fmsinc.com
Introduction
Data Normalization Overview
Spreadsheet Gurus
Efficient Storage and Unique IDs
Data Normalization Extremes
Storing Duplicate Data
Converting Non-normalized Data
Separating into Application and Data Databases
Seek Statements
Summary
Sample Database: normalize.zip (22KB)
Back to Main Technical Papers Page
Introduction
The ability to analyze data in Access is a fundamental skill that all developers must master. The better y
data and knowing how to analyze it, the easier your application development will be. There are lots of w
different techniques must be used depending on your goal. However, there are a few fundamentals that


Data Normalization
Separate Databases for the Application and Data
Data Normalization Overview
There are lots of articles and books on data normalization. They usually scare people including me. I am
theoretical discussion of the pros and cons of data normalization levels. Basically, it comes down to this
data efficiently. This differs depending on the database used, so the more you understand how to manip
more obvious the way you should store data in tables and fields.
A primary goal of good database design is to make sure your data can be easily maintained over time. D
managing more records. They are terrible if fields need to be added since all its queries, forms, reports,
dependent.
Spreadsheet Gurus
Data normalization is a particularly difficult concept for spreadsheet experts. Having been a spreadshee
databases, I sympathize with those struggling to make the transition. The main reason you’re using a d
spreadsheet is probably because have so much data you can’t manage it properly in Excel. The fundame
database is that it allows your data to grow without causing other problems. The big disaster in most sp
add new columns or worksheets (for new years, products, etc.) which cause massive rewrites of formula
difficult to debug and test thoroughly. Been there. Designed properly, databases let your data grow ove
queries or reports. You need to understand how to structure your data so your database takes advantag
your data is totally different from how you show it. So, stop creating fields for each month, quarter or y
as a field. You’ll be glad you did it:
Non-Normalized "Spreadsheet" Data
Normalized Data
Both tables in the example above contain the same data, but they are distinctly different. Notice how th
easily add more records (years) without forcing a restructuring of the table. In the non-normalized table
would require adding a field. By avoiding the need to add a field when you get more data, you eliminate
objects (queries, forms, reports, macros, and modules) that depend on the table. Basically, in database
while new columns are "expensive". Try to structure your tables so you don’t need to modify their fields
Efficient Storage and Unique IDs
A fundamental principle of data normalization is the same data should not be stored in multiple places.
time such as customer names and addresses should be stored in one table and other tables referencing
to it.
Unique IDs (key fields) with no connection to the data are used to link between tables. For instance, cus
stored in a customer table with a Customer ID field identifying the record. Access lets you use an AutoN
assign new ID numbers. It doesn’t matter what the ID number is or whether they are consecutive. The
other than identifying the record and letting records in other tables link to that record.
I’ve seen databases where people use ID numbers that combine a few letters of the last name, first nam
no sense and creates a mess over time. The ID should just be a number and if you want your data sorte
secondary index.
Data Normalization Extremes
It is important to not take data normalization to extremes in Access. Most people are familiar with sepa
table. If not, they quickly discover why they need to. But what about optional fields like telephone numb
number, mobile phone, modem, home phone, home fax, etc.? Most customers won’t have all those num
store them for those that do. There are three approaches:
1. All the fields are in the Customer table.
2. A separate table is created for each type of telephone (a one-to-one link). The table would conta
3.
telephone number.
A telephone table is created with these fields: customer ID, the Telephone Type ID, and number
have unlimited phone numbers (a one-to-many link).
There are arguments for each alternative and to some extent it depends how well you know your data.
and programs such as Erwin often suggest separate table(s). Obviously option 3 is the most flexible sin
unlimited number and type of phones. If you cannot limit the number of alternatives, this is your only c
limit the types, you should opt for option 1 in Access.
First, Access stores data in variable length records. One of the reasons for data normalization is to save
such as dBase, FoxPro, and Paradox stored data in fixed length records with each record taking the sam
fields were blank. The more fields, the larger the table. Not only is disk space is cheap today, Access rec
contained in them. Therefore, this kind of data normalization is not necessary for efficient data storage
Second, and more important, the retrieval of data across multiple tables may be an unnecessary hassle
show the customer’s phone and fax number, retrieving those records out of another table is an unneces
performance. The way Access is designed, it is much easier to just pick the fields from the Customer tab
separate query or sub-report to grab each telephone type separately.
Storing Duplicate Data
It is very important to remember there are situations where you must store what seems like duplicate d
related to the passage of time and the need to preserve what happened. The typical case is an order en
Invoice table linked to Customer and LineItem tables. Each record in the LineItem table is linked to a Pr
product descriptions and pricing. The LineItem table stores a ProductID to designate which product was
However, this is not sufficient. The LineItem table must also store the Price and Description at the time
prices and descriptions in the Product table often change. If it is not preserved in the LineItem table, yo
print the original invoice, which could be a disaster (you would actually show the current description and
data entry, when a Product is selected, you also need to retrieve the Price and Description to fill in thos
The customer information may also change, but that’s actually good since we want the latest customer
Converting Non-Normalized Data
In the example, we show examples of non-normalized and normalized tables. How do you get from one
manually run queries, but that would be very cumbersome. A simple solution is to use Excel’s Transpose
highlight the data to transpose, copy it, then select Edit | Paste Special and select the Transpose option
Within Access, the solution requires some code:
Sub TransposeData()
Const cstrInputTable = "Federal Budget Non-Normalized"
Const cstrOutputTable As String = "Federal Budget"
Dim
Dim
Dim
Dim
dbs As Database
rstInput As Recordset
rstOutput As Recordset
intYear As Integer
Set dbs = CurrentDb
Set rstInput = dbs.OpenRecordset(cstrInputTable)
Set rstOutput = dbs.OpenRecordset(cstrOutputTable)
If Not rstInput.EOF Then
' For each column in the Input table, create a record in the output table
For intYear = 1990 To 1997
rstInput.MoveFirst
rstOutput.AddNew
rstOutput![Year] = intYear
' Go through every record in the Input table
Do
rstOutput(rstInput![Data Type]) = rstInput(CStr(intYear))
rstInput.MoveNext
Loop Until rstInput.EOF
rstOutput.Update
Next intYear
End If
rstInput.Close
rstOutput.Close
dbs.Close
End Sub
In the TransposeData procedure, we basically go down each year in the original table (Federal Budget N
new record in the target table (Federal Budget). Since we know the column names are years, we use a
through each year to transpose. The fields in the target table correspond to the value in the original tab
more general procedure is included in the database that accompanies this article.
Separating into Application and Data Databases
Since Access stores all its objects in one MDB file, it is very important to separate your application and d
This can be a hassle during development since it makes it difficult to create and modify tables, but befo
production, you need to separate the two and have your application database link to the tables in the da
reason is to be able to update the application without wiping out the data (assuming your users are add
also important if you are deploying a multi-user application. The application MDB should include all the
macros, and modules. It should also include any tables that are user specific (tables to store user option
Done properly, you can deliver a new version of your application, replace the existing ones, and use the
changes are necessary with the tables, you will then need to manually modify the data database. The a
linked to the data database through the Linked Table Manager under the Access Tools | Add-Ins menu.
You can also write some module code to update a linked table (from Total Access SourceBook):
Function ReLinkTable_TSB(strTable As String, strPath As String) As Boolean
' Comments : Re-links the named table to the named path
' Parameters: strTable - table name of the linked table
' strPath : full path name of the database containing the real table
' Returns : True if successful, False otherwise
'
Dim dbsTmp As Database
Dim tdfTmp As TableDef
Dim strPrefix As String
Dim strNewConnect As String
On Error GoTo PROC_ERR
Set dbsTmp = CurrentDb()
Set tdfTmp = dbsTmp.TableDefs(strTable)
strPrefix = Left$(tdfTmp.Connect, InStr(tdfTmp.Connect, "="))
strNewConnect = strPrefix & strPath
tdfTmp.Connect = strNewConnect
tdfTmp.RefreshLink
ReLinkTable_TSB = True
PROC_EXIT:
dbsTmp.Close
Exit Function
PROC_ERR:
ReLinkTable_TSB = False
Resume PROC_EXIT
End Function
Seek Statements
For the most part, separating the data into a data database does not affect your application. The querie
remain the same, as do your forms, reports, and code. The main exception is Seek statements. Seek st
find a record. They are very fast because they use an index you specify. For example, for a given table
and search values (varValue1 and varValue2):
Dim dbs As Database
Dim rst As Recordset
Dim fFound As Boolean
Set dbs = CurrentDb
Set rst = dbs.OpenRecordset(strTable)
rst.Index = strIndex
rst.Seek "=", varValue1, varValue2
fFound = Not rst.NoMatch
However, this code fails if the table is linked. This is very frustrating and many developers resort to the
Unfortunately, FindFirst is very inefficient. It does not use an index and performs a slow sequential sear
This can be very painful for large tables. The good news is that you can use Seek on linked tables. It’s a
identifying the database where the table resides. Often, you will know the linked database name and yo
variable (where strLinkedDB is the linked database name):
Set dbs = DBEngine.Workspaces(0).OpenDatabase(strLinkedDB)
The example below is a general solution where the code tests a table and changes the database variabl
Dim dbs As Database
Dim tdf As TableDef
Dim strConnect As String
Dim strLinkedDB As String
Dim rst As Recordset
Dim fFound As Boolean
Set dbs = CurrentDb
Set tdf = dbs.TableDefs(strTable)
' Connect = "" if it is not a linked table
strConnect = tdf.Connect
If strConnect <> "" Then
' Database name follows the "=" sign
strLinkedDB = Right$(strConnect, Len(strConnect) - InStr(strConnect, "="))
' Change database handle to external database
dbs.Close
Set dbs = DBEngine.Workspaces(0).OpenDatabase(strLinkedDB)
End If
Set rst = dbs.OpenRecordset(strTable)
rst.Index = strIndex
rst.Seek "=", varValue1, varValue2
fFound = Not rst.NoMatch
Summary
By normalizing your data and splitting your database into separate application and data MDB files, you’l
establishing a solid foundation for your database development efforts. Data normalization not only make
accurate, it makes it easier to analyze, and more importantly, maintain and expand. Separating your ap
enables you to support multiple users and upgrade the application without wiping out their data. Assum
change that often, the separation also makes it easier to just backup the data database since only that
Luke Chung is the president and founder of FMS, Inc., a database consulting firm and the leading developer of Microsoft Access add-in pro
author of several Access add-ins including Total Access Analyzer, Total Access CodeTools, Total Access Detective, Total Access SourceBook
spoken at a variety of conferences and user groups, and can be reached at LChung@fmsinc.com. Their web site (www.fmsinc.com) offers
technical papers, utilities, and demos.
Download