HOW TO IMPLEMENT DATA INTEGRITY IN SQL SERVER Louis Davidson (drsql.org) drsql@hotmail.com Why did I choose data integrity as a topic? • Answer 1 • If I obviously lie to you, will you trust me? • If your data obviously lies to your customer, will they trust it? • For data to become information, it has to be as trustworthy as reasonably possible. • Answer 2 • If I were the judge and was convicting someone of poor data integrity, I would sentence them to write/maintain ETL • I wrote this slide at 12:49am, 8/14/2013 because I had to get up and fix a data integrity issue First Line of Defense – Testing and Requirements • First, know what your user wants (Requirements) • Build queries to check the data is within tolerances as you build • Define both illegal values and exceptional values • Age for DayCare Student: • Legal: 1-8 Illegal: Everything else • Outside the norm, but perhaps possible: 1, 2, 6, 7, 8 • Save these queries as you go • Test during all phases of the project • Design • Development • Customer testing • Production • No matter if you follow any of my following advice and let your tables go naked, these scripts can be used to verify data is within tolerances Requirements tell you what to test for, now WHERE/HOW to implement? DB • Classic client server: “May I save this data, please?” • Very trustworthy data protection Middle Tier, Rules Engine • Built in tools that most programmers understand • Flexible to change as users need them to UI • Friendly for the user • Provide immediate feedback for nominal rules, limiting bandwidth utilization No one place can satisfy well enough (But…) DB • No interactive protection, limit to 100% true rules • Extremely limited flexibility Middle Tier, Rules Engine • Suffers under highly concurrent situations • Difficult for Inter-row, Inter-table rules • Difficult to use with tools like SQL, SSIS UI • Must be recoded for every form/screen • Very limited rule set that can be enforced Database Layer Responsibilities • 100% Rules • Always true • Usually very simple rules • Failure to meet the prescribed condition would be harmful to the software (and possibly the users of the software) • Other layers repeat some rules and implement everything else Database tier layered approach • Keep it simple • Enforce integrity via (Our Agenda for the next 1hr) • Structure - providing correct places to store data • Keys - protecting uniqueness • Relationships - foreign keys • Domains - limiting data points to size/values that are legit • Conditions - required situations (Customer may have only 1 primary address; No overlapping ranges; etc) But We Don’t Want Errors from the Data Tier! • A frequent concern of non-data tier programmer • Even if you put no constraints you are apt to get errors • You are always likely to get deadlocks • And if your indexing isn’t great, you may get them frequently • Best to code error handler that handle any error condition regardless • If the other tiers handle all of the errors, then the database protection should remain silent • Except perhaps during testing/coding Structure • Match the user's needs precisely to the design with room for growth • Getting design to match the user's needs will get you way down the road to integrity • Normalization will usually get the car fueled up and started • Naming stuff well doesn’t hurt either… • Getting it right can only be done by understanding the users requirements • I promise, no more requirement talk If your structure is wrong…Users will find a way • Requirement: Store information about books BookISBN =========== 111111111 222222222 333333333 444444444 444444444-1 BookTitle ------------Normalization T-SQL Indexing DB Design DB Design BookPublisher --------------Apress Apress Microsoft Apress Apress Author ----------Louis Michael Kim Louis Louis Louis Jessica,&and Louis • What is wrong with this table? • Lots of books have > 1 Author. • What are common way users would “solve” the problem? • Any way they think of! • What’s an another common way someone might fix this? Close, but still quite messy • Add a repeating group? BookISBN =========== 111111111 222222222 333333333 444444444 BookTitle ------------Normalization T-SQL Indexing Design BookPublisher --------------Apress Apress Microsoft Apress … … … … … Author1 Author2 Author3 ----------- ----------- ----------Louis Michael Kim Jessica Louis • But now how to represent who was the primary author on the book? Now, the structure protects the data… BookISBN =========== 111111111 222222222 333333333 444444444 BookTitle ------------Normalization T-SQL Indexing Design BookPublisher --------------Apress Apress Microsoft Apress BookISBN =========== 111111111 222222222 333333333 444444444 444444444 Author ============= Louis Michael Kim Jessica Louis ContributionType ---------------Principal Author Principal Author Principal Author Contributor Principal Author • And it gives you easy expansion Keys • Defending against duplication of data where it oughtn't be duplicated • Artificial Key (Identity/GUID/Sequence generated value) should NOT be the only key • When employed, Artificial Key is for tuning, Natural Key is for the user • Avoid giving users sequentially created values • Well, I am account 0000001, what about account 0000002 Uniqueness Counts • Requirement: Table of school mascots MascotId =========== 1 112 4567 979796 Name ~~~~~~~~~~~ ----------Smokey Smokey Smokey Smokey Color ----------Black/Brown Black/White Smoky Brown School ~~~~~~~~~~~ ----------UT Central High Less Central High Southwest Middle • For a row to be truly unique, some manner of constraint needs to be on column(s) that have meaning • It is a good idea to unit test your structures by putting in data that looks really wrong and see if it stops you, warns you, or something! Key Constraints • Applied to protect data from duplication • May help performance, but should exist even if never used for a query • Part of the data structure – applied with ALTER TABLE – unlike indexes, which are generally attached for performance • NULLs • Primary Key – No NULLs Allowed • Unique – NULL allows, but treated as a single value • Table Clustering • Usually makes sense for the primary key to be clustered (not a hard and fast rule though) • Key constraints valuable with or without clustering Demo – Key Constraints (and a wee bit more) Relationships • Establishes a connection between two tables • Probably the most trouble to implement from outside of the database • Concurrent users means data can change • Caching all data is really costly (particularly to keep up to date with multiple caching servers for inserts, updates, and deletes!) • Using foreign key constraints means these types of queries always return the same value: • SELECT COUNT(*) FROM InvoiceLineItem • SELECT COUNT(*) FROM Invoice JOIN InvoiceLineItem ON Invoice.InvoiceId = InvoiceLineItem.InvoiceId Foreign Key Constraints • Like CHECK CONSTRAINTs, are part of the table structure • One table can reference another’s PRIMARY KEY key columns, or even the UNIQUE key columns • Indexing the child’s reference key can be helpful in many cases • Usually extremely fast, even on very large tables • As long as key’s underlying indexes maintained • For integer keys, a B-Tree index can search millions of rows in a few reads Foreign Key Cascading • Can define cascading operations • UPDATE CASCADE – Deleting the parent deletes the children • DELETE SET NULL – Updating the parent key set the child reference key to NULL • DELETE SET DEFAULT – Deleting the parent row sets the child row to the default • UPDATE NO ACTION – Fail if any child rows exist – THE DEFAULT • Or other combinations of DELETE and UPDATE with CASCADE, SET NULL, SET DEFAULT, or NO ACTION • DELETE CASCADE operations should be limited, to avoid surprises • Use UPDATE CASCADE where you have updatable primary keys. Changing a primary key with references is messy. • Multiple or Cyclic cascade paths require INSTEAD OF triggers or procedures to implement Demo – Foreign Keys Domains • Defining the domain of an object or column • Table - Customers? All customers or certain types? • Column • Integer? Or Whole number between 0 and 10,000,000 • True Unicode Value accepting 64K Characters? Or simple AlphaNumeric? • Can you accept 2GB of Text (varchar(max))? • Goal 0% chance of defects • No situational intelligence • If there can be ANY variation, then the domain includes the variations • Can't fight users doing dumb stuff Please don’t do this. Please? CREATE TABLE object ( objectId uniqueidentifier, fillMeUp varchar(max) ) Extreme Bucket Datatypes • numeric(38,2) • Max value: 999,999,999,999,999,999,999,999,999,999,999,999.99 • Bill gates worth: < $99,999,999,999.99 • US National Debt + All personal Debt: < $99,999,999,999,999.99 • For a nutty value: Distance to nearest galaxy in inches, yes, inches ~74488200000000000000000.00 Extreme Bucket Datatypes - Strings • varchar(8000) • abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghi jklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr stuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghi jklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr stuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghi jklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr stuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghi jklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr stuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz • That is just 780 characters! • Note: If you allow N characters, your apps should minimally test for N (successfully), and N + 1 characters (error) On the other hand, don’t be over restrictive Why did they name me 555 95472? Now I can’t go to school because of the stupid school database! http://peanuts.wikia.com/wiki/File:555.jpg Single Column Domain • What data is EVER "legal" for a column • Most data integrity issues are due to lack of domain control • Missperllings: TN, TNN, TENN, TENNESEE, TINNESEE • Bad values: -1 for Age, NULL for required value, Random default value chosen • Implementation Includes • Intrinsic data type • Optionality (NULL v NOT NULL) • Default Value • Simple predicates • Check constraint • Domain table • Forcing the Issue: Trigger Multiple column • Where the domain of one column is affected by the domain/value of another • Examples: • if col1 = 1 then col2 in (1,2,3) else col 2 in (3,4,5) • If col1 = 'bob' then col2 is NOT NULL • Usually implemented with a CHECK constraint Multiple Column Concerns • Minimize these conditions to only where necessary to avoid illogical/illegal data • RefusedToGiveBirthDateFlag = 1 AND BirthDate is null • Questionable: if DiscountPercent > .5, then ApproverUserId is not null. • Likely Contraindicated - Processing Situations • The user enters Date1 always before Date2 • The ship date must be after the order date • Avoid domains based on data in other tables because data in other tables can shift, leading to messy situations • discountPercent > .5 and savingUser.needsApproverFlag = 1 then ApproverUserId is not null • What happens if you change/delete the user that is referenced in savingUser? Check Constraints • Applied to complete implementation of 99.9% of simple domains • May help performance because it gives the optimizer knowledge of the data • Part of the data structure – applied with ALTER TABLE • Simple predicate implementation • If any column allows NULL, the expectation is that NULL is an acceptable answer unless specifically coded for • Hence, to fail CHECK condition, the answer must be FALSE (unlike WHERE clauses that succeed only when the result is TRUE) • 1=1 TRUE – Acceptable for WHERE or CHECK • 1=NULL UNKNOWN – Succeeds for NULL Column CHECK CONSTRAINT ONLY • 1=2 FALSE – Fails for both Demo – Domains Conditions • Making sure that some condition is met reliably • Examples • Row Modification Details • Overlapping Ranges • Big decisions here • Non-trivial to implement • Feels natural to do it non-data tier code • However non-data tier code: • Can be less reliable • Can be greatly affected by concurrency Tools • Triggers • Instead of Trigger to Automatically Maintain Values • After to validate complex conditions that must be constantly true • SQL • Optimistic Locking to avoid heavy locking without lost updates Demo – Protecting against Conditions Performance Concerns… • For most everything you will commonly need, it can be based on basic declarative integrity constraints • By now, there will be some concern about performance • Performance WILL be impacted, • Done well: almost negligible • Done poorly: can lots of pain • The next demo will do a non-scientific, single user job of showing the performance hit is noticeable, but not tremendous… Demo –Performance Summary • Getting the structure correct is a great start towards data integrity • Make sure column values are always within an acceptable tolerance so software doesn’t break • Employ all of the tools SQL Server gives you to help ensure data integrity • Use non-data tier software to ensure errors that return from the data tier are extremely rare • The key word is: teamwork. You can’t do an adequate job of protecting data in the UI, Business/Object or Data tiers alone Trust but verify • Never stop testing the data, even into production • Be vigilant • Test the structures to make sure constraints not disabled and are trusted • Test data that is not constrained in a 100% manner • Use your slow periods wisely, running tests regularly • Even 1 bad row that a customer notices means they may no longer trust the data…