SQL for Crime ANALYSTS BACIAA Session Thursday, 22 March 2012 James G. Beldock Today’s Agenda • Introductions • Preliminaries • Databases, Structured Data, and Tables • Demo 1: Exploring Tables • How Databases Are Structured (& Why) • Demo 2: Lots of Tables • Break • A Sample CAD Database • SQL SELECT, part 1 • Using database data in Excel • Lunch • Joins • SQL Select, part 2 • Joining • “Saving” Joins to a View • Break • Views • Other SQL Commands 1. 2. Databases, Database Varieties, and SQL How Databases Are Structured (& Why) PRELIMINARIES • Store data permanently • Sometimes called “persistent storage” • Data can be • Structured data • A Person has: First Name; Last Name; Social Security Number; Photo.JPG • Unstructured data • examples: Moby Dick; an entire website; email messages (sometimes) • Sizes • Databases can be small (100K, 1MB, etc.) or • Quite Large (UK Land Registry is 23TB; that’s ~1.1 Libraries of Congress)1 • RIDICULOUSLY LARGE (Google’s index of the web; Facebook’s profiles database) 1 2 DB2 - the secret database (http://www.theregister.co.uk/2006/01/18/db2_neglected/) Wolfram Alpha is great for this sort of thing: http://www.wolframalpha.com/input/?i=23+terabytes that’s 1.84 x 100,000,000,000,000 bits!2 Databases [silicon valley moment] • Recently, SQL-running databases have fallen somewhat out of fashion • SQL was never cool • Now it’s officially “uncool” for some purposes, like building NetFlix • Highly scalable (thousands of servers?) • Very flexible data structures • Today’s session is all about SQL, and SQL is (usually) used with relational databases, which are, if you ask the cool people, not as cool as they used to be. • SQL is still the world’s most prolific database language, and certainly stores more structured data than any other environment ever built. Structured Data • SQL deals with structured data3 • Structured Data • Keeps track of one or more types of things, called Entities (or TABLEs in SQL) • Knows certain, specific, structured pieces of information about those entities, called Attributes (or COLUMNs in SQL) Note: SQL keywords will be in blue. They are traditionally written in ALL CAPS. 3 4 Sample Structured Data: a TABLE of Customers4 and names of Tables or Columns will appear in Brown or Orange, respectively. They are traditionally Capitalized (but not ALL CAPS). Well, nearly always. But not always always: Storing Unstructured Data in SQL Server 2008 – Microsoft, source: SqlCourse2.com, http://www.sqlcourse2.com/index.html Database TABLES Names of COLUMNS Name of TABLE Question: What’s the name of a ROW? ROWs COLUMNs a ROW • 5 COLUMNs (also called Fields): • • • • • Unique IDs are called probably a Unique Identifier KEYs probably not unique customeridsome type of number firstname text (called a String) lastname string probably not unique city string probably not unique state string probably not unique The KEY used to name a ROW is called the PRIMARY KEY Before We Go Further: SQL • That is why you’re here, right? • Structured Query Language (SQL) is: • A language for asking a database for information (“querying”) • A language for changing information in a database • Changing the structure of a database • Adjusting security, performance, and deployment of databases • Destroying everything in the database…but don’t worry :-) Database Manipulation Language, DML Create Read Update Delete Often called: DANGEROUS (seriously, called admin functionality, or Database Definition Language, DDL) SQL’s SELECT Statement • The single most important SQL statement. Period. • “Selects” data out of a database, or performs a calculation on a column, value, table, etc. • Really simple examples: • SELECT 'hello' → hello • SELECT 1 + 3 → 4 SELECT Statement, continued • More commonly, the basic SELECT statement returns ROWs from a TABLE: • • SELECT firstname FROM customers → John Leroy Elroy Lisa SELECT firstname, city FROM customers → John Lynden Leroy Pinetop A special COLUMN name: Elroy Snoqualmie Means “all COLUMNs” Lisa Oshkosh * • SELECT * FROM customers → 10101 John Gray 10298 Leroy Brown 10299 Elroy Keller 10315 Lisa Jones Lynden Pinetop Snoqualmie Oshkosh Washington Arizona Washington Washington SELECT Statement: the Important optioNS (for one table) list of columns, functions on columns, or * SELECT 5 FROM name of table WHERE list of conditions to include (called “predicates”) ORDER BY list of columns and direction of sort (ascending/descending) GROUP BY list of columns 5 The full definition of the SQL SELECT statement syntax is much longer and, to some extent, specific to the database software. See the definition of Microsoft SQL Server 2008 R2’s SELECT statement at http://msdn.microsoft.com/enus/library/ms189499.aspx SELECT…ORDER BY • Use Order By to sort by one or more columns, in ascending or descending order Effect of ORDER BY clause SELECT…WHERE • Use WHERE to filter • based on one criterion: Why the [square brackets]? • or more than one: The word state is a reserved SQL keyword. When it is used as a column name, it must be [bracketed] to avoid confusion. FUNCTIONS • You can add functions to a SQL SELECT statement to perform various analyses. • The most common6 are • Aggregate functions • count(), which returns the number of somethings, and • sum(), which adds up the somethings • Also: min(), max(), avg(), stdev(), var() • Math, Date and String (text) Manipulation functions • Math: abs(), ceiling(), power(), sqrt(), others • String: len(), substring(), replace(), upper(), lower(), left(), right(), others • Date: dateadd(), datediff(), datepart(), getdate(), day(), month(), year(), others 6 The full list is quite long. For SQL Server, see http://msdn.microsoft.com/enus/library/aa258899(v=sql.80).aspx. [DEMO] Using FunctionS, WHERE, and ORDER BY • Summary: • count(*) gives you the count of rows resulting from your query • You can SELECT any combination of columns • Unless you GROUP BY, in which case you are limited to the GROUPed BY columns and aggregate functions applied to other columns • Gotchas • sum(*) doesn’t make sense, but sum(columnname) does—for columns of numbers • GROUP BY is finicky: the list of columns you select is limited • Some things aren’t easy: for example, finding the percent of total Terminology: DBMS • “Database” is a generic term; it can refer to: • A specific set of data running on a Database Server • A Database Server itself (not really the right term) • A large body of information kept by a human being (“my recipe database”) • Databases generally run on a Database Server • • • • A computer running Database Management System (DBMS) Accepts connections (“queries”) from many client computers Returns a response (“result set”) to each client in response to each query Can be distributed onto lots of servers (Facebook: 1,800+ MySQL servers) • DBMS handle multiple databases • Each Database is stored in one or more “database files” • Database Files can sometimes be loaded/viewed/edited by other software Names You Might Encounter (in the Database World) • • • • SQL Server, from Microsoft (also “Microsoft SQL Server”) Oracle DB2, from IBM Less common: • Microsoft Access, dBase, Sybase Database Structures • Most databases have many TABLEs • 10 would be “few”; 50 would be normal; 150 would be many • There is a method to this madness • Different TABLEs contain different categories of information • Example: • Customers: contains lots of customers • Products: contains lots of products • Orders: combines customers and products (and quantities, etc.) Why So Many Tables? • Imagine a world with just 1 table • The problem of duplicate data: OrderID CustomerName CustomerAddress Quantity ProductName 1000 James 123 Main Street, Arcadia, CA, 95000 3 Orange 1001 James 123 Main Street, Arcadia, CA, 95000 4 Apple 1002 George 444 1st Avenue, Sacramento, CA 97000 1 Fork 6 Pear • Adding a new order is easy: 1003 James • 123 Main Street, Arcadia, CA 95000 But what happens when James changes his address? • Answer: need to update every ROW where 'James' is the CustomerName (ugh!) Solution: Divide and Conquer • Divide data into Entities (TABLEs), specific to a given purpose: CustomerID CustomerName CustomerAddress 1 James 123 Main Street, Arcadia, CA, 95000 2 George 444 1st Avenue, Sacramento, CA 97000 OrderID CustomerID Quantity ProductName 1000 1 3 Orange 1001 1 4 Apple 1002 2 1 Fork