immutability

advertisement
Immutability Changes
Everything!
October 10, 2012
Pat Helland
Salesforce.com
Outline
 Introduction
 Accountants Don’t Use Erasers
 Keeping the Stone Tablets Safe
 Hey! Versions Are Immutable, Too!
 Immutability by Reference
 Immutability Is in the Eye of the Beholder
 Normalization Is for Sissies
 Conclusion
2
Some Industry Trends to Consider
Old
New
Computation (CPUS)
Expensive
Computation Cheap
(Manycore Computers)
Disk Storage Expensive
Disk Storage Cheap
(Cheap Commodity Disks)
Coordination Easy
(Latches Don’t Often Hit)
Coordination Hard
(Latches Stall a Lot, etc)
DRAM Expensive
DRAM / SSD
Getting Cheap
We Can Afford to Keep Immutable Copies of Lots of Data
We Need Immutability to Coordinate with Fewer Challenges
3
Increasing Storage, Distribution, and Ambiguity
 Increasing Storage
 Cost per Gigabyte/Terabyte/Petabyte is dropping
 We can keep LOTS OF data for a LONG time
This may be easing as we
get faster and flatter
networks in the datacenter
 Increasing Distribution
 More and more, we have data and
work spread across a great distance
 Data within the Datacenter may be far away…
 Data within a many-core chip may be far away…
 Increasing Ambiguity
Instruction opportunities
lost waiting for a
semaphore increase
with more cores…
 When trying to coordinate with systems that are farther away, there’s more that’s
happened since you’ve heard the news
 Can you take action with incomplete knowledge?
 Can you wait for enough knowledge?
4
Outline
 Introduction
 Accountants Don’t Use Erasers
 Keeping the Stone Tablets Safe
 Hey! Versions Are Immutable, Too!
 Immutability by Reference
 Immutability Is in the Eye of the Beholder
 Normalization Is for Sissies
 Conclusion
5
“Append-Only” Computing
 Many kinds of computing are “Append-Only”
 Observations are recorded forever (or a long time)
 Derived results are calculated on demand
 You can’t rewrite history
 Database transaction logs record all the changes
made to the database
 High-speed appends to the log
 You never modify the log other than by appending to it
 The database is a cache of a subset of the log!
 The latest value of each record is kept in the database
6
Accounting: Recorded & Derived Knowledge
 Accountants don’t use erasers
 All entries in the ledger remain in the ledger
 Corrections can be made but only by new entries
 A company’s quarterly results are published
o They include small corrections to the previous quarter… Small fixes are OK!
 Some entries describe observed facts
 We received these credits and debits
 Some entries are derived facts
 We amortized these capital expenses at this rate based on their cost and usage
 Your current balance depends on last months balance with applied debits & credits
7
The Append-Only View of
Distributed Single-Master Computing
 Single-Master computing means somehow we order the changes
 Centralized Computing
 Two-Phase Commit or Paxos
 Optimistic Concurrency Control
 Somehow, we semantically apply one change at a time
 Each change is layered over its predecessors
 We can perceive a new set of values superseding the old ones
 This may be transactional or single-record changes but they appear in an order
 We continue to append new knowledge over the immutable history
 The new version of the truth is interpreted through the older versions
8
Distributed Computing “Back in the Day”
 Back before telephones, people used messengers
 Kids walking through town or riding bicycles to deliver the message
 The US Postal Service or the Pony Express would deliver the message
 Sometimes, people used fancy forms to capture the computing
 Add new data to a new part of the form
 Tear off the back copy of the form and file it
 Send the remaining portions to the next participant
 Each participant received the data they needed and
added the new information to the form
 You cannot update earlier data on the form…
o You can only append new knowledge to the form!
 Distributed computing was append-only!
 New messages, new additions to the forms…
 You couldn’t overwrite what had been written!
Part 1
Part 1
Part 1
Part 1
Part 1Part 2
Part 2
Part 2
Part 2
Part 2Part 3
Part 3
Part 3
Part 3
…
Part 3
…
…
…
…
9
Outline
 Introduction
 Accountants Don’t Use Erasers
 Keeping the Stone Tablets Safe
 Hey! Versions Are Immutable, Too!
 Immutability by Reference
 Immutability Is in the Eye of the Beholder
 Normalization Is for Sissies
 Conclusion
10
Files, Blocks, & Replication for
Durability & Availability
 GFS and HDFS (and others) offer highly-available files
 A file is a bunch of blocks (or chunks)
 The file (as a file name and description of needed blocks) is highly available
 Each block (chunk) is replicated within the cluster for durability and availability
o Blocks are typically replicated three times with scrubbing
o Replicas are placed across fault-zones
 Each file is immutable and (typically) single writer
 The file is created, one process can append to it, it lives for a while and is deleted
 Multi-writer files are hard (GFS had some challenges with failures and replicas)
 Immutable files and immutable blocks empower this replication
 The file system has no concept of a change to a complete file
 Each block’s immutability allows it to be replicated (and have extra replicas, too)
High Availability of Immutable Blocks Is Affordable Now!
Google, Amazon, Yahoo, Microsoft, and more keep Petabytes & Exabytes
11
Widely Sharing Immutable Files Is Easy
 Immutable files have an identity and a content
 Neither the identity nor the content can change
 You can copy the immutable file whenever and wherever you want
 Since you can’t change it, you don’t need to track where it’s landed!
 You can share the same immutable copy across users
 As long as you track reference counts (when it’s OK to delete it), you can use one
copy of the file to share across many users
 You can distribute immutable files wherever you want
 Same identity, same contents, location independent!
Published Books are Immutable!
Sometimes later editions repair previous bugs
This is versioning of the book
Versions are immutable objects!
12
Names and Immutability…
Watch Out for the Slippery Slope
 GFS (Google File System) and HDFS (Hadoop Distributed File System)
provide immutable files
 Immutable blocks (chunks) are replicated across Data Nodes
 Immutable files are a sequence of blocks (chunks)
 The immutable files are identified with a GUID
 The contents of a file are immutable and labeled with a GUID
 The GUID will always refer to exactly that file and its contents
 GFS and HDFS also provide a namespace which can be changed
 The logical name of the immutable file may be changed to something else
 It takes care in usage to ensure that you have predictable results
Is Something Really Immutable When Its Name Can Change?
13
Storing Immutable Data
in an Eventually Consistent Store
 Consider a strongly consistent catalog
 Single master control over a namespace yielding GUIDs for the file blobs
 Now, keep the GUID to immutable blob storage in Dynamo or Riak
 The eventually consistent store will NEVER give you the wrong answer
 Each GUID will only yield one result because you never store different values
RDBM
S
NameSpace
Block &
DataNode Mgmt
Data
Node
Data
Node
Data
Node
…
Data
Node
Data
Nod
e
Data
Nod
e
Data
Nod
e
Data
Nod
e
Data
Nod
e
Riak
Data
Nod
e
Data
Nod
e
Files/Blocks
Identified by
GUID
Data
Nod
e
Data
Nod
e
Data
Nod
e
Name
Space
NameNode
File/Block
Store
HDFS
 Self-managing and master-less blob-store!
14
Outline
 Introduction
 Accountants Don’t Use Erasers
 Keeping the Stone Tablets Safe
 Hey! Versions Are Immutable, Too!
 Immutability by Reference
 Immutability Is in the Eye of the Beholder
 Normalization Is for Sissies
 Conclusion
15
Versions and History
 Linear Version History (a.k.a. Strongly Consistent):
 One version replaces another – One parent and one child in the sequence
 Each version is immutable
 Each version has an identity
 Typically, each new version is viewed as a replacement for the earlier one
 DAG (Directed Acyclic Graph) Version History
(a.k.a. Eventual Consistency):
 Each version may have one or more parents
 Each parent may have one or more children
 Each parent may have children with different parents
 Each version is immutable
 Each version has an identity (but we may now need vector clocks to describe)
 Each version may be viewed as one of many replacement versions for its parents
Versions Are Immutable and (Should) Have Immutable Names
16
Strongly Consistent Transactions Viewed as Versions
 In a Database, ACID transactions appear as if they have serial order
 This is called serializability
 I know there are reduced degrees of consistency but this is usually close to true
 Transaction T1 commits at one point and Transaction T2 at a later one
 Transaction T1 presents a consistent view of the entire database
 Transaction T2 presents a different and later view of the database
An Active Database Is Constantly
Presenting New Versions of Its Data
Transaction T1 Is a Version of the Database
Later, Transaction T2 Is a Version of the Database
Everything Changeable Can Be Understood as a Bunch of Versions
How Do You Identify the Versions? Can You See Old Ones?
17
BigTable & HBase: Interpreting the Immutable Entrails
 BigTable & HBase:
 Log: When a change occurs, write a record in the log to ensure its durable
o Limited notion of transactions
 Major Compaction: an image (key sorted) of the key-value pairs at a point in time
 Minor Compaction: a set of new key-values (or new values for existing keys)
o Represents changes to a set of keys since the last major compaction
 Both BigTable & HBase function by writing immutable files
 There is not an “update-in-place” to change the data
 There is an append to a new file (Minor Compaction) describing a new version
 Both BigTable & HBase provide a programmer perspective of versions
 Each key has a set of versions (in a linear, strongly-consistent sequence)
 A read may get the latest version or may get an earlier version
Immutability Is at the Heart of BigTable & HBase
Data Change Is By Appending to Files Which Become Immutable
User Semantics Present Immutable Versions of Key-Values
18
Outline
 Introduction
 Accountants Don’t Use Erasers
 Keeping the Stone Tablets Safe
 Hey! Versions Are Immutable, Too!
 Immutability by Reference
 Immutability Is in the Eye of the Beholder
 Normalization Is for Sissies
 Conclusion
19
DataSets: Immutable Collections of Data
 A DataSet is a fixed collection of tables:
 The schema for each table is created when the DataSet is made
 The contents of each table is created when the DataSet is made
 A DataSet is immutable:
o It is created, it may be consumed for reading, and it may be deleted
…
…
Table1
Table2
DataSet-X
…
…
TableN
Schema
 DataSets may be relational or some other representation…
20
DataSets Referenced by a Relational Database
 DataSets can be present within the relational store
 The meta-data for the DataSet is visible within the relational database
 We may choose to store the DataSet “by-reference” but the contents are
semantically present within the relational store
Relational
Database
Stored Elsewhere…
DataSet-X
DataSet-Y
…
…
Table1
Table2
DataSet-X
…
…
TableN
Schema
DataSet-Z
21
Functional Calculations Outside a Relational DB
 Functional versus Dysfunctional calculations
 A functional calculation takes a set of inputs and predictably creates an output
 The entire calculation and pieces of it are idempotent
o Idempotence: Doing it more than once is the same as doing it once!
Idempotence: It’s Not That Hard!
 Work using DataSets can be performed outside the relational store
 The inputs may exist outside the relational store
 The computation may happen outside the relational store
 The results may be stored outside the relational store
 The results may appear (by reference) inside the relational store
DataSet-M
DataSet-N
Functional
Calculation
DataSet-P
DataSet-R
DataSet-O
22
Relational Operations on Immutable DataSets
 You can meaningfully apply relational operations across locked
relational data and immutable DataSets
 Relational operations are value based and require locking semantics
 Database concurrency control temporarily freezes the changing data
 Relational JOINS require frozen snapshots to be meaningful
 Locking presents a version of the Relational DB which can be joined
 Named and frozen DataSets may also be joined with the classic data
Relational
Database
…
Stored Elsewhere…
Join TableA
and Table1
TableA
…
…
Table1
Table2
DataSet-X
…
…
…
TableN
Schema
DataSet-X
TableB
23
Outline
 Introduction
 Accountants Don’t Use Erasers
 Keeping the Stone Tablets Safe
 Hey! Versions Are Immutable, Too!
 Immutability by Reference
 Immutability Is in the Eye of the Beholder
 Normalization Is for Sissies
 Conclusion
24
DataSets Are Semantically Immutable
 A DataSet is semantically immutable
 It has a set of tables, rows, and columns
 It may have semi-structured data (e.g. JSON)
 It may have app-defined data
 DataSets may be defined as a SELECTION, PROJECTION, or JOIN
over previously existing DataSets
 Semantically, all that data is copied into a new DataSet
 Physically optimizations can occur
…
…
Table1
Table2
DataSet-X
…
…
TableN
Schema
25
Optimizing DataSets for Read Patterns
 DataSets are semantically immutable but may be physically changed
 You can add an index or two
 You can denormalize tables to optimize for read access
 You can make a copy of a table with far fewer columns for fast access
 You can place partitions of the DataSet close to where they are being read
 You can dynamically watch the read usage of a DataSet and create
optimizations for the new reader
…
…
Table1
Table2
DataSet-X
Inde
x# 1
Inde
x# 1
…
…
TableN
Schema
Denormalization
of Parts of
Table1 & Table 2
26
Immutability and “Big Data”
 Massively parallel computations usually are functional and
based on immutable inputs
 MapReduce (Hadoop) and Dryad take immutable files as input
 The work is cut into pieces, each of which is immutable
 Functional computation (based on immutable inputs) is idempotent
 It’s OK to croak and restart
Immutability Is the Backbone of
“Big Data” Computations!
Functional Computation with Immutable Inputs
Failure and Restart Based on the Idempotent Nature
of Functional Computing over Immutable Inputs
27
Immutability as a Semantic Prism
 DataSets show an immutable semantic perspective
 Even if the underlying representation is augmented or completely replaced
 The King James Bible is character for character immutable
 Even when printed in a different font… Even when digitized…
Even when accompanied by different pictures… ???... Hmm…
 Is a DataSet changed if there is a loss-less transformation to a new
schema representation
 The new address field has more capacity… Is that OK?
 The ENUM values are mapped to a new underlying representation… Is that OK?
It’s Not Enough to Have the Right Bits!
You Have to Know How to Interpret Them…
“President Bush” meant a different thing in 1990 versus 2005
The word “Fanny” is interpreted differently in the US versus Australia
You Need to know what the Immutable Bits Actually Mean!
28
Outline
 Introduction
 Accountants Don’t Use Erasers
 Keeping the Stone Tablets Safe
 Hey! Versions Are Immutable, Too!
 Immutability by Reference
 Immutability Is in the Eye of the Beholder
 Normalization Is for Sissies
 Conclusion
29
Why Normalize?
 Normalization’s goal is to eliminating update anomalies
 Can be changed without “funny behavior”
 Each data item lives in one place
De-normalization is
OK if you aren’t going to update!
Emp # Emp Name
47
Joe
18
Sally
91
Pete
66
Mary
Classic problem
with de-normalization
Emp Phone Mgr # Mgr Name Mgr Phone
5-1234
13
Sam
6-9876
3-3123
38
Harry
5-6782
2-1112
13
Sam
6-9876
5-7349
02
Betty
4-0101
Can’t update
Sam’s phone #
since there are
many copies
30
We Are Swimming in a Sea of Immutable Data
Data Owning Service
Wednesday’s
Price-List
Wednesday’s
Price-List
Price-List
Wednesday’s
Price-List
Tuesday’s
Price-List
Monday’s
Price-List
Wednesday’s
Price-List
Tuesday’s
Price-List
Listening
Partner
Service-1
Listening
Partner
Service-5
Listening
Partner
Service-8
Tuesday’s
Price-List
Monday’s
Price-List
Listening
Partner
Service-7
31
Think First Before You Normalize
For God’s Sake,
Don’t Normalize Immutable Data!
Unless It’s to Optimize Space in the Representation…
32
Culture:
the Way We Do Things Around Here
People Normalize ‘Cuz their Professor Said To
-- That’s Why We Need All Those Joins…
If All You Have Is a Database,
Everything Looks Like a Nail…
33
Outline
 Introduction
 Accountants Don’t Use Erasers
 Keeping the Stone Tablets Safe
 Hey! Versions Are Immutable, Too!
 Immutability by Reference
 Immutability Is in the Eye of the Beholder
 Normalization Is for Sissies
 Conclusion
34
Takeaways
 Things have changed towards immutability
 We need immutability to coordinate at ever increasing distances
 We can afford immutability because we have room to store versions for a long time
 Versioning allows a changing view of objects with immutable backing
 Linear (strongly consistent) version histories for some (e.g. BigTable, HBase)
 Directed-Acyclic-Graph (eventually consistent) history for others (e.g. Dynamo, Riak)
 Increasingly, systems are based on writing immutable data
 Log-Structured Merge trees (e.g. HBase, BigTable, LevelDB, etc.) as implementation
 Layering immutable data over a distributed file system offers robustness and scale
 Immutability extends consistent relational systems
 Very large immutable DataSets may be embedded by reference in relational stores
 The semantics of immutable DataSets joins cleanly with the changing relational data
 Semantically immutable data may be changed for optimization
 Projections, redundant copies, denormalization, column stores, indexing and more…
 Semantically immutable means the user behavior doesn’t change
 Immutability is the backbone of emerging “Big Data” systems
 MapReduce, Hadoop, and more leverage immutable snapshots
35
Immutability
Changes
Everything!
36
Download