Slides - WWW4 Server

Computer Science iBigTable: Practical Data Integrity for BigTable in

Public Cloud

CODASPY 2013

Wei Wei, Ting Yu , Rui Xue

1/40

iBigTable – Overview



BigTable – Scalable Storage System o Store large data sets with petabytes or even more



Business transactions, software logs, social network messages o Benefits from processing large data sets



Identify business opportunities, find software bugs, mine social relationship o Widely used in Google, Facebook, Twitter

 However, small companies and researchers usually lack of capabilities to deploy BigTable o Large cluster required o Technical difficulties o High maintenance cost

Deploying BigTable in a public cloud is an economic solution.

However, one may not always trust the public cloud provider.

Computer Science

2/40

iBigTable – Overview



Our Focus o Provide integrity assurance for BigTable in public cloud



Basic Idea o Build Merkle Hash Tree based Authenticated Data Structure o Decentralize integrity verification across multiple nodes

Computer Science

3/40

Agenda



Introduction



System Model



System Design



Experimental Evaluation



Related Work



Conclusion

Computer Science

4/40

Merkle Hash Tree (MHT)

• s root

=S(h root

)

• h root

=H(h

12

|h

34

)

• h

12

=H(h

1

|h

2

) • h

34

=H(h

3

|h

4

)

• h

1

=H(d

1

) • h

2

=H(d

2

) • h

1

=H(d

3

) • h

1

=H(d

4

)



Verification Object (VO) o Data returned along with result and used to authenticate the result



Example o Authenticate data d

1

, and the VO for d

1 is {h

2 and h

34

}

Computer Science

5/40

BigTable – Data Model



A table is a sparse, distributed, persistent multidimensional sorted map (OSDI 2006).



Data Model o Table schema only defines its column families



Each family consists of any number of columns



Each column consists of any number of versions



Columns only exist when inserted, NULLs are free



Columns within a family are sorted and stored together o Table contains a set of rows sorted based on row key



Row: a set of column families



Column Family: a set of columns



Cell: arbitrary string (uninterpreted string)

Computer Science

6/40

BigTable – Data Organization



Tablet o Root tablet o Metadata tablet o User tablet



Tablet Server o Each tablet is only stored in a tablet server o Multiple tablets can be stored in a tablet server



Master

 Responsible for load balancing and assigning tablets

Computer Science

7/40

BigTable – Data Operations



Queries o Single row query by specify the row key o Range query by specifying start and end row keys o Projection query to retrieve specific column, column family

 Changes o Data insert, update, and delete o Tablet split & merge

Computer Science

8/40

System Model



Similar to Database Outsourcing o Host data in untrusted party and support data retrieval o Principle ideas of integrity verification



Different from Database Outsourcing o Distributed data among large number of nodes



How to handle authenticated data structures during tablet merging or splitting



Impractical to store authenticated structures in a single node



Not scalable to adopt a centralized integrity verification scheme at a single point o Simple data model and query interfaces



Design much simpler and efficient authenticated structures and protocols to verify data integrity

The actual design and deployment of authentication schemes are significantly different

Computer Science

9/40

System Model



Assumptions o The public cloud is not trusted, and BigTable is deployed in the public cloud, including the master and tablet servers o The data owner has a public/private key pair, and public key is known to all o The data owner is the only party who can update data o Public communications are through a secure channel



Attacks from The Public Cloud o Return incorrect data by tampering some data o Return incomplete data result by discarding some data o Report that data doesn’t exist or return old data

Computer Science

10/40

System Model cont’d



Goal o Deploy BigTable over Public Cloud with Practical Integrity Assurance



Design Goals o Security (Integrity)



Correctness, completeness, freshness o Practicability



Simplicity, flexibility, efficiency

Computer Science

11/40

System Design



Basic Idea o Embed a MHT-based Authenticated Data Structure in each tablet

Computer Science

12/40

Distributed Merkle Hash Tree

Data Owner

Root hash

Root Tablet

Pros

 Authenticated data distributed across nodes

 Only maintain one hash for all data

Cons

 Require update propagation

Meta Tablet

 Concurrent update could cause issues



Hard to synchronize hash tree update



Complicate protocols between tablet servers

User Tablet User Tablet

• …

• …

Computer Science

13/40

Our Design

Data Owner

Root hash

Root Tablet

User Tablet

Computer Science

Meta Tablet

• …

User Tablet

14/40

• …

Our Design

Root Tablet

Data Owner

Root hash

Root hash

Root hash

Root hash

User Tablet

Computer Science

Meta Tablet

• …

User Tablet

15/40

• …

System Design



Basic Idea o Embed a MHT-based Authenticated Data Structure in each tablet o Store the root hash of each MHT in a trusted party (e.g., data owner) o Decentralize the integrity verification across multiple tablet servers

Data integrity is guaranteed by the correctness of the root hash of the MHT in each tablet.

Computer Science

16/40

Decentralized Integrity Verification

•

1.2 generate VO

•

1.1 meta key (root, meta, table name, start row key)

•

Client

• Client

•

Client

•

1.3 meta row (meta tablet location, start and end keys)

•

,

•

Tablet Server

• serving ROOT tablet

•

2.2 generate VO

•

2.1 meta key (meta, table name, start row key)

•

2.3 meta row (user tablet location, start and end keys)

•

, VO

• Tablet Server

• serving META tablet

•

2.2 generate VO

•

3.1 start and end row keys

•

3.3 rows within the start and end row keys

•

, VO

Computer Science

•

Tablet Server

• serving USER tablet

17/40

iBigTable – Authenticated Data Structure



Signature Aggregation Compared with Merkle Hash Tree o Both of them can guarantee correctness and completeness o Incur significant computation cost in client side and large storage cost in server side o Not clear how to address freshness



MHT-based Authenticated Data Structure o SL-MBT: A single-level Merkle B+ tree



Build a Merkle B+ tree based on all key value pairs in a tablet



Each leaf is a hash of a key value pair o ML-MBT: A multi-level Merkle B+ tree



Builds multiple Merkle B+ trees in three different levels o TL-MBT: A two-level Merkle B+ tree (adopted)

Computer Science

18/40

iBigTable – TL-MBT



Index Level o Only one tree – index tree o Each leaf points to a data tree

 Data Level o Row Tree: generate hashes for all rows and each leaf is a hash of a row

Computer Science o Column Family Tree: generate hashes for a column family of all rows and each leaf is a hash of a column family of a row o Column Tree: generate hashes for a column of all rows and each leaf is a hash of a column of a row

19/40




Verification Object Generation o Find the data tree(s) based on the specific query o Use the data tree(s) to generate VO based on the query range



Pros o Performance is comparable to ML-MBT for row-based query o Much more efficient than SL-MBT and ML-MBT for projection query o Flexible authenticated data structure



Cons o Update cost may increase by 3 times o Large storage cost if column trees are created

Computer Science

20/40

iBigTable – Data Access



Range query within tablet o Find metadata tablet, user tablet, data through specific tablet server



Range query across tablets o Break a large range into small sub-ranges



Based on the end key of each tablet



Sub-range falls in a tablet o Execute the sub-range queries

Computer Science

21/40

iBigTable – Single Row Update



Partial Tree Verification Object (VO) o Data included



Only keys and hashes of data for two boundaries



Hashes of nodes for computing the root hash



Keys in related inner nodes o Used for direct update within the range of partial tree

•

3.4 verify and update tablet root hash

•

3.1 new row

•

3.3 partial tree VO

•

Data Owner

Computer Science

•

3.2 generate VO

•

Tablet Server


22/40

iBigTable – Single Row Update cont’d

30 60

50 10 70 80

0 10 20 30 40 50 60 70 80 90

Initial MB+ row tree of a tablet in a tablet server.

Computer Science

23/40

iBigTable – Single Row Update cont’d

30 60

50

30 40 50

45

•

New Key 45

Insert a row with key 45 into partial tree VO

Computer Science

30 60

40 50

30 40 45 50

•

New Key 45

•

Partial tree VO after 45 is inserted

24/40

iBigTable – Efficient Batch Update



Single row update is inefficient o one verification for single row



Range query is efficient o One verification for multiple rows



How can we do batch update like range query?

•

3.4 verify and update tablet root hash

•

3.1 request partial tree VO for a range

•

Data Owner

•

3.3 partial tree VO

•

3.4 new rows

• … … …

•

3.n new rows

•

3.2 generate VO

•

Tablet Server


Computer Science

25/40

iBigTable – Tablet Changes



Tablet split o Grow too large o Load balancing o Better management

 Tablet merge o Only a few data in a tablet o Improve query efficiency



How to guarantee data integrity?

o Make sure the root hash of each tablet is correctly updated

Computer Science

26/40

iBigTable – Tablet Split

30 60

50 10 70 80

0 10 20 30 40 50 60 70 80 90

•

(a) A MBT of a tablet in a tablet server, and split tablet at key 45.

Computer Science

27/40

iBigTable – Tablet Split cont’d

30 60

50 10 70 80

10 20

•

Left boundary node

30 40 50 60

•

Two boundary keys

•

Right boundary node

•

(b) Partial tree returned to the data owner.

Computer Science

28/40


10

30 60

50

•

Split

30 60

50 70 80

10 20 30 40

•

Left Partial Tree

50 60

•

Right Partial Tree

•

(c) Split it into two partial trees by data owner.

Computer Science

29/40


10

30 60

50

10 20 30 40

10 30

10 20 30 40

•

(d) Data owner adjusts left partial tree and computes the new root hash for the new tablet.

Computer Science

30/40


30 60

50

50 60

70 80 60

50 60

70

80

•

(e) Data owner adjusts right partial tree and computes the new root hash for the new tablet.

Computer Science

31/40

iBigTable – Tablet Merge

70 50 70

•

Merge

10 30 60 10 30 60

30 40

•

Left Partial Tree

50

•

Right Partial Tree

30 40 50

•

Merged Tree

•

Data owner merges two partial trees sent from tablet servers into one for the new merged tablet

Computer Science

32/40

iBigTable – Experimental Evaluation



System Implementation o Implementation based on HBase o Extend some interfaces to specify integrity options o Add new interfaces to support efficient batch updates

 Experiment Setup o 5 hosts in Virtual Computing Lab (VCL) o Intel(R) Xeon(TM) CPU 3.00GHz

o Red Hat Enterprise 5.1, Hadoop-0.20.2, and HBase-0.90.4

o Client network with 30Mbps download and 4Mbps upload

Computer Science

33/40

iBigTable – Baseline



Observations o It almost takes the same time to transmit data less than 4k o Time is doubled from 4k to 8k till around 64k.

o After 64k, the time dramatically increases.

o The VO size increases as the range increases, but the VO size per row actually decreases.

•

Ex 1. Time to receive data from server

Computer Science

•

Ex 2. VO size vs # of rows

34/40

iBigTable – Write



Observations o The performance overhead ranges from 10% to 50%.

o iBigTable with Efficient Batch Update only causes a performance overhead about 1.5%.

o Communication cost is high, but computation cost is small about 2~5%.

•

Ex 3. Write performance.

Computer Science

•

Ex 4. The breakdown of write cost

35/40

iBigTable – Read



Observations o The read performance overhead is small, which ranges from 1% to 8%.

o The total computation cost of both client and servers is about 1%.

o The major part of performance downgrade is caused by communication.

•

Ex 5. Read performance

Computer Science

•

Ex 6. The breakdown of read cost

36/40




Observations o As the number of trees that need to be updated increases, the performance decreases dramatically.

o For different data size, we see the large performance variation for different cases.

•

Ex 7. TL-MBT update performance.

Computer Science

•

Ex 8. Projection query with TL-MBT

37/40

iBigTable – Related Work



Research related to BigTable o Performance evaluation [Carstoiu et al., NISS 2010] o High performance OLAP analysis [You et al., IMSCCS 2008] o BigTable in a hybrid cloud [Ko et al., HotCloud 2011] o Integrity layer for cloud storage [Kevin et al., CCS 2009]

 Outsourcing Database o Different authenticated data structures [DASFAA 2006] o Probabilistic approaches [Xie et al.VLDB 2007] o Approaches to address complex queries [Yang et al., SIGMOD 2009] o Partitioned MHT (P-MHT) [Zhou et al., MS-CIS 2010]

Computer Science

38/40

iBigTable – Conclusion



Contributions o Explore the practicability of different authenticated data structures



Focus on Merkle Hash Tree based authenticated data structures o Design a set of efficient mechanisms to handle authenticated data structure changes



Efficient data batch update



Handle tablet split and merge o Implement a prototype of iBigTable based on Hbase, an open source implementation of BigTable o Conduct experimental evaluation of performance overhead

Computer Science

39/40

• •

Thank you

Computer Science

40/40

Slides - WWW4 Server

• •

Thank you

Related documents

Products

Support

Slides - WWW4 Server

• •

Thank you

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib