Transaction Processing on Top of Hadoop Spring 2012 Aviram Rehana Lior Zeno Supervisor : Edward Bortnikov Agenda • The problem and motivation • Background on databases’ access • What is done so far • Project goals • Effort assessment and roadmap The problem and motivation • Combining transactional and non-transactional operations on top of a distributed NoSQL database. • A little bit of motivation: • A new approach in the distributed databases field. • Better performance in comparison to related solutions. • Ease of use from a programmer’s perspective. Background on databases • Historically, data processing was done with SQL databases, which supported transactions. • The computer and internet evolution caused information growth and specifically in databases area there was a need for bigger databases support. • Due to the above scalability issues, SQL databases were replaced by Web-scale NoSQL stores in the early 2000s. These technologies dropped txn support though. Background on databases • Lately, some NoSQL databases like HBase started supporting txn's. • Application designers noticed the potential of transactions in the distributed field and solutions have risen over the years. • Our goal is to extend the mechanism to support both types of traffic. What is done so far • Existing implementation open-source stack: • HDFS – a distributed file system primarily designed for read only access (In use by Facebook and others). • HBase – a distributed R/W database on top of HDFS without support for Txns. • OMID – transaction processing service on top of HBase designed by Yahoo! Research team. • OMID does not guarantee consistency between transactional and non-transactional access to the same data at the same time. • We wish to extend OMID to support transactional as well as traditional non-transactional clients. The OMID model An example of mixed access • What happens if there are no rules: What is a transaction? • A transaction is a collection of operations in an application that fulfill the ACID Properties: • Atomicity - Either all or none of the transaction's operations are performed. • Consistency - T(x) is a correct transformation of the state. • Isolation - Even though txs execute concurrently, it appears to each tx T that others executed either before T or after (serializable). • Durability - Results of a committed transaction are permanent, in spite of possible failures. Levels of Isolation • Nevertheless, achieving the highest level of isolation (Serializability) is too expensive. It requires locking – less scalability. • How do we overcome this issue? • Snapshot Isolation - A guarantee that all reads made in a transaction will see a consistent snapshot of the database (in practice it reads the last committed values that existed at the time it started), and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot. • SI is less consistent, but it is much more scalable! Project goals • Extend OMID work to support both transactional and non-transactional accesses against the distributed database. • Implementation of base algorithms in use for the project: • Lazy Snapshot Algorithm (LSA) • RingSTM • Achieve better performance in comparison to OMID and other solutions in regular transactional clients. Roadmap of the project • Phase 1: • Design RingSTM + LSA components – 2 weeks • Integrate the design into OMID – 5 weeks • Simulations and benchmarking – 2 weeks • Phase 2: • Design write conflict against non-transactional clients module – 2 weeks • Integrate above design into OMID – 3 weeks • Simulations and benchmarking of whole project – 2 weeks • Final conclusions – 1 week • Future work: • Implementing serializability and dynamic separation. Questions? Special thanks: Eshcar Hillel.