BACHELOR OF SCIENCE (HONOURS) IN SOFTWARE DESIGN - YEAR 4 - - Databases - DISTRIBUTED DATABASE SYSTEMS Tom Wasniewski (A00148326) TABLE OF CONTENTS 1. Introduction 2. Challenges and Issues of Modern Distributed Databases 3. Google’s Database System a. General Description of the Google Services and Database b. Architecture of the Google Distributed Database c. Hardware/Software used by Google d. Security of the Google Distributed Database e. Reliability of the Google Distributed Database 4. Yahoo!’s Database System a. General Description of the Yahoo! Services and Database b. Architecture of the Yahoo! Distributed Database c. Hardware/Software used by Google d. Security of the Yahoo! Distributed Database e. Reliability of the Yahoo! Distributed Database 5. Reference Sources 1. INTRODUCTION In this paper I am going to talk about the distributed database systems of the two web development giants: Google and Yahoo!. Both of the companies operate on the massive amounts of data and serve users from all over the globe. As will be shown later on, there is much to take into account when choosing and designing a database system of such a large scale. 2. CHALLENGES AND ISSUES OF MODERN DISTRIBUTED DATABASES There are a number of issues to consider while designing a distributed database system: - Scalability (system needs to be able to span across multiple datacentres located in different, often far, geographical locations) Transparency (system has to behave as one global unit, without interrupting user’s experience) Data Replication/Fragmentation (there must exists a set of rules of how to distribute the data among many servers/locations) Security (both physical remote sites, as well as the underlying data network need to be carefully secured to reduce any potential risks of a breach) These are only some of the challenges that the modern distributed database system carries with itself. The field of distributed systems is relatively young and there are still no particular standards as of how to deal with these problems most efficiently and effectively. The biggest players, like Google and Yahoo!, take the big steps towards making a change in this area, however. 3. GOOGLE’S DATABASE SYSTEM - General Description of the Google Services and Database Google is one of the fastest growing software companies in the world. Starting from simple, as one might think, ‘Google Search’ web browsing application, it now provides a broad variety of softwarerelated (mostly web) services, like Gmail, YouTube or Google Maps to name a few. As the result of Google joining the elite group of leading web-developing companies, they have to manage and process enormous amounts of data. Standard database systems like Oracle or MySQL do not handle huge amounts of distributed data efficiently. For this reason, over the years Google has come out with their own innovative database systems: BigTable, Megastore and the latest Spanner. For long time, Google has been using BigTable database system. It introduced an innovative way to store distributed data by abandoning the traditional relational approach and using multi-dimensional tables, with each cell being assigned a timestamp, allowing for storage of multiple versions of the same cell. As a trade-off to high-scalability and performance achieved by NoSQL (Not only SQL) approach (equivalent to non-relational DBS), BigTable lacks functionality to query and aggregate the database. After some time BigTable evolved into Megastore which in turn morphed into the state-of-art distributed database system we know today by the name of Spanner. Even though BigTable is still widely used by Google Services, Spanner is slowly replacing its predecessor as it offers some interesting features and implements crucial improvements over BigTable. - Architecture of the Google Distributed Database “Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions.” (Google Inc., 2012) Universe is the name for the single deployment in Spanner. It consists of zones which represent certain locations that store the data. In each zone there is one zone-master server, which is in charge of a large group (100-1000) of slave servers called span-servers. These span-servers hold the actual replicas of the data, assigned to them by the zone-master. When clients request the data from a particular zone, they first contact a location proxy server, which guides the client to the correct span-server. (Hari, 2012) Each span-server is responsible for storing up to a thousand instances of tablets, which are collections of key-value mappings, similar to BigTable’s format. Another interesting structure used in Spanner is the directory. Directories hold collections of keys with a common prefix. This helps greatly to keep the related tables together and distribute them on as few partitions as possible, depending on the tables’ sizes, which in turn results in a faster querying time. (Hari, 2012) It is worth mentioning that in addition to Spanner being extremely scalable it also supports an SQL-based language, unlike his ancestor BigTable. The last feature I would like to mention is the ability to automatically adjust the data spread across the available span-servers, dynamically optimizing the load within the universe. - Hardware/Software used by Google Google uses commodity-class computers for its servers, trying to achieve the most computing power per dollar. The exact configuration is being kept secret, yet one can assume that most money would be put into processing unit, RAM and storage capacity. All Google’s software used on the server is being customized and optimized to fit the particular configuration. Most commonly used programming languages by Google are: C++, Java and Python. Examples of more noticeable software installed on Google servers are: Google Web Server (custom version of Linux-based web server) Spanner (most current distributed database system, the subject discussed in this section) Colossus (next generation of Google File System) Chubby (sitting on top of zone-masters and responsible for assigning locks to sub-servers) MapReduce (processes large sets of data, aggregates data) (Wikipedia, 2012) - Security of the Google Distributed Database Google takes security of their datacentres very seriously. Access to the datacentre sites is highly restricted. The security personnel are present on the site 24 hours a day. Each site is monitored by CCTV cameras, helping the security personnel keep watch over the datacentre. Within the buildings, there are multiple access points restricting access to privileged personnel, who have to scan their badges or use biometric devices in order to enter a particular area. A particular place of interest is the server room, where all the servers are stored. Protection of hard drives is particularly emphasized in Google, making sure the data they contain is confidential and properly secured. (Google Data center security, 2012) - Reliability of the Google Distributed Database Physically, Google ensures reliability of their databases by storing files in multiple locations and encrypting their names and contents, to increase security. Spanner also implements three significant mechanisms to help the system to be reliable: - - Snap shot isolation (whenever the data is pulled from multiple machines, Spanner takes a snap shot of the data from all the queried machines at the same time, guarantying consistent results) 2 Phase Locking TrueTime (customized time structure, derived from the GPSs tied to the servers; it returns an interval of time between the earliest and latest possible value, accounting for the possible error range denoted by epsilon; backed up by atomic clocks) 4. YAHOO!’S DATABASE SYSTEM - General Description of Yahoo! Services and Database “Yahoo! Inc. is an American multinational internet corporation headquartered in Sunnyvale, California, United States. The company is best known for its web portal, search engine (Yahoo! Search) and for a variety of other services, including Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Finance, Yahoo! Groups, Yahoo! Answers, advertising, online mapping, video sharing, fantasy sports and its social media website. It is one of the most popular sites in the United States.” (Wikipedia, 2012) Yahoo, like Google, needs to manage a huge amount of data on the daily basis and it also requires an efficient database system to meet its clients’ expectations. The two most significant distributed database systems that Yahoo! is using today are: PNUTS, which has been developed by Yahoo! internally (and which is going to be the main focus of this discussion) and Apache Hadoop – an open source project, used by other web giants like Amazon. - Architecture of the Yahoo! Distributed Database Focus of PNUTS system is “on data serving for web applications, rather than complex queries […] PNUTS is a hosted, centrally-managed database service shared by multiple applications.” (Yahoo! Research, 2008). The architecture of PNUTS encapsulates a simplified relational data model. The tables and data are structured much like in any other relational database. However when querying the data from PNUTS relations, the tables cannot be joined and thus only rows from a single relation can be retrieved from a single query. It is also worth mentioning that PNUTS is able to perform many operations asynchronously, compensating the higher latency it may introduce at times with its other modules. - Hardware/Software used by Yahoo! “PNUTS is a hosted, centrally-managed database service shared by multiple applications. To add capacity, we add servers. The system adapts by automatically shifting some load to the new servers.” (Yahoo! Research, 2008) The bottleneck introduced by this setup seems to be the key hardware components in the servers, like RAM, CPU or hard drives. In the event of hard failure, the data is being copied from a corresponding replica to other live servers, taking care that the load is being spread evenly. Like Google, Yahoo! runs a minimal number of applications on their servers. It should be mentioned that Yahoo! invests a lot of resources into the research related with cloud computing technologies, as the company believes, most rightly, it is the future of massive-scale, globally distributed systems. - Security of the Yahoo! Distributed Database Yahoo! employs rather standard security measures when it comes to physical datacentres security, definitely less restrictive than Google. When it comes to exchanging data on the Internet the company introduces the ‘Yahoo! Security Key’, which could be obtained by the financial services users. The key is an additional layer of security and it expires after one hour of being issued. A famous incident took place earlier this year, in which Yahoo! was a victim of a serious security breach, resulting in over half a million of Yahoo! user accounts being hacked. Information revealed included user names, user e-mail addresses and user passwords. - Reliability of the Yahoo! Distributed Database PNUTS implements redundancy at many levels, including layers like data, metadata or serving components. Like Google, Yahoo! also replicates data across multiple physical locations. Yahoo! says that they provide ‘per-record timeline consistency’, which means that all clones (replicas) of a given record perform their updates to the record in the same order. PNUTS in general supplies “a consistency model that provides useful guarantees to applications without sacrificing scalability.” (Yahoo! Research, 2008) 5. REFERENCE SOURCES Google Data center security. 2012. [Film] Directed by Google. US: Google. Google Inc., 2012. [Online] Available at: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archiv e/bigtable-osdi06.pdf [Accessed 27 10 2012]. Hari, 2012. Systems We Make. [Online] Available at: http://www.systemswemake.com/papers/spanner [Accessed 27 10 2012]. Wikipedia, 2012. Wikipedia. [Online] Available at: http://en.wikipedia.org/wiki/Google_platform [Accessed 27 10 2012]. Wikipedia, 2012. Wikipedia. [Online] Available at: http://en.wikipedia.org/wiki/Yahoo! [Accessed 29 10 2012]. Yahoo! Research, 2008. [Online] Available at: http://research.yahoo.com/files/pnuts.pdf [Accessed 29 10 2012].