Distributed Database Systems

advertisement
BACHELOR OF SCIENCE (HONOURS) IN SOFTWARE DESIGN
- YEAR 4 -
- Databases -
DISTRIBUTED DATABASE SYSTEMS
Tom Wasniewski (A00148326)
TABLE OF CONTENTS
1. Introduction
2. Challenges and Issues of Modern Distributed Databases
3. Google’s Database System
a. General Description of the Google Services and Database
b. Architecture of the Google Distributed Database
c. Hardware/Software used by Google
d. Security of the Google Distributed Database
e. Reliability of the Google Distributed Database
4. Yahoo!’s Database System
a. General Description of the Yahoo! Services and Database
b. Architecture of the Yahoo! Distributed Database
c. Hardware/Software used by Google
d. Security of the Yahoo! Distributed Database
e. Reliability of the Yahoo! Distributed Database
5. Reference Sources
1. INTRODUCTION
In this paper I am going to talk about the distributed database systems of the two web
development giants: Google and Yahoo!. Both of the companies operate on the massive amounts of
data and serve users from all over the globe. As will be shown later on, there is much to take into
account when choosing and designing a database system of such a large scale.
2. CHALLENGES AND ISSUES OF MODERN DISTRIBUTED DATABASES
There are a number of issues to consider while designing a distributed database system:
-
Scalability (system needs to be able to span across multiple datacentres located in different,
often far, geographical locations)
Transparency (system has to behave as one global unit, without interrupting user’s
experience)
Data Replication/Fragmentation (there must exists a set of rules of how to distribute the
data among many servers/locations)
Security (both physical remote sites, as well as the underlying data network need to be
carefully secured to reduce any potential risks of a breach)
These are only some of the challenges that the modern distributed database system carries with
itself. The field of distributed systems is relatively young and there are still no particular standards as of
how to deal with these problems most efficiently and effectively. The biggest players, like Google and
Yahoo!, take the big steps towards making a change in this area, however.
3. GOOGLE’S DATABASE SYSTEM
- General Description of the Google Services and Database Google is one of the fastest growing software companies in the world. Starting from simple, as
one might think, ‘Google Search’ web browsing application, it now provides a broad variety of softwarerelated (mostly web) services, like Gmail, YouTube or Google Maps to name a few.
As the result of Google joining the elite group of leading web-developing companies, they have
to manage and process enormous amounts of data. Standard database systems like Oracle or MySQL do
not handle huge amounts of distributed data efficiently. For this reason, over the years Google has come
out with their own innovative database systems: BigTable, Megastore and the latest Spanner.
For long time, Google has been using BigTable database system. It introduced an innovative way
to store distributed data by abandoning the traditional relational approach and using multi-dimensional
tables, with each cell being assigned a timestamp, allowing for storage of multiple versions of the same
cell. As a trade-off to high-scalability and performance achieved by NoSQL (Not only SQL) approach
(equivalent to non-relational DBS), BigTable lacks functionality to query and aggregate the database.
After some time BigTable evolved into Megastore which in turn morphed into the state-of-art
distributed database system we know today by the name of Spanner.
Even though BigTable is still widely used by Google Services, Spanner is slowly replacing its
predecessor as it offers some interesting features and implements crucial improvements over BigTable.
- Architecture of the Google Distributed Database “Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated
database. It is the first system to distribute data at global scale and support externally-consistent
distributed transactions.” (Google Inc., 2012)
Universe is the name for the single deployment in Spanner. It consists of zones which represent
certain locations that store the data. In each zone there is one zone-master server, which is in charge of
a large group (100-1000) of slave servers called span-servers. These span-servers hold the actual replicas
of the data, assigned to them by the zone-master. When clients request the data from a particular zone,
they first contact a location proxy server, which guides the client to the correct span-server. (Hari, 2012)
Each span-server is responsible for storing up to a thousand instances of tablets, which are
collections of key-value mappings, similar to BigTable’s format. Another interesting structure used in
Spanner is the directory. Directories hold collections of keys with a common prefix. This helps greatly to
keep the related tables together and distribute them on as few partitions as possible, depending on the
tables’ sizes, which in turn results in a faster querying time. (Hari, 2012)
It is worth mentioning that in addition to Spanner being extremely scalable it also supports an
SQL-based language, unlike his ancestor BigTable. The last feature I would like to mention is the ability
to automatically adjust the data spread across the available span-servers, dynamically optimizing the
load within the universe.
- Hardware/Software used by Google Google uses commodity-class computers for its servers, trying to achieve the most computing
power per dollar. The exact configuration is being kept secret, yet one can assume that most money
would be put into processing unit, RAM and storage capacity.
All Google’s software used on the server is being customized and optimized to fit the particular
configuration. Most commonly used programming languages by Google are: C++, Java and Python.
Examples of more noticeable software installed on Google servers are:
Google Web Server (custom version of Linux-based web server)
Spanner (most current distributed database system, the subject discussed in this section)
Colossus (next generation of Google File System)
Chubby (sitting on top of zone-masters and responsible for assigning locks to sub-servers)
MapReduce (processes large sets of data, aggregates data)
(Wikipedia, 2012)
- Security of the Google Distributed Database Google takes security of their datacentres very seriously. Access to the datacentre sites is highly
restricted. The security personnel are present on the site 24 hours a day. Each site is monitored by CCTV
cameras, helping the security personnel keep watch over the datacentre.
Within the buildings, there are multiple access points restricting access to privileged personnel,
who have to scan their badges or use biometric devices in order to enter a particular area. A particular
place of interest is the server room, where all the servers are stored. Protection of hard drives is
particularly emphasized in Google, making sure the data they contain is confidential and properly
secured. (Google Data center security, 2012)
- Reliability of the Google Distributed Database Physically, Google ensures reliability of their databases by storing files in multiple locations and
encrypting their names and contents, to increase security.
Spanner also implements three significant mechanisms to help the system to be reliable:
-
-
Snap shot isolation (whenever the data is pulled from multiple machines, Spanner takes a
snap shot of the data from all the queried machines at the same time, guarantying
consistent results)
2 Phase Locking
TrueTime (customized time structure, derived from the GPSs tied to the servers; it returns
an interval of time between the earliest and latest possible value, accounting for the
possible error range denoted by epsilon; backed up by atomic clocks)
4. YAHOO!’S DATABASE SYSTEM
- General Description of Yahoo! Services and Database “Yahoo! Inc. is an American multinational internet corporation headquartered in Sunnyvale,
California, United States. The company is best known for its web portal, search engine (Yahoo! Search)
and for a variety of other services, including Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo!
Finance, Yahoo! Groups, Yahoo! Answers, advertising, online mapping, video sharing, fantasy sports and
its social media website. It is one of the most popular sites in the United States.” (Wikipedia, 2012)
Yahoo, like Google, needs to manage a huge amount of data on the daily basis and it also
requires an efficient database system to meet its clients’ expectations. The two most significant
distributed database systems that Yahoo! is using today are: PNUTS, which has been developed by
Yahoo! internally (and which is going to be the main focus of this discussion) and Apache Hadoop – an
open source project, used by other web giants like Amazon.
- Architecture of the Yahoo! Distributed Database Focus of PNUTS system is “on data serving for web applications, rather than complex queries […]
PNUTS is a hosted, centrally-managed database service shared by multiple applications.” (Yahoo!
Research, 2008).
The architecture of PNUTS encapsulates a simplified relational data model. The tables and data
are structured much like in any other relational database. However when querying the data from PNUTS
relations, the tables cannot be joined and thus only rows from a single relation can be retrieved from a
single query.
It is also worth mentioning that PNUTS is able to perform many operations asynchronously,
compensating the higher latency it may introduce at times with its other modules.
- Hardware/Software used by Yahoo! “PNUTS is a hosted, centrally-managed database service shared by multiple applications. To add
capacity, we add servers. The system adapts by automatically shifting some load to the new servers.”
(Yahoo! Research, 2008)
The bottleneck introduced by this setup seems to be the key hardware components in the
servers, like RAM, CPU or hard drives. In the event of hard failure, the data is being copied from a
corresponding replica to other live servers, taking care that the load is being spread evenly.
Like Google, Yahoo! runs a minimal number of applications on their servers.
It should be mentioned that Yahoo! invests a lot of resources into the research related with
cloud computing technologies, as the company believes, most rightly, it is the future of massive-scale,
globally distributed systems.
- Security of the Yahoo! Distributed Database Yahoo! employs rather standard security measures when it comes to physical datacentres
security, definitely less restrictive than Google. When it comes to exchanging data on the Internet the
company introduces the ‘Yahoo! Security Key’, which could be obtained by the financial services users.
The key is an additional layer of security and it expires after one hour of being issued.
A famous incident took place earlier this year, in which Yahoo! was a victim of a serious security
breach, resulting in over half a million of Yahoo! user accounts being hacked. Information revealed
included user names, user e-mail addresses and user passwords.
- Reliability of the Yahoo! Distributed Database PNUTS implements redundancy at many levels, including layers like data, metadata or serving
components.
Like Google, Yahoo! also replicates data across multiple physical locations. Yahoo! says that they
provide ‘per-record timeline consistency’, which means that all clones (replicas) of a given record
perform their updates to the record in the same order.
PNUTS in general supplies “a consistency model that provides useful guarantees to applications
without sacrificing scalability.” (Yahoo! Research, 2008)
5. REFERENCE SOURCES
Google Data center security. 2012. [Film] Directed by Google. US: Google.
Google Inc., 2012. [Online]
Available at:
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archiv
e/bigtable-osdi06.pdf
[Accessed 27 10 2012].
Hari, 2012. Systems We Make. [Online]
Available at: http://www.systemswemake.com/papers/spanner
[Accessed 27 10 2012].
Wikipedia, 2012. Wikipedia. [Online]
Available at: http://en.wikipedia.org/wiki/Google_platform
[Accessed 27 10 2012].
Wikipedia, 2012. Wikipedia. [Online]
Available at: http://en.wikipedia.org/wiki/Yahoo!
[Accessed 29 10 2012].
Yahoo! Research, 2008. [Online]
Available at: http://research.yahoo.com/files/pnuts.pdf
[Accessed 29 10 2012].
Download