Modern Distributed Databases

advertisement
Modern Distributed
Databases
Databases – Software Development Year 4
Joseph Kenny A00174254
10/14/2013
TABLE OF CONTENTS
1. Introduction
2. Challenges and Issues of Modern Distributed Databases
3. Google’s Database System
a. General Description of the Google Services and Database
b. Architecture of the Google Distributed Database
c. Hardware/Software used by Google
d. Security of the Google Distributed Database
e. Reliability of the Google Distributed Database
4. Amazon’s Database System
a. General Description of the Amazon Services and
Database
b. Architecture of the Amazon Distributed Database
c. Hardware/Software used by Amazon
d. Security of the Amazon Distributed Database
e. Reliability of the Amazon Distributed Database
5. References
Introduction
This paper is dealing with the topic of modern distributed databases, and in particular relation to
Google Inc and Amazon.com Inc. These two international companies’ process vast amounts of data
every second and are relied upon by users across the world. This paper will delve into the challenges
faced when dealing with distributed databases, the architecture, Security, Reliability and the
hardware/software used for the database.
Challenges and Issues of Modern Distributed Database
There are numerous challenges and issues faced when dealing with distributed databases. For
example,
Transparency
When dealing with distributed databases there must be a high level of continuity for the benefit of
the user. The entire database must act as if it was a single entity even though there could be
thousands of kilometres between each machine. This continuity is essential for not allowing
interruptions for the user and the service that the database provides.
Security
The first issue is the fact that the database has multiple access points to the system, which means
each node has to be completely secure. Another issue I’ve observed is the sending of encryption
keys across the system, if the database is spread across five centres, the keys must be sent to each
location thus allowing a chance of it being exposed. The last issue is the vulnerability of the database
if a node is compromised by a virus or hacker, which means if one node is corrupted that leaves the
others nodes vulnerable.
Scalability
An issue in terms of scalability when dealing with a distributed database is if the database is secure to
be distributed across several countries, In terms of handling user traffic, large amounts of data and
reliability. These three factors are the fundamentals of a distributed database, if one fails or under
performs then the entire system is affected.
Google’s Distributed Database System
General Description of the Google Services and Database
Google is one of the largest companies in the world originally concerned with organizing online
material and making it readily available by using Google’s search engine. Since then the company has
an ever expanding portfolio which now includes






Gmail
Google Drive
Google+
Youtube
Google Maps
Google Navigation
With such a vast amount of data being sent and received, Google has had to continually update its
database system. From its inception in 1998, the amount of searches Google has to process is
expanding yearly. The table below shows the figures it amounts to.
Year
2012
2011
2000
Annual Google Searches
1,873,910,000,000
1,722,071,000,000
22,000,000,000
Average Searches Per Day
5,134,000,000
4,717,000,000
60,000,000
(www.statisticbrain.com, 2013)
Since Google has an ever expanding amount of distributed data they have had to develop their own
systems, from the original BigTable, to its successor Megastore to the newly released Spanner.
Architecture of the Google Distributed Database
A single centre in Spanner is called a Universe; it’s divided into zones which are designated to
locations. A zone is made up of a master server which controls servers known as span servers. These
span servers store data assigned to them by the master server, which is then passed to clients. To
request data from a zone, they contact the location proxy server, which directs the client to the
appropriate span server. The universe is also made up of an administrative console called a universe
master which stores and displays information on individual zone. A placement driver is responsible
for transferring data across zones.
(Hari, 2012)
Hardware/Software used by Google
Google uses cheap commodity computers for its servers, allowing more computing power on a
budget. Google compensates for low cost machines by ensuring twin computers exist in many data
centres around the world, therefore there are replica machines in existence.
In terms of software, most of it being used are all developed by Google employee’s for their servers
with Java, Python and Go the programming languages used.
These include
(Taylor, 2003)
In terms of software, most of it being used by Google are all developed by Google employee’s for
their servers with Java, Python and Go the more favoured programming languages used.
These include
 Google Web Server
 Spanner
 Google F1
 MapReduce
 BigTable
 TeraGoogle
(Wikipedia, 2013)
Security of the Google Distributed Database
Google’s security measures at their data centres ranges from 24/7 security detail, security camera’s,
privileged access to certain rooms and security perimeters.
When entering the data centre there are numerous restrictions put in place to prevent their data
from being compromised. Each employee must use security badges to gain access to these rooms, in
particular the server room. Which stores the Google database; this must stay secure as it contains
confidential user information.
(CBS, 2011)
Reliability of the Google Distributed Database
Google use various methods to ensure reliability with their distributed database. They store files
much of which is replicas in numerous locations around the world in data centres. This allows Google
to back up files on replica computers to ensure the database never gets corrupted.
Google’s spanner data centres use its own time keeping mechanism called TrueTime API. In which
the data centre is equipped with atomic clocks and GPS receivers, allowing the data centre to get
your location along with telling time independently. TrueTime is connected to a master server in
each zone thus allowing each master server to run on a unified time and allow better
synchronisation for more reliable data transfers across their zones.
(Metz, 2012)
Amazon’s Database System
General Description of the Amazon Services and Database
Amazon is an American international electronic commerce company based in Seattle, Washington. It
is the largest online retailer. It began as an online bookstore but has expanded to selling DVDS,
MP3s, games, jewellery and electronics.
(Wikipedia, 2013)
Amazon uses a NoSQL distributed database called DynamoDB which is the successor to SimpleDB.
DynamoDB is a NoSQL database service that provides efficient and reliable performance. It can store
and gather any amount of data and handle any level of request traffic.
(Vogels, 2012)
Architecture of the Amazon Distributed Database
Amazon’s DynamoDB allows the functionality of cloud computing to the NoSQL database. It offers
high availability, reliability and scalability, with no limits on dataset size or request traffic. It runs on
the latest solid state drive (SSD) technology which offers low latency at any scale.
Amazon uses multiple locations worldwide for transferring data. These locations are made up of
regions and Availability zones. This allows datasets to be replicated across these regions and zones
to provide built in high availability.
(Amazon, 2013)
(Amazon)
Hardware/Software used by Amazon
Amazon has embraced Google’s philosophy of buying and using cheap machines for running their
distributed database, they previously used HP machines for this task. They now buy server
processors and memory straight from Intel. All their machines are custom built for their own
specifications and there servers are build in house as well in tandem with an Asian manufacturer.
(McMillan & Metz, 2012)
Some of the software used by Amazon includes;









Java
Servlets
Perl
Rails
Linux
Oracle
C++
Mason
Jboss
(highscalablity.com, 2007)
Security of the Amazon Distributed Database
Security at Amazon data centres is similar to that of Google, with 24/7 trained guards and restricted
access to rooms for privileged employees. These security measures are in tandem with surveillance
cameras in operation around the building and exterior.
Amazons virtual infrastructure has been designed to be completely secure and to provide optimum
availability for customer privacy. Not only are Amazon protected with high quality security for their
data centres and infrastructure, they also run monitoring systems across their network to prevent
distributed denial of service and password brute force detection.
(Amazon, 2013)
Additional security include



Integrated fire-walls
Private subnets
Encryted data storage
(Amazon, 2013)
Reliability of the Amazon Distributed Database
Amazon’s use of regions and availability zones allow the replication of data in multiple data centres
for reliability. If a machine fails in Seattle then a twin machine fills its place in any other data centre.
Since 2010 Amazon introduced a feature to their service called Multi-Availability Zone. This new
feature automatically configures a backup version of the database, which it stores in a different
location from the original. Updates to the prime database are automatically updated to the copy.
This copy will be assigned as the prime database during a network failure.
(Ricknas, 2010)
References
Amazon. Amazon Web Services. aws documentation. Seattle.
Amazon. (2013). aws.amazon.com. Retrieved from Amazon Web Services - Security:
https://aws.amazon.com/security/
Amazon. (2013). docs.aws.amazon. Retrieved from Amazon:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
CBS (Director). (2011). Security and Data Protection in Google Data Centers [Motion Picture].
Hari. (2012, 9 16). www.systemswemake.com. Retrieved from Systemwemake:
http://www.systemswemake.com/author/xobni/
highscalablity.com. (2007, 9 18). Retrieved from Highscalability: http://highscalability.com/amazonarchitecture
McMillan, R., & Metz, C. (2012, 11 30). www.wired.com. Retrieved from wired:
http://www.wired.com/wiredenterprise/2012/11/amazon-google-secret-servers/
Metz, C. (2012, 11 26). www.wired.com. Retrieved from wired.com:
http://www.wired.com/wiredenterprise/2012/11/google-spanner-time/
O'Brien, T. M. (2012, 10 4). strata.oreilly.com. Retrieved from Strata:
http://strata.oreilly.com/2012/10/google-spanner-relational-database.html
Ricknas, M. (2010, May 18). www.cio.com. Retrieved from Cio:
http://www.cio.com/article/594063/Amazon_Improves_Reliability_for_Its_Cloud_Based_Database
Taylor, A. (2003, 10 10). http://www.pcworld.com. Retrieved from pcworld:
http://www.pcworld.com/article/112891/article.html
Vogels, W. (2012, 1 12). www.allthingsdistributed.com. Retrieved from Allthingsdistributed:
http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
Wikipedia. (2013). Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Google_platform
Wikipedia. (2013). Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Amazon.com
www.statisticbrain.com. (2013, 6 18). Retrieved from StatisticBRAIN:
http://www.statisticbrain.com/google-searches/
Download