Modern Distributed Databases Databases – Software Development Year 4 Joseph Kenny A00174254 10/14/2013 TABLE OF CONTENTS 1. Introduction 2. Challenges and Issues of Modern Distributed Databases 3. Google’s Database System a. General Description of the Google Services and Database b. Architecture of the Google Distributed Database c. Hardware/Software used by Google d. Security of the Google Distributed Database e. Reliability of the Google Distributed Database 4. Amazon’s Database System a. General Description of the Amazon Services and Database b. Architecture of the Amazon Distributed Database c. Hardware/Software used by Amazon d. Security of the Amazon Distributed Database e. Reliability of the Amazon Distributed Database 5. References Introduction This paper is dealing with the topic of modern distributed databases, and in particular relation to Google Inc and Amazon.com Inc. These two international companies’ process vast amounts of data every second and are relied upon by users across the world. This paper will delve into the challenges faced when dealing with distributed databases, the architecture, Security, Reliability and the hardware/software used for the database. Challenges and Issues of Modern Distributed Database There are numerous challenges and issues faced when dealing with distributed databases. For example, Transparency When dealing with distributed databases there must be a high level of continuity for the benefit of the user. The entire database must act as if it was a single entity even though there could be thousands of kilometres between each machine. This continuity is essential for not allowing interruptions for the user and the service that the database provides. Security The first issue is the fact that the database has multiple access points to the system, which means each node has to be completely secure. Another issue I’ve observed is the sending of encryption keys across the system, if the database is spread across five centres, the keys must be sent to each location thus allowing a chance of it being exposed. The last issue is the vulnerability of the database if a node is compromised by a virus or hacker, which means if one node is corrupted that leaves the others nodes vulnerable. Scalability An issue in terms of scalability when dealing with a distributed database is if the database is secure to be distributed across several countries, In terms of handling user traffic, large amounts of data and reliability. These three factors are the fundamentals of a distributed database, if one fails or under performs then the entire system is affected. Google’s Distributed Database System General Description of the Google Services and Database Google is one of the largest companies in the world originally concerned with organizing online material and making it readily available by using Google’s search engine. Since then the company has an ever expanding portfolio which now includes Gmail Google Drive Google+ Youtube Google Maps Google Navigation With such a vast amount of data being sent and received, Google has had to continually update its database system. From its inception in 1998, the amount of searches Google has to process is expanding yearly. The table below shows the figures it amounts to. Year 2012 2011 2000 Annual Google Searches 1,873,910,000,000 1,722,071,000,000 22,000,000,000 Average Searches Per Day 5,134,000,000 4,717,000,000 60,000,000 (www.statisticbrain.com, 2013) Since Google has an ever expanding amount of distributed data they have had to develop their own systems, from the original BigTable, to its successor Megastore to the newly released Spanner. Architecture of the Google Distributed Database A single centre in Spanner is called a Universe; it’s divided into zones which are designated to locations. A zone is made up of a master server which controls servers known as span servers. These span servers store data assigned to them by the master server, which is then passed to clients. To request data from a zone, they contact the location proxy server, which directs the client to the appropriate span server. The universe is also made up of an administrative console called a universe master which stores and displays information on individual zone. A placement driver is responsible for transferring data across zones. (Hari, 2012) Hardware/Software used by Google Google uses cheap commodity computers for its servers, allowing more computing power on a budget. Google compensates for low cost machines by ensuring twin computers exist in many data centres around the world, therefore there are replica machines in existence. In terms of software, most of it being used are all developed by Google employee’s for their servers with Java, Python and Go the programming languages used. These include (Taylor, 2003) In terms of software, most of it being used by Google are all developed by Google employee’s for their servers with Java, Python and Go the more favoured programming languages used. These include Google Web Server Spanner Google F1 MapReduce BigTable TeraGoogle (Wikipedia, 2013) Security of the Google Distributed Database Google’s security measures at their data centres ranges from 24/7 security detail, security camera’s, privileged access to certain rooms and security perimeters. When entering the data centre there are numerous restrictions put in place to prevent their data from being compromised. Each employee must use security badges to gain access to these rooms, in particular the server room. Which stores the Google database; this must stay secure as it contains confidential user information. (CBS, 2011) Reliability of the Google Distributed Database Google use various methods to ensure reliability with their distributed database. They store files much of which is replicas in numerous locations around the world in data centres. This allows Google to back up files on replica computers to ensure the database never gets corrupted. Google’s spanner data centres use its own time keeping mechanism called TrueTime API. In which the data centre is equipped with atomic clocks and GPS receivers, allowing the data centre to get your location along with telling time independently. TrueTime is connected to a master server in each zone thus allowing each master server to run on a unified time and allow better synchronisation for more reliable data transfers across their zones. (Metz, 2012) Amazon’s Database System General Description of the Amazon Services and Database Amazon is an American international electronic commerce company based in Seattle, Washington. It is the largest online retailer. It began as an online bookstore but has expanded to selling DVDS, MP3s, games, jewellery and electronics. (Wikipedia, 2013) Amazon uses a NoSQL distributed database called DynamoDB which is the successor to SimpleDB. DynamoDB is a NoSQL database service that provides efficient and reliable performance. It can store and gather any amount of data and handle any level of request traffic. (Vogels, 2012) Architecture of the Amazon Distributed Database Amazon’s DynamoDB allows the functionality of cloud computing to the NoSQL database. It offers high availability, reliability and scalability, with no limits on dataset size or request traffic. It runs on the latest solid state drive (SSD) technology which offers low latency at any scale. Amazon uses multiple locations worldwide for transferring data. These locations are made up of regions and Availability zones. This allows datasets to be replicated across these regions and zones to provide built in high availability. (Amazon, 2013) (Amazon) Hardware/Software used by Amazon Amazon has embraced Google’s philosophy of buying and using cheap machines for running their distributed database, they previously used HP machines for this task. They now buy server processors and memory straight from Intel. All their machines are custom built for their own specifications and there servers are build in house as well in tandem with an Asian manufacturer. (McMillan & Metz, 2012) Some of the software used by Amazon includes; Java Servlets Perl Rails Linux Oracle C++ Mason Jboss (highscalablity.com, 2007) Security of the Amazon Distributed Database Security at Amazon data centres is similar to that of Google, with 24/7 trained guards and restricted access to rooms for privileged employees. These security measures are in tandem with surveillance cameras in operation around the building and exterior. Amazons virtual infrastructure has been designed to be completely secure and to provide optimum availability for customer privacy. Not only are Amazon protected with high quality security for their data centres and infrastructure, they also run monitoring systems across their network to prevent distributed denial of service and password brute force detection. (Amazon, 2013) Additional security include Integrated fire-walls Private subnets Encryted data storage (Amazon, 2013) Reliability of the Amazon Distributed Database Amazon’s use of regions and availability zones allow the replication of data in multiple data centres for reliability. If a machine fails in Seattle then a twin machine fills its place in any other data centre. Since 2010 Amazon introduced a feature to their service called Multi-Availability Zone. This new feature automatically configures a backup version of the database, which it stores in a different location from the original. Updates to the prime database are automatically updated to the copy. This copy will be assigned as the prime database during a network failure. (Ricknas, 2010) References Amazon. Amazon Web Services. aws documentation. Seattle. Amazon. (2013). aws.amazon.com. Retrieved from Amazon Web Services - Security: https://aws.amazon.com/security/ Amazon. (2013). docs.aws.amazon. Retrieved from Amazon: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html CBS (Director). (2011). Security and Data Protection in Google Data Centers [Motion Picture]. Hari. (2012, 9 16). www.systemswemake.com. Retrieved from Systemwemake: http://www.systemswemake.com/author/xobni/ highscalablity.com. (2007, 9 18). Retrieved from Highscalability: http://highscalability.com/amazonarchitecture McMillan, R., & Metz, C. (2012, 11 30). www.wired.com. Retrieved from wired: http://www.wired.com/wiredenterprise/2012/11/amazon-google-secret-servers/ Metz, C. (2012, 11 26). www.wired.com. Retrieved from wired.com: http://www.wired.com/wiredenterprise/2012/11/google-spanner-time/ O'Brien, T. M. (2012, 10 4). strata.oreilly.com. Retrieved from Strata: http://strata.oreilly.com/2012/10/google-spanner-relational-database.html Ricknas, M. (2010, May 18). www.cio.com. Retrieved from Cio: http://www.cio.com/article/594063/Amazon_Improves_Reliability_for_Its_Cloud_Based_Database Taylor, A. (2003, 10 10). http://www.pcworld.com. Retrieved from pcworld: http://www.pcworld.com/article/112891/article.html Vogels, W. (2012, 1 12). www.allthingsdistributed.com. Retrieved from Allthingsdistributed: http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html Wikipedia. (2013). Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Google_platform Wikipedia. (2013). Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Amazon.com www.statisticbrain.com. (2013, 6 18). Retrieved from StatisticBRAIN: http://www.statisticbrain.com/google-searches/