BITS Pilani presentation BITS Pilani Hyderabad Campus D. Powar Lecturer, BITS-Pilani, Hyderabad Campus BITS Pilani Hyderabad Campus SSZG527 Lecture 18 Cloud Computing Lectures Lecture No Objectives Lecture 10 Capacity management Lecture 11 Introduction to PAAS (Drupal, Wolf frameworks, force.com), 5 Principles of UI Design by AWS: MADPO Principles Lecture 12 RAID (Redundant Array of Independent Disks) Lecture 13 MapReduce - distributed programming frame work, Pig, Hive Lecture 14 Distributed File System (GFS,HDFS), cloud storage Lecture 15 Multi-Tenancy, 4 levels multi-tenancy Lecture 16 Cloud security Lecture 17 OpenStack – a cloud computing operating system BITS Pilani, Hyderabad Campus MapReduce BITS Pilani, Hyderabad Campus Map+Reduce Very big data R E D U C E M A P Map: – Accepts input key/value pair – Emits intermediate key/value pair Result Reduce – Accepts intermediate key/value* pair – Emits output key/value pair BITS Pilani, Hyderabad Campus MapReduce Programming Model Data type: key-value records Map function: (Kin, Vin) list(Kinter, Vinter) Reduce function: (Kinter, list(Vinter)) list(Kout, Vout) BITS Pilani, Hyderabad Campus Examples let map(k,v) =emit (k.toUpper(), v.toUpper() ) – (“foo”, “bar”) -> (“FOO”,”BAR”) – (“key2”,”data”) -> (“KEY2”,”DATA”) let map(k,v)= foreach char c in v :emit (k,c) – (“A”,”cats”)->(“A”,”c”),(“A”,”a”),(“A”,”t”),(“A”,”s”) – (“B”,”hi”) ->(“B”,”h”), (“B”,”i”) let map(k,v)= if (isPrime(v)) then emit (k,v) – (“foo”,7) -> (“foo”,7) – (“test”,10) -> (nothing) let map(k,v)= emit(v.length,v) – (“hi”,”test”)->(4,”test”) – (“x”,”quux”) ->(4,”quux”) BITS Pilani, Hyderabad Campus Example: Word Count def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values)) BITS Pilani, Hyderabad Campus Word Count Execution Input the quick brown fox the fox ate the mouse how now brown cow Map Map Shuffle & Sort Reduce the, 1 brown, 1 fox, 1 Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 Reduce ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 fox, 1 the, 1 Map how, 1 now, 1 brown, 1 Map Output quick, 1 ate, 1 mouse, 1 cow, 1 BITS Pilani, Hyderabad Campus Word Count example code (java) http://hadoop.apache.org/docs/stable/mapred_tutorial.html http://wiki.apache.org/hadoop/WordCount BITS Pilani, Hyderabad Campus Distributed File Systems BITS Pilani, Hyderabad Campus The Google File System GFS stores a huge number of files, totaling many terabytes of data Individual file characteristics – Very large, multiple gigabytes per file – Files are updated by appending new entries to the end (faster than overwriting existing data) – Files are virtually never modified (other than by appends) and virtually never deleted. – Files are mostly read-only BITS Pilani, Hyderabad Campus Google File System Divide files in large 64 MB chunks, and distribute/replicate chunks across many servers. A couple of important details: – The master maintains only a (file name, chunk server) table in main memory: minimal I/O – Files are replicated using a primary-backup scheme; the master is kept out of the loop BITS Pilani, Hyderabad Campus HDFC?? Hadoop's Distributed File System is designed to reliably store very large files across machines in a large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time. BITS Pilani, Hyderabad Campus Hadoop Distributed File System – Goals: • Store large data sets • Cope with hardware failure • Emphasize streaming data access BITS Pilani, Hyderabad Campus From GFS to HDFS Terminology differences: – GFS master = Hadoop namenode – GFS chunkservers = Hadoop datanodes Functional differences: – No file appends in HDFS (planned feature) – HDFS performance is (likely) slower BITS Pilani, Hyderabad Campus HDFS Architecture HDFS namenode Application HDFS Client (file name, block id) /foo/bar File namespace block 3df2 (block id, block location) instructions to datanode (block id, byte range) block data datanode state HDFS datanode HDFS datanode Linux file system Linux file system … Adapted from (Ghemawat et al., SOSP 2003) … BITS Pilani, Hyderabad Campus Namenode Responsibilities Managing the file system namespace: – Holds file/directory structure, metadata, file-toblock mapping, access permissions, etc. Coordinating file operations: – Directs clients to datanodes for reads and writes – No data is moved through the namenode Maintaining overall health: – Periodic communication with the datanodes – Garbage collection BITS Pilani, Hyderabad Campus Cloud??? Cloud storage is a model of networked online storage where data is stored in virtualized pools of storage Companies operate large data centers, and people who require their data to be hosted, buy or lease storage capacity from them Cloud storage services may be accessed through a web service application programming interface (API), a cloud storage gateway or through a Web-based user interface It is difficult to pin down a canonical definition of cloud storage architecture, but object storage is reasonably analogous BITS Pilani, Hyderabad Campus Multi-tenanancy BITS Pilani, Hyderabad Campus basic SaaS maturity model 1. ad-hoc /custom 2. configurable single tenant 3. configurable multi tenant 4. configurable multi tenant (scalable) Ad-hoc /customizable instances Each customer has their own custom vision of the software Represents a enterprise data center where there are multiple instances and versions of the software Each customer would have their own binaries, as well as their own dedicated processes for implementation of the application Disadv: Difficulty in Management: Each customer would need their own management support BITS Pilani, Hyderabad Campus Configurable instances All customers share the same vision of the software (one copy for each customer) adv: Easy Management: Single copy of the software BITS Pilani, Hyderabad Campus Configurable multi-tenant efficient instances All customers share the same version of the software (only single copy among all customers) adv: Easy Management: running of only single instance BITS Pilani, Hyderabad Campus Configurable multi-tenant efficient instances (scalable) All customers share the same version of the software (only single copy among all customers) Software is hosted on a cluster of computers Hence, allows the capacity of the system to scale almost limitlessly Thus, increase in no. of customers and capacity as well Ex: Gmail, yahoo mail, etc Disadv: Shared storage problem BITS Pilani, Hyderabad Campus vs share isolate business model (can I monetise?) architectural model (can I do it?) operational model (can I guarantee SLAs?) BITS Pilani, Hyderabad Campus meta-data access control Authentication Unlike traditional computer systems, the tenant would specify the valid users, and cloud service provider would authenticate them Two basic approaches are used Centralized authentication Decentralized authentication BITS Pilani, Hyderabad Campus Authentication (contd..) Centralized authentication: Authentication is performed using a centralized user database Cloud admin gives the tenant admin rights to manage user accounts for that tenant Multiple (two) sign-on service Given self service nature of the cloud, it is more generally used Decentralized authentication: Each tenant maintains their own user database, and needs to deploy a federation service that interface between that tenant’s authentication framework and the cloud system’s authentication service Single sign-on service BITS Pilani, Hyderabad Campus Resource sharing Two major resource that need to be shared are storage and servers Sharing storage resources (two types) File system Databases Since file system storage is well known mechanism, we will restrict our discussion to database storage BITS Pilani, Hyderabad Campus Database There are two methods of sharing data in a single database Dedicated tables per tenant Shared table Dedicated tables per tenant: Each tenant stores their data in a separate set of tables different from other tenants ex: www.mygarage.com portal Shows the way auto repair stores may store each table as separate file BITS Pilani, Hyderabad Campus Dedicated tables per tenant: Best garage Car license Service Cost Service Cost Friendly garage Car license Honest garage Car license Service Cost BITS Pilani, Hyderabad Campus Shared table: The data for all the tenant is stored in the same table in different rows. One of the column in the table identifies a tenant to which a particular row belongs It is more space efficient than previous approach A auxiliary table, called a metadata table, stores information about the tenants BITS Pilani, Hyderabad Campus Shared table (contd..) Data table 1 Tenant ID Car license Repair Cost 1 2 2 1 3 2 Metadata table 1 Tenant ID Data 1 Best garage 2 Friendly garage 3 Honest garage BITS Pilani, Hyderabad Campus Data customization It is important for the cloud infrastructure to support customization of the stored data, since it is likely that different tenants may want store different data in their tables In Dedicated table method, each tenant has their own table, and therefore can have different schema Difficulty is with shared table approach Three method used Pre-allocated columns Name-value pair XML method BITS Pilani, Hyderabad Campus Pre-allocated columns Space is reserved in the tables for custom columns, which can be used by tenants for defining new columns Salesforce.com reserves 500 columns Some of the tenants may not use these columns Disadv: There could be a lot of wasted space BITS Pilani, Hyderabad Campus Pre-allocated columns Tenant ID Car license Service Cost Custom1 Custom2 Data table 1 1 2 2 1 3 2 Metadata table 1 Tenant ID Tenant name Custom1 name Custom1 type 1 Best garage Service rating int 2 Friendly garage Service manager string 3 Honest garage BITS Pilani, Hyderabad Campus Name-value pair The standard table will have an extra column which is a pointer to a table of name-value pair, which indicates additional custom fields for a record The table name-value pair is also called as a pivot table This method overcomes the deficiencies of storage wastage from previous method BITS Pilani, Hyderabad Campus Name-value pair (contd..) Tenant ID Car license Service Cost 1 Name-value pair record 275 2 Data table 1 2 1 3 2 Name-value pair Name ID Value 275 15 5.5 Data table 2 Metadata table 2 Metadata table 1 Name ID Name Type Tenant ID Data 15 Service rating int 1 Best garage Service manager string 2 Friendly garage 3 Honest garage BITS Pilani, Hyderabad Campus OpenStack – a cloud computing operating system BITS Pilani, Hyderabad Campus 9 core components of OpenStack (Havana) Nova - Compute Service Swift - Storage Service Glance - Imaging Service Keystone - Identity Service Horizon - UI Service Quantum - Network connectivity Service Cinder - Block Storage Service Ceilometer - billing, benchmarking, scalability, and statistics purposes Heat: Orchestrates multiple composite cloud applications BITS Pilani, Hyderabad Campus OpenStack conceptual architecture BITS Pilani, Hyderabad Campus Table 1.1. OpenStack current services (Havana) Service Project name Description Dashboard Horizon Compute Nova Networking Neutron Object Storage Swift Block Storage Cinder Identity Service Keystone Image Service Glance Metering/Monit Ceilometer oring Service Orchestration Service Heat Enables users to interact with OpenStack services to launch an instance, assign IP addresses, set access controls, and so on. Provisions and manages large networks of virtual machines on demand. Enables network connectivity as a service among interface devices managed by other OpenStack services, usually Compute. Enables users to create and attach interfaces to networks. Has a pluggable architecture that supports many popular networking vendors and technologies. Storage Stores and gets files. Does not mount directories like a file server. Provides persistent block storage to guest virtual machines. Shared services Provides authentication and authorization for the OpenStack services. Also provides a service catalog within a particular OpenStack cloud. Provides a registry of virtual machine images. Compute uses it to provision instances. Monitors and meters the OpenStack cloud for billing, benchmarking, scalability, and statistics purposes. Higher-level services Orchestrates multiple composite cloud applications by using either the native HOT template format or the AWS CloudFormation template format, through both an OpenStack-native REST API and a CloudFormation-compatible Query API. BITS Pilani, Hyderabad Campus Summary Capacity management Introduction to PAAS (Drupal, Wolf frameworks, force.com), 5 Principles of UI Design by AWS RAID (Redundant Array of Independent Disks) MapReduce - distributed programming frame work, Pig, Hive Distributed File System (GFS,HDFS), cloud storage Multi-Tenancy, 4 levels multi-tenancy Cloud security OpenStack – a cloud computing operating system BITS Pilani, Hyderabad Campus