Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and Jaliya Ekanayake Range in size from “edge” facilities to megascale. Economies of scale Approximate costs for a small size center (1000 servers) and a larger, 100K server center. Technology Cost in smallsized Data Center Cost in Large Data Center Ratio Network $95 per Mbps/ month $13 per Mbps/ month 7.1 Storage $2.20 per GB/ month $0.40 per GB/ month 5.1 Administration ~140 servers/ Administrator >1000 Servers/ Administrator 7.1 Each data center is 11.5 times the size of a football field A bunch of machines in data centers Fabric Controller Owns all data center hardware Uses inventory to host services Deploys applications to free resources Maintains the health of those applications Maintains health of hardware If the node goes offline, FC will try to recover it If a failed node can’t be recovered, FC migrates role instances to a new node, A suitable replacement location is found, Existing role instances are notified of change Manages the service life cycle starting from bare metal Highly-available Fabric Controller (FC) At Minimum (Small) Up to 7 Guest VMs A Host Virtual Machine An Optimized Hypervisor CPU: 1.5-1.7 GHz x64 Memory: 1.7GB Network: 100+ Mbps Local Storage: 500GB Up to (Extra Large) CPU: 8 Cores Memory: 14.2 GB Local Storage: 2+ TB At Minimum CPU: 1.5-1.7 GHz x64 Memory: 1.7GB Network: 100+ Mbps Local Storage: 500GB Up to CPU: 8 Cores Memory: 14.2 GB Local Storage: 2+ TB Azure Platform Worker Role Web Role Compute Blobs Queues Storage Tables Drives A closer look HTTP Blobs Application Storage Compute Fabric … Drives Tables Queues Access Data is exposed via .NET and RESTful interfaces Data can be accessed by: Windows Azure apps Other on-premise applications or cloud applications Account Container images jared Blob PIC01.JPG PIC02.JPG movies MOV1.AVI http://jared.blob.core.windows.net/images/PIC01.JPG Number of Blob Containers Can have has many Blob Containers as will fit within the storage account limit Blob Container A container holds a set of blobs Set access policies at the container level Private or Public accessible Associate Metadata with Container Metadata are <name, value> pairs Up to 8KB per container Block Blob Targeted at streaming workloads Each blob consists of a sequence of blocks Each block is identified by a Block ID Size limit 200GB per blob Page Blob Targeted at random read/write workloads Each blob consists of an array of pages Each page is identified by its offset from the start of the blob Size limit 1TB per blob Account Container images jared Blob PIC01.JPG PIC02.JPG movies MOV1.AVI Block or Page Block or Page 1 Block or Page 2 Block or Page 3 Producers Scalable message paths Provides loose synchronization Any number of messages One week of persistence Maximum size 8KB Visibility timeout Consumers C1 P2 4 P1 3 2 1 C2 Provides Structured Storage Massively Scalable Tables Billions of entities (rows) and TBs of data Can use thousands of servers as traffic grows Data is replicated several times Table A storage account can create many tables Table name is scoped by account Set of entities (i.e. rows) Entity Set of properties (columns) Required properties PartitionKey, RowKey and Timestamp Partition 1 Partition 2 Source : Windows Azure Table – Programming Table Storage A Windows Azure Drive is a Page Blob formatted as a NTFS single volume Virtual Hard Drive (VHD) Drives can be up to 1TB A VM can dynamically mount up to 8 drives A Page Blob can only be mounted by one VM at a time for read/write Remote Access via Page Blob Can upload the VHD to its Page Blob using the blob interface, and then mount it as a Drive Can download the Drive through the Page Blob interface A closer look Web Role HTTP Load Balancer IIS Worker Role ASP.NET, WCF, etc. Agent main() { … } Agent Fabric VM Using queues for reliable messaging To scale, add more of either 1) Receive work Worker Role Web Role main() { … } ASP.NET, WCF, etc. 2) Put work in queue 3) Get work from queue Queue 4) Do work Queues are the application glue • Decouple parts of application, easier to scale independently; • Resource allocation, different priority queues and backend servers • Mask faults in worker roles (reliable messaging). Use Inter-role communication for performance • TCP communication between role instances • Define your ports in the service models Points of interest Access Data is exposed via .NET and RESTful interfaces Data can be accessed by: Windows Azure apps Other on-premise applications or cloud applications Work Home Develop Development Fabric Develop Your App Run Development Storage Source Control Version Local Application Works Locally What the ‘Value Add’ ? Provide a platform that is scalable and available Services are always running, rolling upgrades/downgrades Failure of any node is expected, state has to be replicated Failure of a role (app code) is expected, automatic recovery Services can grow to be large, provide state management that scales automatically Handle dynamic configuration changes due to load or failure Manage data center hardware: from CPU cores, nodes, rack, to network infrastructure and load balancers. Key takeaways Cloud services have specific design considerations Always on, distributed state, large scale, fault tolerance Scalable infrastructure demands a scalable architecture Stateless roles and durable queues Windows Azure frees service developers from many platform issues Windows Azure manages both services and servers Worker Web Role Web Portal Web Service Job registration Job Management Role Scaling Engine Job Scheduler Job Registry NCBI databas es Database updating Role Azure Table Worker Global dispatch queue Blast databases, temporary data, etc.) Azure Blob … Worker • Always design with failure in mind - On large jobs it will happen, and it can happen anywhere • Factoring work into optimal sizes has large performance impacts - The optimal size may change depending on the scope of the job • Test runs are your friend - Blowing $20,000 of computation is not a good idea • Make ample use of logging features - When failure does happen, it’s good to know where • Cutting 10 years of computation down to 1 week is great!! - Little Cloud development headaches are probably worth it Thank you!