Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August 15 2014 Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) HPC-ABDS Layers Message Protocols Distributed Coordination: Security & Privacy: Monitoring: IaaS Management from HPC to hypervisors: DevOps: Interoperability: Here are 17 functionalities. Technologies are File systems: presented in this order Cluster Resource Management: 4 Cross cutting at top Data Transport: 13 in order of layered diagram starting at SQL / NoSQL / File management: bottom In-memory databases&caches / Object-relational mapping / Extraction Tools Inter process communication Collectives, point-to-point, publish-subscribe Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI: High level Programming: Application and Analytics: Workflow-Orchestration: Xen • • • • Xen http://en.wikipedia.org/wiki/Xen supports a form of type 1 virtualization known as paravirtualization, in which guests run a modified operating system. The guests are modified to use a special hypercall ABI, instead of certain architectural features. Through paravirtualization, Xen can achieve high performance even on its host architecture (x86) which has a reputation for non-cooperation with traditional virtualization techniques Xen was developed at the University of Cambridge but is now owned by Citrix Responsibilities of the hypervisor include memory management and CPU scheduling of all virtual machines ("domains"), and for launching the most privileged domain ("dom0") - the only virtual machine which by default has direct access to hardware. From the dom0 the hypervisor can be managed and unprivileged domains ("domU") can be launched. KVM, VirtualBox • KVM http://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine is a GNU licensed type 2 virtualization infrastructure for the Linux kernel that turns it into a hypervisor, which was merged into the Linux kernel mainline in February 2007 – It requires a processor with hardware virtualization extension. • Oracle VirtualBox https://www.virtualbox.org/ http://en.wikipedia.org/wiki/VirtualBox is another well known type 2 hypervisor with GPL2 license – Runs on many O/S Hyper-V • Microsoft proprietary Hypervisor http://en.wikipedia.org/wiki/Hyper-V that supports Windows and some variants of Linux • There must be a parent partition running Windows Server OpenVZ • OpenVZ is a type 2 Hypervisor http://openvz.org/Main_Page with GPL license • OpenVZ (Open VirtualiZation) or Open Virtuozzo is an operating systemlevel virtualization technology based on the Linux kernel and operating system. OpenVZ allows a physical server to run multiple isolated operating system instances, known as containers, Virtual Private Servers (VPSs), or Virtual Environments (VEs). • Docker works well with containers • OpenVZ is not true virtualization but really containerization like FreeBSD jails. • Technologies like VMware and Xen are more flexible in that they virtualize the entire machine and can run multiple operating systems and different kernel versions. • OpenVZ uses a single patched Linux kernel and therefore can run only Linux, all containers share the same architecture and kernel version. However, as it does not have the overhead of a true hypervisor, it is very fast and efficient. • The disadvantage with this approach is the single kernel. All guests must function with the same kernel version that the host uses. • LXC (LinuX Containers) and Linux-Vserver are similar technologies OpenStack • OpenStack, OpenNebula, CloudStack, Nimbus, Eucalyptus are all cloud or Virtual managers. They help users and system administers use virtual machines with various characteristics – The big commercial public clouds have equivalent proprietary systems • • • • OpenStack http://en.wikipedia.org/wiki/OpenStack http://www.openstack.org/ is a free and open-source Apache Licensed software cloud computing software platform. Users primarily deploy it as an infrastructure as a service (IaaS) solution. The technology consists of a series of interrelated projects that control pools of processing, storage, and networking resources throughout a data center—which users manage through a webbased dashboard, command-line tools, or a RESTful API. OpenStack began in 2010 as a joint project of Rackspace Hosting and NASA. Currently, it is managed by the OpenStack Foundation, a non-profit corporate entity established in September 2012 to promote OpenStack software and its community. More than 200 companies have joined the project, including Arista Networks, AT&T, AMD, Avaya, Canonical, Cisco, Dell, EMC, Ericsson, Go Daddy, Hewlett-Packard, IBM, Intel, Mellanox, Mirantis, NEC, NetApp, Nexenta, Oracle, PLUMgrid, Red Hat, SUSE Linux, VMware and Yahoo!. The OpenStack community collaborates around a six-month, time-based release cycle with frequent development milestones. During the planning phase of each release, the community gathers for the OpenStack Design Summit to facilitate developer workingsessions and to assemble plans. The most recent OpenStack Summit, in May 2014 in Atlanta, drew 4,500 attendees, a 50% increase from the Hong Kong Summit six months earlier Apache CloudStack • http://cloudstack.apache.org/ http://en.wikipedia.org/wiki/Apache_CloudStack Has reputation for solid software but does not have the rabid adoption of OpenStack; unusual that Apache solution not most popular! • Came from Citrix via acquisitions • Features include – – – – – – – – – Built-in high-availability for hosts and VMs AJAX web GUI for management AWS API compatibility Hypervisor agnostic (VMware, KVM, XenServer, Xen Cloud Platform (XCP) and Hyper-V) Snapshot management Usage metering Network management (VLAN's, security groups) Virtual routers, firewalls, load balancers Multi-role support Eucalyptus, Nimbus • Eucalyptus https://www.eucalyptus.com/ http://en.wikipedia.org/wiki/Eucalyptus_(software) was top academic project in 2009 and was commercialized and just recently purchased by Hewlett Packard – Eucalyptus had both commercial and Open source GPL3 tracks but latter was not developed as vigorously as other open source solutions – Perhaps first to offer AWS compatible interface • Apache licensed Nimbus http://en.wikipedia.org/wiki/Nimbus_(cloud_computing) http://www.nimbusproject.org/ was probably most effective academic cloud software after Eucalyptus was commercialized and before OpenStack became popular FutureGrid IaaS request popularity by year OpenNebula • http://en.wikipedia.org/wiki/OpenNebula http://opennebula.org/ Apache License. • OpenNebula orchestrates storage, network, virtualization, monitoring, and security technologies to deploy multi-tier services (e.g. compute clusters) as virtual machines on distributed infrastructures, combining both data center resources and remote cloud resources, according to allocation policies • The toolkit includes features for integration, management, scalability, security and accounting. It also claims standardization, interoperability and portability, providing cloud users and administrators with a choice of several cloud interfaces (Amazon EC2 Query, OGF Open Cloud Computing Interface and vCloud) and hypervisors (Xen, KVM and VMware), and can accommodate multiple hardware and software combinations in a data center • Good system which strongly promoted in Europe but little used in USA where eclipsed by OpenStack VMware vCloud • VMware ESX http://en.wikipedia.org/wiki/VMware_ESX is an enterpriselevel computer virtualization product offered by VMware. ESX is a component of VMware's larger offering, VMware Infrastructure, which adds management and reliability services to the core server product. VMware recommends that deployments running the earlier ESX architecture migrate to the newer ESXi hypervisor architecture. • VMware ESX and ESXi are VMware's enterprise software Type 1 hypervisors for guest virtual servers; they run on host server hardware without an underlying operating system. • vSphere http://en.wikipedia.org/wiki/VMware_vSphere uses VMware’s ESXi hypervisor adding management (as in OpenStack) • Note desktop VMware Workstation is a type 2 hypervisor • VMware has historically been a software vendor focused on virtualization technologies. It entered the cloud IaaS market when it launched the VMware vCloud Hybrid Service (vCHS) into general availability in September 2013. http://en.wikipedia.org/wiki/VCloud This allows customers to migrate work on demand from their "internal cloud" of cooperating VMware hypervisors to a remote cloud of VMware hypervisors. – This is called cloud bursting Amazon, Azure, Google Clouds • • • • Gartner has a “magic quadrant” summarizing public clouds 28 May 2014 http://www.gartner.com/technology/reprints.do?id=1-1UKQQA6&ct=140528 Note Amazon is way ahead! Google with GCE (Google Compute Engine) is just starting IaaS. Previously it offered PaaS with Google App Engine Microsoft has recently expanded Azure but still catching up Dynamic Orchestration and Dataflow Software (Application Or Usage) SaaS Platform PaaS Use HPC-ABDS Class Usages e.g. run GPU & multicore Applications Control Robot Cloud e.g. MapReduce HPC e.g. PETSc, SAGA Computer Science e.g. Compiler tools, Sensor nets, Monitors Infra Software Defined Computing (virtual Clusters) structure IaaS Network NaaS Hypervisor, Bare Metal Operating System Software Defined Networks OpenFlow GENI Amazon Web Services AWS • Compute: Elastic Compute Cloud (EC2) offers multitenant, fixed-size and nonresizable, Xen-virtualized VMs without autorestart. Single-tenant VMs are available via Dedicated Instances. There are special options for HPC, including graphics processing units (GPUs). AWS does not have any formal private cloud offerings, though it is willing to negotiate such deals (such as its deal for the U.S. intelligence community cloud). • Storage: VM storage is ephemeral. Persistence requires VM-independent block storage (Elastic Block Store). There is an option for SSDs, as well as storage performance guarantees (Provisioned IOPS). Object-based storage (Simple Storage Service [S3]) is integrated with a CDN (CloudFront), there is an option for long-term archive storage (Glacier), and AWS offers its own cloud storage gateway appliance. • Network: AWS offers a full range of networking options. Complex networking and IPsec VPN is done via Amazon Virtual Private Cloud (VPC). Third-party connectivity is via partner exchanges (AWS Direct Connect). • Security: RBAC (Role based Access Control) is per-element, with customerdefined roles and exceptional control over permissions. AWS has obtained many security and compliance-related certifications and audits. Google Compute Engine • Google has been operating App Engine since 2008, but did not enter the IaaS market until the general-availability launch of GCE in December 2013. • Compute: GCE offers multitenant, fixed-size and nonresizable, KVM-virtualized VMs, metered by the minute. Provisioning is exceptionally fast (typically under 1 minute). • Storage: VM storage is persistent, and there is also VM-independent block storage. All block storage is encrypted. • Network: Third-party private connectivity is not supported. Customers cannot bring their own private IP addresses (although this need may possibly be addressed by GCE's Advanced Routing features). There is no back-end load balancing. • Security: RBAC permissions apply to the whole account. • Google's strategy for Google Cloud Platform centers on the concept of allowing other organizations to "run like Google" by taking Google's highly innovative internal technology capabilities and exposing them as services that other companies can purchase. Consequently, although Google is a late entrant to the IaaS market, it is primarily productizing existing capabilities, rather than having to engineer those capabilities from scratch. It will therefore be able to advance its offering more rapidly than most competitors Microsoft Azure • The Azure business was previously strictly PaaS with a Windows and .Net focus, but Microsoft launched Azure Infrastructure Services (which include Azure Virtual Machines and Azure Virtual Network) into general availability in April 2013, thus entering the cloud IaaS market. • Compute: Azure VMs (Linux or Windows) are fixed-size, paid-bythe-VM, and Hyper-V-virtualized; they are metered by the minute. • Storage: Block storage ("virtual hard disk") is persistent and VMindependent. Object-based cloud storage is integrated with a CDN. • Network: There is no support for complex network topologies. Third-party connectivity is via partner exchange (Azure ExpressRoute). • Security: Virtual network topology limitations prevent useful deployment of most security-related virtual appliances, such as a perimeter intrusion detection/prevention system (IDS/IPS). RBAC uses Azure Active Directory, but permissions are whole-account. Google Cloud DNS & Amazon Route 53 • Google Cloud DNS – Authoritative DNS server available as a service in Google Cloud – The service is efficient, fault-tolerant and available globally – This service can be used by the user hosted services in Google Cloud or from third party applications – https://developers.google.com/cloud-dns/what-is-cloud-dns • Amazon Route 53 – Authoritative DNS server available as a service in Amazon AWS – Provides a fault-tolerant, very fast DNS service. – Same as Google Cloud DNS this service can be used by the hosted services in Amazon Cloud or from third party applications – The service is available in all continents except Africa – http://aws.amazon.com/route53/