Big Data Open Source Software and Projects ABDS in Summary XVI: Layer 13 Part 1 Data Science Curriculum March 5 2015 Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington Functionality of 21 HPC-ABDS Layers 1) Message Protocols: 2) Distributed Coordination: 3) Security & Privacy: 4) Monitoring: 5) IaaS Management from HPC to hypervisors: 6) DevOps: Here are 21 functionalities. 7) Interoperability: (including 11, 14, 15 subparts) 8) File systems: 9) Cluster Resource Management: 4 Cross cutting at top 10) Data Transport: 17 in order of layered diagram 11) A) File management starting at bottom B) NoSQL C) SQL 12) In-memory databases&caches / Object-relational mapping / Extraction Tools 13) Inter process communication Collectives, point-to-point, publish-subscribe, MPI: Part 1 14) A) Basic Programming model and runtime, SPMD, MapReduce: B) Streaming: 15) A) High level Programming: B) Application Hosting Frameworks 16) Application and Analytics: 17) Workflow-Orchestration: Publish-Subscribe Technology Helped by Supun Kamburugamuve Apache Kafka (LinkedIn) • http://kafka.apache.org/ • Apache Kafka is a message brokering system designed for high throughput and low latency, as well as a high level of durability and fault tolerance. It uses a publish/subscribe model, where producers publish messages to a cluster of brokers, consumers poll messages from the brokers. • Kafka was originally developed to track web site activity such as page views and searches, but it has been applied to varied activities. • Kafka was developed at LinkedIn, and was made open source in 2011. It became a top level Apache project in 2012. Kafka uses another Apache project, Zookeeper, to maintain its broker clusters. Apache ActiveMQ • http://activemq.apache.org/ • Apache ActiveMQ is a very popular publish-subscribe message broker • Mainly supports the JMS 1.1. standard. JMS is an API and not a wire protocol. • Has its own wire protocol called OpenWire and supports standard protocols like Stomp and MQTT. • ActiveMQ supports clustered brokers and networked brokers for scalability and fault tolerance. • Producers and consumers can be written in different programming languages RabbitMQ • http://www.rabbitmq.com/ • RabbitMQ is an open source messaging framework under the Mozilla Public License. • Developed to support the AMQP open messaging standard and has support available for other protocols like MQTT. • Developed using erlang programming language and boasts on its high throughput and low latency. • RabbitMQ supports clustering for fault tolerance and scalability • Any AMQP compliant client can publish messages to RabbitMQ as well as consume messages. • Python interface py-librabbitmq used by Celery Apache Qpid • https://qpid.apache.org/ • Qpid is the Apache project implementing the AMQP protocol. • The broker is mainly written in Java and has clients written in different programming languages • Support clustering for high availability Kestrel • http://robey.github.io/kestrel/ • Kestrel is a simple distributed messaging queue. • The nodes in a Kestrel cluster doesn’t communicate with each other, resulting in loosely ordered queues across the cluster. • Because of this simple design Kestrel can scale to thousands of nodes. • The project is developed at Twitter and available in Github. • Kestrel supports memcache protocol and thrift based protocol for sending and receiving messages ZeroMQ • http://zeromq.org/ • ZeroMQ is a embeddable library for creating custom messaging solutions for applications • Provides sockets that can be used to do inter-process, intra-process, TCP multicast messaging. • The sockets can be connected 1 to 1, N to N with patterns like fan-out, pub-sub etc. • The library is asynchronous in nature and provide very efficient and fast communication channels to the applications. • Primarily developed in C but the library can be used from difference programming languages like Java, PHP etc. Netty • http://netty.io/ • Netty is a NIO based Java framework which enables easy development of high performance network applications like protocol Servers and Clients. • Netty was developed at Red Hat JBoss and now available in Github under the Apache license version 2.0. • The library provides out of the box support for popular application protocols like HTTP. • The library can be used to build custom transport protocols using TCP or UDP. NaradaBrokering (NB) • http://www.naradabrokering.org/ • Development stopped in 2009 and ideas further developed into Granules (layer 14B) • For 10 years NB was state of art in exploring publish subscribe systems and their use in collaboration and computing but now other systems have caught up as described in this slide deck • NB supported security, robustness (fault-tolerance), quality of service (such as variation in arrival time, synchronization of streams, delivery order), Web Services. Multiple transport protocols, high performance, distributed efficient set of multiple cooperating brokers, sophisticated topic specification and search Public Cloud Pub/Sub • Google Cloud Pub Sub https://developers.google.com/pubsub is a publish-subscribe messaging system offered by Google as a cloud service – Supports many to many, one to many and many to one communications. – The publishers can use the HTTP API for sending the data. – The subscribers can use a pull based API or push based API for receiving the data. – The service is available as a developer preview and free of charge. – Part of Google Cloud Dataflow that also has FlumeJava and Google MillWheel • See Simple Notification Service (Amazon SNS) http://aws.amazon.com/sns/ for Amazon equivalent and • Azure Queues and Service Bus Queues (advanced functionality) http://msdn.microsoft.com/enus/library/azure/hh767287.aspx for Azure equivalent • See Azure Event Hubs later System Features Amazon Simple Queue Azure Queue ActiveMQ MuleMQ Websphere MQ Narada Brokering AMQP compliant No No No, use OpenWire and Stomp. No No No JMS compliant No No Yes Yes Yes Yes Distributed broker No No Yes Yes Yes Yes Exactly once delivery supported Guaranteed and exactly-once Delivery guarantees Ordering guarantees Message retained Message Based on journaling Disk store uses 1 in queue for 4 accessible for 7 and JDBC drivers file/channel, TTL days days to databases. purges messages Best effort, once No ordering, delivery, duplicate Message returns messages exist more than once Publisher order guarantee Not clear. Publisher- or timePublisher order order by Network guarantee Time Protocol Access Model SOAP, HTTPbased GET/POST HTTP REST interfaces Using JMS classes Max. Message 8 KB 8 KB NA NA NA NA Buffering NA Yes Yes Yes. Yes Yes Yes Yes Yes Yes Time decoupled delivery Security Up to 4 days. Up to a max. of Support timeouts. 7 days. Based on HMACSHA1 signature. scheme Support for WSSecurity 1.0. Support for Web Services SOAP based interactions Transports HTTP/ HTTPS, SSL Subscription formats Access is to individual queues JMS, Adm. API, Message Queue and JNDI Interface, JMS JMS, WSEventing Access to Access control , SSL, end-to-end Authorization based SSL, end-to-end queues by authentication, application level on JAAS for application level HMAC SHA256 SSL for data security, and authentication data security signature communication ACLs REST interfaces REST REST Mule ESB TCP, UDP, SSL, supports TCP, HTTP/ HTTPS HTTP/S, Multicast, UDP, RMI, SSL, in-VM, JXTA SMTP and FTP Access is to individual queues REST, SOAP interactions WS-Eventing TCP, UDP, Multicast, SSL, HTTP/S TCP, Parallel TCP, UDP, Multicast, SSL, HTTP/S, IPSec SQL Selectors, JMS spec allows JMS spec allows JMS spec allows Regular expresfor SQL selectors. for SQL selectors. SQL selectors. sions, <tag, Also access to Also access to Access to indivivalue> pairs, individual queues. individual queues. dual queues. XQuery and XPath ~2010 Comparison Important Brokers changed since then Azure Event Hubs I • This is used by Azure Stream Analytics • It appears to be built on Azure service-bus publish subscribe messaging system • It has usual feature of supporting input and output to Stream Analytics with multiple subscribers Azure Event Hubs II • These figures show that a given event hub can support multiple stream hosts Also the event hub brokers are auto scaled in response to increased load Amazon Lambda • Event triggered computing http://aws.amazon.com/lambda/ • AWS Lambda can automatically run code in response to modifications to objects in Amazon S3 buckets, messages arriving in Amazon Kinesis streams, or table updates in Amazon DynamoDB. • At launch AWS Lambda supports user defined functions written in Node.js (JavaScript). Your code can include existing Node.js libraries, even native ones. • Lambda has scaling, fault tolerance, security, automated administration • Could also/instead be in layer 14B as a stream programming model