PPT - Big Data & Open Source Software Projects

advertisement
Big Data Open Source Software
and Projects
ABDS in Summary XVI: Layer 13 Part 1
Data Science Curriculum
March 5 2015
Geoffrey Fox
gcf@indiana.edu
http://www.infomall.org
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
Functionality of 21 HPC-ABDS Layers
1) Message Protocols:
2) Distributed Coordination:
3) Security & Privacy:
4) Monitoring:
5) IaaS Management from HPC to hypervisors:
6) DevOps:
Here are 21 functionalities.
7) Interoperability:
(including 11, 14, 15 subparts)
8) File systems:
9) Cluster Resource Management:
4 Cross cutting at top
10) Data Transport:
17 in order of layered diagram
11) A) File management
starting at bottom
B) NoSQL
C) SQL
12) In-memory databases&caches / Object-relational mapping / Extraction Tools
13) Inter process communication Collectives, point-to-point, publish-subscribe, MPI: Part 1
14) A) Basic Programming model and runtime, SPMD, MapReduce:
B) Streaming:
15) A) High level Programming:
B) Application Hosting Frameworks
16) Application and Analytics:
17) Workflow-Orchestration:
Publish-Subscribe Technology
Helped by Supun Kamburugamuve
Apache Kafka (LinkedIn)
• http://kafka.apache.org/
• Apache Kafka is a message brokering system
designed for high throughput and low latency, as
well as a high level of durability and fault tolerance.
It uses a publish/subscribe model, where producers
publish messages to a cluster of brokers, consumers
poll messages from the brokers.
• Kafka was originally developed to track web site
activity such as page views and searches, but it has
been applied to varied activities.
• Kafka was developed at LinkedIn, and was made
open source in 2011. It became a top level Apache
project in 2012. Kafka uses another Apache project,
Zookeeper, to maintain its broker clusters.
Apache ActiveMQ
• http://activemq.apache.org/
• Apache ActiveMQ is a very popular publish-subscribe
message broker
• Mainly supports the JMS 1.1. standard. JMS is an API and
not a wire protocol.
• Has its own wire protocol called OpenWire and supports
standard protocols like Stomp and MQTT.
• ActiveMQ supports clustered brokers and networked
brokers for scalability and fault tolerance.
• Producers and consumers can be written in different
programming languages
RabbitMQ
• http://www.rabbitmq.com/
• RabbitMQ is an open source messaging framework under the
Mozilla Public License.
• Developed to support the AMQP open messaging standard and
has support available for other protocols like MQTT.
• Developed using erlang programming language and boasts on its
high throughput and low latency.
• RabbitMQ supports clustering for fault tolerance and scalability
• Any AMQP compliant client can publish messages to RabbitMQ as
well as consume messages.
• Python interface py-librabbitmq used by Celery
Apache Qpid
• https://qpid.apache.org/
• Qpid is the Apache project implementing the
AMQP protocol.
• The broker is mainly written in Java and has
clients written in different programming
languages
• Support clustering for high availability
Kestrel
• http://robey.github.io/kestrel/
• Kestrel is a simple distributed messaging queue.
• The nodes in a Kestrel cluster doesn’t communicate
with each other, resulting in loosely ordered queues
across the cluster.
• Because of this simple design Kestrel can scale to
thousands of nodes.
• The project is developed at Twitter and available in
Github.
• Kestrel supports memcache protocol and thrift based
protocol for sending and receiving messages
ZeroMQ
• http://zeromq.org/
• ZeroMQ is a embeddable library for creating custom
messaging solutions for applications
• Provides sockets that can be used to do inter-process,
intra-process, TCP multicast messaging.
• The sockets can be connected 1 to 1, N to N with
patterns like fan-out, pub-sub etc.
• The library is asynchronous in nature and provide very
efficient and fast communication channels to the
applications.
• Primarily developed in C but the library can be used
from difference programming languages like Java, PHP
etc.
Netty
• http://netty.io/
• Netty is a NIO based Java framework which enables
easy development of high performance network
applications like protocol Servers and Clients.
• Netty was developed at Red Hat JBoss and now
available in Github under the Apache license version
2.0.
• The library provides out of the box support for popular
application protocols like HTTP.
• The library can be used to build custom transport
protocols using TCP or UDP.
NaradaBrokering (NB)
• http://www.naradabrokering.org/
• Development stopped in 2009 and ideas
further developed into Granules (layer 14B)
• For 10 years NB was state of art in exploring publish subscribe
systems and their use in collaboration and computing but now
other systems have caught up as described in this slide deck
• NB supported security, robustness (fault-tolerance), quality of
service (such as variation in arrival time, synchronization of
streams, delivery order), Web Services. Multiple transport
protocols, high performance, distributed efficient set of multiple
cooperating brokers, sophisticated topic specification and search
Public Cloud Pub/Sub
• Google Cloud Pub Sub
https://developers.google.com/pubsub is a publish-subscribe
messaging system offered by Google as a cloud service
– Supports many to many, one to many and many to one
communications.
– The publishers can use the HTTP API for sending the data.
– The subscribers can use a pull based API or push based API for
receiving the data.
– The service is available as a developer preview and free of charge.
– Part of Google Cloud Dataflow that also has FlumeJava and
Google MillWheel
• See Simple Notification Service (Amazon SNS)
http://aws.amazon.com/sns/ for Amazon equivalent and
• Azure Queues and Service Bus Queues (advanced
functionality) http://msdn.microsoft.com/enus/library/azure/hh767287.aspx for Azure equivalent
• See Azure Event Hubs later
System
Features
Amazon Simple
Queue
Azure
Queue
ActiveMQ
MuleMQ
Websphere
MQ
Narada
Brokering
AMQP compliant
No
No
No, use OpenWire
and Stomp.
No
No
No
JMS compliant
No
No
Yes
Yes
Yes
Yes
Distributed broker
No
No
Yes
Yes
Yes
Yes
Exactly once
delivery
supported
Guaranteed and
exactly-once
Delivery
guarantees
Ordering
guarantees
Message retained
Message
Based on journaling Disk store uses 1
in queue for 4
accessible for 7 and JDBC drivers file/channel, TTL
days
days
to databases.
purges messages
Best effort, once
No ordering,
delivery, duplicate Message returns
messages exist more than once
Publisher order
guarantee
Not clear.
Publisher- or timePublisher order
order by Network
guarantee
Time Protocol
Access Model
SOAP, HTTPbased GET/POST
HTTP REST
interfaces
Using JMS classes
Max. Message
8 KB
8 KB
NA
NA
NA
NA
Buffering
NA
Yes
Yes
Yes.
Yes
Yes
Yes
Yes
Yes
Yes
Time decoupled
delivery
Security
Up to 4 days.
Up to a max. of
Support timeouts.
7 days.
Based on HMACSHA1 signature.
scheme
Support for WSSecurity 1.0.
Support for Web
Services
SOAP based
interactions
Transports
HTTP/ HTTPS,
SSL
Subscription
formats
Access is to
individual queues
JMS, Adm. API, Message Queue
and JNDI
Interface, JMS
JMS, WSEventing
Access to
Access control ,
SSL, end-to-end
Authorization based
SSL, end-to-end
queues by
authentication,
application level
on JAAS for
application level
HMAC SHA256
SSL for
data security, and
authentication
data security
signature
communication
ACLs
REST interfaces
REST
REST
Mule ESB
TCP, UDP, SSL,
supports TCP,
HTTP/ HTTPS HTTP/S, Multicast,
UDP, RMI, SSL,
in-VM, JXTA
SMTP and FTP
Access is to
individual
queues
REST, SOAP
interactions
WS-Eventing
TCP, UDP,
Multicast, SSL,
HTTP/S
TCP, Parallel
TCP, UDP,
Multicast, SSL,
HTTP/S, IPSec
SQL Selectors,
JMS spec allows JMS spec allows JMS spec allows
Regular expresfor SQL selectors. for SQL selectors. SQL selectors.
sions, <tag,
Also access to
Also access to Access to indivivalue> pairs,
individual queues. individual queues. dual queues.
XQuery and XPath
~2010
Comparison
Important
Brokers
changed
since then
Azure Event Hubs I
• This is used by Azure Stream Analytics
• It appears to be built on Azure service-bus
publish subscribe messaging system
• It has usual feature of supporting input and
output to Stream Analytics with multiple
subscribers
Azure Event Hubs II
• These figures show that a given event hub
can support multiple stream hosts
Also the event
hub brokers
are auto
scaled in
response to
increased load
Amazon Lambda
• Event triggered computing
http://aws.amazon.com/lambda/
• AWS Lambda can automatically run code in response to
modifications to objects in Amazon S3 buckets,
messages arriving in Amazon Kinesis streams, or table
updates in Amazon DynamoDB.
• At launch AWS Lambda supports user defined functions
written in Node.js (JavaScript). Your code can include
existing Node.js libraries, even native ones.
• Lambda has scaling, fault tolerance, security,
automated administration
• Could also/instead be in layer 14B as a stream
programming model
Download