Big Data Open Source Software
and Projects
ABDS in Summary XIV: Level 14B
I590 Data Science Curriculum
August 15 2014
Geoffrey Fox
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
Message Protocols
Distributed Coordination:
Security & Privacy:
IaaS Management from HPC to hypervisors:
Here are 17 functionalities. Technologies are
File systems:
presented in this order
Cluster Resource Management:
4 Cross cutting at top
Data Transport:
13 in order of layered diagram starting at
SQL / NoSQL / File management:
In-memory databases&caches / Object-relational mapping / Extraction Tools
Inter process communication Collectives, point-to-point, publish-subscribe
Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI:
High level Programming:
Application and Analytics:
Apache Storm
• https://storm.incubator.apache.org/
• Apache Storm is a distributed real time computation framework for
processing streaming data.
• Storm is being used to do real time analytics, online machine
learning, distributed RPC etc.
• Provides scalable, fault tolerant and guaranteed message
• Trident is a high level API on top of Storm which provides functions
like stream joins, groupings, filters etc. Also Trident has exactly-once
processing guarantees.
• The project was originally developed at Twitter for processing
Tweets from users and was donated to ASF in 2013.
• Storm has being used in very large deployments in Fortune 500
companies like Twitter and Yahoo.
Apache Samza (LinkedIn)
• http://samza.incubator.apache.org/
• Similar to Apache Storm, Apache Samza is a distributed real
time computation framework for processing streaming data.
• Apache Samza is built on top of Apache Kafka and Apache
Yarn. Samza uses Kafka as its messaging layer and Yarn for
managing the cluster of nodes with Samza processes.
• Samza is scalable, fault tolerant and provides guaranteed
message processing.
• Samza was originally developed at LinkedIn and was donated
to ASF in 2013
Apache S4
• http://incubator.apache.org/s4/
• Apache S4 is a distributed real time computation framework
for processing unbounded streams of data.
• Unlike Storm and Samza S4 provides a key value based system
for processing data
• The system is scalable, fault tolerant and provides guaranteed
message processing.
• S4 was originally developed at Yahoo and was donated to ASF
in 2011
• S4 isn’t popular as Apache Storm
Databus (LinkedIn)
• Closed source Databus http://data.linkedin.com/projects/databus
• Databus provides a timeline-consistent stream of change capture events for a
database. It enables applications to watch a database, view and process updates in
near real-time.
• Databus provides a complete after-image of every new/changed record as well as
deletes, while maintaining timeline consistency and transactional boundaries.
• The application integration is decoupled from the source database, and each
application integration is isolated, which allows for parallel development and rapid
• Databus has a few key parts:
– a database connector to watch changes and maintain a clock or sequence value
– an in-memory relay that keeps recent changes for efficient retrieval
– a bootstrap service/database that enables long lookback queries (including from the
beginning of time)
– a client that provides a simple API to get changes since a point in time
• To use databus, the consuming application simply maintains a high watermark, and
periodically requests all changes since that point in time using the Databus client.
Each consuming application maintains its own high watermark, which provides
isolation from one another
Google MillWheel
• http://research.google.com/pubs/pub41378.html
• MillWheel is a distributed real time computation framework by
• Provides scalable, fault tolerant and exactly once message
processing guarantees.
• The key data abstraction of the MillWheel is Key-Value pairs and
data is processed in a directed acyclic graph where nodes are the
computation nodes.
• The project is not open source and is planned to be available to
general public through Google Cloud platform as a SaaS.
• Similar functionality to Apache Storm
• Part of Google Cloud Dataflow
that also has Google Pub-Sub and FlumeJava
• See Amazon Kinesis http://aws.amazon.com/kinesis/ which
combines Pub-Sub and Apache Storm capabilities
