Flint: Making Sparks (and Sharks and HDFSs too!)
Jim Donahue | Principal Scientist Adobe Systems Technology Lab
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Flint: Bring BDAS to the AWS Masses @ Adobe

How to effectively evangelize BDAS @ Adobe?

Looking for intrepid, curious users who want to experiment

Curiosity is always tempered by cost of startup

Most of the data for experimental applications likely in AWS
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
2
Flint: Design Principles

Shared Nothing


Simple Configuration


Batch, Spark/Shark shells, Shark Server, web UIs, …
Access to all the Spark/Shark tuning parameters


As simple or complex as you want/need
Full access to tools


Write a little JSON, run a couple of scripts
Efficient, flexible scaling


Get your own AWS account and go
Very simple hardwired “spark-env.sh”
Tuned to Adobe environment

Port choices determined by our firewall
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
3
Flint: Architecture
•
•
Local Spark/Shark,
Slaves can use S3
storage for files
Local
Spark/
Shark
Remote Access runs
shells on SSH Server
•
Components use S3,
SimpleDB for state
management
•
Flint distributes shared
AWS credentials among
components
•
Flint manages master,
SSHServer startup
•
Slave elasticity
managed by master,
can leverage spot
pricing
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Cluster
Setup
Remote
Access
Local
Flint
Server
Spark
Master
Spark
Slave(s
)
SSHServer
(Shells)
S3
SimpleD
B
4
Flint: Setup
•
Flint instance
manages
encrypted AWS
credentials
Local
Spark/
Shark
Cluster Remote
Access
Setup
Local
Flint
Server
•
Create S3 buckets
to hold JAR files
•
Create SimpleDB
tables to hold state
S3
•
Create key pair,
security group for
instances
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
SimpleD
B
5
Flint: Provisioning
•
Define clusters
through JSON spec
Local
Spark/
Shark
(“master instance configuration is
x, slave instance configuration is
y, scaling rule is …”)
•
Cluster Remote
Access
Setup
Local
Flint
Server
Define configurations
through JSON spec
(“spark master uses AMI x,
running service y, with properties
a, b, …”) and JAR file containing
services code
•
•
“Getting started” set of
clusters,
configurations
provided
S3
AMI provided with all
the requisite Spark /
Shark / Hadoop /
Kafka bits
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
SimpleD
B
6
Flint: Cluster Start
•
•
•
•
Local Flint Instance
launches “master”
instance (using
cluster definition in
SimpleDB)
Master reads
SimpleDB and S3
for configuration
and code, installs
master services
Local
Spark/
Shark
Local
Flint
Server
Spark
Master
Starting services
launches Spark
and/or HDFS
masters through
command line
S3
SimpleD
B
Master puts
“connect URL” in
SimpleDB
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Cluster
Setup
7
Remote
Access
Flint: Slave(s) Start
•
•
Master “scaling
service” launches
slave instance(s)
Local
Spark/
Shark
Slave reads
SimpleDB and S3
for configuration
and code, installs
worker services
Cluster
Setup
Local
Flint
Server
•
Slave gets master
“connect URL” from
SimpleDB
Spark
Master
•
Slave launches
Spark and/or HDFS
workers through
command line
Spark
Slave(s
)
S3
SimpleD
B
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
8
Remote
Access
Flint: Client Start
•
•
•
•
Flint instance
launches “client”
instance (using
cluster definition in
SimpleDB)
Client reads
SimpleDB and S3
for configuration
and code, installs
(SSHServer)
services
Client reads
SimpleDB for
authentication info,
master connect
URL
Service startup
starts SSHServer
connected to right
“shell factory”
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Local
Spark/
Shark
Cluster
Setup
Remote
Access
Local
Flint
Server
Spark
Master
Spark
Slave(s
)
SSHServer
(Shells)
S3
SimpleD
B
9
Flint: Client Connect (Remote Shells)
•
•
•
Flint server finds
“appropriate client”
Local
Spark/
Shark
SSH client
launched to
connect
Cluster
Setup
Local
Flint
Server
SSHServer
connects to master
on client’s behalf
Spark
Master
Spark
Slave(s
)
SSHServer
(Shells)
S3
SimpleD
B
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Remote
Access
10
Flint: Client Asynchronous Requests



Flint clients can also make asynchronous requests

Each Flint master runs service that pulls request from SQS queue

Request progress/results stored in SDB
Requests include:

Move data between HDFS and S3

Mount EBS volume and cache in HDFS (AWS public data sets)

Run batch job
Client can make request even if cluster not alive

Simplifies startup sequencing

Can use monitoring of “cluster queues” to start cluster “on demand”
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
11
Flint: Where We Are Now

Have some intrepid, curious users

The big issue is always “Do I really want to use Spark/Shark?”

SQL is a big selling point

Scala is a mild put-off

Spark Streaming may help settle the issue

Open Sourcing is under discussion

If you’re interested, let me know! 
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
12
© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.