Flint: Making Sparks (and Sharks and HDFSs too!) Jim Donahue | Principal Scientist Adobe Systems Technology Lab © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Flint: Bring BDAS to the AWS Masses @ Adobe How to effectively evangelize BDAS @ Adobe? Looking for intrepid, curious users who want to experiment Curiosity is always tempered by cost of startup Most of the data for experimental applications likely in AWS © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2 Flint: Design Principles Shared Nothing Simple Configuration Batch, Spark/Shark shells, Shark Server, web UIs, … Access to all the Spark/Shark tuning parameters As simple or complex as you want/need Full access to tools Write a little JSON, run a couple of scripts Efficient, flexible scaling Get your own AWS account and go Very simple hardwired “spark-env.sh” Tuned to Adobe environment Port choices determined by our firewall © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3 Flint: Architecture • • Local Spark/Shark, Slaves can use S3 storage for files Local Spark/ Shark Remote Access runs shells on SSH Server • Components use S3, SimpleDB for state management • Flint distributes shared AWS credentials among components • Flint manages master, SSHServer startup • Slave elasticity managed by master, can leverage spot pricing © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Cluster Setup Remote Access Local Flint Server Spark Master Spark Slave(s ) SSHServer (Shells) S3 SimpleD B 4 Flint: Setup • Flint instance manages encrypted AWS credentials Local Spark/ Shark Cluster Remote Access Setup Local Flint Server • Create S3 buckets to hold JAR files • Create SimpleDB tables to hold state S3 • Create key pair, security group for instances © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. SimpleD B 5 Flint: Provisioning • Define clusters through JSON spec Local Spark/ Shark (“master instance configuration is x, slave instance configuration is y, scaling rule is …”) • Cluster Remote Access Setup Local Flint Server Define configurations through JSON spec (“spark master uses AMI x, running service y, with properties a, b, …”) and JAR file containing services code • • “Getting started” set of clusters, configurations provided S3 AMI provided with all the requisite Spark / Shark / Hadoop / Kafka bits © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. SimpleD B 6 Flint: Cluster Start • • • • Local Flint Instance launches “master” instance (using cluster definition in SimpleDB) Master reads SimpleDB and S3 for configuration and code, installs master services Local Spark/ Shark Local Flint Server Spark Master Starting services launches Spark and/or HDFS masters through command line S3 SimpleD B Master puts “connect URL” in SimpleDB © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Cluster Setup 7 Remote Access Flint: Slave(s) Start • • Master “scaling service” launches slave instance(s) Local Spark/ Shark Slave reads SimpleDB and S3 for configuration and code, installs worker services Cluster Setup Local Flint Server • Slave gets master “connect URL” from SimpleDB Spark Master • Slave launches Spark and/or HDFS workers through command line Spark Slave(s ) S3 SimpleD B © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8 Remote Access Flint: Client Start • • • • Flint instance launches “client” instance (using cluster definition in SimpleDB) Client reads SimpleDB and S3 for configuration and code, installs (SSHServer) services Client reads SimpleDB for authentication info, master connect URL Service startup starts SSHServer connected to right “shell factory” © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Local Spark/ Shark Cluster Setup Remote Access Local Flint Server Spark Master Spark Slave(s ) SSHServer (Shells) S3 SimpleD B 9 Flint: Client Connect (Remote Shells) • • • Flint server finds “appropriate client” Local Spark/ Shark SSH client launched to connect Cluster Setup Local Flint Server SSHServer connects to master on client’s behalf Spark Master Spark Slave(s ) SSHServer (Shells) S3 SimpleD B © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Remote Access 10 Flint: Client Asynchronous Requests Flint clients can also make asynchronous requests Each Flint master runs service that pulls request from SQS queue Request progress/results stored in SDB Requests include: Move data between HDFS and S3 Mount EBS volume and cache in HDFS (AWS public data sets) Run batch job Client can make request even if cluster not alive Simplifies startup sequencing Can use monitoring of “cluster queues” to start cluster “on demand” © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11 Flint: Where We Are Now Have some intrepid, curious users The big issue is always “Do I really want to use Spark/Shark?” SQL is a big selling point Scala is a mild put-off Spark Streaming may help settle the issue Open Sourcing is under discussion If you’re interested, let me know! © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12 © 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.