Syngenta’s Predictive Analytics Platform for Seeds R&D A journey from on-premise Hadoop to AWS’s Big Data Serverless Analytics stack Amazon Athena and AWS Glue Summit – Boston October 9, 2018 Michael Swanson Domain Architect - Insights and Decisions, Syngenta Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. All other trademarks are the property of their respective owners. Outline CHALLENGE 2 OPTIONS AND EVALUATION Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. BUILD-OUT Find better ways to feed the world 3 Every day the world’s population increases by 200,000 870 million people go to bed hungry Agriculture produces 30% of greenhouse gases and 40% of soil is degraded By 2050 4 billion people will live with water scarcity – agriculture uses 70% of fresh water Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. About Syngenta Agribusiness #1 Crop Protection and #3 Seeds 4 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Seeds R&D Challenge LONG PRODUCT DEVELOPMENT LIFECYCLE 7+ years 5 SHORT DECISION CYCLES BIG AND COMPLEX HIGH STAKES MISSION Several short 6week windows to make critical decisions Volume (GB to PB), velocity (30% YTD), tools (100+) A mistake in year 7 is 900X more costly than in year 1 Build world-class analytics platform for Syngenta Seeds R&D over next 3 years Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Increase Analytics Capability With Flat IT Budget BEFORE - Legacy RDMS warehouse, replication, and ETL; ad-hoc per-app ETL - On-prem compute Hadoop cluster with high operational cost and is 92% idle - Repetitive calculations performed in analytics tools and shared in non-scalable, application-specific APIs 6 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. AFTER - Store once, transform asneeded object store and managed ETL service - On-demand, scalable compute at a low and transparent cost per analytics operation - Provide self-service bulk data access, BYO-tool capability, and business focused APIs Design Considerations DEMAND Must be resilient and scalable during several 6-week periods 7 ACCEPT NON-REALTIME COST Max 8-hour data refresh via mini-batch until source systems are rearchitected EMR startup delay is acceptable as calculations take 30 minutes – 1 day Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Solution Options OVER 2 MONTHS, WE EVALUATED AND CREATED POCs ON FOUR DIFFERENT PLATFORMS Expand On-prem Hadoop cluster 8 Move to National Center for Supercomputing Applications (NCSA) Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Lift-and-shift to Cloudera Altus in AWS Refactor on AWS using S3, EMR, Glue, Athena Evaluation Criteria Total cost of ownership Total cost for platform: infrastructure, usage, licensing and support Support and recovery Infrastructure support hours and recovery times for platform components Dynamic scalability Ability to add/remove compute/storage to support variable demand Cost re-engineer Cost to convert existing codebase and processes Network implications Effect on network due to data movement to disparate locations Data integration performance Data integration performance to support analytics tools and processes Security Ability to implement access control across data and tools Support for currently used tools Support for currently used tools – Sqoop, Spark, Hive, Pig Global performance Global performance in NA, LATAM, EMEA, APAC Training Training and design support - Critical criteria 9 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Evaluation Scoring Total cost of ownership Support hours and recovery times Dynamic scalability Cost and time to convert Network implications Data integration performance Security Support for currently used tools Global performance Training 10 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Build-out For A Breeding Predictive Analytics Tool* 1. Ingest from RDS to S3 4. Crawl S3 /pub 2. Crawl S3 /ingest 5. Access through Athena 3. Transform to ORC & publish What is PARENT/PROGENY SELECTION? Genomic best linear unbiased prediction (GBLUP) is an algorithm that uses genomic, phenotypic and environment data to predict best parents to breed * Many steps left out for clarity 11 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Syngenta Seeds R&D Analytical Platform - Parent and Progeny Prediction Use-case RDS EC2 EMR Airflow 12 Athena S3 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Glue Glue Data Catalog Predictive Analytics Tools Ingest from RDS to S3 ● Full extracts from 8 source systems to S3 using custom java importing on EC2 ● Why wasn’t scoop used for incremental extracts? - Limited by source system architecture: missing timestamps, partitioning anti-patterns… ● Why not DMS? - Extracts can fail on delimiters, quoted strings, partitions in source systems ● Java importer allowed control for: - Parallelization by partitioning - Filtering by non-ascii characters ● Lessons learned: - Excessive parallelization can cause source system strain 13 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Airflow Ingest Workflow RDS (Oracle, Postgres, SQL, Server) X6 java on EC2 S3 /ingest S3 Folder Structure ● ENVIRONMENT SEPARATION – prod, stg, dev, test, and developer specific folders ● DATA SEPARATION – Ingest, work, and publish prefixes. Avoid concurrency issues and future proof design ● TIME BASED PREFIXES – Data loaded into date specific prefixes that act as RUN_ID. Future-proofs design when incremental extracts are implemented ● SEPARATE ACCOUNTS – Prod and non-prod accounts for protection against accidental deletes ● TEST ENVIRONMENT – Controlled data set for continuous system functionality testing for continuous integration 14 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Crawl S3 /ingest Airflow ● Serverless ETL that enable direct querying of S3 data via Athena ● Build metadata catalog create collection of S3 targets try to delete previous crawlers create crawler Crawl Ingest ● Lessons learned: • Crawler limits: Recycle crawlers programmatically via delete_crawler() • Can create new catalog table instead of updating when schema changes in source system S3 /ingest 15 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Glue Crawler Glue Data Catalog Transform to ORC Airflow ● Why: ORC tables enhance query performance in Athena ● Create EMR Cluster and run pySpark job Foreach s3 target in /ingest pySpark .read() Transform to ORC .Repartition() .ToDF() .Write.orc() # to /data Create EMR Cluster ● ORC vs Parquet? - Before June: Athena from performing date arithmetic operations due to table column definition as a date type when using Parquet ● Why not Glue Jobs? - High-memory need favors EMR solution - Glue Jobs took 8 hours vs. 1.5 hours on 20 node EMR cluster S3 /ingest 16 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. S3 /data Publish ORC to Pub ● Copy ORC from /data to /pub to prevent concurrency issues and prepare for crawling published data Airflow foreach table in /data S3 sync src dest Publish ORC S3 /data 17 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. S3 /pub Crawl S3 /pub Airflow ● Same step as crawling ingest data ● Separate meta data catalog enables Athena access to: - published data for analytics tools - raw ingested data for new model development create collection of S3 targets Crawl Ingest try to delete previous crawlers create crawler S3 /ingest 18 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Glue Crawler Glue Data Catalog Access through Athena ● Expose published data to analytics tools via Athena -- filter training data set by trial environment select year, trial, material... From trainingDataSet S3 /pub left outer join environmental on... where... order by ● Athena vs Impala: - Low cost, serverless, direct query against S3 - Near drop-in replacement for Impala • Refactor queries with CREATE TABLE LIKE, INSERT INTO and PreparedStatement ● Athena design considerations: - Time-limit of 30 minutes - Number of concurrent queries might require limit increases ● Great Support: - Athena JDBC driver fixed slow iterations on ResultSet fixed June 2018 19 Athena Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Predictive Analytics Tools for parent selection Keys to Success DETERMINE SOURCE INVEST UNDERSTAND access patterns and compute requirements data will drive ingestion strategy in POCs to build foundation security and access requirements 20 Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. Summary 21 LOWER COST LOWER RISK Lowered infrastructure costs 10x Eliminated $100k+ in vendor licensing and support costs On-prem outages 15+ days last year. Zero outages on AWS BALANCED SKILLS SPEED Lowered demand on high-skill data and infrastructure experts Team of 9 people for 3 months Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. QUESTIONS & ANSWERS Thank You Classification: PUBLIC ©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.