Uploaded by ran an

syngentapredictiveanalyticsplatformawsathenagluesummitboston-181010175129

advertisement
Syngenta’s Predictive Analytics Platform for Seeds R&D
A journey from on-premise Hadoop to AWS’s Big Data Serverless Analytics stack
Amazon Athena and AWS Glue Summit – Boston
October 9, 2018
Michael Swanson
Domain Architect - Insights and Decisions, Syngenta
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company. All other trademarks are the property of their respective owners.
Outline
CHALLENGE
2
OPTIONS AND
EVALUATION
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
BUILD-OUT
Find better ways to feed the world
3
Every day the world’s
population increases
by 200,000
870 million people go
to bed hungry
Agriculture produces
30% of greenhouse
gases and 40% of soil
is degraded
By 2050 4 billion people
will live with water
scarcity – agriculture
uses 70% of fresh water
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
About Syngenta
Agribusiness
#1 Crop Protection and #3 Seeds
4
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Seeds R&D Challenge
LONG PRODUCT
DEVELOPMENT
LIFECYCLE
7+ years
5
SHORT DECISION
CYCLES
BIG AND
COMPLEX
HIGH STAKES
MISSION
Several short 6week windows to
make critical
decisions
Volume (GB to PB),
velocity (30% YTD),
tools (100+)
A mistake in year 7
is 900X more costly
than in year 1
Build world-class
analytics platform
for Syngenta Seeds
R&D over next 3
years
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Increase Analytics Capability With Flat IT Budget
BEFORE
- Legacy RDMS warehouse,
replication, and ETL; ad-hoc
per-app ETL
- On-prem compute Hadoop
cluster with high operational
cost and is 92% idle
- Repetitive calculations
performed in analytics tools
and shared in non-scalable,
application-specific APIs
6
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
AFTER
- Store once, transform asneeded object store and
managed ETL service
- On-demand, scalable
compute at a low and
transparent cost per analytics
operation
- Provide self-service bulk data
access, BYO-tool capability,
and business focused APIs
Design Considerations
DEMAND
Must be resilient and scalable
during several 6-week periods
7
ACCEPT
NON-REALTIME
COST
Max 8-hour data refresh via
mini-batch until source
systems are rearchitected
EMR startup delay is
acceptable as calculations
take 30 minutes – 1 day
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Solution Options
OVER 2 MONTHS, WE EVALUATED AND CREATED
POCs ON FOUR DIFFERENT PLATFORMS
Expand On-prem Hadoop
cluster
8
Move to National Center
for Supercomputing
Applications (NCSA)
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Lift-and-shift to Cloudera
Altus in AWS
Refactor on AWS using S3,
EMR, Glue, Athena
Evaluation Criteria
Total cost of ownership
Total cost for platform: infrastructure, usage, licensing and support
Support and recovery
Infrastructure support hours and recovery times for platform components
Dynamic scalability
Ability to add/remove compute/storage to support variable demand
Cost re-engineer
Cost to convert existing codebase and processes
Network implications
Effect on network due to data movement to disparate locations
Data integration performance
Data integration performance to support analytics tools and processes
Security
Ability to implement access control across data and tools
Support for currently used tools
Support for currently used tools – Sqoop, Spark, Hive, Pig
Global performance
Global performance in NA, LATAM, EMEA, APAC
Training
Training and design support
- Critical criteria
9
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Evaluation Scoring
Total cost of ownership
Support hours and recovery times
Dynamic scalability
Cost and time to convert
Network implications
Data integration performance
Security
Support for currently used tools
Global performance
Training
10
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Build-out For A Breeding Predictive Analytics Tool*
1. Ingest from RDS to S3
4. Crawl S3 /pub
2. Crawl S3 /ingest
5. Access through Athena
3. Transform to ORC & publish
What is PARENT/PROGENY SELECTION? Genomic best linear unbiased prediction (GBLUP) is an
algorithm that uses genomic, phenotypic and environment data to predict best parents to breed
* Many steps left out for clarity
11
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Syngenta Seeds R&D Analytical Platform - Parent and Progeny Prediction Use-case
RDS
EC2
EMR
Airflow
12
Athena
S3
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Glue
Glue Data
Catalog
Predictive
Analytics Tools
Ingest from RDS to S3
● Full extracts from 8 source systems to S3 using
custom java importing on EC2
● Why wasn’t scoop used for incremental extracts?
- Limited by source system architecture: missing
timestamps, partitioning anti-patterns…
● Why not DMS?
- Extracts can fail on delimiters, quoted strings,
partitions in source systems
● Java importer allowed control for:
- Parallelization by partitioning
- Filtering by non-ascii characters
● Lessons learned:
- Excessive parallelization can cause source
system strain
13
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Airflow
Ingest Workflow
RDS
(Oracle, Postgres, SQL, Server)
X6
java on EC2
S3
/ingest
S3 Folder Structure
● ENVIRONMENT SEPARATION – prod, stg, dev, test, and
developer specific folders
● DATA SEPARATION – Ingest, work, and publish prefixes.
Avoid concurrency issues and future proof design
● TIME BASED PREFIXES – Data loaded into date specific
prefixes that act as RUN_ID. Future-proofs design when
incremental extracts are implemented
● SEPARATE ACCOUNTS – Prod and non-prod accounts
for protection against accidental deletes
● TEST ENVIRONMENT – Controlled data set for
continuous system functionality testing for continuous
integration
14
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Crawl S3 /ingest
Airflow
● Serverless ETL that enable direct querying of S3 data via
Athena
● Build metadata catalog
create collection of S3 targets
try to delete previous crawlers
create crawler
Crawl Ingest
● Lessons learned:
• Crawler limits: Recycle crawlers programmatically
via delete_crawler()
• Can create new catalog table instead of updating
when schema changes in source system
S3
/ingest
15
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Glue Crawler
Glue Data
Catalog
Transform to ORC
Airflow
● Why: ORC tables enhance query performance in Athena
● Create EMR Cluster and run pySpark job
Foreach s3 target in /ingest
pySpark
.read()
Transform to ORC
.Repartition()
.ToDF()
.Write.orc() # to /data
Create EMR
Cluster
● ORC vs Parquet?
- Before June: Athena from performing date arithmetic operations
due to table column definition as a date type when using Parquet
● Why not Glue Jobs?
- High-memory need favors EMR solution
- Glue Jobs took 8 hours vs. 1.5 hours on 20 node EMR cluster
S3
/ingest
16
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
S3
/data
Publish ORC to Pub
● Copy ORC from /data to /pub to prevent concurrency
issues and prepare for crawling published data
Airflow
foreach table in /data
S3 sync src dest
Publish ORC
S3
/data
17
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
S3
/pub
Crawl S3 /pub
Airflow
● Same step as crawling ingest data
● Separate meta data catalog enables Athena access to:
- published data for analytics tools
- raw ingested data for new model development
create collection of S3 targets
Crawl Ingest
try to delete previous crawlers
create crawler
S3
/ingest
18
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Glue Crawler
Glue Data
Catalog
Access through Athena
● Expose published data to analytics tools via Athena
-- filter training data set by trial environment
select year, trial, material...
From trainingDataSet
S3
/pub
left outer join environmental on...
where...
order by
● Athena vs Impala:
- Low cost, serverless, direct query against S3
-
Near drop-in replacement for Impala
• Refactor queries with CREATE TABLE LIKE, INSERT INTO and
PreparedStatement
● Athena design considerations:
- Time-limit of 30 minutes
- Number of concurrent queries might require limit increases
● Great Support:
- Athena JDBC driver fixed slow iterations on ResultSet fixed June 2018
19
Athena
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Predictive Analytics
Tools for parent selection
Keys to Success
DETERMINE
SOURCE
INVEST
UNDERSTAND
access patterns and
compute requirements
data will drive ingestion
strategy
in POCs to build
foundation
security and access
requirements
20
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Summary
21
LOWER COST
LOWER RISK
Lowered infrastructure costs 10x
Eliminated $100k+ in vendor
licensing and support costs
On-prem outages 15+ days last
year. Zero outages on AWS
BALANCED SKILLS
SPEED
Lowered demand on high-skill data
and infrastructure experts
Team of 9 people for 3 months
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
QUESTIONS & ANSWERS
Thank You
Classification: PUBLIC
©2018 Syngenta. The Alliance Frame and Syngenta logo are registered trademarks of a Syngenta Group Company.
Download