Uploaded by favaha6914

02 - Data Migration

advertisement
Data Migration
Concepts - Push Approach
- Pull Approach
©2022 Databricks Inc. — All rights reserved
1
Data Migration Concepts
• Hadoop -> Databricks Lakehouse: Data Migration
•
Key Considerations
•
Data Dependencies
•
Data Validation and Testing
©2022 Databricks Inc. — All rights reserved
2
Data Migration Considerations
● Cloud storage already includes data redundancy. Costs are based on actual data size
and level of service for storing the data (hot, cold, archive)
● Two types of data
● Historical
● Ongoing data feeds
● Historical data
● Can be in the TB to PB size range
● Not all data may need to be pushed to the cloud
● Ongoing data feeds
● Typically in the MB to low TB per day range
● Provided by some existing ingestion service - ETL, CDC, in-house framework
©2022 Databricks Inc. — All rights reserved
Hive Schema and Data Migration Tooling
● BladeBridge
● Can convert DDL scripts to equivalent Spark SQL DDL scripts
● WanDisco Data Migrator
● Migrates schema and data into the Delta Lake
● Can automatically create Delta DDL
● SI Solutions
● Some SIs provide their own data migration solutions
©2022 Databricks Inc. — All rights reserved
Hive Schema Migration Best Practices
● Hive DDL statements are similar to Spark SQL DDL
● Always change DDL from STORED AS to USING for better performance
● USING tells the spark optimizer that the table is a Spark table
● Location in DDL implies the table is “external”
● In Hadoop, location and external keywords are decoupled
● Hive/Spark DDL are similar, but partition table DDL has subtle difference
● Hive specifies the partition column type in the PARTITIONED BY while Spark SQL does not
● Hive does not include partition columns in the column definition while Spark SQL does
● Convert ORC and other files to Delta Lake - Programmatically in Spark or via COPY INTO command
● Convert Parquet can be done with COPY INTO or converted in place with CONVERT TO DETLA
● Hive UDF/UDAF/UDTF are supported, but should be converted to Spark equivalent for better
performance
©2022 Databricks Inc. — All rights reserved
Partner and 3rd party tool mapping
Data Migration & ETL Tools
Category
Hadoop
Component
3rd Party Tool Options
Type
Integration
Level
Data Synchronization
DistCP
WanDisco
3rd Party
Native
(with Logic
Pushdown)
Data Synchronization
DistCP
AWS Data Sync, GCP Distcp,
Cloud Native
Native
Data Migration/ETL
Data Migration/ETL
DistCP
DistCP
©2022 Databricks Inc. — All rights reserved
Azure Data Factory
Talend, Informatica, Fivetran,
dbt, Matillion
Cloud Native
3rd Party
Native
(with Logic
Pushdown)
Native
(with Logic
Pushdown)
6
Data Migration Concepts
• Hadoop -> Databricks Lakehouse: Data Migration
•
Key Considerations
•
Data Dependencies
•
Data Validation and Testing
©2022 Databricks Inc. — All rights reserved
7
Understanding Data Dependencies for Migration
● Moving a workload to the cloud starts with identifying what data is needed
● Data sources
○ HDFS data
○ External systems
● Reference data
○ HDFS data
○ HBase / Solr / Kudu
○ External systems
● How to identify data dependencies in a Hadoop workflow
● Leverage native hadoop data governance tooling e.g. Atlas, Navigator
● Leverage enterprise data governance solutions e.g. Alation or Colibra
● Code analysis - review what data is being access by the workflow
● Metastore analysis: Table Dependencies
○ VIEWS - if code accesses data via VIEWS
○ FOREIGN KEY constraints - Hive doesn’t enforce these constraints, but hints at underlying dependencies
©2022 Databricks Inc. — All rights reserved
Data Migration Concepts
• Hadoop -> Databricks Lakehouse: Data Migration
•
Key Considerations
•
Data Dependencies
•
Data Validation and Testing
©2022 Databricks Inc. — All rights reserved
9
Data Validation
● Custom approach
● Validate DDL and metrics - using custom scripts
○ Column definition
○ Row counts
● Validate data
○ Source data replicated as is into cloud storage - data transfer tool ensures data integrity
● Run spark code to compare source data in cloud against target data
➢ E.g. df1_source.unionAll(df2_target).except(df1_source.intersect(df2_target))
○ Hash / Row Signature
● Source and target data resides in the cloud
● Use custom logic to generate hash value for source and target records
● Compare hash values between source and target tables based on matching primary key
● WanDisco Data Migrator
● Leverage migration verification feature
○ Can be used when replicating source to target as is
©2022 Databricks Inc. — All rights reserved
Push Approach
• Hadoop -> Databricks Lakehouse: Migrating Data using a Push Approach
•
Basics - Pros/Cons and Examples
•
Hadoop Native Tools
•
3rd Party Tools
•
CDC
•
Auto Loader
©2022 Databricks Inc. — All rights reserved
11
Push approach
Pros & Cons when considering a push approach
● Pros
● On-premises data owners have control over how and when data gets moved to the cloud
● Network security teams are more likely to approve outbound connections to the cloud vs inbound
connections to the data center
● More outbound network throughput means faster data transfers
● Data flows through internal security controls before landing into cloud storage
● Cons
● Data latency
○ Consumers have access to data as of last push, and may not have influence to trigger a refresh of the data into the
cloud
©2022 Databricks Inc. — All rights reserved
Push approach
Tools
● DistCP / Cloudera BDR (Hadoop Native)
● 3rd Party Tooling
● WanDisco
● On-premises ETL Solutions e.g. Talend, Informatica
● In-house frameworks
● Existing logic that facilitates data movement
● Cloud Native
● AWS Snowmobile, Azure Data Box, Google Transfer Appliance
*NOTE: A push approach requires an agent to be installed on-premises. Databricks is a cloud-first solution,
and hence does not provide a native push approach for data migration.
©2022 Databricks Inc. — All rights reserved
Push Approach
• Hadoop -> Databricks Lakehouse: Migrating Data using a Push Approach
•
Basics - Pros/Cons and Examples
•
Hadoop Native Tools
•
3rd Party Tools
•
CDC
•
Auto Loader
©2022 Databricks Inc. — All rights reserved
14
Migrate Data from HDFS
Hadoop native tool considerations
● Does not require inbound network connections from the cloud
● Requires resources from Hadoop cluster
● Throughput affected by
● Available YARN resources
● Available network bandwidth to the cloud
● Cloud storage write speed
©2022 Databricks Inc. — All rights reserved
Push Approach
• Hadoop -> Databricks Lakehouse: Migrating Data using a Push Approach
•
Basics - Pros/Cons and Examples
•
Hadoop Native Tools
•
3rd Party Tools
•
CDC
•
Auto Loader
©2022 Databricks Inc. — All rights reserved
16
Why use a 3rd party tool over DistCP
Connecting from on-premise
● Some 3rd party solutions provide drag and drop interface for defining ingestion pipelines. This
may be a customer requirement.
● Customers may already leverage an ETL solution and desire to continue using it in the cloud
● Be sure to confirm cloud support, may be version specific
● Disadvantages of DistCP
● Hadoop specific
● No built-in monitoring
● Requires more setup when working with larger data transfers
©2022 Databricks Inc. — All rights reserved
Data Migration from On-Premises
Highlighted 3rd party option
● WanDisco provides HDFS to cloud storage replication
●
●
●
●
Simple setup, multi-cloud support
Can be used for historical plus ongoing data synchronization
Built-in monitoring and tracking of replication
Can replicate Hive tables into Delta Tables in Databricks
○ Includes replication of Hive DDL and table data
● Cloud Native Solutions
● Azure Data Factory
● AWS DataSync
©2022 Databricks Inc. — All rights reserved
Push Approach
• Hadoop -> Databricks Lakehouse: Migrating Data using a Push Approach
•
Basics - Pros/Cons and Examples
•
Hadoop Native Tools
•
3rd Party Tools
•
CDC
•
Auto Loader
©2022 Databricks Inc. — All rights reserved
19
Considerations: Change Data Capture (CDC)
● Use cases where CDC is recommended
● SQL-based extracts impact other workloads of the source database
● When data needs to be moved in stages, only migrating changed data instead of full extract overwrites.
● Databricks partners that provide CDC-based technologies
● Fivetran HVR, Arcion, Talend, AWS DMS, Oracle Golden Gate, OS Debezium, and Qlick Replicate
● Databricks can also consume change records from other CDC solutions
● https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html
● Databricks can be used to handle SCD (Slowly Changing Dimension) use cases
● https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-cdc.html
● Existing hadoop ingestion pipelines may already include CDC records
● Ingestion pipelines can be modified to target cloud storage
● Legacy merge logic may be re-used - partition rebuilding and partition swap
● Databricks recommends the use of the Delta Lake and its Merge capabilities
○ https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html
©2022 Databricks Inc. — All rights reserved
Push Approach
• Hadoop -> Databricks Lakehouse: Migrating Data using a Push Approach
•
Basics - Pros/Cons and Examples
•
Hadoop Native Tools
•
3rd Party Tools
•
CDC
•
Auto Loader
©2022 Databricks Inc. — All rights reserved
21
Auto Loader
General Information
● Recommended to replace legacy hadoop jobs that monitor edge node or HDFS directories
for new data, and trigger ingestion processing into Hadoop.
● Consider rewriting data processing workflows to leverage Auto Loader
● Efficiently handles the processing of newly landed data files in cloud storage
● Can only be used with Structured Streaming and Delta Live Tables
● Language Support
● Structure Streaming supports Python, Scala, and Java. Hence, the SQL interface for Auto Loader cannot
be used with Structured Streaming.
● Delta Live Tables supports Python and SQL. Hence, the Scala interface for Auto Loader can not be used
when configured as part of a Delta Live Tables pipeline.
● Auto Loader processes data landed in cloud storage only
● An external process is required to copy the data to cloud storage
● Auto Loader can then be used to trigger additional data processing i.e. transformations, etc
©2022 Databricks Inc. — All rights reserved
Auto Loader
Considerations
● Auto Loader provides file-based processing and hence operates under file-based processing
SLAs i.e. it does not provide sub-second processing times
● Decoupled from cloud ingestion processes, allowing for integration with any ingestion tool
● Auto Loader detects new files in cloud storage without knowledge of the technology used to deliver the data
● Use Auto Loader to auto convert files from JSON, CSV, PARQUET, AVRO, ORC, TEXT, and
BINARYFILE formats into the Delta Lake.
©2022 Databricks Inc. — All rights reserved
Pull Approach
• Hadoop -> Databricks Lakehouse: Migrating Data using a Pull Approach
•
Basics - Pros/Cons and Examples
•
Connecting to On-Prem
•
HDFS
•
JDBC
©2022 Databricks Inc. — All rights reserved
24
Pull Approach Considerations
Circumstances where the Pull approach is recommended
● Disconnect between Hadoop Owners and Business resulting in
● Artifacts produced by Data Engineering are outside of Business control
● Consumption is up to the Business
● When on-premises data sources systems cannot be moved to the cloud
● Consumers need latest version of on-prem data, and dataset is small (10s-100MB)
● For larger datasets (GBs to TBs), consider a scheduled extract that writes to cloud storage
● Databricks is optimized for read/write access to cloud storage
● Direct network reads from on-premises systems to the cloud will be a bottleneck
©2022 Databricks Inc. — All rights reserved
Pull Approach Considerations
Pros & Cons
● Pros
○
○
○
○
○
Business users have access to Hadoop data
Acceptable as short term solution to access curated data
Can take advantage of optimized data connectors within Databricks
Can access the most up-to-date incremental data from Hadoop very quickly
Simpler to implement than a Push approach
● Cons
○ Data extraction gated by network connectivity to Hadoop (typically on prem)
○ Slows migration progress with on-premises dependencies
○ Reduces data owners control over how the data is transmitted compared to push approach
©2022 Databricks Inc. — All rights reserved
Examples of the Pull Approach
● Spark Structured Streaming
● Accessing on-premises Kafka (or other messaging systems)
● Spark Batch Processing
● Reading files from an on-premises HDFS
● Accessing on-premises RDBMS via JDBC
● 3rd Party Tooling
● Sqoop Replacement
○ Fivetran (JDBC)
○ Matillion (JDBC)
©2022 Databricks Inc. — All rights reserved
Pull Approach
• Hadoop -> Databricks Lakehouse: Migrating Data using a Pull Approach
•
Basics - Pros/Cons and Examples
•
Connecting to On-Prem
•
HDFS
•
JDBC
©2022 Databricks Inc. — All rights reserved
28
Relevant Technologies
Connecting to on premise
● On-premises connectivity options include:
● AWS Direct Connect
● Azure Express Route
● Google Cloud Interconnect
● Some cloud providers have multiple ways to connect to on premises which offer
different throughput
©2022 Databricks Inc. — All rights reserved
Pull Approach & Networking
Connectivity requirements
● For HDFS - Databricks cluster needs to talk to
●
●
●
●
Nameservice (namenode services)
Datanode services
KDC
Metastore Database
● For RDBMS - Databricks needs access to database host and port
©2022 Databricks Inc. — All rights reserved
Pull Approach & Networking
Networking Requirements
● Options
●
Same VNET/VPC as Hadoop if deployed in the cloud
○ Dedicated Subnets for Databricks
●
Separate VNET/VPC
○ VNET/VPC Peering - typically needed with connection from Cloud to On-premises
● DNS
Simplifies setup if you can use an internal DNS
● alternative is to use /etc/hosts to identify Hadoop nodes, but requires manual configuration
●
● Firewall
Databricks Cluster Subnet CIDR needs to be whitelisted
● Treat Databricks as an Hadoop client
●
● Troubleshoot connectivity issues from Databricks notebook
●
E.g. %sh notebook cell
○ nc <hostname> <port>
©2022 Databricks Inc. — All rights reserved
Pull Approach
• Hadoop -> Databricks Lakehouse: Migrating Data using a Pull Approach
•
Basics - Pros/Cons and Examples
•
Connecting to On-Prem
•
HDFS
•
JDBC
©2022 Databricks Inc. — All rights reserved
32
Accessing HDFS using Kerberos
Required configuration files for Kerberos
● Extract from Hadoop cluster
●
●
●
●
hdfs-site.xml
core-site.xml
krb5.conf
keytab
● Upload artifacts to accessible location for Databricks Cluster
● DBFS for initial testing only!
○ Use CLI or UI to upload
● recommend a protected location for production
○ sftp, cloud storage, protected by credentials
○ store access credentials as a databrick secret
©2022 Databricks Inc. — All rights reserved
Accessing HDFS using Kerberos
Sample cluster init script for Kerberos cluster
#!/bin/bash
## Init Script by Ganesh Rajagopal for HDP3.x
DB Integration
## Copy hdfs site and coresite from local to conf folder, they must be uploaded to dbfs
cp /dbfs/<some location>/hdfs-site.xml /home/ubuntu/databricks/hive/conf/hdfs-site.xml
cp /dbfs/<some location>/core-site.xml /home/ubuntu/databricks/hive/conf/core-site.xml
## Copy K5b5 conf to the nodes
cp /dbfs/<some location>/krb5.conf
/etc/krb5.conf
## Install krb5 package
export DEBIAN_FRONTEND=noninteractive && apt-get --yes install krb5-user
### Copy over the key tab
## This can any user headless keytab too
cp /dbfs/databricks/<some location>/<keytab file>
### Run Kinit
kinit -kt /tmp/<keytab file> <kerberos principal>
©2022 Databricks Inc. — All rights reserved
/tmp/<keytab file>
Accessing HDFS using Kerberos
Protecting the Kerberos keytab
● Store the keytab on hosted storage - sftp, cloud storage
● Use an init script to download keytab
● Secrets will be used to retrieve credentials for protected storage
● Keytab is copied to the cluster’s VM storage
○ No SSH access to VMs
○ Ephemeral storage
● Cluster ACLs prevent use by end user
● Kerberos principal with least privilege - general best practice
©2022 Databricks Inc. — All rights reserved
HDFS
Reading and writing from and to HDFS
● When configured, Databricks can read/write to HDFS
● Co-exist between Hadoop and Databricks
● Not all data processing pipelines can be migrated to the cloud (migration schedule)
● Consumers in Databricks need to access Hadoop curated datasets
● Consumers in Hadoop need access to Databricks curated datasets
● Cloud egress costs will be incurred when pushing data back to on-premises
● Pushing data back to on-premises can be beneficial when all downstream dependencies
have not been migrated
©2022 Databricks Inc. — All rights reserved
HDFS Access from Databricks
After the connection to on-prem is established
● Use Spark to access the HDFS data
● Dataframe API example
○ df=spark.read.parquet(“hdfs://<path to parquet file directory>”)
● Data can be pushed back to HDFS. This is typical for a co-exist strategy where consumers
on Hadoop need access to Databricks curated datasets.
● Dataframe API example
○ df=spark.write.parquet(“hdfs://<path to parquet file directory>”)
©2022 Databricks Inc. — All rights reserved
Pull Approach
• Hadoop -> Databricks Lakehouse: Migrating Data using a Pull Approach
•
Basics - Pros/Cons and Examples
•
Connecting to On-Prem
•
HDFS
•
JDBC
©2022 Databricks Inc. — All rights reserved
38
Access using JDBC/ODBC
Considerations
● Typically used to access on-premises database systems when they
● That can’t be moved to cloud due to migration schedule
● That consist of smaller datasets
● JDBC/ODBC can be used to access Hadoop and database systems
● JDBC/ODBC access to Hadoop data is slower than accessing HDFS directly
● JDBC/ODBC access and network connectivity can be simpler than accessing HDFS
● May not require Kerberos
● No configuration files are required
● JDBC/ODBC access can be used as way to provide Sqoop functionality within Databricks
● Use spark JDBC source as a way to read/write to database systems
©2022 Databricks Inc. — All rights reserved
Download