Data Migration Concepts - Push Approach - Pull Approach ©2022 Databricks Inc. — All rights reserved 1 Data Migration Concepts • Hadoop -> Databricks Lakehouse: Data Migration • Key Considerations • Data Dependencies • Data Validation and Testing ©2022 Databricks Inc. — All rights reserved 2 Data Migration Considerations ● Cloud storage already includes data redundancy. Costs are based on actual data size and level of service for storing the data (hot, cold, archive) ● Two types of data ● Historical ● Ongoing data feeds ● Historical data ● Can be in the TB to PB size range ● Not all data may need to be pushed to the cloud ● Ongoing data feeds ● Typically in the MB to low TB per day range ● Provided by some existing ingestion service - ETL, CDC, in-house framework ©2022 Databricks Inc. — All rights reserved Hive Schema and Data Migration Tooling ● BladeBridge ● Can convert DDL scripts to equivalent Spark SQL DDL scripts ● WanDisco Data Migrator ● Migrates schema and data into the Delta Lake ● Can automatically create Delta DDL ● SI Solutions ● Some SIs provide their own data migration solutions ©2022 Databricks Inc. — All rights reserved Hive Schema Migration Best Practices ● Hive DDL statements are similar to Spark SQL DDL ● Always change DDL from STORED AS to USING for better performance ● USING tells the spark optimizer that the table is a Spark table ● Location in DDL implies the table is “external” ● In Hadoop, location and external keywords are decoupled ● Hive/Spark DDL are similar, but partition table DDL has subtle difference ● Hive specifies the partition column type in the PARTITIONED BY while Spark SQL does not ● Hive does not include partition columns in the column definition while Spark SQL does ● Convert ORC and other files to Delta Lake - Programmatically in Spark or via COPY INTO command ● Convert Parquet can be done with COPY INTO or converted in place with CONVERT TO DETLA ● Hive UDF/UDAF/UDTF are supported, but should be converted to Spark equivalent for better performance ©2022 Databricks Inc. — All rights reserved Partner and 3rd party tool mapping Data Migration & ETL Tools Category Hadoop Component 3rd Party Tool Options Type Integration Level Data Synchronization DistCP WanDisco 3rd Party Native (with Logic Pushdown) Data Synchronization DistCP AWS Data Sync, GCP Distcp, Cloud Native Native Data Migration/ETL Data Migration/ETL DistCP DistCP ©2022 Databricks Inc. — All rights reserved Azure Data Factory Talend, Informatica, Fivetran, dbt, Matillion Cloud Native 3rd Party Native (with Logic Pushdown) Native (with Logic Pushdown) 6 Data Migration Concepts • Hadoop -> Databricks Lakehouse: Data Migration • Key Considerations • Data Dependencies • Data Validation and Testing ©2022 Databricks Inc. — All rights reserved 7 Understanding Data Dependencies for Migration ● Moving a workload to the cloud starts with identifying what data is needed ● Data sources ○ HDFS data ○ External systems ● Reference data ○ HDFS data ○ HBase / Solr / Kudu ○ External systems ● How to identify data dependencies in a Hadoop workflow ● Leverage native hadoop data governance tooling e.g. Atlas, Navigator ● Leverage enterprise data governance solutions e.g. Alation or Colibra ● Code analysis - review what data is being access by the workflow ● Metastore analysis: Table Dependencies ○ VIEWS - if code accesses data via VIEWS ○ FOREIGN KEY constraints - Hive doesn’t enforce these constraints, but hints at underlying dependencies ©2022 Databricks Inc. — All rights reserved Data Migration Concepts • Hadoop -> Databricks Lakehouse: Data Migration • Key Considerations • Data Dependencies • Data Validation and Testing ©2022 Databricks Inc. — All rights reserved 9 Data Validation ● Custom approach ● Validate DDL and metrics - using custom scripts ○ Column definition ○ Row counts ● Validate data ○ Source data replicated as is into cloud storage - data transfer tool ensures data integrity ● Run spark code to compare source data in cloud against target data ➢ E.g. df1_source.unionAll(df2_target).except(df1_source.intersect(df2_target)) ○ Hash / Row Signature ● Source and target data resides in the cloud ● Use custom logic to generate hash value for source and target records ● Compare hash values between source and target tables based on matching primary key ● WanDisco Data Migrator ● Leverage migration verification feature ○ Can be used when replicating source to target as is ©2022 Databricks Inc. — All rights reserved Push Approach • Hadoop -> Databricks Lakehouse: Migrating Data using a Push Approach • Basics - Pros/Cons and Examples • Hadoop Native Tools • 3rd Party Tools • CDC • Auto Loader ©2022 Databricks Inc. — All rights reserved 11 Push approach Pros & Cons when considering a push approach ● Pros ● On-premises data owners have control over how and when data gets moved to the cloud ● Network security teams are more likely to approve outbound connections to the cloud vs inbound connections to the data center ● More outbound network throughput means faster data transfers ● Data flows through internal security controls before landing into cloud storage ● Cons ● Data latency ○ Consumers have access to data as of last push, and may not have influence to trigger a refresh of the data into the cloud ©2022 Databricks Inc. — All rights reserved Push approach Tools ● DistCP / Cloudera BDR (Hadoop Native) ● 3rd Party Tooling ● WanDisco ● On-premises ETL Solutions e.g. Talend, Informatica ● In-house frameworks ● Existing logic that facilitates data movement ● Cloud Native ● AWS Snowmobile, Azure Data Box, Google Transfer Appliance *NOTE: A push approach requires an agent to be installed on-premises. Databricks is a cloud-first solution, and hence does not provide a native push approach for data migration. ©2022 Databricks Inc. — All rights reserved Push Approach • Hadoop -> Databricks Lakehouse: Migrating Data using a Push Approach • Basics - Pros/Cons and Examples • Hadoop Native Tools • 3rd Party Tools • CDC • Auto Loader ©2022 Databricks Inc. — All rights reserved 14 Migrate Data from HDFS Hadoop native tool considerations ● Does not require inbound network connections from the cloud ● Requires resources from Hadoop cluster ● Throughput affected by ● Available YARN resources ● Available network bandwidth to the cloud ● Cloud storage write speed ©2022 Databricks Inc. — All rights reserved Push Approach • Hadoop -> Databricks Lakehouse: Migrating Data using a Push Approach • Basics - Pros/Cons and Examples • Hadoop Native Tools • 3rd Party Tools • CDC • Auto Loader ©2022 Databricks Inc. — All rights reserved 16 Why use a 3rd party tool over DistCP Connecting from on-premise ● Some 3rd party solutions provide drag and drop interface for defining ingestion pipelines. This may be a customer requirement. ● Customers may already leverage an ETL solution and desire to continue using it in the cloud ● Be sure to confirm cloud support, may be version specific ● Disadvantages of DistCP ● Hadoop specific ● No built-in monitoring ● Requires more setup when working with larger data transfers ©2022 Databricks Inc. — All rights reserved Data Migration from On-Premises Highlighted 3rd party option ● WanDisco provides HDFS to cloud storage replication ● ● ● ● Simple setup, multi-cloud support Can be used for historical plus ongoing data synchronization Built-in monitoring and tracking of replication Can replicate Hive tables into Delta Tables in Databricks ○ Includes replication of Hive DDL and table data ● Cloud Native Solutions ● Azure Data Factory ● AWS DataSync ©2022 Databricks Inc. — All rights reserved Push Approach • Hadoop -> Databricks Lakehouse: Migrating Data using a Push Approach • Basics - Pros/Cons and Examples • Hadoop Native Tools • 3rd Party Tools • CDC • Auto Loader ©2022 Databricks Inc. — All rights reserved 19 Considerations: Change Data Capture (CDC) ● Use cases where CDC is recommended ● SQL-based extracts impact other workloads of the source database ● When data needs to be moved in stages, only migrating changed data instead of full extract overwrites. ● Databricks partners that provide CDC-based technologies ● Fivetran HVR, Arcion, Talend, AWS DMS, Oracle Golden Gate, OS Debezium, and Qlick Replicate ● Databricks can also consume change records from other CDC solutions ● https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html ● Databricks can be used to handle SCD (Slowly Changing Dimension) use cases ● https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-cdc.html ● Existing hadoop ingestion pipelines may already include CDC records ● Ingestion pipelines can be modified to target cloud storage ● Legacy merge logic may be re-used - partition rebuilding and partition swap ● Databricks recommends the use of the Delta Lake and its Merge capabilities ○ https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html ©2022 Databricks Inc. — All rights reserved Push Approach • Hadoop -> Databricks Lakehouse: Migrating Data using a Push Approach • Basics - Pros/Cons and Examples • Hadoop Native Tools • 3rd Party Tools • CDC • Auto Loader ©2022 Databricks Inc. — All rights reserved 21 Auto Loader General Information ● Recommended to replace legacy hadoop jobs that monitor edge node or HDFS directories for new data, and trigger ingestion processing into Hadoop. ● Consider rewriting data processing workflows to leverage Auto Loader ● Efficiently handles the processing of newly landed data files in cloud storage ● Can only be used with Structured Streaming and Delta Live Tables ● Language Support ● Structure Streaming supports Python, Scala, and Java. Hence, the SQL interface for Auto Loader cannot be used with Structured Streaming. ● Delta Live Tables supports Python and SQL. Hence, the Scala interface for Auto Loader can not be used when configured as part of a Delta Live Tables pipeline. ● Auto Loader processes data landed in cloud storage only ● An external process is required to copy the data to cloud storage ● Auto Loader can then be used to trigger additional data processing i.e. transformations, etc ©2022 Databricks Inc. — All rights reserved Auto Loader Considerations ● Auto Loader provides file-based processing and hence operates under file-based processing SLAs i.e. it does not provide sub-second processing times ● Decoupled from cloud ingestion processes, allowing for integration with any ingestion tool ● Auto Loader detects new files in cloud storage without knowledge of the technology used to deliver the data ● Use Auto Loader to auto convert files from JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE formats into the Delta Lake. ©2022 Databricks Inc. — All rights reserved Pull Approach • Hadoop -> Databricks Lakehouse: Migrating Data using a Pull Approach • Basics - Pros/Cons and Examples • Connecting to On-Prem • HDFS • JDBC ©2022 Databricks Inc. — All rights reserved 24 Pull Approach Considerations Circumstances where the Pull approach is recommended ● Disconnect between Hadoop Owners and Business resulting in ● Artifacts produced by Data Engineering are outside of Business control ● Consumption is up to the Business ● When on-premises data sources systems cannot be moved to the cloud ● Consumers need latest version of on-prem data, and dataset is small (10s-100MB) ● For larger datasets (GBs to TBs), consider a scheduled extract that writes to cloud storage ● Databricks is optimized for read/write access to cloud storage ● Direct network reads from on-premises systems to the cloud will be a bottleneck ©2022 Databricks Inc. — All rights reserved Pull Approach Considerations Pros & Cons ● Pros ○ ○ ○ ○ ○ Business users have access to Hadoop data Acceptable as short term solution to access curated data Can take advantage of optimized data connectors within Databricks Can access the most up-to-date incremental data from Hadoop very quickly Simpler to implement than a Push approach ● Cons ○ Data extraction gated by network connectivity to Hadoop (typically on prem) ○ Slows migration progress with on-premises dependencies ○ Reduces data owners control over how the data is transmitted compared to push approach ©2022 Databricks Inc. — All rights reserved Examples of the Pull Approach ● Spark Structured Streaming ● Accessing on-premises Kafka (or other messaging systems) ● Spark Batch Processing ● Reading files from an on-premises HDFS ● Accessing on-premises RDBMS via JDBC ● 3rd Party Tooling ● Sqoop Replacement ○ Fivetran (JDBC) ○ Matillion (JDBC) ©2022 Databricks Inc. — All rights reserved Pull Approach • Hadoop -> Databricks Lakehouse: Migrating Data using a Pull Approach • Basics - Pros/Cons and Examples • Connecting to On-Prem • HDFS • JDBC ©2022 Databricks Inc. — All rights reserved 28 Relevant Technologies Connecting to on premise ● On-premises connectivity options include: ● AWS Direct Connect ● Azure Express Route ● Google Cloud Interconnect ● Some cloud providers have multiple ways to connect to on premises which offer different throughput ©2022 Databricks Inc. — All rights reserved Pull Approach & Networking Connectivity requirements ● For HDFS - Databricks cluster needs to talk to ● ● ● ● Nameservice (namenode services) Datanode services KDC Metastore Database ● For RDBMS - Databricks needs access to database host and port ©2022 Databricks Inc. — All rights reserved Pull Approach & Networking Networking Requirements ● Options ● Same VNET/VPC as Hadoop if deployed in the cloud ○ Dedicated Subnets for Databricks ● Separate VNET/VPC ○ VNET/VPC Peering - typically needed with connection from Cloud to On-premises ● DNS Simplifies setup if you can use an internal DNS ● alternative is to use /etc/hosts to identify Hadoop nodes, but requires manual configuration ● ● Firewall Databricks Cluster Subnet CIDR needs to be whitelisted ● Treat Databricks as an Hadoop client ● ● Troubleshoot connectivity issues from Databricks notebook ● E.g. %sh notebook cell ○ nc <hostname> <port> ©2022 Databricks Inc. — All rights reserved Pull Approach • Hadoop -> Databricks Lakehouse: Migrating Data using a Pull Approach • Basics - Pros/Cons and Examples • Connecting to On-Prem • HDFS • JDBC ©2022 Databricks Inc. — All rights reserved 32 Accessing HDFS using Kerberos Required configuration files for Kerberos ● Extract from Hadoop cluster ● ● ● ● hdfs-site.xml core-site.xml krb5.conf keytab ● Upload artifacts to accessible location for Databricks Cluster ● DBFS for initial testing only! ○ Use CLI or UI to upload ● recommend a protected location for production ○ sftp, cloud storage, protected by credentials ○ store access credentials as a databrick secret ©2022 Databricks Inc. — All rights reserved Accessing HDFS using Kerberos Sample cluster init script for Kerberos cluster #!/bin/bash ## Init Script by Ganesh Rajagopal for HDP3.x DB Integration ## Copy hdfs site and coresite from local to conf folder, they must be uploaded to dbfs cp /dbfs/<some location>/hdfs-site.xml /home/ubuntu/databricks/hive/conf/hdfs-site.xml cp /dbfs/<some location>/core-site.xml /home/ubuntu/databricks/hive/conf/core-site.xml ## Copy K5b5 conf to the nodes cp /dbfs/<some location>/krb5.conf /etc/krb5.conf ## Install krb5 package export DEBIAN_FRONTEND=noninteractive && apt-get --yes install krb5-user ### Copy over the key tab ## This can any user headless keytab too cp /dbfs/databricks/<some location>/<keytab file> ### Run Kinit kinit -kt /tmp/<keytab file> <kerberos principal> ©2022 Databricks Inc. — All rights reserved /tmp/<keytab file> Accessing HDFS using Kerberos Protecting the Kerberos keytab ● Store the keytab on hosted storage - sftp, cloud storage ● Use an init script to download keytab ● Secrets will be used to retrieve credentials for protected storage ● Keytab is copied to the cluster’s VM storage ○ No SSH access to VMs ○ Ephemeral storage ● Cluster ACLs prevent use by end user ● Kerberos principal with least privilege - general best practice ©2022 Databricks Inc. — All rights reserved HDFS Reading and writing from and to HDFS ● When configured, Databricks can read/write to HDFS ● Co-exist between Hadoop and Databricks ● Not all data processing pipelines can be migrated to the cloud (migration schedule) ● Consumers in Databricks need to access Hadoop curated datasets ● Consumers in Hadoop need access to Databricks curated datasets ● Cloud egress costs will be incurred when pushing data back to on-premises ● Pushing data back to on-premises can be beneficial when all downstream dependencies have not been migrated ©2022 Databricks Inc. — All rights reserved HDFS Access from Databricks After the connection to on-prem is established ● Use Spark to access the HDFS data ● Dataframe API example ○ df=spark.read.parquet(“hdfs://<path to parquet file directory>”) ● Data can be pushed back to HDFS. This is typical for a co-exist strategy where consumers on Hadoop need access to Databricks curated datasets. ● Dataframe API example ○ df=spark.write.parquet(“hdfs://<path to parquet file directory>”) ©2022 Databricks Inc. — All rights reserved Pull Approach • Hadoop -> Databricks Lakehouse: Migrating Data using a Pull Approach • Basics - Pros/Cons and Examples • Connecting to On-Prem • HDFS • JDBC ©2022 Databricks Inc. — All rights reserved 38 Access using JDBC/ODBC Considerations ● Typically used to access on-premises database systems when they ● That can’t be moved to cloud due to migration schedule ● That consist of smaller datasets ● JDBC/ODBC can be used to access Hadoop and database systems ● JDBC/ODBC access to Hadoop data is slower than accessing HDFS directly ● JDBC/ODBC access and network connectivity can be simpler than accessing HDFS ● May not require Kerberos ● No configuration files are required ● JDBC/ODBC access can be used as way to provide Sqoop functionality within Databricks ● Use spark JDBC source as a way to read/write to database systems ©2022 Databricks Inc. — All rights reserved