Snowflake ETL Best Practices for Data Engineers

5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today Oct 23 Written By John Ryan Top 14 Snowflake ETL Best Practices for Data Engineers John Ryan Comments (16) https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices Newest First 1/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today Photo by Flickr 8 Minute Read. In this article we will describe the data Snowflake transformation landscape, explain the steps and the options available, and summarise the data engineering best practices learned from over 50 engagements with Snowflake customers. https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 2/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today What are Snowflake Transformation and ETL? ETL or ELT (Extract Transform and Load) are often used interchangeably as a short code for data engineering. For the purposes of this article, Data Engineering is the process of transforming raw data into useful information to facilitate data-driven business decisions. The steps involved include data acquisition, ingestion of the raw data history followed by cleaning, restructuring, enriching data by combining additional attributes, and finally, preparing it for consumption by end users. The term ETL and Transformation tend to be used interchangeably, although the transformation task is a subset of the overall ETL pipeline. Snowflake ETL in Practice The diagram below illustrates the Snowflake ETL data flow to build complex data engineering pipelines. There are several components, and you may not use all of them on your project, but they are based on my experience with Snowflake customers over the past five years. https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 3/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today The diagram above shows the main categories of data provider which include: https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 4/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today Data Lakes: Some Snowflake customers already have an existing cloud based Data Lake which acts as an enterprise wide store of historical raw data used to feed both the data warehouse and machine learning initiatives. Typically, data is stored in S3, Azure or GCP cloud storage in CSV, JSON or Parquet format. On Premise Databases: Which include both operational databases which generate data and existing on-premise data warehouses which are in the process of being migrated to Snowflake. These can (for example) include billing systems and ERP systems used to manage business operations. Streaming Sources: Unlike on-premise databases where the data is relatively static, streaming data sources are constantly feeding in new data. This can include data from Internet of Things (IoT) devices or Web Logs in addition to Social Media sources. SaaS and Data Applications: This include existing Software as a Service (SaaS) systems for example ServiceNow and Salesforce which have Snowflake connectors in addition to other cloud based applications. Data Files: Include data provided from either cloud or on-premises systems in a variety of file formats including CSV, JSON, Parquet, Avro and ORC which Snowflake can store and query natively. Data Sharing: Refers to the ability for Snowflake to seamlessly expose read-only access to data on other Snowflake Accounts. Using either the Snowflake Data Exchange or Marketplace provides instant access to data across all major cloud platforms (Google, AWS or Microsoft) and global regions. This can be used to enrich existing transactions with additional externally sourced data without physically copying the data. In common with all analytics platforms, the data engineering phases include: Data Acquisition: Which involves capturing the raw data files and storing them on cloud storage including Amazon S3, Azure Blob or GCP storage. Ingestion & Landing: Involves loading the data into a Snowflake table from which point it can be cleaned and transformed. It's good practice to initially load data to a transient table which balances the need for speed and resilience and simplicity and reduced storage cost. Raw History: Unless the data is sourced from a raw data lake, it's good practice to retain the history of raw data to support machine learning in addition to data reprocessing as needed. Data Integration: Is the process of cleaning and enriching data with additional attributes and restructuring and integrating the data. It's normally good practice to use temporary or transient tables to store intermediate results during the transformation process with the final results stored in Snowflake permanent tables. Data Presentation and Consumption: Whereas the Data Integration area may hold data in 3rd Normal Form or Data Vault, it's normally good practice to store data ready for consumption in a Kimball Dimensional Design or denormalized tables as needed. This area can also include a layer of views acting as a semantic layer to insulate users from the underlaying table design. Data Governance, Security and Monitoring: Refers to the ability to manage access to the data including Role Based Access Control in addition to handling sensitive data using Dynamic Data Masking and Row Level Security. This also supports monitoring Snowflake usage and cost to ensure the Snowflake platform is operating efficiently. Finally, the data consumers can include dashboards and ad-hoc analysis, real time processing and Machine Learning, business intelligence or data sharing. Snowflake Data Loading Options The diagram below illustrates the range of options available to acquire and load data into a landing table - the first step in the Snowflake data engineering process. https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 5/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 6/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today As the diagram above shows, Snowflake supports a wide range of use-cases including: Data File Loading: Which is the most common and highly efficient data loading method in Snowflake. This involves using SnowSQL to execute SQL commands to rapidly load data into a landing table. Using this technique it's possible to quickly load terabytes of data, and this can be executed on a batch or micro-batch basis. Once the data files are held in a cloud stage (EG. S3 buckets), the COPY command can be used to load the data into Snowflake. For the majority of large volume batch data ingestion this is the most common method, and it's normally good practice to size data files at around 100-250 megabytes of compressed data optionally breaking up very large data files were appropriate. Replication from on Premises Databases: Snowflake supports a range of data replication and ETL tools including HVR, Stitch, Fivetran and Qlik Replicate which will seamlessly replicate changes from operational or legacy warehouse systems with zero impact upon the source system. Equally there are a huge range of data integration tools which support Snowflake in addition to other database platforms and these can be used to extract and load data. Equally, some customers choose to write their own data extract routines and use the Data File Loading and COPY technique described above. Data Streaming: Options to stream data into Snowflake include using the Snowflake Kafka Connector to automatically ingest data directly from a Kafka topic as demonstrated by this video demonstration. Unlike the COPY command which needs a virtual warehouse, Snowpipe is an entirely serverless process, and Snowflake manages the operation entirely, scaling out the compute as needed. Equally, the option exists to simply trigger Snowpipe to automatically load data files when they arrive on cloud storage. Inserts using JDBC and ODBC: Although not the most efficient way to bulk load data into Snowflake (using COPY or Snowpipe is always faster and more efficient), the Snowflake JDBC and ODBC connectors are available in addition to a range of Connectors and Drivers including Python, Node.js and Go. Ingestion from a Data Lake: While Snowflake can be used to host a Data Lake, customers with an existing investment in a cloud data lake can make use of Snowflake External Tables to provide a transparent interface to data in the lake. From a Snowflake perspective, the data appears to be held in a read-only table, but the data is transparently read from the underlying files on cloud storage. Data Sharing: For customers with multiple Snowflake deployments, the Data Exchange provides a seamless way to share data across the globe. Using the underlying Snowflake Data Sharing technology, customers can query and join data in real time from multiple sources without the need to copy. Existing in-house data can also be enriched with additional attributes from externally sourced data using the Snowflake Data Marketplace. You may notice a consistent design pattern in the above scenarios as follows: 1. Acquire the data: Load data files to a Snowflake file stage. 2. Ingest and Land the data: Into a Snowflake table. Once the data has landed in Snowflake, it can be transformed using the techniques described below. In some cases, Snowflake has made the entire process appear seamless, for example, using the Kafka Connector or Snowpipe, but the underlying design pattern is the same. Snowflake Transformation Options Having ingested the data into a Snowflake landing table, several tools are available to clean, enrich and transform the data, which are illustrated in the diagram below. https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 7/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today The options available to transform data include: https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 8/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today Using ETL Tools: This often has the advantage of leveraging the existing skill set within the data engineering team, and Snowflake supports a wide range of data integration tools. It is best practice when using these to ensure they follow the push down principle, in which the ETL tool executes SQL which is pushed down to Snowflake to gain maximum benefit from the scale up architecture. Using Stored Procedures: In addition to the JavaScript API Snowflake will soon release a new procedural language, Snowflake Scripting. This includes support for loops, cursors and dynamic SQL similar to Oracle's PL/SQL. These features when combined with the additional power of External Functions and Java User Defined Functions can be used to build and execute sophisticated transformation logic within Snowflake. Be aware however, the best practice to deliver highly scalable data transformation solutions is to avoid row-by-row processing. Instead, use SQL statements to execute set processing executing functions as needed for complex logic. Simply use the procedural code to organize the steps and use External and Java UDFs for complex calculations. Incremental Views: Is a pattern commonly found on systems migrated from Teradata and uses a series of views built upon views to build a real-time transformation pipeline. It is good practice to break complex pipelines into smaller steps and write intermediate results to transient tables as this makes it easier to test and debug and can in some cases lead to significant performance improvements. Streams & Tasks: Snowflake Streams provide a remarkably powerful but simple way of implementing simple change data capture (CDC) within Snowflake. It is good practice to combine Streams and Snowflake Tasks on data acquired for near real-time processing. Effectively, the Stream keeps a pointer in the data to record the data already processed, and the Task provides scheduling to periodically transform the newly arrived data. Previously, it was necessary to allocate a suitably sized virtual warehouse to execute the task, but the recent release of the Serverless Compute option further simplifies the task and means Snowflake automatically manages the compute resources, scaling up or out as needed. Spark and Java on Snowflake: Using the recently released Snowpark API Data Engineers and Data Scientists who would previously load data into a Databricks Cluster to execute SparkSQL jobs can now develop using Visual Studio, InteliJ, SBT, Scala and Jupyter notebooks with Spark DataFrames automatically translated and executed as Snowflake SQL. Combined with the ability to execute Java UDFs, this provides powerful options to transform data in Snowflake using your preferred development environment but without the additional cost and complexity of supporting external clusters. The data transformation patterns described above show the most common methods, however each Snowflake component can be combined seamlessly as needed. For example, although Streams and Tasks are commonly used together, they can each be used independently to build a bespoke transformation pipeline and combined with materialised views to deliver highly scalable and performant solutions. “If the only tool you have is a hammer, you tend to see every problem as a nail”. Abraham Maslow. Snowflake ETL and Data Engineering Best Practices The article above summarises several options and highlights some of the best data engineering practices on Snowflake. The key lessons learned from working with Snowflake customers include: Follow the standard ingestion pattern: This involves the multi-stage process of landing the data files in cloud storage and loading them to a landing table before transforming the data. Breaking the overall process into predefined steps makes it easier to orchestrate and test. Retain history of raw data: Unless your data is sourced from a raw data lake, it makes sense to keep the raw data history which should ideally be stored using the VARIANT data type to benefit from automatic schema evolution. This means you can truncate and re-process data if bugs are found in the transformation pipeline and https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 9/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today provide an excellent raw data source for Data Scientists. While you may not yet have any machine learning requirements, it's almost certain you will, if not now, then in the coming years. Remember that Snowflake data storage is remarkably cheap, unlike on-premises solutions. Use multiple data models: On-premises data storage was so expensive it was not feasible to store multiple copies of data, each using a different data model to match the need. However, using Snowflake, it makes sense to store raw data history in either structured or variant format, cleaned and conformed data in 3rd Normal Form or using a Data Vault model. Finally, data is ready for consumption in a Kimball Dimensional Data model. Each data model has unique benefits, and storing the results of intermediate steps has huge architectural benefits, not least the ability to reload and reprocess the data in case of mistakes. Use the right tool: As the quote above implies, if you only know one tool, you'll use it inappropriately. The decision should be based upon a range of factors, including, the existing skill set in the team, whether you need rapid near real-time delivery, and whether you're doing a once-off data load or a regular repeating process. Be aware that Snowflake can natively handle a range of file formats, including Avro, Parquet, ORC, JSON and CSV. There is extensive guidance on loading data into Snowflake on the online documentation. Use COPY or SNOWPIPE to load data: Around 80% of data loaded into a data warehouse is either ingested using a regular batch process or, increasingly, immediately after the data files arrive. By far, the fastest, most cost-efficient way to load data is using COPY and SNOWPIPE, so avoid the temptation to use other methods (for example, queries against external tables) for regular data loads. Effectively, this is another example of using the right tool. Avoid JDBC or ODBC for regular large data loads: Another right tool recommendation. While a JDBC or ODBC interface may be fine to load a few megabytes of data, these interfaces will not scale to the massive throughput of COPY and SNOWPIPE. Use them by all means, but not for large regular data loads. Avoid Scanning Files: Using the COPY command to ingest data, use partitioned staged data files. This reduces the effort of scanning large numbers of data files in cloud storage. Choose a suitable Virtual Warehouse size: Don’t assume an X6-LARGE warehouse will load huge data files any faster than an X-SMALL. Each physical file is loaded sequentially on a single CPU, and it is more sensible to load most loads on an X-SMALL warehouse. Consider splitting massive data files into 100-250MB chunks and loading them on a larger (perhaps MEDIUM size) warehouse. Ensure 3rd party tools push down: ETL tools like Ab Initio, Talend and Informatica were originally designed to extract data from source systems into an ETL server, transform the data and write them to the warehouse. As Snowflake can draw upon massive on-demand compute resources and automatically scale out, it makes no sense to use and have data copied to an external server. Instead, use the ELT (Extract, Load and Transform) method, and ensure the tools generate and execute SQL statements on Snowflake to maximise throughput and reduce costs. Excellent examples include DBT and Matillion. Transform data in Steps: A common mistake by inexperienced data engineers is to write huge SQL statements that join, summarise and process lots of tables in the mistaken belief this is an efficient way of working. In reality, the code becomes over-complex, difficult to maintain, and, worst still, often performs poorly. Instead, break the transformation pipeline into multiple steps and write results to intermediate tables. This makes it easier to test intermediate results, simplifies the code and often produces simple SQL code that runs faster. Use Transient tables for intermediate results: During a complex ELT pipeline, write intermediate results to a transient table which may be truncated before the next load. This reduces the time-travel storage to just one day and avoids an additional seven days of fail-safe storage. By all means, use temporary tables if sensible, but the option to check the results of intermediate steps in a complex ELT pipeline is often helpful. Avoid row-by-row processing: As described in the article on Snowflake Query Tuning, Snowflake is designed to ingest, process and analyse billions of rows at amazing speed. This is often referred to as set-at-a-time processing. However, people tend to think about row-by-row processing, which sometimes leads to programming loops that fetch and update rows one at a time. Be aware that row-by-row processing is the biggest way to kill query performance. Use SQL statements to process all table entries simultaneously and avoid row-by-row processing at all costs. Use Query Tags: When you start any multi-step transformation task, set the session query tag using: ALTER SESSION SET QUERY_TAG = 'XXXXXX' and ALTER SESSION UNSET QUERY_TAG. This stamps every SQL statement until reset with an identifier and is invaluable to System Administrators. As every SQL statement (and QUERY_TAG) is recorded in the QUERY_HISTORY view, you can track the job performance over time. This can be used to quickly identify when a task change has resulted in poor performance, identify inefficient transformation jobs or indicate when a job would be better executed on a larger or smaller warehouse. Keep it Simple: Probably the best indicator of an experienced data engineer is their value on simplicity. You can always make a job 10% faster, generic, or more elegant, and it may be beneficial, but it's always beneficial to simplify a solution. Simple solutions are easier to understand, easier to diagnose problems and are therefore easier to maintain. Around 50% of the performance challenges I face are difficult to resolve because the solution is a single, monolithic complex block of code. The first thing I do is to break down the solution into steps and only then identify the root cause. https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 10/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today Conclusion Snowflake supports many tools and components to ingest and transform data, and each is designed for a specific use case. These include traditional batch ingestion and processing, continuous data streaming and exploration on a data lake. It's easy to get lost in the array of options and, worse still, to use an inappropriate tool for the task. I hope the description above and the best practices help you deliver faster, more efficient and simpler data pipelines. Can you Help? Was this article helpful? Can you explain why in the comments below? I’m writing a book, “Snowflake: Best Practices”, and I’m interested in your opinion on what to include. What’s important to you? I'd appreciate your thoughts. Notice Anything Missing? No annoying pop-ups or adverts. No bull, just facts, insights and opinions. Sign up below and I will ping you a mail when new content is available. I will never spam you or abuse your trust. Alternatively, you can leave a comment below. Disclaimer: The opinions expressed on this site are entirely my own, and will not necessarily reflect those of my employer. https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 11/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today John A. Ryan Preview chitiz Post Comment… 2 months ago · 0 Likes If you are taking the help of the right process, not going in favor of the continuous integration or delivery you can become an expert for testing the individual releases and can co-ordinate with instead of handling the traditional warehouses. It becomes an easier task to avoid the FTP with the help of the process of controlling the external data sharing. The platform of separately handling the storage and computing devices is an exceptional feature that helps in differentiating the types of data being shared across the systems. Best companies for working confidentially with all the customers and external vendors so that the data sharing becomes protective and convenient simultaneously. It does not depend that whether the customer belongs to the salesforce system but all the tasks associated with external data sharing can easily be implemented. You can give a more engaging touch to the security features if you are working with all workflow rules so that that's the best view of streamlining and securing the data being shared with the help of FTP. Some inbuilt write scripts are responsible for sharing the data and organizations can easily access the data externally. https://techilaservices.com/snowflake-developer/ Robert 5 months ago · 0 Likes As a Data Architect, I was looking into designing an end-to-end data pipeline and every tool that we could use for that, and stumbled on your comprehensive article. There is one important tool which I noticed that is missing here - dbt. What do you think about it and would you consider using it? John Ryan 3 weeks ago · 0 Likes Hey Robert, many apologies for the delay. Yes DBT is excellent. I've kind of tried to avoid recommending (or criticizing) any tools in the article because often the best tool is the one you already know. But yes, DBT has a well-deserved great reputation with the Snowflake Solution Architects. I believe I saw somewhere that Snowflake has actually invested in the company to ensure it has some influence in the tool design. https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 12/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today John Ryan 4 months ago · 0 Likes Hi Robert. Excellent question, and I’m delighted to say the answer is absolutely yes. Firstly, DBT is in fact one of a number of (ELT) tools which can be used with Snowflake. The tools include DBT, Matillion and Informatica. I see a full list here : https://docs.snowflake.com/en/user-guide/ecosystem-etl.html Matillion and DBT stand out from the crowd, as these are relatively new entrants into the market and have been developed from the ground up to use Extra ct, Load and Transform techniques rather than “old school” ELT (Extract, Load and Transform). DBT is in fact very impressive and in my personal opinion is an excellent fit for Snowflake. Maybe I should write an article about the advantages and disadvantages, but I find it is very well received by data engineers who are more familiar and comfortable with coding rather fan operating using entirely a graphical user interface. So the short answer is yes. DBT is an excellent transformation tool for Snowflake. Be aware, though, it is purely a transformation tool, it assumes the data has been already loaded into Snowflake. Thaenraj Packiamani 7 months ago · 0 Likes Ryan, Very well explained , you connected all the dots in data engineering.. Thank you so much for your great article. John Ryan 5 months ago · 0 Likes Thankyou @Thaenraj - it's a real pleasure to help. Kali 8 months ago · 0 Likes This is an awesome post. Keep up the good work. https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 13/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today John Ryan 5 months ago · 0 Likes Wow! Thanks @Kali - really delighted to hear that Любовь 10 months ago · 0 Likes Hello! Thanks for this helpful article. I recommend another one that describes ETL in Snowflake: technology, tools, and best practices (https://skyvia.com/blog/snowflake-etl#skyvia). I think it complements your guide. Rajesh 11 months ago · 0 Likes What an Excellent article Ryan.. I am a big fan of your articles on snowflake and other stuff since you were in CS:-). Please post a blog on end to end real time scenario on Data Engineering in using snowflake(extracting from any source, load into snowflake and transform with any single example) which would help for the beginners. Thank you in advance. John Ryan 5 months ago · 0 Likes Hey @Rajesh - great to hear from you. yes, the CS days seem a very long time ago. Yes, I'll take you up on the suggestion - I'm always looking for ideas for articles. So much better if it comes from the development community. Take care mate! Alex A year ago · 0 Likes https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 14/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today Hi John, this page is very interesting, would you have also information on the consumption of the data? You touched the modeling aspect but I would like to see it on the aspect of the data going out of SnowFlake for example, the Snowflake Native SQL API, the web browser interface including the new UI or the Unload function ? John Ryan 2 years ago · 0 Likes Thanks for the amazing feedback Diane, Binu and Parag! It’s a “Labour of love “ for me. Snowflake really is an astonishing database platform. Thanks again guys! Diane Elinski 2 years ago · 0 Likes How did you pack so much great information in one blog! Awesome John! Binu Varghese 2 years ago · 0 Likes This is a fantastic summary of what Snowflake does or what youbshould know about it. Keep up great wokk John. Parag 2 years ago · 0 Likes Wow...query tag is a really cool feature that alone makes snowflake standing out from the other day warehouse solutions. Very informative article John. Great work. https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 15/16 5/13/23, 8:08 PM Top 14 Snowflake Best Practices for Data Engineers — Analytics.Today Snowflake Zero Copy Clones and Data Storage Snowflake Row Level Security in Plain English Designed by me and hosted on Squarespace. (c) Copyrig https://www.analytics.today/blog/top-14-snowflake-data-engineering-best-practices 16/16

Snowflake ETL Best Practices for Data Engineers

Related documents

Products

Support

Snowflake ETL Best Practices for Data Engineers

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib