5 More Snowflake Best Practices
Snowflake is a powerful data cloud tool. Of course, with great power comes… the potential for great mistakes. When talking about massive amounts of data and data processing, mistakes can be incredibly costly.
That cost isn’t just financial, although that can be significant. It can also be expensive in regards to time spent (and wasted), governance concerns (including security), and access (including too much and too little).
As such, it’s crucial that you carefully consider how you’ll be using Snowflake. Best practices are great guide rails for optimal usage - including ones we’ve talked about before:
- Separate needs operationally and/or by team: This is one of the most critical aspects of Snowflake. Snowflake is incredibly flexible and offers many features, not all of which are appropriate for every team or need. Don't be afraid to be granular with how the platform is used - it's built so that you can be.
- Assign resources to each team: Everyone doesn’t need access to every piece of data in Snowflake, and it can be easy to get carried away with digging into the data. Use resource monitors and quotas to help keep an eye - and some control - on usage.
- Leverage Snowflake’s limitless optimization features: The Snowflake team has dedicated much time to creating a platform that can be optimized from nearly every angle and direction. Be sure to understand what options are available from an optimization standpoint.
- Secure your data: Some of the data you house is sensitive, including PII, financial, etc. Use various Snowflake features, like views, to give teams access to the information they need and protect the data they shouldn’t see.
- Automate by new entries: When do you need your data? Configure loads based on need - bulk loads/batching, continuous auto ingestion, and direct query are all ways to have Snowflake consume cloud data.
These are just the start of the best practices for using Snowflake. In the rest of this piece, we’ll explore a few more.
Getting the Most from Snowflake with Best Practices
Think ELT, not ETL
What is it? While many people architect around an ETL methodology when building out a warehouse or database, snowflake challenges this paradigm and works as an ELT platform - Extract, LOAD, then Transform. Loading the data before transforming has many benefits
Why it’s essential. Loading before the transformation of data can lessen your development overhead. Snowflake has built-in tools to automate the transformation and curation of your data. When you're familiar with these tools, you can consume data with minimal effort and transform it on an as-needed basis.
For instance, you can load JSON blobs into a variant column, then manipulate that data into a view or table. When someone has a specific need for the data, you can manipulate it at that time to suit their needs instead of forcing them to transform data into their required state.
What will it prevent? Transforming after loading will reduce massive technical debt and development overhead that results from needing to curate data constantly as your organization’s data needs evolve.
When should you leverage this? Any time your data isn't perfectly curated or suited to your needs as it would be an ideal time to shovel it into a data lake, then transform it into what you need after it has been loaded into your data warehouse.
What is it? Snowflake can interface with your CI/CD cycle, keeping your code clean and allowing for testing.
Why is it important? Code that isn’t integrated and tested can have critical faults that may cause a production failure. Many people don’t include their warehouses and associated assets into the CI/CD pipeline, and with products like Snowflake, that’s a missed opportunity. Your code can still be stored in GitHub and branched (particularly any stored procedures or UDFs you’ve written) and should be tested and checked just like any other non-warehouse-based scripts.
What will it prevent? Simple - adding your Snowflake SQL code, scripts, and stored procedures into your CI/CD will prevent critical failures and code degradation. A few small changes may not seem like it's worth the trouble. Still, over time this can create massive amounts of technical debt that is easily avoided using GitHub, Jenkins, and other tools for quality and automation.
When should you leverage this? Has your code base reached the point where multiple people are touching it? Are you doing rapid development? Then it's time to implement CI/CD for your Snowflake infrastructure. As our world moves towards data-driven operations and business decisions, this becomes more and more crucial for continuity.
Separation of concerns - by source, by schema, by type of data (and name it well!)
What is it? Whenever you have a new data source or type of data, you have an opportunity to consider how you will keep it separate within your existing system. Do you have multiple data sources? Schemas can separate them. Your data types can be divided this way as well. For instance, if you're dealing with massive amounts of data in different types - structure, unstructured, semi-structured - it can be advantageous to separate those. This can be especially valuable if you have one source sending large volumes of different types of data. You can use separate schemas with many tables to group the data from that one source.
Why is it important? Logically separating the data keeps it clean and easier to work with. At first blush, it may seem excessive, but as your data grows, you'll appreciate the ability to find source-specific JSON data in a single place. Also, if you're working to combine all of your data and need to trace it back to its original form, this separation can make it simpler to identify where your data originated from.
What will it prevent? When your data is all in one place, it can become hard to navigate, especially as it grows. That can waste time that would be better spent on development and analysis. It can also streamline issue resolution - when a problem surfaces in some of the data, it can be cumbersome to dig through unorganized databases to find the issue. You don't need to waste time dealing with a production level outage or concern.
When should you leverage this? Pretty much all of the time. Get as granular as you can. Pay attention to your naming conventions across all of your objects, including schemas, tables, views, and Snowpipes. If something does go wrong or you need to act on a particular set of data, it will be quickly sourced.
Transactions: Reconcile, Reconcile, Reconcile
What is it? In this instance, we're referring to double-checking your data. There are many ways to do this, but the concept is essential to keep in mind in your pipeline.
Why is it important? Tainted data can and will skew results. Some simple steps, like reconciling the total amount of transactions against a header, validating the number of rows inserted into a staging table, or holding data from multiple systems to cross-reference single entries, will shore up the credibility of your warehouse data.
What will it prevent? You'll avoid bad reporting, over or under-estimating, and poorly trained models. There are a few more to add to that list, and it may seem short, but play out for yourself what a few bad or missed entries could mean further down the line and in all the places that data is used, and you'll quickly see why this is so crucial to your process.
When should you leverage this? It's most critical when reporting on transactions, fees, or money in general. Many data sources will give you sums by time period or a header or footer to validate against. Snowflake even has tools so that you can reach out to other systems to check the data against. Use stored procedures or other methods to check data before it's used in reporting or before inserting it into a larger dataset. This can't be stressed enough - data quality is vital.
Plan for failures, backups, and failover
What is it? Backups and failovers are ubiquitous concepts. Snowflake offers multiple features to help plan for the eventual need for these. Time Travel, database replication, and failover features can protect your data when the worst happens.
Why is it important? Something, at some point, will go wrong. That's a given. Snowflake is a fantastic platform, but no technology is infallible. The old saying "Failing to plan is planning to fail" - a phrase attributed to everyone from Benjamin Franklin to Winston Churchill - is accurate. Planning for the moment of failure is one of the most crucial aspects of data warehousing. What will you do if your AWS region goes down and your data is inaccessible? How will you recover if someone truncates a production table with years of data? If bad data enters the warehouse, how will you identify it?
What will it prevent? It can't stop an issue, but it can mitigate the severity of the impact of those moments when everything goes wrong.
When should you leverage this? When your data is a critical part of your business, you always need to consider how to protect it. Take a look at Snowflakes recovery, failover, and backup features so that you understand what your options are, like:
- “Undrop” in Snowflake for errantly dropped tables
- Snapshots of your data from a week ago to determine if bad data was consumed
- Failovers to another account in another region for seamless operational continuity
Remember, we previously said to be careful and considered with your use of Time Travel, but not to ignore it altogether. It and Snowflake's other disaster and prevention recovery features will help you sleep at night.
(Bonus best practice): Stay in the loop
Snowflake is a dynamic platform. The company constantly brings out new and enhanced features meant to make your warehouse management easier, offer better protection of your data, and make access more controlled and flexible.
Make it a habit to check Snowflake's documentation and look for webinars and info sessions Snowflake might have to introduce features. When possible, take part in public previews - it's also an opportunity to learn new skill sets. By keeping up to date with what's new and emerging with Snowflake, you'll be able to build performant data warehouses with ease and with future-proofing top of mind.
Snowflake is a powerful system with an extensive set of features and options. Leveraging best practices when using Snowflake as much as possible - including during planning, allocating, and monitoring - will help you realize the value the toolset can bring while managing financial and opportunity-based costs.
Big Compass's extensive experience with the platform in some of the world's largest and best-known companies can give you an edge in seeing the greatest value and success with your Snowflake usage. Need help applying these practices to your Snowflake implementation? Contact us today to chat about how we can help you with your data platform.