5 Snowflake Mistakes You Can Avoid
Snowflake is an incredibly powerful cloud-based data solution. But with great power comes great… mistakes. With Snowflake, those mistakes can be costly.
Snowflake is intended to be incredibly efficient for certain uses, such as online analytics processing or machine learning training. These highly intensive database jobs are what the platform was built for and its features optimized for.
However, that means that it is not intended for short, transactional-type data activities. eCommerce inserts, quick data lookups, and the like aren't Snowflake's wheelhouse. That doesn't mean Snowflake can't perform these activities - it certainly can. What it does mean, though, is that organizations must be considered in its use because misuse of the platform can be inefficient and expensive.
One Reason Why Snowflake Misuse is Expensive
It may seem counter-intuitive, but it isn't cost-effective to use Snowflake for shorter, less processor-intensive jobs. The shorter and simpler the query, the more money it wastes.
Why? Because Snowflake charges in 60-second increments at a minimum. Therefore, a longer running query, such as one that takes 40 or 50 seconds to complete, will cost precisely the same as a single pull that takes 2 or 3 seconds. With a 2-second long query, you're paying for the operation plus 58 seconds of idle time.
This isn't the only mistake you can make with Snowflake. Still, it's one that clearly illustrates that the platform is optimized for particular types of jobs and that it's crucial to tune your usage of Snowflake to better align with its intended usage.
Common Mistakes When Using Snowflake
As you can see, failing to plan or optimize your Snowflake usage can result in a lot of waste. Let's look at a few other missteps that organizations commonly make when using this powerful platform.
Mistake: Not paying enough attention to optimization
While inherently short-running queries on Snowflake aren't cost-efficient, that doesn't mean that you should let your queries run rampant, either. Optimization is still an essential part of database activities.
We're typically dealing with large data sets where queries can take hours - or even days - to run on Snowflake. You'll still run into issues if you're using an unoptimized query or an underpowered warehouse.
Some things to remember about optimization of your queries:
- Limit the returns of subqueries: If your sub-queries are returning hundreds of lines, it's probably time to refactor and create well-thought-out clauses and filters.
- Utilize time windows: Use time windows and filter to return smaller subsets where possible instead of returning whole data sets if they aren't needed.
- Watch for many to many relationships: These commonly result in increased query time and inadequate datasets. Be sure to use Join Tables when appropriate.
- Operation order: The order of operations is also important. If you're aggregating data, think about where that can and should take place. Frequently, it makes more sense to aggregate a count in a subquery than return the data itself.
- Join types: In many instances, a full join is an overkill. Examine if the work can be done with a left, right, or inner join instead.
- Avoid Cartesian joins: Cartesian joins can be incredibly expensive for long-running queries.
One other area to pay attention to is clustering keys. If you are defining your own, you may not realize that Snowflake is re-organizing your data around those clustering keys every time you make an insert into a table. Because of this, data sets may frequently be changing and can quickly become expensive. Consider using Snowflakes built-in Micro-partitions and reserve clustering keys for very large, infrequently changed datasets.
Mistake: Not finding the right warehouse balance
Are you running many queries? Or heavy queries? Looking at this can help you determine if you've got the right warehouse balance.
If your team is large, but your dataset isn't, smaller, multi-clustered warehouses are probably the right choice. These will disperse the load and keep processing time to a minimum. Keep an eye on the general runtime of your queries and consider economy scaling to get the maximum benefit from your warehouses.
On the other hand, if you have vast datasets that require complex analysis, you may need beefier warehouses. Maximizing your clusters would prevent numerous, long-running queries from causing the warehouse to spin up and down.
Mistake: Ignoring the advantages of batch processing
What are your requirements for ingesting data in any of your processes?
Snowpipe is an excellent tool within Snowflake. It can be easy to misuse it, however. Files that are too small or too large impact the performance of Snowpipe, negating the benefits of the tools.
It's also worth determining when your data is needed. Do you really need it right away? Bulk loads can be an intelligent approach if that's not a genuine requirement. You may not need to go the route of frequent inserts of individual rows but instead can utilize Snowflake's "COPY INTO" feature.
Sizing your warehouses appropriately for your bulk loads is also essential. If you are doing a bulk load once a day, monitor the first few loads for the average runtime. You can resize your warehouses accordingly based on that information.
Mistake: Overusing Time Travel
Time Travel is a stellar feature of Snowflake. Knowing that you can have a minute-by-minute snapshot of your data for 90 days is empowering.
But is it necessary? It's rarely required to save data for that long. A week's backup allows for enough time to notice and correct mistakes in many cases. Even mission-critical data, like accounting data, can be kept for only a month. Evaluate the frequency of change to the data to determine the right level of data insurance. Separate your concerns - test data may not need to be backed up at all, whereas product data may require more stability but fewer people accessing it.
Mistake: Failing to properly configure access and resources
It should also be noted that Snowflake allows for fine-grained role-based access and resource management. This can do more than just keep one query from interfering with another.
Consider creating views if you have several teams interested in different parts of the same data. Plus, secure views have the added benefit of limiting the data that users can see and act on.
Materialized views may be a good solution if the data doesn't change frequently. A pre-computed view keeps constant re-creating the same sets of data eating up Snowflake credits.
You should also consider using resource monitors when possible. It can be easy for a team to get carried away playing with data and not be aware of the cost incurred. Resource monitors and credit quotas can bring awareness to the amount of needless exploration that could be happening within a team.
Costly Snowflake Mistakes Can be Avoided
Snowflake is a powerful tool, but it isn't a panacea. Mistakes with the platform can be costly, and you can end up paying too much for bad code or architecture. Proper planning, allocation, and monitoring will help you eliminate as much waste as possible. And, of course, if you have questions, Big Compass is here to help. Snowflake is a core technology for Big Compass, and our architects and engineers can get your Snowflake instances optimized and save you money.