many companies are overspending on their data infrastructure I know I've seen plenty of them some who are obviously smaller are spending tens of thousands others hundreds of thousands if not millions of dollars on data infrastructure that's often poorly optimized and built and could be possibly costing them twice as much as they should be spending if not more and don't just believe me here's a great article if you missed it about someone who accidentally saved their company half a million dollars simply by fixing a few configurations on Snowflake and this is happening everywhere there are so many ways companies are wasting money uh on their data infrastructure and I wanted to talk about it today in ways you can check to make sure your team is fully optimizing their data infrastructure spend so that way you can focus more on actually delivering value and not having to explain why you're spending $500,000 on data infrastructure now before diving too deep into that I just want to say hey there everyone thanks so much for joining the studio my name is Ben Roan aka the Seattle data guy now in Denver if you don't know me um I work as a data engineering and infrastructure consultant and prior to this I worked at Facebook full-time for uh 3 years and before that some startups uh and in healthcare what I find interesting about costs in the data space is I once was talking to an engineer who told me that they didn't want to consider a cost as one of essentially their requirements they didn't want to be inhibited by whatever costs were going to be incurred which is strange to me because at the end of the day whether you're building a bridge or writing softwall where cost is always a factor what materials you can use how safe and how long something will last how well you know you abstract and build things into your software all will likely be impacted somewhat by cost yes there's obviously timelines and everything else that are involved but cost play a major role in how we build what we build in fact before I start almost any project with my clients one of the conversations I have is one how much do you want to spend in terms of infrastructure up front and then also how much you want to spend monthly now obviously everyone wants to spend zero but that's not going to happen there's always a trade-off you know if you use an open source solution more than likely you are going to pay for it in other ways such as monthly maintenance and time so there is generally always a trade-off that you will have sometimes it will be worth it to go the open source route and sometimes maybe a vendor solution makes the most sense but you need to go through the process of thinking through do we build it do we use open source do we buy something off the shelf but let's say you've already built Your solution and now you're spending way too much and you're not sure how to fix it let's go over a few key places that I constantly see people spend way too much money on their data infrastructure and as you're kind of going through this you can kind of just go through this chart that I'm going to put together basically this is what I used to do uh when I would look for uh optimizations in terms of time when things were not on the cloud and your only concern was mostly uh how long your performance was like how long was a ETL process running but you can now add in uh cost as well to this so you can go through the process of first figuring out what the problems are that are causing these costs one of those things might be the fact that you're spending so much on a solution that it might as well be the cost of an employee and that sounds a little bit crazy but I recently had a client and again they were quoted nearly $200,000 on a solution to ingest data from some sources that were really again One Source it was just one real database but that's what the solution cost obviously the solution made a lot of sense early on when they were paying for um some apis and smaller data poles but as soon as they connected to the application suddenly their bill ballooned upwards they were quoted $200,000 a year just to pull data from a database something that any of us probably out there could write in terms of writing the connector and just pulling the data yes there's other things you have to write such as making sure you're only pulling in the most recent data and there there are little things in there but the fact that they were spending essentially $200,000 just to pull that data over was a little shocking now in this case luckily we found a solution cuz they didn't want to hire a whole other employee to build something similar and in this case it was Estuary which was why for those of you who do know I did become an advisor with estuaries because of this specific project where they helped me essentially reduce this company's bill by about 80% and we did it all within about the span of a month but yeah as we were going through again that list that I talked about we basically broke down all these processes you know ingesting data from API 1 2 3 database and like put that cost there where you know you're basically spending 200k and we're like well that's clearly where we're going to focus this whole process and that's what you're going to do you're going to go through the process of listing out what you could actually uh improve on what are the different steps in the process um is there a transformation that you can improve is there a dashboard that's costing you a lot of money if you list all that out this is a great way to keep track of what's costing what especially since most companies won't give you this information or at least will make it hard for you to get to You'll likely have to kind kind of do some level of parsing it out next another problem that I constantly see companies deal with is The View on view on view on view problem now again if you remember this chart what's interesting here is you will both see that the cost is generally very expensive by the processes that are connected to this as well as the time is very poor generally the performance you know it takes like 10 minutes for people to get dashboards uh to return back to them and this is one of the things that I'll tell people if they tell me their dashboards are taking 10 minutes and they're spending a lot of money I'll often say I can kill two birds with one Stone because generally the issue here is you've built some sort of view on view on view system where it's just taking way too long to process all this data and on top of that every time you're processing it you're taking 10 minutes to run all of these views meaning you're paying for all of this versus pre-running data and yes there are tradeoffs here where your data Maybe won't be as up to dat if you're going directly to the raw State and there's some streaming involved sure but is that necessary so it's one of those questions that I often ask people you know do you really need this data to be live what decisions are you making really having to understand what they're trying to do because they need to understand that they're paying for the fact this data is alive and do they want that or do they need to look for a different solution to give them that ability it's just there's so many options here but in general I found if there's a view on view on view situation that's costing a ton of money more than likely you can fix it by building some more permanent data models um or cbles that actually sit there to support these needs now with that uh that also adds to the fact that real time I just brought that up is expensive in the cloud it really is it's it's not just the fact that it's technically hard and honestly there's a lot of solutions again I brought up Estuary that make it very easy but even in the case of estuary where we've used it we had to kind of turn off in some cases their real time effect because we didn't need it and we were having to pay for it not from Estuary mind you but from Snowflake because that's how snowflake works if you're loading data into snowflake every time you tell snowflake to turn on it generally will run for a minute and you will pay for that minute so the way to think about this is if you have let's say 60 tasks and you want all of that to be fired whenever that task is is live great but if it's perfectly broken down for every minute in that hour you're going to pay for a whole hour of snowflake whereas if let's say you back all 60 of those tasks at the end of the hour you will likely pay and it only takes a minute to run for 1 minute of snowflake costing you 160th the cost of what it cost you to do you know every one time every minute uh as you trying to load that data live so that's the difference like if you're spending too much and this is something that I've done as well if you're seeing that you're spending too much because you're doing real time data and every time that data fires you are paying for it consider batching your data I get that batch is boring but it also one is it works and two ensures that you're not paying as much again if you're on snowflake more than likely if you're on Prem you can probably get away with this as long as your system uh requirements can handle it but yeah real time is expensive and I don't mean technically technically we're starting to get to a point where I think companies and solutions exist that make it easier but you'll often pay for it somewhere else so just consider that as you're doing real time again we kind of have this list of things you'll you'll see the ingestion this 200k uh maybe real time you're spending extra K because it was a real-time pipeline versus the batch pipeline that's 10K or 5K or 1K and then maybe your dashboard's costing you $30,000 um I've often seen people spend a ton especially dashboard solutions that are like connected to like pull from your data like raw I've seen I've actually had it's just funny that I I think about this I've actually had uh Avenger tell a client of mine that hey the best way to set up our dashboarding solution is to set up connected to your data raw and like just do all the processing in our dashboarding solution which it's like hey guys like I get what you're saying like in theory that gives you this ability to gradually look at the data great but they will pay for that and they didn't tell them that they're not going to tell them that hey snowflake will charge you every time or whatever solution you're using that's in the cloud you're going to pay for it and so it's very important to realize that especially if you're using something like snowflake data break something where it is pay for consumption there is this tradeoff that as you're consuming it's it's geared to be more expensive if you were to use it the equivalent almost as a database where it's just exists all the time finally the topic that I've talked about a few times now on this channel is data modeling which I needed to do more of do expect that soon probably in January but bad data models and I like that someone posted this can cost companies a lot of money um whether that's because you've just set them up um and how they build themselves is incorrectly uh done or maybe not incorrectly but maybe you recreate the table every time which can be fine like it's an easy easy way to solve a problem if your tables are small but as soon as your tables start getting big you need to start thinking about hey can we append data is a merge faster sometimes it's not like merge is one of those things where it's like sometimes it's better sometimes it's not so it's very important that you think through all of the possible decisions that you make like hey what is the most efficient way to run this script when I'm building the data model and now of that's just building the data model right like that's the physical action of building it but what about how it is modeled this is where you'll often see people do like Kimble one big table all these different approaches to how you could possibly build your data model and as per usual there are trade-offs in all of these if you do do Kimble right there is that cost where it's like hey if you have to join joins are going to be expensive right if it's heavily normalized that is going to likely cost you a little more when people are building things what I generally see people do in that case is like they'll generally do a hybrid cuz there's trade-offs in both cases is like they'll build these normalized tables and then build one big table for the animalist to work with that's one way you can kind of hopefully either reduce costs and hopefully simplify how people actually work with the data but even there you're then committing to building this table which takes more storage which also costs the compu of building the table and that's why you have this chart that we put together where you will go through and figure out what is actually costing you money because if it's table by table if you go through and you're like hey these tables are costing us a ton of money is it worth it you can actually answer that question but usually people don't do the groundwork of like hey what actually exists what is it costing where should we focus our energy instead they'll just take blind shots in the dark cuz they saw something was expensive one place without thinking about what the trade-offs are I think that's the biggest thing here for me is like as you're going through the process of trying to save your company money it's generally all about tradeoffs you know maybe you don't have real- time data anymore or maybe you you're data is not as granular in certain parts of your dashboard but that can lead to these cost savings that you're trying to do and if you're looking for a data team of experts who have helped companies save hundreds of thousands of dollars honestly my goal next year is to save companies over a million dollars of data infrastructure cost so if you need help even if it's just a phone call that I can help you in like 30 minutes give you a few tips or pointers on how it can save your company $50,000 I'd love to take that call so feel free to set up a meeting with me in the calendar below I'd really like to reach that million doll goal by the end of next year so if that's something that interest you feel free to reach out now back to the regular [Music] programming