today we're going to be talking item potent uh pipelines and slowly changing Dimensions let's let's dig into it all right so what what are we covering today two things really item potent pipelines so item potent is a very interesting word uh definitely look it up in the dictionary right now and one of the things that you might notice after you look it up in the dictionary is you're more confused about what it actually means because the definition is kind of uh and then we're also going to be covering slowly changing Dimensions uh Dimension and not just slowly changing Dimensions just like dimensional data modeling in general and the most complex part of dimensional data modeling is slowly changing Dimensions uh the rest of it is actually pretty straightforward so I'm really excited to get through this with y'all today let's go all right so item potent pipelines are critical if you are writing pipelines that are not item potent you're going to be frustrated a lot of your co-workers you're going to be creating all sorts of crazy data quality problems that are very hard to troubleshoot there's a bunch of different things that can happen so what does item potent mean uh I had y'all look in the dictionary but here's the dictionary definition of item potent denoting an element of a set which is unchanged in value when multiplied or otherwise to operated on by itself and a lot of you are probably like what what the hell is that like it's a very like mathematical definition uh the key thing to remember here is that your pipelines should produce the same results regardless of when it's ran and it's more item potent captures more than just when it's ran it has a bunch of other aspects to it as well so your pipeline should produce the same results regardless of the day you run it regardless of how many times you run it regardless of the hour that you run it so what this means is like like when you are running your pipeline in production or you're running your pipeline in backfill you should have the same data it should be the same like it's not going to be any different like and keeping in mind that a lot of this item potent stuff that I'm talking about really applies much more critically to batch pipelines than it does to streaming pipelines a lot of times streaming pipelines are a little bit different and a little bit harder to like replicate but even streaming pipelines can have their own item potent feel to them as well so keeping in mind the key here is these are the three things let's dig a little bit deeper here to kind of maybe have an example so you want to be thinking of your pipeline like a mathematical function so you can think of it this way uh you have f ofx equals 2x you see if you plug in four you get eight it doesn't matter if you plug if you plug in four on Christmas you get eight if you plug in four on New Year's you get eight you plug in four on Fourth of July you get eight it's the same always right so the key thing here with item potent pipelines is if you have the same input data set you should have the same output data set F ofx it's like a function it does a transformation that's like a function but that's not always the case and how when people architect their pipelines that is not always the case on how they actually implement it so keeping in mind that a lot of these functions f ofx equals 2x that's an item potent function right so let's uh let's let's let's dig a little bit deeper into to like what this even means you know so why is troubleshooting non- item potent pipelines hard so think about it this way like say you are running a pipeline in production and then you find an error and then uh you backfill uh because you want to fix a data quality bug and then that actually changes the data because your pipeline isn't item potent and so uh your backfill data does not match your your old production data even though you fixed a Quality Air you have different inconsistencies in your data now because your pipeline is not item potent generally speaking these airs are not caught anywhere they're not caught by data quality checks they're not caught by unit tests or integration tests because all of those are ass make assumptions about like when or how many times that the that the pipeline is being executed like a unit test assumes that your pipeline is being executed one time right I mean there are I mean if you're like a super sweaty engineer you might be someone who is saying like oh yeah I uh I I I actually run unit tests to check if my pipeline is item potent but like the number of uh the number of those unit tests I've seen in my career uh like one one in 10 years so uh I would say that that's probably not a very common thing to happen but that's that's a good test that you might want to consider putting in your Pipelines so let's let's dig more here because this concept is very kind of tricky to explain okay so what can make your pipeline not item potent because I think that's probably one of the things that uh people here are like what the hell so think about it this way insert into is a great example of something that can be non- item potent because if you run insert into and you run it twice right you run your pipeline it has insert into you run it twice now you have two times now you have 2X the data right it's not F ofx anymore right because the first time you plugged in the data you got one result the second time you plugged in the data you got double the result so it's not a mathematical function anymore and it's not treated that way the transformation is no longer item potent if you are just using insert into by itself so there are some workarounds here uh with insert into uh a lot of people like to use trunc a where they truncate the table and then insert it in because then that makes it so it's item potent again right because the whole flow ends up being like you can keep running it over and over and over and over and over again and you get the same result as you should that's what item potent is all about so uh that's one uh great example of when a pipeline could become non item potent um another one is say you're using start date greater than so imagine you're like saying okay want all the data after uh January 1st 2023 uh one of the problems with that is you need to have something on the other end you need to have a a less than and you you got to process a window of data and the reason for that is you can think about it this way like when you run it in production say say you want all the data after January 1st 2024 and you run it in production on January 1 you're going to get one day of data right that' be one day of data and if you uh run it on January 7th if you backfill January 1st on January 7th now you're going to have seven days of data and so on and so forth right like if you don't specify how big your window is supposed to be you're gonna then when you back fill you're going to have a bigger and bigger and bigger increasing window of data that is going to make your pipeline not item potent anymore and uh it it there is a chance it could be item potent still based on like your wear claws or things like that that could definitely still uh be the case but it's going to be inefficient even if it's still itm potent it will be inefficient because you're throwing away all that data so keep keeping that in mind um so another thing another thing to think about is uh your pipeline might not be using a full set of partition sensors where um in production you might be uh running with an incomplete or partial set of data and then your f ofx equals y sort of situation is no longer the case because you changed X in this case because you aren't waiting for all of your inputs to be ready so why why changes because in that case you're running with an incomplete X and so that's going to be another way that you want to make sure that you have your full set of partition sensors so that your pipeline is not running with partial data you would be surprised that is actually a problem I have seen like probably like at least two or three times a year in my time in big Tech like so it's a very very common problem that like happens so make sure like when you're running your partition sensors in airf flow that you are running them and actually checking for the full data that you're looking for um last but certainly not least is uh not using depends on past for cumulative pipelines so cumulative pipelines are different than other pipelines in that each run actually depends on the previous run because it accumulates up so why you have to use depends on past is you can't you have to run that pipeline sequentially you have to run you have to run yesterday then today then tomorrow that's the only way that you can do it and so um if you aren't doing it that way then what can happen is you can run maybe yesterday and today or yesterday runs and then today doesn't run and then when tomorrow gets here if you aren't depending on yesterday then that will run and it will break your cumulation and you will no longer have the cumulative Pipeline and so that can be a a big problem that can happen so make sure when you're running your pipelines and uh checking for item potency that you are using depends on past in your cumulative pipelines and letting y'all know cumulative Pipeline and a slowly changing Dimension pipeline are the same thing so the it applies to a lot of the rest of this presentation as well so here's some more things that can make your pipeline um not item potent uh I have some I have some good stories for you here so uh a lot of times uh data Engineers have this uh pressure to cut corners on their data sets because they need their data to be available sooner so they just say okay give me the most recent version of a data set instead of giving me today's version of the data set because that again will cause in consistent behaviors between production and backfill because when you backfill you will have the most the most upto-date data whereas in production you probably picked yesterday or the day before like you won't you won't have the the most recent upto-date data you just have like whatever you currently have so don't don't do that like generally speaking that's a bad practice like a better practice like if you are in a place where you're building a pipeline that needs uh that that kind of constraint is just wait on yesterday's data like wait on the day before and because then at least it's item potent because then backfill will have the same behavior as production and so that can be a very powerful thing I want to uh give a a quick story here about one of the last things that happened for me when I was working at Facebook back in 2018 so I was working on this pipeline um to detect like the inflows and outflows of fake accounts so like for example uh an account can be labeled fake and then they can be they can pass a challenge and then they're not fake they get unlabeled fake and then they can be labeled fake again and then they can be deleted there's all sorts of different things that can happen in like a fake account life cycle and so I was I was building um a table on top of this table called Dim All fake accounts and I was tracking the inflows and outflows of the fake account kind of flow and process mostly to monitor like the machine learning health and one of the things that when we were doing this stuff like I was producing all this data and then like the analysts were like why doesn't this match this other table like it doesn't match the canonical user table at all and like I was like I this was a bug that I looked at for like a month and I couldn't quite figure it out and then ultimately I realized that uh that Dim All fake accounts table that I was working with actually did this latest partition problem that was just terrible because then it made it so that no matter what I was doing whether I was back filling or I was in production like you're G to and like it'll be inconsistent with other tables as well because other tables might also be depending on the same data sets but if they depend on them in an item potent way and you depend on them in a n in a non-em potent way you're GNA have inconsistent data and like people are going to be mad at you right so keeping in mind that that's going to be the case the there is one exception here and the exception here is you can rely on the latest partition of a properly modeled SCD table assuming that you also rely on The Daily Dimension table so that you know that the daily Dimension is ready because then you will be able to know that you can pick that SCD table anything else relying on the latest partition of anything else is almost always a bad idea like it's it it it produces all sorts of really hard to troubleshoot really nasty bugs that I highly recommend you avoid okay so remember uh the pains here like this is a a very important part of what I'm trying to convey to y'all is the Pains of not having item potent pipelines backfilling causes inconsistencies between the old and restated data that is probably the biggest pain it's so painful because like you just spent because backfilling isn't fun even when it's even when you do it right it's not fun it's probably one of the most pain in the ass parts of being a data engineer and then imagine you go through all that pain and then you have inconsistent data and then you're going to have to back fill again and that is when like when you when you got to do it twice that's when like you want to pull your hair out as a data engineer it's not fun uh other things very hard to troubleshoot bugs it's like where is why why is it inconsistent like you could run you could walk through all of your SQL all of your SQL could be right you could do perfect joins all your joins are right but you still can't find the the quality bug because of the fact that your pipeline is producing inconsistent results when you back other things uh the unit tests cannot replicate the production Behavior because if you're if you're relying on the latest partition in production you can't emulate that pattern with a unit test so unit test coverage cannot save you when you are designing your pipelines with these non- item potent airs in them and last but certainly not least uh silent failures silent failures are my least favorite type of failure because they're very disrespectful right they really waste your time at least like when a pipeline fails or like with spark you know like you might get an out of memory exception and it like screams at you and it says like hey like we ran out of memory uh what's going on but like uh this this this is not a problem here with non- item potent pipelines like uh the air just doesn't show up doesn't show up it's not not there so uh that's definitely something uh to to consider when you are working through this kind of stuff okay uh we're going to be talking here a little bit about uh slowly changing Dimensions um I will get yall this link here uh when we go to YouTube but like um one of the things to think about is when you're modeling you're slowly changing Dimensions uh slowly changing Dimensions have the possibility of being non item potent like fairly easily and so uh some people actually say that slowly changing dimensions are kind of a scam that they like we don't need them that it's actually better to keep daily Dimensions so instead of like even though like you do get compression right so for example say I have my I'm on iPhone uh on Facebook right and I've been on iPhone since 2014 so there's two ways to model that right you could say like I was on iPhone today I was on iPhone every single day for 10 years so there's like 3,000 rows of me being on iPhone that's one way to do it another way to do it is like you could model like I was on iPhone from 2014 to 2024 one row with a with a range that could be another way to do it the problem with that range is that like it's not quite as item potent like because you you can mess things up so um I don't know if y'all know um Max uh Max bman he uh is the creator of Apache airflow and he says that scds are terrible and that we shouldn't be using them like at all and that daily dimensions are the better way to go mostly because Storage storage these days is cheap and so if storage is cheap then why are we wasting all of our time modeling slowly changing Dimensions when we can just have the daily value every day and we can just do it that way and honestly like there is a lot of Merit to that like in my startup I do not model slowly changing Dimensions there is no reason for me to model slowly changing Dimensions my data is too small and like a lot of times the slowly Dimension stuff that you get is only when you have very big very B scale data that's when you're going to start to see some big wins so um let's talk about the the three ways that uh you can uh you can go right you can do latest snapshot which is terrible right latest snapshot is not item potent because if you use the latest snapshot for your dimension when you back fill it's going to be wrong uh you can do daily monthly yearly snapshot that's going to be better that's an option that is very item potent the problem with that one is that it doesn't compress very very well and you have a lot of extra data but storage is cheap so generally speaking that's also not a half bad option and then the last option is modeling a slowly changing Dimension um so another thing to think about when you are choosing whether or not to Model A slowly changing Dimension is how slowly is this Dimension changing because if you think about it this way like imagine uh you have a dimension that is changing every day then modeling it as a slowly changing Dimension is probably pretty stupid because then you could just do a daily snapshot call it a day and it's simpler and you don't have to think about it right that I remember that was something uh when I worked at Netflix and I was modeling instances modeling ec2 instances and the I the IP addresses associated with ec2 instances they actually changed not every day but every hour so like it was almost like the opposite of a slowly changing Dimension it was like a rapidly changing Dimension and in that case like you you don't get any benefit from modeling it as a slowly changing Dimension because you're going to have one row per day anyway and you don't get a win so generally speaking when you're thinking about a slowly changing Dimension this is something that should maybe make one or two changes like one or two changes like a year like that's probably a pretty solid like rule of thumb in terms of when you should model it as an SCD versus when you should model it as a daily snapshot so just remember that uh because I feel like a lot of times data Engineers they get into this habit of they want the data to be absolutely as small as it possibly can be but they don't remember like one that they could be working on other things that are more important like than getting another 2% win on storage cost or like another thing is is like just because it's smaller doesn't mean that it's easier to use because then uh if you model is slowly changing then it can also cause non- item potent problems right whereas daily snapshots never have that problem if you just use daily snapshots you will never have the non- item potent problem so let's let's dig a little bit deeper in the slowly changing Dimensions because it's a complicated topic as well so first off why do Dimensions change uh someone decides they hate iPhone and want to move to Android now or the other way around I did that back in 2014 I I used to be on Android and then I switched to iPhone um someone migrates from Team dog to team cat so like when I was at Netflix almost everyone at Netflix had either catflix or dog Flix uh stickers on their laptop and like back then I was like on team catflix which nowadays I'm like I am such a different person now I'm a very different person I mean look at look at look back there you see that that's that's that's a that's a dog back there you know and that in the big old thousand doll neon sign dog right I'm very firmly on team if I went back to Netflix I'd be firmly in team dog flick not team catflix anymore right and you can think of all sorts of other ways someone migrates from us to another country uh there's all sorts of different ways that Dimensions can change right because your life changes over time right like maybe you gain some weight and your weight Dimension changes or maybe you grow taller or maybe like you get glasses or like you know there's all sorts of things that you could think of like as Dimensions that could change you know okay so um so how can you uh model Dimensions that change uh some people and some data Engineers uh they are tempted to model them as singular snapshots so in this case a singular snapshot would be uh you have one data you have one data set that uh has whatever the latest value is for for me right so in that case that data set would have I was on I'm on dog fck I'd be on team dog now but it would have no no understanding that I used to be a cat person and so uh you can see how like this type of Dimension uh storing things as singular snapshots would be bad because then if we backfilled using this as the source then uh my catflix history at Netflix would go away and uh that would no longer be a thing and so the back filling of using that uh snapshot would not work uh it be very painful very frustrating stuff to work with right so another option is you have daily partition snapshots so in this case you have like um every day you see what the value is and you store that so once a day you take a snapshot of the dimension um this does have some uh kind of tradeoffs in terms of like if you have a dimension that's changing more often than once a day you only get the day over day change at at the time of snapshot so you can actually lose some uh like some clarity on the actual dimensional mutations that are happening in that case another option there is you can do like CDC like change data capture that can be another very powerful way to model your snapshots so that you have the the hourly uh mutations while also um like so you you actually get all of the changes in the entire change log that could be another way to do it one of the things about change data capture though is like it has the change log and that kind of gives you a slowly changing Dimension type two out of the box which can be really nice um the the last one is we have a bunch of different ways to model slowly changing Dimensions we have types one types two types three and there's also types four types five types six types seven there's like they keep going on and on and on and on and on and like most of them you do not ever need to use like it's like it's one of those things that data Engineers are like oh my goodness there's so many different ways to do this but it's just like you know if you ever read that book JavaScript the good parts you know how like the normal JavaScript book is like like five inches thick and JavaScript the JavaScript the good partes is like half an inch thick because you really just need to know the the 20 or the 10 or 20% of data engineering that does 80% of the impact so let's uh let's dig a little bit deeper into all the different ways that you can model these uh slowly changing Dimensions okay so let's talk about the types of slowly changing Dimensions so you have type zero type zero is a good one uh because type zero is a dimension that actually isn't slowly changing like for example my dog Lulu she's a Siberian Husky and she's been a Siberian Husky since day one and she will be a Siberian Husky until the day she dies so and even even later like she's just a Siberian Husky that's just that's her that's a dimension for her that is actually not a slowly changing Dimension so if you're doing SCD type zero remember remember that if you model it this way it's not a slowly changing Dimension the only things that should be modeled this way are things that actually don't change like my birthday that's another great example and one of the things you'll notice is there's actually not that many things that fit this bucket like another a great example is like registration date uh for a user when they sign up that's another kind of unchanging value uh there's not that many uh things that are completely unchanging so generally speaking you want to avoid type zero unless you're for sure for sure know that it's not going to change okay type one right so type one is when uh uh don't use this don't use type one just never use type one so type one uh slowly change your dimension when you're modeling it this way uh you just don't care about back filling you give no you give no shits about back filling at all if you use type one backfill is not you you don't you don't do backfill backfill is a bad word uh because if you model your Dimensions as type one you only care about the latest value which means you don't give a single about backfilling and so that's why in on this slide you see I say never ever ever ever ever use this type of modeling because it makes your pipeline not item potent anymore uh this like obviously when we're talking about modeling uh and data modeling like I like in this in this context I'm talking about analytical data modeling where you care about history and you care about Trends and you care about all that kind of stuff uh if you are talking about uh like more like uh relational data modeling like where you're in like postgress with primary and foreign keys and latest values in those cases like type one probably fits but I and keeping in mind that this entire presentation is all about olap and analytical data modeling not uh relational data modeling so uh let's uh let's go to let's go to the next one which is probably my favorite slide in this whole presentation so type two uh SCD type two is uh the gold standard SCD uh one of the things to remember here about um SCD type two is that it is what Airbnb uses so uh at Airbnb when you are building out a gold standard pipeline so they have this thing called the Midas Midas process that creates gold pipelines that are very very high quality very very high trusted uh when you're doing that uh and you're building an SCD pipeline the only acceptable SCD modeling that they will allow is SCD type 2 uh so the way the SCD type 2 works is you have a history log of everything so like for example I'll talk about my Android and iPhone so like for me I was an Android User from like 20 like 2009 to 2014 was an Android User for like five years so you could think of there would be like a a record for like my ID Android 2009 2014 so you have four four things right you have the ID of the entity the dimensional value the start date and the end date and then a lot of times these tables have one more column called is current which will give you whatever the current value is which is great because guess what if you just filter this table to where is current equals true guess what you get you get type one and there you go now you get type one and you get all the benefits of type one and you get freaking an item potent pipeline so it's pretty cool so how this works right is um for the end date in the pipeline a lot of times like this is modeled in one of two ways you either model it as null or you model it as like at Airbnb we modeled it as 99992 31 like you know just a date really far into the future like um hopefully like we have we have stuff hopefully my pipeline isn't running that long because I hope my pipeline does not run for 7,000 years that's a very long time um so that's generally speaking how they model like the current uh the current record is uh I like the 999 912 31 way of doing things even though it does put like quote unquote fake data into the pipeline and the reason for that is because I um it makes us so the between syntax works better because if you have like you want to know like where start date and end date are between uh a given date uh the problem with null is between if you use between and one side of the of the of the interval is null it just is always it always returns false so you can't get the record like so you get nicer like between syntax if you model uh the the end date as 9999 1231 as opposed to null uh I'm sure there are people in this live stream right now that think that what I just said is absolutely absurd and that's okay because guess what data modeling is an art not a science and it's all about how you want to use your data um so one of the things that's kind of tricky about SCD type 2 is that it's hard to use right because uh when you join you're going to get like a history of Dimensions as opposed to a single Dimension and almost always you just want a single dimension so in that case when you're joining SCD type 2 tables you have to join on the entity identifier like my user id but you also want to join where the on the date like whatever your reference date is you need to say that this date is between the start date and end date of the dimension so then you can uh it's it's pretty great that can be a really powerful way to model your slowly changing Dimensions so keeping in mind that SCD type 2 is my alltime favorite slowly changing Dimension because of the fact that it is purely item potent it's the only one that is it's the only one right so keeping in mind that like all the other slowly changing Dimensions that we're going to talk about today you can just forget about so I'm going to probably spend the next 10 minutes talking about things that don't really matter that much but we're going to talk about them anyways because I think that uh I want to be exhaustive here just to let y'all know because some of y'all might be interested in learning how to model slowly changing dimensions in a non- item potent way so uh let's dig a little bit deeper into like types two and type three and stuff okay so type three um so type three is a little bit different uh what you get with type three is you um you have one row per Dimension and in this case you have the original value of the dimension and then you have the current value of the dimension and so this is pretty good uh What the the good here is you get the you get one row per Dimension again so you don't have to have multiple rows that can be a very powerful uh thing in terms of usability so this is uh can be a lot more usable uh the problem here though is if there is more than one mutation so say there's three mutations like for example for me I grew up in Utah so my first um uh location Dimension was Utah and then I moved to Washington DC and then I moved to California and so in this data set you only have the original Dimension which for me would be Utah and the current Dimension which would be California but then all of the data uh if I backfilled my data when I lived in DC it would um either be attributed to Utah or California depending on how you model this data set so uh this one is kind of a nice Middle Ground like if you're going to pick a different one besides type two uh which I highly recommend that you you don't pick something besides type two because type two is the best um if you pick something besides type two I would recommend that you pick uh type three because you do get a lot of uh the value that you get from type two but you get a lot of usability gains in terms of like only having one row per Dimension so that's pretty nice uh but yeah you do lose some clarity if there is multiple State Dimensions over time so uh generally speaking if if you're running analytical pipelines I would say avoid type three I like I I use I use type three before in my uh like in my uh relational data models and stuff like that so that we can keep track of whatever their first value was and then we can see how it changes over time and that in those cases we don't really care about the entire history so we can actually just have all of these values so this can be another great option for uh model slowly change your dimension but keeping in mind that this is not item potent right so you still get silent failures you still get inconsistencies on backfill if you were using this type of slowly changing Dimensions so for me for me that that that is the showstopper that's why I'm like screw this I'm not I'm not going to use type three but obviously I'm sure people in the uh people in the Stream uh might think otherwise okay so let's go over this again very quickly uh so type zero and type two are item potent type two or type zero is only item potent because it's not a slowly changing Dimension it's a value that is the same all the time so keeping in mind like for example Lulu my dog is a Siberian Husky and she's always going to be a Siberian Husky I mean maybe maybe in the future we'll get to some genetic engineering where you can change the breed of your dog halfway through their life but we're not there yet and we're probably still pretty far away from that so uh that's how type zero works so if I backfill any of Lulu's data and I put Siberian Husky in as the breed Dimension it will always be correct um type two is going to be the other one that's item potent and the reason why type two is item potent is because you get the entire history you have the start date and the end date for every single value of when people were active and inactive right so that's going to be the other thing to think about when we are kind of going through a lot of the the different values that we're working with today uh we're going to be uh jumping into a lab here in just a little bit uh where we are going to be going over how to build a type two slowly changing Dimension so that should be really great um uh keeping in mind now I'm G be talking about the things that make me sad as a data engineer uh type one Dimensions if your model A slowly changing Dimension as type one uh you're a bad person like don't do it like that's just that's just terrible it's just really terrible type one is the worst like at least type three feels like you you tried like you tried and you gave at least a little bit of a about history and so that's what's great about um type three uh is that like at least you looks like you try it a little bit but type one is like you're like I don't care about history at all I don't care about the latest value so keeping in mind that these are the ones obviously there are other slowly changing dimensional values that you could possibly work with like um uh type four five six there's like a lot more slowly changing Dimensions keeping in mind that I'm not covering those in this presentation because I've never used them in my career and I don't think that they're that valuable but uh maybe check out May maybe uh this is something that can give you a springboard into looking into that for yourself so definitely uh read more into like the other types besides type three because they there's a lot they it just keeps going the amount of modeling opportunities in data engineering is kind of too much it's kind of like way too much okay so slowly changing Dimension uh type to loading so when you are loading your slowly changing Dimension there are two ways you can load your slowly changing dimension in two ways the first way that you can load it is you look at the entire history in one query and uh it's inefficient because you process um all the data right in one go and you just read the entire history and uh it's one query and you're done and you have all of it it's great um I want to talk a little bit about the differences here because then you have incrementally low data after the previous STD is generated so that's where you take the the current or today's snapshot and you add it on to to uh the SCD from yesterday and then that's where uh you are not reading in the entire history every day or the entire like daily snapshot history or whatever so one of the things I want to talk about real quick here is like I worked on uh unit economics at Airbnb for a while and um you can think of like unit economics as like uh a slowly changing Dimension because you can imagine like say you uh are work you book a reservation right and then there's a value associated with that a revenue associated with that but then we refund you later so like depending on the day like Airbnb has a revenue or they don't have a revenue depending on the day so it's like a slowly changing Dimension over time so one of the things that I noticed like with that was like I wanted to create the the incremental uh way of doing things but it's complicated like for that one because like the only way that you could do that is like you have to pick a cut off point of when like uh a reservation can no longer experience changes and that one I I didn't understand like what would be the right cut off point there so I was like okay we'll just load in all of history every day and that was like I I as an engineer I was like why damn this is like very very inefficient but like also elegant because it's just one query it's done right and like you can you can like you can you don't even have to backfill right it's not like you backfill multiple stages in airflow it's like you backfill one day and you're done it's pretty cool but like it also has its own problems and but that's what we're going to be going over in the lab today is how to load the entire history in one query um obviously uh the other way to go is you can load the data um incrementally where you have the STD from yesterday and then you load in the the the data from today where a lot of that data will be unchanged some of that data will be changed some of it will be new some of it will be old all that kind of stuff and you have to like think about all of the different different permutations there and that can be kind of complicated so um that was always something I had a dream of doing when I was at Airbnb was to make uh unit economics incremental so that like it wasn't like processing all of history every day but then it came back to like me realizing that there were more important things for me to work on that wasn't the number one priority and so that's another thing to remember when you were working with slowly changing Dimensions or any sorts of uh data engineering tasks like is these optimizations that you're talking about like a lot of times they come at the expense of other things that you could be working on and like that's the last thing I want to say here about uh kind of this slowly changing Dimension modeling is that you have uh like people like Max who say this stuff is like we don't need to do this anymore like we can just hold on to all the data in the cloud as daily snapshots and we don't need to do any of this anymore this is all like silly and uh and Airbnb the the company itself actually kind of jumped back and forth between like that uh mentality and actually doing the S mentality and like generally speaking uh my perspective on this now that I've worked at startups at Big Tech and I've worked at like all sorts of sizes of companies and I have my own startup now is when you are at a small enough scale SCD is a very stupid thing to do it's a very like low Roi thing to do because like you don't need to do it like because the efficiency gain that you get is like o I saved five cents in the cloud look at me and I spent 10 hours saving five cents in the cloud right whereas like as the company gets bigger and bigger especially once they have like I think that the cuto off here is like once they have like 10 or 15 million users that's when you're going to really start to see a lot more of the slowly changing Dimension gains that you would expect to see so anyways uh let's uh go go a little bit let's let's go into this all right so I'm gonna I'm GNA pause here for one second and let's uh I'm going to take about five minutes here to answer any questions y'all have about the presentation before we dig into the lab uh make sure that you have a data expert.i account uh you can see that running across the bottom there and then go to data expert.i classroom back Zack Wilson and you will be able to find um a place where you will get all of the queries uh notified to you in real time as I am doing them live so anyways um if you're in the chat here I will definitely uh um answer any questions for y'all on like what we can do right um yeah cona you said Can can't we derive type three from a type two table 100% you that's 100% correct and that's why type two is the best um I think that that's uh really awesome but yeah if anyone has any questions just drop them in the chat we will be uh doing just uh a quick little Q&A here for about five minutes and then we will be going into the lab on this stuff how about current and history tables instead of SD type 2 if it's huge table having 300 million records for users table I mean you can have that's another way to do it where if you model it that way but like the problem with uh doing current and history tables like that is um if people uh uh if you if people have um should I say here if they depend on the current Table then they will oftentimes not uh they if they depend on the current Table in their pipeline then they are going to inadvertently cause non- item potent problems right but that's usually okay like uh in my experience I have done like the type one stuff but I've also told people like this table is not meant to be consumed by a pipeline this is meant to be consumed by analysts right and the the strongest way that you can do that is you take it out of the lake right you move it to some other source right so for example when I worked at Netflix I didn't put the current the current Table in uh the data Lake I put it in Druid right we move it into another source right some other place that people can do fast analytics on it and we move it out of the lake because the thing about it is is like if something's in the data Lake people are going to ETL it this is how it works if it's in the lake it's going to be eted for sure so um what is the okay let's see here have you seen any use cases where type two wasn't sufficient no it's been pretty good I actually found type two to be really really solid um you can get into some hairy cases where like uh if it comes back to uh Kunal that comes back to where um you uh what I talked about early early on in this presentation which was you modeling a slowly changing Dimension you have to be aware about how slowly it is right that could be another thing to think about right is like how slowly is it actually changing because that can be another really powerful way to do it right so um definitely uh I I found it to be a very very powerful use case but like the only time that stct type 2 runs into problems is when it's not a slowly changing Dimension it's actually a fast changing Dimension and you have so many mutations that like uh that like you actually have more records than you would in a daily snapshot scenario so that can be another solid thing to think about uh I have a question do you recommend uh change data capture table instead of STD that is not something that I can recommend one way or the other I think that there is going to be definitely two things a uh it's because you're not signed in actually that's why that doesn't work uh uh you need to log in that's why that uh that is not working right so I will I'm gonna show that off real quick so if we go to that I'm going to pop this guy over gonna we're gonna try to do this side by side one second let me uh let me change the this is gotta make this guy a little bit smaller and see if that like actually works can is that like is that fixing that problem can we do it that way okay there we go that looks better okay um came from X Kimble is worth reading on slowly changing Dimension yeah I I like the Kimble books 100% so um we're mostly going to be working in this uh query editor today so one of the things you might notice is uh in here so when I run a query here you'll you'll get a notification right here and then you can copy it so if you fall behind you will be able to catch up and will be able to kind of work through this together so the idea here right is we're going to find uh it's is if you were in my other lab or other live uh where we went over consecutive streaks this is going to be a very similar thing to that right where we have um essentially all of these records here of like when someone was active right and uh or like we have all these NBA player Seasons records right but one of the things that we need to see here is we actually need to create uh the the data for when they're not active because that's another thing that super important because one of the things is like when you're doing these slowly changing Dimensions one of the things that can often happen is uh you need to model data that's not there like when because like their Dimension doesn't exist anymore because they aren't active anymore so we're going to be working with this table today we're going to say select star from boot camp. NBA player Seasons right and you'll see in this and I'm going to say where player name equals Michael Jordan and you'll see Jordan this is going to be what we're going to be working on today uh so this is obviously um a table here's Michael Jordan one of the things you might notice here is he was here in um you have 1996 1997 20012 2002 so he was also not active in 90 1998 1999 and uh stuff like that right so one of the things that we want to do is we actually want to have a record for him for every every year so then we can see when he was actually not active so what we can do here is we want to go ahead and do uh we can say cross join um unist and then in this case we're going to say generate series and then I think we can say what does this work I think this will work 1996 we'll say to 2002 as T I think this should work generate series okay give me one second generate series series postgis oh no generate series trino what is it there's a i generate is it generate sequence oh it's sequence that's why it's a it's sequence that's what that that that that makes a lot of sense Let's uh let's go back in here so this is actually sequence here so if we do sequence this should give us what we're looking for there we go so now you'll see uh there should be uh you'll see where we have but the problem here is this is not quite right right because we actually want this to be um we we need it to be like and uh T equals season actually no because like there's going to be records here that are like not there right you see how uh we are cross joining and then we'll have all the records here that will be so this is going to give us the 1998 1999 records and then what we want to do is this is going to give us our sequence of all the values that we're looking for because our goal here is to just smash this data down to find the records where um uh so in this case what we can do right is if we say um player name and then we say um season equals T um as um is active let's just run this real quick this should give us uh season that's totally oh oh oh it's because this is uh you got to call it like this this is going to be um we'll call this we'll call this like exploded season I guess we'll call it like that's probably a good name so we're going to say t because it's like that the syntax is super weird like where when you want to explode out the the records so this will give us so you'll see here we now have uh what's put in season uh and then we'll say uh season exploded season and then we can see which records we want here right oh I think we actually want exploded season here exploded season season so this is going to give us whether or not they were active during that period right so what we can do here is now we can aggregate down right so if we say uh Group by uh we can say uh uh player name exploded season and then we want to like we really want to put this as like a Max here because this is uh not what we need here but this will give us all the records that we want and this will now like you'll see one of the things oh then what's order by here we'll say order by exploded season now this should give us like an interesting kind of uh way of looking at things so you see true true false false false true true so now what we want to do is this is this is getting a lot closer to what we were expecting but one of the things is is like wait we probably like we we can explode this out further if we want but I want the queries to run fairly quickly so what we want is we're going to just put this in a CTE we're going to call this with all data call this all data as and this is now um and I'm GNA change exploded season and I'm G to rename this as season just so like in the in the later queries we can know this so the goal here is we're just trying to create a slowly changing dimension for Michael Jordan from uh to to show like when he retired when he was active all that kind of stuff right so then what we have here is so if we just say like select star from all data this is now um looking pretty good in terms of getting us uh the same data set we're looking for right so now what we want to do is we want to look at uh uh we want to we're going to be using some window functions here to understand the like the lag and the lead of these data sets right of like of what's going on here so let's go ahead and look at lab is active comma one right so the thing about uh window functions is they need to have a window definition right so we can say over and then in this case we want to say Partition by in this case we're going to say player name and we want to say order by and we say season so this is um as is active last season so let's go ahead and run this okay so you see when uh if they're true true then we can we can finally see like where they where there's going to be a switch right so what we can do is we can make another column here where we can say is active equals this right and there like we call this or we're going to call this as like a case when here say case when so uh then zero else one and as did change so this is just going to give us our change value that we that we would expect right so you'll see here this makes sense right that we have uh that for the true to false here did like because it's from null to True okay that's fine and then uh when they didn't change then they did change didn't change didn't change did change because he went back right so the whole idea here is we're trying to collapse all these records down just to be based on the continuous streaks for when the player is there right so in this case we essentially I want to move this guy into another uh kind of CTE real quick so we're going to call this as um change identified as now this is going to be our second kind of query here then what we can do is we can say uh change identified now what we want to do is we want to kind of do a rolling sum here right so if we say sum and then we say um did change change but this is going to be a rolling sub right so we're going to say over say Partition by uh we're going to say player name order by season as we're going to call the streak identifier very very similar to uh if you were in one of the previous uh sessions so one of the things you'll see here is now you'll see we have streak identifier here of because he was active two seasons in a row and then three seasons he was inactive 98 99 2000 and then he came back in 2001 2002 right and so and you see all of these are now uh values that we can work with right so this is going to be our streak identifier we're just going to call this um we're g call this maybe identified as and we're we're getting really close to having our SCD so then what we do is we say uh select star from identified now what we want to do is we want to say uh we're going to say player name is active and then uh streak identifier and then what we want to do is we want to say Max season Min season so this is going to be our uh we can say this as like say as start season this is as End season and then we want to group here right Group by player name is active streak identifier so now this will give us a pretty cool little data set oh I forgot there we go there we go so now you'll see we have um in this data set we actually don't need a select streak identifier because that one's kind of like a hidden column that we don't really need to select but uh now you'll see what's let's look here so now you see how this we have a start season end season oh I got that backwards this is min and this is Max it's flipped right uh we'll run that one more time and you'll see uh we have our start season our end season and then you can see okay slowly changing Dimension is active F right so then like what we can do right is if we remove Michael Jordan from this and we run this this query again you'll see this actually Works across the board for every player right and you can see okay when they were active when they were inactive right you can see all sorts of stuff like that right you can see exactly when uh the players are active and inactive right like for example this Fred Roberts guy was like one season right and then this guy was you can see all of it right and you can essentially this gives you your type two SCD right so the way the the one last thing I would probably recommend to do here here is so this data set actually goes to like 2021 and then what we can do here is uh the last thing I want to do here is we can say um uh is current right we can say as uh Max season equals uh whatever this last value is right that's like what did I put in there 2021 so if we do that and this is like as is curent current there we go let's call that as is current so we can have the current value for people so there we go and then you can see like when people were around right so one of the things that's interesting about this right about this strategy is that this doesn't necessarily uh like this kind of gives you dimensions for people before as well so that's the one thing that's like not quite as good about this way of doing things about like the unest here because the um this value here this uh the beginning of the sequence really should be uh like the minimum value for their Seasons because you see like this guy AJ gon right you see how like he really didn't even start playing right um until um 2000 right he was not even in the league back then this was his first season was probably 2000 so really what we want to do here is we want to uh probably do one more aggregation ahead of time so let's let's do that real quick so let's say um first season as and then in here we're just going to say select Min or say player name Min season from boot camp. NBA uh player Seasons Group by a player name and I'm going to call this as first season and what we want to do here is um we're going to just do a join here I'm going to call this um call this uh NP um join um first season FS on um I'm going to say np. player name equals fs. player name and then uh in this case we got to do the freaking like stuff here right where we do like np. season np. season and then here though we can say fs. first season so that we don't uh explode the sequence uh so we don't get like data for people who haven't shown up yet right because that's going to be an important part of this kind of EXP loaded uh process here so now if we run this query here you'll see um oh player name is ambiguous where is that there's a player name in here somewhere oh it's this one np. player name that's what we got to do there we go so now uh you'll see that like no player in here now is going to be uh like they won't have data before before they were playing right because uh we only start the sequence from their first season so then this gives us all of their data set and it goes from the first season all the way to the current date so that can be another really powerful way to kind of generate all of the S tables and all this kind of stuff in uh one shot and this is going to be probably the the best way to do this stuff so I wanted to show um what uh uh someone asked like how can you turn uh this SCD type two into a type one and into a type three that was the other thing I wanted to do real quick so let's go ahead and just uh create a table here so we're gonna say create table we call this Zack wilson. NBA players UMD as so this is actually going to be moving the data like that's one thing that's really cool about this platform is that it actually like creates and uh builds data so there we go so we have 5,000 records now of different people transitioning so uh what we're going to do now is I'm just going to take this table and then I'm going to just Stomp the rest of this so if I say select star from this table you'll see we have um exactly what you would expect this is going to be uh so this is our type one or this is our type two SCD so to get to type one type one is very easy and this is why people who uh use type one they're just kind of lazy sometimes so we just say is current done right and now this will just give us all of the current values right so like coming back to what I was talking about with back filling though is that like you can model the table like this right and this is going to be uh and the problem with that is like if I say and player name we'll say uh player name equals Michael Jordan one of the things you might notice here is this is going to be uh we're we'll get one record right we'll get one record it's nice but you see uh then if we backfill we back fill with this Dimension uh then uh Michael Jordan will just be retired right he'll be not active when he actually won six championships in the 1990s right so um that's uh this is where uh type two or type you know like the way that this would work is I guess to make it actually exactly type one is it's you don't have the start date and end date it's this is type like this is taking a type two and devolving it into a type one right so now this is now a type one um SD right where you have one value for every player and like this is like without filtering right you have one value for every player and whether or not they're active right easy but also terrible right because remember that like we don't want to backfill Michael Jordan and say that he won six championships while he was retired right so let's go ahead and um look at how we can convert this as well into a type three s which I think is another one that can be really interesting so with a type three s we're going to need to create uh we're going to say with first as or we'll say with let's call this F as and then in here what we want to do is we're going to say select um uh player name uh uh in this case we're going to say uh so actually in this case is active is going to be like the F it's their very first original value and in this case like it's not always the case like because we're we're we're we're working with a binary here right but like everyone's first value is always active right because if it wasn't they wouldn't be in the league right so the but the way that we can do this right is we can just say we can say is active from um we have our Zack Wilson Player NBA players right so in this case what we want to do is let's go ahead and create like a row number sort of thing right so if we say like row number um over we say uh Partition by player name and then order by uh then we order by uh start season right and then this will give us we just say select star from F right this will give us uh essentially whatever that first value is okay so that gives us our first value and then we just want to get um so in this case we can say uh we also want to get our season or our start season and an End season and then um in this case what we want to do is we're call this um as num then what we want to do is in this case we're going to say player name and then we want to say in this case we're going to say um Max and we say case when num equals ALS one then is active as original um is active and we have Max um and then in this case we would just say Max case when End season equals 2021 uh then is active oh we need a end here right then is active end as and then in this case this is going to be um current is active and then we just group right Group by place your name so now this essentially converts it fairly efficiently to um uh fairly efficiently to um SD type two or from type two to type three right and then this uh will also have uh like this this has a date in it right like you could also put like the original dates right and a lot of times they have that in type three where in this case you say Max case when um equals one then and or then um start season and as um original start and then uh you do the same thing for this one but this one is like always the same because it's current right so usually this one is just like hardcoded like as 2021 as um current start or current oh no actually no it is uh no they are different never mind let's let's just grab this guy and this is a start um um season and this is current start so a lot of times they do have like the date that they changed right so then you can kind of see okay so you see that uh some of these people like this guy here right he he originally started in 2004 and then um he retired in 2018 uh that's you can kind of see that but see this is another thing that you could you you might make the assumption that he actually like played for 14 years like like for example like if we say where uh player name equals Michael Jordan I think Michael Jordan will have will be very clear on um on this exact problem because of the fact that he uh oh I forgot that the where needs to go before Group by SQL fail okay so you see here how like um this one uh is kind of wrong right because you see uh like his current here is actually not the right one right because this is when he retired like but like you you would this is lying to you right a little bit because like this makes it seem like he he played for uh seven years here right but he didn't right he didn't and this is like this is showing the limitations of S Type 3 because of the fact that like it makes it look like he played from 1996 to 2003 but he had a whole another uh iteration because his value changed multiple times so this is why like SCD type 3 is not as good and it can cause a lot of other kind of um item potent problems and all sorts of things like that so that is um oh wow I guess like this has only been like about like 75 minutes so okay that's fine um I think that that that's pretty much what I wanted to show today in today's lab um uh I think one of the things was is things went a little bit quicker than I thought they were going to today which is kind of fine so um anyways um I want to open it up a little bit to questions uh so that we can kind of go over more of like what happened in the lab so that uh y'all can be successful with your slowly changing Dimensions so um yeah uh vidit you can query my table uh all the like so how it works right uh I guess that's a probably a good thing to talk about like so um in if you are creating data right in this in the in the boot camp right for example if I say if I try to drop table say I say drop table boot camp NBA player Seasons NBA NBA player Seasons it's not going to let me do this right because it says you can't drop it because the only so like you have a schema right your schema is whatever schema is up here and you can do you can create tables you can drop tables you can do whatever the hell you want in that schema and but like for reading data you can read data from wherever right reading is totally okay like reading data can be done wherever you want to read data and keeping in mind that like in the future here we are going to be doing some pretty cool stuff like I have a leak code like experience that's almost done so that y'all can like try it out and like we have like we're gonna have like a hundred questions when we launch should be really great and they're difficult they're difficult questions like like what we just did is one of the questions right it's it's it's it's not like oh like what is the average number of likes on a post or like you know those like very easy leak code questions it's not like that this is like going to be a very challenging difficult platform to work on and that stuff is is going to be launching very very soon in like the next couple weeks so I'm really excited to show yall that as well it should be pretty great um um is STD for a column or a table that's a great question so um I would say STD is mostly for a table it's going to be but like you can do uh you can do the same stuff uh with SD for multiple columns at the same time as well like so you know that did change kind of aspect so you can have a slowly changing Dimension table that tracks multiple columns at the same time so you aren't just tracking like is active but you might be tracking like were they active and good or like were they active and something else like some other uh attributes you can track like so the the pattern I just showed showed you like if you go back to like what we were doing here before where we had like the did change let me let me go back to that real quick we have it here somewhere is it this one no it's player name is active oh yeah this guy what's what's a um let's copy uh yeah let's copy this guy here with lag lag lag lag yeah this guy can we y this guy cool so um let's copy him and put him back up here so uh you'll see when we uh do the um did change right there's like a freaking did change thing here um that did change function oh wait where is it at yeah right here like because you can see like you can have different ways that that did change function can manifest like this this query here is another example of like doing a very similar thing but like with um consecutive season or consecutive Seasons right so and like this thing can be like you can comp you can do multiple things where like the like nothing has to change like if nothing changed then zero otherwise one and so you can track multiple columns at the same time usually speaking like uh like a slowly changing Dimension is like you can model multiple columns at the same time it's just that um like and and usually almost all the columns in the table are going to be slowly changing except for like the identifier column so yeah it should be pretty good um for the same table can we see an example of type six I mean I didn't I didn't prepare for that so probably not but um yeah that's so that's essentially um what's going on here but like that's a good Ube that'll be a good uh one for uh uh uh for another session you know what if we want to track changes for two columns in the same table how does SD type two works here the same it's the same like you just have two columns and then you have a start date and end date and then if either of them change then you need to have a new record right so just like what we did before where like if either of the columns change we need a new record that's like that did change value right where like where we had it before uh that one will have multiple uh Records when you're coming up with new new streaks and stuff like that right so awesome thank you so much Sebastian I'm I'm I'm really happy that y'all are liking this I'm I I've been really putting a lot of time and effort and energy into this stuff and I really am really grateful for every each and every one of you uh showing up and uh supporting me as I go through this journey um and I realized I needed to turn this guy off I should have a bigger face but um anyways uh the one last thing I wanted to talk about is uh we have about um I'm going to be doing a boot camp from uh May 6th to June 14th uh and we are still accepting early bird sales which is going to be the cheapest you can get the boot camp for if you keeping in mind this is cheaper than you can get it anywhere at any other time and cheaper than any other boot camp I've ever done so uh if you do early bird V4 at checkout uh you'll get 20% off and so you can get 20% off we're going to be covering a lot more of the stuff in a lot more detail and a lot more uh and there's going to be like think about it this way like we're gonna have like freaking like eight hours of this right so like you can really grind and really understand like the trade-offs and benefits and risks of all of the different things that you can do here so uh definitely highly recommend check this out uh you will also be in a community with a lot of other people who are highly motivated to learn we already have like about 65 people who have bought so I'm hoping this will be my biggest boot camp ever and yeah and this uh with the 20% off you can get the whole boot camp for about 1,600 and it should be a really really really great time I'm really really excited uh for y'all to uh to join the community and all this stuff we have a really really great Community as well a very highly motivated data engine engineers and at the end of the boot camp if you put in all the effort to get certified we will match you Mentor who will help you change your life and change your career So yeah thank you so much everyone for all of this um canu the boot camp is um uh it is an analytics boot camp it's a mix of analytics and data engineering so um you can just do analytics itself there's the analytics track of the boot camp as well so if you just want to focus on that and you don't care about like Flink and Spark and all that technical stuff you can just focus on the analytics side of it where we go really deep into every every single thing in analytics that you could think of and uh it's it should be a really really great time I'm really really excited uh but yeah everyone thank you so much for your time today I'm really I'm really happy that y'all uh you know entrusted me with your eyeballs and your attention today and I hope yall have a fantastic Thursday and um I'm excited to keep doing these and uh keeping in mind five five to six 5: to 7 around that time on Thursdays we will be doing these at least until the boot camp launches in May so uh we're going to be doing these uh for the next couple months uh hopefully uh like we can get to 100K on YouTube in that time as well so make sure to subscribe to me on YouTube by the way as well so um so uh Sebastian uh if you want to have the the boot camp uh if you just do go to data expert. or data engineer. they both they both work uh uh you can just uh like I'll I'll I'll put that in the the channel here as well if you go to data expert. this is going to be one where one place to go or data engineer. I I'm kind of in a middle of a rebranding so I kind of I'm using both of them right now and they both will take you to the same place actually so and then you can learn more about like the curricula so okay heac you have showcased interesting ways to drive SGS against base tables or views in a typical Warehouse or DB would you create SDS as store proced procedures or a set of tables always a set of tables like always a set of tables never never don't dods as a as a view it's like like you saw that how nasty that was like that view that exploded the data and was like right like no no definitely not awesome everyone I hope youall have a a fantastic evening uh yeah this is it all uh thank you