Exploring Data Automation Techniques

good morning fellow data Enthusiast um and evening from from my friends in Australia which one of them has a joined at least the other one is is on his way so I'm not entirely alone here in the dark uh it's not too bad it's only it's only 7 PM so uh I shouldn't be complaining and I'm I'm not um I really look forward to today it's going to be about what else go generation and automation for data solution my favorite topics and I'm going to cover something today that I think is really cool and interesting and it's something that brings all kinds of data vir syes together into something that I refer to as this engine for for data automation but before we go there I'd like to share a couple of personal views on what it means to work with data and why I think that the the patterns and the things and Technologies I'm going to show today are worth considering because for me data is is typically and mostly just just stuff right it can be it's whatever and I Define data as evidence of events of activities by something or someone that is persisted somewhere and that we can uncover and analyze if we want to do that and to me that that process of uncovering and analyzing something that happened in the past is similar to how I you know imagine paleontologist digging up these ancient fossils and skeletons and use these B bones and these artifacts and fil to piece together what actually happened at that forgotten moment in time and as a kid it was one of those those those things I really got into like uh fossil searching dinosaurs all kinds of things like like any young boy I suppose but the idea of of of that thing that happened in the past and it just sticks there until we decide to cover it to to to dig it up and to figure out what it means it really stuck with me and and it's also similar to how biologist use that same fossil record to improve the understanding of you know what what why we came to be the way we are right how Evolution eved so what we see here now is a set of visualizations of the the tree of life right this visual that talks about the relationship between living things and the closeness as as species together on their shared characteristics their their uh gen materials and I find that a really good analogy for what it means to work with data because our model our interpretation of reality it's only as good as the understanding that we have at a point in time and when new things new facts new events New knowledge understanding comes up becomes available then these views and this model it should change and scientists like these biologists and pist is they they they they know that right they know that new evidence May mean that what we previously previously thought to be a truth might have to be adjusted and I think that's I think that's awesome I think that's that's really fascinating and these these these models that we see on the screen here these visualizations they have changed dramatically over the years when new discoveries are made so by following this scientific method we collect and collect more context that supports our understanding of what is definite at least to the point that we know that now and what is still being clarified we're just things we're not too sure about yet we're working on assumptions and to me working with data is similar because we we find ways to to extract these this data these events that happened somewhere and interpret it and collect it and do something with it and this data is what we get it from we typically talk about sources systems or feeding systems or or things like that operational systems and these are the applications that support the day-to-day operation of the business and as we all know they are many and they are very varied they come in all kinds of shapes and sizes and colors designs levels of quality processes around it controls consistency reliability and of course technology so it's like all over the place and every new new company you go in there's this Patchwork of applications that may or may not talk to each other in the same language and and things like that all of them create data all of them do things because people run their day-to-day business on it because data is created when people follow these business processes using these applications for example entering new customers and systems and processing refunds and stuff like that so by understanding those processes we we get this better view about what the data means and this context that we get is important critical even to actually understanding what the data means how to use it and it's something that you you just don't get right from the start it's something you have to fine-tune over a longer period of time and that is something that is really hard to get right and I would say it's almost impossible in most cases it might be pretty straightforward for some data points but this context may also be completely non-existent right it may not be there the people that actually understand what's happening may not be there it wasn't documented people were using workarounds not following the normal documentation the system has some shortcuts uh the processes have changed so we look at the current state of the processes but in the past things were doing differently and we just look at the data that was generated back then from today's understanding of how the process works stuff like that so there's so many reasons why it's not always super clear what this data really means and this understanding this context and I keep using the word context because it's it's it's what shapes the model at the end we use that to Define that model and you often he hear about the single version of the truth for example or have lots of workshops and piece together this you know this holistic business model this is what the business does this is how it prodoc should work and you know depending on on how you look at at implementing that you take different Avenues to work towards that ideal goal so at the at best you work with an imperfect understanding of the data you work with an a model that is you know needs to under undergo a couple of revisions at the very least and that's that's not even it yet because the business itself Es as well you've got new products you've got merges and Acquisitions you've got technology being being changed business models being changed and all of that impacts your model right your interpretation of the world and last but not least to make matters even worse the methodology and techniques that we use change too I think that even today we TR we haven't truly figured out all the necessary details for something like dataa right we're here in the dataa user group new ideas and new patterns and new ways of doing things keep coming up right and having worked with dataa for more than 20 years now I I certainly have let go of some ideas of how to do quote unquote data fault and picked up some idas from others and that's I think that's great that's progress but it obviously does mean that things change as well and i' I've got this opinion and I was at the um the Data Vault the elm Forum in what was it November in in Rotterdam where I sort of talked about a little bit I I think that when at the point in time when you were exposed to Data Vault is it it sort of colors your understanding of what it means to work with it right so to do data fault is very different from back in the early 2000s to 2010 to now these methodologies they change over time and when certain things just work better we should find a way to incorporate it right and and that's also one of the drivers to to to you know make sure that designing for change is important and one of the core beliefs for me to to build these solutions to begin with so it's okay to change your mind you have to account for degree of flexibility and the the way I look at it is that yes we want this single version of Truth this business model we know that everything around us changes right technology company uh understanding of data methodology itself everything is is fluid so we have to approach this ideal as close as we can and it's it's a moving gold post it changes all the time but thanks to automation you can get really close to it or at least as close as possible and and try to keep up with this ever shifting state right it's it's a it's it's a morphing organism so imagine you build this model everybody's super happy with it and then you throw it all away you truncate it and your model just grows back automatically and then you truncate it again and then it grows back automatically exactly the same way and then you change it a little bit and then it grows back into that shape and then you go back to the previous version and it comes back the exact same way how cool would that be right so that's what we're going to do today so um I'm F and I refer to myself as an automation enthusiast because I I really like working with that right making making things run automatically generating code devops all that kind of stuff and I I for a long time I've been referring to it as inspired laziness because it's really you know it's really hard work to be professionally lazy so you know it's a lazy is good and I'm going to spare you from this this lengthy sort of history of uh of of Ruland um suffice it to say there's a couple of things that I think are are meaningful to share in this context um the most important one is that the code that I'm generating and running today you can actually try yourself try for yourself at home and that's this agnostic data Labs uh link here agnostic datal labs.com beta. agnostic datal labs.com so agnostic data Labs is this uh this Venture that me and and Stefan Json is also on the call um we started a couple months back uh as a new platform for data Automation and we think we approach the space from from an angle that is unique and it you know it's still early days and we're still working on it but if you go to this beta link and it's coming back at the end of the deck as well but if you follow that you can register and then we can set you up and you can run this and other code as well we're still tying things up before we're really done before we can say yep it's a general available version that's why it's still beta but the things I'm covering today work fine as they do so by all means give it a go if you think this is of interest and you want to know a bit more obviously 40 45 minutes is not a lot of time to go into all the details of everything but you know it if you can run it yourself you uh you might get an idea and of course reach out if if not which we'll get to later so that's agnostic data labs and then the GitHub link is is there because the ideas and Concepts and some of the technologies that we use for agnostic data labs are open source there is this whole collection of Frameworks that you need in place to run this stuff and that's what the agnostic data Labs platform and other platforms are built on so the idea is that all these open source components keep being expanded keeps being worked on and agnostic data Labs is just one of the front ends for uh using those Concepts but there other tools as well including open source tools and it's really meant to be that it's really meant to be available and and open so oops sorry this is not the one I wanted to click um we've got this book coming up which is called um data engine thinking and that's really the um the underpinnings of the the theoretical underpinnings of all the these things it's something I'm working on with Derk Lerner a fellow countrymen here for you Germans U you probably know him or some of you know him we want to wrap that up by a couple of weeks should be available this year somewhere and it's yeah it's really really really something I'm super proud of and I I really hope you'll have a look when it's when it's done and it explains pretty much the background of what I'm talking about today as well and last but not least there's the the web blog with some of these practical tips and the the link to the training materials for coaching and trainings if you're interested in those kinds of things right with all that out of the way a lengthy introduction but let's uh let's get into it so i' I still got like 35 minutes roughly to talk so you know five to 10 minutes for questions as Christian said if you do have questions in the meantime feel free to put him in the chat because uh Stefon is there as well and he might be able to uh well he not might me he will definitely be able to answer some of that so we might even get a a bit of time back if we're lucky am I right so the engine I'll keep this pretty short because some people this is familiar to some of the people in the group and I I am aware of that and I I don't want to dive into too much detail but this is important context because to make all these things work that we're going to see you need a lot of stuff to have in place and each of these blocks that we're going to use to build this engine is a framework in itself which is also on this open source GitHub so what do we need we need something to capture metadata in the schemer for for dat for dataw automation right or something like that like a repository where we store our design foundational we need something to keep the original Ro events that have happened prior to any interpretation at all which also means prior to going into the data world that's the Precision staging right just the transaction log of all the Chang that are happening time stamps we need to understand exactly what in in detail what it is we're trying to do right I'm such a fan of having a library of patterns that say a hub works like this a satellite works like this we're going to do date meth like this we're going to make these and these decisions and based on these and these considerations because then you can start to talk about code generation and templates so templates are the implementation of that pattern like how do I do this best in that particular technology and how can I automate that using the metadata that I have you need a way to reload history deterministically SP introduction about letting things something grow pruning it letting it grow again you can only do that if your processes are truly deterministic we need to be able to version it because we need to go from this model to the earlier model back to the current model and maybe have two models with different versions at the same time so we need a way to version everything which means we need to version the data which is the PSA we need to version the metadata which is happening in typical repositories we need to version the templates which is also part of the same repository as long as we version that in the same context we can do anything we need a way to apply some automation pipeline of sorts to do this automatically so when there is a change could be triggered by a commit in the Version Control could be triggered by a change in the model but then the updated design needs to be refactored automatically again all the things you need to have in place to make the stuff that I'm going to demo today work and please go to the GitHub and have a look we need a a way to implement checks and balances and validation to make sure that the data we generate still conforms to the intent the purpose that we uh we intended for it so it's a a a testing and data validation framework so it's testing as regression but it's also checking periodically things are still okay I use that in this demo for referential I'm not sure if I'm getting to it but I can run it if we uh we need it and we need some way of running things and understanding when we run it what has happened why by whom and what the outcome was very important to maintain the Integrity of the overall application we need to be be reminded and be notified if something is out of the ordinary and then we can take a step back and look at all these things and say is it really that meaningful to keep talking about you know do we need a link satellite or should be a h satellite should how do we do driving Keys how do we do transactional links versus his links versus modeling things out all these kinds of things they are at at some point you zoom out to a certain level of abstraction that it doesn't matter as much and we can really talk about oh yeah we still have the same model but if we link it to design pattern that does a link like this then you know your physical model will look look like will start to look slightly different so having that ability to connect your design to logical conceptual models is really meaningful of course we need transformations in there I'm not going to dive into that too much and then the the you know the what's it the I know this is one of those rare casions when I actually know the Dutch word but not the English word it doesn't happen as much anymore but the thing that that locks it all to in place is the um is is the optimizer that that figures out how the the ideal model can be IDE physical model can be generated from The Logical model based on the the you know the the directives that we provide and I I I love that right so that's that's the engine that's the stuff that's built into these Frameworks and that is then running in in the background doing what it needs to do while we figure out how the model is going to uh to change we can start collecting runtime information query patterns usage and stuff like that and then we can start talking about the the really the the semantic meaning of of the things that we're working of the the gluster the the taxonomies and we do that by doing workshops so my ideal is that we spend all our time in workshops and we tweak the metadata and everything just merges and and refactors automatically that's that's what I think is awesome and that's when we're winning so the engine gives you the power to do that right it what drives the wheels the direction of where we're going is driven by workshops and and governance processes and we can adjust a little bit where we are on the road by the information we collect from using the system that's that's the engine I've got an example running on a local SQL Server that I'm going to switch over in a bit but I I do want to explain a little bit what this solu solution looks like first before we go into uh the the screenshots so um this solution and I I use the term data solution over data warehouse more often than not but it's a pretty standard three- tiered design so you've got data that is loaded from the operational system into a persistent staging area PSA going to an integration layer which this is like the equivalent of a raw Data Vault if if you will a dist Data Vault if you will but data goes into the precent staging and then from there it goes into this integration area and from here from there it can either go into the delivery or get some Transformations and then go into there so if you see terms like derived and Bas and stuff like that you'll see different tables being created that that that have those monitors so keep keep that in mind it's typical three um three layer stuff so not super exciting but what is exciting and what I'm really passionate about this that this bit here is the only thing that's truly persistent physical and it's really only the persistent staging area so I'm I'm a really big fan of capturing everything first and then drawing any interpretation including modeling later and this is where the term virtual data wareing comes from that I sometimes used because all this stuff can be changed whenever so virtual means you can you can morph these layers into whatever you need it to be and you know you can deploy this views for sure right so T often if you hear virtual data R like all layers of views and that that that could be true but it could also be procedures start procedures that load tables over and over again and long as you truncate them then they just grow back but whatever you're doing you're always using the same steps same architecture and you'll get the same outcome so the technology in that or the implementation choices of actually delivering that sort of less less interested in but you know to uh avoid some of the criticism sometimes I get by using views because we can't always um you know go into production with views on on Big Data verses and I and I get that although I still feel it's super cool if we actually could do that conceptually but you know I don't care we can also generated as St procedures which I will be doing today and these store procedures they are generated in a way that they are completely independent of each other so all these procedures they basically have to load from the PSA right because that's the only place where you actually have a copy of the data as it was so how does that work that's done by using load WIndows so each procedure loads data from the PSA to their Hub link satellite whatever and from The Hub link satellite into the dimensions and stuff we'll see that but each psa2 integration layer procedure has its own load window so each time you run that it processes whatever there is to do and how does how does how does that kick off each of these procedures is kicked off in a state machine and a state machine in this sense is an is an internally running process like a service or demon or something like that where you have a number of slots available and every time a slot opens up because one of the processes has changed then you know the the the next process from the queue will be picked up based on its priority and then executed and then when it's completed or something else completes another thing pops up so there is no dependency there is no work ation it just loads whatever there is to load in whatever priority you set it another of those Concepts I really really love so I'm really come way to uh boot up the server and show you which will be a couple of minutes away but it's such a cool way to do this because you can tweak this priorization to fix a whole bunch of problems like like race conditions and bottlenecks delays and and everything it's super transparent it's uh yeah big big fan so the state machine runs on my local server and in fact I've got two because you've got one of these State machines that loads data into the PSA continuously and you've got other state machine at loads from the integration layer into sorry from the PSA into the integration layer at the at the same time so there is no order there is no dependency it just loads and at some point it's back to where it needs to be and then for because I ran out of time preparing this demo I just said look let's do the views on the top obviously I can do the same there and I have done that in in other projects but the the the presentation layer the dimensional model in this case is just layer of use so that's um that's that um last but not least I have created this demo in SQL Server because it's easy for me right I can just take it with me on the road and I can tweak on it when I in the plane or waiting somewhere in the I don't know like family shopping or whatever so we have some demos and other Technologies as well so you can also look at stuff like this in Snowflake for example it doesn't really matter that much it's just a just a tweak of the code generation template in the end so let's uh let's start running things so I'll switch back to my local SQL server and what we have done here is I've got these databases here that say staging area staging area integration layer pration layer and and they're empty so the first thing I'll do is to to generate the demo and and kick it off and to do that I have prepared some code and I'll show you how I got there in a bit but basically I have generated a set of scripts and this is the stuff you can do yourself as well right and if I this is from my earlier project work so if you run this deployment script which is a Powershell script that is is generated as part of the whole Automation and you can put it in defs but i r it manually here then it will start setting up the implementation of the whole solution as it is defined in metadata now so now if I go back and have a look there should be some tables there should be some tables and I can start running to see what happens and you can already see in my integration layers in my data fold I've got like three rows here and everything else is empty and if I keep running now there's five rows and then I need to wait because the PSA also needs to be populated but slowly but and surely slowly but surely you see that more and more data comes in so this is this is that stage machine running right it will say I've got so many slots I'm going to pick up the one that's next and if I want to see what order that is I can have a look at the the list that is given giving telling the state machine which process to pick up next so this is an example right so it says let's do this link first then do this other link then do this link then this El set and it's just based on nothing in particular because it's all um in it's all started from scratch right the load window is not there yet so the first time it runs it will say what can I do and I can load data from the beginning of time based on the the description time stamp right the the the load date time stamp in data fault when did I get the data up until where I can load it and that's how you can build those load WIndows so this this will keep changing and changing and you can see that now this is at the top and then now it's gone and now this one will be gone in the meantime my data warehouse is done so now I've got this data warehouse loaded and I can create some random Dimension that I've generated as a as an example it doesn't really matter what matters is that if I drop the whole thing so now I'm going to trate everything including the load window then everything is gone but it starts to pop up and then it will keep rerunning and reloading until it's back in its original state so when I leave this running it will it will come back and look exactly the same so let's have a look what we uh how we how we got there and this is where we can go to um agnostic data Labs so the code that I'm running from the VIS Studio code screen that I showed earlier is generated from the tool and this is the stuff that you can do at home so this is our uh our new new platform and you know this is not a this is not a software demo it's not meant to be a software demo we um I just want to show how how you can do this yourself um it's basically going to this get started guide and it will say sure right you've got stuff that you that explains how it works you can specify where you want your metadata to be saved in your repository and then you then this is the critical part you can connect to a folder that you can create when you do that the browser will say can I interface with your metadata and it's it's local metadata or or get metadata because we're really of the opinion that we want to have a way that um we can Version Control things separately from the tool and run things separately from the tool and that's it so now the tool can connect to your local drive that you've nominated and then you can select this data data FS physical data where SQL Server preview which is the demo I'm going to show today I'm not going to press next here because it will deploy all the objects and then I need to generate all the objects to to demo it so instead what I'll do is I'll I'll disconnect from here open up the one that I've prepared earlier it will again say do am I okay as a browser to access those files and it will show me the design and the design is a list of let's go to the graph I'll show you a little bit later but basically you know stuff like hubs links satellites right now that's all okay I think it's good to take a quick step back and look at the model itself because I am super happy that there's so much data in it and it's always the same but if I look at the model and I'm going to drag that model in and I've I've created in SQL dbm for now it's a model right we don't really care about it too much there's a customer there's an offer there's a membership but when I look at it I'm like you know this Hub customer going to this link membership with this lset and this membership plan it has this this a here this degenerate field generate attribute it has link satellites that I'm not super big fan of these days anymore um I want to change that right I want to ultimately go and model this out into its own business concept I want to model the membership out as its own business concept as well as opposed to having a link now so I need to create a new hub for this one I need to create a new satellite for this one I need to update the link probably rename the link as well and then model take this one out and model it somewhere else so a a big change in the model because you know I've fallen from my beliefs from the early days and I've adopted some other beliefs and as we said if you the more you abstract the less it matters but at this point it matters to me and I want to change this model so what we need to do is to basically do something like this we go from oops this is not the screen I wanted to to go to we go from this one where we have a hub to another hub for the the plan and this link satellite and this degenerate column here to something like this because I think that's better and we don't need to agree on that right but what will see is if I do this and I don't like it I can go back to my early State and it's still still okay so model this out change this link create this Hub and create this satellite now what we're going to do then in uh agnostic data Labs is stuff like this we can go here and move it around right where where is my membership this would be oops apologies where is it this would be something else I want to call it subscription oops subscription membership for example and look I'm not going to make these changes here because it's pretty boring and I have prepared it earlier but what I think is important is that all these things these objects they are defined in here so I've got this this customer that has columns and stuff like that and more importantly I know which mappings are attached to this and where it's going as part of the lineage which is sort of built into the metadata anyway so by using this this open source schema we we get this all this stuff all for free so I can see all these objects they map to this what we call data object the table top customer and what what happens there if I make these changes in the model these objects will change these mappings will change ultimately then I'll I'll go to the that's this is not what I wanted to show um we we've connected the mappings and I need to go back and select it here to template so we know that this object is subject to this template and these templates is where the code generation happens so in the templates we have these ones and this is the one that I've looking at looking at earlier and this is what actually generates the code so by connecting this to the object I can generate this code as you will recognize some of you will this is the exact same way as we've been doing this for years and years in in team VDW and the open source automation Frameworks it's just a nicer front end around it so that kind of stuff goes into the preview and that will give me ultimately my St procedure so to explain how that works and again not meant to be tool demo because you can do this at home but the template generates this code and when I press play in the code generator it will create the code for all those mappings and the deployment scripts and the documentation and the testing scripts and all the other stuff that we'll find in the in the tool so that's what will ultimately happen when I run when I press generate it will then in this output update the stuff that I can then vertually control or put into a devops framework so we've got a new deployment script I've got a new um a set of tables including uh the subscription membership the The Hub channel the the generate field that's gone and stuff like that so now before I go and I keep forgetting this before I go let's stop the que the state machine which is this one I've got two this one and this one so now these automatic processes that load everything they stop working so my data warehouse doesn't grow back but if I go here and get rid of all these tables obviously there's not much left and now if I me see if in the in the right one yeah if I deploy this version let's go to the right direct H let's go to the right directory and now it will deploy the updated version and the queue will start and everything will be uh will be fine so now I've got my my updated model which will tell which will grow back in time for that and I think that I think that's that that's awesome this obviously can only be done because you've got the PSA up and running let's give it a minute I think you'll believe me that if I truncate the table again it will keep coming up but slowly but surely this updated model of my uh you of my data will come into existence and will be populated again this can only work if all those Frameworks line to make this uh to for for this purpose but I think it's again such a cool way of doing things if you want to go back to the previous version all I need to do is stop the queue delete out my integration layer or whatever and then deploy the other version and then I'm back to where I was so I wanted to sort of stop there uh and have a bit of time for questions and and things like that because create some real State here because it's I know it's it's it's a lot right and I I don't want to bore you with too many too many details on on how it works but I definitely invite you to reach out and talk to me about it right and to work with me on these open source Frameworks and to use these tools which are free right and I really like these kind of topics so if nothing else it be my pleasure to be able to to show this to you um obviously my contact details are on the deck so do reach out if you want to know more what we've done is we've thought about why is designing for change so important and we've looked in detail into one of the ways to make that happen by having the right combination of Frameworks and patterns and automation place to to give us that and I think we need it because everything does change and we've set up our solution is such a way is such a way that we can change our model our interpretation of reality in a way that's version controlled and automated so every time we make a change depending on how we set up our devops it will then kick off if we commit this change to git we can set it up that it kicks off exactly the workflow that is showed today and it will rebuild the model we need to have a PSA but with by using a PSA we can have this deterministic set of patterns that actually guarantees that if you load the same data which is immutable of course by definition if you have that in your code generation you can be super flexible about how you define what data means and how you run it and load it the more you understand the context so again if you want to try this yourself go to beta. agnostic dat.com register will set you up and you can run it yourself

good morning fellow data Enthusiast um and evening from from my friends in Australia which one of them has a joined at least the other one is is on his way so I&#39;m not entirely alone here in the dark uh it&#39;s not too bad it&#39;s only it&#39;s only 7 PM so uh I shouldn&#39;t be complaining and I&#39;m I&#39;m not um I really look forward to today it&#39;s going to be about what else go generation and automation for data solution my favorite topics and I&#39;m going to cover something today that I think is really cool and interesting and it&#39;s something that brings all kinds of data vir syes together into something that I refer to as this engine for for data automation but before we go there I&#39;d like to share a couple of personal views on what it means to work with data and why I think that the the patterns and the things and Technologies I&#39;m going to show today are worth considering because for me data is is typically and mostly just just stuff right it can be it&#39;s whatever and I Define data as evidence of events of activities by something or someone that is persisted somewhere and that we can uncover and analyze if we want to do that and to me that that process of uncovering and analyzing something that happened in the past is similar to how I you know imagine paleontologist digging up these ancient fossils and skeletons and use these B bones and these artifacts and fil to piece together what actually happened at that forgotten moment in time and as a kid it was one of those those those things I really got into like uh fossil searching dinosaurs all kinds of things like like any young boy I suppose but the idea of of of that thing that happened in the past and it just sticks there until we decide to cover it to to to dig it up and to figure out what it means it really stuck with me and and it&#39;s also similar to how biologist use that same fossil record to improve the understanding of you know what what why we came to be the way we are right how Evolution eved so what we see here now is a set of visualizations of the the tree of life right this visual that talks about the relationship between living things and the closeness as as species together on their shared characteristics their their uh gen materials and I find that a really good analogy for what it means to work with data because our model our interpretation of reality it&#39;s only as good as the understanding that we have at a point in time and when new things new facts new events New knowledge understanding comes up becomes available then these views and this model it should change and scientists like these biologists and pist is they they they they know that right they know that new evidence May mean that what we previously previously thought to be a truth might have to be adjusted and I think that&#39;s I think that&#39;s awesome I think that&#39;s that&#39;s really fascinating and these these these models that we see on the screen here these visualizations they have changed dramatically over the years when new discoveries are made so by following this scientific method we collect and collect more context that supports our understanding of what is definite at least to the point that we know that now and what is still being clarified we&#39;re just things we&#39;re not too sure about yet we&#39;re working on assumptions and to me working with data is similar because we we find ways to to extract these this data these events that happened somewhere and interpret it and collect it and do something with it and this data is what we get it from we typically talk about sources systems or feeding systems or or things like that operational systems and these are the applications that support the day-to-day operation of the business and as we all know they are many and they are very varied they come in all kinds of shapes and sizes and colors designs levels of quality processes around it controls consistency reliability and of course technology so it&#39;s like all over the place and every new new company you go in there&#39;s this Patchwork of applications that may or may not talk to each other in the same language and and things like that all of them create data all of them do things because people run their day-to-day business on it because data is created when people follow these business processes using these applications for example entering new customers and systems and processing refunds and stuff like that so by understanding those processes we we get this better view about what the data means and this context that we get is important critical even to actually understanding what the data means how to use it and it&#39;s something that you you just don&#39;t get right from the start it&#39;s something you have to fine-tune over a longer period of time and that is something that is really hard to get right and I would say it&#39;s almost impossible in most cases it might be pretty straightforward for some data points but this context may also be completely non-existent right it may not be there the people that actually understand what&#39;s happening may not be there it wasn&#39;t documented people were using workarounds not following the normal documentation the system has some shortcuts uh the processes have changed so we look at the current state of the processes but in the past things were doing differently and we just look at the data that was generated back then from today&#39;s understanding of how the process works stuff like that so there&#39;s so many reasons why it&#39;s not always super clear what this data really means and this understanding this context and I keep using the word context because it&#39;s it&#39;s it&#39;s what shapes the model at the end we use that to Define that model and you often he hear about the single version of the truth for example or have lots of workshops and piece together this you know this holistic business model this is what the business does this is how it prodoc should work and you know depending on on how you look at at implementing that you take different Avenues to work towards that ideal goal so at the at best you work with an imperfect understanding of the data you work with an a model that is you know needs to under undergo a couple of revisions at the very least and that&#39;s that&#39;s not even it yet because the business itself Es as well you&#39;ve got new products you&#39;ve got merges and Acquisitions you&#39;ve got technology being being changed business models being changed and all of that impacts your model right your interpretation of the world and last but not least to make matters even worse the methodology and techniques that we use change too I think that even today we TR we haven&#39;t truly figured out all the necessary details for something like dataa right we&#39;re here in the dataa user group new ideas and new patterns and new ways of doing things keep coming up right and having worked with dataa for more than 20 years now I I certainly have let go of some ideas of how to do quote unquote data fault and picked up some idas from others and that&#39;s I think that&#39;s great that&#39;s progress but it obviously does mean that things change as well and i&#39; I&#39;ve got this opinion and I was at the um the Data Vault the elm Forum in what was it November in in Rotterdam where I sort of talked about a little bit I I think that when at the point in time when you were exposed to Data Vault is it it sort of colors your understanding of what it means to work with it right so to do data fault is very different from back in the early 2000s to 2010 to now these methodologies they change over time and when certain things just work better we should find a way to incorporate it right and and that&#39;s also one of the drivers to to to you know make sure that designing for change is important and one of the core beliefs for me to to build these solutions to begin with so it&#39;s okay to change your mind you have to account for degree of flexibility and the the way I look at it is that yes we want this single version of Truth this business model we know that everything around us changes right technology company uh understanding of data methodology itself everything is is fluid so we have to approach this ideal as close as we can and it&#39;s it&#39;s a moving gold post it changes all the time but thanks to automation you can get really close to it or at least as close as possible and and try to keep up with this ever shifting state right it&#39;s it&#39;s a it&#39;s it&#39;s a morphing organism so imagine you build this model everybody&#39;s super happy with it and then you throw it all away you truncate it and your model just grows back automatically and then you truncate it again and then it grows back automatically exactly the same way and then you change it a little bit and then it grows back into that shape and then you go back to the previous version and it comes back the exact same way how cool would that be right so that&#39;s what we&#39;re going to do today so um I&#39;m F and I refer to myself as an automation enthusiast because I I really like working with that right making making things run automatically generating code devops all that kind of stuff and I I for a long time I&#39;ve been referring to it as inspired laziness because it&#39;s really you know it&#39;s really hard work to be professionally lazy so you know it&#39;s a lazy is good and I&#39;m going to spare you from this this lengthy sort of history of uh of of Ruland um suffice it to say there&#39;s a couple of things that I think are are meaningful to share in this context um the most important one is that the code that I&#39;m generating and running today you can actually try yourself try for yourself at home and that&#39;s this agnostic data Labs uh link here agnostic datal labs.com beta. agnostic datal labs.com so agnostic data Labs is this uh this Venture that me and and Stefan Json is also on the call um we started a couple months back uh as a new platform for data Automation and we think we approach the space from from an angle that is unique and it you know it&#39;s still early days and we&#39;re still working on it but if you go to this beta link and it&#39;s coming back at the end of the deck as well but if you follow that you can register and then we can set you up and you can run this and other code as well we&#39;re still tying things up before we&#39;re really done before we can say yep it&#39;s a general available version that&#39;s why it&#39;s still beta but the things I&#39;m covering today work fine as they do so by all means give it a go if you think this is of interest and you want to know a bit more obviously 40 45 minutes is not a lot of time to go into all the details of everything but you know it if you can run it yourself you uh you might get an idea and of course reach out if if not which we&#39;ll get to later so that&#39;s agnostic data labs and then the GitHub link is is there because the ideas and Concepts and some of the technologies that we use for agnostic data labs are open source there is this whole collection of Frameworks that you need in place to run this stuff and that&#39;s what the agnostic data Labs platform and other platforms are built on so the idea is that all these open source components keep being expanded keeps being worked on and agnostic data Labs is just one of the front ends for uh using those Concepts but there other tools as well including open source tools and it&#39;s really meant to be that it&#39;s really meant to be available and and open so oops sorry this is not the one I wanted to click um we&#39;ve got this book coming up which is called um data engine thinking and that&#39;s really the um the underpinnings of the the theoretical underpinnings of all the these things it&#39;s something I&#39;m working on with Derk Lerner a fellow countrymen here for you Germans U you probably know him or some of you know him we want to wrap that up by a couple of weeks should be available this year somewhere and it&#39;s yeah it&#39;s really really really something I&#39;m super proud of and I I really hope you&#39;ll have a look when it&#39;s when it&#39;s done and it explains pretty much the background of what I&#39;m talking about today as well and last but not least there&#39;s the the web blog with some of these practical tips and the the link to the training materials for coaching and trainings if you&#39;re interested in those kinds of things right with all that out of the way a lengthy introduction but let&#39;s uh let&#39;s get into it so i&#39; I still got like 35 minutes roughly to talk so you know five to 10 minutes for questions as Christian said if you do have questions in the meantime feel free to put him in the chat because uh Stefon is there as well and he might be able to uh well he not might me he will definitely be able to answer some of that so we might even get a a bit of time back if we&#39;re lucky am I right so the engine I&#39;ll keep this pretty short because some people this is familiar to some of the people in the group and I I am aware of that and I I don&#39;t want to dive into too much detail but this is important context because to make all these things work that we&#39;re going to see you need a lot of stuff to have in place and each of these blocks that we&#39;re going to use to build this engine is a framework in itself which is also on this open source GitHub so what do we need we need something to capture metadata in the schemer for for dat for dataw automation right or something like that like a repository where we store our design foundational we need something to keep the original Ro events that have happened prior to any interpretation at all which also means prior to going into the data world that&#39;s the Precision staging right just the transaction log of all the Chang that are happening time stamps we need to understand exactly what in in detail what it is we&#39;re trying to do right I&#39;m such a fan of having a library of patterns that say a hub works like this a satellite works like this we&#39;re going to do date meth like this we&#39;re going to make these and these decisions and based on these and these considerations because then you can start to talk about code generation and templates so templates are the implementation of that pattern like how do I do this best in that particular technology and how can I automate that using the metadata that I have you need a way to reload history deterministically SP introduction about letting things something grow pruning it letting it grow again you can only do that if your processes are truly deterministic we need to be able to version it because we need to go from this model to the earlier model back to the current model and maybe have two models with different versions at the same time so we need a way to version everything which means we need to version the data which is the PSA we need to version the metadata which is happening in typical repositories we need to version the templates which is also part of the same repository as long as we version that in the same context we can do anything we need a way to apply some automation pipeline of sorts to do this automatically so when there is a change could be triggered by a commit in the Version Control could be triggered by a change in the model but then the updated design needs to be refactored automatically again all the things you need to have in place to make the stuff that I&#39;m going to demo today work and please go to the GitHub and have a look we need a a way to implement checks and balances and validation to make sure that the data we generate still conforms to the intent the purpose that we uh we intended for it so it&#39;s a a a testing and data validation framework so it&#39;s testing as regression but it&#39;s also checking periodically things are still okay I use that in this demo for referential I&#39;m not sure if I&#39;m getting to it but I can run it if we uh we need it and we need some way of running things and understanding when we run it what has happened why by whom and what the outcome was very important to maintain the Integrity of the overall application we need to be be reminded and be notified if something is out of the ordinary and then we can take a step back and look at all these things and say is it really that meaningful to keep talking about you know do we need a link satellite or should be a h satellite should how do we do driving Keys how do we do transactional links versus his links versus modeling things out all these kinds of things they are at at some point you zoom out to a certain level of abstraction that it doesn&#39;t matter as much and we can really talk about oh yeah we still have the same model but if we link it to design pattern that does a link like this then you know your physical model will look look like will start to look slightly different so having that ability to connect your design to logical conceptual models is really meaningful of course we need transformations in there I&#39;m not going to dive into that too much and then the the you know the what&#39;s it the I know this is one of those rare casions when I actually know the Dutch word but not the English word it doesn&#39;t happen as much anymore but the thing that that locks it all to in place is the um is is the optimizer that that figures out how the the ideal model can be IDE physical model can be generated from The Logical model based on the the you know the the directives that we provide and I I I love that right so that&#39;s that&#39;s the engine that&#39;s the stuff that&#39;s built into these Frameworks and that is then running in in the background doing what it needs to do while we figure out how the model is going to uh to change we can start collecting runtime information query patterns usage and stuff like that and then we can start talking about the the really the the semantic meaning of of the things that we&#39;re working of the the gluster the the taxonomies and we do that by doing workshops so my ideal is that we spend all our time in workshops and we tweak the metadata and everything just merges and and refactors automatically that&#39;s that&#39;s what I think is awesome and that&#39;s when we&#39;re winning so the engine gives you the power to do that right it what drives the wheels the direction of where we&#39;re going is driven by workshops and and governance processes and we can adjust a little bit where we are on the road by the information we collect from using the system that&#39;s that&#39;s the engine I&#39;ve got an example running on a local SQL Server that I&#39;m going to switch over in a bit but I I do want to explain a little bit what this solu solution looks like first before we go into uh the the screenshots so um this solution and I I use the term data solution over data warehouse more often than not but it&#39;s a pretty standard three- tiered design so you&#39;ve got data that is loaded from the operational system into a persistent staging area PSA going to an integration layer which this is like the equivalent of a raw Data Vault if if you will a dist Data Vault if you will but data goes into the precent staging and then from there it goes into this integration area and from here from there it can either go into the delivery or get some Transformations and then go into there so if you see terms like derived and Bas and stuff like that you&#39;ll see different tables being created that that that have those monitors so keep keep that in mind it&#39;s typical three um three layer stuff so not super exciting but what is exciting and what I&#39;m really passionate about this that this bit here is the only thing that&#39;s truly persistent physical and it&#39;s really only the persistent staging area so I&#39;m I&#39;m a really big fan of capturing everything first and then drawing any interpretation including modeling later and this is where the term virtual data wareing comes from that I sometimes used because all this stuff can be changed whenever so virtual means you can you can morph these layers into whatever you need it to be and you know you can deploy this views for sure right so T often if you hear virtual data R like all layers of views and that that that could be true but it could also be procedures start procedures that load tables over and over again and long as you truncate them then they just grow back but whatever you&#39;re doing you&#39;re always using the same steps same architecture and you&#39;ll get the same outcome so the technology in that or the implementation choices of actually delivering that sort of less less interested in but you know to uh avoid some of the criticism sometimes I get by using views because we can&#39;t always um you know go into production with views on on Big Data verses and I and I get that although I still feel it&#39;s super cool if we actually could do that conceptually but you know I don&#39;t care we can also generated as St procedures which I will be doing today and these store procedures they are generated in a way that they are completely independent of each other so all these procedures they basically have to load from the PSA right because that&#39;s the only place where you actually have a copy of the data as it was so how does that work that&#39;s done by using load WIndows so each procedure loads data from the PSA to their Hub link satellite whatever and from The Hub link satellite into the dimensions and stuff we&#39;ll see that but each psa2 integration layer procedure has its own load window so each time you run that it processes whatever there is to do and how does how does how does that kick off each of these procedures is kicked off in a state machine and a state machine in this sense is an is an internally running process like a service or demon or something like that where you have a number of slots available and every time a slot opens up because one of the processes has changed then you know the the the next process from the queue will be picked up based on its priority and then executed and then when it&#39;s completed or something else completes another thing pops up so there is no dependency there is no work ation it just loads whatever there is to load in whatever priority you set it another of those Concepts I really really love so I&#39;m really come way to uh boot up the server and show you which will be a couple of minutes away but it&#39;s such a cool way to do this because you can tweak this priorization to fix a whole bunch of problems like like race conditions and bottlenecks delays and and everything it&#39;s super transparent it&#39;s uh yeah big big fan so the state machine runs on my local server and in fact I&#39;ve got two because you&#39;ve got one of these State machines that loads data into the PSA continuously and you&#39;ve got other state machine at loads from the integration layer into sorry from the PSA into the integration layer at the at the same time so there is no order there is no dependency it just loads and at some point it&#39;s back to where it needs to be and then for because I ran out of time preparing this demo I just said look let&#39;s do the views on the top obviously I can do the same there and I have done that in in other projects but the the the presentation layer the dimensional model in this case is just layer of use so that&#39;s um that&#39;s that um last but not least I have created this demo in SQL Server because it&#39;s easy for me right I can just take it with me on the road and I can tweak on it when I in the plane or waiting somewhere in the I don&#39;t know like family shopping or whatever so we have some demos and other Technologies as well so you can also look at stuff like this in Snowflake for example it doesn&#39;t really matter that much it&#39;s just a just a tweak of the code generation template in the end so let&#39;s uh let&#39;s start running things so I&#39;ll switch back to my local SQL server and what we have done here is I&#39;ve got these databases here that say staging area staging area integration layer pration layer and and they&#39;re empty so the first thing I&#39;ll do is to to generate the demo and and kick it off and to do that I have prepared some code and I&#39;ll show you how I got there in a bit but basically I have generated a set of scripts and this is the stuff you can do yourself as well right and if I this is from my earlier project work so if you run this deployment script which is a Powershell script that is is generated as part of the whole Automation and you can put it in defs but i r it manually here then it will start setting up the implementation of the whole solution as it is defined in metadata now so now if I go back and have a look there should be some tables there should be some tables and I can start running to see what happens and you can already see in my integration layers in my data fold I&#39;ve got like three rows here and everything else is empty and if I keep running now there&#39;s five rows and then I need to wait because the PSA also needs to be populated but slowly but and surely slowly but surely you see that more and more data comes in so this is this is that stage machine running right it will say I&#39;ve got so many slots I&#39;m going to pick up the one that&#39;s next and if I want to see what order that is I can have a look at the the list that is given giving telling the state machine which process to pick up next so this is an example right so it says let&#39;s do this link first then do this other link then do this link then this El set and it&#39;s just based on nothing in particular because it&#39;s all um in it&#39;s all started from scratch right the load window is not there yet so the first time it runs it will say what can I do and I can load data from the beginning of time based on the the description time stamp right the the the load date time stamp in data fault when did I get the data up until where I can load it and that&#39;s how you can build those load WIndows so this this will keep changing and changing and you can see that now this is at the top and then now it&#39;s gone and now this one will be gone in the meantime my data warehouse is done so now I&#39;ve got this data warehouse loaded and I can create some random Dimension that I&#39;ve generated as a as an example it doesn&#39;t really matter what matters is that if I drop the whole thing so now I&#39;m going to trate everything including the load window then everything is gone but it starts to pop up and then it will keep rerunning and reloading until it&#39;s back in its original state so when I leave this running it will it will come back and look exactly the same so let&#39;s have a look what we uh how we how we got there and this is where we can go to um agnostic data Labs so the code that I&#39;m running from the VIS Studio code screen that I showed earlier is generated from the tool and this is the stuff that you can do at home so this is our uh our new new platform and you know this is not a this is not a software demo it&#39;s not meant to be a software demo we um I just want to show how how you can do this yourself um it&#39;s basically going to this get started guide and it will say sure right you&#39;ve got stuff that you that explains how it works you can specify where you want your metadata to be saved in your repository and then you then this is the critical part you can connect to a folder that you can create when you do that the browser will say can I interface with your metadata and it&#39;s it&#39;s local metadata or or get metadata because we&#39;re really of the opinion that we want to have a way that um we can Version Control things separately from the tool and run things separately from the tool and that&#39;s it so now the tool can connect to your local drive that you&#39;ve nominated and then you can select this data data FS physical data where SQL Server preview which is the demo I&#39;m going to show today I&#39;m not going to press next here because it will deploy all the objects and then I need to generate all the objects to to demo it so instead what I&#39;ll do is I&#39;ll I&#39;ll disconnect from here open up the one that I&#39;ve prepared earlier it will again say do am I okay as a browser to access those files and it will show me the design and the design is a list of let&#39;s go to the graph I&#39;ll show you a little bit later but basically you know stuff like hubs links satellites right now that&#39;s all okay I think it&#39;s good to take a quick step back and look at the model itself because I am super happy that there&#39;s so much data in it and it&#39;s always the same but if I look at the model and I&#39;m going to drag that model in and I&#39;ve I&#39;ve created in SQL dbm for now it&#39;s a model right we don&#39;t really care about it too much there&#39;s a customer there&#39;s an offer there&#39;s a membership but when I look at it I&#39;m like you know this Hub customer going to this link membership with this lset and this membership plan it has this this a here this degenerate field generate attribute it has link satellites that I&#39;m not super big fan of these days anymore um I want to change that right I want to ultimately go and model this out into its own business concept I want to model the membership out as its own business concept as well as opposed to having a link now so I need to create a new hub for this one I need to create a new satellite for this one I need to update the link probably rename the link as well and then model take this one out and model it somewhere else so a a big change in the model because you know I&#39;ve fallen from my beliefs from the early days and I&#39;ve adopted some other beliefs and as we said if you the more you abstract the less it matters but at this point it matters to me and I want to change this model so what we need to do is to basically do something like this we go from oops this is not the screen I wanted to to go to we go from this one where we have a hub to another hub for the the plan and this link satellite and this degenerate column here to something like this because I think that&#39;s better and we don&#39;t need to agree on that right but what will see is if I do this and I don&#39;t like it I can go back to my early State and it&#39;s still still okay so model this out change this link create this Hub and create this satellite now what we&#39;re going to do then in uh agnostic data Labs is stuff like this we can go here and move it around right where where is my membership this would be oops apologies where is it this would be something else I want to call it subscription oops subscription membership for example and look I&#39;m not going to make these changes here because it&#39;s pretty boring and I have prepared it earlier but what I think is important is that all these things these objects they are defined in here so I&#39;ve got this this customer that has columns and stuff like that and more importantly I know which mappings are attached to this and where it&#39;s going as part of the lineage which is sort of built into the metadata anyway so by using this this open source schema we we get this all this stuff all for free so I can see all these objects they map to this what we call data object the table top customer and what what happens there if I make these changes in the model these objects will change these mappings will change ultimately then I&#39;ll I&#39;ll go to the that&#39;s this is not what I wanted to show um we we&#39;ve connected the mappings and I need to go back and select it here to template so we know that this object is subject to this template and these templates is where the code generation happens so in the templates we have these ones and this is the one that I&#39;ve looking at looking at earlier and this is what actually generates the code so by connecting this to the object I can generate this code as you will recognize some of you will this is the exact same way as we&#39;ve been doing this for years and years in in team VDW and the open source automation Frameworks it&#39;s just a nicer front end around it so that kind of stuff goes into the preview and that will give me ultimately my St procedure so to explain how that works and again not meant to be tool demo because you can do this at home but the template generates this code and when I press play in the code generator it will create the code for all those mappings and the deployment scripts and the documentation and the testing scripts and all the other stuff that we&#39;ll find in the in the tool so that&#39;s what will ultimately happen when I run when I press generate it will then in this output update the stuff that I can then vertually control or put into a devops framework so we&#39;ve got a new deployment script I&#39;ve got a new um a set of tables including uh the subscription membership the The Hub channel the the generate field that&#39;s gone and stuff like that so now before I go and I keep forgetting this before I go let&#39;s stop the que the state machine which is this one I&#39;ve got two this one and this one so now these automatic processes that load everything they stop working so my data warehouse doesn&#39;t grow back but if I go here and get rid of all these tables obviously there&#39;s not much left and now if I me see if in the in the right one yeah if I deploy this version let&#39;s go to the right direct H let&#39;s go to the right directory and now it will deploy the updated version and the queue will start and everything will be uh will be fine so now I&#39;ve got my my updated model which will tell which will grow back in time for that and I think that I think that&#39;s that that&#39;s awesome this obviously can only be done because you&#39;ve got the PSA up and running let&#39;s give it a minute I think you&#39;ll believe me that if I truncate the table again it will keep coming up but slowly but surely this updated model of my uh you of my data will come into existence and will be populated again this can only work if all those Frameworks line to make this uh to for for this purpose but I think it&#39;s again such a cool way of doing things if you want to go back to the previous version all I need to do is stop the queue delete out my integration layer or whatever and then deploy the other version and then I&#39;m back to where I was so I wanted to sort of stop there uh and have a bit of time for questions and and things like that because create some real State here because it&#39;s I know it&#39;s it&#39;s it&#39;s a lot right and I I don&#39;t want to bore you with too many too many details on on how it works but I definitely invite you to reach out and talk to me about it right and to work with me on these open source Frameworks and to use these tools which are free right and I really like these kind of topics so if nothing else it be my pleasure to be able to to show this to you um obviously my contact details are on the deck so do reach out if you want to know more what we&#39;ve done is we&#39;ve thought about why is designing for change so important and we&#39;ve looked in detail into one of the ways to make that happen by having the right combination of Frameworks and patterns and automation place to to give us that and I think we need it because everything does change and we&#39;ve set up our solution is such a way is such a way that we can change our model our interpretation of reality in a way that&#39;s version controlled and automated so every time we make a change depending on how we set up our devops it will then kick off if we commit this change to git we can set it up that it kicks off exactly the workflow that is showed today and it will rebuild the model we need to have a PSA but with by using a PSA we can have this deterministic set of patterns that actually guarantees that if you load the same data which is immutable of course by definition if you have that in your code generation you can be super flexible about how you define what data means and how you run it and load it the more you understand the context so again if you want to try this yourself go to beta. agnostic dat.com register will set you up and you can run it yourself

Transcript for:Exploring Data Automation Techniques

Transcript for:
Exploring Data Automation Techniques