Understanding Data Processing Architectures

Hey there! Have you ever wondered why shouldn't we just connect directly from our reporting solution to the source database? Why bother with all those extra stages that I presented last time and how the approach evolved over the years? If yes, then this episode is for you. So let's get started.

Now let's assume that your case is quite simple, your scenario is simple. So your only data source is this relational database, right? So let's write it database.

For example, it is SQL server, right? You don't have any other data sources. You don't have CSV files, JSON files, no APIs, just a good old database. And now, on the other hand, we've got various reports that we would like to create. For example, let's show some sales, right?

Because why not? So why shouldn't we just connect our reporting solution directly to this database and query data from it? What prevents us from doing this?

Or maybe this is a valid approach. So let me explain what we should take into consideration. So first of all, those two.

sites they use completely different type of workloads so on the left hand side on our source it is most likely OLTP workload right where T stands for transactional and OLTP means online transactional Processings so this T it means transactional transactional and what it means is that most likely this is a database used by some application, right? We've got some app that reads and writes data from this database. And we've got users that use the app, right? And there can be thousands of users using the app at the same time. Imagine customers placing some orders for this application.

And all of these orders have to be saved inside our database. So some of the key features of this OLTP processing is that we have multiple queries executed at the same time. A lot of concurrency is in place. So we have to make sure that those requests do not block each other, or at least do not block in an extensive way. Secondly...

those requests to database they have to be very fast right users cannot wait a minute to get the response because they will just go away and buy a product in a different a different company and we can't afford it then queries executed against this source database they usually are about single rows right we want to create a single new order we want to update the status of a single row single order right so we are focused on small amounts of data and to achieve that a special design of this database was probably used that resulted in having multiple tables created in a normalized way right so we you okay we can imagine design like this that these are our tables all right and let's say all of these tables they contain data about our customers and orders and this data is spread across many tables just to support this ltp processing right and this design works it is perfectly fine to have this uh type of tables designed on our source database because we want to normalize our data normalize our tables and for example we would design tables to be at least in the third normal form right on the other hand our reporting solution it is an example of all up processing right online analytical processing so this a here it stands for analytical analytical and here we rather don't care about single order We don't want to take a look at single sales transaction. No. We care about huge amounts of data. Let's say we want to check sales across the whole year.

Get the sum of sales amount from every order and compare it to the previous year. Or check it quarter by quarter to see whether our sales improved or not. And in this case, users are much more forgiving when it comes to response time.

So here our queries might be a little bit longer than those ones on the OLTP side. Not too much, but it is fine. We don't have to provide a response in a millisecond.

But that changes and users want to get data as fast as possible. Anyway. to make those queries on all app sites fast and effective we've got to design our database in a different way right so instead of this complex normalized structure like we had in the source site we might end up with a very simple design like this right that resembles a star so in this case we have denormalized data denormalized right and we do this we do this for two reasons first of all reading data from such structure will be faster compared to the old one because we make less joins right queries are simpler secondly this data structure is simpler and easier to understand and it is especially important if you would like to use power self-service bi right and self with bi is a concept in which end users create reports on their own you just give them access to the data to the our data model and then they create their own reports right and to be able to do this the data model has to be simple right it has to be self-explanatory So the end user would be able to start using it to create their own reports.

And compare these two designs, right? On the left hand side, the normalized one, too difficult for end users to understand. For example, think about SAP tables.

I am not sure if you saw this database. SAP is this huge CRM, right? widely used across the world but it has a lot of tables with cryptic names and cryptic column names so there is no way end user would be able to to write good queries on this on this model so to enable users to create their own reports we've got to have a data model simplified right more easy to understand Secondly, if we allowed users to run queries on a source database, let's even assume that they understand the data structure, then the queries they would write might be quite bad.

They might execute for a long time because there is a lot of tables involved that have to be joined. There might be complex where clauses included. all right so the query executed from the report would be executed for a few minutes right that would block effectively those tables from our application that brings money and that's not possible we cannot block these tables right the application that is the main income source it has to be up and running without any issues right so that's it then we've got yet another reason why we might not be let into this source database which is security right if our database is used by this very critical application then no database administrator would allow end users just to get in and run their own queries no way Simply, security is more important here, right? So that's why usually we don't want to run queries directly on the source database.

Of course, in some cases, for example, for proof of concepts, it might be a good idea just to get some reports quickly, to show them to end users and to get some feedback. But eventually we would add those additional stages. And you might think that, hey, I actually would use some techniques to work around those issues that I just described. For example, on the source side, you could use SQL views that would hide those complex logic and expose data as a view. Right?

True. You could do this. But do you really believe that you will be allowed to create your own view in a sub database? the DBAs would allow for this?

No chance. Another thing we could do just to limit impact on the money bringing application would be to use a read-only replica of our database. So it would be like this.

The green database is a read-only. replica that contains basically the same data that the original database has right and it is periodically updated automatically so then instead of connecting to the main database the main replica from our reports we would connect to the only one right this way we wouldn't put as much stress on the source database and on application so it should work fine right and yes partially that's true but still it doesn't solve uh all the issues that i just described but having this read-only replica is a great option so if it's possible then you should use it right even if this in this extended approach and one more thing about the database structure and database design between these two solutions So I told you that in case of OLAP workloads, in case of those analytical workloads, we usually are interested about all rows, right? All the huge amounts of rows. So you can think of a query like this.

Select, let's say, some amount. from our table from table and now let's assume that this table it contains 20 columns right and a month is just one of them so even if you execute this query on this ltp database then actually database engine will read all columns from disk right not only this one but everything and it's just because the way data is physically stored on disk pages on the other hand the same query executed on our all up database we might configure it in a way that only this single column would be read from disk which obviously would be faster right so that's yet another advantage of having this separation all right so Now that we know that reading data directly from the source database is rather not a good option, except for those proof of concepts or very simple scenarios in which you are in a total control of a database and you know what you are doing, then we had some designs created in the past and I would like to go through some of them just to explain how it worked. So the first notable one was called Relational Data Warehouse.

Relational Data Warehouse. Data Warehouse. All right. And how did it work? And basically this concept of data warehousing is not new.

It was created in 80s, previous century, so it's quite old, right? And at the beginning of data warehousing, we usually dealt... We had to deal with databases as a source, right?

So we've got some database as a source. right from which we wanted to ingest the data and transform it and present on some reports right so what we would do then is to have a data warehouse data warehouse which was just a database hosted on a relational database engine That could, for example, be split into two areas. The first one would be used for staging purposes. So we would use this area just to grab data from source database and save it as is, as tables in our data warehouse. That's the extract stage.

And then we would model. this extracted data into a design that works in our case. For example, like this. And to transform the data, we would use SQL statements.

Then we would have some reports connected to this data warehouse. And everyone is happy. Now, so why this approach was possible? So basically, back then, our main data source was structured. So we get structured data as a source.

And by structured, I mean that, let's say in a database, we know exactly what tables are there. what columns every table has, what data types are set for every column, whether a column is nullable or not, what constraints are there. So it was quite easy to grab this data and store it as a table in our data warehouse, because table to table it's quite easy. Then transformations were done.

using TSQL or any other SQL dialect, right? And we would connect two reports. Optionally, what we could have were cubes. Cubes that were just yet another representation of our data that made reporting faster.

it gave us some additional capabilities and everything worked just fine. So what was great about this solution was that just because data warehouse was just a database, so it had transactional support, right? We could write transactions.

Then we could enforce schema on our data in our data model because again Everything was stored as a table, right? To model the data, there were some approaches that we could use. And actually there were two most important ones.

So we had dimensional modeling created by Ralph Kimball. And we had a second approach created by Bill Inman. So these two approaches were all about how to design our data model inside a data warehouse. And there was like a holy war between followers of Kimball and those for whom Inman approach was better.

And then in time we got the third approach which was Data Vault. by the Danna Linstead data. And each of those approaches had it cons and pros, right?

Each of them had its own use cases. And in this course, we'll focus on dimensional modeling by Ralph Kimball. Then the next key feature of this approach was that the target schema, target model in our data warehouse, it had to be... created upfront right so before we started loading the data we had to have it modeled already right so it was uh it was called schema on right schema on right meaning that whenever you want to write our data to a data warehouse the schema was already there it had to be defined upfront. Which was kind of a drawback, right?

When you think about how much time it took to get new data source and process it and expose in data warehouse, right? Because to be able to do this, first developers would have to know how to model the data, right? And to be able to model them in a correct way, they had to know what kind of questions.

this data is supposed to answer. And that wasn't always that obvious, right? So it means that this approach worked, but it was kind a little bit of slow to add new data. But for some time this approach was a valid one. However, word has changed.

And this approach stopped working because of big data. So basically, if we think about big data key characteristics, the first one is volume, right? So as the number of data was getting bigger and bigger that we have to process, it turned out that this approach in which we transform data using SQL in the data warehouse It's not a good option, right?

Because it might be not enough time during the night when we do batch processing to process all of this stuff. And increasing the compute power of data warehouse was not easy, right? Storage costs were not that low. So basically, the more data we had to process, the more difficult it started to become.

Then we've got a variety of data sources. So as long as we had this structured data only, everything was fine. But how about unstructured data like JSON files or XML files, for example, right?

And they are semi-structured because... We know how a valid JSON or XML should look like, but we don't know upfront what actually will be stored in those files. How many nesting levels will be there, right? Which made modeling of this type of data quite difficult when we wanted to grab the data from JSON file and just load it into those nice columns and tables in a data warehouse.

So it was kind of difficult. and finally we have velocity all right so basically this approach in a relational data warehouse it doesn't work for streaming when you have this constant flow of events like some iot devices or some twitter feeds or twitter notifications this approach simply doesn't work and there are some other characteristics of big data but let's just focus on those three so basically big data caused that this design started to go away right we had to find something different that would be able to handle it so that's when next approach was created that was called a data lake Lake. And its goal was to allow ingesting and processing these big data files. Big data, data, right?

So how did it work? So in this case our data source could be not only structured like we had previously. So let's say we've got a nice structured database.

but apart from that we've got some files xml json and so on right this data had to be ingested somewhere right but this time we were not ingesting it into a relational data warehouse but into a thing called a data lake and data lake is just a glorified file folder And I really like this definition because it explains what data lake is. So you can compare it to a file folder in Windows to which you just save files. So data lake is just a...

think of it like a file system that can store any files regardless of its type. So it means that inside our data lake we would store our files whatever type uh they were json xml csvs and so on right so if we could store any type of file then the extract part was very easy right because we could just grab data from the source and just dump it into a data Very easy. We didn't have to transform it in any way. And then we would have our reporting solution that would connect to a data lake and read the data stored there and present on the reports. And the key.

A feature of this approach is that in this case, we have schema on read, right? Which simply means that we define the target schema of our data at the time we start to read them, right? When we read this JSON file from a data lake, then it is when we decide how we want the output to look like, right?

Because... Earlier, this data is stored as a JSON file. Now, data lake is just a storage. It doesn't have any compute. So if we had to transform the data, then we needed some additional software that would be able to process it.

And remember that one of the... Issues with the previous approach with relational data warehouse was that it was too slow, not performant enough to process big data. So the new approach had to be able to do this in a timely manner.

And that's why Hadoop was created. And Hadoop is simply an open source distributed processing framework that is able to process and manage big data. And it does it. in a different way than relational data warehouse.

Simply, we have multiple compute nodes, no longer a single one, among which the data is distributed, the workload is distributed. If each of them does its part of the job, and then data is returned to the customer, to the client, right? And to be able to do this, these Hadoop applications, they needed... a special file system that could access data stored on a data lake. And that's how HDFS was created.

HDFS stands for Hadoop Distributed File System. And it was simply a file system that enabled Hadoop applications to access data stored on a data lake. All right, so perfect solution, right? Not really. Actually, it turned out that using Hadoop is quite difficult it requires a lot of skills it has a steep learning curve so it was quite difficult to use it so it wasn't the best solution either right of course it was much cheaper than the relational data warehouse it had much better performance right it was scalable because we were able to load huge amounts of data, but data querying was not that easy, right?

Secondly, it turned out that actually those features we had in a data warehouse were pretty cool. Like transaction support. We don't have it here in a data lake.

We don't have schema enforcement, because here in data lake these are just files. And how about concurrency? when we want to read and write to the same file at the same time it doesn't really work right so actually it turned out that this data lake approach it didn't quite work it had some benefits like the very easy way to extract and ingest the data but many projects that were hyped by this data lake approach they failed right So yet another approach had to be developed. And it was called, or it is called, Modern Data Warehouse. So let me write it down.

Modern Data Warehouse. And how does it work? So basically, it's a combination of... two previous approaches a relational data warehouse and the data lake so here we again can have both structured and unstructured data as a source right so these are our data sources right We have to ingest this data quite easily and we know from data lake approach that it works. So let's keep using it, right?

So let's have this data lake here. Data lake. And we would store this data as files, right?

Because that's how data lake works. But now to make it easier for end users and for developers to use, what we would do is to add something behind a data lake, like a relational data warehouse, to which data would be saved from data lake. And here we would model our data in a proper way, like following Kimball's approach.

and then we would have reports that would connect to our data warehouse right so it looks like it is a combination of two previous approaches best of these two right so ingesting data extracting it it was very easy right just get a file dump it in a data Lake and that's it very fast very low overhead then if you wanted some users to grant access to those files just to start looking at them and start analyzing for example data scientists we could do this right we could grant them access to the data lake and those guys they like working on files right And then we got new services. So no longer we have to use this clunky Hadoop applications. We have, for example, Databricks, which makes this big data processing much easier. And the entry barrier is much lower than previously. So transformations on a data lake are much easier to do, right?

But then to have this transactional support schema enforcement, we would just load the data to the data warehouse, right? And end users that like to use SQL statements, SQL queries, would just connect to our relational data warehouse, right? Works great, right? And actually, this is the design that is quite commonly used nowadays.

you can spot it in multiple projects but it's not ideal there are some drawbacks of this approach so first of all our data is duplicated right we've got one copy of data in a data lake and another one in data warehouse so we pay twice for data storage we have to maintain and care about both when it comes to security we have to set it separately on a data lake and in a data warehouse right so it's not perfect solution so that's why yet another design was created called data lake house it's a pro a quite new one data lake house and actually it was created by databricks company so you wouldn't be surprised that they suggest just to use databricks as a service so as always we have all types of data structured semi structures no uh or unstructured so we've got databases we've got some files and so on then We ingest this data to a data lake and you can see that this part of having a data lake is common. It's simply proved to be working. Data lake, right, where we store our data as files, right, but to make it better and just to enable this data lakehouse approach, data would be stored using new file format called delta it's an open file format open source one that brings some additional capabilities to data lake like acid properties something that we had in databases which basically means that data lake now kind of looks like a database right and actually if you wonder what file type is used internally by Microsoft Fabric, then it's this delta format, right? And by the way, this data lake house name, it was created as a combination of data warehouse and data lake.

That's why we have this data lake house, right? Then apart from data lake, we have a new service that is able to connect and read this delta files stored in a data lake very easily right it is very performant connection for example it might be databricks databricks right and then you would have reports reports and so what's so special about this databricks in this in this scenario is that this additional layer it exposes our data lake as if it was a database right we've got this sql endpoint so everyone who is connecting to this uh database service thinks that actually they they are connecting to a database so they are not even aware that actually SQL statement they sent, it is translated and data is retrieved from files on a data lake. But for them it looks just like as if it was a database, right?

And everyone knows SQL so it's very easy to query this data. The same goes for transformations, right? Developers can write SQL statements to transform data which makes it quite easy to process.

And we've got this huge benefit that Delta is stored only in a single place in this data lake. Apart from that, we've got transactional support, we've got schema enforcement. That's what this Delta format gives us.

And don't worry, we will cover this Delta format in more details in one of future episodes. So this data lakehouse approach is getting more and more popular. and actually probably you will work on it in one of your projects one day all right and in this course we will focus mainly on these two on modern data warehouse approach right and on the data lakehouse approach now there is one more concept that is quite hot nowadays that i would like to talk about it's data mesh data mesh now please remember that data mesh is just a idea set of principles that should be followed it's not a service that we can just provision in in azure so basically data mesh is all about four core principles so first one is all about domain ownership so previously all this data flow all data transformations were implemented by a central it team right however this it team is quite often overloaded with tasks right and it takes time to implement it so in data mesh we decentralized responsibilities right so it is no longer this central it team that is doing the job no we take it away and give it to business domains like manufacturing sales finance hr and so on so each of those domains let's say this is our domain finance domain should be responsible for processing it's data right so they would have its own mini it team that would implement this flow and produce the data as a result and it should be done by this finance domain because those guys they know exactly what data they use they know every in and out of this financial accounting stuff right so they are the best candidates you to work with this data secondly data that is produced by those domains let's say at the end they produce in a finance domain a table with sales transaction right on sales it should be treated as a product right so we have the notion of treating data as products and what does it mean data as a product so basically let's say that you go to a shop and you buy a new mobile phone for example so you expect it to be working from day one right because you believe it was tested everything works it has some guarantee so if it breaks it will be fixed in no time you believe that the manufacturer will release new versions of the software for example and update it when necessary right so basically this product has a good quality and the same should be done here so this sales table that is created by finance domain should have a very good quality which means that it should be tested the data should be trusted right and basically it is uh ready to be used and consumed by other domains by some other processes right and it is finance domain responsibility to make sure that the quality is good enough or maybe not good enough that is just satisfactory right they vowed that they did everything to clean the data remove some anomalies and just make it proper right Done. The third rule is about self-serve infrastructure. So to allow those domains to work quite independently, they have to have a possibility to provision services and infrastructure they need to generate those data products.

So there should be some platform created. that would allow those domains just to go there and select i want this this this and this and please create it right and then within an hour those resources are ready are deployed and pre-configured just ready to be used right and basically every domain could just use whatever tools they need right depending on the skills they have depending on the type of data they process but this ability to provision Infrastructure for them is very important. And the last rule is about federated governance, which means that, okay, some autonomy is given to domains, but to avoid a total mess, we need to set some standards, some policies that would be enforced.

For example, we might define that every product that is created by various domains it has to expose its data using SQL endpoint. We might define that every product that contains sensitive data is encrypted in a specific way. We might define how those products have to be documented, how they will be versioned, how new versions will be released, and so on. And as a result of those four rules, we might end up with multiple data products created by various domains, right?

So those boxes are just data products. And actually one product can be created from another product, right? Like this, which creates this mesh, these connections between those products, hence the name.

Now, this data mesh approach is not easy to... implement, especially that it involves a lot of organizational changes, right? That we shift this responsibility from IT team to domains. And those guys, for example, in HR, they might not be ready to generate and process this data. So there is a long road to implement this.

And actually, there is a great blog series by Paul Andrew. that discusses if and how it is possible to implement data mesh on Azure. And link to his blog posts are in the video description. Alright, so that was it. So today I just wanted to give you a brief history of Big Data, how it looked like, what approaches were used, what were the pros and cons of each of them.

And that's it for today. I hope this was useful. Thanks for today and see you next time. Take care.

Transcript for:Understanding Data Processing Architectures

Transcript for:
Understanding Data Processing Architectures