Transcript for:
Diseño de data warehouses en BigQuery

yeah um so right so what we're going to talk about today is um about designing data warehouses in bigquery and why it's different from the way we used to design data warehouses like 20 years ago and and especially what i want to focus on is what is what is it about bigquery that makes it so different that you know makes it high performant and makes it very fast and makes it very efficient but it also requires us to change a little bit about how we think about how we design data warehouses so first thing i just want to point out that this is an architecture discussion so there won't be a lot of um like visuals of you know screenshots from the uh bigquery interface or command lines you know with details about how to execute a particular operation instead it's more about how is bigquery put together like what are its components that influence how we make decisions about how we design our data model so just at a high level i just want to point out there are three broadly speak broad areas of data store options the first is relational and those are uh databases that are have been around the longest they're highly structured we often use them for transaction processing systems while they work really well in many cases they can be difficult to scale now nosql databases emerged you know in the last couple of decades in response to that difficulty around scaling relational databases and nosql works really well with semi-structured data especially in in applications where um if you're when you're replicating data the replicated data doesn't have to be immediately consistent and then also the third type is an analytical database and that's really designed for structure database like relational but it's also designed to be highly scalable and what we give up is you know some of the transactional capabilities and bigquery is an analytical database and so what we're going to do is dig into a little bit about how that's different from what is has typically been the case with data warehouses so early on so so sort of like the um sort of the early data warehouses were typically designed for relational databases and we had usually one server maybe a small cluster like an oracle cluster but the key thing was that the storage and the compute resources were tightly coupled so if you wanted to scale up storage you were also scaling up your compute resources and vice versa typically you know when the data warehouse got too big or performance got too slow we would have to scale vertically which means we just need a bigger server more memory more disk space and also these data warehouses we typically use the same software we use for oltp systems so you know we might be using oracle for an online transaction processing system and we might use the same version of the software to build a data warehouse and we might change some configurations we might change block sizes and things like that um to to have larger block sizes in the data warehouse but essentially we're working with the same engine so it wasn't really optimized for analytical operations we were taking the tool that was optimized for ol2p and using it for analytical operations now bigquery is different now in the sense that it's a serverless data warehouse designed right from the beginning to be petabyte scale so immediately right out of the box we can start to begin to understand why bigquery had to be designed differently now bigquery does use sql but it's not a relational database it's an analytical database so that means it's really designed for these analytic workloads that are considered read intensive and don't require transaction processing now there's some other features of bigquery that are really interesting and make it really appealing i won't get into these today but i just want to point them out bigquery has bigquery ml which allows you to do machine learning directly in bigquery you don't have to export the data or go use python or spark or anything you can work with your data right in the database and use sql commands to implement machine learning bigquery now has a bi engine which is like essentially an in-memory olap cube which makes you allows you to do very low latency operations on large amounts of data and there's bigquery gis if you are interested in geographical systems okay so so really under the hood what's so different about bigquery well it all comes down to the architecture and this is probably one of my favorite um images that google has provided about bigquery i think it really succinctly captures what's going on here the key takeaway here is like over on the left we have our storage system and over on the right we have our compute system and they really are two separate things in fact these were developed separately there's colossus which is the successor of the google file system this is the global data store like block storage data storage system that google uses and it's used by multiple system bigtable uses it spanner uses it so colossus is essentially you know if you want to write data to some a highly reliable persistent storage it goes to colossus now typically we think oh that sounds like network attack storage you're going to get really long latencies if we're going from a server over the network to grab data but google solved that problem by creating these proprietary networking switches called that they call jupiter it's the jupiter project or jupiter switches and it's a petabit network so it has very high throughput and that means that basically you can go across the network and get the same kind of latencies you would typically get from a server that you might have on-prem that has direct attached storage now on the um on the compute side there's really two important components there's dremel and dremel is a software system which manages a cluster of servers and is responsible for executing sql queries so that's really dremel is really the thing that creates a job that that is then executed and managed by borg and borg is the predecessor of kubernetes so it's like a job scheduler and it manages the resources now because bigquery is multi-tenant you and i and a bunch of other people are all want access to the same resources borg um and dremel manage those board manages sort of the sql jobs and then there's also the process of managing some of the resources so that um they're fairly allocated and that's something that that comes with with dremel so let's look at each of these components and kind of dig down a little bit because they all really influence why we do things in bigquery the way we do so again as i said dremel's multi uh a multi-tenant cluster so there's no isolating you know plus uh you know large volume customers in one physical cluster and others and others know we're all in the same multi-tenant bigquery pool of resources now dremel as i mentioned maps sql queries into execution trees and execution three is just a name for splitting up a sql query into a bunch of jobs that can then be executed in a way that produces the required results now these trees of course have leaves and the leaves of the execution trees are called slots and slots are the compute resources that read data and perform basic computation so you can think of slots as like a processor or a processor and and some memory but basically it's a computation unit um that that can do the work it has and has network access and things like that now the inner nodes in the tree perform aggregation so as you're doing some low-level processing say you you know you're grabbing a bunch of rows you're selecting excuse me you're you're grabbing a bunch of data across multiple rows you might pick say a particular column and you might need to transform that column or do a calculation on it that kind of thing is done at the slot level at the at the leaf node then if you need to aggregate that and say combine it with other the results of that other slots are producing that work is performed in the inner nodes um dremel can dynamically allocate slots to queries and so that's why you can allow fairness or or basically you know you you don't have um the kind of effects you might see like in virtual machines where you have multiple hosts uh multiple vms and you have the noisy neighbor problem where you might have you know one machine trying to consume too much so dremel is designed right from the beginning to maintain fairness now that doesn't mean everybody gets the same number of slots it all depends based on what's available and what's required single users can get many slots if that's what they need but then get few slots a few slots will do the job but here's another visualization of that shows sort of a drill down into the dremel borg architecture what you'll notice at the bottom is we have colossus and again colossus is the global file system and data storage system and it depends on the jupiter network and the jupiter switches to move data between dremel and colossus now within dremel we have the leaf nodes at the bottom so those are the slots to do the computation and when removing data between say the leaf nodes and the mixers those inner nodes and the root server and the root server is the thing that basically builds the query plan the the sql plan the execution trade when moving data between those nodes we're also depending on jupiter as well so we have very low latency between all of the servers regardless of whether what kind of work they're doing and then around dremel you'll notice over to the left we mentioned the board and again borg is sort of a job is basically like a job scheduler at this point and it's trying to figure out how to make sure all of these nodes are being used efficiently now colossus as i mentioned is a distributed storage system and this is really handy because basically it manages so many um sort of storage management functions things like replication and recovery we don't need to worry about failover servers and hot standbys and things like that that used to like take up a lot of mental energy and a lot of dba time to kind of figure out and also we don't even need to think about storage systems like like you know i when i used to work on like oracle and stuff we constantly thinking of you know what's our what do we have available what's our growth rate what do we think we're going to need how soon do we need to order disks so that we have enough we don't need to deal with any of that stuff anymore it's basically just assume that there's enough storage out there for our petabyte scale database now jupiter again global um networking switch and it's pediatric scale that's the critical thing and it's both for storage compute communication but also compute to compute and one of the nice things about this is that google like google bigquery doesn't need to have rack awareness um and what that means is if you've worked with other distributed computing systems like spark you might have come across this idea that jobs try to be scheduled or the schedule or try and schedule jobs in a way so that as the job is broken down and distributed across multiple servers those servers are all in the same rack because within rack communication is much faster than going from inter between racks but in the case with the jupiter network it doesn't make any difference where it is you still have petabit scale networking between them so you have low latency so you don't have to kind of have that extra overhead of trying to think about rack based efficiency optimizations because it's just not needed and then borg again really the key thing about borg is was that the predecessor of kubernetes and it just manages the mixers and the slots now another um component that wasn't in the diagram but i think is you know just as important as capacitor and that's a column storage format it's not part of the architecture but it's the way we use the architecture and it's designed to support semi-structured data now the critical things about capacitor is that it's built from the ground up to support nested structures so you can think if you're thinking of terms of like json or hierarchies it's designed right off the bat to support that and it can also support arrays or repeated fields and as an optimization if you're thinking oh nested structures you have to like maintain the whole nested tree and be able to walk down the tree to get to a particular node say deep in the hierarchy that's not the case with capacitor it's optimized it has certain levels metadata that tracks about every row so it's very efficient to get to attributes that are lowered down in the nested structure so there's no need to read the entire parent column to produce a nested structure attribute and capacitor also has you know specialized compression that make it very efficient to store data and capacitors again it's a columnar storage format which works really well with analytical databases because typically we might have very wide tables we have a lot of columns but when we query we're often picking a small subset of those queries so it's a very efficient to say get all of the after all of the columns from many rows um versus getting all the columns from a few rows so so capacitor is really designed specifically for this kind of analytical application all right so this is all well and good so now we have a sort of a peek under the hood of what's going on in the google data centers but but really but what does this mean for us when we're building a data model so i would just say i'll i'll go to the punch line right now and i'll sort of give it all all right now and then we'll drill into the details so first thing you remember so this is a distributed system so things are happening in parallel and the things at the end of the day databases are very good at scanning data blocks looking for pieces of data and bigquery is like that as well in fact it's so good at scanning that it doesn't even have indexes all right so there are no indexes which is a sort of a staple of data warehousing in when you're using a relational database management system so what we want to remember is everything we're doing is it from a design perspective is driven by the knowledge that bigquery is basically a big scanner and it's scanning through our data oh and the other thing to keep in mind is that google charges for bigquery based on how much data you scan so google has a vested interest in all of us being very efficient and using the google resources efficiently and they incentivize us to to do that as well by by charging for how much data we scan also um so then the question is well how do we reduce the amount of data we scan well partitioning is a big one partitioning basically breaks up big tables into smaller tables and it allows us to minimize the amount of data we're scanning to answer a particular query now bigquery has a something known as clustered tables or clustered partitions and basically that's another way of reducing scanning and the way we reduce scanning is by keeping the data in some kind of order within that partition so we can take advantage of the fact that the partition is always kept in sort order now bigquery supports joins and that's not a problem it's not a problem to do joins but it can require shuffling across slots um so another thing that we can do that can make things a little more efficient is to actually to denormalize and take advantage of capacitor by using nested and repeated fields so these are sort of the the data modeling techniques that go against the grain of many things that you might have practiced when designing for oltp or even for denormalized data warehouses in relational when using a relational database management system but these are these are sort of best practices good patterns to use when designing for bigquery because bigquery is built differently from other databases so we don't want to carry over practices and patterns that we use that were sort of optimized for one type of system in one architecture and bring it over to an architect another architecture where it's not a great fit so the idea behind partitioning as i mentioned is just making a big table feel smaller and here we have five partitions now those partitions all have like some range of values that they take care of so partition a may might be you know one to a hundred excuse me partition b might be the you know 100 to 200 and so on excuse me and so with partitioning we're we're able to specify in our where clause oh i'm looking for range you know 310 to 320 well then we know immediately we only need to scan partition c and if it's clustered if it's ordered we can even reduce it further but but the idea here is again we're just basic basically breaking up a big table into ordered chunks and we know what the order is and we can use that to optimize our queries now so again partition tables are just tables that are grouped into partitions and it improves query performance because we're reducing the amount of data that's scanned now there are a few different ways to partition a table we can partition by ingestion time and so what happens is when as we're loading data in it's loaded into a daily date-based partition and what's really nice about this is google excuse me bigquery automatically creates the new partitions for us every day and it uses the ingestion time to determine which partition things go in now it also creates a pseudo column called underscore partition time so that's like an extra column in the table that we can use in our queries to to determine which range or which sets of partitions we want to be scanning now the partition time is a date based time stamps and we actually specified in in the where class we can say where underscore partition time something some condition excuse me now there's also date time stamp partitioning and this is similar to ingestion time in that it's based on a date or time stamp column and each partition holds one day of data but there's no need for a partition time column because we use one of the columns that already exists in the table now there are a couple of special partitions that are created there's the um the null or underscore null underscore partition and that's where rows will go if they don't have a date in that particular column or if the um if the value in the daytime column is outside the allowed range then it'll go into this one partition partition as well then the final type is called integer range partitioning and this is just based on some type where we have an integer column now it can't be a repeated field it can't be an array of integers it needs to be a scalar so not a repeatable now bigquery has something called legacy sql which is an older sql that was used with bigquery originally as opposed to standard sql which is what we use now integer range partition can't be used with that older legacy sql but i can't think of a reason why we wouldn't be using standard partitioning on a new data warehouse now excuse me standard sql on a new partition now so this probably shouldn't be really much of an issue now in other database management systems you might have come across this idea of sharding and charting is the idea of using say a separate table for each partition so for example have a separate table for each day and you might name that table with some prefix like you know sales orders and then a particular date and then you just have a bunch of these sales order tables and then you use a union to query or scan multiple tables and that's that's you know that is one way of doing it but in bigquery partitioning is a preferred is preferred over sharding because there's less metadata to maintain so each of these each of the tables in the sharding scenario have its own set of metadata and there's also access control issues so if there are fewer tables then there are fewer times you need to basically check access controls and to determine you know does a particular you know processor query or service that's trying to query the table actually have access um to that particular table so as a result partitioning gives better performance than sharding so definitely in most cases we want to prefer partitioning over sharding another thing we can do is set something called the require partitioning parameter and what that means is that at a table or even at a partition level we can require that the partition column be used in the where clause now this is useful if other people are querying your data and they might not know about partitioning and might not understand the implications of say doing a full table scan over a petabyte table and what that means in terms of performance and cost so this is a way of basically at least putting guardrails on how much data somebody can pull back certainly they could have a where clause that goes from the you know the earliest date to the latest date and you know you end up still partitioning a lot of excuse me scanning a lot of data but at least the where clause gives you a mechanism to force people writing queries to put some limits on the range of data that's being scanned i mentioned clustered cables earlier and in a clustered table the data is sorted based on values in one or more columns and this can improve performance like with aggregate queries so if you know like the sub range where you want to say do a sum across some range and it's already clustered it's pretty easy to find that and it can also reduce the scanning when the cluster column is used in a where clause so if you know for example if you're in a partition with values between 300 and 399 and you're looking for say 310 to 320 and if you know as soon as you hit 321 you can stop scanning you're never going to see another value in the range of 310 to 320 in that cluster table so that's why clustered tables can further improve this idea of reducing the amount of data data that's scanned now cushions are like anything else in the universe they're subject to entropy so things start to get out of order especially as new data is ordered it doesn't it doesn't necessarily come in order so as you ingest the data um you might need to reorder things and that's okay because bigquery automatically re-clusters in the background so we don't need to worry about that this is just again another one of those like services that you might have had to think about earlier um like if you if you used to use like materialized views and you know when do we need to refresh and things like that this or you know how do you manage storage you know you don't need to think about managing storage you don't need to think about resorting your uh clustered partitions because bigquery just takes care of that for you now the final sort of architectural piece that is different about bigquery is the ability to use nested and repeated fields now you can use less than repeated fields in other databases like postgres will support nested structures like json and xml and you can have arrays in postgres but it's not designed like the storage structure isn't designed like capacitor to be optimized to store that kind of data the way bigquery is so here for example is a screenshot of a part of the bigquery console and i'm looking at some data sets and um i've opened up here on the uh the left kind of column a data a public data set called bitcoin blockchain and i've highlighted the transaction table and if you look at the transaction table you notice it has attributes like timestamp and transaction id and both of the one is an integer one is a string and then has something called inputs and that's a record so it's data type is record and its mode is repeatable so the repeated simply means it's we can have an array of these things so we can have an array of these inputs and let's take a closer look at those inputs so inputs to this bitcoin transaction have things like script bytes and script string and a number of other things i'm i'm not familiar enough with bitcoin transactions to really understand what these things are but i can tell that this is from a relational data modeling perspective this screams oh i need to create a separate table and use a foreign key from transaction to this other table where i will have transaction inputs and each of these will be its own row but with bigquery we don't need to do that we can actually store this data together with the the transaction inputs with the transaction data and because we can take advantage of capacitor in its ability to store recorded and repeated fields in an optimal way so this is why we don't need to denormalize or one of the reasons we don't need to denormalize at least we don't pay a penalty for denormalizing and in fact it can be advantageous to denormalize because it's this is a more efficient way of storing this kind of data and and what i mean by that is like with transaction and transaction detail or transaction inputs it's like if you're using a particular transaction you're almost certainly going to be using some of the data in the inputs and you would have to go and do a join but you don't have to do that that data is relatively easily accessible so that that kind of covers the the key points i wanted to make so i just wanted to re-emphasize um when we're working with bigtable we just are constantly thinking about how do i how do i deal with scanning i know i know the basic operation is scanning so i can use partitioning to minimize the amount of scanning i can use clustering within a partition to further reduce the amount of scanning now i can use joins and that's perfectly fine to use joins especially you know large joining large tables and small tables that's certainly fine but it also it often helps to denormalize and use nested and repeated fields to avoid joining especially when you're joining large tables like two large tables if you can find a way to denormalize and use nested and repeated fields instead that's that's a almost always a better option so that is the wraps up my talk so i will stop sharing