Best Practices for DAX Filtering

Ciao friends and welcome to a new video from SQL BI. In this video, I want to talk about why you should never ever use tables as filters in Calculate. Now, DAX has several best practices and there are really a tiny number of golden rules, rules that you should always obey. And this is one of the most important ones. Filter columns never filter tables. tables. If you like rules, you can just follow the rule and live a happy life without having to watch the video. But if you're curious about why this is important and why we are so strong in enforcing this rule, then I want to provide you the rationale behind the rule. So what we are going to do is we author a measure, we place filters on tables rather than columns, and then we investigate on why the measure has two great implications. issues. The first one, the numbers are wrong and the second one, it is incredibly slow. By changing the way the filter happens, we will make the measure fast and correct and at the end you will have a better picture of why filtering tables is quite always the wrong choice. Actually, it is always the wrong choice. Anyway, let's take a look at the demo together. I'm going to use a modified version of Contoso. because I need two fact tables for this demo. So in this model, we have sales as we usually do and we also have receipts. Why do we have two fact tables? Well, the idea is that sales happen through orders. Therefore, we know the customer. We know that a customer bought something and we have sales in a store of a product in a given date. Whereas receipts are direct sales that happen in stores. Therefore somebody goes into the shop, brings something, goes and pays for it and then brings it home. We have no idea about who the customer is. Therefore there are no relationships between receipts and customer. So we have revenues which are coming from two different sources, sales and receipts. And we can investigate on the total revenues. So I build a nice report that shows the sales amount, the receipts amount and then the combined revenues. Sales amount is just our usual calculation, SUMX over sales of quantity times net price. The receipts amount is the same code, but this time it references receipts rather than referencing sales. And combined revenues, as you might guess, just sums the two measures together. So everything is working fine and you can easily check that the sales amount plus the receipts amount is actually the combined revenues. So there is nothing complex here. However, The requirement is now different. We want to compute the large transactions. We want to build and measure large revenues that takes into account only revenues that individually account for more than $500. So transaction per transaction, both in sales and in receipts, we want to check if the quantity times the net price is greater than 500. then we take this into account, otherwise we skip it. So we are sure that the large revenues will provide a smaller value than the combined revenues because not all the transactions account for more than $500. And we know calculate, we know filter, we know if, we know conditions, so that looks like an easy task. So let's start writing the code that as you are going to see is proof is going to be wrong. So we start writing it. We build a new measure and we call it large revenues. Large revenues uses CALCULATE because we want to place a filter. It will compute the combined revenues. And then we need to place a filter checking that quantity times net price is greater than 500. We are using two columns from the table, so we can just use a filter over sales that says that sales quantity times the sales net price is greater or equal than 500. And this is good for sales. So this filter is going to scan the sales table row by row and return only the rows where quantity times net price is greater than 500. We need to do the same also for receipts because we have two fact tables now. So we do the same, filter receipts, well, receipts quantity. times receipts, net price is greater or equal than 500. We close one parenthesis, we close the second parenthesis and looks like the job is done. The measure is going to compute the combined revenues for only transactions both in sales and receipts where the amount is greater than 500. Just hit Enter, format it as a decimal number and then we place it in the report. Let's get rid of the sales amount. No, let's get rid of sales amount. and receipts amount, we only are interested in the combined revenues, we take the large revenues and we drop it here. And now start counting with me. One, two, three, four, five, six, seven, eight, nine. Nine seconds to compute the large revenues. Now I'm not using a small model, I'm using a model that contains around 10 million rows. However, you can easily tell that there is something wrong. The time that is required to compute the combined revenue is incredibly small compared to the time that is needed to compute the large revenues. However, we do have a number here, 1.5 millions out of 10 millions. And the question is, is this number correct? Well, I have no idea about the data, so I would assume that it is correct, then I will show it to my users and they will immediately complain that, I'm sorry, this number is really, really too small. I would expect it to be much larger than that. And you look at the code and there seem to be nothing wrong here. So we do have a problem. The problem is that the measure, first of all, is returning a much smaller number than the expected one. And it's incredibly slow. It took around 10 seconds to compute its value. So there is something definitely wrong. And as we are going to discover, the problem is this FILTER SALES and FILTER RECEIPTS. Before we investigate on the problem and we understand exactly what happened, let me rename this because we know there's something wrong. We just rename it as LARGE REVENUES WRONG. And I want to show you immediately that the the correct version of the code. So let me copy everything, build a new measure, working on it, let's wait a second. Okay, and we now write large revenues correct. The problem is going to be this filter sales. This is just wrong. If you want to place a filter over two columns, you can just write the condition as I'm doing here for one, and the other. There is a huge difference between this condition that touches only two columns and the previous condition that actually scans an entire table. But in order to keep the same semantic, we also need to use KIPFILTERS so to make sure that the numbers returned are actually correct. So we need to use KIPFILTERS here and there. And this version is the correct version of large revenues. Let me just hit enter, format it as a decimal number. Okay, working on it. And then let me get rid of large revenues wrong. We place large revenues, the correct one, the first one. We drop it here, boom, job done. So this is the speed that I would expect from Power BI. Large revenues correct, first of all, shows a much larger number. 9.3 millions, which is now correct compared to the number computed by large revenues wrong that is going to show 1.5 millions. So a much smaller number, and as you can easily see, it takes forever to compute this value. Understanding the reason why, that requires a bit more theory. We need to understand expanded tables and how expanded tables are affecting the way the numbers are computed. If you are not familiar with expanded tables, I'm not going to go into the details of expanded tables. We have articles and videos at SQLBI that go into a much deeper level of detail than I can go in this simple video. So if you are not familiar, go to the article, there is a link, you can click on it and spend some time understanding what an expanded table is. For now, I just want to focus on the model and check what is the expanded table. Let me show you again the code of large revenue, the wrong version. And please pay attention, I'm filtering sales and I'm filtering receipts. Whenever you mention a table, whenever you reference a table in DAX, it is always the expanded table. Now, what is the expanded table of our table? Let me show you this with the whiteboard. Let's focus on receipts first. Well, the expanded version of receipts follows all the relationship that go out of receipts and it includes store, product and date. So in other words, let's use red, all the tables that I'm highlighting are part of the expanded table of our table. receipts. And what is the expanded version of sales? Well, let's change the color, let's go for green. Sales follows a relationship with customer because it is linked to customer and then it has relationship that go to store, to product and finally to date. Therefore, the expanded version of sales includes all the tables that are coloring in green. And if you pay attention to the highlighted part, You see that I'm highlighting in green store, product and date plus customer, but we are not interested in customer. What I'm interested in are these three tables because the expanded version of sales and the expanded version of receipts, they intersect over three tables, which are store, date and product. So what happens is that in my code, when I write, filter sales here. What I'm filtering, I'm filtering, because I'm using sales, I'm filtering customer, date, product and store, plus sales. And what about receipts? When I filter receipts, I'm not filtering customer because there is no relationship, but still I'm filtering date, product, store and receipts. In other words, I'm placing two filters through CALCULATE and the two filters, they intersect over three tables. So the engine needs to retrieve all the combinations of date, product and store from both filters in order to intersect them and make sure that both filters are working at the same time. That explains you why the numbers are so small because you see a smaller number, for large revenues wrong because you're actually showing from both sales and receipts only the rows where there is the exact same combination of store, product and date. So a product need to be sold directly and through an order, the same product in the same store on the same day. And only in this case, I will see the row. Otherwise I just don't see the value at all. Trust me, debugging this code and understanding why this number comes out of the blue is totally insane. You will just go crazy trying to understand why on earth you see 1.5 millions rather than nine millions. The solution, well, the solution is absolutely simple as we have seen. If you just use large revenues correct, One filter filters just two columns, which are sales quantity and sales net price. The other filter filters receipts quantity and receipts net price. As you see, the two filters used in CALCULATE are not intersecting in any way. So there's no need to retrieve pairs from the two tables to make a combination and check values. And because there is no need for that, the two filters operate independently and they can be pushed down to the storage engine in a much more efficient way. That is why it is faster and it is correct because now sales and receipts are filtered independent from each other. The same reasoning expanded table also explains the speed. If we want to investigate about the speed, well, it's enough to launch DAX Studio and look at the query plan of a test query that computes both large revenues correct and large revenues wrong. Let's do that together. I should have already DAX Studio open and we can write a test query like evaluate, summarize columns. We group by product brand, product brand, and we compute large revenues. large revenues. Let's use large revenues wrong. So we start with the wrong version and we need the server timings in order to understand the storage engine queries which are being executed because there we will see the effect of expanded tables. And then we run it. We know it's gonna take around 10 seconds, sometimes a bit less, sometimes a bit more, depending on how busy the machine is when I execute the code. Indeed, it took around 10 seconds to compute the value. Let's look at the server timings. The total execution time is 10 seconds. The storage engine CPU is only three seconds, meaning that eight, around eight seconds are spent in the formal engine. If we look at the timeline, you see that there is a lot of interleaving, small storage engine queries and a lot of formula engine. And if we take a look at the details, there are one, two, three, four, five different storage engine queries, but the ones that retrieve a lot of values are row number two and row number six. The other one retrieve calculations, but they are not all that complicated, even though we can take a quick look at them. Let's investigate on the first query and you see that it's retrieving DATE, PRODUCT KEY, SAYS QUANTITATES, SAYS NETPRIZE and STORE KEY. So why on earth it is retrieving the PRODUCT KEY, the DATE, the STORE KEY? Well, the reason is expanded tables because it needs to retrieve the combinations of PRODUCT, STORE and DATE in order to intersect this filter with the next one, the one that will be coming from This is coming from sales, so the one that is coming from receipts. Not only this, but also it needs to retrieve quantity and net price. So it looks like it is not performing the multiplication in the storage engine. Actually, if we look at more details, we see that it is actually computing the multiplication in the storage engine, retrieving again a data cache that contains all the columns. Then we have the same calculation for receipts. So this is scanning receipts and again, it's computing date, product key, brand, store key, and quantity. And now going through all the storage engine queries, you get a feeling. The engine is retrieving a lot of rows to obtain all the combinations that are needed in order to generate the filter that is required by the expression. If we take a look back at the large revenues run, it's computing It's computing this filter and later the formula engine will intersect them and finally compute the value. So the reason why it is slow is because it needs to perform all this work because of expanded table. If we execute the very same code, but this time rather than using large revenues wrong, we use large revenues correct and we execute the code, I would expect this to be much, much faster. Indeed it is. It's running in 53 milliseconds, compared to 10 seconds is 200 times faster. Not only is incredibly faster, you see that the amount of time spent in the formula engine is definitely tiny. Most of the time is spent in the storage engine. It is so fast that there is basically no parallelism happening. And if we look at the storage engine queries, we have only two of them, which are grouping by brand and computing the expression. One is computing on top of the other one is computing the same expression but on top of sales. So we have strongly reduced the level of materialization. The engine pushes everything down to the storage engine. Storage engine makes its calculation and welcome back Power BI. We can now scan millions of rows at the blazing speed that VertiPaq used us to have whenever we scan large models. The key Get rid of filters over tables and just use filters over columns because of expanded tables. You don't want to spend your time debugging these measures because of expanded tables. So just don't filter tables. As you have seen, we have a few rules in DAX. That of not filtering tables is probably the strongest one. And by the way, it is one of the mistakes that most newbies do at the beginning. Because it comes natural to say filter sales. Maybe you're used to SQL, maybe you're used to a different programming language, or it just comes natural to write filter sales where quantity times net price greater than or whatever condition you want to place there. The thing is expanded tables are an issue. I mean, they are not an issue, they are one of the advanced topics about DAX that as a DAX developer, you should be aware of and know quite well. Because expanded tables are not always useful, but when they are, they are really, really important. They can explain you why the code might be right, wrong, slow, or fast. So spend some time learning them because that will make you a better DAX developer. And if you want to improve the quality of your DAX code, just stop filtering tables. And because I know that somewhere in your code you have some calculator, whatever, and then filter sales, just go there and remove it. Because if you have a simpler model than what I have here, it might be the case that the numbers are correct. Still, it will be slow, so just get rid of that filter sales and replace it with a filter over columns. Enjoy that.

And this is one of the most important ones. Filter columns never filter tables. tables.

If you like rules, you can just follow the rule and live a happy life without having to watch the video. But if you're curious about why this is important and why we are so strong in enforcing this rule, then I want to provide you the rationale behind the rule. So what we are going to do is we author a measure, we place filters on tables rather than columns, and then we investigate on why the measure has two great implications.

issues. The first one, the numbers are wrong and the second one, it is incredibly slow. By changing the way the filter happens, we will make the measure fast and correct and at the end you will have a better picture of why filtering tables is quite always the wrong choice. Actually, it is always the wrong choice.

Anyway, let's take a look at the demo together. I'm going to use a modified version of Contoso. because I need two fact tables for this demo.

So in this model, we have sales as we usually do and we also have receipts. Why do we have two fact tables? Well, the idea is that sales happen through orders. Therefore, we know the customer.

We know that a customer bought something and we have sales in a store of a product in a given date. Whereas receipts are direct sales that happen in stores. Therefore somebody goes into the shop, brings something, goes and pays for it and then brings it home.

We have no idea about who the customer is. Therefore there are no relationships between receipts and customer. So we have revenues which are coming from two different sources, sales and receipts. And we can investigate on the total revenues.

So I build a nice report that shows the sales amount, the receipts amount and then the combined revenues. Sales amount is just our usual calculation, SUMX over sales of quantity times net price. The receipts amount is the same code, but this time it references receipts rather than referencing sales.

And combined revenues, as you might guess, just sums the two measures together. So everything is working fine and you can easily check that the sales amount plus the receipts amount is actually the combined revenues. So there is nothing complex here. However, The requirement is now different.

We want to compute the large transactions. We want to build and measure large revenues that takes into account only revenues that individually account for more than $500. So transaction per transaction, both in sales and in receipts, we want to check if the quantity times the net price is greater than 500. then we take this into account, otherwise we skip it. So we are sure that the large revenues will provide a smaller value than the combined revenues because not all the transactions account for more than $500.

And we know calculate, we know filter, we know if, we know conditions, so that looks like an easy task. So let's start writing the code that as you are going to see is proof is going to be wrong. So we start writing it. We build a new measure and we call it large revenues.

Large revenues uses CALCULATE because we want to place a filter. It will compute the combined revenues. And then we need to place a filter checking that quantity times net price is greater than 500. We are using two columns from the table, so we can just use a filter over sales that says that sales quantity times the sales net price is greater or equal than 500. And this is good for sales.

So this filter is going to scan the sales table row by row and return only the rows where quantity times net price is greater than 500. We need to do the same also for receipts because we have two fact tables now. So we do the same, filter receipts, well, receipts quantity. times receipts, net price is greater or equal than 500. We close one parenthesis, we close the second parenthesis and looks like the job is done.

The measure is going to compute the combined revenues for only transactions both in sales and receipts where the amount is greater than 500. Just hit Enter, format it as a decimal number and then we place it in the report. Let's get rid of the sales amount. No, let's get rid of sales amount.

and receipts amount, we only are interested in the combined revenues, we take the large revenues and we drop it here. And now start counting with me. One, two, three, four, five, six, seven, eight, nine.

Nine seconds to compute the large revenues. Now I'm not using a small model, I'm using a model that contains around 10 million rows. However, you can easily tell that there is something wrong.

The time that is required to compute the combined revenue is incredibly small compared to the time that is needed to compute the large revenues. However, we do have a number here, 1.5 millions out of 10 millions. And the question is, is this number correct?

Well, I have no idea about the data, so I would assume that it is correct, then I will show it to my users and they will immediately complain that, I'm sorry, this number is really, really too small. I would expect it to be much larger than that. And you look at the code and there seem to be nothing wrong here. So we do have a problem. The problem is that the measure, first of all, is returning a much smaller number than the expected one.

And it's incredibly slow. It took around 10 seconds to compute its value. So there is something definitely wrong.

And as we are going to discover, the problem is this FILTER SALES and FILTER RECEIPTS. Before we investigate on the problem and we understand exactly what happened, let me rename this because we know there's something wrong. We just rename it as LARGE REVENUES WRONG. And I want to show you immediately that the the correct version of the code. So let me copy everything, build a new measure, working on it, let's wait a second.

Okay, and we now write large revenues correct. The problem is going to be this filter sales. This is just wrong. If you want to place a filter over two columns, you can just write the condition as I'm doing here for one, and the other. There is a huge difference between this condition that touches only two columns and the previous condition that actually scans an entire table.

But in order to keep the same semantic, we also need to use KIPFILTERS so to make sure that the numbers returned are actually correct. So we need to use KIPFILTERS here and there. And this version is the correct version of large revenues. Let me just hit enter, format it as a decimal number. Okay, working on it.

And then let me get rid of large revenues wrong. We place large revenues, the correct one, the first one. We drop it here, boom, job done.

So this is the speed that I would expect from Power BI. Large revenues correct, first of all, shows a much larger number. 9.3 millions, which is now correct compared to the number computed by large revenues wrong that is going to show 1.5 millions.

So a much smaller number, and as you can easily see, it takes forever to compute this value. Understanding the reason why, that requires a bit more theory. We need to understand expanded tables and how expanded tables are affecting the way the numbers are computed. If you are not familiar with expanded tables, I'm not going to go into the details of expanded tables.

We have articles and videos at SQLBI that go into a much deeper level of detail than I can go in this simple video. So if you are not familiar, go to the article, there is a link, you can click on it and spend some time understanding what an expanded table is. For now, I just want to focus on the model and check what is the expanded table. Let me show you again the code of large revenue, the wrong version. And please pay attention, I'm filtering sales and I'm filtering receipts.

Whenever you mention a table, whenever you reference a table in DAX, it is always the expanded table. Now, what is the expanded table of our table? Let me show you this with the whiteboard.

Let's focus on receipts first. Well, the expanded version of receipts follows all the relationship that go out of receipts and it includes store, product and date. So in other words, let's use red, all the tables that I'm highlighting are part of the expanded table of our table.

receipts. And what is the expanded version of sales? Well, let's change the color, let's go for green. Sales follows a relationship with customer because it is linked to customer and then it has relationship that go to store, to product and finally to date. Therefore, the expanded version of sales includes all the tables that are coloring in green.

And if you pay attention to the highlighted part, You see that I'm highlighting in green store, product and date plus customer, but we are not interested in customer. What I'm interested in are these three tables because the expanded version of sales and the expanded version of receipts, they intersect over three tables, which are store, date and product. So what happens is that in my code, when I write, filter sales here.

What I'm filtering, I'm filtering, because I'm using sales, I'm filtering customer, date, product and store, plus sales. And what about receipts? When I filter receipts, I'm not filtering customer because there is no relationship, but still I'm filtering date, product, store and receipts.

In other words, I'm placing two filters through CALCULATE and the two filters, they intersect over three tables. So the engine needs to retrieve all the combinations of date, product and store from both filters in order to intersect them and make sure that both filters are working at the same time. That explains you why the numbers are so small because you see a smaller number, for large revenues wrong because you're actually showing from both sales and receipts only the rows where there is the exact same combination of store, product and date. So a product need to be sold directly and through an order, the same product in the same store on the same day. And only in this case, I will see the row.

Otherwise I just don't see the value at all. Trust me, debugging this code and understanding why this number comes out of the blue is totally insane. You will just go crazy trying to understand why on earth you see 1.5 millions rather than nine millions. The solution, well, the solution is absolutely simple as we have seen.

If you just use large revenues correct, One filter filters just two columns, which are sales quantity and sales net price. The other filter filters receipts quantity and receipts net price. As you see, the two filters used in CALCULATE are not intersecting in any way.

So there's no need to retrieve pairs from the two tables to make a combination and check values. And because there is no need for that, the two filters operate independently and they can be pushed down to the storage engine in a much more efficient way. That is why it is faster and it is correct because now sales and receipts are filtered independent from each other. The same reasoning expanded table also explains the speed. If we want to investigate about the speed, well, it's enough to launch DAX Studio and look at the query plan of a test query that computes both large revenues correct and large revenues wrong.

Let's do that together. I should have already DAX Studio open and we can write a test query like evaluate, summarize columns. We group by product brand, product brand, and we compute large revenues. large revenues.

Let's use large revenues wrong. So we start with the wrong version and we need the server timings in order to understand the storage engine queries which are being executed because there we will see the effect of expanded tables. And then we run it.

We know it's gonna take around 10 seconds, sometimes a bit less, sometimes a bit more, depending on how busy the machine is when I execute the code. Indeed, it took around 10 seconds to compute the value. Let's look at the server timings. The total execution time is 10 seconds. The storage engine CPU is only three seconds, meaning that eight, around eight seconds are spent in the formal engine.

If we look at the timeline, you see that there is a lot of interleaving, small storage engine queries and a lot of formula engine. And if we take a look at the details, there are one, two, three, four, five different storage engine queries, but the ones that retrieve a lot of values are row number two and row number six. The other one retrieve calculations, but they are not all that complicated, even though we can take a quick look at them.

Let's investigate on the first query and you see that it's retrieving DATE, PRODUCT KEY, SAYS QUANTITATES, SAYS NETPRIZE and STORE KEY. So why on earth it is retrieving the PRODUCT KEY, the DATE, the STORE KEY? Well, the reason is expanded tables because it needs to retrieve the combinations of PRODUCT, STORE and DATE in order to intersect this filter with the next one, the one that will be coming from This is coming from sales, so the one that is coming from receipts.

Not only this, but also it needs to retrieve quantity and net price. So it looks like it is not performing the multiplication in the storage engine. Actually, if we look at more details, we see that it is actually computing the multiplication in the storage engine, retrieving again a data cache that contains all the columns. Then we have the same calculation for receipts.

So this is scanning receipts and again, it's computing date, product key, brand, store key, and quantity. And now going through all the storage engine queries, you get a feeling. The engine is retrieving a lot of rows to obtain all the combinations that are needed in order to generate the filter that is required by the expression. If we take a look back at the large revenues run, it's computing It's computing this filter and later the formula engine will intersect them and finally compute the value. So the reason why it is slow is because it needs to perform all this work because of expanded table.

If we execute the very same code, but this time rather than using large revenues wrong, we use large revenues correct and we execute the code, I would expect this to be much, much faster. Indeed it is. It's running in 53 milliseconds, compared to 10 seconds is 200 times faster. Not only is incredibly faster, you see that the amount of time spent in the formula engine is definitely tiny.

Most of the time is spent in the storage engine. It is so fast that there is basically no parallelism happening. And if we look at the storage engine queries, we have only two of them, which are grouping by brand and computing the expression. One is computing on top of the other one is computing the same expression but on top of sales. So we have strongly reduced the level of materialization.

The engine pushes everything down to the storage engine. Storage engine makes its calculation and welcome back Power BI. We can now scan millions of rows at the blazing speed that VertiPaq used us to have whenever we scan large models. The key Get rid of filters over tables and just use filters over columns because of expanded tables.

You don't want to spend your time debugging these measures because of expanded tables. So just don't filter tables. As you have seen, we have a few rules in DAX. That of not filtering tables is probably the strongest one.

And by the way, it is one of the mistakes that most newbies do at the beginning. Because it comes natural to say filter sales. Maybe you're used to SQL, maybe you're used to a different programming language, or it just comes natural to write filter sales where quantity times net price greater than or whatever condition you want to place there. The thing is expanded tables are an issue.

I mean, they are not an issue, they are one of the advanced topics about DAX that as a DAX developer, you should be aware of and know quite well. Because expanded tables are not always useful, but when they are, they are really, really important. They can explain you why the code might be right, wrong, slow, or fast.

So spend some time learning them because that will make you a better DAX developer. And if you want to improve the quality of your DAX code, just stop filtering tables. And because I know that somewhere in your code you have some calculator, whatever, and then filter sales, just go there and remove it. Because if you have a simpler model than what I have here, it might be the case that the numbers are correct.

Still, it will be slow, so just get rid of that filter sales and replace it with a filter over columns. Enjoy that.

Transcript for:Best Practices for DAX Filtering

Transcript for:
Best Practices for DAX Filtering