Microsoft Certified Fabric Analytics Engineer (DP-600) Exam Preparation

hey everyone and welcome back to the channel and today we've got a very exciting video because we're going to be starting a brand new series here on the channel we're going to be looking at how you can become a Microsoft certified fabric analytics engineer bit of a mouthful you might know it better as the dp600 exam that you need to pass if you want to get this certification now a lot of people have been asking me about dp600 how do I get certified what do I need to learn and so in this course what I'm going to be doing is bringing together everything that you need to know for this certification so that you can hopefully pass first time so what are we going to be doing in this video well this is a bit of an introduction okay so we're going to be looking at what's in the exam what are some of the details of the exam so how long is it how how can you pass do you do it online or in person that kind of thing we're going to look at an overview of this course so specifically what are we going to learn in this course in YouTube and how are you going to learn as well then we're going to finish by looking at why you might want to take this course and why you might want to get fabric analytics engineering certification in general so what's in the exam so the first section is around planning implementing and managing solutions for data analytics now within that we've got various kind of sub modules okay so you need to understand the requirements how do we identify requirements for Solutions how do we do things like security how do you know the difference between different data gateways how do you set up access control and workspaces and capacities and how do you modify the settings of all these things as well another important section of the exam and being an analytics engineer in general is Version Control you'll need to understand how to set up Version Control for Azure devops understand some of the settings and the configuration options that that entails as well as some of the deployment pipeline functionality as well now this is not a completely exhaustive list of what's in that section of the exam we'll go through each of those in a bit more detail it's just the high level kind of areas that are covered in that section and this section is worth 10 to 15% of the exam so it's a small section but it's a very important section I think anyway in terms of becoming an analytics engineer these are really important topics that you need to understand so up next we've got 40 to 45% of the exam so this is really the core of the exam that you need should probably spend most of your time studying okay and it's a around preparation and serving of data so this is one of the core tasks that you'll be asked to carry out as an analytics engineer and within that we've got quite a lot of really important topics okay so you've got understanding The Lakehouse how do we set one up how do we create tables difference between tables and files the warehouse so the tsql experience creating shortcuts ingesting data from external locations into our warehouse or our Lakehouse what are the different methods that we can choose here and when do we choose specific ones okay then we've got data transformation so once we've got our data within fabric we might want to do some Transformations on it now you can do that with tsql spark and you will need to know at least the basics of tsql and Spark and also Dax in the next section as well so there's quite a lot of languages that you need to know to kind of an intermediate level I would say for this exam next we've got performance so how do we optimize performance both in terms of the getting data into fabric but also in terms of the transformation piece so if you've got spark jobs that are running really long or you've got some tsql scripts that are really not very performing very well they're taking a long time to kind of return your data what can you do to monitor performance then optimize it and as I mentioned you will need to know P spark and T SQL to a kind of intermediate level there so that's something to bear in mind for this exam so the next section of the exam is around implementing and managing semantic models and this is worth 20 to 25% of the exam and within this category you've got the different storage modes so direct query import mode and direct late mode and kind of understanding how these things work how to set up direct Lake mode when to use it maybe when not to use it you're going to need to have a good understanding of Dax here there's quite a few questions around Dax and Dax studio and tabul editor 2 as well so these are kind of things that you need to be be aware of because you'll probably get some questions around that as well also within the section you've got data modeling so things like Star schemas Bridge tables how do we deal with many to many relationships and also we've got things around security how do we set up roow level security within your santic model object level security and how do we validate that that's actually working correctly so in the final section of the exam which is worth around 20 to 25% again is explore and analyze data so here we're really looking at the analysis part of being an analytics engineer because most of your role might be in around the engineering piece but really you need to know the analytics side of thing as well because that's going to make you a much better analytics engineer so in this section you're going to be asked questions about data analysis specifically using tsql okay so analyzing your lake house SQL endpoint analyzing your data warehouse coming up with insights from data using tsql you'll also be asked about data profiling so understanding the pro profile different tables based on some of the profiling metrics that you get and also analyzing data via the xmla endpoint so that's another part of this section of the exam so these are the four sections there's quite a lot to go through and it covers quite a broad range of skills that you need to know from ppar tsql Dax data modeling getting data in data modeling serving data in semantic models as well so quite a broad range endtoend exam so in the exam there is between 40 and 60 question and the results are scaled and you get given a result between 0 and 1,000 so the pass Mark here is 700 out of 1,000 and that's the scaled score so that's something to bear in mind and the questions can be of different types okay so some of them are multiple choice some of them might be a case study and the case studies normally take two three four questions so you really need to understand what's going on and you get multiple questions about the same case study you also have things like drag and drop and ordering lists and for a full list of the question types recommend you go to this resource here it's on the Microsoft learn the exam question section of Microsoft learn I'll leave a link to that in the description below you can also use the exam sandbox that provides you with a Sandbox experience of the exact question types that you can expect in the exam you'll get given 100 minutes to actually carry out the exam you should set aside at least 2 hours though so 120 minutes for the exam just so that you can kind of get in get settled as I mentioned it's 700 out of 1,000 to pass this exam so 70% you can take the exam either in person in Pon view sensors or you can do it online so my personal recommendation would be to take the exam in person if you can um it kind of eliminates a lot of the doubts and the problems around like Wi-Fi and worrying about your desk setup and your room has to be a very specific layout and so if you can visit a center in person I think it's a lot better cuz you can just walk in take the exam and walk out whereas online you have to think about lots of different things for me it's less stressful to do it in person if you can so there is an exam fee which is $165 for people in the USA it varies for different countries but if you look on the right hand side we got a free exam so if you're very quick and you go to the link in the description and you complete the fabric AI skills challenge training course before the 19th of April so you've only got a few days you can get a voucher to take the DP 600 exam for free so you can do that skills challenge training very quickly get your free voucher and then watch all of this series on YouTube and as long as you take that exam I think it's before June the 22nd or something so you've got two or three months to kind of go through the material at your own pace and then you get to save yourself $165 or the equivalent in your currency in your country so this is what this series is going to look like here on YouTube we've got the first video which is an introduction to the course then I'm going to be covering all 11 chapters of the exam and the content is going to be delivered through various real world scenarios CU I want to make this content interesting for you and also make it relevant for you if you want to be an analytics engineer or if you are an analytics engineer currently in your career and to do this I'm going to be combining Theory so I think there is some theory in some of these modules you do have to understand how things work but then practice as well how to actually implement this stuff in fabric at the same time throughout the course I'm going to be reinforcing that knowledge as we go through and as asking rhetorical questions as well as sample questions and at the end I do plan to go through a full video kind of like a practice paper let's say designing lots of questions that you can expect within the exam as with all of my courses all of the resources and module notes and scripts and notebooks and all this kind of thing I'll be posting that in the school community so if you're not already a member there I'll leave a link in the description below it's completely for free so make sure you sign up and yeah you'll get access to all of that so there's a few reasons that I just want to cover quickly here around why you might want to take this course and then go on to become a certified analytics engineer well the course gives structure to your learning Microsoft have given us a study guide and they've said these are what we think is important to learn for an analytics engineer so it gives you a really good Pathway to follow might also help you get a new job having that certification on your CV is definitely not going to hinder your chances it would also be good to go into kind of promotion talks or payiz talks with your boss and say yeah well last year I did the certification and this is especially true if you work in a consultancy right because consultant here they're kind of selling your skills and your experience so if you have certification then that can help them win work in the future so in this lesson we've looked at what's in the exam we've looked at some of the exam details we've also looked at the overview of this course so what are the different chapters that we're going to be looking through and why I think you should take this course join us in the next lesson where we'll be starting the course properly and we'll be looking at how to plan and Implement a data analytics solution hello and welcome to this first chapter in this dp600 exam preparation course we're going to be looking at how to plan a data analytics environment in fabric now this is the first chapter in 11 chapters that we're going to be going through teaching you everything you need to know to hopefully pass the dp600 exam in this chapter we're going to be covering exactly what you need to know if you look at the study guide in Microsoft learn these are the elements that we're going to be covering how do we identify requirements for a solution so the various components features performance capacity skus that kind of thing how do we make decisions about that we're also going to be looking at how to recommend settings in the fabric admin portal how do we choose data Gateway types and also creating custom powerbi report theme towards the end of the lesson we'll be testing your knowledge with five sample questions and just as a reminder all of the lesson notes and key points and link to further learning resources they're going to be published on the school community so if you're not already a member I'll leave a link in the description now you play the main character in a scenario and this scenario is going to walk you through everything you need to know for those four ele Els of the study guide are you ready let's begin so you are a consultant and you're starting your first day on a new project and this is Camila she is your client for the project and on the phone before the meeting Camila had mentioned that she wants to implement fabric but she doesn't know really where to start and that's where you come in you're going to start with a requirements Gathering Workshop so you organize a full day workshop with Camila the client to truly understand their business and their requirements now your goals for this Workshop are to extract a set of requirements from the client to help you build a plan for their new fabric environment and another goal is to do such a great job in planning their environment that the client is going to give you a new contract by the end of it to build the thing okay so this requirements Gathering Workshop what are you going to ask Camila what do you need to know when you're identifying the requirements you should think about focusing on these three elements to begin with the capacities so how many do we need what sizing do the capacities need to be in this new environment then we're going to look at data ingestion methods so there's lots of different ways that we can ingest data into fabric you're going to ask a set of questions that's going to kind of deduce the best method for getting data into fabric based on the requirements similarly we've got data storage so we've got three different options for storing data in fabric how do you ask the right questions and identify the requirements to choose the right one so let's start off thinking about capacity requ requirements now the requirements that we need here are really the number of capacities that are required and the sizing so the SKU the stock keeping units you probably know by now that in fabric we have capacities of varying sizes so what determines the number of capacities required so from previous videos you've probably understood that one of the things that impacts the number of capacities required is compliance with data residency regulations so the capacity dictates where your data is stored so if you have regulations that dictate that your data must reside in the EU for example for gdpr that's going to be one capacity in your business if you have other requirements that say these data sets need to be stored in the US you're going to have to have a separate capacity for that as well another thing that can impact the number of capacities is the billing preference so the capacity is how you get build in fabric so some organizations might want to separate the billing between different departments in their organization so they might have one capacity for the finance department one capacity for your Consulting division one capacity for your marketing department for example another thing that could determine the number of capacities that you need is segregating by workload type so if you have a lot of heavy intensive data engineering workloads then you might want to put those in a separate capacity and give it enough resource to allow you to do that in a confined capacity then you're serving of business intelligence you might want to do that in a separate capacity so that the read performance on those kind of dashboards is not impacted by the heavy data engineering stuff maybe machine learning stuff that's being done in other capacities you might also want to segregate by department just through business preference as well aligned with that billing preference so some companies like to have their capacity aligned to various dep departments within their business so these are the things you need to extract in terms of requirements when you're talking with this client and what about the sizing well we've touched on that already but some of the things that impacts the sizing of a capacity are the intensity of the expected workloads so are you going to be doing High volumes of data ingestion are you going to be getting gigabytes of fresh data into Fabric or even terabytes of data into fabric every day these are going to use a lot of your resources and to go through them quickly it helps if you have a higher capacity similarly heavy data transformation so if you're doing a lot of heavy transformations in spark that's going to use a lot of resources so if that's something you're going to be doing regularly in your business you want to be choosing a high capacity for that again machine learning training can you be very resource intensive going to take hours or sometimes even days to train a machine learning model if that's something you're going to doing regularly you want to be having that on a high capacity the budget of your client also dictates the capacity the sizing of the capacity that you're going to choose obviously the more resources the higher that SKU that you decide the more expensive it's going to be and some clients might be very sensitive around the cost and related to that is can you afford to wait or can the client afford to wait because if you procure a F2 skew it's probably going to go through your data but it might take a very long time and in some business that might not be a problem maybe you're just doing data ingestion once per day you ingest all of your fresh data and it might take a lot longer on an F2 capacity but that's not necessarily a problem maybe you can do it overnight and by the time people come in in the morning all of that data has been ingested or transformed and it's ready for consumption in the morning so what's your propensity to wait now some other companies might have regular data coming in every hour like gigabytes of data every hour and in that scenario you really need a high capacity to be able to churn through all of that stuff and get it processed before the next hourly load for example another thing that can determine the sizing of the capacity is does the client want access to f64 Features so there's quite a lot actually of features that open up when you get to f64 so co-pilot being a good example currently and there's many many more I'll list them on the screen here these are features that only really are available if you choose f64 capacity or above so that's something to bear in mind if you want to use any of these features you need an f64 plus so what about the data ingestion requirements well here what we really need to know is what are the fabric items and or features that you need to get data into Fabric and how are you going to configure these items once you've built them now some of the options here and this is not an exhaustive list there's lots of different options here we've got the shortcut database mirroring ETL via data flow ETL via data Pipeline and a notebook and the event stream so these are some of the options that you might want to consider so what are the questions that you need to ask of a client when you're identifying the requirements to help you make the decision here well these are some of the deciding factors the main one really is where is the external data stored if it's in ADLs Gen 2 Amazon S3 or S3 compatible storage location like Cloud flare for example Google Cloud Storage or the data verse but then these are the ones that are going to be available for you to shortcut into fabric so if you get any questions in the exam around you know my data rest stored in ADLs Gen 2 well obviously the shortcut is a good option for that now it's not necessarily the only option you can still do ETL via any of these storage locations but it does open up that shortcut possibility now if you see Azure SQL Azure Cosmos DB or snowflake mentioned then immediately you should start thinking okay this could be database mirrored so you can use database mirroring to create that kind of live link to the database and it's going to maintain a mirror inside a fabric is it on premises now if you're data stored on premises then you're going to be probably want to be using the ETL via data flows or data pipelines because these two activities these two items allow you to create that on-premise data Gateway on your on premise server and then connect to that via the data flow or the data Pipeline and if you got realtime events realtime streaming data obviously you probably want to using the event stream to get that data into fabric anything else really you're going to be looking at ETL by either the data flow the data Pipeline and the notebook and when to choose which one well I've done a very long video I'll leave a link in the description or you can click here to make that decision about which of these is best for that particular organization so related to that is also what skills exist in the team because you don't want to build a solution that can't be maintained managed by the company or your company or your client's company so if you're looking for a predominantly no and low code experience then you're going to want to be focusing on the ETL via data flows and data pipelines both of these are fairly low and no code experiences help you get data into fabric if You' got a lot of SQL experience in your team then here you can be using the data pipeline you can use the script activities to do Transformations on your data as is coming in and if you have people that are familiar with spark python Scala that kind of thing then you can use the ETL notebook if you're you know perhaps you've got data coming from a rest API and you want to be using python libraries to get that in that's a good option for you there so whil we're on the topic of data ingestion there's a few other features you need to be aware of that might come up in the exam that can help you identify different requirements for getting data into fabric these are the on premise daily Gateway which we've mentioned the v-net the virtual network data Gateway fast copy and staging so you might be asking some questions about these things in the exam as well so when do we decide on these sorts of things well you need to ask how the data in the external system is being secured right so if it's on premise if it's an on premise SQL Server you have to be using the on premise data Gateway if your data is living in Azure behind some sort of virtual Network or private endpoint that kind of thing then you want to be setting up the v-net data gateway to access that and in terms of the volume of data this is also going to have an impact on the items that you choose for doing your data ingestion and also some of the features available so if you've got low or medium data per day well if it's low then you probably don't need any of these specific features like the out of the box Solutions will be good enough but if you've got quite a lot of data gigabytes per day in that kind of range you want to be using some of the features like Fast copy and staging similarly if You' got very high amounts of data these are going to be one of using the fast copy and the staging if you're using data flows alternatively you can use data pipelines and if you can get data in bya a fabric notebook then that's another option as well so before we move on I just want to mention a bit more detail around the data gateways now as you probably know already there are two types of data gateways that we can configure in Microsoft fabric number one is the on-premise data Gateway and number two is the virtual network data Gateway and a data Gateway in essence helps us access data that's otherwise secured so if his data is on an on-premise SQL server for example it gives us a secure way to access that data and bring it into fabric likewise if you've got data behind a virtual Network secured in Azure in like blob storage or ADLs Gen 2 it provides us with a secure mechanism to access that data so I'm not going to show you step by- step how to set up a data Gateway in this lesson but what it done is linked to two other videos by other creators that show the process in detail if you want to go and have a look I'll leave that in the school Community but I do want to just cover kind of the high level process for each of them just so you understand a bit more about what that looks like if you've never set one up before so for the on- premise data Gateway there's a few high level steps number one we need to install the data Gateway on the on premis server and if you've already got an on- premise data Gateway set up on your on premise server perhaps you're using it in traditional powerbi data flows for example then you're going to need to update it to the latest version cuz that's going to be compatible with Microsoft fabric the next step is to in fabric create a new on premise data Gateway connection and then from that you can connect to that data Gateway from either a data flow and now also a data pipeline so the data pipeline was recently added in the last few weeks I think it's still in preview that connection so you might not get asked about it in the exam but it's good to know that now it's actually possible via the data flow and the data pipeline to set up the v-net data Gateway we're going to start in Azure there's a few settings that you need to configure in your Azure environment before you can set up the v-net data Gateway connection so you're going to need to register a Power Platform resource provider within your Azure subscription and then within the item that you want to share or you want to access for example in your Azure blob storage item in Azure you need to create a private endpoint in the networking settings then create a subnet and then we're going to use that in fabric to create a new virtual network data Gateway connection and then again from that you can connect to it via your data flow to be able to access that data that is behind that virtual Network in Azure so next let's look at the data storage requirements and when we're talking to our client here identifying requirements really what we're trying to extract is okay what fabric data stores are going to be best for these requirements and what overall architectural pattern are we going to be aiming for with this solution now the options here are obviously The Lakehouse the data warehouse and the kql database and some of the deciding factors to choose between these are well what's the data type okay so is it structured or semi structured or even unstructured so are you going to be getting raw files CSV Json maybe from AR rest API is it unstructured is it image data video is it audio data for example these are all going to be wanted to store in The Lakehouse because this is kind of the only place in fabric where you can store a variety of different file formats if your data is relational and structured then obviously you can keep that in either the lake house or the data warehouse and if it's real time and streaming you're going to be want to streaming that into your kql database next up another important consideration when choosing a data store is what skills exist in the team so if you're predominantly tsql based then you're going to be want to be using the data warehouse experience if you're predominantly spark and python Scara that kind of thing then you're going to be wanting to storing your data predominantly in the lakeh house and if you're predominantly using kql in your organization that's going to be one to using kql database for your data storage congratulations you've completed your first engagement for Camila you've convinced her to set up a proof of concept project in her organization so she's already created the fabric free trial she's set up her environment but immediately she's hit a bit of a hurdle so this is your next mission she Rings you up and she says hey I need some help I open the fabric admin portal and nearly had a heart attack please can you help me understand all of these settings so you set up a call with Camila to help her understand the fabric admin portal how are you going to teach her and what are you going to teach her about the admin portal what are the most important settings that she needs to know about okay just before we get into the fabric admin portal and look at some of the settings available to us in there there it's important to note that to be able to access the admin portal of course first you need a fabric license but then you need to have one of the following roles you need to be either a global administrator a Power Platform administrator or a fabric administrator So within the admin portal here in fabric you'll see this menu on the left hand side so these are some of the important settings in tenant settings here you can allow users to create fabric items so if you just set up fabric in your ization you need to allow people to actually create fabric items without that you can't really get very far enable preview features so every time Microsoft release new features normally they put them in the admin portal and you can allow or disallow users in your organization to use them you can also allow users to create workspaces there's a whole host of security related features that you can manage and get control over in your tenant so for example how do you manage guest users allowing single sign on options for things like snowflake big query red shift accounts that kind of thing how do you block public internet access so that's really important to know enabling other features like Azure private link for example allowing the service principal access to the fabric apis so if you're going to be doing some automation you need to allow access to service principles to the API there's also options in there for allowing git integration so if you're setting up Version Control that needs to be enabled there and there's also some features like allowing cop pilot within the organization as well now in general some of the settings can be one of three things it could be enabled for the entire organization it can be enabled for specific security groups so say you only want super users to be able to use this feature or admins within your fabric environment to use a specific feature then you can enable it for specific security groups or you can enable it for all except certain security groups so everyone in your organization gets access apart from these people perhaps guest users is a good example now other settings in the fabric tenant settings are kind of binary you either enable them or you disable them for the entire organization another important point in the fabric admin portal are the capacity settings so this section here and in here you can create new capacities delete capacities manage the capacity commissions and also change the size of a capacity so these are some important capacity settings that you need to be aware of understand how they work and how to manage them within your fabric environment so great you've talk Camila about the fabric admin portal and she's very grateful but before the meeting ends she has one more thing she wants to ask you about she says one final thing before you go and it might seem a bit random but when we migrate to fabric I want our bi team to create more consistent reports have you got any ideas about how we can achieve that and of course the first thing you think of are custom powerbi report themes now there are many ways to create a custom report theme in powerbi you can either update the current theme if you're in powerbi desktop or you can write a kind of Json template yourself using the documentation if you're feeling a bit Brave you can do that yourself or you can also use a third party online tool there's quite a few report theme generator tools that exist online but it's unlikely you're going to be tested on that in the exam so your task is to show Cilla how to create a custom report theme so let's have a look at how you can do that within powerbi desktop top so here I've got a report and what I'm going to do to access the report themes you need to go to the view tab then you can see these themes here and obviously these are the preset themes so you can just click and update the current theme very simply like that but to do most of the customization you need to click on this button here and you can access the current theme and all accessible themes are currently installed on this machine then there's a few settings down here that quite important to know so browse for themes if you click on that it's going to allow you to import a powerbi report theme so if you've already got a theme Here For example this one here then you can select that and install it into your environment like so if you want to customize the current theme you can do that like this and it's going to bring you through to this UI environment just to you know change some colors change some text change some visuals what you have to think about for this section of the exam is what could they ask you you have to think about how could you possibly be tested on this so in terms of the powerbi report theme stuff you're likely to be tested on these buttons here and what they do plus they could ask you about a Json theme so they could show you a Json theme and maybe ask you about okay how can you edit this theme what doesn't look right in this theme that kind of thing so it's good to have a bit of familiarity about the different sections in these Json files so the name of it how you can store your data colors as a list some of these different settings here you're probably not expected to memorize all of the different settings in Json format but you might get shown a theme in Json format and asked to modify it or asked to comment on it in some way to export the current theme you can also use this save current theme and that's going to allow you to export a Json file that you can share within your organization and you've also got here access to the theme Gallery so this is going to bring you through to the theme Gallery website where you can download other people's themes for your report to finish up this video and this lesson we're going to go through five practice questions just to kind of solidify that knowledge make sure you're understanding some of the key Concepts within a context of a scenario so the first question is you're running an F2 capacity and you regularly experience throttling with that capacity now there's a number of long running spark jobs that take on average 3 hours to complete and you need these to complete in under 1 hour so you plan to increase the SKU of the capacity where would you go to make this change would you go to the workspace settings and configure spark settings would you go to to the admin portal and the capacity settings section and then click through to Azure to update your capacity would you go to the monitoring Hub and look at the Run history or would you use the capacity metrics app so pause the video here and have a bit of a think and then I'll move on so the answer is 2b so you can manage your capacity settings within admin portal and then capacity settings and then you can actually click through to Azure it gives you a link to the Azure portal and that's where you're going to change the capacity within the Azure portal OB so you can't be within the spark settings that's for managing the configuration of your spark cluster within a workspace and in the monitoring Hub we can't get anything there to do with capacity settings that's just going to tell you how your jobs are running and in the capacity metrics app that's just a readon app for having a look at how your capacity is being used so that wouldn't also be suitable either question number two your data governance team would like to certify a semantic model to make it discoverable in your organization now only the data governance team should be able to do this in what order should you complete the following tasks to certify a semantic model so have a look at the five actions here and what you're going to have to do is put these in an order so these are this is an ordered list it should be so one of the things you'll have to do first second third fourth and fifth so once you've got these in order we'll move on so let's look at the answer now so the correct order looks a bit like this so we start by creating a security group for the data governance team the clue in the question was that only the data governance team should be able to do this so when you see that you think okay well they need to be within a security group to enable this number two and you could argue that one and two could be interchangeable but these are the first two items anyway but enable the make certified content discoverable So within the admin portal the tenant settings as a section for Discovery you're going to need to enable that for the organization and then after that going to have to make sure that that settings is applied only to the data governance security group that you set up then you're going to need to ask the data governance team to go into the semantic model settings and then endorsement and Discovery and click certify for that semantic model and then you want to validate that that has been set up correctly and your business user can see that certified semantic model within the one leg data Hub three you join a new company and you're given a powerbi report theme as a Json file to use for all new projects how do you apply this Json file theme to the report that you're currently developing is it a in powerb desktop go to view themes and customize current theme B go to the fabric admin portal click on custom branding and then set the default report theme C use tabulate editor 2 to update the theme or D in power by desktop go to the view themes and then browse for themes so the answer here is D in power desktop go to the view themes and then browse for themes so d and a are quite similar but a is for customizing a current theme so that's not going to be allowing you to import a Json file that's going to allow you to use the user interface to update the current theme so that's not what we want to do we want to import ad Json file as our report theme which is possible using D B that functionality doesn't actually exist custom branding does exist but that allows you just to update the colors and the icons within fabric not a default report theme and tabul editor 2 is also the incorrect answer question four you have 1,000 Json files stored in Azure data Lake storage ADLs Gen 2 that you want to bring into fabric the ADLs Gen 2 storage account is secured using a virtual Network which of these actions would you need to perform first is it a in fabric go to manage connections and gateways and then click on create a new virtual network data Gateway B create a shortcut to the ADLs Gen 2 storage account C in Azure register a new resource provider and create a private endpoint and subnet or D install an on premise data Gateway on an Azure virtual machine in the same virtual Network or E enable Public Access in the storage account network settings so for this one you'll remember that the answer is C so the first step in setting up a virtual network data gateways well we need to go into Azure we need to perform some network configuration okay so you need to register that new resource provider that Microsoft Power Platform resource provider within your subscription and then on the item create a private endpoint and a subnet all of the other options some of them are steps in the process but not the first step so the question was which of these actions would you need to perform first so yes we do need to do a but it's not going to be the first thing that you're going to do B is kind of a bit of a a red herring here cuz you might have seen ads Gen 2 and thought ah shortcut but actually you need to configure the virtual Network daily Gateway before you can even think about kind of connecting to it D installing the on- premise data Gateway well you know that we're looking at a virtual Network here so you're going to be choosing the virtual network data Gateway rather than an on- premise Ste goway and E enable Public Access in the storage account network settings while that's going to expose your data to the public internet so not advisable question five you have data stored in tables in Snowflake which of the following cannot be used to bring the data into fabric a use the data pipeline copy data activity B create a shortcut to the snowflake tables from your Lake housee C use the data flow Gen 2 with the snowflake connector D use database mirroring to create a mirrored snowflake database in fabric so the answer here is B to create a shortcut to the snowflake tables from your lake house as you'll know you can only shortcut to ADLs Gen 2 or Amazon S3 or Google Cloud Storage so the ability to shortcut is generally on files when we're talking about tables in databases whether there all of the other three we can use so you can do a copy data activity from a data pipeline to bring that data in if you want to copy it in or you can use a data flow Gen 2 or you can use database mirroring because snowflake is one of the databases where database mirroring is possible Camila says thanks she's seriously impressed with your knowledge well done in this lesson we've looked at how you can identify requirements for a fabric solution we've looked at the different types of data gateways that are available to us in fabric we've looked at the settings in the admin portal and we've also looked at how to create custom powerbi report themes and the good news is you want an extension to the contract Camila would like you to implement and manage her data analytics environment so you've got the next stage of the contract in the next lesson we'll look at how you can do that how you can set up Access Control sensitivity labeling workspaces capacities all that kind of stuff how do we set these things up inside fabric so make sure you click here for the next lesson hey everyone welcome back to the channel today we're going to be continuing our dp600 series and we're going to be looking at implementing and managing a data analytics environment this is the second part of the dp600 syllabus or the study guide that we're going through on the channel and today we're going to be going through these particular elements and these are coming straight from the study guide so number one we're going to be looking at implementing workspace and item level Access Control implementing data sharing for workspaces warehouses and lake houses managing sensitivity labels configuring fabric enabled workspace settings and managing fabric capacity we've got five sample questions so at the end of the video we'll be going through some sample questions to test your knowledge and as with the last video I'll be posting the key points and links to further resources in the school Community available for free I'll leave a link in the description for that we'll continue the scenario that we were developing in the last lesson you're the main character again are you ready let's begin so just to recap you are a consultant and you're working with your client who is called Camila and in the last engagement you successfully planned their data analytics environment now you've won the contract to support Camila in implementing that solution so Camila is busy doing her resource planning she's thinking about who she's going to need to support support this environment in fabric she's asking for your assistance to help her structure her thoughts and also her team so let's have a look at what that looks like in fabric so this is a high level structure of a fabric implementation now you notice that it's hierarchical at the top we have tenant level so this is kind of the one tenant that you're going to have in your organization then below that you might have one or many capacities and we talked about capacities in the previous lesson we're going to going to be doing a bit more on capacities in this lesson as well then in each capacity you might have one or multiple workspaces then in each workspace we go down to the item level so you might have a data warehouse and a Lakehouse in your workspace then we can actually go one level deeper than that which is called the object level So within the data warehouse you have dbo do customer that might be a table in your data warehouse or a view in your data warehouse that's at the object level now when we're administering fabric we need to be aware of these different levels because at each of these different levels Administration happens in a different way in the last lesson we looked at the tenant level admin settings so that's mostly things in the admin portal under the tenant settings section today we will explore item level a little bit later on but first I just want to look at how we can administer each of these three top levels so here we have a table now the table isn't complete yet we're going to walk through it together so we have the three top levels we got tenant level the capacity level and the workspace level and then on the right hand side we're going to go through what the administrator or who the administrator is what role they require and also where the admin happens right so where are they going to be working where are the settings that they need to administer at each level so starting at the top level we've got the tenant admin now to get the rights to be able to be a tenant admin we're actually going to go higher than fabric we need an entra ID role of global administrator Power Platform administrator or fabric administrator and we looked at what that looks like in the last lesson and they're going to be working predominantly in the fabric admin portal so if you have any of these three roles that's going to be available to you and you can configure your tenant settings in there one level down at the capacity level well the capacity admin is assigned when you create a new fabric capacity in Azure now where the administration happens so any sort of administration at that capacity level is going to be done either in the Azure portal as we mentioned before or in the fabric admin portal there's a section on capacity settings we're going to have a look at both in a minute at the workspace level so we're going one level down now you're going to have a workspace admin they're going to be the person that is kind of in charge of that admin level and the role required here is a workspace role so it's going to be a person or a group with the workspace role of admin and again we'll have a look at what that means in more detail a bit later on now there going to be doing most of their Administration within the workspace settings and also within the manage access so these are the two areas that they're going to be focusing most of their time on now we looked at the tenant level admin settings previously in the last video in this lesson we're going to focus on the capacity level settings and then the workspace level settings so let's just focus in on the capacity administrator settings for a moment and as I mentioned there's two really places where capacity Administration gets done number one one is in Azure because we need to use Azure portal to purchase capacity right so that's going to be where you go to create a new capacity delete existing capacities changing the size of a capacity so if you've got an F2 skew and you want to go up to an F4 that's going to be done in the admin portal and also changing the capacity administrator so you can do that within the Azure portal as well as well as that within fabric we can change some of the settings for a particular Capac so we can do things like enabling Disaster Recovery viewing the capacity usage report so how much is our capacity being used we can Define who can create workspaces within that capacity we can Define who is a capacity administrator we can update the powerbi connection settings so who can connect to this capacity or items within this capacity from powerbi and how does that look like we can permit workspace admins to size their own custom spark palls so this is quite important right because you might want to set some sort of limits on the sizing of the spark poles that workspace owners underneath the capacity or in this capacity you might want to limit how high they can go with their spark poles because that's going to have quite a big impact on the overall capacity usage so if you're sitting at the capacity level you might want to add some restrictions on you know the custom spark configurations that happen in Your Capacity and we can assign workspaces to the capacity in this section as well okay so so here we are in the portal and I just wanted to show you some of the capacity settings how to administer a fabric capacity and we're going to start right from the beginning so how do you actually set up and buy a capacity well you go to Microsoft fabric if it's not already there then you can just search for it here Microsoft fabric this is going to bring you through to the the resource creation tool we're going to click on Create and we're going to walk through these steps to create a fabric capacity so you need a subscription and a resource Group and you can enter the capacity name so give it a region change the size I'm just going to do an F2 select and here is where you sign the capacity administrator and again we can change that afterwards but that's you need at least one to set it up so it's going to give you the estimated cost per month here and then you press create and it's going to deploy that capacity okay so now my capacity has been created we can click on go to Resource and this is where we're going to do some of the administration tasks within the Azure portal right so here we can have a look at well firstly we can pause it so if you want to pause the capacity you can do that here delete it as well down the left hand side we've got some useful things here so capacity administrators that's where you're going to change your capacity administrator we've also got change size so if you're finding that F2 is not enough for your workloads that you're running you can change it to F4 or f256 if you've got a spare 40 Grand a month to be using on fabric I'm going to keep it as an F2 that's just something to bear in mind there okay now I just want to have a look at what that capacity setting looks like inside fabric so if we go to the admin portal and capacity settings if we go over to the fabric capacity tab here we can see that fabric capacity that we've just set up so it's an F2 it's in UK South and it's active so we've got some actions here we can change the name we can have a look at the admin you can't really do much there there's a link through back into Azure so if you want to make any changes to it from within fabric you need to click on this link here but if you click on the actual capacity name we go through to the capacity settings right so this is where you're going to be doing things like enabling Disaster Recovery having a look at the usage report for that capacity turning on notifications so it's going to give you a notification when you've used x amount of percent of your capacity updating who is the administrator for that capacity can also be done here changing how we can access powerbi and how powerbi can access data in the capacity we've also got data engineering settings it's mainly spark settings and this is where we're going to have that permission to permit people to change the custom spark poing right so either on or off and we can assign certain workspaces to this capacity so if you just create a new capacity it's going to be empty and you can move existing workspaces onto that capacity so as a workspace administrator we've got a few different options so here we're stepping down level into the workspace level and a workspace administrator as we mentioned deals primarily in the workspace settings and here you can edit the license for the workspace so change it from like for example a trial capacity to a fabric capacity or from Pro to PPU premium per user we can also configure connections to Azure as well as well as configuring Azure devops connections so if you want to use Git Version Control for this workspace that's going to be done in the workspace settings another thing we can do in the workspace settings is set up what's called a workspace identity now this is basically having a managed service principle dedicated for your workspace and it basically means you can connect to things like ADLs Gen 2 for things like shortcuts and you can do that in a kind of trusted workspace access manner basically this is another pretty new security feature that they've added quite recently and I'll be going through more of the security principles in more detail probably in a separate video at the end of this series CU they're quite important we can also edit some of the power settings in the workspace settings and also the spark settings so particular default environments that we might want to set up within this workspace things like that just note here that managing access is done through the the managing access section so it's slightly set it's in the same kind of area but it's not necessarily in workspace settings where we add users and add groups into our workspace we'll have a look at that in a bit more detail okay next I just wanted to walk through some of the workspace settings in a bit more detail so here we are in a workspace it's called Share Hub what we're going to be doing is clicking on this Dot and you can see that we've got two here that are useful for workspace administrators number one is managing access so this is how we're going to give people access to our workspace either person or a group we can add people in here and we can give them admin member contributor or viewer if we click on these dots again and then go through to the workspace settings this is where we're going to be able to edit some of the settings for our workspace General is you just do the image and the description and also domains if you're using domains license info so this is where you're going to change the potential capacity and the license that's being used in that workspace so this one is a trial workspace so maybe we want to actually change that to an F2 maybe we've got some fabric capacity we can select which one we want to use I'm going to be using this fabric F2 learn capacity and that's going to change the license for the workspace we've also got connecting to Azure connecting to git downloading things like the file explor as well and enabling caching for shortcuts is another workspace setting we've got here managed identities so if you're on an f64 capacity or higher you can make use of workspace identities and that's going to basically allow you to create kind of like a managed service principle just for this workspace so give your workspace an identity and allow it to connect to ADLs Gen 2 create your shortcuts things like that in a secure manner kind of trusted workspace access is what it's called and if you want to learn more more about that I'll leave a link to the workspace identity section in the school Community we can also do things like adding private endpoints and that kind of thing for connecting via spark to things in Azure you've got your spark settings down here for configuring something about the pool that you're using the spark pool that is being used in this workspace you can change the default environment that's where you're going to be going to add libraries and things like that so if you want to pre-install python packages onto your spark cluster so that every time you run a notebook or start a new notebook you have those libraries there ready to go that's where you do this change some settings for high con currency and that kind of thing there so Camila says okay great I now have some clarity on administering fabric at the tenant capacity in the workspace level what I'm not sure about is giving access to the people on my team so that's important right we build all these things in fabric but how can we give people access the right amount of access to these items that's a look at that in a bit more detail so if we go back to our structure of a fabric implementation generally when we're sharing items with people that's really done at these bottom three levels of our fabric architecture right so sharing things in fabric is normally done at these three levels now object level sharing is possible for the data warehouse and the seal endpoint in the lakeh house but I don't think it's assessed as part of this exam so it's not part of the study guide in in anyway so we're not going to be covering that in this lesson there's some documentation on the Microsoft learn website and I'll leave a link to that if you're interested in object level sharing if you want to bit learn a bit more about that we're going to be focusing on the workspace level sharing and item level sharing so let's just start with workspace level sharing people or groups can be given workspace level access and when sharing the personal group is assigned a workspace role as you can see on the right hand side there we've got admin member contributor and viewer now this role applies to all items in the workspace for example a viewer in the workspace will be able to view all of the items in the workspace let's just take a bit of a moment cuz roles are really important and the role that you assign someone dictates basically what they can do in your workspace this image here comes from the Microsoft documentation again I'll leave a link to this in the school community and I definitely recommend you take some time to study it this is what we're going to be doing here so the first thing to note with this diagram is let's just start with the admin so if you give someone an admin permission what can they do the first thing is that they can update and delete the workspace so this is really high level permissions that only maybe one or two people really should have people that you trust in your organization they can also add and remove people including other admins so it's the only role that allows you to add an admin add another admin next we move down to the member and the member can do similar things to an admin but they can't add an admin okay so a member cannot add an admin they can only add people with lower permissions or other members okay the other permission that is unique at the member level is you can give other people the permission to share items so being able to share items is a fairly high level thing to do and you're giving people that permission to share okay so that's something to bear in mind as well then we move down to the contributor level now contributors can do pretty much everything in the workspace other than as we see here deleting the workspace adding other people into the workspace and allowing other people to share but they can do when you're talking about contributing to fabric items anything around lake houses or warehouses or data pipelines they have read and write access to all of these things so the viewer has a unique set of missions those six green ticks and if we go kind of from top to bottom they can view and read content in a data pipeline a notebook spark job definition machine learning model so they can view kind of the outputs of these things they can also View and read the content of kql databases query sets and real time dashboards they can connect to the SQL analytics endpoint of a Lakehouse or a data warehouse and they can read Lakehouse and data warehouse data and shortcuts with tsql so the viewer can basically use SQL to analyze data in either The Lakehouse or the data warehouse what they can't do is access any of the one lake apis or spark so they can't run spark jobs or notebooks or anything like that now one unique thing about the viewer permission is in the data pipelines now they can't edit or update any of the activities in a data pipeline but they can execute and cancel the execution of a data pipeline run so that's an important kind of edge case to remember for the exam and they can also finally view the output of data pipelines notebooks and machine learning models so that's kind of a high level overview of all of these workspace roles and what they can do at each level again this is really important to understand for the exam so I definitely recommend going into the documentation taking some time to understand these different things because you'll probably be tested quite a lot on these so let's just have a bit of a workspace level access example this is John he is a business analyst working in Camila's team and Camila has asked you to give him contributor access to workspace one this is workspace one this is the architecture that they've got here now this is what John's access currently looks like where the red box is basically no access at all and a green box if there's any access and you can see everything's red So currently has no access to anything you are an admin in the workspace now what steps would you take to give John this access have a look little think about that and then we'll talk about it in a second okay so what steps would you take well personally what I would be asking is does John fit into an existing security group that has contributed access to the workspace because best practice here is to add people into groups rather than adding them individually just makes maintenance in the future a lot easier where possible we always want to add people into groups before we add them individually now if a security group doesn't exist then you might want to create one for John maybe you want to create an analyst security group so that in the future when another analyst wants to join the team or join the workpace you can just add that person into the group rather than having this long list of individual contributors in that workspace so you create an analyst Security Group add John to the security group and give the group contributor access to workspace one so in the picture how does that update well it looks a bit like this right so John now has access to workspace one and everything within it because we've given him access the security group access at that workspace level you'll notice that workspace 2 he still has no access to that he can't even see that so that's something to bear in mind when you're giving workspace level access Camila suddenly Rings you she realizes that JN shouldn't have access to everything in the workspace instead she wants you to give him access to the data warehouse only not the semantic model not the data pipeline so how would you check change what we've just done to reflect this so this is what we're we're looking at here we want to go from this which is the arrangement that we've just done for John at the workspace level to this at the item level now this might be important because it kind of reflects quite an important principle when it comes to giving people access which is the principle of least privilege now in general in Data Systems information security we want to give people the amount of access that they need to perform their roles and nothing more right so if you don't technically need access to the semantic model or the data pipeline then one way of kind of getting around that is to give people item level access giving people access to only what they need okay just to recap on some of the additional permissions so when you share a data warehouse you get these three additional permissions we have read all data using SQL and what that means is it allows people to read all objects within the warehouse using tsql we also have the read all one L data and with this you're allowing that person to read the underlying oneel files using spark pipelines anything else basically so in the top one they can only use SQL if you give them the second permission it allows them to basically do anything with that data and the third permission allows the user to build reports on the default semantic model not any custom models just the default semantic model when it comes to the lak house these permissions are similar but they're just worded a little bit differently so again if you give them the read all SQL endpoint data it allows them to perform tsql on the tsql endpoint if you give them read all Apache spark then again it's going to allow them to run notebooks and Spark code on top of that data and again the build reports on the default semantic model does exactly what it says on the tin now one point that I just did want to make here is around one Lake data access model now this is a very newly announced feature so it might not have made its way into the exam yet but I do think it's going to have a very big impact on how we manage security in fabric going forward so I did want to at least mention it here I'm not going to be going through it in detail I did just want to flag it you might want to have a look at the documentation page just so that you become aware of it now this feature is not really something I looked at yet much in detail but from just from looking at the documentation from how I understand it is it's going to allow you to perform rback so Ro based access control on things like folders so now that they've implemented folders within a workspace it's going to allow you to define a specific role or give a specific role access control over that folder and then the permissions are going to be inherited for every item in that folder but like I mentioned it's a preview feature and it's relatively new so I'd be surprised if they ask you about this in the exam but it is very important I do think it will change quite a lot in fabric so I wanted to mention it and if you look at the study guide it does say for the dp600 exam it does say that most questions cover features that are generally available the exam may contain questions on preview features if those features are commonly used I think at the moment this isn't commonly used because it's only been released a few weeks ago so that's something to bear in mind Camila says thank you now I understand workspace level and item level sharing in more detail One Last Thing Before You Go we've been working on this government project and I need to apply sensitivity labeling in a workspace can you walk me through it so what even is a sensitivity label well sensitivity labels are a data governance feature and they're created and managed in Microsoft purview so fabric items such as a semantic model can be given a sensitivity label such as confidential right and it's for information protection purposes now in some Industries labeling data and information with a sensitivity label is necessary for compliance with information protection regulations now to apply a sensitivity label in fabric really there's two main methods if we go into that item for example this Lakehouse here what you have in the top tool bar you've got the sensitivity label and you can just click on that drop- down change sensitivity label in there the other option is to go into the settings of that particular fabric item and you can see that in the left hand tool bar there you've got sensitivity label and you can change the sensitivity label in there now one of the options that you can give it if you go through the settings method is to apply to Downstream items so again we've got that notion or that concept of inheritance of the label that you give it here also applies to everything Downstream okay so now we are going to test some of your knowledge for everything that we've learned in this section of the study guide and we're going to start with a case study style question so we're going to going through a bit of a case study and then going to be asked three maybe four questions on this particular case study let's again Toby creates a new workspace with some fabric items to be used by data analysts Toby creates a new security group called Data analysts he includes himself as a member of this Security Group Toby gives the data analyst Security Group a viewer role in the workspace what workspace role does Toby have is it a viewer B member C admin or D contributor pause the video here have a think and then we'll move forward to the answer so the answer here is see now this combines two pretty important Concepts to understand when we're looking at workspace level sharing number one is that the creator of a workspace is always given admin permissions in that workspace now we also have Toby with the viewer role in that workspace CU he's in the security group with viewer role and this is another concept if you have more than one level of permission within the workspace you're always given the higher level so he's got admin role cuz he created the workspace and he's got viewer role because he's in that Security Group Well the admin permissions is always going to be prioritized he's always going to take that role over his viewer role so let's continue this case study Sarah is also a member of that data analyst Security Group she has no other role in the workspace which of the following can Sarah not do in the workspace a execute a data pipeline run SQL scripts in the data warehouse run spark notebook or review the evaluation metrics of a machine learning model now the answer here is C run the spark notebook now when we're looking at the workspace level roles and the permissions for each role we know that Sarah is a viewer in the workspace that's the highest level of permission and you remember that a viewer role can actually execute a data pipeline in a workspace they can also run tsql scripts in data warehouse or a SQL endpoint of a lake house what they can't do is run a spark notebook okay so anything in a notebook they can have a look at the notebook but they can't actually execute any code so C is the right answer cuz what we're looking for is what can she not do in the workspace and D is review the evaluation metrics of a machine learning model which we know we can do because she's just reading the output of that model to continue this case study again Toby wants to delegate some of the management responsibility in the workspace he wants to give this person the ability to share content within the workspace invite new contributors to the workspace but not add new admins to the workspace what role should Toby give this person a admin B member C contributor or D viewer so the answer here is B now the the key point in the question was but not add new admins to the workspace so we know that to be able to add another admin into a workspace you need to have admin permissions yourself so Toby doesn't want to give that person this ability basically so we know it can't be admin it's not going to be viewer it's not going to be contributor we know that the member is kind of one down from admin and that's going to allow you to do all of these three things they can share content they can invite other contributors because a member can add new people either members or contributors or viewers but they can't add other admins so it' be B member the next question is completely separate you have admin role in a workspace Sheila is a data engineer in your team she currently has no access to this workspace at all now Sheila needs to update a data transformation script in a pisb notebook and the script gets data from a Lakehouse table cleans it and then writes it to a table in the same Lakehouse now you want to adhere to the principle of leas privilege what actions should you take to enable this is it a you're going to give Sheila the contributor role in the workspace b share the Lakehouse item with read or spark data permission C give Sheila the admin role in the workspace or D share the lake house item with read all spark data permissions and share the notebook with edit permissions so the answer here is D so one of the clues in this question was the line where it says you want to adhere to the principle of leas privilege so immediately when you see that giving people workspace level access is not really good enough so A and C is giving a role in the workspace so it's going to enable her to contribute and change and edit everything in the workspace but it doesn't adhere to the principal of least privilege so we can immediately rule out a and C so another really important point in the question here was Sheila needs to update a data transformation script in a notebook so she needs to edit the code in a notebook She's Not Just executing an existing notebook she needs to actually make changes to a notebook and so for B you wouldn't have that permission you've got read all data for the spark so you can actually execute a notebook but you can't edit a notebook so be able to make these changes really need access to the notebook and The Lakehouse that that notebook is interfacing with that that notebook is reading from because you can't just share the notebook because then you won't have access to the underlying data and we can't just share the lake housee because you won't have access to the notebook that she needs to edit so the answer is D share the Lakehouse item we're giving spark permissions and we're also giving edit permissions on the notebook next question you have admin role in a workspace you want to pre-install some useful python packages to be used across all notebooks in the workspace how do you achieve this a in the fabric ad admin portal go to spark settings and install the libraries B go to workspace settings spark settings and then Library management C create an environment install the packages in the environment go to the workspace settings spark settings and set it as the default environment or D go to capacity settings and then default libraries so the answer here is C creating an environment and then going into your workspace settings and setting it as the default environment for for spark now now this is a bit of a a naughty question because B is the old way so it used to be you go to workspace settings spark settings and there was a section for Library management but that's actually not possible anymore the way to do it as I mentioned is to create an environment and then in your spark settings make it the default environment A and D don't actually exist these capabilities so these kind of red herrings so the answer is C Camila says thanks she's seriously impressed with your knowledge again in this lesson we covered all five of these elements of the dp600 study guide from workspace and item level sharing data sharing for data warehouses and lake houses sensitivity labeling and then workspace and capacity level settings and the good news is again you've won an extension to the contract Camila would like you to implement control over the entire analytics development life cycle in her organization so for this we're talking Version Control deployment pipelines powerbi projects all that good stuff that's what we going be looking at in the next lesson so click here to continue that lesson hey everyone welcome back to the channel today we're continuing the DP 600 series looking at what it's going to take to hopefully P that exam and become a fabric certified analytics engineer Today Is video 4 we're going to be looking at managing the analytics development life cycle and in this section exam we're going to be focusing on implementing Version Control creating and managing powerbi projects planning and implementing deployment Solutions performing impact analysis on Downstream activities deploying and managing semantic models through the xmla endpoint and creating reusable assets so powerbi template files powerbi data source files all these kind of things so that's what we've got in store for you today as ever there's going to be some sample questions at the end I'll also be posting all of the lesson notes and links for further resources in the school community go and grab them there you will play the main character in the scenario again we're going to be continuing this theme this scenario that we've been developing throughout this course you're the main character are you ready let's begin so as you know already we are a consultant working with the client called Camila and you've already helped Camila plan and Implement her fabric implementation her environment now the time has come to implement analytics development life cycle now she wants to focus on Version Control at least initially but to be honest he's not really familiar with Git Version Control never really used that in an analytics environment before so what you're going to have to do is set up a call with Camila and walk her through the basics of git first and Version Control what is it why does it even exist why is it now a part of fabric and what are some of the key terms and terminology that we need to understand when we're implementing Version Control just bear in mind that all of the git integration features are currently in preview and some of the other features we're going to be looking at today like the deployment pipeline functionality a lot in this kind of analytics life cycle stuff is still in the previous stages so just bear that in mind when we're walking through these examples I'll show you what's possible today but if you're coming from using like GitHub or git a lot in the past and a lot of the features that you might be used to are currently not available so let's move on okay so I just wanted to start this video with a bit of a primer on git and Version Control in general and I think the best way to do that is to show you what this looks like and along the way we can explain some of the key Concepts in git and Version Control I realize that probably a lot of people who have been coming from maybe a powerbi background don't have experience with Version Control and get in general so let's set up a project and start at a basic level introduce more and more Concepts as we go through this demo so in fabric the way you're going to start is with Azure devops now AZ devops is a service that's used primarily for software development and managing infrastructure and the devops life cycle of that infrastructure and software development but it's also where we're going to use to store our code and our artifacts and our powerbi project files because they come with repositories so you need to set up an account just go to this website here and I'll leave a link in the description or in the school Community now once you've set up Azure devops for your organization you'll come through to this page here I've set up an organization called fabric University and then you have this concept of a project so a project is going to be where you store your repositories your code has lots of other features as well but we won't be going into too many of the other features in this short video so we're going to be creating a new project for this demo let's just call it dp600 practice we're going to make it private and then just click on create okay so this is an Azure devops project on the left hand side you can see here make that a little bit bigger on the left hand side we've got some useful things to know about right so this is just the overview where you can see an overview of the project probably the most important ones to bear in mind are boards so this is like a Work Management tool for doing development work we're not going to be focusing much on that in this video what we're interested in is this Repose repositories right so repositories is kind of the core thing that we need to implement Version Control now repositories are also available in GitHub but currently that integration is not possible between Fabric and GitHub you can only use Azure devops repos so that's why we're using this currently so a repository is just where we're going to store our code in the cloud so when you're developing locally like a powerbi report for example really what we want to do is store that report in the cloud in our repository and that's where we're going to be controlling tracking changes managing who can make changes to that project in the repository so let's just do a bit of configuration here to set up this repository so that we can use it for Version Control so the first thing I'm going to do is right at the bottom here initialize a main branch with a readme and a readme file is just a markdown file it's like a text file just to initialize our branch and we'll talk a bit more about branches in a bit more detail here currently I have one branch and it's the main branch here now branching is kind of like a whole field in itself there's lots of different strategies for how we can manage different branches when we developing our code or developing our fabric artifacts but we'll get into that in a bit more detail a bit later on for now what we're interested in really is creating a copy of this repository on our local machine right so that's going to be the first sync we want to get these two in sync because the way that version control works in general is you develop locally on your local machine for powerbi and then you sync those changes up to the repository in the cloud and if you have a big team of people maybe you have 10 people doing this they're all syncing their changes into this main branch by default so what we want to do is clone this repository and we get this URL here for this git repository we can copy the URL and what we want to do is clone that on our local machine so there's various different ways to do that you might see in the documentation Microsoft use a tool called vs code I personally prefer using GitHub desktop this is another tool it's completely free you can download it just Google GitHub desktop download it's built by obviously GitHub but it's also possible to do Azure devops repos in here as well and it just makes the whole process of Version Control and git a lot simpler it basically builds a UI on top of a lot of the git functionality so the first thing we want to do is click on file and clone repository because we're trying to drag things down from the internet and if you click over to URL we can just copy the URL that we've got here that we got from our azid Devo repository and we can clone it now it's going to ask for a username and password so if you go back to a devops we've got generate this git credentials and we can just copy the username in there copy the password in there save and retry so that's just going to authenticate with our Azure devops account so we know that okay this is the actual person that owns that repository it's creating that authentication between the two okay so what has that actually done well now what we've got is a copy of what we have in Azure devops in our local machine you can see the file path there is C users learn documents git and what we can actually do is open this in Explorer okay so now I have a copy of those files and folders from Azure devops on my local machine so that is the first first thing that we need to do really is clone the repository get that copy local and what's going to happen now we've got this all set up and synced so that GitHub desktop is Now tracking this folder so any changes you make locally to this folder it's going to pick those up right so for example if I do a new text file call it my file now if we go back to get up desktop you can see that it's picked up that change already it's saying you've added a new file in there it's called my file . text and if I open that with notepad for example just call it my amazing file and I'll save that here we can see again it's tracked to the changes so git works by tracking the changes in text based documents okay so traditionally it's used for code so a python file a SQL script or a Javascript file or C all these kind of things are textural based right so they work really well with Git so now we've made some Chang changes to our local repository right we've added this text file but what you'll notice is that you know here in Azure devops nothing has changed yet because what we need to do is push those changes into the cloud environment maybe you've got a team you want other people to see the changes that you've made and we'll start by using this text file right and then we'll slowly move up to more complicated stuff we'll look at powerbi projects as well but for now let's just push these changes so you see down here you can do created a new file so created a new file and then what we're going to do is click this commit to main button so this is going to commit our changes onto that main branch and up here we click on push to origin so we're going to commit and then push the changes up into the cloud right into Azure devops and so now what's going to happen if we refresh our Azure devops we can see that file there we can see my file. text my amazing file that's in there in this repository in azure devops now Okay then if we want to make more changes maybe we want to change this a bit more even more amazing file and we save that again it's going to show you oh well your original file was this okay this is what we've got stored in Azure devops this is what the the main branch is currently saying now you've changed that file right and we've noticed that okay so it's tracking the changes between every change you make in that file so that's really important thing to bear in mind so let's just update this text file obviously in practice in proper environments you add a more meaningful commit message because these are really important and we'll push that again to the origin using this button here so again if we just go back to the Azure devops you can see that that has now updated in here as well great so that is the very very basic implementation of git and Version Control and syncing our local repository with our Azure devops repository in the cloud so at this stage I think it's worthwhile just noting what we we've done here right so at the top we've created an A devops account we've added that readme file we've copied the git URL and cloned it locally and then we've modified that local folder create a new file modified the file committed those changes back pushed the origin and observe the changes in as devops now this is great it's a good first step right we can track changes now between files but one of the core benefits of git and Version Control in general is that instead of just allowing anyone to update that main branch what we can do is we can protect that main branch so that we can control who has access to and who has the ability to overwrite those changes so let's make a few changes to the figuration of our repository in Azure devops what we're going to do is we're going to protect that main branch so that nobody can just overwrite that file anymore we want to add in a review and approval phase and this is going to help a lot with our quality and control over who and how that code base is being changed so let's have a look at that in a bit more detail okay so in our Azure devops project here we can go to Project settings and then we can go down to repositories and then what we're going to do is add in a policy for branch policies protect important branches Nam spaces so we're going to add something in here protect the default Branch create automatically included reviewers so I'm going to add myself in here as a reviewer just because there's only one person in this project in reality you might have a team of maybe senior developers or lead developers that responsible for reviewing code and approving changes to the the code base so now we've set that up let's have a look at doing something a bit more so we've got our file here so if I save this I've just added another line in here new changes and if we go back to our GitHub here we can see that oh it's picked up that new changes again but now what I'm going to do is update this so added some new changes and I'm going to commit that and when I try to push this to the origin is going to say error okay because now pushes to this Branch are not permitted you must use a poll request to update this Branch so that's going to introduce a few new Concepts that we need to get around this really number one is the concept of branches and number two is the concept of pool requests let's have a look at both of those now now so now that we've implemented some protection over that main branch we have to change how we develop okay so now we have to use branching So currently all of my files that text file and the readme file is on this main branch so if I now want to update what's in that text file I'm going to have to create a new Branch because nobody can update and edit that main branch directly so what we can do in GitHub desktop it's very easy so what we're going to do is click on Branch New Branch now we can change the name to change text file create this branch and now if we publish this branch and we can add in some changes here so now if we go back to Azure devops let's just go back to our project here and back to Repose and if we go down to this change text file you can see that it has actually found this new Branch this is a branch and if we go over to the pull requests section we'll explain what that means in a minute but you can see that it's registered the fact that we have now published a new Branch so we've made some changes added that new changes line in the text file and it's popped up saying oh do you want to create a pool request here and so the pool request is when we want to make a change or merge some changes into the main branch okay and you can only do this through a review because that's the policy that I've set up on that main branch so what we can do is create a PO request you can say oh I've added some new changes in here you can add in a reviewer which is going to be me and you can say oh please review this and you can create a pull request so what that's going to do is going to notify the reviewer and say oh this person wants you to review their changes to the code base on this change text file branch and now as the reviewer because I'm kind of playing both roles here I can have a look at the file changes I can review the changes like oh yep added new changes that looks good to me I'm going to approve that change and now what you can do is complete so what complete is going to do is merge those changes into the main branch so your protected Branch because we've gone through the reviewer process the text file has been reviewed now we can merge it safely into the main branch there's a few options here that are quite important to bear in mind and we'll look at those when we move into fabric we'll look at those in a bit more detail but generally kind of the default settings here complete Associated work items after merging it's fine we don't want to delete change to text file after merging generally in Fabric and we'll get to why that is in a minute we'll just do on complete merge Okay so we've now merged our poll request let's have a look at the repository so now we have my text file in the main branch and it's got new changes Okay so we've merged our changes into this main branch great that's the first section of this demo done okay and this second half there what we did was protect the main branch Okay so we've added an approv in the repository we've updated the repo approv policy we tried committing a new change and it was rejected we've added a new feature Branch we brought the changes onto that Branch we've committed that branch and we've added a p request then our approver has approved it and we've merged it into the main branch and just to reiterate the benefits of that is that now we're checking the changes between files and we're also controlling who can make changes so we're protecting that main branch and saying okay if you want to update this it has to go through an approval process now up until this point we've just been using one text file actually git can track the changes between any text based file format text base that's the the problem that existed with PB files PX files kind of a bit of a black box right if you can't open a file in notepad and understand what's going on and it's going to struggle with Source control Version Control in git now that is one of the core reasons why Microsoft developed the powerbi project file format because it represents a powerbi project in text based files right so let's just have a look at an example of a pbip first and then we'll look at how we can integrate that into Version Control Systems okay so here we are in powerbi desktop and I've just got one of the sample reports here from Microsoft the competitive marketing analysis report now what we're going to do is this is just a PB file at the moment what we're going to do is save this as a pbip a powerbi project file then we're going to explore that and have a look at what that looks like so it's fairly easy to save your powerbi file as a powerbi project file so if we go to file and then save as and what we can do is we can find our git reposit so that's in document git dp600 practice that's the repository that we just cloned from Azure devops and it's currently empty for this kind of thing we can change the save as type to a powerbi project and we can just save it like that click on Save and now we have our powerbi project saved in pbip format so let's just have a look at what that looks like okay so this is our git repository dp600 practice and we've saved our powerbi project as a BP file it's actually a number of different files and folders right what we've got here is we've got the read me that was already in the repository and my file is that text file that we've been working with but now what we've got is two folders a git ignore file and this pbip file format so what it's done is it's basically decomposed our powerbi project into a series of files and folders that describe what's going on in that report Okay so let's move back to uh GitHub desktop here and what you'll notice is well the first thing is I've created a new Branch okay so now that we're going to be updating this pbip file we'll notice that all of the files are now listed here CU we created that pbip they're all text based they're all trackable using Version Control first commit of the pbip again we can commit that up into Azure devops push that to the origin in Azure devops now if we go through into azid devops again poll requests we can see that here we can create a new poll request and then we can commit that in there the reviewer will be myself again create that approve it complete merge so now what we've got is our powerbi report in our Azure devops environment and it's been approved and it's now trackable through Version Control now imagine I'm a powerbi developer and I want to make some changes to that report what is that going to look like well we're going to create a new Branch so anytime you're making changes to your powerbi file you're going to create a new Branch powerbi changes obviously you'd make it a bit more descriptive than that but that's good enough for this purpose here then we're going to go through and actually this isn't a marketing report it's sales and marketing so we're going to update the title here just as an example we're going to save that file and now this is the real power of Version Control for our power VI assets but also all of the stuff in fabric that we'll have a look in a minute any changes that we make to that powerbi file are now being tracked in our Version Control System okay and then when we're done we can push those changes up into Azure devops go through the review process merge them into the the main branch okay so you're probably thinking great but up until this point you haven't even mentioned fabric I thought this was a channel talking about fabric well we just built up the concepts of git vers control we've looked at the local case of powerbi report development but everything else in fabric happens in the fabric Cloud so now let's look at where fabric fits into all this okay how do we set up a repository in our workspace how do we link a repository to our workspace let's have a look at how fabric fits into this picture now okay so here we are in Microsoft fabric what I've done is I've set up a workpace what we're going to do is link it to our Azure devops repository our Azure devops project so to do this you will need to be a workspace administrator go to workspace settings so the first time that you click on get integration you'll need to actually sync it with your account I've already done that with mine and it's going to allow you to select the organization the project the repository the P600 practice and a specific branch that you might want to connect to now one thing I would say is that the way that we've done that protecting of the branching in the last section of the video what I found in practice is that this doesn't work particularly well currently with the way that git integration is set up in fabric but we'll have a look at what it looks like just click on connect and sync okay so now our first Sync has started and it's been done successfully what you'll notice is this Source control section here it's got zero in it so our source control so our things in AZ devops repository and what's in fabric is perfectly in sync currently and we can see that as well with the green sync and green synced here so now that we've got uh what's in our Azure devops repost synced with what's in fabric now we can add in fabric items into this mix right so we can create a new data pipeline for example create a data pipeline at the moment I'll just keep it as an empty one and we can just go back to our repository and we can see that it has been uncommitted right and we can see the source control button up here is now saying one so if we click on that we can see that we got one change here so comp compared to what's in our main branch in AZ devops it's noticed that we've got a change so it's similar to how we saw in GitHub desktop which tracks local changes this is tracking changes in fabric as well so what we can do is commit this change this data pipeline into our source control in AZ devops now to do this because we've got a protection on that main branch let's just try committing and see what it says first commit data pipeline but we know know that we've got protections on that main branch so when we commit I don't think it's going to allow us to do that no forbidden due to the branch policy and this is where it gets a bit more complicated within fabric what we're going to have to do is check out a new Branch so maybe we want to do adding data pipeline that could be our new branch and then that's going to allow us to commit added a new data pipeline we're going to click on the item that we want to merge which is our data pipeline or commit not merge and now that's going to go through into Azure devops okay so if we open up our poll requests section boom we can see adding data pipeline so now we've got these two things in sync between Fabric and Azure devops and our local repository for powerbi development and Azure devops and again if we want to merge that into the code base into the main branch we can go through a familiar process of choosing a reviewer creating that PLL request and if you're the approver you can then approve that PLL request and complete and merge it into the main branch so then if we go back to our repository now we can see that in our main branch we have this my data pipeline data pipeline so that is now being tracked with Version Control in our repository and if we go back to fabric we can say that we've now got three synced items here so it's perfectly in sync with what's in our Branch now the problem here or the slight limitation that I found is changing back to the main branch cuz currently we're on this adding data pipeline Branch okay and that was just a feature Branch we want to move back onto the main branch normally so that we can maybe create another another Branch but we don't want to be branching from this added data pipeline we want to be branching from the main branch now that is possible we can just go back to workspace settings go through to the get integration again and we can change our Branch back to the main so we're going to switch and override and now we're going back to the the main branch and all three items are synced because we know that we've merged into that main branch but the problem here is that only the workspace administrator can change that Branch back so if you found any better ways of working with protected branches in git and fabric please let me know okay so we've covered quite a lot of ground there from the basics of git and Version Control in general right through to syncing our changes or tracking changes in a powerbi project into AZ devops and then the same from a fabric environment so tracking changes that we make within fabric also to the same Azure devops repository let's just do a bit of a summary there and focus back on the exam because obviously not all of that is going to be tested some of it was just for your background knowledge so in general Version Control with Git allows you to track changes made to fabric items we can also revert back to older versions of an item as well I didn't show you how to do that but that is also possible within git item management Git Version Control now one of the benefits of this is that multiple users can collaborate on the same fabric item or the same powerbi report okay so no longer are you sending around a PBX file to your colleague everyone can work on the same pbip file and changes are tracked using Version Control and you can also update the same report at the same time as long as there's no conflicts when you merge into your branch two different changes from two different people can both be merged into the same kind of central branch as long as there's no conflict as I mentioned that's absolutely fine so it's a good way that enables collaboration on powerbi reports we've also looked at how you can Implement a check and approval process for approving changes made to fabric items so if you don't want your Junior developer to be updating your fabric notebooks and just pushing those into your Git Version controlled repository Without You approving them you want to set up protections on that on your main branch or your equivalent of a main branch for they get pushed into to production now currently the following items are supported I would argue that not all of these are probably fully supported like as as fully supported as we would like but it is possible to at least check them into a version control system so data pipelines Lakehouse notebooks pageat reports reports apart from ones that are connected to Azure analysis services and also semantic models except for these exceptions here so that is git integration and Version Control now the next section of the exam is is related to this but it's around deployment Solutions so let's just look at what do we mean by deployment Solutions well actually let's start with deployment what do we mean by deployment because for many people coming from the world of analytics deployment is probably a bit of a New Concept okay so rather than having one copy of a powerbi report for example which is your production copy instead we're going to have multiple copies typically three sometimes four we're going to have a development version and that's the version that you're using when you're making changes to the reports we're going to have a test version of that report so that is the report that you send to your client or your colleagues to review or maybe you've got automated testing in place that's done at that stage and the test version is sometimes also called the staging version in databases we've also got the production report right so that is the the public facing report that gets given to your client or it gets shared within your organization now Microsoft have released a feature called deployment pipelines to help manage these three environments or more in a bit more detail so let's have a look at deployment pipelines in Microsoft Fabric in a bit more detail okay so now let's look at how we can set up deployment pipelines in Microsoft fabric so what you're going to need here is three workspaces and each workspace is going to represent a different environment a different stage in our deployment pipeline so if we click through to workspaces you'll see that I've set up dp600 Dev test and production so these are three separate workspaces and currently they're completely empty so what we're going to do is set up a deployment pipeline so that we can manage that deployment process from development test through to production because what we want to be doing is make doing a lot of our development work obviously in the development workspace and then when it's ready we want to push those changes into our test workspace so that we can do testing know share it with colleagues who might want to test a report or a notebook run test scripts so integration tests unit tests data validation checks something like that in this test environment and then the production is going to be where we're going to be sharing making our public our production reports or notebooks and things like that so what we're going to do to set up a deployment pipeline is click on the workspaces and then deployment pipelines and we're going to set up a new Pipeline and we're going to give it a name so we can call it dp600 Pipeline and it's going to ask you to customize your stages now you can add multiple stages you can add as many as you like here for our example we're just going to do three development test and production and click create and now we're going to get through to this kind of wizard this UI interface it's going to allow us to specify which workspace we want to connect to which stage in our deployment pipeline process so we want to find our dp600 Dev workspace assign that here then for the test stage dp600 test assign that here then for the production do that as well so now we've synced our three workspaces with our three stages in the deployment Pipeline and currently we can see that this completely empty right so we haven't got any items in any of these three workspaces so let's go through to our dp600 Dev workspace and we can see that it's here we synced it with this deployment pipeline so it knows it's part of a deployment pipeline here so if we create a new notebook so this is our new ETL notebook let's just call it ETL notebook could be doing whatever you want in there doesn't really matter for these for the purposes of this demo and we're going to go back into our workspace and now we can see we've got this ETL notebook say we're happy with that we've done our development work and we're happy with it now we want to push that into our test environment so we'll go into our deployment Pipeline and here what you can see is that now we can see this deploy right so we can select any items that we've got in our Dev environment and we can deploy them we can add a note in there if you want and we can deploy that into our test environment so it's going to move one stage to the right into our test environment what it's going to do is copy exactly the file that you've got there and it's going to copy it and paste it into our test environment so now we can see we've got one notebook in this test environment okay so let's just have a look at that there so workspaces dp600 test so now we have ETL notebook in our test environment now one thing to bear in mind is that we have a history of our deployments so we've seen that we've deployed to test at this time of date by this person me and we can see the number of items that have changed so we've got one new item in that deployment now another thing that we need to look at is this button here so this is quite important it's called deployment rules so if we click on there we can look at a deployment Rule and deployment rules basically allow you to change different parameters and settings at different stages in the pipeline and they differ depending on the the fabric item that you're looking to assign these deployment rules so for the notebook our deployment rule that we can set is changing the default lake house okay so we can add a rule in there whereby we can change the default Lakehouse so maybe you have a development Lakehouse and a test Lakehouse and when we push to the test environment actually we want that notebook to be reading from the test Lakehouse so that's what we're going to be doing here in this deployment rules section so that's something important to note there now if we want to actually deploy this all the way into production we can click on deployment at this test stage and now it's going to copy that notebook into our production environment now if we click on the settings of this stage the production stage we can see that we've got this make stage public okay and what that means is that people can actually access the output of this stage so if they're in that workpace they're going to be able to see the output they're going to be able to see the items that are in there with the other stages like this test environment by default that is not a public stage so so that's something to bear in mind around public stages and non-public stages so that's the very basics of deployment pipelines now currently the functionality I would say is quite limited as to what you can do in deployment pipelines in fabric I think this is a feature that they're going to be adding a lot more to currently it's a very manual process right and normally when we're deploying stuff in in the real world in software development World a lot of this is automated Okay so we've looked at the basics of deployment pipelines what that looks like in fabric let's just do a bit of a summary of what we've just learned there again focusing back on the exam so the overall goal is to add layers of control when we're developing and deploying new fabric items okay or making changes to existing items in fabric ultimately we trying to ensure that new things that you develop are not going to break your existing analytic Solutions right so you're adding in that test so we're not just pushing straight into production and risking ruining any sort of analytics in your environment now normally this includes three stages development test staging test St staging and production sometimes you had a fourth one in there called like pre-prod it just depends on your strategy now deployment using deployment pipelines involves copying items from workspace to another and by default in fabric that's a manual process and deployment rules can be implemented to change things like the default Lakehouse for a notebook or the data sets and data sources that your semantic model reads from for example now as well as the deployment pipelines functionality there's a number of other ways to manage deployment in fabric now that could be done through branching in Azure devops you can also use in Azure devops functionality called pipelines right so you create a yaml template we're not going to go through what that looks like but just for the exam know that it is possible and you can also do deployment of semantic models via the xmla endpoint we're going to be looking at that in a bit more detail a bit later on in this lesson okay so as I mentioned there are a few other ways that we can deploy things in fabric now a really good resource here to check out is Kevin Chance's blog and I'll leave a link to that in the school Community specifically there's two blog posts here around cicd for the data warehouse now the data warehouse is not natively supported yet in fabric but you can set up a SE project and then use things like yaml pipelines and Azure devops so if you want to understand other ways that we can deploy items in fabric I recommend checking out these blogs here now if we're looking specifically at the semantic model and how we can deploy these we also have the option of the xmla endpoint so broadly speaking there's two ways to create and manage semantic models number one is to create and manage your semantic model within your workspace so create it from a lake housee or a data warehouse within your workspace but method two is to create your semantic model in a third party tool for example TBL editor and then you can deploy that via What's called the xmla endpoint into your workspace now to grab that xmla end point you need to go to your workspace settings and then you get this palbi URL kind of thing and you can connect that to TBL editor or SS SMS or DAC Studio as well to deploy your models into fabric using this xmla endpoint okay so the next section of the study guide that we're going to look at is these three different file types or three different items that we can create in pobi and Microsoft fabric that you need to know for the exam number one is the powerbi template file then we got the powerbi data source file and then shared semantic models so we're going to look at each of these in a bit more detail starting with the powerbi template file okay so the powerbi template file is a reusable asset that can improve the efficiency and consistency when you're creating power reports now you can easily save a PBX file as a powerbi template file and then you can use that to generate new reports in a given style or with a given layout already specified in that template file if you got parameters in that powerbi file then when you create a new project or a new file from the template you'll be asked to set your parameter in there if there's any parameters in that report let's just have a quick look at how you can create a PBI template file from a pbix file okay so here we are back in pobi desktop and here we've got a pobi project that we're working on our sales and marketing analysis report say we want to create multiple versions of this report all with the same format the same structure and the same layout here so the same pages in this report what we can do simply is go to file save as and then at the bottom here rather than saving as a PBX or a pbip we're going to save it as a PB a powerbi template file so we just select a folder to save it to and then click on Save we can give it a description sales template okay and that's going to save your report and the next time you go to create a new report we can then import that template and that can be your starting point for your new report so the powerbi data source file or PB IDs is a another file type that basically represents a data source file so it's a reusable asset and it can help us quickly transfer all of the data connections that you create in one powerbi file transfer them over to another report okay in this section of the exam we're going to be looking at impact analysis and the lineage tool in Microsoft fabric so here we have a workspace and it's got quite a lot of different items and if we click on this button here in the top right hand corner we can change from this list View to a lineage View and this is what this looks like here we can see all of our fabric items within that workspace and we can have a look at how data is Flowing from Source through to different lake houses here we've got some notebooks here the semantic model that's being created from that Lakehouse now for each of the main items in our lineage view we've got this button here which is the impact analysis button so for this Bronze Lake housee we can click on the impact analysis and we can look at what the downstream items of that Lakehouse are so if you're going to be making some changes to that Lake housee that might break some of the notebooks or the semantic model for example the impact analysis basically allows us to see what the downstream items are now in our example we've only got three Downstream items but you could have 10 or 20s or hundreds of different Downstream items so it's really important to know if you make a change Upstream what the impact is going to be another piece of functionality that you have here is to notify people so you can notify people that are listening for notifications on any of those Downstream items and you can let them know before you make a change so we're going to add a new column to this table in the lake house for example we're going to change add some new tables FYI just to make them aware of the changes that you're going to make before you make them so that's basically all the lineage tool and the impact analysis tool tool in fabric look like at the moment again I think they're adding a lot more functionality to these features in the future that's basically all you can do with this at the moment just bearing that in mind for the exam because you might get asked questions around notification about how do you find Downstream items of a particular fabric item so just have a quick look at some of your workspaces perform some impact analysis just by looking at the downstream items for the exam Okay so we've covered a lot of ground there in that video Let's just round the video up with some practice questions to test your knowledge of this section of the exam question one you are looking to improve the efficiency and consistency of your powerbi development team you want each report created by the team to always consist of three pages the intro the context and the analysis the reports should always align to the company branding which of the following would help you achieve this number one would you create a pbip file two create a PBX file three create a pbit file four create a PB IDs file or number five using a Json custom report theme pause the video here have a think and I'll reveal the answer shortly so the answer here is the pbit file because that is the powerbi template file now the important part of this question was you want each report created by the team to consist of the following pages right so you might have thought oh we're talking about branding talking about following a star guide e might be an answer there using the Json report theme that we looked at in the first video in this series but if we want to include Pages kind of template pages and that's going to be the powerbi template file PBP the project file the PBX that's not going to do the job and the PBS is just for data sources not for giving us a template structure to follow for our report which of the following most accurately describes git a to use Git You must be using GitHub B git is a Microsoft product for tracking changes made to fabric items C git is an open-source version control system that tracks changes in any set of text based files D git allows us to add deployment rules to fabric deployment pipelines so the answer here is C git is an open-source version control system that tracks changes in any set of text based files right so git the underlying technology is open source and it's used in a wide variety of Version Control Systems one of which is azure devops the repo is in there GitHub is another example bit bucket is another example there's lots of these different ones so you don't have to be using GitHub to be using git git is not a Microsoft product for using specifically in fabric it's kind of like a generic tool that's used right across software development industry and it's nothing to do really with deployment rules in fabric deployment pipelines although you can set up git to be the the version control system in different stages different workspaces in your deployment pipelines but it's not really best describing git question three in an Azure devops repo the main branch is protected so it needs approval before any changes are merged into it the repo contains one pbip file you have to update the title in the report merging these changes to the main branch in which order should you carry out the following tasks to achieve this now this is an unordered list your job is to order this list so the tasks here are commit and push the feature Branch wait for approval then merge into the main branch clone the repository to your local machine make the required changes to the report check out a new feature Branch from the main branch and then open a pull request in Azure repos or Azure devops so order this list and then we'll show you the correct order shortly so the correct ordering here starts with cloning the repository to your local machine so if you want to make any changes to report you need that repository to be in your local environment first then you're going to check out a branch because the main branch is protected so we can't edit that directly need to check out a new Branch from the main branch then we're going to make the changes to the report these are going to be tracked via our Version Control System we're going to commit and push those changes that we made on that feature Branch we're going to open a poll request in Azure repos or as devops and we're going to wait for approval and then merge it into the main branch so that is the correct order here question four you want to deploy a semantic model using the xmla endpoint where can you find the xmla endpoint to set up a connection with a third party tool is it a go to the workspace settings for the workspace you want to deploy your model to B go to the fabric admin portal and then capacity settings C in your workspace find your semantic model then click on the settings to get the xmla endpoint address in the Azure portal in your fabric capacity go to the xmla endpoint connect ction string settings so the answer here is a go to the workspace settings for your workspace you want to deploy your model to when we're creating that xmla endpoint connection that's going to link to our workspace in fabric that's where you're going to go to get the address not in capacity settings not in the Azure portal and see in your workspace find your sematic model where the sematic model doesn't actually exist in that workspace because we want to deploy it into there and even if it did exist it doesn't have the xmla end point in the settings anyway so congratulations that is the first section of the exam study guide complete next up we move into the biggest section of the exam which is worth 40 to 45% Camila has been seriously impressed with your skills and knowledge so far in the next lesson we'll be looking at how to create objects in a Lakehouse and a data warehouse so click here for the next lesson in this series hey everyone welcome back to the channel and this is going to be video five in our dp600 exam preparation course and in this video the focus is going to be on getting data into fabric now this is the first part of the second section in the exam which is all around preparing and serving data and it's worth 40 to 45% of the exam so there's going to be a lot of questions in this so let's get into it in the video we're going to be covering ingesting data using data pipeline data flow notebooks copying data which is basically the same thing but they've included it twice in the study choosing an appropriate method for copying data so not just understanding what the tools available to us but making decisions about the best method given a particular problem or a particular scenario and also creating and managing shortcuts now you'll notice that these are slightly deviation from the study guide what I've done is I've had a look at the whole of section two and I've changed some of the order of things don't worry we'll be going through all of the skills in the study guide but we're just going to be going through them in a slightly different order so these are the four that we're going to focus on today now I have released a video in the last month this one here data pipelines versus data flow shortcuts notebooks and this is a more comprehensive video so I definitely recommend watching this either before or after this video so as such for this video we won't be starting from zero more revising what we went through in the last video updating it a little bit because there have been some changes revising some of the Core Concepts and the distinctions between these tools that you need to know for the exam as ever there will be five sample questions at the end of this video and we've got the school Community with quite extensive notes and links to further resources if you want to go into a bit more detail about any of the topics that we cover in this lesson so let's start with a bit of a framing of all the different options that we have available to us when we're talking about ingesting data and getting data into fabric so one group of tools is the data ingestion the El or the ETL tools right so extract transform and load these are going to be copying data from external systems bringing them into Fabric and saving them in some sort of data store in fabric so we've got the data flow data pipeline the notebook and the event stream now the event stream I don't think is actually covered in the DP 600 exam so we're not going to be talking about that much today we also have shortcuts so that's another method that we need to be aware of that we're going to go through in this video and that involves bringing data in from Amazon S3 ADLs Gen 2 the data verse and also Google Cloud Storage as well we've also got internal shortcuts so connecting our different data sets Within fabric so creating references to other Lakehouse tables from a lake house for example then at the bottom we've got a new feature that's in preview at the moment which is database mirroring so you can create a mirror of your database within fabric from either snowflake Cosmos DB or as a SQL at the moment and again I don't think this is actually part of the exam study guide currently so we're not going to be talking about it in this section of the study guide if you don't want to learn a bit more about databas Maring I did mention it in that video that I mentioned previously so in this video we're just going to be going through the data flow the data pipeline the notebook and shortcuts in a bit more detail cuz these are the things that you need to study for the exam starting with the data flow so as you know the data flow comes with about 150 connectors to external systems and you can bring that data in using power query like that no low code interface and you can do data transformation on that data before writing it to one of the fabric data stores so when should you use it and maybe when should you not use it well if you want to use any of those 150 connectors then it's definitely good to use data flows it's a no and low code solution so it's quite maintainable if you have people who perhaps don't have SQL or python skills in your organization this is a good method and as I mentioned we can do extract transform and loading all in one tool now the data flow is one of the tools that we can use to access on premise data via the on- premise data Gateway and it's also quite useful when you want to ingest more than one data set and maybe combine them in the same data flow although if you're following a kind of Medallion architecture maybe that's not what you want to do but it's just an option that is possible with the data flow it's also the only tool that we can use to upload raw local files so maybe you have a CSV file you know this file won't ever have to really be updated you just want to get some data into fabric you can do that in the data flow so maybe when you shouldn't be using this well traditionally they've been struggling with large data sets but there's been a number of features released to try and speed that up so recently they released fast copy so that's one feature that they released to try and speed up data flows now the data flow uses the same backend infrastructure that the data pipeline uses in the copy data activity so the performance between these two should now be a lot more similar another aspect with power queries it's quite difficult to implement data validation right so because all of our logic is being locked up in those power query routines it's difficult to kind of validate those steps as you're going through them so that's maybe one reason why you might want to go for the elt routine rather than the ETL extract load transform rather than extract transform load now it's also the only tool out of the data flow the data Pipeline and the notebook where you can't pass in external parameters so it's difficult to build kind of metadata driven architectures with the data flow at the moment now there is a kind way around this where you can create another query within your power query engine of the the data set or the the metadata that you might want to use to parameterize a solution so there is kind of a workaround but there's no native functionality for passing parameters into a data flow from a data pipeline for example next up we're going to be talking about ingesting data with a data Pipeline and the data pipeline is primarily an orchestration tool right but it can also be used to get data into fabric using the copy data activity and also some other activities as well you can use for this but the copy data activity is the main one now one of the main pros of a data pipeline is that where it performs well on large data sets and now as I mentioned before the data flow now has fast copy so the for performance should be comparable between the two it has many connections to cloud data sources especially in Azure so it's good for using if you've got data in Azure for example we need some sort of control flow logic so maybe looping through through different tables for example there's a lot of functionality for building metadata driven or parameterized data ingestion methods so that's something to bear in mind with the data Pipeline and as well as the copy data activity you can also use it to trigger a wide variety of other actions in fabric for example a stored procedure and a stored procedure can also be used for ingesting data into fabric for example the copy into statement in t s call can be used to ingest a CSV file into a data warehouse directly for example now some of the cons in the data pipeline where it can't do the transform piece natively so there's no real data transformation activities but what you can do is embed notebooks and data flows into a data pipeline if you want to do that transformation has no ability to upload local files so that's not possible with a data pipeline at the moment you do need to be careful with any sort of crossw workpace data pipeline usage now we did submit an idea on ideas. fabric. microsoft.com and it is actually going to be planned now so that's good news so they've planned this feature we don't know when it's going to be released yet but they are working on support for crossw workpace data pipelines and what we mean by that is maybe you want to bring data in from a data source and you want to write it to a destination that's in a different workspace to the data pipeline so that currently isn't possible but they're working on that feature as we speak next up we've got ingesting data with a notebook and a notebook is just a general purpose coding notebook which can be used to well for a wide variety of things but one of the things is to bring data into fabric now we can do this via either connecting to apis using something like the requests library in python or something similar or by using client python libraries so for example if you have a thirdparty SAS product like HubSpot a lot of the big ones have python libraries that you can use to bring data in as well as a lot of the Azure tooling so Azure data Lakes for example have a python client that you can use to bring data in that's another option with the notebook so some of the pros here is again it's really good for extraction from apis because you know if you know how to code with python you can do quite customized logic around things like authentication imagination and that becomes really simple in a notebook if you want to be using any of those client libraries so anything from Azure is also really good or HubSpot as we mentioned previously now it's good for code reuse so a notebook can be parameterized and then used in lots of different situations you can also embed data validation and data quality testing into the incoming data so we done quite a lot on this channel around data quality and validating incoming data so that becomes a lot easier in a notebook and in terms of the performance well a few people have been testing the different performance of these different meth methods and the notebook always comes out on top really in terms of speed and therefore also capacity usage so if you're really sensitive around the amount of capacity uh units that you're using then notebook is going to be the most efficient and I'll leave a link to a bit of an analysis done on the Lucid bi blog and it goes through some of the investigation work that they've been doing there it's a really good blog if you want to learn more about that now some of the cons well when you don't have a python capability in your organization that might sound a bit of an obvious one but if you don't have a team to write and then support these ETL notebooks then it's not going to be a good choice for you and secondly one of the limitations with the notebook is you can't actually currently write into a data warehouse so if that's your destination then you're going to be wan to using other tools for data ingestion not the notebook okay the other method that we can use to bring data into Fabric or at least make data accessible from within fabric is the shortcut and we've talked quite a lot about shortcuts on this channel so far so what we're going to be doing is just a bit of a review of what's possible with a shortcut and some of the things that you need to bear in mind for the exam so the first thing is kind of a bit of an overview really a shortcut enables you to create a live link to data stored in another part of fabric which is an internal shortcut or in the following external storage locations so ADLs gen2 Azure data Lake storage Amazon S3 other services that use Amazon S3 for storage for example Cloud flare buckets also use Amazon S3 and that's quite a big section of the market there's a lot of tools that use Amazon S3 for their storage and that is now opened up for shortcuts Google Cloud Storage is another one and also tables in the data verse now a shortcut can be set up for individual files but also for a folder so if you set up a shortcut to a folder it's basically going to bring in all of the files that are in that folder and sync them now one thing to be careful of is the cross region egress fees so if your fabric capacity is in UK South Region for example but your ADLs storage account is in West us then you're going to be charged cross region ESS fees by Azure basically and that's 1 cent per gigabyte of data that's transferred now you can also create shortcuts now via the fabric rest API as well so if you want to do some sort of programmatic creation of shortcuts maybe you want to do hundreds of tables shortcuts cutting in one go then that's an option for you there now something that might come up in the exam is around permissions for shortcuts so what I've done is I've just copied the documentation piece here and I'll link to this in the school Community as always so what this table is showing is the shortcut related permissions for each workspace role so starting at the top there we've got creating a shortcut well to be able to create a shortcut the user needs write permission in the place that they're creating the shortcut and also read permission of the file or the that they're shortcutting too okay so these are the two permissions that you need side by side to create a new shortcut secondly if you just want to read the file contents of a shortcut you're going to need read permissions in both the place where the shortcut lives but also where the reference file is living as well so you need at least read permission in both of these locations if we want to write new files or new data to a shortcut Target location you're going to need write permissions for both of those locations both where you're writing the shortcut data to and where that shortcut data is being read into as well so let's just talk now about deciding when to use which method because and we did touch on this in the first lesson in this series we talked about some of the deciding factors when we're planning out our fabric implementation right so talking about the storage where is it stored and also what skills exist in the team and that's a really good start but there's also some other factors that we need to bear in mind when we're thinking about choosing a data ingestion method so if you have a requirements for real time or near realtime data you're going to be wanted to prefer the options like the shortcut if it's a file or folders or database mirroring if it's in a table in any of the three database types that are supported by database mirroring because these are live links to those locations so when you query it it's going to have near realtime data coming back we talked about these skills in the team so if you've got predominantly no and low code users then the data flow in the day pipeline if you've got SQL based then you can use the data Pipeline and stored procedure activity or script activity and we'll look at some of that in the next lesson we'll talking more about store procedures and that kind of thing but it is possible to use for example the copy into statement for ingesting data from files like parket files or from CSV files into a data warehouse and if you got python or Scala skills then you can be using notebooks the next thing is around crossworks bace limitations as we mentioned before the data pipeline must be in the same workspace as your destination store when we're talking about data ingestion right so it needs to be in the same Works spay and the other two methods the data flow and the notebook they don't have those limitations just to be clear on that next talk about the scalability and the size of your data and cost and capacity usage and all these things are pretty tightly linked right so in general The Notebook from quite a few people's analyses that I've seen is the most efficient method I'm not sure if you'll be tested on this in the exam but just something to bear in mind for when you're actually designing Solutions okay so now let's test some of our knowledge in this part of the exam when we're talking about getting data into fabric question one you're trying to create a shortcut to a folder of CSV files in Azure data L Storage Gen 2 which of the following is a valid connection string you can connect to A B C or D now I'll pause the video here have a bit of a think and I'll reveal the answer to you shortly so the answer here is D so in Azure there's a number of different end points that we can connect to for a storage account and the one that we need for a shortcut is the DFS the distributed file system path which is D the database. windows.net well it's not a database so it's not going to be A and B and C are both end points of a storage account but it's not the ones that you need to create a shortcut question two you're implementing the first stage in a medallion architecture your goal is to retrieve data from a rest API using a get request and you're going to save the raw Json response in the files area of a bronze Lakehouse which of the following methods can you use to achieve this now there's three correct answers here is it a the data pipeline copy data activity B the data flow with the web API connector C the event stream B the data pipeline web activity or E the fabric notebook so there's three methods here a the data pipeline copy data activity D the data pipeline web activity and E the fabric notebook now one of the important an parts of this question is that we want to save the raw Json response in the files area of a bronze Lakehouse so we're not going to be ingesting it directly into a table we want to save the raw Json and these three are the only three where that's possible the data flow we have to Output it into a data store so into a lake house table or into a data warehouse table for example so we need to perform some sort of transformation on that Json we can't just write out the raw file same with the event stream and the other three does allow us to do that so the data pipeline you can output in various different file formats and with the fabric notebook you can do that as well question three your goal is to extract a CSV file in Azure blob storage and write it to a fabric data warehouse table which of the following methods can you not use to achieve this data pipeline copy data activity B data flow with a blob storage connector C in a fabric notebook use the Azure blob storage client library for python get the file and write the data into the data we table D use the copy into statement in tsql from within your fabric data warehouse so the answer here is C which of the following methods can you not use to achieve this so you can't as we mentioned in the lesson today you can't use a notebook to write data directly into a fabric data warehouse so that was the clue in this question all of the other three methods so the data pipeline the data flow and the copy into statement in a tsql script they can be used to ingest data into a fabric data wouse table question four workspace a contains lake house a workspace B contains lakeh house B you want to create an internal shortcut in Lakehouse a pointing to a table in lakeh house B what's the minimum level of workspace permissions you need to achieve this a contributor in workspace a and viewer in workspace b b contributor in workspace a and contributor in workspace b c member in workspace a and contributor in workspace b or d viewer in workspace a and viewer in workspace B so the answer here is contributor in workspace a and viewer in workspace B so when we were talking about the workspace roles permission required to create a new shortcut where you need some sort of write permission in the workspace where you're creating the shortcut and read permissions in the lake house that you're actually referencing so the important part of the question here is what's the minimum leval of workspace permissions that you need to achieve this so the others or B and C at least would allow you to do this but it's not the minimum level of permissions that's required and D viewer in workspace a if we're a viewer in workspace a then you won't have the permissions to create the shortcut in Lake housee a number five one of your team is a superstar using mcode for data extraction in which of the following data extraction tools can you write M code is it a the fabric notebook b in a tsql script in the data warehouse C in a data flow or D in a data pipeline mapping data flow so the answer here is C the data flow gen to that's the one with the power query interface and you can write in the advanced editor you can write M query obviously not in the fabric notebook or a tsql script and the data pipeline doesn't actually have a mapping data flow activity so that's a bit of a red herring that's coming from Azure data Factory where that did exist but not in fabric congratulations you've completed the first part in section two of the study guide preparing and serving data in the next lesson we're going to be looking at the fabric Data Warehouse in more detail and we're going to be looking at how you can schedule all of your ETL workloads whether that be a data flow a notebook or a data pipeline so click here to continue your dp600 Learning Journey I'll see you there hey everyone welcome back to the channel today we're continuing the dp600 exam preparation course and we're going to be looking at SQL the data warehouse and how we can schedule things to run in Microsoft fabric this is video six in our Series so we're nearly halfway we've covered a lot of ground already but there's some way to go still until we get to the end of the course in this video we're going to be covering creating views functions and stored procedures in that data warehouse experience how we can add stored procedures notebooks data flows to a data pipeline then how we can schedule data pipelines and also schedule things like data flows and notebooks and throughout this section of the course obviously we're going to be focusing on what you need to know to prepare for this dp600 exam a lot of these topics can go really deep but we're just going to go through what I think would be sensible to learn for the exam in case they come up now this lesson will be pretty much 100% practical so we're going to be going into fabric having a look and creating all of these things ourselves at the end of the lesson we'll be doing four sample questions and as ever I've got some key points and links to further learning resources if you want to learn more about a particular topic and really brush up on your skills in a particular area for the exam okay so let's start off by looking at creating functions and stored procedures and Views in the data warehouse experience what all these things are when you should be using them how you should be using them and things you need to bear in mind for the exam so we're starting off in a workspace here and I've created some fabric items the most important one being this data warehouse dp600 data warehouse if we open that up and have a quick look around just created some really simple tables we're going to be using one or maybe two of these tables for this tutorial specifically this dbo do employees table so only got four rows but that's good enough just to show you the functionality of a function and a store procedure and a view so what we're going to be doing is we're not actually going to be working in the fabric online experience we're going to be using SQL Server management studio so what we're going to be needing to do is go into the settings of the data warehouse collect this SQL connection string copy that and then then go over to SQL Server management Studio you can download that for free I'll leave a link in the description or in the school community and then we can connect to our fabric Data Warehouse from within SQL Server management studio now I've already set up the connection here but if it's your first time using SS SMS and you haven't connected it before just click on connect to a new database Engine add in your server name here use the Microsoft entra multiactor authentication it will ask you to authenticate with fabric and the on line experience and then you should see your databases listed here we've got this dp600 data warehouse that we were looking at previously and you can see that we've got some tables we've got the employees table the gold table and some other tables in here as well so let's start off just by exploring what we've got here right so let's just do a very simple select star from db. employees and that's just going to return all of that data just so we can have a quick look at it make sure it's all connected correctly yeah so we've got our four rows there three columns employee ID D name and age perfect so say we've got this table here db. employees and for a powerbi report that we're wanting to create we actually want to do a bit of transformation on this table we don't want the raw table just as it looks like here we want to do some sort of transformation it doesn't really matter what that transformation is for this demo maybe we want to do a where statement so where name is like Jack for example this is just going to bring us back all of the employees that are called Jack and we're using this percentage Wild Card operator here and what that means is if the first four characters are j a c k anything after that is going to return in our results set here as you can see we've only got one Jack in the data set so that's coming back correctly now in our powerbi imagine we want to query just this results set but what we can do is we can save this query as a view now this is a bit of a toy example but sometimes you have a view that might have really complex transformation right it's not stored in the database but whenever we query it we want the transformations to be done at query time so all you can do to create a view of this is add in create view give the viewer name dbo do employee get Jack and then use the keyword as right then everything that follows that is going to be part of your view and if we just execute that we can see that that's been successfully completed and then we can just do select star from dbo do view employee get Jack and if we execute that we get the same result as before so now when you're creating your power report you can just query this view rather than the underlying table to get the transformed data back okay so that's the basic Syntax for creating a view what I've done is I've just written down some key points to understand just in general but also for the exam as well so for aiew the transform data is not stored right it's just the transformation logic and the code the SQL code is stored and then every time you query it maybe in a powerr report is going to perform that transformation at query time now you can create a view from another view so you can query another view from for example this view employee get Jack we can query that if we create another view that's possible we do have to be careful with performance sometimes when you string multiple views together and they're doing lots of heavy processing then it can have an impact on performance and it also can be difficult to understand and maintain if you're constantly querying other views from other views that's something to bear in mind now with a view we can't specify any sort of parameters okay it's just simply select statements you're just going to be able to query it like so and whatever the results is you get those back now this is possible with functions and stored procedures as we're going to see in a minute as we mentioned reading a SQL view into semantic model will fall back to direct query mode so direct like mode is not possible and that makes sense right because direct Lake works by reading the underlying Delta tables and in a view those Delta tables don't exist right so that's something to bear in mind with with a view so if that's a view now let's look at a function so say we want to parameterize that view that we had right we want to pass in a parameter maybe first name so that we're not just getting back Jacks we can potentially use it to get a list of employees given any first name let's have a look at what that might look like so this is a declaration of a function now you notice it's similar in some ways to the view but there are some key differences we're going to start by calling create function we give the function a name this one's function employee first name search then we're going to open some brackets and the brackets are really important in a function it's a bit like a function in Python for example you're going to open those brackets and pass in a parameter now the parameters are optional you don't have to pass in a parameter but it is an option we're defining our parameter with this at symbol so at first name and we give it a data type just for our char2 and this is a default value this is saying we have one parameter and the parameter name is first name the parameter type is far Char 20 and the default value is an empty string now the next really important piece of syntax here is this returns table so if you've used any sort of SQL functions in other flavors of SQL you know there's a few different types of functions now in fabric we're talking about table functions which means that it returns a table right we're not talking about Scala functions because that's currently not possible in Microsoft fabric version of tsql We're going to be calling a function and it's always going to return a table so in our syntax we have to Define that right we say returning a table as return Open brackets and then we can pass in whatever we want to declare in our select statement now this is very similar to our view declaration but here we're making it Dynamic right we're parameterizing it and we're using our first name parameter so let's just run this okay so so we've created our function let's just have a look at what that looks like so if we go to programmability in SQL Server management studio and in our functions you can see that in SQL Server management Studio they do actually make the differentiation between scalar valued functions and table valued functions so as we know R1 is going to be in this table valued functions here it is here dbo function employee first name search and we can call it like this select star from dbo our function name Open brackets with the parameter so if we call this execute and you see it Returns the same as what we had before but is parameterized so now instead of just Jack we've created a bit of a parameterized function here so we can call Jack we can also call Sarah you know whatever that needs to be we're just encapsulating all of that logic into a function and we parameterized it so that you can use it in multiple different places in your tsql code base and it just packages up that logic into a nice reusable function fun so what are some of the key points with a function as I mentioned it's useful for packaging up logic that you might want to reuse in different places and it can be parameterized now with a function you can only do select statements so you can't actually update any rows you can't do any sort of insert statements it's only ddl flavors of SQL only select statements so to call a function we can only really use a select statement like so or we can call it from within a stored procedure or from within a view which is kind of underneath this ddl as well if you've got a ddl view and by ddl I just mean select statements basically it can't be orchestrated directly right so we can't use it in a data pipeline directly although we can embed it in a stored procedure as we mentioned previously it allows one or more parameters specifically input parameters and the output type should always be a table okay so in fabric we're only going to using table valued functions and it's always going to return a table okay so that's the function now let's move on to the stored procedure so this is a very basic definition of a stored procedure we've got create procedure we give the procedure a name as and here we're just doing a select statement right and all this is going to do there's no parameters you'll notice when we execute a stored procedure we're just going to use this exec which is execute so that's one of the differences between a stored procedure and a function and a view the way in which we call it right so at the moment we're not really doing much with this store procedure it's just the most basic store procedure possible we're just passing in a select statement and we're executing it and we're getting back the results set the full results set now let's step it Upp a gear and add in a parameter so we've got a very similar thing to what we were looking at before create store procedure now we've got dbo Spore employee uncore get by first name and we're passing in we're declaring a parameter the parameter is called first name again it's going to be of type varar 20 underneath that we're declaring what what we actually want to do in this stored procedure so we want to select all of the employees where the name is like first name which is going to be passed in as the parameter and we're adding this kind of wild card operator onto the end so that anything that goes after that whatever the surname we're going to return that as well so let's just execute this to create the stored procedure ah yeah so let's just drop it drop procedure if exists dbo dot okay so now we've dropped it we can recreate it again just to show that it works completed successfully and then if we want to execute that again we're going to use the exec command pass in the name of our store procedure and our parameter which is Jack if we execute that we get this so that's the basic definition of a stored procedure what it looks like now let's think about some of the key points here so the store procedure we have the ability again to Define input parameters but also output parameters as well I didn't show you that in this lesson maybe that's one for another lesson but we can actually Define output parameters which are useful in the data Pipelines scenario maybe we'll look at that another time on the channel probably all you need to know for the exam is that it is possible now they're called using this EXA Command right it's not as part of a select statement you can call other store procedures from a store procedure so you can create three stored procedures and then create kind of like a master stored procedure that calls each of these other stored procedures in series now one of the key use cases of a store procedure is to give people access just to the stored procedure and not to the underlying data so you can use it as a security mechanism just by giving people access to the stored procedure and you know they're only getting access to the results of that store procedure and not any of the underlying data now one of the key points in fabric to understand is that the stored procedure is the well it's one of the only things that we can embed into a data Pipeline and we can pass the parameters that we've seen here from other notebook activities so that's something to really important to understand with store procedures they become a lot more useful because they can be orchestrated as part of a data pipeline then they become really useful for data transformation and all of these kinds of things in your data warehouse and in your architectures in general now the one I'm missing here is the ability to do inserts updates deletes now this is what's unique to the stored procedure at least in this list here is that here we're just using a select statement but we can use stored proced procedures for updating and inserting the underlying data sets so they can be really powerful tools that we can use for any sorts of data transformation data loading all these kind of things are possible with a store procedure okay so let's just focus now on the stored procedure and we mentioned here it can be embedded in a data pipeline so let's go back into fabric now and have a look at what that looks like okay so here we are in a data Pipeline and you'll notice if we go over to the activities tab there's a lot of different options for activities now the three that they mention for the dp600 study guide are the stored procedure Activity The Notebook activity and the data flow activity here we're going to go through each one in a little bit of detail have a look at some of the settings that we can apply when we're creating these activities in a data pipeline then we're going to go on to look at how we can schedule these things starting with the store procedure so when we drop the store procedure activity onto the data pipeline canvas we can have a look at some of the settings that we get of the box right so most of the configuration happens within the settings tab you can navigate to your specific data warehouse that you want to connect to obviously noting that the data warehouse you connect to currently has to be in the same workspace as your data pipeline we've mentioned that quite a lot on the channel around the some of the crossw workpace limitations with the data pipeline so just be wary of that then we connect to the warehouse and then we can choose from a number of different stored procedures that are in that data warehouse we you can Define any sort of parameters that we've got here there is this option to automatically import parameters so it's going to look into your store procedure it's going to pick out any parameters that you've got there here it's found one called first name it's of type string it knows that and currently we're just hardcoding in a value here called Jack but one of the benefits of the store procedure is that you can add Dynamic content right so you're going to be able to parameterize that stored procedure activity and pass something into this value here using data pipeline parameters on the general tab we've got just the name of it so you can update the name you can make it active or deactive we can add a timeout so this might be quite useful for some data pipeline activities maybe you got a really long-standing stored procedure it might take half an hour to run you might want to add a timeout here at 1 hour because if it runs for 1 hour or longer than 1 hour you know that probably something has gone wrong and you don't want it to lock up your database your data warehouse so that's something to bear in mind here the the timeout another one is the retry so this retry setting is available on quite a few of the data pipeline activities and as the name suggests it's going to try it and if it fails that activity it will retry the number of times that you specify in this box here we can give it an a retry interval so we're going to say okay we're going to try it once then we're going to wait for 30 seconds and then try it again secure output and secure input as well this is just going to specify whether you want to Output into a log the results of that activity or in the input as well so that's the stored procedure activity now let's look at this notebook activity so we've got a notebook here and again if you go through to the settings you can Define your workspace and your notebook here I've got notebook load to Silver similarly with the stored procedure we can declare any parameters so if you've got a parameter cell in that notebook you can pass parameters into the notebook from other activities in your data Pipeline and again we've got very similar settings on the The Notebook as we had with the store procedure activity we've got a number of retries so with the notebook a retry is probably more important or more likely you're going to be using this because with a notebook you're obviously using the spark cluster and also potentially querying rest apis so there's a lot more that can go wrong I would say in a notebook than in a store procedure so the retry functionality here I think Microsoft recommends that you set the retry to two or three just so that if your spark cluster is busy and you're doing lots of computation on it and the session times's out or something goes wrong with your notebook execution it's going to retry it so always recommended to add in one or two retries into a notebook activity again we can specify the retry interval down here and secure input output so these are the same as the store procedure activity now the final one we're going to look at is the data flow activity here similarly you're going to go over to the settings table find your workspace and your data flow select a particular data flow that you want to run and that's basically if we go back to the general tab you'll see that the settings here are exactly the same we give it a name description we can set the activity state to active or deactive we can give it a timeout and a number of retries and the retry interval so that's a little bit about the stored procedure activity the data flow activity and the notebook activity in data pipelines now let's look at scheduling a number of these different items in Fabric and the different options that are available to us there so here we are back in our workspace and let's look at how we can do scheduling of different items within Microsoft fabric now there's a few different ways that we can schedule things to work in fabric these ETL jobs mainly we can schedule them to run maybe every hour or something like that depending on your use case we're going to be looking at scheduling data pipelines data flows and notebooks in a bit more detail here so with the data flow we can actually schedule from a workspace so we can click on the settings of a particular data flow just clicking on these ellipses three dotts here here clicking on the settings of that workflow go down to the refresh settings configure a scheduled refresh and we can turn that on and we can change the update frequency to daily or weekly or if you want more fine grained refreshes than that we can add specific times 1:00 a.m. maybe 2 a.m. 3:00 a.m. that kind of thing here now with the data flow the maximum number of Refreshers per day that you can do is 48 so that's in line with what you could do previously in palbi premium we can also send refresh failure notifications here to specific people or the owner or both let's just hop back to our workspace now and look at the notebook so with the notebook it's a similar story we can click on either schedule here or settings both take you through to the same thing here opens up this sidebar menu where we can specify the schedule and again we can repeat it every hour every day weekly or by the minute so every 5 minutes for example example we can specify an start time and an end time and the time zone that you want that schedule to be running on so this is one of the key differentiations between the notebook and the data flow obviously we can specify a frequency that's a lot more than 48 refreshes per hour if we're using this but bear in mind obviously it's going to use a lot more capacity units so if you don't need to refresh your data at this Cadence at this interval then undo it so that's the scheduling of notebooks now the data pipeline is obviously a orchestration tool so another way that we can do it is by putting our notebook and our data flows within a data Pipeline and then scheduling the data pipeline so you'll see here within the data pipeline we've got this schedule button again it's going to open up a sidebar menu we can click on here and specify by the minute hourly daily weekly again like so start and end time and the time zone so it's very similar to The Notebook scheduling functionality now if you're coming from ADF as a data Factory currently one thing to bear in mind is the only way of triggering a data pipeline is with scheduling right to schedule it to trigger a data pipeline using a schedule we don't currently have event based triggers or window triggering or HTTP request triggering all those different really useful functionality for triggering a data pipeline that does exist in ADF as a data Factory currently doesn't exist in fabric the only way we can use is this schedule trigger now you might be thinking okay when should I use the inbuilt scheduling for a data flow for example and when should I use a data pipeline for scheduling well it just gives you a bit more flexibility and functionality for handling different events if you use the day pipeline say for example you wanted to add some sort of activity on fail maybe you want custom notification or some sort of other activity or logging to be done that's possible obviously if you schedule it using an a pipeline if you schedule a data flow to refresh within the actual data flow you don't have access access to all this other functionality that you get in a data pipeline so that's why you might want to think about you know embedding your data flows into a data Pipeline and then triggering them from a data pipeline okay so that rounds up the content for this lesson now let's go back to the slides and test some of the knowledge of the things that we've learned in this section of the study guide okay so let's just round off this video with some practice questions number one the maximum number of scheduled refreshes allowed per day with the data flow Gen 2 is a 12 B 24 C 48 D 96 or E unlimited refreshes per day pause the video here have a think and then I'll reveal the answer to you shortly so the answer here is 48 refreshes per day so you can get a data flow Gen 2 to refresh every half an hour if that's what you want to do now if you're coming from the powerbi premium World you'll be used to this the other figures are just incorrect really question two which of the following can you use used to update a row in a data warehouse table is it a a stored procedure B A View C A create table statement or d a function so the correct answer here is a stored procedure that's the only one that allows you to actually update data in a data warehouse table a view as the name suggests is just read only it creates a view on top of the data it doesn't actually modify the underlying data a create table statement is not going to be able to update a row and a function also cannot actually update the underlying data so the answer here is a a stored procedure in a stored procedure we have a lot of flexibility to insert into to update rows to delete rows all of that kind of DML is exposed in the stored procedure and available to us and by DML I mean data manipulation language it's a subset of SQL it's part of the SQL language question three which of the following statements is false when talking about operations in a fabric data warehouse a you can call a function from a stored procedure B you can call a function from A View C you can query A View From Another view D you can call a store procedure from a function so the answer here is D you can call a stored procedure from a function this is the only one that you actually can't do so you remember that with the stored procedure we have to use that exec the ex execute command to execute the stored procedure and you can't do that from within a function now you can call a stored procedure from another stor procedure but that's pretty much the only time when we can call another stored procedure from another item the others you can call a function from a stored procedure well we know we can do that you can call a function from A View yes you can do that and you can query A View From Another view you can also do that so the correct answer here is D that statement is false question four you orchestrating many spark notebooks to perform data transformation activities at the same time now you notice that sometimes the notebook execution is failing because your cluster is busy that's the error message that's it's giving you now what modification can you make to the data pipeline notebook activity to give the pipeline more chances to run successfully is it a deactivate and reactivate the activity B set the number of retries to two or three C change the parameters you're passing into the notebook or d add a failure activity to handle the execution failure so the answer here is B set the number of retries to two or three you'll remember that the the retry option in a notebook activity it allows the activity to retry if it fails a number of times so if your spark cluster is busy and it can't execute the first time it's going to wait and then retry it two three or any amount of times that you specify in that retry parameter in that retry setting deactivating and reactivating the activity that's not going to make much difference changing the parameters well you probably don't want to do that cuz it's going to change the output and adding a failure activity to handle the execution failure so that might be a good idea but it's not actually going to impact the result of your activity right it's not going to allow it to rerun and get a successful execution so the answer here is B congratulations you've completed the second part of section two preparing and serving data in the next lesson we're going to be deep diving into the exciting world of data Transformations within Microsoft fabric make sure you click this video here to join us in the next lesson hey everyone welcome back and this is the next video in our dp600 exam preparation course this is video 7 we're going to be looking at transforming data and there's a lot to get through with these and it's a little bit overwhelming when you look at the things that we're going to be covering don't worry I'm going to break them down into a number of different sections hopefully to make it a little bit more digestible so we're going to be looking at First Data cleansing how can we Implement a data cleansing process we're going to look at resolving some common issues that we get with data so duplicates missing data null values conversion of data types and filtering data and we're going to be looking at how we can do that those things using the data flow tsql and Spark as well then we're going to move on to data enrichment so under this category we're going to be looking at merging and joining different data sets together and also enriching the data that we've already got so adding new columns new tables based on our existing data finally we're going to take a look at data modeling we're going to look at the star schema what that is we're going to look at type one and type two slowly changing Dimensions we're going to look at the bridge table and what problem that that solves and how we can implement the solution solution using tsql we're going to look at data denormalization Aggregate and deaggregating data as well as ever we're going to be testing your knowledge at the end of this video so we've got some sample questions and all of the key points for this section of the exam plus links to any further learning resources if you want to dig into a bit more detail into any of these topics they're going to be posted on the school Community I'll leave a link in the description below so first let's take a look at the data cleansing process and to give a bit of structure as to what we looking at here I'm going to be talking through the lens of The Medallion architecture whereby we have bronze silver and gold areas in our data processing pipeline now when we talk about data cleansing normally this takes place anywhere really between the bronze and the silver and maybe in the silver as well this data cleaning CU we're going to get some raw data in bronze but it's going to be messy so we want to be doing all of our data cleaning steps normally between bronze and silver so the way that I'm going to do this is to walk you through how we can do common data cleansing routines and operations within all three of the tools listed there so starting with the data flow then we're going to look at how we can do all these things with tsql in the data warehouse and also in a spark notebook as well so let's jump into Fabric and start with data cleansing in a data flow okay so for this part of the lesson we're going to be jumping into Fabric and we're going to be transforming a particular data set here and it's that car dealership sales data set now the data set itself comes from this blog here Tableau server Guru so thanks very much to this person who has some sample data sets on their website don't worry we're not going to be talking about Tableau servers or anything like that but they do have this nice car sales data set that I downloaded that comes in a pretty good star schema and when you download this it comes actually in xlsx now this data set is actually quite a clean data set so have actually made some modifications to make it a bit more dirty so that we can perform some data cleaning on this data set and I'll leave a link to the dirtified data sets if you can call them that I'll leave them in the school community so go over to there to get the source data files that I used here and to get us started what I've done is I've just put them into a Lakehouse files area so it's number of these csvs and then I've just loaded them simply into a number of tables so these are going to be our bronze Lakehouse tables that we're going to be using for this part of the tutorial so here we are in the data flow Gen 2 and I'm just going to show you how I can do some data cleansing in within the data flow itself I just pulled in one of those tables from our Lakehouse area which this is the source here it's in our lake house and I pulled in this Revenue table so this is what it looks like basically completely untransformed now you notice I've introduced some null values here I've introduced a few duplicate values as well so what we're going to do is just step through the different transformation steps to clean this data set within the power query experience within the data flow so the first thing that we might want to do is remove any duplicate values that we've got so I know because I've introduced some duplicates into this data set there are duplicates in here so to remove any duplicates in this power query engine what we're going to do we're just going to highlight all of these columns clicking on the left hand column holding shift and then clicking on the right hand column that's going to select all of our columns then we can right click on the top here and then remove any duplicate rows so this is important because we want to check that the whole row is a dup so we need to select all of the different columns here and then remove the duplicates like that and you can see it's been added into this applied steps here so if we take a look at our Revenue column here you can see that we do actually have some null values in this data set now obviously it depends on your use case for data cleansing but if this Revenue value is actually really important and this is the only thing that you really care about in that data set then you might want to remove these NS all together from the data model again it depends very much on your use case so to remove the values in this Revenue column we just click on the column itself and then deselect this null value press okay that's going to remove those null values from the data set now another thing we can do is to change the type so if you want to change type in the data flow simply click here on change type now it might not make sense for this particular column because this is already looks good looks like an integer value and it's got an integer type here whole number but say for example this was a text value and you know that deep down is actually an integer then you can just right click on that change type to any of these types here now another thing we can do with the data flow Gen 2 is add in new columns so you can see here add column and what we can do is add a custom column and maybe we want to do some sort of Revenue bin uh maybe you want to add in some logic here to kind of group these revenues into something a little bit different I'm just going to do a simple one here revenue is greater than 1 million and we're going to give it the data type as a true or false and then if we click okay that should give us this True Value just to check some of them are false so yeah that's happening correctly so now maybe we want to change the name of that something a bit more descriptive Revenue over 1 million maybe that's probably a better description for what that new column is you know another way we can filter these data sets maybe you don't want actually want all of these units sold maybe perhaps you're just for this particular piece of analysis that you're doing or this data set you want to actually remove the unit sold over one and again just showing the functionality here really depends on what your use case is in your business as to which data cleansing steps are going to make sense so that's a bit of an overview of power query and how we can do data cleansing in the data flow Gen 2 now let's jump over to the data warehouse and look at how you can Implement similar data cleansing process using tsql okay so I've just jumped over to the SQL end point here in that same bronze Lakehouse because here we've got the same tables here but now we're just in the SQL endpoint experience so we can write some tsql and to begin with let's just have a look at our data set here just make sure that that is coming through okay yep looks exactly the same as we had in the data flow so that's good here so the first step we're going to look at here is identifying duplicates in this table now there's many different ways that we can use to identify and remove duplicates from a table using tsql now this method here I'm using is Group by so what we're going to be doing is grouping by something that I know is unique or should be unique actually for every Row in this fact table and we're going to be using having where a count of more than one so what this is going to mean is that for each of these groups there should be exactly one row so if this returns a count of more than one for this group then we'll see that that is in fact a duplicate row so let's just have a look at these so we can see that here for dealer these two dealer IDs and these two dates we can see there's actually three rows here so if we just give this bit of number of rows just to make that a bit clearer and then we rerun that yeah so now we can see our number of rows is three and four so how do we go about removing those duplicates well again what way of doing that is by using the group buy again we can Group by the same two columns here the DLo ID and the date ID and we can just bring back a aggregate function this is an aggregate function the max of the revenue and it's worth checking before you do this that the revenue figures for each of these rows are actually the same so the max or the Min doesn't actually make a difference we're just going to bring through whatever that value is and the result of that is going to be the same data set but with the duplicates removed which is this one here and then you can save that into another table or whatever Downstream activities you want to do with it so next let's look at missing data nulls removing nulls filtering that kind of thing here so there are some null values in this Revenue column so we're just going to use where revenue is null to identify those first things first so you can see here we've got six rows here where the revenue is null so again you might want to remove those and obviously to remove those what it's pretty simple we can just use is not null and that will return you all where the revenue is not null so one way can we can just quickly verify that that is actually the case is if we just bring this into a bit of a CTE with remove nulls as this we do a select count star remove nulls we'll also do a select count star from the original table let's just compare these two values oh yeah I have to actually call it so we have result one is 1855 result two is 1861 so we can see that the row count Has Changed by six and those are those six null values that we saw when we did the where revenue is null so that's worked correctly so another thing that we need to bear in mind for the exam is tsql type conversion so here you can see what we're using is the cast function function and here we're going to cast the values in the revenue column as a float I think currently it is yeah I think currently it's an integer and so what we're doing here is we're casting it as a float so basically a decimal number so by doing that you can see here that when we bring both of these columns we've got the original Revenue here in this column and that's an integer and here it's changed the data type into a float using this cast functionality another thing we've done here is to add another column so when you're using ttal you can just add in another row here into your SQL scripts and you can do whatever you want here this is just a bit of an example here divided the revenue by two and I've given it a name of half the revenue so that's just an example of adding new columns that we can do in tsql okay so just to round off this part of the tutorial next we're going to look at transforming data using pypar notebooks and specifically we're going to be looking at some of the data cleansing routines that we've been looking at in the data flow and the tsql engine now we're going to look at how we can implement them in a spark notebook now I did do a 3.5 hour tutorial which goes into a lot more depth about different data cleansing operations you can have a look at that video here I'll leave a link in the school Community if you want to have a look at that in a bit more detail but here we're just going to go on a bit of a quick Deep dive into some of the most common operations and I'll also leave this notebook on our school community so if you want to play along at home then you can do that as well so we're using this same bronze lake house we've got our tables here I'm just going to be having a look at this Revenue table which is our fact table in a bit more detail and I've begun by just reading it into a spark data frame and displaying the results here just to check that everything's loading okay and got everything logged in here now you notied that I haven't actually committed or haven't actually written any of those changes that we made in the data flow and the tcq engine so this is still reading from the raw data here in that bronze layer so let's start by looking at duplicate data again to identify duplicates we can use a similar kind of pattern that we used in tsql by looking at group by or by using Group by and then looking at count of more than one so here you can see we've identified the rows where we have some duplicates in that data set and luckily it's the same as what we were seeing in our tsql engine so we can see that these Branch IDs and date IDs have a count of four and three respectively how do we go about dropping those duplicate values from our data set well in spark we have this drop duplicates method so what I'm going to do here is just start by counting the rows in the original data set performing this drop duplicates saving it into D duped which is a new data frame and then counting the rows in that new duped data frame okay so I've just printed out the result here we can see that the process has removed if I could spell process correctly it's removed five rows from the data set so I've just taken the difference really between our original data set and that that duped data set so that makes sense CU we've got seven count here originally and obviously two of those we want to keep right because they are actually valid data so we've removed five rows so five of them are duplicates so that looks like it's worked correctly and if we were just going to verify again that that's worked we can perform the same operation that we did before grouping by these two Fields looking at where the count is more than one and we've got this empty data frame here so that looks like it's worked correctly next we're going to take a look at missing data and how we handle nulls and that kind of thing in spark so first off we're going to look at identifying missing values you might want to do a bit of like inspection of your data before you go ahead and drop things so a good idea to interrogate your data a bit have a look at what you might be deleting so let's just run this and then we'll talk through it so we've got our original data frame here and we're filtering on this specific column here so we're passing in DF do revenue. isnull so so this is a method that we get on a spark column and we can see that it's returning these six rows that we know are null now another way we can do that is I've changed two things in this second example here but it's basically using DF do Weare so DF do filter and DF do Weare are basically synonymous in spark this time we've passed in call which is one of the spark SQL functions it's just another way of writing and obtaining that column data and again we're calling is null on that column data and we're saving it into this nulls 2 variable and we're displaying that and so we can see that these two are exactly the same so both methods have returned the same null values which is good so now let's look at dropping some of those null values using the drop na method so this method has a few different parameters these are two of them how threshold and also subset as we're using in this example here so if you pass in a value in the how parameter it can be either any or all so if it's any it's going to drop the row if any of the values in any column is null if you pass in a how value of all it's going to drop the row only if all of the values are null now threshold is going to give you a threshold for the number of columns that need to be null for that row to be dropped and obviously if you use this value it's going to overwrite that how parameter now the other one that we're going to be using is subset so you can pass in subset and it's going to limit The Columns that it looks for for your drop so example we only want to drop it if there's a null value in this Revenue column and you can pass this in either as a list or as a string as well in our example we've only got one column in that list so we're going to save the result as no Nas and then we're going to do the same thing we're going to print out the result here just to check that we have actually removed some rows so we've removed those six rows the null values from a data set so that looks like it's worked correctly so next let's look at type conversion and also adding new columns into a spark data frame so we can call DF do print schema and it gives you a bit of a look at what our schema is initially for this data set including the data types for each column what we're going to do is get the column using DF do unit sold so that's one of our columns here currently it's an integer and for the purposes of this demo we want to use thecast method and we're going to give it the data type of string then what we've done is we've printed the schema again just to check what that schema looks like and we can see here this unit sold converted so we've passed in this units sold converted which is the new column name that we get with this with column function and we can see that from the print schema it's showing a data type of string so we've successfully casted that value from an integer into a string value so next just take a quick look at filtering and we've already had a look at some filtering previously in this lesson but let's just go over it again so in a filtered data frame here we're getting the original data frame and we're calling DF do where we're passing in the column name and the column here is revenue and we want to filter only for data that is more than 10 million so we can see that this has actually removed it's filtered out 1,19 rows from this data set as I mentioned previously we can use DF wear or DF filter if you want to do filtering if I change this to DF do filter should should work exactly the same there you go so now let's look at data enrichment so adding new columns so we've already seen this one as well we're going to be using DF dowi column and you can also use with columns if you want to do more than one of these at a time but we're just going to be showing you one column here what we're doing we're taking the original data frame we're calling with column to add a new column we're giving the column a name half revenue and we're giving the function to apply to that new column and in our example we're just going to be in the revenue so DF Revenue divid two and then we're displaying the results in our enriched data frame here so if we just call that that's going to look like this so we've got our new column which is half Revenue here which is the revenue divid by two finally we're just going to look at joining and merging data frames in spark so for this one we're going to be using some different tables I've just pulled in the Dealer's data frame so the de dealers table which is this one here and also the countries data frame and you'll notice that these do actually have a joining key so they both have country ID so each dealer has a country ID what we want to be doing is pulling through the country name maybe we've got a normalized data model and we want to be denormalizing it and we're going to be looking at what that means in a bit more detail a bit later in this tutorial but for now let's just look at joining and merging these two data sets together so the basic Syntax for a join in spark is well we're going to get one data set here which is the dealer data set we're going to call dealers DF do jooin that's going to be the first part of our join we're going to pass in the second data frame that we want to join it to in our case countries DF then we're going to specify on what column we're going to be joining on so we're going to be joining on dealers DF country ID is equal to countries DF do countryid in this next row we're just going to specify some select statements so what we've done is we've used this brackets here because we're going to be chaining more than one command together and in this second row we're just selecting a few different columns to be returned in this joined data frame we just want to specify that we want returned the dealer ID the country ID and the country name this is all we care about in that resultant data frame so dealers DF is not defined that's cuz we haven't defined it let's just read that data into a spark data frame first and then we can run this one here okay so now we've got dealer ID country ID and country name in the same data frame now one thing you notice here is that the country name has actually pulled through a few interesting characters so that's something you might want to change later on in your data processing workflow might be that I've saved the CSV file in an incorrect format maybe that's something we want to clean as well in our data cleansing process but for now we're just going to leave it like this so finally we're going to be taking a look at data modeling and typically this takes place within that gold layer because these are going to be the analytical models and the data models that we're going to be bringing into to our semantic layer so let's start off with the star schema so this is an example of a star schema what you can see is in the middle we've got a fact table and this fact table represents sales so it's revenue for a particular company here this example is using car sales so the amount of cars sold from different dealerships so we've also got a dealership Dimension table a dim date dim model so the type of car that's sold and the branch that that was actually sold at so called a star schema because we have our fact table in the middle and then multiple Dimensions all linking to that fact table via some sort of primary key now the reason we prefer star schemas is if you want to be doing bi powerbi basically because the powerbi engine is most efficient in this star schema so a pretty common piece of data modeling that you might have to do as an analytics engineer is prepare this data model in your gold layer of a data warehouse for example so that your powerbi developers can just pick up this data model and you know write their measures on top of it now normally when we're modeling in this star schema data model the fact table is going to be aend only so we're not really going to be updating many values in that fact table it's just going to be new sales added onto the end of that fact table normally it's going to be a very long list your fact table and you're going to create connections from that fact table into the other dimensions so if you want to know more information about that particular sale in the fact table then you can going to find those in the dimensions and the dimensions tables can change over time right take for example dim branch in that Dimension table you'd expect to see details about the different branches that exist for this car manufacturer and some of those things can change over time some of the details about a particular Branch might change right they might change address they might change name certain details about each Dimension might change over time and one way of dealing with that in data model well we need to introduce this concept of slowly changing Dimensions because with our fact table as we mentioned we're not really updating anything over time we're just appending new rows with our Dimensions we need to be able to handle different changes to that Dimension table over time and it's what we call slowly changing Dimensions because these are not going to be big updates might happen once a month or every week a lot of slower frequency than the data is going to be added into that fact table and in data modeling there's a lot of different ways that we can model these slowly changing Dimensions here's an example here take a different example we're looking at employee table we've got employee ID on the left hand side employee name and the department now it's not uncommon for an employee to change departments so on the right hand side you can see that this Dimension has actually changed so Danny Walker who employee ID number two is moved from the marketing department into the sales department and we can model this in a number of different ways ways now the first way in which you need to really be aware of for the exam is the type one SCD or slowly changing Dimension so in the type one SCD we're going to be overwriting any new data so we're going to get the employee data and every time we query that data set that Source data again we're just going to overwrite whatever's in that table we're not going to store any sort of History so we're going to be implementing it with the overwrite writing mode and you can do that either in the data pip line the data flow or in py spark as well now if you're using a Lakehouse as your data store it's important to bear in mind that the history can still actually be retrieved at a point in time using the Delta log so in a lake house architecture in the fabric Lakehouse because you can access those Delta logs just because you overwrite it doesn't necessarily mean that that data isn't stored so the second way that we can deal with this sort of change in a dimension is the type two slowly changing Dimension so in this example we're going to need a few extra columns you can notice that at the top there we've added valid from and valid to and optionally an is current as well which is also quite useful for bi purposes as well so you notice that when we first write a record into this table we're going to populate the valid from that's when this row is valid in that data set and when you first write it the valid to is going to be sometime long into the future normally it's the year 99,999 and we also set the is current flag to one or true now when we get an update to that data set so we get a brand new set of data well in the type two SCD we add new rows based on any incoming data so you see in this type two SCD we're going to be adding a third row and it's going to be Daniel Walker but this time the department is sales we have to do a few things here with the valid 2 and the valid from dates so when you write that third row into the data set you need to update row two to populate that valid 2 column and set it is current to zero or false the new row when we write that third row into the data set obviously we're going to have department is sales we're going to add the valid from date as the day that you're making the change the update with the valid two of the year 99,999 and then is current of true now you can see that by doing this we're actually storing a bit of a history about how our Dimension is evolving over time and so you can on the back of this create some quite sophisticated queries about what the state of that Dimension table is at any given point in time now when we're implementing the type two slowly changing Dimension there's a few things that you need to bear in mind so if you're using the Lakehouse and Spark engine then bear in mind that as we mentioned previously the Delta log actually stores a history of all the rights to a particular table so this might be a simple option and depending on how you want to use the history Downstream in your analysis that might be good enough for some use cases now if not then you can use the merge into so we can use that as part of the spark SQL library and we can use that to basically update the underline Delta tables now if you're using the data warehouse and the tcq experience then one method is to load your new data into a staging table and then use a stored procedure to perform the checks around the updates and the inserts and the valid to and valid from calculations now unfortunately the merge operation in tsql the tsql surface area is not currently supported so you have to actually implement this manually if you want to do that in the data warehouse currently now one kind of implementation trick is to use row hashing here so say for example you want to check which rows have changed from One update to the other and one method that you can use to do this in a bit more of an efficient manner is to implement row hashing so you can hash the value of an entire row both in your existing data set and in the new data set that you're checking and then you can compare the two hashes and obviously if the hash values match then you know that that record hasn't changed but if the two hash values are different then you can go ahead with the update logic that you need to update that table next we're going to talk about Bridge tables so imagine a company has many different projects running and the employees assigned to one or many projects what does this look like if you want to do a bit of a data model and maybe build a power VI report off that well in the question here or in the scenario we can see that actually many employees can work for many different projects at the same time so that is a many to many relationship which exists between our dim projects and our dim employees table and this can cause quite a lot of issues when it comes to powerbi and efficiency in very large data models as well so one thing that we can do to resolve this is to implement a bridge table and what that looks like is basically a onetoone mapping of all projects and all participants now this is really useful because it turns our many to many relationship into two on to many relationships let's have a look at how you can Implement that in a tsql data warehouse just to kind of show you what that looks like in real life okay let's explore Bridge tables in a bit more detail and here we're in the data warehouse experience and I've just built a gold data warehouse and we've got these two tables here we got a projects table and an employees table now it's just a simple demo just to show you what a bridge table might look like and how you can implement it in SQL our tables here are the projects we've got the project ID and the project name and we've also got these project participant and if we have a look at the data here for our projects table we've got a bit of a messy string separated values in this project participants column so we can see that each project has multiple project participants and some of them are overlapping right so these are employee numbers that come from this employees table so employee 101 who is John Smith he's actually working on multiple projects right so in our data model here we've actually got a many many relationship we can't really implement it at the moment because of that string concatenation in this project participants column so how do we get around this well and I have actually got the tsql here and I'll leave this in in the school Community as well if you want to have a play around with this yourself I've just created the projects table I've created the employees table and I've inserted some values here so one way that we can use to resolve that many to many relationship is to create a bridge table between those two tables between the projects and the employees table and in this example I'm implementing that as a SQL View and what I'm doing is I'm using the cross apply function here in tsql along with string split so string split is basically going to look in that project participants column within dbo do projects so that's the a comma separated value column which is a bit messy but we can use cross apply and string split together and it's basically going to separate all those values based on this separator here the the comma separation and what this is going to do is it's going to create a one to one mapping for every single project and every single project participant so if we just run this here here let's just have a look at what that looks like here so you can see if we look at the original table just to remind ourselves of what that project's table looks like so initially it looked like this it had two rows and it had project one with two participants and project two with three participants and what we've done in this view if we just recalculate that it's basically separated out all of these comma separated values and it's made one row for each one so now we have five rows because these are all the different combinations of projects and employees what we can do is we can create a view with this logic like so and I've just called it dbo view Bridge Project participants and that's going to make that available in our data model now so let's just have a look at this so now we've got this view bridge table in our data model and what we can do is we can now connect the project ID to the project ID in this instance it's actually going to be one too many because this is a dimension table this is our Bridge table so it's going to be one to many and we can do the same on this side here so this time it is going to be many to one so what we've done is we've transformed a many to many relationship into two one to many relationships and this is going to make it a lot easier when you implement this stuff in powerbi and that's one way that you can resolve a many to many relationships using a bridge table now we've implemented this in a SQL view just because it's quite easy in reality if you wanted to use direct Lake mode obviously you can't use a view with direct late mode so if you want to use direct late mode then you might want to use a stored procedure to materialize out the data in this bridge table so that it doesn't fall back to direct query mode but that's Bridge tables in tsql okay so the next thing we want to talk about here is normalized and denormalized data so if we go back to our data model here where we were looking at car sales so we have some sort of fact Revenue in the middle and some Dimensions that give us more information about that particular sale so what branch it was at what model of car was sold the date that it was sold and also the dealership and what we've done is we've added in another dimension here onto that dim dealers so there a relationship between the dim dealer and the dim cities so it's basically telling us the city of that dealership now this is what's called a snowflake architecture now this can be a very efficient way of storing really large data models because you'll notice in the dim dealers table we're only storing the city ID we're not bringing through any other data into that dim dealers now this can be beneficial in some instances for example as we mentioned for efficiency but when we move this kind of model into the powerbi world it can bring some limitations as well so one way to get around this is with the denormalized data model and this is what this looks like here and it's effectively turning our snowflake model back back into our star schema model which we know can be a lot more efficient when we've got very large data sets now this does introduce some redundancy into that dim dealers Dimension because instead of just storing the city ID for every dealership we're going to be repeating the city ID and the country ID and the region for all of the rows in our dim dealerships but by D normalizing our data model we actually get a number of other benefits so now we have everything in that dim dealer's table which makes it a lot easier to build things like filters on top of that Dimension table we can also have hierarchical filters so we can look at the dealerships by City Country and region and build a bit of a hierarchy there which isn't possible when you've got that split across two Dimension tables finally we're going to take a look at data aggregation and the word aggregation can mean a few different things in data analysis and data modeling and I'm not 100% sure what Microsoft expects for the data aggregation and deaggregation that they've mentioned in the study guide but I'm going to talk through both examples so that you know both of them and if you have any insight as to what Microsoft mean from the study guide when they talk about aggregation then let us know in the comments so the first possible meaning when we talk about data aggregation is when we have different slices of the same data set but they're spread across different files so this one we have one file with the UK data one file with the USA Data and one file with the Canadian data data and in this context data aggregation can mean basically combining these data sets into one long table of transactions in this case showing you Revenue across all your different Source data sets right and we can implement this in a number of different ways really in Fabric in the data flow we can use the append functionality and in the tsql experience and also spark SQL we can use Union and Union all and Union is also a method in pypar as well now the difference here between Union and Union all with a union it's actually going to remove any duplicate rows in the resultant data set whereas Union all is just going to basically append all of the different data sets on top of each other and it's not going to check for duplicates so also mentioned in the study guide is data deaggregation and again there can be many different meanings to the word deaggregation so assuming we mean the first meaning of aggregation then deaggregation is going to be the opposite right it's going to be splitting one large file into multiple different categories of data so another possible definition of data aggregation is when we transform a data set from a more granular data set into a less granular data set so on the left hand side for example we have all of our transactions and the revenue for that transaction and the different country that that transaction was made in now if we were to get aggregate that data perhaps by country then we get some sort of aggregation met tric for each country so here we're showing the total revenue for each country so the UK here has 600 which is 100 plus 500 the US has 200 and Canada has 800 now this is typically implemented using the group by statement and the group by functionality and this can be done in the data flow tsql or in spark and when we build this kind of aggregate transformation we also need an aggregation function and it's normally one of count sum Max Min average so it's how are you combining all of the values within that group buy statement so in our example in the top right hand corner we used a sum so we just summed up all of the different Revenue numbers for each country but you could do average revenue you could do the max Revenue it depends on your use case here okay let's just round up everything that we've gone through in this lesson and test some of your knowledge question one when using DF dojin in pisar Notebook the default join type is a full outer join B inner join C left join d right join or E anti join pause the video here take a moment to think about the answer and I'll reveal the answer to you shortly so the default join type in a p spark or in any spark notebook is the inner join not much to say about that one that's just something you need to know all the others are incorrect question two you're looking to migrate a data transformation workload that is currently done using a data flow Gen 2 and convert it into a tsql script now the data flow appends two data sets together and removes any duplicate rows which tsql command can you use to implement this transformation a a left joint B Union or C concat D append or E union so the answer here is the union now we mentioned when we were talking about unions that the union removes duplicates and in our question here obviously the important sentence to pick out was that the source data flow the thing that we're trying to convert into a tsql script well currently that appens two data sets together and removes any duplicate rows so the tsql equivalent of that is the union it's not going to be the union all because that's not going to remove the duplicate rows it's not going to be the left join because that won't do what we want to do now aend is not a tcq function conat is how you could achieve this in pandas but not in tsql so the answer here is Union question three you have a spark data frame called DF your goal is to remove rows that contain a null value in the transaction date column which of the following will help you achieve this a DF do drop duplicates B DF do dropna with how equal to all DF filter transaction date do is null D DF drop na how equals any or E DF do dropna with subset equal to transaction date so the correct answer here is is e we want to be using the drop Na and passing in subset equal to transaction date now the question here is asking us to remove rows that contain null values in a specific column so that specific column is the important part of the question so we want to be using drop na because it's going to remove the rows and by passing in subset equals transaction date that's going to specify only to look in the transaction date column so maybe that's a really important column in our data set we want to be abs Ely sure there's no na values CU if there's any na values then maybe that's going to make our analysis completely redundant or it's going to ruin all of our Downstream analysis so you might want to remove those rows entirely it's not going to be B because in B and D we're not actually specifying that subset now if you to implement D then it would drop the rows where there is a null value in the transaction date column but it would also remove rows with null values in any other column so that might be too much based on your requirements that's probably not what you want to be doing filter that's going to just filter the data set for null values cuz that's going to return all of the rows which are null so we want to be doing the opposite of that basically and a drop C duplicates well that's not what we're trying to achieve here so that's not going to be the answer question four a classical star schema data model consists of the following is it a one Central fact table and multiple Dimension tables B one dimension table and one or more fact tables c one fact table multiple dimens di tables some with Dimension to Dimension relationships or D one big fact table fully denormalized without any Dimension tables so the answer here is one Central fact table with multiple Dimension tables so B is obviously the wrong answer here because it's got one dimension table and many fact tables it's not going to be the star schema C is a snowflake one fact table with multiple Dimension tables some with Dimension Dimension relationships so that's not going to be what we're looking for in a star schema and one big fact table fully denormalized without any Dimension tables again that's not really a classical star schema so the answer here is a you inherit a data project and you're inspecting the tables in the data warehouse one of the tables is a dimension table dim contacts with the following columns contact ID contact name contact address effective date and effective until make an assumption about the type of data modeling that being implemented in this Dimension table is it a type zero SCD slowly change di mention a type one SCD a type 2 SCD or a Type 3 SCD so here the answer is Type 2 SCD so we can see from inspecting the columns here that they've got two date fields or we can at least assume their dates so effective date is probably when that row when that contact data point was entered into the system and effective until is basically the same as that valid two now we can't actually see the data here but we can assume that that's what those two columns are doing a type zero SCD is a fixed Dimension so something that never changes type one is obviously you're not going to be tracking that effective date and effective until those two dates it's just going to be overwritten in a type one and a type three is another slowly changing Dimension type that is not actually asked about in the exam but it's basically going to store previous values for each of your columns or at least the columns that you're interested in storing so say for example you'd have contact name there you might also have previous contact name and then when your data gets updated you're going to update the previous contact name and the new contact name as well so that's a Type 3 SCD don't think that's in the exam but just something to bear in mind so the answer here is c a type two slowly changing Dimension congratulations you've now completed the third part of section two preparing and serving data in the next lesson we're going to be looking at performance monitoring and optimization of all of our data processing workloads in fabric so make sure you click here to join us in the next lesson I'll see you there hey everyone welcome back to the channel today we're continuing our dp600 exam preparation course and we're up to video eight we're making a very good progress here on the course plan today we're going to be looking at optimizing performance and specifically we're going to be covering these bullet points in the dp600 study guide so we're going to be looking at mainly identifying and resolving performance issues right so when you're loading data or when you're querying data or transforming data and specifically we're going to be looking within the data flow The Notebook so The Spark engine and also SQL queries as well then we're going to look at Delta tables in a bit more detail how we can identify and resolve issues within our Delta tables as you know fabric is built on top of the Delta file format so that's a really important topic to understand and as part of that we're going to be looking at file partitioning as well so what that is what that looks like why you might want to implement file partitioning in your Lake housee so as ever at the end of the video we'll be testing some of your knowledge from the topics that we cover in this lesson and as ever I've got some quite detailed notes that you can use to enhance your vision available in our school Community I'll leave a link to that in the description box below so most of this video I'm going to be diving into Fabric and going through performance optimization in a number of different places in fabric but I just wanted to start by Framing what we mean really by performance optimization in fabric now as you know fabric is a very diverse tool so when we talk about performance really we need to get a bit more specific about well what are we talking about it could be data flow performance could be a SQL script in your data warehouse or we could talk about the Delta files and optimizing how they get written and read in our one Lake in our lake houses as well and I would make the distinction here between identifying performance issues and then resolving them it's kind of like a two-step process first we need to know how to identify performance issues in each of these tools and then we need to think about how we can possibly resolve these issues after we identify them and you'll notice for each of the fabric items how we identify performance issues is going to be a little bit different right so for the data flow we're going to be looking at the refresh history and the monitoring Hub and the capacity metrics app and you notice that some of these actually repeat so the monitoring Hub and the capacity metrics app is kind of like a generic place where you can do lots of performance monitoring across Fabric in the data warehouse we have query insights and DMVs Dynamic management views with the Spark engine we have the spark history server and we also have access to quite detailed monitoring in the monitoring Hub as well and when we're identifying performance issues in Delta files there's a number of places you can do that one of them that we're going to be looking at is describe then when it comes to resolving some of these issues well that's where it gets a bit more difficult to Define right because it's normally going to involve some element of refactoring so using different operations in your data flow for example or refactoring your SQL code or refactoring your spark jobs as well now in the data flow we have some specific performance optimization features that is worth going through so we'll be talking a little bit about staging and fast copy as well then when it comes to Delta file optimization we're going to go into a bit more detail about V order optimization file partitioning and also the vacuum and optimize which are two Delta table functions that we can Implement to improve performance with Delta files so that's a bit of an overview of what we're going to be discussing in this lesson now let's dive into Fabric and we're going to begin by looking at some of the generic tools like the monitoring Hub and the capacity metrics app before diving into the data flow data warehouse and the spark notebook in more detail so let's begin okay so just before we jump into fabric for this tutorial I'm just going to start in the school Community here and talk through some of the notes that we have for optimizing performance so here is video 8 optimizing performance we've got this framing performance optimization chart that we spoke about previously but to get us started I want to speak generally about performance monitoring there's a few tools that are quite General they apply across different workloads that are useful to know for the exam and also in fabric generally so we're going to be talking about the monitoring Hub and the capacity metrics app and these are two tools that can be used to monitor performance for a wide variety of operations in fabric so the monitoring Hub is the first one so let's start by looking at the monitoring Hub and if I just flick over to pobi here obviously to access the monitoring Hub you've probably seen it here it's in the left hand toolbar here you got this big button for monitoring Hub and within here we can basically have a look at the runs of a lot of different item types right so you can see semantic model refreshes notebooks so this is going to be a spark session and for each item type we have different logging that gets exposed in this monitoring Hub now the notebook we're going to have a look at in a bit more detail because that's exposing spark log information you can also see like data flow Gen 2 we can have a look at whether Those runs have succeeded or not table loading information in a lake house for example and we can obviously click through into specific items that we care about and we get more information right so this is showing a table load into the bronze Lakehouse and it's showing you the different jobs because this is a lake house is these spark jobs right so it's also going to tell us in the monitoring Hub whether runs have been successful or failed so this particular data flow run we can see that it actually failed right so on the 1st of May at 12:46 I tried to refresh this data flow and actually failed so you can click on view detail and you can get a few more details about what happened here you can't really diagnose what went wrong with that particular data flow to actually get the details of the data flow you have to actually go into the data flow itself which is this one here and then we can click on these three dots here click on the refresh history so for data flows if you want to actually debug what went wrong this is obviously pretty poor data this but you can click on the individual runs and get more detailed information about what's going wrong here so here we can see it's actually this activity here that's failed we can click on that and then get more information about why specifically that column or that data set can't be refreshed so the next general tool that we can use to monitor performance and resource consumption within our fabric capaces is the capacity metrics app and the capacity metrics app and I'll leave a link you can you can obviously get the install instructions here if you have never installed this before this is a powerbi app that you install within your fabric environment you give it your capacity ID and again the capacity settings is where you'll find that capacity ID and then it's going to bring you through to this kind of capacity metrics app now the capacity metrics app is split into two sections we have compute and storage and this obviously lines up with how fabric is build right you're build partly on storage so the amount of storage you have in fabric plus the resources that you consume during compute so if you've already installed the capacity metric app you can find it in the powerbi experience go to apps and then you should see it there the Microsoft fabric capacity metrics app now as a mentioned there's two tabs to this report it's compute and storage on the compute page so starting at the top left we can see the capacity unit spend for particular item types in Fabric and we can also break it down by duration different operations and by user as well we also get this time series of capacity usage over time as a percentage of the total capacity that you have available based on your skew So currently I'm on this trial capacity so this is going to be an X f64 and as you can see I'm not really using barely any of this capacity we've also got these other interesting graphs around throttling so if you're using more than 100% of your capacity usage on that particular capacity this is going to show you where you're throttling and it's also going to show you rejections so if you've got workloads that are being rejected On Your Capacity because you're again over 100% And it's currently rejecting workloads then that's going to be exposed here we've also got a graph on overages so again if your capacity is throttled an overage is basically you repaying that capacity usage from your future spend right and for each of these obviously I haven't actually been throttled or you know there's no overage on my actual capacity but there is this explore button that you can drill through to specific events that you want to explore in more detail if that something that's happening on your capacity down below you've got a table of all of the different items in your fabric capacity and the capacity unit seconds that are being used by that specific resource so this is a synapse notebook and we can see that that is the most resource intensive it's used up the most of our capacity unit seconds and it's also got this tool tip where you can look at specific activities and runs of that notebook to dig into a bit more detail there on the storage tab obviously this focuses on the amount of gigabytes of storage in this fabric capacity we can see how it's changing over time we can look at the specific storage by date and we can also look at the top 10 workspaces by bable storage once that's loaded that's what that brings you there here we go okay so next I just wanted to talk about data flows and just to summarize what we looked at previously well if you want to monitor the performance of a data flow well at a high level we can do that within the monitoring Hub but if you want a bit of a lower level data and to understand what's happening within a particular data flow you're going to be wanting to look at the refresh history as I showed you previously for a particular data flow here you can inspect the error messages you get breakdown of the the different sub activities in that load for data flow so that's going to be really important for you to diagnose what's going wrong in a particular data flow if it's not refreshing correctly that's when you where you're going to go to have a look there now there's a couple of features that you need to be aware about in terms of optimizing the performance specifically when we're talking about data flows right so the main one is staging now staging is probably best described using this diagram here so this diagram actually comes from this link here it's the spotlight blog on data flows and it's got some top tips for improving the performance in your data flows so I think to understand what's going on with staging this diagram gives you a pretty good idea so let's start by talking about when staging is disabled so this bottom diagram here right so if staging is disabled all of your transformation in a data flow is going to be done by the data flow engine it's otherwise known as the mashup engine right and if you got a really big data set or you're doing lots of transformation that might not be the most efficient way of doing it so a feature we have available to us to try and improve the speed of doing all these Transformations within a data flow is staging so at the top here when staging is enabled what is going to going to do is it's going to read in the data from the data source and then it's going to immediately write that data into a Lakehouse staging table then it's going to use that Lake housee to perform the transformation right so leveraging the Spark engine rather than the mashup engine to do your transformation then it's going to read the data back into the mashup engine and write it into the destination wherever that might be so a few things to bear in mind here if you've got a lot of data Transformations or you've got very large amounts of data staging is probably going to be a lot more efficient now if you got small data sets it's probably going to be less efficient right because you're going to have to write the data into a lake house transform it into the lake house then write it then the mashup engine picks it up again and writes it to your output destination so there's lots of kind of reading and writing here so on small data sets you probably want to disable staging or not enable staging it's only really when the data set becomes large or you're doing lots of transformation on that data set then we want to enable staging that's what I've of summarized with this sentence here there's a bit of an overhead when you're performing staging and so it doesn't work in all cases only really when your data set is large or you're doing lots of Transformations or both now another feature that they recently announced and therefore might not actually be in the exam yet but it's good to kind of understand know that it exists is fast copy and the way that I think fast copy works is that under the hood it uses the same technology is the data pipeline copy data activity rather than the data flow technology basically as I'm I mentioned it's still a preview feature and it's relatively recent so it might not actually be in the exam yet but you know if you're using data flows in the real world and you're struggling with performance it's worthwhile enabling fast copy just to give it a go see how it impacts the performance in your data flows okay so next up we're going to move on from data flows and now we're going to focus on SQL so we have the SQL engine within the data warehouse and also the SQL endpoint of the lake house as well and we're going to have a look at how we can diagnose and then optimize the performance of SQL scripts and as I mentioned before the capacity metrics app can give you a good kind of high level overview of the resource consumption of specific operations that you're doing within your data warehouse but we have a lot more functionality within the data warehouse to actually explore and identify things like long running queries frequently used queries next we're going to talk about Dynamic management views or DMVs and if you take a look in the data warehouse under the CIS schema there's obviously a lot of different views in there that we can use for database management in general and there's three main ones really for understanding the live SQL query life cycle okay so things that are currently going on in your data warehouse that you need to be aware of and getting some insights about what's happening there and these are exact connections exact sessions and exact requests and these are related in this way here so we've got a bit of data model here so whenever you start a query execution in the data warehouse it's going to start up a session the session is going to have a one toone relationship normally with the connections right so it's going to create a connection between your data warehouse and the underlying seal engine so that's what a connection is going to show you and then you're going to have many requests normally for each of these connections right so using these three commands these three dmbs we can begin to build a bit of a picture about who and how your data warehouse is being queried and these DMV are going to help you answer questions like who is the user running the current session when was the session started by the user what's the IDE of the connection to the data warehouse that is running a particular request how many queries are actually currently active and which queries are long running so you begin to build a bit of a picture about how your data warehouse is being queried who's querying it what they're doing and you know the performance of those queries now the DMV is quite a lowlevel view right we can do lot of information here can merge these tables in different ways to get more and more information so alongside the DMVs Microsoft also expose query insights so query insights is in a different schema so if we have a look here at this particular example of data warehouse we've got the Cy which is our DMVs what contains our DMVs as well as other database management system generated views we've also got query insights so in here we've got these four views that give us basically more userfriendly abstractions over the DMVs right so it's going to expose things like frequently run queries long running queries and you don't have to actually perform those joins of the underlying DMVs to get this information it just exposes them right here and so we're just going to focus on three of these query insights views so exact requests history it's going to return information about each completed SQL request on that particular dat Warehouse frequently run queries is obviously going to give you information about the most frequently run queries and long running queries is basically going to return you information about queries by execution time so this long running queries is going to be really useful to as the name suggests identify queries that are running for a long time and it might be causing performance issues in your data warehouse now one thing to note here is that if you're coming from a SQL Server background when we're talking about performance optimization a really important tool there is the query plan right so currently I don't think it's possible to expose the query plan for particular SQL query but I do think they are planning to support that in the future next up I want to talk about identifying performance issues with the Spark engine and you'll notice that we're back in the monitoring Hub and I just want to look at one of the item details here for this specific notebook so I've been running this notebook it's called Delta optimization and when we click through on this item we get a really detailed analysis of the different jobs that have been run in this notebook we can see that all of these have succeeded you can see which the duration of particular job the data that's been read and written for that particular job so this is a really good place to go if you want to understand what's actually happening when you click run in a spark notebook what's happening under the hood and the success or failure of each of the individual spark jobs now within the monitoring Hub we've also got this link through to the spark history server and as you can see from the UI here we've actually switched from a fabric tool to a generic spark tool right so here you're going to get a lot more detailed information about specific sparkk jobs that you're running you can look at graph so if you do want a bit of a query plan look at the different stages in execution of a particular spark job you can have a look at that here so if you're looking for really fine grain control and Analysis of what's going on on your Spark engine you're going to come to the spark history server now to actually interpret and understand what's going on here would be a whole series in itself I don't think you need to know the the nitty-gritty details of actually what's going on in the spark history server for the dp600 exam just to understand you know what's possible in the spark history server what does it log what can you monitor there I think that's good enough for the dp600 exam so finally in this tutorial I just want to focus on Delta table optimization now as you probably know fabric is built on top of Delta tables so this is a really important topic to understand and the Delta file format is great but it can lead to poor performance and Bloated storage sizes if we're not managing those Delta files correctly now this is a topic that can run very deep like a lot of the topics that I mentioned today we're just going to go through some of the basics of what you need to know for the exam and to do that we're going to be going through a spark notebook so this is the spark notebook that we're going to talk through now I'm going to start by just exploring the problem in a little bit of detail and I think a really good way of understanding what the problem or a problem that can arise with Delta files is this visual here and this visual comes from this blog post here by Sid daba and it's around efficient data partitioning with Microsoft fabric best practices and implementation guide and I've left a link to that in this notebook The Notebook is obviously available in the community here as well and I've also left a link to it here efficient data partitioning here as well so if you have one big file that's 10 gigabytes one paret file then the Spark engine is going to struggle to process that right because as you know spark is a distributed processing engine which means it works best when it splits your file your paret files into smaller chunks then processes these chunks in parallel but to make that possible we need to partition our data and partitioning is basically the process of converting one really big or several really big files into more manageable chunks so that our data can be transformed in parallel basically by The Spark engine so in this notebook we're going to start by looking at file partitioning and then look at some other methods for optimizing Delta tables in doing so this can help improve the read and retrieval performance so by doing so there's no need to scan through millions and millions of rows in your really big paret file if you got a good partitioning system in place it's going to speed up the read performance of your queries and it's also going to improve your transformation performance right as we mentioned before because partitions can be transformed in parallel so file partitioning I'm going to walk through a bit of a demo here and I've used a demo file and it's available on this website here but it's just a parket file what I've done is I've just put it in the files location this one here is called Flights 1M parket and before we get started it's just bear in mind the best practices on Delta Lake partitioning right and I've left a link here this is on the Delta Lake website where they give some general best practices about managing Delta files okay and one of them is around choosing the right partition column now most commonly it's done by date so if you've got time series data you've got some dates in your data set and it's common to use your date for partitioning but there's two kind of rules of thumb to bear in mind when we're talking about partitioning and deciding what col colums you want to partition on so in general we don't want to choose our partitioning column to be something of really high cardinality so say for example you have a column of user ID well that's going to be really high cardinality right every row is basically going to be unique so if you've got a million rows that's going to be a really bad partitioning strategy right because you're going to get a million different partitions and they're all going to be really small and as a general rule of thumb the amount of data that should be in each partition should be around one gigabyte that's kind of like the the good balance of what you should be aiming for with each partition so let's take a look at file partitioning and how to actually implement it in Fabric in a spark notebook so I'm going to begin by just reading in that parquet file our flights 1M parket file I'm just reading it into a data frame I'm just displaying it here and you notice we've got this date column here it's going to be useful for our partitioning strategy we're going to be partitioning on date and it's got some other items here that we don't really care about for this tutorial so before we do some partitioning and we write this file into into our Lakehouse using partitions we're going to do a bit of preparation and specifically we're going to add some columns into our data frame we're going to use DF with columns to add more than one column at a time and we're going to pass in this dictionary object here and each uh part of that dictionary each key is going to be the new column name and the value is going to be how we're actually Computing that value so here I'm just reading in some P spark SQL functions to extract the year the month and the day of month of that date field right that we looked at previously so if we just run that and if we just display the results here okay so now we've got this transformed data frame object right and so you can see here there added in this year month and day column and we can use these in our partitioning strategy now so you can see here from our code cell that we've actually written three different write modes here so the first one is going to write without partitions so this is just the normal saving of the table into a Lakehouse table from our parket file we're going to put it in tables and we're going to call the table flights not partitioned the second one we're going to do is we're going to call Partition by and we're going to Partition by the year and the month and we're going to save this one into another table called Flights partitioned and then as a third example we're going to write with some more partitions so we're going to be partitioning our data into smaller partitions here and again we're calling Partition by but this time we're passing in the year the month and the day so these are going to be more fine grained and I'm going to save that into a table called flights partitioned daily let's just run those okay so now our cell has been executed all of our spark jobs have concluded let's just have a look at this so yeah now you can see in our lake house tables we've got flights not partitioned flights partitioned and flights partitioned daily so we got three different tables here all with different partitioning strategies so let's inspect those and see what's going on okay so I've got three different cells here and you'll notice we're using the SQL so we're using spark SQL here and we're calling describe detail on that particular table it's going to inspect this table it's going to describe what's going on there and some of the results give us a bit of a picture as to how this table has been written into one L so if we inspect the results here and this is our flights partitioned we can see that the partition columns here are year and month which makes sense because this is the year and month one in flights partition and we can see the number of files it's created so the number of paret files we've got here is two next if we compare that to what we get when we look at flight not partitioned we can see that we got zero partition columns and we got one file so everything's just been written into one file in that instance here and for the daily one again if we inspect what's going on here we can see that we've got year month and day partition columns and here we've got 59 files so here it's been partitioned we've broken up that data into 59 smaller chunks of data now that might be too fine grained because if we scroll back up to the top here one thing that I did mention is that it's a bit of a balancing this because if your files are too big the spark engine's going to have performance issues but there's also the small file problem if your file sizes are under that gigabyte then you know that's also going to cause a lot of issues it's going to have to work harder you're going to have to go through the operation 59 times rather than two so again it's a bit of a balancing act trying to get a good partitioning strategy for your Delta tables but for the purposes of the exam I think it's worthwhile understanding the Syntax for creating part partions like so and then analyzing different partitions using this describe method spark SQL method okay so next I just want you to talk about V order optimization now V ordering is a Microsoft proprietary algorithm and it basically changes the structure of your parket file and what I've put here is is kind of providing a bit of special source so it does some special sorting compaction compression of that parquet files right ultimately to improve the read performance of these parket files across all of the different engines in fabric now whilst the actual algorithm is proprietary it's only used by Microsoft the output of like the parket file is fully kind of Open Source aligns to the traditional parket standards so you can actually read vorded parket files wherever you can read normal parket files that's not a problem at all Now by default V ordering so this algorithm that you used to write parket files it's enabled by default in the fabric spark runtime and you can check that by running this spark comp get so we're looking at the configuration and you can see that here is actually returning true now it can actually be manually disabled if you want it to so we can set the spark configuration by passing in this specific property here spark SQL paret V order enabled to false and then we can reenable it by doing the opposite right putting it to True again so for the exam you might be asked about how do you know whether a particular notebook or a particular spark environment has v order enable well that's this one or how do you enable it or disable it in a spark notebook as well that could be a common question that you might get asked okay just finally I just want to mention a few more Delta table maintenance and optimization techniques so there's a few that come from the actual Delta format itself so we have this function called optimize which is a Delta Lake method that performs bin compaction it it can basically improve the speed of your read queries so if you got multiple small files it's basically going to coals basically mean joining small files into larger files vacuum is another function that we can run and it basically involves removing files that are no longer referenced by a Delta table and then on the spark side there's two they might at least want to be familiar with is coales so as we mentioned when we're talking about optimize optimize is basically the what that's doing under the hood is calling coales and coales is a spark method or basically reducing the amount of partitions in your Delta table so if you've got 100 partitions for a particular file you can coals that Delta table into 10 partitions so coals is a pretty efficient way of grouping partitions into a smaller number of partitions right and I say it's quite efficient because it doesn't require a shuffle of the data it's just grouping partitions together doesn't actually reorganize within particular partitions your data now repartition is similar to coales but it's actually less efficient because it involves breaking up your existing partitions and then creating new partitions and because of this you create either more or less partition it's basically just restructuring how your partitions are created and it does involve some shuffling involves breaking up of your existing partitions and repartitioning them now you might be thinking what's the difference between the spark functions and the the Delta optimize well the Delta optimize has a few kind of things working under the hood that makes it more efficient for number one it's item potent so if you run it repeatedly it's not going to reoptimize files that have already been optimized whereas repartition is going to always repart partition your files you can keep on running this again and again and it's never going to get more efficient it's always going to repartition the files that's one difference between repartition and optimize and you can also run optimize on specific partitions in your data set whereas repartition that's kind of All or Nothing approach you have to repartition your whole table in one go so if you're looking for a bit more fine grained optimization you're going to be want to using the Delta optimize okay so let's just round off the video here by testing some of your knowledge of the things the topics that we've covered in this lesson question one a client you're working with wants to reduce the SKU of their fabric capacity from an F-16 to an f8 to save some money they want to find the most resource intensive workloads and optimize them to use less capacity unit seconds where should they look to find this information is it a the monitoring Hub B capacity metrics app C query insights D spark history server or e the one Lake Hub pause the video here have a little think and I'll reveal the answer to you shortly okay so the answer here is B the capacity metrics app we're talking about resource intensive workloads and our capacity is where we're going to get those resources and specifically it's going to tell you which workloads are the most resource intensive are the workloads that are going to use more of your capacity units seconds right so the answer is going to be your capacity metrics app we can look at all of our spark jobs our data warehouse operations and our data flows and we can come to conclusions about which of these are good candidates for refactoring or optimization now all of the others they might be useful for understanding performance of specific workloads within fabric but the capacity metrics app is the only one here that converts that into capacity units right and that's the important part of the question to understand question two you noticed one of your data flow Gen 2 runs failed to refresh last night where would you go to find out why a particular data flow might have failed a particular Run is it a the capacity metrics app B the monitoring Hub C power query error Hub D data flow refresh history or E the data pipeline run history so the answer here is D the data flow refresh history is where you're going to go to analyze error messages and debug particular runs of a data flow now the power query error Hub that doesn't actually exist I made that up the data pipeline run history but we're not talking about data pipeline here so it's not going to be that the monitor the monitoring Hub will give you some information so it will tell you whether a particular run has failed or succeeded but it doesn't give you more detailed information about error messages and things like that and the capacity metrics app is not going to tell you that answer either so the answer here is D data flow refresh history question three when talking about Delta table optimization which of the following operation removes old files no longer referenced by a Delta table log is it a v order optimization B Zed ordering C vacuum d optim or E bin compaction so the correct answer here is C vacuum so as we mentioned previously the vacuum command does exactly as it's mentioned there basically removes old files that no longer referenced by a Delta table log so the correct answer here is C question four which of the following statements about V order optimization is false a v order optimization is enabled by default in the fabric spark runtime b v order can be enabled during table creation using table properties c a table can be both V ordered and Zed ordered d v order improves the read performance for parket files e v order speeds up the right time of a parket file so the correct answer here is e so the question asked which of the following statements is false and so e is actually false V order does not speed up the right time it actually increases the right time of a parket file the benefit of V ordering comes in the read performance right that's why we do it takes a bit longer to write these files but it massively improves the read performance across any of the engines that you might want to use it in fabric all of the other options are true so it is enabled by default B if it's not already enabled in your spark environment it can be enabled for specific tables during table creation using table properties you can actually optimize both V ordering and Zed ordering for a particular table or PAR paret file within that table at the same time and D is also true because we mentioned that V ordering does improve the read performance for parket files that's why we do it question five you want to analyze long runn queries in a fabric data warehouse what's the minimum workspace role you need to run the following query select start from query insights. longrun inqueries is it a admin B member C contributor or D viewer so the answer here is C contributor to be able to run a query insights query you know autogenerated views that give us information about long running queries in this example you need to have a workspace role of contributor obviously the question asked for the minimum workspace role you can also run these queries with admin or member but the minimum workspace role would be contributor so if you have a viewer role you can't run these queries in a data warehouse congratulations you've completed the biggest section of the exam we're well over halfway now so in the next lesson we're going to be starting the third section section of the exam which is all about building semantic models well done and I'll see you in the next video hello and welcome back to the channel today we're continuing our dp600 exam preparation course and we're up to video 9 designing and building semantic models now as you can see from the course plan we're making really good progress we just got a few more important sections to look at and today we're going to be starting powerbi and semantic modeling part of the exam specifically we're going to be looking at the different storage mod modes import mode direct query direct Lake we're also going to be looking at composite models and what we mean by that including aggregations as well we're going to be looking at the large format data set and then we're going to be digging into a bit of a practical example in palbi desktop defining different Dax measures we're going to be looking at functions iterators table filtering windowing information functions and some of the other more advanced or more recent features as well including calculation groups Dynamic strings field parameters that kind of thing as well as ever we're going to have five sample questions at the end of this video to test your knowledge now bear in mind that I wouldn't classify myself as a powerbi developer I used powerbi quite a lot about six seven years ago recently I've been more focused around data engineering data science so I'll try and explain these Concepts as best possible but I definitely recommend doing your own research I'll leave a link to a lot of really good resources for this kind of thing in the school community so that you can go in a bit more detail there so first up we're going to be looking at storage modes now I've mentioned this fair amount on the channel and lots of people talk about storage modes within powerbi and fabric so we're going to do a bit of a revision what we mean by different storage modes some of the advantages and disadvantages and how you can choose between them so this is the diagram that exists in the documentation I think it does a pretty good job at framing out three different storage modes connection modes you might also hear it called as well now if you've worked in powerbi for a while you're definitely going to be familiar with both the import mode this one in the middle plus the direct query mode potentially you might have used that as well at the top there so for those not coming from the PBI background the the import mode is basically going to take a copy of your data from source and load it into powerbi so it's storing a a copy of all your data in the actual powerbi data model itself now that makes it really fast when you're building ports cuz your data is right there now it does have some limit ations in that because you're copying your data into powerbi there are some limitations around the size right that's one of the main limitations on import mode and also because you're having to do this lift and shift import on a schedule normally your data in your powerbi data model is not going to be always up to date because you're going to have to do it every hour potentially there is the chance that your data will become a little bit stale on the other hand we have direct query so moving up to this top one here in direct query whenever the user views a particular visual particular report page power actually sends a query back to your Source it's going to perform a query of that source and then get back fresh data so in doing so it's near real time so that's one of the benefits of direct query some of the downsides of that mean that we can't actually perform much transformation on that data it has to be transformed in the source right you have to create views and tables that already are transformed in practice direct query can be really slow because you've got to do that query every time you go back back to the original data source and for the user they're sitting and waiting for their visuals to up update every time you click a filter or you change the page it's going to take time it's going to be pretty bad user experience so the final one we've got here is the new one that came with powerbi it's called direct Lake and direct Lake creates a connection between what you create in your P report and the underlying parket files so it's going to read the parket files in your one Lake environment directly so let's just have a look at potential reasons why you might want to choose import mode or direct late mode or direct careering mode so some of the key considerations well as we mentioned for import mode it's going to be really good when your data is small enough to fit within a palbi data model including large format semantic models which we're going to talk about in a short while another good use case for import mode is when you want really good read performance in your dashboards you know like interactivity and a good user experience for your dashboard users when you don't have requirements for near real time updates and if you want to use calculated columns or calculated tables if you want to be doing that kind of stuff then you're going to be want to using import mode and if you want to combine data from multiple different data sources that's another good use case for import mode which me leads me nicely into when you would choose direct Lake mode well the first limitation really is that your data has to be stored in one fabric data store so in a lake house or a data warehouse for example you can't use direct Lake mode to access data across lots of different data stores so that's one limitation the prime use cases when your data set is really really big we're talking tens or hundreds of gigabytes here now obviously that would be too big for most import mode models but that use case is really really good in direct late mode now with direct late mode it will require a little bit of a different skill set within your team because you're going to have to do a lot of the data modeling more up dream in your Lake housee in your data warehouse right because you need to materialize parquet files that can be read by the direct Lake mode connection and typically what that means is your data transformation your data modeling is going to have to be done in your lake house so either using spark or TC call and that might be a slightly different skill set to what you might have in a an import mode powerbi team for example okay so finally let's just talk about when you might want to choose direct query mode for your semantic models well again if you need near real time updates that's going to be a really important one again you're going to be needing to do your trans data Transformations more Upstream so in your data source wherever that might be and direct query is also important part of what we call composite models which we're going to look at in more detail shortly so let's look at composite models then so a composite model combines one or more of these different connection modes that we just discussed previously now commonly this is a direct query fact table and import mode Dimension tables because if we think about the common characteristics of a fact table versus a dimension table well in our fact table we're going to have a lot of rows normally a lot of data could be millions or even billions of rows and it's likely to be updated very often maybe every minute or every second even in some oltp transaction processing type fact tables they could be hundreds and hundreds of records every second now because of those characteristics direct query can be a good match for that type of data set right because you get near real time updates so on data sets that are changing very often and very fast direct query gives you that near real time access to fresh data right now with the dimensions they might be changing a lot slower so it makes sense to use import mode for those Dimensions you know your product table might be updated once per day for example so a direct query connection mode in that example wouldn't really make too much sense because you're going to lead to user experience issues on the front end for that table and you're not going to be getting much benefit because the underline data isn't really changing very often now another benefit of composite models is that they provide a way to model many to many relationships without the need for bridge tables as well so next up we're going to look at aggregations now in this context we're talking about a specific powerbi feature for managing aggregation and specifically what this feature does in powerbi is it takes a really large data set it creates a aggregation either automatically or user generated you can assign the aggregation that you want to build and then it caches the actual aggregation so when you have really large data models it can improve the performance because you're caching the aggregation rather than loading in you know a really long fact table for example and because of that they are often used in conjunction with composite models now either you can create the actual aggregation itself in your data source and then just pull into your powerbi data model or you can bring it into powerbi and then use the power query engine to create an aggregation which then gets loaded into your powerbi engine now as I mentioned there's two really types of aggregation we have userdefined aggregations and that's using the the manage aggregations dialogue in P your desktop to Define these aggregations for a specific aggregation column and then you choose how you want to summarize do you want a Min a Max that kind of thing plus the detail table and detail column properties now if you have access to a premium subscription powerbi then you also get automatic aggregations and these are basically going to use machine learning to try and optimize direct query semantic models and they going to look for the best aggregation to improve performance another thing we need to be aware of for this part of the exam is the large format semantic model now large format semantic models provide a highly compressed inmemory cache for optimized query performance enabling fast user interactivity so if you've got a semantic model model that's perhaps bigger than 10 20 30 GB what you can do is you can convert that small format semantic model into a large format semantic model it's going to apply this compression this in-memory caching that's going to improve the performance of those models now it's not just models that are over 10 GB in size where you might want to think about converting to a large format semantic model in many cases even below that kind of 10 GB threshold there's some benefits in converting to a large format scientic model firstly you're going to get the performance benefits anyway secondly it's commonly used when connecting to thirdparty tools via the xmla endpoint now another feature of large semantic models is ond demand loading now if you watched some of my previous videos you know that direct L connection mode also uses this on demand loading and what that means is that when a user is viewing a particular page in a report they don't need the full data set to be loaded into memory on demand loading has a look at the online paret files and only loads the required data that is needed to visualize the data that's being requested at that specific time now that can really improve performance again but it's a feature that's shared between large format semantic models and the direct L connection mode as well okay so for the next part of this video we're going to switch over to powerbi desktop and we're going to be using this to explain some of the key things that we need to know for this module in the exam so we're going to be looking specifically at variables if iterators table filtering window functions information functions calculation groups Dynamic strings and field parameters and each of these different features and Dax Expressions I've got a little bit of an example just to talk you through the implementation what that looks like so let's start off with variables now variables are very common in pretty much every programming language that exists and ax variables can help us avoid code repetition and also potentially improve the performance of your Dax code too so here what we've done is we've created two variables one called total revenue and one called total days you notice the syntax here is to use V to declare it as a variable and then it stores that result locally and then you can use it later on in your Dax expression typically you'll need to use a return statement as well when we specify these variables and so in this example we're declaring total revenue and total days as variables then we're using that in this return statement to give us the overall average revenue per day next we're going to look at iterator functions and iterator functions in powerbi basically enumerate through all of the rows in a table and they perform some calculation depending on the specific iterator function that you choose and then it's going to aggregate the result now examples include sum X count X average X most of them have this x afterwards and here we've got a bit of an example here to Showcase iterator functions and we using it in this cumulative measure so we've got this cumulative revenue and we're doing sum X okay and we're using it on this filter so what we're basically saying is for each of the rows in our date table we're going to go one by one and for each of them we're going to increase the number of rows that are being filtered right so the first row here we're basically comparing the the current date or the the date in the row that we're interested in with the max date which on the first pass when we're enumerating through this table there's only going to be one row the top row right and then we're calculating the sum of the revenue on that particular date now on the second row obviously this is going to increase the two rows so then we're going to do a cumulative revenue for the revenue on the first and the second row then we're going to go down to the third row in that table and it's going to give us the sum of the revenue on the first second and third rows so when we do this through the whole table the result is this cumulative chart of Revenue basically from the first dat here all the way through to the last date which is about 19 billion in revenue on the last date in our data set which is 31st of May 2020 next up we have table filtering now table filtering uses the function filter and it Bally returns a table that represents a subset of another table or expression that you're using so in this measure we're calculating total revenue by category and so we're using the calculate function and we're passing in the sum of the revenue and then we're filtering it by specific product names so what this is going to do is create us this chart right so we get Revenue figures for each particular category because we're doing this filtering of particular product names so like Audi tataa Hyundai these are all product names and our filter expression here is basically filtering out the product name where is equal to the selected value of product name next up we're going to look at window function and there's actually three functions in Dax that a class as window functions you have the window function called window index and offset we're going to be focusing on the window function in this example and some of the use cases where you might want to use a window function well if you've used window functions maybe in SQL the result is quite similar the way that you implement it is quite a bit different actually so a use case might be for things like rolling averages so if you want to create a 3mon moving average of revenue for example you might want to use window functions obviously there's lots of ways to do moving averages in powerbi in DAC window function is one of them or our window can actually be a cat variable so say for example the average revenue for each department in a company so if we take a look at the documentation for the window function at a really high level basically what it's going to do is return multiple rows which are positioned within a given interval for example we're going to give it a from and a to window basically and it's going to return the rows that meet that criteria right between the from and the two dat now there's a few other parameters maybe to be aware of here including from type and to type so here we can specify either absolute or relative values so absolute is just going to take the whole table from top to bottom and pick the absolute value in your from uh parameter here or do you want it to be relative right so relative so if you pick a from type of relative rather than looking at all the values in this table and picking the first or the second or 90th value from top to bottom the relative is going to look at the current date and then look at maybe minus 10 might be a relative from parameter now on its own the window function is not particularly useful normally how it's used if we look scroll down to an example here it's going to be used in combination with some other sort of measure typically as you can see here it's used in this iterator function so we're using the window function to get a window of data and then applying average X on through that window and that's to return the 3-day average price so that's a bit of an example of how you can use the window function in practice so information functions are another class of functions that exist within the Dax language and and there's a lot of examples of information function you've probably used them if you've used Dax before like contains or contain string or has one value is blank is error selected measure user principal name as well it's basically going to check a particular value in your table and return depending on the information function that you use normally it's a Boolean value but sometimes it's something else like user principal name it's not actually looking in a table in that point it's looking at your actual PBI file and looking at the logged in user so in this example here we've got a table very simple table of simple transactions we've got transaction IDs and some revenues and we're using this function here is blank which is an information function and we're basically using it in an if statement so if this is blank returns true then obviously we're going to use no Revenue recorded if it returns false so I.E there is some value in that column then we're just going to write out Revenue recorded and then we get this sort of table here next up we have calculation groups and it provides a simple way to reduce the number of measures in a model or at least the maintenance of those measures in a model so say you want to create daily average monthly average and then yearly average measures and you might want to do this for revenue and you might want to do this for cost and you might want to do this for salaries there's a lot of repetition that you're going to have to do in this Dax code so what we can do in calculation groups is basically parameterize those Dax measures so that the actual maintainable code that you're writing is a lot less now you can create these now in Pia desktop as of a few months ago and also in table editor as well got an example here where I've created this calculation group and it's called aages now to have a look at our calculation group let's just go over to the modeling Tab and then you can see here we've got calculation groups averages and we've got some different calculation items so obviously to create a new one you can create a new calculation group here now in this calculation group I've actually got three calculation items the first one is just the total which is just selected measure that's kind of like your your Baseline measure and then we're going to reuse that or we're going to call that within other measures so like the daily average is going to call Select measure here the monthly measure is also going to call selected measure but this time on the month next up we're going to look at Dynamic string formatting and this is a pretty cool one it basically allows you to apply string formatting on numerical measures without updating the underlying data type underneath so it can remain as a numeric measure in this example here we've created this measure called Dynamic format measure we're using this some Revenue just as an example here in our example what we're doing is giving a few different options in our switch statement so if it's less than a th000 we're not going to do any formatting at all if it's between a th000 and a million we're going to add in this K so thousand we're going to add in an M if it's in the million ranges and billion if it's in the billion ranges basically now the benefit of this is that well in our tables and in our this is just a card it's going to show a much nicer presentation of that number 5246 million rather than lots and lots of numbers and the big benefit is that the difference between this and the modeling tab format string so if we were to click on a specific Revenue number here for example so we can maintain our format as whole number but we're also formatting the presentation right and this is important maintaining this format as whole number because we might want to do something like visualize this data within a chart so the final feature we are going to look at today in powerbi is these field parameters and field parameters allow the report user so the end user here is going to be coming into your report to select different categoric variables and also measures as well kind of dynamically so depending on the way in which they want to slice the data you can set up field parameters to give them this flexibility so in this example here we've got a number of different parameters and we're looking at the revenue by different parameters So currently we're looking at by location but you might also want to look at by dealer or by country or by model or by product and you're giving the end user the flex ability here to decide now to set up field parameters you can go to the modeling tab new parameters fields and in that way you can set up a new field parameter just give it a name and then pass in the different parameters that you want you also rename them if you want here as well that's going to set up your field parameters and then in the actual visuals it's going to create this parameter maybe I could to prove the naming here but it's basically going to look like this so it's just going to have this object here and it's going to say location is name of this it's going to give it an index here 0 1 2 3 so that when you click on the filter it's going to update the visual and on the visual side again we've got on the y- axis just this parameter that we created and our measure that we want to visualize so sum of Revenue now in this instance we've got a field parameter in the y- AIS but we could also have a field parameter for the revenue as well maybe you wanted to do some of the Revenue average revenue median Revenue whatever you would want and you want to give the user flexibility to change these dynamically that's how you would do that there okay so we've been through a lot in this video now let's test your knowledge of what we've been through and some of the questions that you might face in this section of the exam question one the Dax expression average X is an example of a and information function b a calculation Group C table filtering d a window function or E an iterator function pause the video here have a little think and then I'll reveal the answer to you shortly so the answer here is obviously an iterator function now the big clue here is the X at the end of average X which generally denotes an iterator function of course an iterator function is those ones where we're going to be enumerating through every row in a particular table and Performing some calculation before combining the results in some way depending on how you set up your iterator function it's not going to be information functions it's not going to be a calculation group group table filtering here well you might actually do some table filtering within your iterator function but average X itself is not table filtering function similarly with window functioning again you might use an iterator function average X within a window function but average X itself is not actually a windowing function question two on demand loading I loading only the data is needed for a particular query is a feature of which two are the following a import mode B direct Lake mode C direct query mode d large format semantic models or E the xmla endpoint the answer here is B direct Lake mode and D large format semantic models so we mentioned this when we were going through the slides this feature on Dem M loading is actually shared by two of these modes here so direct late mode and also it's a feature included in large format Mantic models as well on demand loading doesn't really make sense in import mode CU in import mode we've got the full data set there for us to query anyway now you could argue that direct query mode what it's actually doing is very similar to On Demand loading whenever you get a request for a query you're actually going back to the data source and you're querying that data source directly and then loading in only the data that's needed whatever comes back from that database query but I do think there is a distinction between what direct query mode is doing and specific feature called On Demand loading and I do think these two are slightly separate so although direct query does a similar job it's not actually leveraging on demand loading and the xmla end point e is just not the correct answer question three Dynamic format strings overcome which significant limitation that comes from using the Dax format function is it a the format function is slow on large data sets B the format function returns a string value so the values can't be used in chart visuals is it C the format function can't handle date local conversion whilst formatting or is it D the format function can't be used with field parameters so the answer here is B the format function returns a string value so the values can't be used in chart visuals one of the major benefits of dynamic format strings is that you actually retain the original data type for that particular column so if you've got a numeric data type maybe you want to format some millions or billions in that numeric data type you can create a dynamic format string that's going to visually format the string but the underlying data type Still Remains numeric so you can use that field within charts right you maybe you want a Time series chart that wouldn't be possible if you used the format string because that's going to return a string value and you can't visualize a string value in a chart like that so the answer here is B question four the Dax expression selected measure is most likely found in the construction of which of the following is it a a calculation item in a calculation Group B field parameters C an iterator function D large format semantic models or e a window function now the answer here is a a calculation item as part of calculation groups so as you remember you need to have that selected measure that's what makes it Dynamic and parameterizable let's say is that selected measure expression it's not part of field parameters or iterator functions or window functions large format sematic models but of course you could have a calculation group within your large format model but the most likely place that you're going to find this because it's pretty much necessary is in that calculation item when you're creating calculation groups question five which of the following is an irreversible operation which means it can't be changed afterwards where you can't go backwards is it a changing the cross filtering of a relationship to bidirectional B changing the storage mode of a table to import C naming a calculation Group D converting a semantic model into a large format semantic model C creating a window function so the here is B changing the storage mode of a table to import mode now this assumes that the original storage mode of that table was direct query and if we're moving it back to import mode that is an irreversal operation I'll leave link to the documentation there that kind of specifies where that's the case obviously with a the cross filtering of a relationship to bidirectional we can change the cross filtering that's not a problem of a particular relationship we can also rename calculation groups creating a window function doesn't particularly make sense because of course you can just delete the window function or delete the the measure now converting a semantic model into a large format semantic model is an interesting one now I was under the impression that this also was an irreversible operation but then when I was actually going to research it and test it out I could actually convert a large format semantic model back into a small sematic model so for that reason I've included it as false but let me know if you think that D is also a correct answer for this question congratulations that's the first part of the semantic modeling section of this exam complete in the next lesson we're going to be looking at model optimization and security so make sure you click here to join us for the next lesson I'll see you there hi welcome to video 10 out of 12 in this dp600 exam preparation course today we're going to be looking at securing and optimizing semantic models so this is the second part in our semantic modeling section of the study guide and as you can see here we're very close to the end of the course so we got two more modules after this one but today let's focus on semantic modeling again specifically we're going to be looking at implementing Dynamic row level security and object level security implementing incremental refresh implementing performance improvements in queries and Report visuals and to do that we're going to dive into the use cases for external tools like Dax Studio tabular editor 2 and then we're going to go into a bit more detail about okay what can we actually do in terms of improving Dax performance in Dax studio and also optimizing semantic model models using tablet editor as ever at the end of this video I'll be asking you five sample questions to test your knowledge of the things that we go over in this part of the study guide now as with the last video I've left links to really good further learning resources from people that are much more experienced in power be development and semantic modeling so I will caveat this lesson by saying you know definitely go and check out the further Learning Resources by MVPs Microsoft mvvs for the powerbi side a lot of this content is powerbi and also third party tools that connect to powerbi as well so let's start by looking at Dynamic roow LEL security so in general when we talk about row level security we're talking about restricting who can see what data at the row level in specific tables in a powerbi report now Dynamic Road level security is kind of an extension to Road level security by applying Road level security using the usable principle name now there are other information functions that we can use but user principal name basically gives us the email address of the logged in user so when the user logs into the powerbi service then behind the scenes we're going to get access to their email address and we can use that to apply filters to the data in our power report so that that user only sees the information that you have configured that they should be seeing now you can configure Road level security in the semantic model the data warehouse and the tsql endpoint of The Lakehouse in fabric but if you're using direct Lake mode then you want to be configuring Road level security in the semantic model otherwise you're going to be falling back to direct query mode and in general the whole One Security model in fabric is still kind of under construction so I definitely recommend where we are currently I would recommend just using the conventional semantic model Road level security Now one thing to bear in mind is that road Lev security only really works when the users that you're trying to give Road level security to have viewer permissions in a particular workspace so if they have more than viewer so admin or member or contributor technically this Ro level security is not going to be enforced because they'll have lots of other ways to access that data so that's one thing to bear in mind there need to be viewers in the workspace your users or your viewers of reports so at a high level these are the steps that are required to implement Road level security so the steps for implementing Road level security in P desktop first we're going to have to create a role now after you've created a role you want to select the table that you want to apply Road level security two you're going to enter a table filter Dax expression to configure when and who the road level security is applied to and we're going to validate that roow level security has been applied correctly now as well as roow level security we can also apply object level security to our semantic models but the way that we're going to be doing that is different because object level security can only be configured via third party tools such as tabular editor and when we talk about object level secur security what we're talking about is restricting access to a particular table in a semantic model or a specific column so you might have a sensitive column within your powerbi report and you want to restrict who can see that particular column of data in your spany model as I mentioned to configure object level security you need to use an external tool such as tabulate editor and similarly to row level security object level security only restricts data access for users with viewer permissions so we can't give them admin member or contributor roles in a workspace now the high level steps to implement object level security in tabul editor well again we're going to need to start by creating a role in power your desktop or you can also create it in tab editor as well then within Tabet editor you're going to want to find the role that you've created click on the table properties for that role set the permissions for the particular table that you want to apply object level security on set that permission to either none or read okay so obviously none will be if you don't want to give them access to that table and read will be if you do want to give them access to that table similarly we can do the same for a particular column as well then we're going to be wanting to publish the report to the service and add the people uh and groups to the particular role in the service next we're going to talk about incremental refresh in pobi now one thing to bear in mind here is that incremental refresh is a developing field in the world of fabric but but for the purposes of this exam I think what they are looking for is incremental refresh which is a feature in powerbi not talking about incremental refresh in data flow gen 2s or data pipelines or anything like that so I think we're focused here on the specific feature within powerbi called incremental refresh and typically this is used on large fact tables because incremental refresh allows you to pull in only the data that is changed within a given range right so perhaps in the last 24 hours or the last hour you might only want to bring in the new data that's changed within that period rather than you know loading in all the data that's in that Source database and obviously that's going to have many different benefits if we can do this incrementally rather than allinone for starters we're going to need fewer refreshes the refreshes are going to be a lot quicker because you're only pulling in what's new your resource consumption could be lower as well because again you're only bringing in what's changed you're not bringing in everything every time now the refreshes can be more reliable because rather than pulling in hundreds of thousands or millions of rows every time you refresh the data set this could create open connections that are going to run really long on your database and have the potential to be timed out as well obviously when you move to an incremental refresh that problem is likely to go away because your refresh is going to be a lot quicker as I mentioned currently incremental refresh is only possible within the powerbi side so they are working on incremental refresh features for ETL items like data Factory items like the data flow Gen 2 and the data pipeline but for the purposes of this exam currently when we talk about incremental refresh we're talking about powerbi and incremental refresh in powerbi is available for powerbi premium licenses only so PPU or premium capacity subscriptions and the incremental refresh policies are defined within power desktop so just have a look at how we can Implement incremental refresh so it starts by creating two parameters called range start and range end they must be called range start and range end these are reserved keywords for these parameters and then in your power query you're going to be wanting to apply some custom date filters to filter the data based on that tables date column to keep only the range between the range start and the range end then you're going to want to find your incremental refresh policy and this is what this looks like you're going to select the table you're going to set your import and refresh ranges and there's a few optional settings there like only refreshing on complete days and detecting data changes as well and then you're going to review and apply your policy and then when you publish your report into the service that's when your incremental refresh is going to kick in based on the policy that you've set and the range that you've set so next we're going to switch our attention and we're going to talk about semantic model performance and whenever we talk about performance of anything really it's useful to think of it in two steps number one is around monitoring and observation and Gathering data about what's happening in our semantic model in this case and then we want to talk about optimization so once we understand what's going wrong how can we optimize it to improve performance ultimately So within powerbi and the external tools that you can connect to powerbi there's quite a few different ways that you can monitor semantic model performance on this slide we'll just do a high level summary of all of them before digging into each of them in a bit more detail so when we're talking about power query perform performance there's a tool called the query analyzer tool when we're talking about analyzing the visual and query performance well we can use the performance analyzer in powerbi desktop we can also export that data into other tools like Dax Studio as well for a bit more fine grained analysis we'll take a look at that shortly so when we're talking about Dax performance we can again use Dax Studio we can bring in data from performance analyzer we can also use things like traces to monitor all of the different events both on the client side on the server side as well and for semantic model performance we can use things like the best practice analyzer which is a tool within tabular editor so for the purposes of the exam you're going to need a bit of a knowledge about what you can do in powerbi desktop the limits of performance analysis in P desktop and then knowing when to switch to tools like DAC studio and tabular editor as well when you want a bit more advanced analysis or deeper diving into the things that are going wrong in your Dax and your semantic models let's take a look at these three in a bit more detail so first let's look at Dax studio so one of the core use cases that we can use Dax studio for is loading the powerbi performance analyzer data in Dax studio for further analysis So within powerbi we can record different actions when we're using the report like clicking on different filters refreshing different pages and it's going to log the query time and the visual load time for every visual on a specific page but the UI in pobi for analyzing this performance analyzer data it's a little bit limited so what we can actually do is export that data and we can import it into Dax studio for a bit of a deeper dive analysis on that we get better filtering and sorting than in powerbi desktop and you can also view the queries behind each visual load now since powerbi has introduced the Dax query view this has become a bit of a less of an advantage for Dax Studio because now you can also run the underlying Dax query for a specific visual within the Dax query view in power desktop but it's something to bear in mind another really core use case of Dax studio is using the view metrics to look at the vertac analyzer so the verti PAC is the engine the analysis Services engine that basically power runs on and so we can analyze what's going on in there using this view metric we can take a look at table and column sizes and obviously the size of your table and your column has a really big impact on performance right and then we start to look a bit deeper into reasons why things might be slow like the cardinality of a column the different data types that you're using for a specific column and much much more we can also use the verti PAC analyzer to look for referential integrity violations and what we mean by that is a mismatch in the unique keys on two sides of a relationship you know there's lots more information in the verti pack analyzer for the purposes of the exam I think it's good just just to have a look at the verti pack analyzer understand what you can do there and the insights that you can gather from that another really useful feature in Dax studio is the trace analysis so for Trace analysis there's three traces really we need to be aware of the all queries Trace is going to capture different query events from client side tools such as power desktop so this is the only one of these traces that's going to able to capture both on the client side and on the server side your queries as well the query plan Trace there is only really going to capture the query plan Trace events from the analysis Services tabular server right so if you're using the Dax editor within Dax studio and you're using that to run queries and kind of analyze their performance that's where you're going to be using the query plan because it's going to give you the plan of that specific query on the analysis Services engine and the server timing Trace is going to give us the query timing from a server perspective so next let's switch over to tabul editor and look at some of the use cases that we might want to be using tabul editor for and the main one when it comes to optimizing semantic model performance is the best practice analyzer and this is a tool that basically performs a scan of your semantic model and it checks it for common issues now there's a list of rules that can be downloaded from GitHub and you can also create your own custom rules as well but there's a predefined list of rules that you can download from GitHub and they're organized into these categories we have ones that talk about performance ones that talk about your actual Dax Expressions that you're using error prevention formatting and maintenance so these are different categories of those rules in the best practice analyzer rule set now these checks can also be run from the tabulate editor CLI as part of a cicd process so when you deploy a semantic model from a development environment to a testing environment you might want to run the best practice analyzer rule sets against your semantic model to give you an automated way of analyzing ing the quality of that model to surface any particular errors that you might get with that model and things you need to be aware of from a quality and maintenance perspective in that model so I've left this use cases for Dax studio and Tabet editor at the end here just to kind of review everything we've talked about when it comes to Dax studio and tablet editor so on the Dax Studio side we're going to be wanting to use Dax Studio when we want to write and execute and debug Dax queries but they do actually need to be manually copied over to power desktop as Dax studo is readon now since as I mentioned previously powerbi desktop now has the Dax query view you can actually write and execute Dax queries and view the results similarly to how you can do in Dax Studio but that's a relatively new feature that didn't used to exist in powerbi we can use the verti PAC analyzer to understand the size of your semantic model as well as individual tables and columns within the model we can bring powerbi desktop performance analyzer data into act Studio to analyze it further and we can use those Trace analysis functionalities to analyze query events both on the client side and the server side depending on the trace that you select and when we're talking about the use cases for tabular editor we're going to be able to quickly edit data models so we can create measures perspectives calculation groups from the Dax editor within tabular editor and we can publish them directly into the semantic model so that's a bit of a distinction between the functionalities of tabular editor and D Studio in tab editor we can actually update our semantic models in the tool itself there's also functionality for automating repetitive tasks using scripting and as we've mentioned we can incorporate devops into the kind of tabular mod model life cycle using that cicd functionality in the tabular editor CLI and we can use the best practice analyzer to identify common issues in your powerbi static model finally a good use case for tabular editor is if you want to implement object level security as we mentioned before that's not possible currently within power desktop so if you want to be defining object level security that's going to be done in tabular editor okay we've covered a lot of ground again there so let's just wrap up this video with some practice questions to test your knowledge of this section of the exam question one your goal is to analyze performance analyzer data in Dax Studio to find the visual in your report with the longest total load time put the following steps in the correct order to achieve this so what you've got here is five steps in a process your goal is to reorder this list so that it makes sense for this particular goal that we're trying to do analyzing performance analyzer data in Dax studio so take a moment here get a bit of paper maybe write these down in the correct order and I'll show you the answer shortly so the answer here is like this so first we're going to be starting a recording in powerbi performance analizer we need to actually record it in powerbi to begin with then you're going to be wanting to click refresh visual just to update the visuals on the page or you can interact with the report as well if that's what you want to be analyzing then clicking stop recording then we can export that performance data Json and then import it into Dax Studio then we can go over to the powerbi performance Tab and sort by the total milliseconds descending to find the longest refresh time for a particular visual on the page question two you want to use the best practice analyzer all within tabul Editor to assess your Dax performance which of the following severity codes for best practice analyzer rule violations indicates an error is it a level zero B level one C Level Two and above D level two e level three and above so the answer here is level three and above so in the best practice analyzer within tabular editor we have many different levels of severity of the different rule violations level one is just for information only a level two is a warning and level three and above is an error so level three and above e is the correct answer to this question question three when implementing Dynamic Ro level security which information function should you use to filter the data in a specific table based on the logged in user's email address is it a user B user object ID C user email D user principal name or E username so the answer here is D user principle name so as you recall when we were talking about Dynamic role of security and specifically in the question is talking about using the logged in users email address and the information function to give us that is the user principal name now the user function doesn't exist the user object ID and the username does exist as information functions in Dax but these are not the correct answer they won't give us the email address the username will just give you the domain and the the user's name rather than the actual email address and and user email C is also incorrect that one is is made up so that's not what the function is called it's user principal name d question four in Dax Studio which of the following records queries are generated by a client tool like powerbi desktop a query plan Trace B SQL profiler C all queries Trace D server timings trace or E the verti PAC analyzer so the answer here is see the all queries Trace so the clue in the question here was talking about the client tool and recording queries that generated specifically within the client tool itself within powerbi desktop so the all queries Trace is the only one that's going to give you that information the verti pack analyzer well that's just going to analyze things like referential integrity and the sizing of different tables and columns within your semantic model not necessarily the load time directly within the client tool power VI and the SQL profiler the server timings and the query plan these are all kind of serers side backend tools they're not going to give you information about queries generated in powerbi so the answer here is C the all queries Trace question five the first step in implementing incremental refresh in powerbi is to a add a range start and a range end column to your data set B add refresh start and refresh end parameters to your powerbi desktop project C add a refresh start and a refresh end column to your data set or add range start and range end parameters to your powerbi desktop project so the answer here is D adding range start and range end parameters to your powerbi desktop project now as we mentioned previously range start and range end they're reserved keywords when we're talking about parameters so you do need to specific when you're adding in these parameters should be range start and range end answers A and C talk about adding columns to your data set that's not going to do anything or at least anything useful when it comes to incremental refresh we need those as parameters cuz we're going to use those to filter date column in the table that you're interested in so we're going to use those parameters to filter that table and as I mentioned previously B is wrong because we're using refresh start and refresh end rather than range start and range end we need to be specific when we using inal refresh to be using range start range congratulations you've now completed the first three sections of the exam study guide we only have two quite short sections to go so make sure you click here to join me in the next video where we'll be looking at exploratory analytics in a bit more detail see you there hello and welcome back to the channel today we're going to be continuing our dp600 exam preparation course and we've made it to video 11 of 12 in this series and today we're going to be looking at performing exploratory data analytics now there's this video and then one more video to go so we're very close to the end so keep going we're almost there and within this video we're going to be looking at descriptive diagnostic predictive and prescriptive analytics and specifically we're looking at how to implement those things within powerbi we'll also be taking a look at the data profiling tool which is part of the power query experience so you'll see that within the data flow Gen 2 and also within the power query engine in powerbi as well towards the end of the lesson we'll be finishing with five sample questions just to test your knowledge of this section of the study guide and as ever I'll be leaving links to further learning resources if if you want to go into more detail about anything I mention in this video and I'll leave a link to the school community in the description below if you want to grab those learning notes okay so just to kick us off I think it's worthwhile just going through those four types of analytics and looking at what they mean in Microsoft's own words so we'll start by looking at descriptive analytics so when we talk about descriptive analytics we're talking about analytics that interpret past data and kpi to look for Trends and patterns so we're looking at what happened in the past we're describing what happened in the past when we talk about descriptive analytics and we'll be going into some examples of descriptive analytics and some visuals you can use to perform descriptive analytics but for now let's just look at these definitions the second one to know is diagnostic analytics so diagnostic analytics varies from descriptive analytics because we're not just worried about what happened in the past but we look looking at analytics that describe which data element will influence specific Trends and the possibility of future events so when we talk about diagnostic analytics we're not just talking about what happened but we're inferring why something happened so we don't just care about okay what happened in the last 12 months but we're looking for more detail around particularly why certain events or certain Trends might have occurred now typically this uses techniques like correlation analysis and data mining at at least in the words of Microsoft and there's various techniques that we can use to kind of Infuse our visual reports in powerbi to try and expose some of this diagnostic information number three is Predictive Analytics so with Predictive Analytics we're going to be using statistics or machine learning as well to forecast future outcomes with statistical models and machine learning techniques as I mentioned now these analytics provide context and Clarity for future decisions so in Predictive Analytics we're not just worried about what's happened in the past but we're using that data our historic data to think about what might happen in the future with some amount of certainty and finally the next level of analytics is prescriptive analytics so in prescriptive analytics we're not just predicting what's going to happen in the future but we're providing recommendations and we're recommending actions that might be the best course of action to either prevent a particular scenario or to increase the chances of a particular scenario particular outcome for your business now with all of these four the clue really is in the name so if you're struggling to remember what each of these types of analytics is and it helps just to go back to the first word in each of them and really understand what each of them means so in the first one we're describing so it's descriptive analytics we're describing what happened in the past with diagnostic analytics we're diagnosing so we're not just describing what happens but there's some sort of causality there we're thinking why something happened in the past so we're diagnosing a particular Trend obviously with Predictive Analytics we're going to be predicting what happens in the future and with prescriptive analytics obviously the key word there is prescribe we're not just predicting what's going to happen in the future we're also prescribing a suitable course of action that you should follow to optimize a particular outcome in the future now one thing I'll say before we start here is that most of the content for this section of the exam comes from or at least is inspired by the pl300 exam for the PBI data analyst so that is what we'll focus on for this section of the study guide for me it's a bit less about descriptive diagnostic predictive and prescriptive analytics although you could argue that the visuals that they're talking about here align to one of these four categories really it's more about powerbi and the visuals in powerbi when you should use which Visual and when you should use specific powerbi features so that's what we're going to be focusing on in this lesson and when it comes to visuals and choosing which ones you should use when well visual selection is somewhat subjective I would argue for the exam recommend that don't try to be too clever when you're thinking about visual selection so they might ask you when should you use this particular visual or given this scenario which visual would you choose so instead of trying to be too clever in instead I would recommend thinking about what Microsoft see as the main use case for a particular visual you know get inside the heads of the examiner and of Microsoft and think about how they want you to use the tools keep that in mind as we go through the following examples okay so let's go through some of the commonly used visuals in powerbi and when you might consider choosing them or maybe not choosing them as well so let's start with the table and the Matrix visualization so these are really good for visualizing fine grain details in your data allowing your user the report user to explore the data themselves right and these are commonly used in drill through functionality so when you want to provide the user the option to drill through to find more detail about a particular metric you might allow them to drill through and we'll be talking through that functionality in more detail a little bit later on they can also be used to display aggregate information so for example the revenue broken down by month for example one of the drawbacks of the table and Matrix visualizations that obviously it's difficult to to spot long-term trends at least visually next we have the bar and the column chart so when you have one categoric variable and one numeric variable for example you might be looking at the revenue by region these are particularly useful they can also be stacked if you want to add in another categoric variable into your analysis so maybe Revenue by region but then also by product type as well just to give it an extra layer into your analysis again bar and column charts they can be used to visualize time series information but it does get a bit messy if you've got lots and lots of time periods along your xaxis generally it's better to present time series information on a line chart which brings us nicely into the next one which is the line and the area chart as I mentioned this are really good for visualizing time series information you can fit a lot of information into one chart over large time ranges we can also use the legend to kind of break up a single line into multiple lines to compare how that metric is changing within each category over time now another variation of this is the area chart on the right hand side there my personal opinion is I don't really like the area chart it's difficult to interpret all these different areas because you know the color of each area is kind of impacted by the color underneath it so it can be difficult to interpret in my opinion from a user perspective but again this is not about my personal opinion it's about what Microsoft sees as the core use cases for each of these charts s next we have the card visualization now these are obviously really good for kpi metrics and you can also include percentage change metrics you can add a bit more context about what's happened is that sales amount an increase or a decrease since the last sales amount as well now out of the box by default it doesn't really give you that longer term Trend but it can be coupled with things like a spark line so if you want to show the momentum of a particular metric over time adding that spark line can be really useful to give your users a bit of context about how that's changing over time and the card visual is something that has been developed quite a lot by Microsoft over the last few months and years and it's actually quite feature Rich now so you can really add a lot of information into these card visuals the pie chart donut chart and tree maps are a little bit controversial but their main goal is to show ratios between different categories now the reason why pie charts and donut charts these kind of ratio charts can be controversial is that they're more difficult for the human brain to interpret the difference between an area which is what we're showing in a pie chart or a tree map for example and a more linear comparison that you get with like a bar chart another potential downside of this type of visual is that by presenting ratios it does somewhat obscure the overall numbers so you might be comparing the sales in USA to the sales in the UK as a ratio we don't really by default while get a view of the overall sales in each category now you can add that to the labeling and also potentially a tool tip to add that information in but I would argue if you're going to do that there's better visualizations to choose another downside is that we can't really see the trends over time so simil with the card visual we're only getting the point in time metric for that particular metric right we're not seeing how that metric has changed over time which is normally more useful for the user next we come to combo charts now these are typically used to visualize more than one metric on the Y AIS so you can see here we've got the bar chart showing one particular metric and then we've got a line chart showing another metric on top of the same visual right now this can be useful in some scenarios but you do have to be careful because sometimes this allows the user to come to a certain conclusion about correlation between these two metrics which you might not actually be correct that correlation and it can also be difficult for a user to interpret which axis is showing which metric you know often you need a really good Legend to explain the differences between these two axes these two metrics as well next we come to the funnel visualization now these are really good for showing some sort of movement through a linear process an example here would be tracking website conversion so at the top of your funnel you might have a website viewer visiting your website then the next stage in that linear process might be okay they're going to a product page then the next stage might be okay they've clicked on buy they've added something to their cart for example and then the next stage in the process might be they've actually bought the product and so on and so on so when you've got linear processes and you're trying to visualize some sort of movement through that process funnel visualizations are really useful now if I was to be picky I think you could argue that is difficult for some users to grasp the scale of the difference between each of the levels in this funnel but again that's probably not something for the exam that's just personal preference next is the gauge chart now the gauge chart is useful when we want to show progress of a particular metric towards a goal so say you have an annual revenue Target and you want to show halfway through the year that oh we're actually at 53% of our Target so we're on track for example next up is the waterfall visualization so the waterfall visualization is useful for showing a running total over either a time period or within specific categories right so the goal here is really we want to understand which periods or categories contribute most to that change to that overall figure right and again from personal experience the waterfall chart I think should be used carefully because it can be in my opinion somewhat difficult for users to interpret but that's just based on my experience next we have the scatter chart which is generally used when you want to visualize two numeric or continuous variables now again you can add more information to the legend of these types of visuals if you want to you know add a further layer of analysis so you might be visualizing someone's age versus their height and then your third variable that you want to add into that analysis might be the country that they grew up in or the country that they live that might be an extra variable that you want to add into this analysis and you can do that by color coding the dots also changing the shape of those dots as well now obviously you need to be careful when you do that adding a third variable because you know having a scatter chart with lots and lots of different categories of different colors can be quite difficult to interpret from a user perspective another kind of warning with these types of charts is it can sometimes lead the user to come to conclusions about correlation between these two variables and as we know correlation doesn't always equal causation so that's something to Bear In Mind as a report author you might want to compare these two variables when in reality they might not actually have a causative relationship next we've got custom visuals so custom visuals help you go beyond the out thebox visuals that come with powerbi and there are lots and lots of custom visuals that you can explore the app Source now these can be really useful if you want to use slightly more Niche visual types that maybe haven't made their way into the core powerbi visuals set yet and of course bear in mind that if you use a custom visual these are normally built by Third parties and sometimes they involve licensing and things like pay walls you might be able to use it for a short period and then you have to pay but that's something just to bear in mind around custom visuals now the final visual type we're going to look at is the Q&A visual in powerbi which allows users to ask natural language questions about the data in your underlying data models now this sounds good but in practice I think it's difficult to implement well obviously the questions currently need to be carefully articulated to kind of match your data model so it requires the user to understand the columns and the table in your data set to craft good questions that can give good results basically currently it's only available in English and Spanish as well if you enable it in the admin settings so that's a summary of some commonly used visuals now let's look at some powerbi features that can help us as report developers to go kind of an extra layer in our analysis to help with maybe some of the diagnostic elements or the predictive elements that we're looking for in this part of the study starting off with the drill down functionality so drill down basically allows your user to explore your data through layers of a hierarchy and that hierarchy can either be explicit so as a a powerbi hierarchy or it can be implicit so you maybe you haven't actually defined it as a hierarchy but there is some implicit hierarchical relationship between these variables now a classic example of a drill down might be to visualize at the top level of your drill down Revenue in a particular country and then you might want to drill down and look at Revenue by specific State and then within that state you might want to look at Revenue by store in a specific state for example now alongside drill down and drill up functionality you also have drill through and drill through is a powerbi feature that allows users to drill through from one report page into another report page and it carries through that information they click on so an example here you might want to drill through this SharePoint category and take them through a different page in the analysis pre-filtered based on whatever you drill through on obviously with this drill through you need to think carefully about the user journey in your report and the navigation Journey that they're going through you might want to add in things like back buttons to make sure that they don't get lost in your report now with both of these features again this is just from personal experience it requires your end user to have knowledge of drill down and drill through so I would argue this is a bit of a downside for these features but something to bear in mind when you're authoring these reports next up grouping so grouping allows report authors to group two or more categories within a visual we can create groups of months in this example or anything really that makes sense for the particular visual that you're creating now when your data is continuous so it's numeric you might want to use binning and binning basically allows you to create different bins for for these continuous variables now an example of this might be to create a salary bin so rather than just have salary as a number you might want to create salary ranges so from 0 to 30k 30k to 60k for example and obviously that opens up different visual types that you might want to explore and use for this type of data some other features to be aware of are reference lines so this allows report authors to provide a static reference line across the ex for the Y AIS to give the user some more context so it might be the average sales amount for sales people so maybe you're you're visualizing how all the different sales people have performed and you can see all of the data it might be useful to show report users what the average is so they can make a comparison give them some context for that comparison another feature here is around visualizing errors so if your data set contains some sort of errors in them normally it's around prediction uncertainty or measurement uncertainty then we can use error bars to show that uncertainty to the user now these can either be markers or they can be lines or they can be shaded areas maybe if you're doing like a Time series forecast that kind of thing you can have a shaded area for the uncertainty of a particular prediction obviously this requires data on errors or prediction uncertainty next we have what if parameters so we can use what if parameters to perform some what I would say is kind of rudimentary scenario analysis so giving your report users the option to change a particular variable particular parameter in this case we've got discount percentage and they can see what impact that has on a particular metric in the chart below now to actually get whatif parameters to work you need to do quite a lot of data engineering and potentially prediction algorithms as well so it can be quite difficult to set up a good what if analysis finally we've got time series forecasting and this can be accessed in that analytics pane so the third icon when you're setting up a visual is the analytics Pane and here if you have a Time series data set you've got revenue numbers for the last 24 months for example powerbi gives you the functionality to create a Time series analysis obviously this is quite limited and I would argue you should be careful here you know if you want to be doing any sort of serious time series analysis I would argue that it shouldn't be done solely within powerbi but the functionality is there and you might get asked about it great so let's finish off this lesson with a look at the power query data profiling tool okay so let's just explore the data profiling tool which comes with the power query engine so we can use this in a data flow Gen 2 or also in the power query engine in powerbi powerbi desktop if you're using that as well now what I've got here is just a query on this Revenue data set and I want to explore some some potential data quality issues in this data set now to do this we can go to the view Tab and have a look at the data view enable column profile and then we got a few different options here so show column quality details if we just do that one to begin with and let's just explore what that gives us so we can see we've got these three kind of categories so it goes through each column and it gives us a percentage of how many values in that column are valid how many are errors and how many are empty so it gives you a really quick indication of column quality Now by default you can see down here that the column profiling is based on the top 1,000 rows So currently is looking at the top 1,000 rows and for this dealer ID column saying that none of them are empty there's no errors and 100% are valid now if we change this to the entire data set obviously going to take a bit longer to calculate but then it's going to look at your entire data set every value in this column and it's going to perform that same categorization as you can see this data set looks pretty good we've got 100% valid for all columns no empty and no errors so that's good if we go back to the data view now we click on the column value distribution let's have a look at what that gives us so now we've got this quite high level view on the distribution of values within this column and if we make that a bit bigger here we can begin to see how many of these values are distinct now to explain the difference between distinct and unique I think it helps to look at this column here so we've got true and false values in every Row in this column now it's showing as a distinct count of two which is obviously referring to true and false so it's kind of like a distinct count of the values in that column unique means values for which there is only one row so here there's no unique values because there's true in more than one row you know these are not unique it's included in multiple rows and false there are actually multiple falses so if in our sample size here we only had one false value then it would be in fact unique so we'd get that one in the unique column but currently because we have more than one false value in this column it's not unique so zero unique values in this column so next up to explore we have the details pane so if we enable the details pane here and then we click on a specific kind of distribution we've got here make this a bit smaller so we get a bit more information about the distribution in a particular column so we can see the counts the error counts the null counts we can also see the distinct count unique count empty string counts and the minimum and the maximum values now obviously this is a minimum maximum of a string value but it would also work with numeric numbers as well I think this column only has unit sold one so the Min and the max is also one so it's not a particularly good example here and we've also got things like the average standard deviation number of odds number of evens so it give you a bit more detail about what there is in that column now once you've got a pretty good idea about the kind of distribution of a column you might be able to spot things like duplicate values in here what we can do is we can right click on it and we can remove duplicates or you might want to remove errors as well you can do that from just right clicking on this top section here and it gives you a quick way to remove or replace these errors or duplicates as well okay let's just round off this video by going through some practice questions to test the knowledge of what you've learned during this section of of the exam of the study guide question one a company annual report shows a net profit of $34 million now the company has 12 business units each with their own net profit or loss amount which of the following visual types could best be used to visually show how each business unit contributed to the overall net profit metric is it a a line chart b a scatter chart c a matrix chart d a waterfall chart or e a question answer visual pause the video here have a think and I'll reveal the answer to you shortly so the answer here was the waterfall chart as you remember when we were going through the waterfall chart one of the core use cases of that waterfall chart is to break down one top level metric into its constituent parts to allow the user to explore how each in this case business unit contributes to the overall net profit metric now the other visuals that we've listed here you know you might be able to glean some of that information but as I mentioned previously what we're after here is what would be the best visual to visualize this information and for me that's the waterfall chart question two you want to add measurement error bars on a Time series line chart to show the potential error in each measurement where would you go to add this information to your visual is it a the analytics pane B the format visual pane C view Tab and show error bars D the build visual pain or e in the model view so the answer here is the analytics pain when you're configuring the settings for a particular visual there's obviously three panes or three different windows that we can use to declare different settings and the error bars functionality is included in the analytics pane it's not going to be the build visual pane that's where you add your different variables different columns to a particular visual format visual that's where you change things like the text and the border and all that kind of thing it's not going to be the model view CU that's where you define relationships and things like that and it's not going to be in the view tab show ER bars that doesn't actually exist that functionality I just made it up question three you want to visually compare two continuous variables age and height of survey respondents in one chart which of the following visual types could best represent this data is it a a line chart b a stacked bar visual c a scatter visual D question answer visual or e a matrix visual so the answer here is C the scatter visual obviously when you're visualizing two continuous variables so age and height then probably the best visual to use for that would be the scatter chart put your height on the y- Axis or your age on the AIS and then you can plot different dots for each of your survey respondents again the other visual types not really going to help with that kind of analysis You could argue that the The Matrix would potentially show you all of that but it's going to be a really big chart if you've got lots of respondents so the answer here is C the scatter visual question four by default the data profiling tool reviews the top n rows of your data set to show you potential data quality issues so what here is n so how many rows is it a 10 B 100 C 500 D 1,000 or E 10,000 so the answer here is D 1,000 rows as you can see here from that visual by default the column profiling is based on the top 1,000 rows obviously we can change it to include the entire data set as well but by default it's 1,000 rows question five which the following features of the data profiling tool can help you identify duplicate values in a column that you plan to use as a key to join on is it a column quality details B column value distribution C column duplicate analysis D column key constraints or E group but so the answer here is B the column value distribution so as we mentioned one of the key use cases for the the value distribution is to look for duplicate values because if you've got a key that you plan to join on you want to be looking for duplicate values in that key and in the column with no duplicate values is going to look like this with a just a flat line so every value is going to be distinct here on the left hand side this would indicate that there are some duplicate values there's some values here that have a count of more than the others right so this would be a potentially problematic key on which to join on but this date ID column all of the values here are unique at least in the sample of data that you're using for profiling the column quality details a doesn't give us this information the column duplicate analysis C doesn't actually exist neither does d the column key constraints and E the group buy okay you could potentially use a group buy to look for duplicate values but it's not a feature of the data profiling tool itself congratulations we're nearly there only one section of the study guide to go in the next lesson we'll be looking at the final section of the study guide we'll be looking at how to use SQL to analyze data via the Lakehouse SQL endpoint in the to warehouse and also via the xmla endpoint as well so make sure you click here to join us in the last lesson in this series I'll see you there hello and welcome to video 12 in this dp600 exam preparation course and we've made it this is the final video in the study guide we're going to be looking at querying data by using tsql now you'll noticed that I've added an extra video there cuz I really wanted to give you one more video just to really help you prepare for the exam so we'll be going through how to prepare for the exam some other resources that I think you should take a look at before you take the exam and some advice for when you're actually sitting the exam as well but we'll be going through that in the next video for this video we're going to be focusing on these three sections of the study guide so we're going to be looking at querying the Lakehouse and the data warehouse using tsql and we're also going to be looking at the visual query editor which is a feature of both of those two SQL endpoints as well finally we'll be mentioning how to connect to and query data sets using the xmla endpoint as ever I've released study notes to help you go a little bit deeper just to make sure that you're covering off all the right points and links to other further resources if you want to go a bit deeper on whatever I've mentioned in this section of the study guide as however I've also got five sample questions to test your knowledge at the end of this video so let's just start by looking at the different ways that we can access the tcq engine within fabric has a few different ways to be aware of so one of them is the Lakehouse TC endpoint now one thing to bear in mind here as we've already mentioned quite a lot already is that this is read only so all you can really do here is Select statements ddl that kind of thing you can't do any sort of inserts updates deletes all that kind of stuff from the TC queno in The Lakehouse now the more obvious place to do tsql is within the fabric data warehouse here you're going to have the opportunity to write tsql both ddl DML inserts updates deletes select statements all of that stuff so the data warehouse is going to give you the ultimate flexibility really to write tsql scripts within fabric now on top of the tsql query editor you can actually also create tsql like queries using the visual query editor and this is quite similar to the data flow Gen 2 if you've ever used the visual editor in a data flow so we can do things like merging different tables and filtering tables and adding additional columns that kind of thing but we can do it through a no code visual interface and we'll be taking a look at that in a bit more detail shortly so the other option for writing SQL is via the xmla endpoint and as we've mentioned previously in this course to connect to that xmla endpoint we need to go into our workspace settings as you can see on the left here grab the connection string go into SS SMS in this example connect via the analysis Services server type pass in your xmla endpoint in there that's going to bring through all of our lake houses and our warehouses within that workspace and then we can write different queries depending on your use case from that xmla endpoint so now that we have a good understanding of where we can write tsql for most of this exam you need to have a pretty good level of tsql at least in appreciation for what a lot of different tsql functions do and be able to at least read tsql quite well and I don't think there's kind of a definitive list of which tsql functions you need to be familiar with but I definitely recommend being comfortable with the following so the difference between where and having group by summarizations Union and Union all different joins and when to use them Common Table Expressions things like lead and lag row number partitioning that kind of thing subqueries and cross Warehouse queries as well so rather than just describing all these things I think it would be better to jump into fabric open up a data warehouse and show you some of these functions in action okay so here we are in SQL Server management studio and I've connected to one of my gold data warehouses here called d gold using the SQL connection string and you can see that currently we don't actually have any data in this data warehouse the tables are empty so the first thing that we're going to do is just to bring some data from some other tables that we've got into this data warehouse so to do that I'm just going to be using this Seas so create table as select so it's going to enable us to create a new table in this data warehouse using existing data in another data warehouse or in this example it's it's actually a lake house so we're connecting to the SQL endpoint here so this is an example of cross database querying because what we're doing here is we're creating a table from another Lake housee Al together so we're getting the all of the data from this db. Revenue table in our Lakehouse bronze and we're creating a new table called dboa Revenue so then if we refresh these tables we should now have the first one which is dbo Factor Revenue this one here and we can do the same for dim date and dim Branch we're going to be using these data sets just to show some of the functions that you need to be familiar with for the exam we're not going to cover all of them because to be honest I'm not sure exactly the full breadth and depth of what is expected for the exam for tsql but we're going to go over some common types of problems that you might see in the exam so now if we refresh our tables again so now we've got our dbo fact Revenue our dim dates and our dim Branch so let's just start by visualizing our data here so specifically I'm going to be looking at this fact revenue and these data sets come from the same data set that we used actually in a different part of this exam preparation course is relating to Car Sales and car dealerships so you can see in our fact Revenue table we've got a revenue figure and we've got a date ID column and a branch ID column here and that's what we're going to be using for the rest of this analysis so what I'm going to be doing is Pres presenting you with a series of problems then we're going to walk through how you might solve that in SE we'll be starting off quite simple and then we'll be adding in more and more functionality as we go through so what if we wanted to calculate the top five branches by total revenue so to do that we're going to be needing to perform some sort of aggregate so this is what our fact table looks like currently we've got Branch ID and revenue and what we want to be doing is calculating the top five branches so what we could do is just select the top five here Branch IDs some of the revenue so what we're doing here is doing a simple group Buy on the branch ID because we're looking for the top five branches and we're going to order it by the sum of the revenue so this is the aggregate calculation that we're running on this aggregate here it's a sum of the revenue and we're ordering it by the sum of the revenue descending so we're getting the top five now let's change the problem a little bit and maybe we want to return only the top five branches in Spain or maybe the top three branches now Spain is a country name that comes from a different table in our data set so what we've added in this example is an inner join on dim branch and we're joining on the branch ID because we have the branch ID in both data sets we're using an inner join here because we only want to return data for which we have both keys again we're aggregating by the sum of the revenue this time we've actually brought through a few different columns from that Branch Dimension table so let's just run it all and see what we get here okay so this query has returned the top three and what we've done is we've filtered this results for only country names that are equal to Spain now we've had to add in some new things into this group by selection because we've also got country name and Branch name mentioned here so we've also brought brought through the branch name so that we can just you know we've got more than the branch ID maybe you don't not familiar with the branch IDs you want the actual Branch names in this example and we can see that that has brought through the top three branches that are in Spain by the total revenue so that has sorted that problem but what if we wanted to filter after the aggregate so for example give me all of the branches that have a revenue of greater than some amount so here we're not going to be doing the wear statement because we want to be using having so you want to be doing filtering after the aggregate or on that aggregate value then we're going to be wanting to use having and that's going to come after our group by statement so previously the wear statement here because we were kind of pre-filtering on this country name now we're looking for the results of an aggregation that are greater than in this example 5 million or 50 million so we've got a very similar setup we've removed the wear statement because now we're not just interested in Spain we want all the branches and we're going to use this having some of the Reven Vue greater than 50 million so here you can see it's returned only two branches which makes sense what if we change this having statement we removed one of the zeros yeah so here we're just looking at 5 million so when we've got 50 million there's only actually two branches that have more than 50 million Revenue over this time period or over the full data set in that fact table now in this example we've been using the inner join because that's what we wanted for this specific use case but for the exam you're going to be need to be familiar with left join right join inner join F outer join all of these different join types and when they are useful for different scenarios now we're not going to be going through all of these here because there's quite a lot to go through but I definitely recommend you know if you're not familiar with these things learning the differences between these and when you might want to use one or the other so another thing that you might need to be familiar with for the exam is commentable Expressions now these are really heavily used in the world of SQL when you're creating views or things like that so you need to be familiar with how you construct a Common Table expression what the the key words are what the general structure is and that kind of thing here in general the Comon table expression allows us to Define kind of like variables that we can use later on in our script so here we've got these three lines are the query we're selecting the top five Branch IDs and summing the revenue we're grouping by that branch ID so it's similar to what we saw before and so we can actually just select these three rows and have a look at what that returns us and then what we're doing is we're kind of saving that in this variable or this commentable expression called top five rev and the syntax here is always going to start with with which is the keyword for a commentable expression you're going to give it a name and then as Open brackets put in your expression within those brackets and then we can use this top five rev later on in our query now one of the other benefits of table Expressions is that we can create more than one of these so for any subsequent Expressions that we declare firstly we're going to need a comma there and then we can call a second one branches so for subsequent ones we don't need the with keyword we only need that on the first one and so here I've defined branches as select Branch ID and Branch name from dbo dim Branch so this is what this one looks like so we're just getting the branch ID and the branch name and we're storing it as this branches and then the final kind of section of a CTE is the select statement so you're always going to need to return something from this comment table expression and we can do that with a select statement so as we can see here we're selecting the branch ID the branch name and the total revenue and these are coming from top five rev so that's our keyword for our first expression that we defined up here and we're joining it with our branches which is our second one here we're giving it these aliases and to run this we have to select all all of the different sections and then press execute and we get this result here so we've got the top five branches by revenue and then we've joined that on the last section of our CTE back to the branch data set now obviously this could also be written in a different way you probably don't especially need a CTE to return that result but I just want to show you the structure of a commentable expression because you might get asked about that in the exam or you might get shown some code that is in this format and you need to understand and how it works another thing that you might need to be aware of for the exam are the lag and the lead function so this is a way of creating new columns that reference other columns but with some specified offset so probably worthwhile just taking a look at a bit of an example here so what I've done is I've defined another Common Table expression here so in the first part we getting the top five branches and then we're dividing this Revenue by 1 million just to give us you know some a bit easier values to comprehend and we're getting the floor of that division so it's going to round down to single digits in this case and we're defining that column as Revenue in millions just to make it easier to understand what's going on with this lag and Lead function that we're going to introduce shortly so this is what our data set looks like we've got five branches and we've got rev which are all singled digit whole numbers now so we've stored that as rev T and next we're going to introduce a lag function over this data set so we got select Branch ID and rev M which is looks like this and we're going to add on another column into this using this lag function so what the lag function is going to do if I just run this let's just start with an offset of one and we rerun this so the lag function is going to look at the column which you pass in which is revenue M and it's going to look at the value of the row number minus the offset so for row number one here it's going to look at row zero which doesn't exist so that will return null but for row two it's going to look for the value in row one of the column that we pass into the function now another thing that we've done here is we've ordered it by this rev M so the order here has changed it's now in ascending order from 1 2 2 3 6 like so and we can change this lag function to anything we want so here we're doing it with a lag of two so the offset parameter here is going to be two so it's going to introduce another null because now we're offsetting by two and it looks like this now hand in hand with the lag function we also have the lead function which looks in the opposite direction really so rather than looking at the lag so Looking Back In Time the lead function is going to look forward in time or at least forward in your row index so it's actually going to start on this row here now let's just change this back to a offset of one so we have and let's just change this to lead so when we run the lead function here on the same column Revenue M with an offset of one this is going to be the result here so now our first value in this lead call column is actually going to be the value here similarly this value is going to be here this value is three six comes from here and now we have an old value at the end of our data set because we don't have you know a sixth Row from which to pull that data from now one thing to bear in mind is that we can't do something like this so we can't do a lead or a lag with an offset of minus one the offset parameter cannot be a negative value so bear that in mind now another function to be aware of is the row number and partitioning so row number as a name suggest basically runs through your data set and sequentially numbers your rows okay so let's just run this first section of code just to have a look at what this is doing here so here we've defined our row number and we've given at this name row num and we've also passed in a partition value so what it's going to do is it's going to group all of our dealer IDs so here you can see in the output here all of these dealer IDs so dealer ID DLR 001 this first one all of these values are within that partition okay they're all that same dealer ID and the row number is going to assign a row number based on the order so the revenue so these are all ordered in Revenue sending order and then we're creating our row number 1 to8 within this partition then you'll notice for the next dealer ID DLR 002 now we've got 10 values within this partition ordered in the same way and we've got row numbers defined here from 1 through 10 so that's this section within the CTE what I've done is I've just selected star from Parts which is the name of our CTE and I've just done a bit of a wear statement just to only get two of the dealer IDs back so here you can see this example here we've got DLR 001 which is this one and we got DLR 001 7 which is this one so make sure you understand row numbering and partitioning and ordering by for the exam because this is something that could easily come up now the last section I wanted to mention or the last topic that I wanted to cover is subqueries so for all of the other examples within this tutorial or this video so far we've been using select star from table right so if we go back up here we' got select all of this stuff from dbo fact Revenue now a subquery basically doesn't have that structure of Select star from dbo doable instead we can actually open up a brackets here and rather than doing a whole table or getting the data from the whole table we can create a subquery that's going to basically prefilter or do some sort of tsql query return the results of that sub query back up to the top level here and we've got to give it this Alias of sub doesn't have to be sub but that's just what I've called it here and we'll see the results there so what this is doing is first going to calculate this so whatever is in our brackets we're going to get the revenue figures just for these two dealer IDs and then we're going to get select star from that result so again this is just a bit of a toy example just to show you subqueries in a bit of action at least the structure of a subquery just so that if you come across it in the exam you know what that is okay so here we are in a data warehouse and it's the same data warehouse that we were using for the first part of this video the DW gold so we got our fact Revenue table our dim date and dim branch in here now specifically I want to show you the visual query editor so we can access it by clicking on a new visual query and immediately we're going to get this introduction here to build a visual query need to drag some tables onto our canvas here so let's just drag on the fact revenue and we'll see what we get so the visual query engine basically tries to make it as simple as possible to transform your data that is in your data warehouse or also in the Lakehouse tsql endpoint as well so we've got our fact Revenue table so what can we actually do with it here well you can see along the top here we've got some kind of quick options for choosing specific columns removing columns that kind of thing filtering so removing certain rows sorting rows transforming we can do group by we can also do merging and appending different data sets together so if we want to drag another one of these onto the canvas as well then we can kind of combine those using either merge or append but we're not going to do that in this example now another thing that you can do within the visual query engine is to click on this plus button here and we get access to a few more commands so we've got all of the ones that are available in that top menu plus we've also got some transformation so we can do some text Transformations on a specific column we can do length finding the First characters adding columns using for example a conditional column or column from examples so if you're used to using the data flow power query engine you might be familiar with creating a new column from examples and we can also do some adding column from text using these methods here so say for example maybe I want to do a group Buy on this fact Revenue table and I want to get the median value just going to change this for the median and I want it on the revenue column and I want to group by our branches so we also have this fuzzy grouping as well so if this column is not particularly good quality you might want to add in some fuzzy grouping which is basically going to look at likeness or similarity between different values in that column and if it's above a certain threshhold so they're very similar so maybe there's just one character that's different it's going to add it into the same group so that's what fuzzy grouping would be we don't want to enable that there cuz we've already managed our data quality so if we do an okay here you can see it's add in this step here so now along the bottom you can see the result has actually updated so we've got this group by the branch ID and we've got our new column which is the median value within that aggregate basically so say I wanted to I just remove this one because we don't don't want to be doing anything with our dim date we got our fat Revenue now and what we can do is either save as table so this is going to save the results as a new table or we can save it as a view so you can see that this is actually gr out here cuz it's you can't save this as a view because the query fact revenue is not supported as a warehouse view since it cannot be fully translated to SQL and the reason for that is because we've used this median and median is not actually a function in tsql so what we can instead do is rather than grouping by and aggregating on that median value we change this to sum I expect that yeah so now that error message or that warning message has now gone away we might want to also change this to some of the revenue and then we can save it as a SQL view so maybe we want to do sum of the revenue aggregation give our viewer name it also gives you the SQL statement for that view which we can copy to the clipboard if we want and we just save it as a view which you can query from a powerbi semantic model obviously acknowledging the fact that with a view it's going to fall back to direct query mode if you're using direct Lake mode to access this data so that was just a tour of the visual query engine in the data warehouse think for the exam just understand what it's capable of have a look at the different functionality here understand what you can do here because you might get a question or two about the visual query engine okay so let's just round up this video by looking at some practice questions that you could expect for this section of the exam question one the SQL scripts creates the results shown below what is function in this example so we're doing select Branch ID column one and then your function which you have to work out what that function is from sales and returning that table below so take a look at the answers on the right hand side have a little think about what that function would be to produce the results in the table at the bottom of the page pause the video here have a think and then I'll reveal the answer to you shortly so the answer here is a we want to be using a lag function because we can see that what we're trying to do is make a transformation from column 1 to column 2 and we can see the pattern there that on Row three in column 2 there's obviously an offset going on there and we can see the offset is two because Row three in column two maps to row one in column one and we know there going to be a lag function because column 3 is actually two rows behind what's on column one and you can see that replicates below with rows four and five as well and another clue here is that we' got two null values at the Top If You Got null values at the top it's going to be a lag function because it doesn't have a value for that Row one in column 2 and row two in column two so that one's always going to be null so we know it's got to be a lag function and we know that the offset is two so it can't be B which is a lead function the lead function is going to look ahead of time rather than looking back in time and the bottom two there have an offset of minus two so the offset in a lag a lead is always positive so those two would not be correct either question two you have the following query analyzing sales data for various products your goal is to analyze the sales data by product name and year but only for the products that have a yearly sales amount of more than $50,000 how would you complete this query so take a look at the different options there have a think about how you would come to that conclusion that this question is asking for and I'll reveal the answer to you shortly so the answer here that we were looking for is B so the question here is asking you to analyze sales data by product name and by year so that is the first clue here in our group by we need to be grouping by the product name and the year not the product key and the date key cuz that wouldn't be the right aggregation for what the question is asking here now the second part of the question is we're asking for products that have a yearly sales amount of more than 50,000 so the yearly sales amount of more than 50,000 that's going to come from our sum of the sales amount and the sum of the sales amount is going to be calculated during that aggregation so we need to be using having here what we're trying to do here is filter out the result of the aggregate we're not filtering out before the aggregate cuz that would just filter out individual rows in this table we want to be looking at the result of an aggregate function where the sum of the sales amount not just individual sales amounts the sum of the sales amount is more than $50,000 so the answer here is B we need the group buy product name and year and then having we need to use the having statement to only return the product names with a yearly sales of more than 50,000 question three you're trying to inspect a join between two tables to spot referential integ violations which of the following tsql join types would be easiest to identify keys on both sides of the join that do not have a match on the other side of the join it's a left join B right join C inner join D full outer join e cross join so the answer here is going to be the full alter join so again there's two parts to this question really firstly we need to understand what a referential Integrity violation is so referential Integrity is when you're joining two tables together obviously you're going to be joining on a specific key now a referential Integrity violation would occur when one of the keys in the left hand table is not in the right hand data set and vice versa as well so to be able to identify that using a SQL join we're going to be needing to use the full outer join because the full out to join is going to bring back firstly the instances where those keys do match and then secondly all of the other results that don't match from both tables so when we get this back we're going to bring back some null values where there isn't an appropriate join or is an appropriate match on that join for the table on the left and equally the same on the table on the right so that's going to help us identify referential Integrity violations by investigating where there are null values in that output the left joint and the right joint they're not going to help us here CU we need to identify VI Rel ation on both sides of that join the inner join is only going to return the the matching keys from both sides and the cross join is just going to return us all the different combinations of the keys that exist in these tables so that's not going to help us really identify referential Integrity violations question four you have two warehouses Warehouse 1 and Warehouse 2 you want to create a SQL view in Warehouse 2 that combines data from both warehouses Your solution should minimize dwell effort which solution do you recommend is it a a user data pipeline copy activity to copy the table in Warehouse 1 to Warehouse 2 B create a shortcut from Warehouse 1 to Warehouse 2 to perform the query C use cross database querying between Warehouse 1 and Warehouse 2 or D use a data flow Gen 2 to read the table in Warehouse 1 and set Warehouse 2 as the destination so the answer here is C use cross database querying between Warehouse 1 and Warehouse 2 now within a warehouse we can query any other Warehouse within the same workspace so this would be the most economical or require the least amount of effort because we can just query directly the data the table Warehouse 1 from Warehouse 2 B would be incorrect create a shortcut because we can't actually shortcut from a warehouse one to Warehouse 2 functionality doesn't exist at the moment and the data pipeline copy activity and the data flow Gen 2 wouldn't minimize the development effort so that would technically meet the requirements apart from the requirement that says your solution should minimize development effort and this is something that you might see on quite a few questions within the exam so when you see that it's a bit of a flag that you should always look for the most efficient way of doing things or normally there's a method that requires little to no effort against some that require more development effort so the answer here is C using cross database query question five you're using the tcal query editor and your goal is to add a new column to your data set called salary bins and this column is going to bin a continuous salary variable into three different bins less than 30,000 between 30,000 and 60,000 and more than 60,000 which functionality should you use to add this new column A add additional column B add column from examples from all columns C add column from examples from selection D duplicate the column e add column from text so the answer here is a add a conditional column so when we're in the TC cor visual query editor there's a few different options there to help us add columns and for this specific use case creating a new column which is going to bin the values in a specific column we're going to be wanting to use the conditional column because that's where we can add in the logic for less than less than or equal to and more than and we can add more than one conditions on that column which would help us achieve our goal of creating the bins in this salary bins column now add columns from examples so these two are functionality within the tsql visual editor but it's going to be very difficult to implement that logic from a columns from examples it's not really a good use case for that type of adding column functionality and D duplicating the column well that's not going to do much it's just going to duplicate the column and similarly e adding a column from text that's also not going to achieve what we're looking for so the answer here is a adding a conditional column we did it congratulations we made it to the end of the study guide this is video 12 out of 12 and we've covered an awful lot of ground over the last 6 weeks so well done for sticking with it it's a very tough exam we've covered a lot of different things this exam covers a very wide range of topic from tsql py spark Dax all of the different planning and all that sort of things so to get this far is a really good job now as a bonus I'll be recording one more video in this series basically to help you prepare to take the exam so things you need to know before the exam when you're booking your exam and whil you're in the exam to try and help you get as good a score as possible so thank you very much for joining us in this series I'll see you in the next video which will be the final video I look forward to seeing you all there hey everyone you thought the series was over but I've just got one more bonus video in this dp600 exam preparation course I really wanted to bring you some top tips for the exam so we're not going to be covering any of the technical content but I just have some words of advice or some tips that might help you when you're actually doing the exam itself so this is the bonus round of our course plan if you're just joining us here in this video I recommend you go back through all of the last 12 videos CU that's where we cover most of the content and I wanted to talk to you about booking your exam preparing for the exam some some advice for during the exam and then also what you should do after the exam as well so when you're booking the exam if you've used the AI skills challenge voucher then make sure you schedule your exam before the expiration date and that means you have to schedule the exam to be conducted before that expiration date now I think that was the 24th of June 2024 but double check that in your own emails that you have that date there now if you don't have that AI skills voucher that was given away maybe 2 months ago now then you can get a 50% off voucher for the exam by completing the cloud skills challenge I'll leave a link to that in the description so if you want 50% of the dp600 exam and you still haven't booked your exam yet or you still haven't got a voucher for it you can use that there and I've said this before but I always recommend if possible to take the exam in person now you might have heard there's a lot of horror stories for people that take it online you know have an online proor experience and they're very very busy and there's always problems with technology and you know you have to clear your desk and clear your room and there's all these kind of hurdles that if you can avoid by going to an inperson test center I would very much recommend doing that appreciate not everyone lives near a test center but if you do or you can commute to one just for the exam I would definitely recommend that reduces a lot of stress you just walk in take the exam and walk out so preparing for the exam well obviously you can go back through these videos and look in the school Community as well I've got lots of notes in there the reason I kind of created those was to help people who are you know just about to take the exam they can go through the key points and just make sure they remember everything as they go through each of the different sections in that school community and if you're still not sure about anything in any section of the exam you can dig into the further learning resources that I've linked in every section there now one of the key resources that I would definitely use when you're preparing is the official practice assessment for the dp600 exam and again I'll leave a link to that in the description now you can go through this multiple times I think it's a 50 question assessment but you can refresh the page and you get fresh questions when you refresh I don't know exactly how many different questions there are in that exam set but there's definitely more than 50 so go through that practice assessment numerous times to get a really good idea of the types of questions that you're going to be asked and to highlight any gaps in your knowledge another really good resource is the Microsoft learn together series that's kind of been running side by side with this series I think we started around the same time I think they finished maybe a week or two ago now this is a really good series delivered by Microsoft MVPs so there's lots and lots of content that you can go through there to dig in a bit deeper about any of the aspects in the dp600 study guide again I'll leave a link to that in the description as well as the technical content for Microsoft MVPs they've also got a few lectures or a few videos that describe to you what the exam entails so if you've never taken a Microsoft exam before I would definitely recommend watching some of those videos I think they're at the end of that Series where they walk through types of questions you might get how to prepare for it how much time you have all of that kind of stuff just so that you can enter that exam as prepared as possible now as well as that there's lots of other great content from other people in the community definitely recommend data Mozart Nicola's blog there and also you you Channel as well now the data guy newsletter on LinkedIn something managed by Abu Baka he basically collated a lot of learning resources for each section of the study guide definitely recommend that similarly there's a post on serverless squl by Andy Cutler which is basically a guide to the exam as well and in there he highlights a lot of resources for the exam as well so during the exam I would say don't spend too long on one question if you get stuck then you can flag it for VI and you can come back to it at the end if you have time now a lot of people are saying that there's not much time in this exam you have 100 minutes to actually do the exam and there's normally around 55 to 60 Questions and some of the questions are very wordy so it might take you 30 seconds or even a minute just to try and understand what they're asking you so if you get stuck and you don't know the answer don't waste too long on one question just flag it for review and come back to it at the end now you will receive a at least one case study question where you'll be given a lot of context about a business and you'll be asked a series of questions afterwards now I definitely recommend that you use a pen and paper to draw the case study architecture as you're reading that question because normally these case studies they're very complex and they'll describe a scenario to you for example oh company has capacity one and in capacity one there's workspace one and workspace 2 and in workspace one there's Lake housee one and Lake housee when you're reading this you don't really take any of it in so I definitely recommend drawing it as you're reading it this can really help you when you go forward to answering the questions because you don't have to keep on going back to the the case study the context when you're answering the questions you can just refer to your your diagram if you don't know an answer guess you don't lose marks for an incorrect guess so you might as well and bear in mind that the questions were created many months ago so I don't think they've updated the questions since it was first came out so bear that in mind so a lot of the more recent features they are not going to be the correct answer because they weren't in general availability when the questions were created so bear that in mind now a lot of the questions I would argue are ambiguous or at least the answers so what you have to bear in mind or try and keep asking yourself is what do I think that the Microsoft examiner is expecting to see here so don't try and be too clever always think about what AIC Microsoft trying to teach us about Microsoft fabric how do they want us to use the platform and always answer your questions with that in mind so after the exam so you will know if you pass or fail directly after the exam you get the results straight after now if you fail I would say don't be too hard on yourself honestly it's a very very tough exam it expects you to be familiar with a very wide range of topics right for starters SQL py spark M Dax plus all of the planning stuff and all the lake houses data pipelines data flows all of this stuff is a very wide ranging exam that you need to be at least familiar with so if you do fail don't be too hard on yourself at all and if you pass then well congratulations you can be very very proud of your achievement so that is all I have thank you so much for joining me in this series and the best of luck for all of you that are taking the exam in the future thank you so much for joining me in this series I'm going to be taking a short break now and be back on the Channel with more videos teaching Fabric in the future thank you

Transcript for:Microsoft Certified Fabric Analytics Engineer (DP-600) Exam Preparation

Transcript for:
Microsoft Certified Fabric Analytics Engineer (DP-600) Exam Preparation