hello and welcome to everybody on cloud fitness so in today's video i am going to show you what are the different ways in which we can connect our adls gen 2 to the data bricks right we will actually do a hands-on session and see how we can accomplish that so this is basically an on-demand video one of you have commented to make a video on this particular topic and here i am with this so in fact this particular um you know ppt was made by me around eight to nine months back but instead of you know going through this um you know theory i will in fact go through the i will directly go to the azure portal and i will show you how we can accomplish this so the very first thing that you need to understand is you definitely need to have a storage account adls gen2 and a data bricks right so i have already created one adls gen2 storage account container if in case you have watched my previous video you already know how to go ahead and create a storage account. And inside my storage account, I do have a container with the name of container itself. And inside it, I have uploaded a CSV file. OK, now coming on to the Databricks site, Databricks already I have.
Now, when you think about connecting a storage account to Databricks, right, the very first thing that you should that should come into your mind is whenever you're trying to establish any connectivity, you need a username and password. right in the similar way for the storage account as well you need a username and a password so your username is nothing but your storage account name and this for me it is youtube storage account 01 right and coming on to the password password is nothing but the access keys that you see over here right so you can directly go ahead use the username and the password and connect it but that is not the recommended approach for sure but you should know about it as well because sometimes it might come handy as well so whenever uh let's say you have thousands of storage accounts right then in that case you have to maintain and store the keys and the storage account as the keys of each and every storage account name right so that is not a very recommended approach and definitely security and data governance comes into picture here but let us see how we can actually use it so right now what you see on the screen is the storage account key right so now let me first copy my storage account name over here. and then i'll go to my data breaks and i have already uh i have command five what you can see on the screen right now right so the first part of it the first part of the configuration that i'm trying to set is nothing but my storage account name right what is the name of my storage account i'm telling data bricks then hey this is the name of my storage account And the second part is the key. So basically, if I go here, you can actually, you know, show the keys over here and then don't show it to anybody else. This is just a demo video.
But otherwise, you're not supposed to give your keys. Otherwise, the other person will have full access. Now you can simply go ahead, copy this key and then you can actually paste this key over here.
Right. So let me paste this key over here. Now I have a cluster already running. So let me attach that cluster and run this particular command. So the moment I run this command, a connectivity has been established using username and password.
Now, using command 6, right, I am actually using dbutils over here. I have already made a video on dbutils. You can go ahead and watch that out.
Now, I'm just trying to list what is inside my container, right? What is inside my container? So container is nothing but it is a file system.
And YouTube storage account 01 is the name of my storage account. right so i'm just trying to list what is there and you can actually see that it shows me the exact file name as well as it shows me what is the size of my file right so this is how your connectivity is established now this is the first way now coming on to the second way which i am going to explain you for that i will actually go ahead and show you this so basically uh as i told you it is very difficult to maintain, you know, these... passwords for each and every storage account name and data governance and you have security issues as well right now for that you have an option to create a service principle and give access to service principle for your storage account and use the keys of that particular service principle to log in in the data breaks so this is what we are going to do now that is the second approach right when i'm saying that you create a service principle and give that service principle I access to your storage account in this case what happens is you are not actually using or you are not displaying you are not directly using the keys right you are not using the storage account key you are only using the service principle and whosoever has access to that service principle will actually be able to see your content right now for that what do we do so the first thing that you see over here is the app registration you try to register. register an app the when i say register an app you go to this uh uh you go to the azure portal you create an app registration and through that you are actually uh establishing a service principle now this app registration you have to give access to storage account right you have to give app registration access to the storage account after creating app registration you create a connectivity between both of them by giving app registration access to the storage account.
account. Now once you have done that now your app registration has a secret right. Now this secret you are actually using or this ID you are actually using to connect to Databricks. Now even you know we are not even directly exposing this ID in Databricks.
We are storing this ID in a keyword right. Now you are storing this ID in a keyword and when using that ID in data breaks you are using a scope so scope is nothing but it is a wrapper in Databricks where you will actually use it to fetch the key vault. So in that what will happen is you will just, it is just a wrapper around key vault, you will actually use scope to read key vaults in Databricks. and then this is the overall picture but what happens in the background so what will happen over here is your storage account so whenever databricks is trying to access your storage account your storage account will actually uh through app registration it will go to azure active directory and give the user name and the password or you can say it will give its credentials to the azure active directory it is not going to give credentials to the databricks it is going to give credentials to azure active directory and then azure active directory in return will give a token. To app registration.
Now this app registration will get a token and then this app registration will actually share that token to Databricks. Now Databricks once it gets a token it will confirm with the Azure Active Directory that whether it is a valid token or not. Right, and once Azure Active Directory confirms to the Databricks that yes, this is a valid token, then the connectivity is established. So now here you can see that the trade off is actually not between your username and password.
The trade off is actually the token. So this is the background of it. So now let's go to the portal and try to create.
us our app registration right so if you go over here you can actually type in in the search and you will see the app registration so if you click on new uh here so you can directly write the app registration name over here and then without you know thinking much you can just click on register Now you will be loaded onto this page. Now you can see that it has an object ID, it has an application ID, it has a tenant ID. So object ID is nothing but in Azure everything is an object.
So every object is linked to an object ID by default. And then you have certificates and secrets. So you can go ahead, click on new client over here and let's say you write it as client secret.
Right. I'm just giving it a name. and let's click on add over here so the moment you have clicked on add you can see that the client secret appears over here and this is the value so you do remember to you know copy this value beforehand because the moment you refresh this page this value will be gone and you will not be able to copy it and the once you have done this app registration you have created a service principle now in this case you can directly go to your storage account so if you go to iam right of your storage account you will see add a role assignment option over here so now here you can actually search for blob data contributor access right blob data contributor access and what it exactly does is it allows allows you to read write and delete right you can actually use this as well it is a built-in role Now, the moment you click on it, click on next and then it will actually ask you that whom you have to assign this access, right? So, you have to select the member.
So, click on the select member and remember what I explained you that your service principal or your app registration should have access over your storage account, right? So, for that, you actually have to give access to the app registration. And this is where you type the name of your app registration.
Click over here and then. select so now you can see that your app registration appears over here with the object id and then click on next next and simply review and assign so here what you have actually done is i'll go back to the uh this place you have created an app registration with a secret and you have assigned access to the storage account right app registration access to the storage account now you can actually verify the access also you can just write app registration over here and then you can actually check the access. You can see that the storage blob data contributor access has already been assigned to it now once you have done that you will go ahead and you want to save your um you know secret for app registration in a key vault now this key vault is nothing but it is just used for storing your password so let's go ahead click on create a keyword so let me just quickly select the resource group and i can enter the name like youtube key vault 02 let's say and then i'll just quickly choose the location of my choice and then i'll quickly go ahead to the access policy now remember that you know when i'm talking about access policy over here your key your app registration should actually have access to the keyword or your keyword should have access to list the secret from your app registration now if you go to the app registration it has a secret right now and again if i go back to this ppt you will see that this there should be a linking between this app registration and keyword because keyword will store the client id and secret from your app registration so so your keyword should be able to read the secret right and when uh scope is asking it to list it it should be able to list it so that kind of permission you need to give it over here so So you have to click on this add access policy over here.
And then. There are some permissions. You can give whatever permission you want.
But yes, key and get and list are the mandatory permission because you need to get the password. You need to list the password and you need to select the principal. Now your principal over here is the app registration.
So just click on the app registration. Click on select. and then click on add so the moment you do it you can actually see that your app registration has actually you know it shows up over here now let's click on networking tags review and create so by the time it is creating the key vault over here you can actually see that we have established the full you know connectivity So now guys you can see that the deployment is complete.
I'll quickly go to the YouTube and if I go to the YouTube keyword and then if you click on the secrets over here. Right. So now I have to import the secrets from my service principle from my app registration. Right.
So let me name it as secret itself and then the value of the secret. So now what we have copied from the app registration. This is the value of my secret. Right.
I'll just copy it. In fact, you should copy it beforehand and keep it somewhere. and then you can simply click on create so now you also have a secret now you can directly go ahead to the data breaks and then you have to create a scope now what is a scope scope is actually going to read your keyboard right in databricks scope is going to read your keyboard now to do that you what you will do you can do it through command line interface but i'm going to not show you command line interface over here right now i'm just going to show you how you can actually do it through ui so you just have to copy your url and then just write hash secrets and then forward sash create scope so this will actually lead you to a page where you can go ahead and create a scope so now you can write a scope name over here so let's say you write now in fact for the scope name i'll actually use the keyword name itself so that you people don't get confused Right, because it is doing nothing.
It is just reading your keyword, right? Now let's paste it over here. So this is your scope name and then it will ask you for the DNS name and the resource ID. Now for DNS name and the resource ID, you actually have to go to the properties of your vault, right? In the properties of your vault, you will actually see the vault URI.
You just need to copy this vault URI and paste it in the DNS name. and similarly for your resource id you can go ahead copy this resource id from here and then paste it over here so dns name is nothing but you know the name of your let's just let's say you have a virtual machine it is 10.10. something something so instead of you know calling it you know through that particular 10.10. you are actually uh you know calling it through the dns name so let's click on create and you will actually see that the secret scope is already created so this is how you you know you can add your secret scope and you can also do it through command line interface but now i'll actually go you we have set up everything if i again go back to my sheet that that I was showing you.
You have created a scope. You have created a keyword. You have created app registration. Now you have established a connectivity between app registration and storage account. You have established a connectivity between app registration and keyword.
And then if you see over here, these are the few commands which will allow you to connect through a service principle. Now these commands you can find on. line as well there is nothing fancy about it these are just a few commands where you authorize the connection so these are the few configuration lines the first one being the client id so here you basically have to provide your client id client id is nothing just if you go to your app registrations let me actually refresh this you see this is the app registration that we created right so you have this client id over here right so this is the same thing that you need to copy over here and let me paste my client id over here now uh also i'll come back to the secret but the last line which you can see is the client endpoint right so you have to use the same url just in between you are actually going to use the tenant id now for tenant id again you have to go back to the app registration and you can see here direct ID or tenant ID right so you just copy that and then just paste your directory ID directory ID is also called your tenant ID so wherever you see directory ID it is essentially your tenant ID and then coming on to scope now you are actually wanting to read the now you understand this thing that this is the principal ID right this is your service principal ID and you are now second thing that you are going to read here is principal secret right and that secret you have stored in the Key Vault and uh that's key vault and you have stored it in key vault with a secret scope right so you are actually using this scope and this is the name of your scope right what was the name of our scope we just created key vault youtube zero uh two uh youtube sorry this is the exact name i'll actually copy this uh name over here so this is the exact name that we kept for our scope right the same name as of default that we have written that is the same name that we have kept for the key vault now coming on to the key key is nothing but the secret right so you have actually created a secret as well right now that secret uh in fact you go to the key vault and you click on secrets so there is a in fact let me just refresh it so there is a secret that you have created inside the key vault right so this is the secret that you are trying to use so the same secret is this now once you have established So these lines are actually your connectivity lines. After that, what you essentially do is, let's say you're moving your code to production right now. And in that case, you will especially use your mount point.
Mount point is nothing but just a location inside your container. that you want to reuse again and again. Now what happens over here is, if you are not going to use mount point, then what will happen is, if you run your code 1,000 times, then your connectivity will be established 1,000 times. Now if you mount a point, then in that case, it is not going to establish connectivity 1,000 times. Now when I say establish connectivity, it is just the token exchange that is happening.
Token exchange is time-consuming, a little bit of time-consuming, that's all. Now this token, exchange will not happen if you have mounted a point in your container so let's say i have a storage account in this storage account i have a container with the name of container right now if i mount this container then in that case until i demount it until i demount it the connectivity will always remain it will not go ahead and try to create a token and exchange it right so that is the difference between you know using a mount point and not using a mount point so here in fact with my youtube storage account zero one and my container so this is the file system this is the container name that i'm specifying i'm mounting this container so now just let me run this command now the moment i've run this command you will actually see that it gets succeeded and this i have run on a simple standard cluster so i will actually uh so okay it is already mounted so you can actually see this error as well right so this is nothing but i have already uh you know mounted this particular point previous to making this video so actually i have to go and demount it right so let me actually take the mount point from here so this is my mount point container so i have to copy it and i have to run this command to d mounted so whenever you get this definitely go ahead and demount it and even when you're writing the code the very first thing is what you do is you mount it and when you run the full code you just unmount it so that next time when you run it you don't get this kind of error right so let me first just unmount this Now guys you can see that it is trying to unmount my container. So the moment it unmounts my container I can actually go ahead mount it again and then try to read the file. So you can see that I have got this message that your container has been unmounted. Now if I go ahead and run this command right you will see that these commands starts running and my container it gets mounted again.
So the moment it gets mounted again, I'll quickly show you the second command that I'm running, right? So you will see. that i'm trying to read a file which is present inside my container right now uh you can directly use your mount point or you can give your full uh you know length of or your full address basically of your file so i'm simply using the data frame i think most of you who have been watching my videos they already know how to read a file from the data frame so i will actually go ahead and run this come on so you already know that i have a random file random csv file right inside my container so it has a header and i'm just trying to read it so why i am able to read it because i have established the connectivity that is the reason i am able to read this file so you will see now you can see that yes it has worked so this is the most common way of you know connecting to data breaks connecting adls to data breaks Now, in this case, I'll actually go back to the cluster and I will show you the type of cluster I'm using. So remember that this particular, you know, whatever I've shown you, I've run on this particular cluster test 01, right?
And if you see over here is. I have not enabled a credential pass-through, right? I have not enabled this option. But I have another option named another cluster named as HC cluster, which is a high concurrency cluster.
High concurrency essentially means that multiple users can use it. And then I have enabled the credential pass-through. There is an option to enable the credential pass-through. I have enabled this option. Now, if you are using a high concurrency cluster where you have enabled the credential pass-through method, Right.
In that case, you can actually go ahead and use this command as well. You can directly give the path. You can directly give the path and then you can try to read the file.
So let me do this again. So now, in fact, here the location is not correct. I need to actually rechange it.
Let me write the correct name of my storage account over here. let me copy it And let me paste it over here because here the name was slightly different. And after that, I hope list other things are okay container. And now let me try to just read it. So you will actually okay it is test one.
So let me in fact move it to the other cluster. Let me detach it and reattach it to the other cluster where now you see I'm attached to a high concurrency cluster. which has credential pass-through enabled. So that is the main thing. Credential pass-through has to be enabled on a high concurrency cluster for this command to actually work.
So if I actually go ahead and run this, it will essentially work. so this is also another way of connecting adls to databricks but you need to understand that i'll actually check if there is any issue in this so essentially you need to understand that here in this particular case what is happening is it is using your own credentials it is using the person who is running this command right my credentials to actually go ahead authorize it and then use it and in this case if you are using credential pass through you need to remember that you cannot use your r and scala commands so these are also the two things that you need to take care of while using this but yeah this particular option is not widely used right because this is not a very good option to again choose from because here you are not uh you are actually using the person who is running the command those that person's credentials so right now it is using my credentials so let's say my credential does not have access right now so most probably it will not have because i have not set it up so um in case you your credential does not have uh you know access to proper proper level of access in that case it will throw an error like it has thrown in my case right so you need to understand that part as well but this is not something that you will use in your production ready codes right because definitely you do not want you know you you do not want a user to actually authenticate and authorize your commands in production right so you most probably you will always go for the first approach which i have shown you right this particular approach and try and accept it is nothing just an error handling part. you know you can just uh you know write a fan like it is a little fancy thing where you are just uh you know saying that yes this is my error right so this is uh most probably what i wanted to show you in this particular video so we have done these things what we have done is we have created an app registration we have given storage blob data contributor access to the app registration in iam of the storage account right so these are the steps so that is why i'm just like making sure that you remember these steps and in the background what happens is everything goes to azure active directory for authentication right and Everything goes to Azure Active Directory for authentication in form of tokens. Here you are using tokens. Tokens are something that are being used in the background, not your username and password.
And of course, you're not able to see that. And then you generate a secret from the app registration. that secret you create a keyword and you store it in a keyword and then you come to the data breaks you create a scope so that you can use that particular secret so this is pretty much that i wanted to you know explain in this particular video do let me know if you have any comments if you have any doubt and you want me to make video on any other topic so thank you so much for being till here do remember to Like subscribe and share my video thank you so much