Transcript for:
CI/CD Setup for Databricks

hello and welcome to everybody on cloud Fitness so in today's video we are going to talk about continuous integration and continuous deployment in data brakes so in this particular video essentially you know we will be seeing how we can do c ICD in data breakes so this was you know one of the most important and most asked video especially in terms of data brakes for um you know Azure data Factory I have already created a cicd video you can go ahead and watch that out but especially in this particular video bear with me because this video is definitely going to be a bit longer now let's move on to the portal and see what are the basic necessities for it now uh you know since you're watching this video I already assume that you know about data brakes you have been watching my previous videos and you already have a data brakes instance created so you should have a data brakes instance created and you should be working on it as well as far as I understand that's why you're watching this video and you should also have a Azure devops account created now in this Azure devops account and the datab brakes instance you should always remember that the ID right the user through which you have created this Azure devops account should be same as your data braks one so if you see in the data breaks I have babab [email protected] right so even if you see over here in Azure devops it's the same so this has to be same now now whenever you create this Azure devops account you will actually land into this particular page and right hand side you can actually see something called as new project all right you can go ahead click on this new project and you can get started by creating a new project right let me just name the project as let's say 258 456 now if you see you know this project what exactly is this so this is the place where you know you will be working in Azure devops you create a project right so you are working in a project this is the place where you are going to store the code of your uh you know data brakes you know you will have all the you know features available here for the devops you can create multiple projects as well within the same account right we this is not a detailed video in as devops but I'll just give you an overview so the moment you create this particular project on the left hand side right you can actually see you have some you have dashboards you have wiki page where you can add some documents you have boards so this boards is nothing but you know in case you're using this uh you know uh you can like in your organization if you're using it to create work items you can link your work items here and you can track your Sprints right you have a repost section here now this repost section is nothing but where your code will reside so if you have code in data brakes and you want to store it somewhere in Azure devops this is the place where you where you going to to do it in the repost sections right similarly if you want to create pipelines right cicd pipeline right you must have heard this name and we will be discussing we will be creating that so those pipelines will actually come here release pipelines your integration pipelines your deployment pipelines will all come here similarly if you have any test plans and artifacts they actually reside here at the bottom you also have something called as project settings but this is you know this is not something that we are going to discuss today so going back to the repos section so this is how your initial setup of repos will look like in case you want to um you know work on it you have to initialize this repository so now when I say initialize you have to just click on this option here you know initialize essentially means let me just click here so the moment you initialize it you can actually see your uh you know repository getting created over over here right complete blank repository so if you see this is the repository the name of the repository over here right this is the project essentially the pro this is the same project name that we have created and inside it you will have multiple branches right main branch you can add as many branches as you want over here now this is a little bit overview on the devops part now coming on to the data bricks side of it right in data bricks essentially you are you have two ways in which you can do continuous integration and continuous deployment one way is essentially your old school way or you can say you know um it is not used much right so we will be discussing both but I will be showing you demo for uh one probably so if you go to the workspace uh on the left hand side and you go to the users right so you might have multiple users over here so if I go to you know any any one user over here right and if I click on this calling not or like any Notebook on the left right hand side you can see that there is something called as revision history so this is one way in which you can do cicd but it is not the preferred one so if you click on this revision history right you can see all the revisions you know uh all the changes that were made to your notebook are actually reflected here and also on the right hand side you have something called as get not link and if you click over here right and you click on this link option in fact if if you see in this uh blue box they have already mentioned that this integration functionality is a legacy feature so it is usually you know it is quite old one now if you click on this link uh you know option over here you can paste the URL of your you know devops you know repository you can select any branch whatever Branch you might be having and then you know uh in the git repository where do you want to store your code and all so Sim this is how you can link over here so the moment you link here what will happen is whenever you're making any changes and you you know just save it then in that case it will directly go and link to your Azure devops over here and store the code right but this is not used much the reason being if you see over here I am a user right so similarly there might be multiple developers so if you have multiple developers over here then what will happen is all the developers will have to link you know devops to their you know um local if I if I say their local uh development environment they have to link it link it dink it if they want to link it to some other Branch they have to first dink it and then relink it right and then you know it might become really very difficult to work using this particular approach so that is why this is an obsolete one so we will be talking more in detail about the second approach and the preferred one so in the preferred one you have something called as repos on the right hand uh on the left hand side right now before moving on to this repo option let me you know talk about little Basics so in this particular Excel you can actually see what essentially happens is you have a git server right and you have a developer one developer 2 developer 3 multiple developers you know working in their own local environment so git server is the place where you have actually put your latest version of the code so developer exactly what it does what he or she will do do is he will take the take the latest code from the git server do the development test it and then push it back to the git server right so the first part is the developer will try to pull the latest code from the git server and then developer will test it or do any kind of development and then do a git push to push it back to the git server in between approvals are uh you know always involved so this is a typical process now this is a very easy process in case someone is trying to develop on python or someone is trying to do it in Java python right it becomes easier to execute to pull the code in their own local environment but in case of data breaks right everything is happening on cloud right now in this case how everything works in case of data breaks right you have something called as repos now if I go back here in the repo section so me you know if you see right now I'm I am the owner I do not have multiple users here right but in in your organization or in your databas instance you might have multiple users right so in the similar way when you have multiple user you will see folder names of each and every users over here right now what you can do is you can have a section called as as ADD repo so the moment you click on it we will discuss we will do it so you can add a repository over here now what does this adding a repository mean right it essentially means that you are linking your repository your development space now this is your development space your own folder right so inside repo right you have a development space now in this development space you can do all your development how you will actually link your development space to Azure devops you are going to link your development space to Azure devops pull the latest code from azure devops over here right you're going to pull the code do the development do the changes and then you are going to push it back to the Azure devops so this is how the first step is going to take place then what essentially will happen there will be uh you know uh this is the first part right and the second part is you might ask that if push and pull is going to happen on you know let's say uh data brakes notebooks then essentially if you want to schedule the notebooks right if you want to schedule the notebooks you want to schedule the job right there should be some standard notebook right where you know your latest code is present always and that code is used for your database jobs now for that what will happen is whenever a developer is developing he's first what he's doing he's pulling the code he's first connecting to the a devops he's pulling the code and uh you know one uh he's pulling the latest code the moment he pulls the latest code he do the development and push the code back to the Azure devops now from Azure devops what happens is a pipeline can run which will actually point to a folder over here in the workspace you have something called as you can you can add any folder I have just added a folder named as release you can also add a folder which points to this folder in this folder of release the latest code will be present now this latest code can actually be used for your scheduling notebook SCH scheding or any kind of work right so this is the use case essentially that happens so the moment what H what is happening here the moment a developer pushes the code back to the Azure devops right it goes to you know let's say a branch right uh that goes to the master branch that goes to the main branch then in that case a deployment pipeline can actually run which can take the code from aure devops and put it to this particular release folder right and this release folder you can use for your scheduling purpose so this is what we are going to see today now here essentially if I go back to repos click on add a repo here now you can see that there is a something called as git repository URL right now what is this git repository URL if you go back to this Azure devops click on this clone option over here copy this particular URL and then simply paste it over here right and then you can see automattic automatically a git Provider aszure devop Services is auto selected here although there are multiple you know git providers if you're using bit bucket bit you know GitHub or anything you can actually use that as well but for this particular demo session we are going to use aure devop Services similarly repository name is just for your reference you can keep it as anything and just click on submit so the moment you click on submit you will actually see that it says creating repo in the background and you will see that this repo got created right so this is how your repository will get created it this is the repo name and it will also tell you which branch it is linked to it is linked to the main branch now if you click on that you can actually see that it is linked to the main branch now what exactly is this so it is actually linked to this particular repository and this particular main branch so if you want to change the branch you can create a new branch and you know simply just select it that's all that is what you need to do now in case you want to pull the latest code from the repository you can always hit this pull option you will pull the latest code from the repository now let's say that I am a developer right uh you know and now you can actually see that it has taken readme.md file now once I have linked it right what it has taken it has taken the latest file from the repository so this readme.md file gets created whenever you create a repository it will create get created by default right now similarly you can actually see that that file has come up here as well now let me do one thing let me just say create a notebook over here let's say test notebook now I am a developer I am doing some kind of development that can be anything I have created a demo notebook you know in that demo notebook let me just write a normal print statement right let's say print demo notebook so this is let's say this is the code that I have right not till now you can see that this code even let me just refresh it this code is not present in Azure devop still now right you have linked it but it is not present why because you need to push it right you need to push it now how do you do that let me go back here you know click on this test notebook or in fact what you can do is you can simply uh you know click on this part this project uh let me just do that yeah so you can actually see that uh once you click on this project where this folder basically where you are this is nothing but this is your local environment and you have copied all the latest code from measure devops now if you go here you click on get the moment you click on get over here you will actually see this page opening right now you can see it also shows you that this particular file has been added right it tells that okay this particular folder is linked to my aure devops and you you have added one file here which is not present over there right you want to comment it just comment it means you know you kind of want to push it uh you you know you want to add this particular uh changed file to the Azure devops Right add a commit message over here and you know the green ones are the you know changes that you are doing you know just say comit and push so the moment you do this comment and push what exactly will happen is you can see that it says successfully committed and pushed changes to the main branch right create a pull request on your git provider so if I go back over here and just let me refresh it you will actually see that this test notebook. py actually came over here right so this is how you know whenever you do a code and you want to push it to some Branch now this Branch can be a feature Branch as well right now this is your feature Branch for example now from this feature Branch what exactly happens is you know we create a pull request right we create a pull request you know let's say create a new pull request and what exactly happens we just select a branch wherever we our code is present right uh right now our code uh okay let me do one thing over here right uh let me go back over here to show you this part as well so if we go to the repost section over here in the main let me just create a new Branch right let me say New Branch so this is how you can create a new Branch so now my new branch is created now if I go back over here and if I click on uh here and if I click on new here I can actually click on the new Branch so now it will be linked to my new Branch right it is just linked right now my new branch is actually created from my main branch so whatever changes were already there in the main branch they came to the new Branch now let me just add another file for my new Branch over here let me say create notebook let's say new New Branch notebook right let me say that and let me just simply create it now in this particular uh you know uh here just let me add a print statement let's say new let me say New Branch changes right so for example this is the uh you know New Branch changes now if I go here in this new Branch nothing will come over here right everything is the older version now what I do is I go back over here right I go back over here if I click here right then what will happen is you can actually see that it says that okay you have added this new file which is not present in the new Branch right this is actually linked to the new new Branch right it is linked to the new branch and you have added a new file over here just add a commit message new commit and then just simply say commit and push so the moment you do this then what will happen is is you can actually see that if I go and refresh here in this new Branch you can see this that this new branch. py file got added right now this can be a feature Branch right where a developer is working and developer is allowed to push right where developer is allowed to merge the changes now from here what essentially happens is a pull request is created create a pull request now this pull request uh happens like from your working Branch to whatever branch and then you know it goes through this set of reviewers and everything and then you simply create it and the approvers will you know just check you know whatever files are added updated what are the commits all these details they will check and they will approve it or they will complete it right and when you complete it what will happen is the changes from your new Branch will go to the main branch right so developer can create his or her own branch do the changes in that particular branch and then push the changes to the main branch after get all the approvals right and reviewing their code and always remember that whenever you're trying to complete this pull request right whenever you're trying to create this pull complete this pull request you have to always delete the new Branch before after merging otherwise what will happen is uh you know there will be multiple branches created by developers and at the end you will see that there are too many branches lying which are unnecessary right and let me just complete the merge so you can see it says completed and and if I go back to the repos and you I see the March and I I see the main branch you can see that even the main branch has this notebook now right so this is a process in general that happens but now you would say that you know we are not going to come again and again and we are not going to you know you know create a pull request from this particular uh UI right you will have to do it programmatically right this is nothing but your continuous integration right now continuous integration and continuous deployment pipeline which we will be creating now right so this is the process and in the pipeline that we will create automatically whenever a person merges you know code to a branch then automatically you know uh even a folder will be created in the uh let me go back here even in the workspace part a folder will be created over here right so this is a part of continuous integration and continuous deployment only okay remember one thing uh you know whenever um let me just change it from uh let me just show you this part as well I'm just switching back to the main branch right because this is just a demo uh now remember that whenever you are trying to push the changes from here in the first go you might get an error now that error will actually be if you go back to the settings option you go to the admin console what what will happen is you go to the workspace settings you will always see that there is something called as repost right now uh you know you can put in some restrictions using these options over here the restrictions can be that you want to push the changes only to a particular URL now in that case you have to you have to mention the restrictions over here let's say you say that uh restrict clone commit and push to the allowed git repository so whatever allowed G repositories are just only allow developer to push to that particular repository now you can select that option and then you have to here in this particular space you need to define the URL of the repository that you want to commit to only only those uh repositories which you mention over here will be allowed so this thing you need to check so right now I have disabled I have no restrictions I can you know commit and I can push to or you know any repository so this thing you need to keep in mind otherwise this part will give you an error while pushing so this is something that you need to always take care of let me just refresh this part again and go back to the repos here so right now I am connected to the main repo now what essentially now I hope you understood you know how does this happen right how does the integration part with your devops happen now let me go back to Azure devops right now what I want is the moment moment uh a developer make changes to the main branch you know whenever uh it goes through feature uh you know whenever he commits to anything to the feature and everything goes through the pr process everything is approved and finally any change has come to the main branch then automatically in my workspace this release folder should get updated with the latest part latest code right so right now this release folder has some code it has some demo notebook 01 it has some code written in it but now I want that if any change is done by a developer in the main branch right then automatically my release folder should get updated so that I can use the latest code for my deployments uh not just deployments essentially for job scheduling purpose right now for that we have to what we have to do is we have to create a pipeline now on the left hand side there's an option of pipelines let us click that particular option and this is how it will look like let me just click on this create Pipeline and uh okay let me just go back in the meantime in case you have not yet subscribe to my channel I do request all of you to subscribe to my channel and share this video as much as possible because I think this is the really important one let me just click on create Pipeline and if you see that there is an option right to choose from the uh versioning tool now you are using Azure repository git right you are using Azure repos this repo you are using so we'll just click on the first part and then this is the project that we are using and then we are actually we will be using now this pipeline is nothing but a piece of code right that we are going to write and this is a very simple language that we're going to use uh which is yaml so we are going to use yaml script so let me just click on starter pipeline so we are going to write the pipeline so yaml is very easier to understand and it is very easier to write as well so let me just remove everything from here so yaml is nothing but U you know a kind of a scripting language where everything is described in form of scripts and it is very easily readable it is like like English it is like simple English with little bit of syntax so just to save time because uh you know this video will actually become like really WR long so I have already written a script and let me just paste here we will go through it step by step to see how does it happen and how does everything work let me just open another instance as well okay so if you see here in your uh yaml pipeline the very first thing that I have written over here is the pool now this pool is nothing but a virtual machine like you can call it as devops virtual machine where your script will run right you know now your this script which you are writing that needs some machine to run right and this is nothing but this pool is nothing but a machine where your script will run so by default you get you know UB to latest you get this uh machine you get this Linux machine by default so you can use that particular machine now variables are nothing but a simple set of variables which you are going to use throughout the script we will come back to it later right and then you actually go on and write the step steps right now these variables and everything we are going to discuss in detail now steps what exactly steps are involved right the very first see uh now how we are going so the moment a developer is making any change to the main branch automatically my folder should get updated right now for that how do I do how do I do this whole work the very first thing is I need to make sure that whatever code I have in my repository whatever uh you know latest code I have in my repository I take that particular code bundle it together which is nothing but it is called as an artifact right so all the code from this particular project right and when you bundle it together it is called as an artifact so I want to take all the code as an artifact and then what I want to do is I want to connect to this data bricks instance and then I want to go to this particular folder right now when I go to this particular folder what do I want to do I want to remove everything what is present inside my this release folder and then recreate everything using the artifacts which I have downloaded and what is the artifact artifact is nothing but the latest uh code which is actually present in my repository which has been used by the developer right so this is how it will actually work right now if I go back to the script this is exactly what the script is doing right publish pipeline artifact so everything here is defined in form of tasks right the first task is publish pipeline artifact right so these are the uh by default uh variable paths right when you write a AML scripts inside your virtual machine now you want to connect to your datab bras instance as well so that is why you say pip install datab BR CLI install my datab brakes command line interface on the machine right and then I want to you know once I have installed it to log into the datab bres instance I want you know uh username and password right so for that I need datab Brix host and I need datab brick token now this datab brick token you will get from settings user settings you can see that you know you can simply click on generate new token get the token from here right and just keep it I'll show you where to put it and all so you have your datab BR token and coming on to the host part what is your host this part what I'm highlighting on the screen right so that is uh essentially your uh you know host name right HTTP and the whole URL till your uh this part dotnet is your host right now this host name you have taken and you have taken this URL but if you see here I have mentioned dollar right whenever you mention anything in this dollar symbol it means that it is a variable Dollar open bracket and the variable name this is nothing but a variable this variable we have kept here right and if you see this variable uh you would say B we don't see any variable like datab host and token no it is not there how we have kept it is we have kept it in a group now if you can see variable group Dev so if I go to this project option if I go to this pipeline section on the left hand side and if I say library right and uh if I click on this variable group you can actually see that I can go ahead and I can create a uh you know group right that group name is Dev over here I have mentioned that you can keep any name so in this Dev you can simply add you know the same variable you know what we have mentioned here so you have said datab braks host just simply write datab Brak host host over here similarly you have something called as datab braks token right so simply write datab braks token as well here now I need to provide the value of my data bres host right now to get the value of my data brakes host we just need to copy this part right now the moment I copy this part you can actually see I go to the pipeline I can simply paste in here right the back slash is not needed this is my data Brak host right now let me just put in the token as well here and come back now you can see that I have generated a test token over here and I have pasted the test token as well over here now if you want you can simply you know click on this particular icon and then your token will be invisible right now if I click on this then you can see that even my host gets invisible it get masked now if I click on Save then a variable group is created now this variable group has my data breakes host and token which will be used by this particular script right to log to the datab CLI after that what it does is it is simply using one testing command you know it is just trying to test whether it has connected to the CLI or not workspace LS it is simply saying okay list the whatever is present inside my workspace after that whatever artifacts are generated it is trying to download the artifacts right using this download artifacts after that what it is trying to do is if you come over here right after checking the artifacts it is trying to go and it is trying to check whether a folder with this folder name is present or not now this folder name is nothing but a variable if I have mentioned this variable at at the top now if you see this is my folder name right folder name is nothing but release right release correct now you can add any name so in case this folder is present then go ahead right and from that folder delete and then what you do is you know delete old reades essentially whatever was present in that particular folder earlier was a older version just delete it and then what you do is then whatever artifacts you have generated just import those artifacts and in the folder just add that particular uh you know uh new piece of code just add it right so this is what essentially the script is doing so we will see it don't worry we might come back here I'll tell you step by step what is happening now here here you have not defined any uh you know trigger like when the script should run right so ideally what happens is so ideally what happens over here is you kind of Define at the top trigger right now you say trigger like whenever there is any change made made uh made to you know the branch right you can say the branch name Branch name can be main you this kind of customization you can do you know so whenever you know this kind of thing is done whenever a trigger whenever any changes is done to the main you know little bit syntax might be here and there so whenever any change uh is done to the main branch then in that case just start this pipeline but in my case what I've done is I've not added anything when you do not add anything what happens is for any change right it will execute right so since it is a demo I'm not uh you know create I have not created multiple branches and all so that's why if I remove this option it is going to run in all the cases now I can say save and run right now when I click on save and run you can actually see that it says commit directly to the main branch it means that whatever your pipeline you have created this script do you want to save it I would say yes save it to my main branch itself so now why if I run it now you can say that it says creating pipeline running and now you can actually see that it has started running as well right save and run this is what we discussed now right now it has just started right so if you can see over here zero artifact so I have told you that it creates artifact it creates it takes the piece of code from the repo creates a file out of it which is nothing but artifact and then deploy that artifact to the datab breakes workspace this place right this is what it is going to do right right now we have demo notebook over here correct now let me just refresh it and at the same time this part is also running this pipeline is also running now let me you can see that it says running and you also see it says one published right now if I click on this one published you can actually see the artifact generated now what is this artifact again if you I go back to the repos over here right this main repo it contains you know yaml pipeline you know New Branch notebook it contains test notebook right so all these you can actually see it over here right what exactly is this right it is the artifact it is the latest code that it has taken from the pipeline right and then you can see that the job has finished and if I go back over here uh to my data brakes and let me just run ex uh you know refresh it and if I go back to my workspace release folder you can see that the two python notebooks that I created test notebook and the new Branch notebook right both got added over here right you know it has taken you know the files and it has added it in the release pipeline so this is how it actually works also if I go inside this job right you can actually see it has initialized the job then it has you know installed datab brick CLI right this is what we did pip install datab brick CLI so it started the installation of datab brick CLI you can actually see over here right and it also says successfully built data break CLI similarly configure datab break CLI right and and uh if you see we also did a workspace LS in the script right so list whatever is there in the workspace release users share list everything then download pipeline artifacts now in this case it is trying to download you you can see the chunks downloaded four download complete then command line right and then delete old release it is trying to delete from the release folder what was present then new release it is trying to add right if you can see it says that it has added test notebook it has added new uh Branch notebook right now yaml uh they do not have your uh you know extension valid like the in the artifact you have MD you have do yaml file as well right so it does not have these extensions that's why it has skipped right and similarly uh you know and then you can see that this job has been created and it has been successfully created so this is how it works I'll also show you one more thing right now now basically in this particular project what happens is you have already seen the artifact right let me go back to the pipeline that we ran right so if you see this particular pipeline we go to this pipeline you can actually see that one artifact was published and this artifact had you know these four files now let's say you say B I don't want to include this md. AML file in my artifact why do I unnecessarily add these files right now in that case what you need to do is in this part itself right where you are trying to uh you know run your uh yaml file what you can do you can add let's say new file right you can name that F you can basically add a artifact ignore file right artifact ignore right the moment you add this artifact ignore file what essentially happens is this file gets created and what it does is star star for/ star you can simply add it now what it does is it what does this star star Forward star mean it means take everything from this particular um uh how do I say take everything from this particular uh um okay uh from this particular repository take all the files from this particular Repository and then what do you need to do you need to let's say uh okay now here essentially what you can do is you can actually type in here star. py essentially it means take all the files from here and then what do you do take all the py files from here py files and then you say this exclamation mark you means do not take anything else apart from the py file and you can simply comment commit it right so essentially this is also one of the way in which you can avoid any extra uh you know uh you know any extra files that you don't want right this is just one way and remember that why it has started my pipeline has started because I have made change to the main branch right so whenever a developer will make change to the main branch it has to come from the data brakes not the direct way that I have done but it has to come uh directly from this uh you know uh part and then you can then you will actually see that this pipeline will automatically run right for example right now it's running and you can see now again it has published the artifacts I want to show you the artifact over here right especially and if I you can now see that this artifact is just you know py. py file nothing else earlier the artifacts had all the four files so this is the difference right this is the difference now what I want to do is now let's say again I go to the repo you know just to make things little more clear I go to the repo it is linked to the main branch I just create a notebook I'm a developer let's say new development right I just created right and then I just add a com I just normally write anything new development code right now the moment I do it I can come back here right let me just check whether my pipeline has finished the previous pipeline okay it has uh being finished now this is let's say new development right so this is I'm what I'm trying to do I'm going and committing it to the main branch let me go to the pipelines over here now I had I I ran this pipeline right I ran the pipeline two times you have you already know it now uh the moment I do this change let me just comment and push so the moment I do this comment and push to the main branch automatically all these things will happen because remote contain okay now if you see why did why did this error came right because I have added a git ignore file over here over there right if you remember I have added git ignore file right so I need to pull in the latest changes let me just pull it from here using this pull option you have to always pull the latest code from the repository and then let me just do comment and push because if you remember I just added uh you know dogit ignore artifact ignore sorry not dogit ignore I added artifact ignore to show you one feature now you can see it has committed and push the changes to the main branch now if I go to the repost part over here you can actually see that it has this new development. py code because it has pushed my changes now if I go to the pipeline and if I you know just you know refresh it you can see that this new development pipeline is running if I go here it has published the artifacts and just because I we have artifact ignore file we will only have py file and it has added the third py file here now if I go back you can see that the job is complete and now if I refresh my data brakes over here you will see that in the workspace release folder I'll have the new file new development file right I have this new development file now from this new development file you know you can schedule you know your code or and you know you can schedule it to run any time of the day right so this is how your cicd actually works do let me know in the comment section if you have any doubts and remember in case you did not understand it in the first go you can always replay the video and try to listen to it maybe a little slower I kind of Tred to complete it you know within the time span of this particular video I hope you liked it and do remember to like share and subscribe