[Music] uh hi Santosh good morning how are you doing um good morning I'm doing good how are you I'm doing good thank you so my name is you can call me Arun I have an experience of 15 years and then I've been working with TCS for about like more than 10 years I've joined way back in 2012 so it's been more than 10 years and I've been working as a data scientist data engineer Azure ml Etc so these are one line about me can you give me a very brief intro about yourself for that introduction so uh let me uh walk you through my profile uh So currently I'm working with uh Microsoft and I do have around uh nine years of experience working with data driven Technologies uh mainly focusing on uh data engineering pipeline so I understand that the role which you're looking is on the Azure data engineering pipeline so just let me give you a brief what I have worked for so in Microsoft we actually collect a lot of data that is coming from third party like we sell a lot of products in terms of uh Microsoft products such as teams and the Outlook and yeah Santosh let me pass you here uh so give me very brief intro or don't get into the projects right away I want to understand all the tools and technologies that you have worked on and then we'll get into the project so first can you tell me all the tools and techniques that you've worked on yeah sure to start with I have worked with SQL uh that's my primary skill and python is my primary language which we have used for sorry I'm sorry to interrupt you again so whenever you say it's SQL python Etc I want you to mention like you know SQL I work for like seven years python six years Etc okay yeah yeah thanks for this uh context and expectations yeah so uh when I'm working with SQL I would say right from a beginning of my credit eight years in SQL python for almost six years and working with Azure services and the pipeline Creations it's almost for six years okay good to know and any expense in Big Data Technologies uh yes uh yes uh before jumping into the uh Cloud pipeline I was working with on-prem pipeline so that is where I do have around three years of experience working with big data Hive uh especially on mapreduce and Peak that is also something which I've experienced okay good to know uh anything else uh in terms of Technologies yes um yeah I'm good with the bi tools as well uh where I usually uh do a reporting for our clients okay fine good to know so when you say Azure what are the have you worked on data yeah when when we talk about Azure right so uh Azure uh it provides a lot of services but uh what is really required for data pipelines such as data bricks we have used for heavy lifting Transformations uh data Factory for orchestration and uh in factors for storage we have used Azure data like storage and block storage so these are all services which have experience with so uh before we get started I want you to rate yourself on a scale of 10 in SQL python Pi spark and Azure as a whole okay uh SQL 8.5 out of 10 uh python um 8 out of 10 and Pi spark or maybe Big Data Technologies I would say it's uh 7 out of 10. okay how about azure Azure uh yes I've been working with almost uh 10 to 12 Services I would say around 8 out of 10. okay so you're more comfortable SQL so let's start with SQL I'll give you a small scenario so if you don't mind can you share your screen open notepad and uh yeah sure so let's get over with SQL python Pi spark and then probably we can jump to Azure so we shall spend more time on Azure so let's spend quick five to ten minutes on the other Technologies okay yeah sure is my screen visible yes yes I can see yeah so assume we have got a table but then we've got two columns country and daily sales so now if you don't mind can I connect it down make it down yeah that's fine yeah this thing okay what is it uh the table country you can assume table name to be anything and then columns are country and days is okay Day sales yeah okay so now um that's fine yeah anything is right here okay uh so give me SQL query to find out uh country wise total sales country wise total sales okay total sales made by okay uh so I can write select country by sales rights and you you mean to say that this is actually daily sales daily sales yes okay I can simply write select some more sales uh yeah I think this should work okay that's fine but only small thing is uh you have mentioned it as D underscore sales and you're using sales but that's totally fine okay thanks for the catch okay okay um these SQL um okay I understood they are very comfortable uh case sensitive no yeah SQL is not the case good okay fine uh we'll get moving so uh assume now you have this table schema right so I assume it is a panda's data frame now okay and now same thing same query it in Python same query okay in Python Okay so I'm assuming that uh numpy I mean all libraries are pre-installed yes yes you can use any okay okay in that case I will go with data frame assuming that it is a country data frame country data frame and group by and I can write D sales and submit and the DF dot dot show this should give me the answer in Python okay fine and um after doing group F country why to select thesis uh that's what I wanted as aggregate function right so you want to get the total sum of sales so for that I am using the column and I'm taking the aggregate function okay I understood understood where you're coming from but we have only one column as details right so is it really required is it really required uh-huh see if it is you're taking anything like count in that case would have been using as only one column the reason why I've used this is you need to sum up the sales right so that's the reason I've written of course you can write it in a number of ways but this is something which I preferred so that we can easily get the column sales okay fine uh so same thing in pi Spark okay same question yes same question assume it is the Arc data frame okay okay shall I assume that the spark session is on all initiated I just concentrated on the code okay so in that case I can go with spark data frame collectors are here also it's almost same but I can see there is a syntax change Group by country and uh then I'll write sum first and then I'll trade details we can write it in this way and in fact we can also use spark SQL by using sparse SQL also you can write within that SQL code I think this should answer your question yeah yeah even using spark SQL in which you have written a SQL query so yeah that's fine your investor written SQL and both doing by spark also okay uh yeah we'll move on to Azure so this is where we'll focus mostly on so as a data engineer can you walk me through all your day-to-day activities okay okay so when I'm talking about day-to-day activities um actually we are working on pipelines uh when I say pipelines we usually collect the data from different data sources that we get from third party and we try to uh do a data uh integration like we collect it and we make sure that we prepare a data virus system but this is on a high level but if you ask me on a day-to-day activities uh we follow agile methodology so we connect with the client on daily basis trying to understand if they have any a doc request sometimes they suddenly they comes up and say that we need to get XYZ table so we try to work closely with the team and we pull it and we also give us an analysis so day to day activities involves a doc request implementation of data pipelines and also we work with some of the reporting tool where we report it to the end users okay so all the reporting transformation everything is part of your expectations right Your Role yes yes okay fine uh so so Santosh I have your resume open so pick any project um preferably the recent project that you have done or uh if that is not completed if you are comfortable with some other public we can do that uh and walk me through the entire process I want to understand economy what is your role there what is the end objective that you're trying to solve and also mostly most importantly construct more on the Transformations part that you have done okay okay sure so I'll walk you through my latest project uh which I was working with uh the dynamic 360 web products so what exactly as a Microsoft we do it uh we sell our products to many of the clients like we do have uh the dynamic places where Outlook teams and we have Yammer all these are the Microsoft products right so we have different companies who are buying our products and this data we get it from third party now as a whole as a Microsoft person we need to find out how much of Revenue it is getting generated from each and every product but the problem here is we don't have a proper source so that is where we have to collect all of the data sources come up with the one centralized dataware system so that we can start uh building our analysis for this what we have done is we have collected the data from Flat files SQL files Excel files and majority of the data is coming from SQL Server by using all this we have uh mode our data from on-prem to Cloud the orchestration tool whichever uses Azure data Factory Azure data Factory is responsible for moving your data and before that we also bring first to a data Lake storage making sure that the data is present in the cloud and when coming to the transformation right Azure data Factory is a code free Transformations you may not get everything for that we have used data bricks databricks is actually a spark cluster which we have initiated on the environment and by using pi spark we have actually calculated the uh complex Logics so now uh specifically to answer your question in terms of Transformations uh calculating rolling Seven Day sales 14 Day sales average and then fixing the the number of products that is sold per year all these are the calculations that we have implemented in pi Spark okay so your spice Park on python yes okay fine uh can you explain me how to optimize spawn jobs um optimizing Spar jobs there are actually multiple ways uh so if I want you to walk through by one by one I would say we have a three types of data sets like rdd data sets and data frames so depending on the use case we can use it and we also have a persistent cash so persistent cash at the two I would say unique features that spark as wherever you feel like you are revisiting the code you can go and persist and we can save it as a cache as well so that is something which we prefer and also there is serialization plays a very important role uh that is also something which we can decide in our configuration most importantly uh nowadays every project we are using is a broadcast variables and broadcast joins so broadcast variables and broadcast zones it will make our work easier by storing the data in the node itself so that it need not go and read from the worker node that is also one of the use case uh by by using this we can improve our spark jobs okay understood uh you mentioned persistent cast right so can you walk me through what is the difference between these two a persistent cache I would say like persist in a sense it will go and uh persist uh within the uh disk like if you're like let's say you are maintaining in hdfs it will directly go and persist and it will save that if you say as a cache it will say only in the memory uh if I want you to walk uh tell this in a very layman term it's like a ram and ROM you go and say within your hard disk it's a persist you go and say within the memory it's a cache yeah I understood so what is lazy evolution of iceberg in spark basically okay so lazy evaluation I would say that uh there are certain function functions which we use in spark which means that executions will not start until an action is triggered so uh I mean some Transformations does not required execute immediately let me let me give an example you have a data frame in the data frame you write it a filter and then you again write some selected columns so all this you need not display again and again it will not even execute it it will be just uh taking as a memory and then when you're proceeding further uh what exactly happens is it saves the computations and increases the speed so because I'm not running it every now and then uh it is also one of the optimization technique by using this we can organize the program into small small operations so that you will be getting the subsequent results and finally when you want to display it then it will Club it until then it will not execute that is lazy evaluation okay very nicely put so now my question is spark is extremely fast because of laser Evolution right and now you are saying you have used persist so what are the lazy Evolution concept that is there that won't happen so as soon as you say purchase it will position the disc in the disk right so why do we want to persist in the first place uh first place we are not persisting right like let's say we are working with multiple Transformations this multiple Transmissions is actually stored in a temporary rdds I'm directly not going and dumping into the disk so it is stored in the temporary rdds so finally when I'm using the results I can use this temporary rtds and then I can Club it so that is how we usually do it for our heavy lifting transformations okay okay fine good good nice answer man okay you have said rdd right so can you tell me the difference between rdd data set and data frame um rdt data set and data frame are like uh three unique uh features I would say which we can use it in our uh system so whenever you are uh working with a small Transformations where you internally need your uh spark to take place in that case we can go with rdd rdd comes with a base code and data sets as all we know is like a data frame um I'm sorry data sets is actually like a table format where you can go and uh pull the data CSV files flat files and then you can work as a whole and data frame is specifically related to python where we pull it in a table format and we queried for our analysis now if you ask me which is better it all depends on the use case for the low Transformations we can go with the rdt for the table level Transformations we can go with uh data frames and if you're loading any external files as a data I mean external files we can use it as a data sets okay fine okay um so if you have to write any custom functions so how you do it in pi Spark um custom functions on top of my mind if I remember I can use UDF so whenever if you want to do any repetitive task we can go with UDF UDF is basically user defined functions and this user defined functions is basically used as a name itself right you write it once and you can use it as many times as possible and we may not always have a built-in functions uh I remember I've used one thing like we wanted to have in a specific date format that even your python Pi spark did not have the built-in function they wanted to have in DDM why and then the time so this type of format they needed which we did not had as a built-in function for that we have written a standalone function which is called UDF and we have used it for multiple methods okay fine good uh okay so so once you are in the team this is what is expected out of you so you have to do ETL jobs uh basically well ETL uh Transformations and then once that is done uh you'll have to store everything in data lakes and then data a scientists team they will take care uh like make use of this data and they'll build different different models so your task here would be looking at edl and then once they build the models you need to run that models so basically you need to schedule both these jobs so can you walk me through how do you schedule and run your jobs okay so how do we uh schedule spark jobs you did you mean spark jobs yeah you can assume spark plugs so have you worked on airflow uh yes I worked with airflow and uh but I mean I worked with the airflow but for not for the latest project for the latest projects the way we have scheduled is by using Azure uh logic apps so logic apps is um I mean initially it was logic apps but now we have something called Azure schedulers so by using Azure schedulers we can execute the workflows and we can also able to integrate and manage and monetize the services but in the other project I've used airflow so airflow is like uh which is again an uh product which we can monitor the pipelines manage it we can also connect end-to-end Services uh get to know that where we have missed the operation or where we got an error that type of complete flow we can get to know and that is again integrated with uh Azure we have used that as well as a part of a different module okay okay fine uh so you have mentioned pipeline so many times in your resume uh so can you tell me how do you maintain incremental load in uh specific Pipelines okay so incremental load uh actually let me tell you why incremental load is required okay so we are in our case we had a problem we had a 10 years of data so every time the pipeline runs right so we cannot go and load every time 10 years of data 10 years after 10 years of data rate so we need to pick out let's say last three months or 10 months or eight months depending on the frequency of the changes we used to do that so for that what exactly we have done is uh we maintain something called Watermark column so Watermark column is uh let's say for example the last load has happened in the December so we actually marked that down like 31st 12 2022 so we mark it with the date that has been loaded so we come up with a watermark column which is basically a timestamp of the last updated so using this column we can chill check what is a Delta with existing data Delta means uh a minus B just like a data set a minus data set B by using this Delta we will get to know what is that in um Gap that has been there from last refresh to this refresh that Delta we can get it only that we can upload as a part of a pipeline so that is what is called incremental load okay that's fine uh can you tell me how is it different from Auto scaling uh Auto scaling you're actually dependent on the uh system configuration right so in system what happens uh we will we will not have control where to uh get the starting point or end point so we need to go with a specific interval so that's a problem with auto scale but when we are coming up with our own Watermark it is easy for us to calculate because Let me Give an example in this cycle you have to get last eight month sales uh in the last cycle would have been eight months sales right so in this Auto scale will uh find it difficult to adjust the frequency that is why I usually prefer our own Watermark which makes our work easier okay fine yeah so yeah let me ask you a few technical questions as well so you can tell me difference between a group by key and reduce biking uh yeah yeah definitely we have to use this because this comes as a part of an uh uh optimization and the shuffling so both Reduce by K and group by K is actually a y transformation which means that it triggers a shuffle operation uh the key difference between the reduce key and the group by K is at uh reduce key does a map site combine and group ik does not map site combined so uh usually the preferred one is reduced by key because that is one which gives you a better optimization okay yeah uh okay there are a lot of built-in Azure functions right so I'm sure you would have used Azure function so I don't want you to explain any functions in generic but uh take any use case where you have used them and explain me that use case uh not maybe not necessarily be in your current project it could be your previous project just I'm looking at use cases to uh like you know if you have used Azure functions okay actually yes we have used Azure functions uh it's like a basically a trigger so uh whenever we want the code to get executed right uh based on some scenario uh say let's say for example pipeline so you're running a pipeline on a daily basis means it's simply a cost right so end of the day it's all about Cloud so it keeps on creating the cost so what Azure functions will do is when only some scenarios or conditions is satisfied in that case we can run this triggers so uh in our project we wanted to run the pipeline when the file is changed okay only the file is changed sometimes it may change in one year sometimes it may change in one week or sometimes it may change in two days so only when the file is changing that time you have to execute your pipeline it doesn't make sense if you run your pipeline for all 365 days right so so maybe your file may be changed to five times or ten times in a year only that time you have to run a pipeline for that we have used larger functions in fact it is a very useful in order to have an optimized cost for your projects okay fine uh can you walk me through uh data validation and data seeking Pipelines um okay so when I say the data validations uh we are usually what we do is of course it's it comes from the normal uh data based uh solution so we can Implement something called error log table so error lock table at every activity we have it and we also have something called exception handling which is implemented in Python uh especially in data bricks so and also we have written store procedures uh what exactly the store procedures will do is it will take two parameters input and output input is like a source Outpost is a destination so it will start comparing row by row column by column any small special character Also let's say for example extra space that will be captured in this store procedure because the way we have implemented stored processors uh it was going to compare row by row column by column so if it is matching okay well and good if it's not matching immediately that entry will be made to a error log table as long every part project they want this error lock to will be empty if something is getting entered that means there is some error has happened in the data so that is how we do a data validation okay can you tell me what is broadcast uh broadcast is actually a variable uh I would say and Joints also is there so broadcast join is uh spark I would say if you want to join one small data with the large one uh we usually go with the broadcast so the reason is uh so what you explain is broadcast joins my question is broadcasting this um a process called as Broadcasting usually do we do it in Python so my question is mostly on broadcasting any idea I'm sorry I don't remember right away okay so it is uh nothing usually whenever we do any operation or any transformation so we want to do it in column by column right not row by row not each row so that process of doing the entire transformation okay okay got it yeah you're explaining broadcast joints right yeah please go ahead and explain broadcast uh usually uh we all work with uh I mean normal joints but we have something called broadcast joints so this is usually preferred when you want to join as small data with the larger data so usually what happens spark sends the data all the executors node in the cluster so once this date data frame is broadcasted spark can perform join without shuffling so always we need to remember the lesser the shuffling better the performance so by using this broadcast joins uh we will be able to um uh avoid shuffling and that is something which makes our uh join faster okay yeah nice uh I am mostly done Santosh so before we go I have one last question uh so can you tell me I do have uh access management controls have you used it do you have permission in your organization yeah yeah definitely because uh end of the day it's all about who can access your uh systems as well whether it's a developer or it's a user right so end of the day we need to give access so in our case what we have done is we have used Azure active directory and also we have used access management which is Im so we protect it by uh implementing this access and whoever is entitled we make sure that our their entry is made in this table um this is something which we do it and we also Safeguard it with access keys so specific access key people only will be able to use our services okay fine uh lastly um how do you troubleshoot if you get an issue in databricks okay [Music] um troubleshooting I would say that um actually we shouldn't we need to find out like uh which part of the code it's taking time uh so there are basically metrics called uh resource consumption time and CPU consumption time and we also have something called disk usage and Shuffle memory so all this it's like we keep an eye on it uh it's basically internally the tool itself will come up with something called log analytics so it will come up with some charts and all it will help us to understand which is taking more time and all by analyzing this log analytics uh we can usually understand the each job stage by stage we can do a check and by using a drill down we drill Downing this problem we will able to execute investigate this execution of jobs yeah yeah uh okay Santosh I have mostly done from my side so uh any questions for me okay I mean I just have one question uh this this role is really exciting if you can help me to understand uh what could be the next steps and I mean uh is there anything like specifically a project oriented role you're looking for or is it a a pool role which you're looking for if you can help me that will really be uh helpful for me to take a decision okay uh so you have mentioned all these pipelines and everything in your current job right and we also mentioned the data Lake Etc a lot of things right so it is mostly the same on the same lines see we have a data and then the data comes in different different sources we have mentioned flat five sequel everything right it is the same thing so you need to do ETL power so you need to extract and try to do all the Transformations udfs that you mentioned and then um measure it in the data Lake and then we have all the different different teams data science team and then we have bi team which uh like you know they usually do all this uh reporting Etc then all this uh like you know power bi tableau and then uh we uh like you know once they do that you need to schedule the jobs as well so that um like you know you don't want to do this ETL process every day though data comes every day you need uh you know everything must be automated so that is where we require you and then uh once the data engineering data science team does their modeling and then um you need to do it currently we are using airflow uh so yeah you will your help is required on that part as well and then the bi tools right it must be automatically refreshed as soon as the data gets refreshed so basically whatever you have been doing so it's extension of uh whatever you have done so okay thanks thanks so much for this it was really helpful uh for understanding the role I mean one last question uh just just wanted to understand uh if you can share your experience uh working with TCS uh that will really give an idea about the organization for me to uh just know more about your uh company okay so uh usually we don't get these questions so like you know TSS is uh very well you know so everybody knows about these years uh we are almost like four lakhs plus companies and um uh very employee friendly so we have very good policies like you know medical insurances in the industry that we can say and then um you will be leading a team and then there'll be like a lot of people working around you so and the nine in this project also we have got like you know different different teams that we can help each other out and then managers and all this uh the policies are really good the HRS and everybody so yeah and that's everything you know so we are a global uh MNC we have presence everywhere uh in India itself we have our presence in almost every city and then as part of our CSR activities we go to the rural areas and we'll find a talent from the niche uh people as well so I mean maybe this could be a uh out of context but I've heard from a lot of people that TCS is known for on-site so is it really true uh before that I have a question for you where are contemplates a lot of actually I'm from Bangalore Bangalore so are you willing to migrate to Chennai okay really if there is a project need I can yeah okay so currently we have teams set up in Bangalore Chennai and Hyderabad so uh right now we are doing its hybrid so we are doing work from home for few days and then um sometimes we'll call you to office so that's where we are doing and then onset see our planet is basically a U.S client so we do have like you know quite a lot of people working from the us as well so yeah uh so maybe if you prove your what you will be going on site so as of now no I mean I can't say anything so it all depends on the department and I understand yeah yeah sure sure thank you thank you so much actually it was a very good distance we have Lisa uh yeah I have it but uh I have not been traveling actually so I'm looking for an opportunity yeah okay so any uh any other question before you leave uh I'm good thank you so much uh excited for this role yeah thank you thank you