Databricks Data Engineer Exam Insights

Namaste welcome back to SNA you are watching data bricks certified data engineer professional exam question analysis series um I actually yesterday night I recorded I think about 10 questions and then I felt sleepy um I don't know what I recorded So I am restarting recording this video again uh the the reason I fell sleep is it is late I agree with that but the questions the questions of data breaks makes me sleepy for some reason because I don't understand I mean why can't they put some effort into these questions literally most of the questions are like for this option where we where where do you click the button right who would know that I mean when you are doing certain thing you will go and anyways you'll click it but what is the point of asking where is this option on the UI where is this option somewhere can't they give us questions in such way that at least it makes us think a little bit like at least tries to test the genuinity of okay this person actually has some knowledge about it instead of like uh this UI this I will I think one of the question that I went through I think one or two or something it literally makes me think like seriously this is what you are asking and this is a professional exam not even associate so this just makes me sleepy I'm sorry sorry it sounds if you are offended I'm sorry about this these questions but literally these questions I mean there are questions trust me there are some questions in here that are make sense you know like real time scenarios they ask based on real time scenarios but most of the questions are like where is this option how would it doesn't make any sense anyways they are giving five options and we have to pick one I mean better it's worst I mean it's not worst but tougher than giving for and asking us to pick you have 25% but here you only have 20% that's there but still they can put some effort and ask questions wherein you know you actually think instead of thinking the sense like okay so in this particular scenario what do we do instead of what is that option like they are asking these questions where you have to mug if you're preparing right because they are asking like this button where is this option there is yeah I mean I will stop waiting but you get the point when we get to that question you will be like seriously in a professional exam where you are paying I mean datab bricks exam is not that cheaper go do some you know if I compare with AWS exam right whether it is SSA or you know dop or S SOA or DVA go check those exam questions you will see the difference that I'm mentioning nowhere in this question questions they will ask you where is this option you know Etc kind of questions and that is cheaper than this one and if you wrote One AWS exam the next exam is 50% off and I don't think these guys even give you that so if you're spending that much amount at least make my questions make more sense not literally like uh asking like making me memorize different options and Etc okay I'll will stop ranting let's go through the questions here I mean if you agree with me leave a comment and let me know do you feel that way as well regarding the datab Bas questions I feel like they should put more effort considering how expensive the exam is a data pipeline uses structured streaming to injust data from Apachi Kafka to Delta like data is being stored in a bronze table and includes the Kafka generated time stamp key and value 3 months after the pipeline is deployed the data engineering team has noticed some latency issues during certain times of day a senior data engineer updates the table Delta table schema and injection logic to include the current time stamp as recorded by Apache spark as well as the cka topic and partition the team plans to use these additional metadata fields to diagnos the transient processing delay they just want to debug why it is being delayed using these fields which limitation will the team face while diagnosing this problem okay this question makes at least it is good it makes you think instead of picking options so let's go through the options let's go from bottom up option e uh I won't read the options pause and read the options because I don't want to make this video long by reading the options option well this is incorrect because while you can provide default values it's not strictly required for all data types so this is not a limitation option D schema updates in Delta leg do not invalidate the transaction log this log is essential for features like time travel and asset transactions so this is not a limitation as well option C Delta lake is designed to handle schema changes including adding new Fields even in production environments so that is not a limitation and option b spark can indeed capture cough car topic and partition information as part of structured streaming so this is not a limitation as well so what is the limitation well when you add columns to Delta leg table existing data will have null values in those new columns the newly added Fields will only be populated for data ingested after the schema change if you worked with database tables that is the case this is the same case there as well if you add new columns to existing table then whatever data that already exist will be empty for these tuno columns whatever data you insert after you add into new columns then that will be populated same here as well all right right so if you want to handle missing values for historical data then I will tell you how to do it one you will you can run a separate process to read historical data add the new columns with appropriate values potentially defaults and write the modified data back to the Delta label or if the new columns can be derived from existing data you could create derived columns in your queries to dynamically calculate their values or depending on your analysis you may be able to with the null values in historical records if the insights you are looking for don't depend heavily on the new Fields so those are some ways you can handle this situation the marketing team is looking to share data in aggregate table with the sales organization but the field names used by the teams do not match and a number of marketing specific Fields have not been approved for the sales organization which of the following solution address addess the situation while emphasizing Simplicity which means you can some of these options works but we are looking for Simplicity like ease of use or least operational overhead if you are familiar with AWS certification all right let's go through again bottom up option e this solution is not optimal in terms of Simplicity or scalability why because it introduces manual steps Pro to errors delays inefficiencies because relaying on the CSV downloads and emails does not provide a robust automated solution for sharing data between teams and option D introduces complexity by adding parallel table rights to the production pipeline while it allows for customization of the sales table based on the marketing table managing multiple table rights in the pipeline could increase complexity and maintenance efforts and option C ctas in you know it it refers to create table as select this statement can be used to create a new table with a desired schema based on a query or an existing table so configuring a production job to propagate changes adds complexity it it requires additional monitoring and maintenance of the job to ensure changes are accurately provide propagated this solution may not emphasize Simplicity as much as the other two so option b while this ensures that both teams have Separate Tables with the required schema using deep clone functionality adds unnecessary complexity deep clone is typically used for creating a copy of a Delta table including its schema and data but it's not designed for on goinging synchronization of changes between tables using it for this purpose could introduce complexity and potential maintenance overhead so that leaves us with option A this solution offers Simplicity by creating a view which is essentially a virtual table that presents data from one or more Tables by selecting only the approv fields and aliasing the names to match the sales naming conventions the marketing teams data can be presented to the sales organization without actually altering the underlying data this approach is straightforward and minimizes complexity when evaluating the ganglia metrics for a given cluster with three executor nodes which indicator would signal proper utilization of vm's resources okay they are testing us if you know these terms or not let's look at this option a 5 minute load average one while relevant it's a broader system level metric that might be impacted by the processes other than your spark job for that reason we are not looking for this one and option b bites received Network metrics are vital for distributed computing but an absolute limit like 80 million bytes per second is arbitrary focus on spikes or unexpected drops rather than fixed Trust holds option C Network IO never spikes this is an unrealistic expectation B in network traffic are common in distributed systems especially during shuffles within a spark job and option D total dis space remains constant this wouldn't be expected in a running cluster data caching temporary files and logs will consume dis space a more useful metric is the availability of free dis space so not this one so on so what is the indicator that would signal proper utilization of VM resources CPU utilization is around 75% why because CPU directly reflects how much processing power the executors on each VM are using a moderate level like 75% indicates good workload distribution and healthy resource usage if CPU utilization is consistently low it might indicate that the job is IO bound limited by discar network not making full use of the available compute conversely sustained High CPU can suggest the cluster is overloaded okay so here two things holistic analy analysis monitoring a combination of metrics for example like CPU memory Network this is always best for understanding cluster Health a single metric rarely tells the whole story and workload specifics the optimal CBU utilization depends on the specifics of the jobs some jobs might naturally be CPU heavy While others might be iob bound I guess that helps you decide if you are testing a collection of mathematical functions one of which calculates the area under a curve as described by another function which you can see here which kind of test would the above line exemplify so basically they are testing your skills or your understanding of different kinds of testings all right let's start from bottom up end to end testing what does this testing do this test simulate a users complete journey through a system and here it is far too narrow in scope to be considered end to end because you are only testing a part of your uh entire uh system and integration testing this focuses on how different components of a system work together together this test is focused on a single function not the interaction between multiple components so it is not that and option C functional test check the behavior of a piece of code from the perspective of its requirements like what it's supposed to do while this test does validate the functions Behavior the term unit test is more specific which is option A so option b manual well manual test involveed a human test executing steps and checking the results the code snippet indicates an automated test not a manual so option A unit test a unit test focuses on the smallest testable part of an application typically a single function in this case you are testing the my integrate function in isolation verifying that it calculates the definite integral correctly I hope you now know what are the different kinds of tests with statement regarding sparkk configuration on data bricks platform is true okay let's go through these options option a while you can use rest API to modify cluster configurations these changes often require a cluster restart which would interrupt running jobs so this is not true this is actually false and option b spark configurations set within a notebook are isolated to that specific notebook spark session not to all spark session so this is false as well and option C you actually have multiple ways to configure Spar properties on data breaks you can do at cluster level notebook level or using unit scripts so unit scripts are not the only method so this is false as well and option e notebook level configurations will generally override cluster level configurations so these won't be ignored so this is false so what is true option D is true why because spark configuration properties as I mentioned previously set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster because you are this at cluster level not at notebook level so you need to remember this hierarchy there is a hierarchy of precedents for configuration settings so how does this work generally notebook level settings take the highest priority you can set them you can set the cluster level configurations in the datab bricks UI when creating or editing a cluster okay and network notebook level configurations you can use Spar con set key value within a notebook to set them hierarchy is what cluster level highest hierarchy lowest is n notebook so whenever you set something at notebook level it will automatically override cluster level a developer has successfully configured their credentials for data bricks repos and cloned a remote git repository they do not have privileges to make changes to the main branch which is the only Branch currently visible in their workspace which approach allows the user to share their code updates without the risk of overwriting the work of their teammates this is how you develop your code in real time you will you will only give you know the read access to the main branch so that you can actually create your own what is it called feature Branch based out of main main will always have the production code so you won't have access usually there will be a separate person who will have access um so once you create your feature Branch out of main branch then you will develop your code in your feature branch and then you will push your changes to the remote feature Branch then you will create something you called a pull request wherein you will ask your admin or whoever it is to merge your feature Branch with the main branch that's how your changes will end up in the main branch so let's go through the options that being said you can clearly see all four A B C D or wrong why because option A lets the team see changes it's not a collaborative workflow and hinders your actual contributions and option b is valid for external contribution patterns but within the same team Forks are usually an unnecessary overhead and option C might work if the remote main branch has changes but it's unsafe if the developer can't directly push to the main branch accidents happen and option D merging without a separate Branch risk messy conflicts if the main branch has evolved further So based on the explanation I have given what is the correct option option e you create a new Branch this creates a safe work area isolated from protected main branch and then commit changes the developer commits changes to this new branch and then push to remote the branch is pushed to the remote repository creating a separate clear separation of work and even though it is not mentioned here the next step would be to create pull request right if the workflow requires it the developer can then open a pool request from their Branch into the main branch this triggers code review and approval processes if you learn something new this deserves a like and comment so go ahead and do that if you haven't subscribed to this channel go ahead and do so because I am adding more amazing content and if you are interested in data breaks uh analytics data analytics analyst as well as machine learning then you need to subscribe let me know in the comments then I'm not planning to because I don't see people are interested if you are interested leave a comment saying you are interested based on that I will think of making videos for data analyst data briak data analyst as well as datab bricks ml all right and if you want to support the channel there are different ways one way is to become a member and as part of that membership I there is a perk which is I provide PDF for all the questions that I uploaded onto YouTube but sometimes you need questions that I didn't upload on YouTube or you need PDF questions for an exam that I don't have on YouTube I can definitely provide you that but for a price of $25 you can email me the email can be found on the each template here in the background but if you don't want to become a member you can still support the channel by just giving a super thanks or Super Chat that is available under the video next to the Subscribe button if you don't want to do that as well you can still support by subscribing liking and commenting that is also supporting the channel that will really help the channel and me a lot to provide more awesome content to you if you are into AWS certification analysis questions you can see I pretty much covering all the AWS exams so if you want more exams to be covered on this channel definitely I need your support So if you are already supporting that's awesome I really appreciate it if you are not please go ahead and subscribe because 70% 75% of the users or viewers on our channel are not yet subscribed so think about when those 75% actually subscribe where this channel would grow how this channel would grow I have awesome plans I have plans to uh host these questions as quizzes on website think about it where you can practice where you can set um your own questions like I am planning to give you the control like do you want to use 75 questions 30 questions over a pool of questions so all that amazing content will only happen you know the more support you guys give the more the channel grows then I will bring all those amazing content to you all in order to prevent accidental commits to production data a senior data engineer has instituted a policy that all development work will reference clones of Delta leg tables after testing both deep and shallow clone development tables are created using shallow clone a few weeks after initial table creation the cloned versions of several tables implemented as type 1 s CD stop working the transaction logs for the source table show that vacuum was run the day before which statement describes why the Clone tables are no longer working this is an interesting question because they trying to test your knowledge on deep and shallow clone at the same time SD at the same time vacuum so let's learn all those Concepts shall we option A this is incorrect because type 1 SD updates what do they do they override existing records but Delta L ensures data consistency through its transaction log and acid complains so cloned table should still reflect the state of data at the time of cloning regardless of the SD type and option b this is also incorrect because while running vacuum may affect the metadata and data files of Delta leg tables it doesn't automatically invalidate shallow clones because shallow clones can still be used after vacuum but they might be become inconsistent if vacuumed data files are deleted deep clones are in necessary for this scenario a refresh operation can address inconsistencies caused by vacuum and option C this is incorrect because shallow clones of Delta tables are not automatically deleted after a certain retention period they persist until explicitly dropped by the user and option e this is partially correct because when vacuum compacts data files the changes might not be immediately reflected in the cloned metadata however instead of running refresh running vacuum on the cloned table would typically resolve the issue by updating the metadata to reflect the compacted data files refresh might not be sufficient if the issue is caused by deleted data files so which is what option D is actually doing what is option D is doing it is running a vacuum so when vacuum is run on Delta tables it deletes files that are no longer needed for query consistency or that are marked for deletion if shallow cloned tables reference these deleted files they can become inconsistent or nonfunctional that's why those cloned tables are not no longer working you are performing a join operation to combine values from a static user lookup table with a streaming data frame which code block attempts to perform an invalid stream static join invalid which means four options are valid only one is invalid and this is kind of an exam question that I'm talking about where you need to memorize um I will show you that what you have to memorize this is the table you need to memorize where if your static is left static you have both static tables on left and right all types are supported if you have stream on the left starting on the right then only uh inner left outer these are joined uh supported full outer is not supported and left outer sorry left outer right outer is not supported left and right outer are not supported when you have left stream right static so going back to our questions let's go one after another we have user or static table left stream on right so let's go ahead Statics table left stream right so water supported only inner right outer join are supported left and full outer join are not supported so if you see this is inner which is supported so this is a valid scenario this is not invalid so cross that out and option b you have stream on the left static on the right stream on the left starting on the right water supported inner and left are supported full and right are not supported and here it is mentioned full outer Jo so this is not supported which is invalid which is the right option so you can go ahead and see these all are valid up you can go and check they will be valid but the point I'm trying to make is how will you remember this only if you memorize all these options then only you will you can goe this table and memorize however you want it but our invalid option or join is option b because stream as left static as the right table won't support outer joints spill occurs at as a result of executing various y Transformations however diagnosing skill requires one to proactively look for key indicators where in the spark UI are two of the spark primary indicators that partition is spilling to this so they are asking what two primary indicators that's why you see two options in each one of them one might be wrong other might be wrong that makes the option wrong so if we go through option A they are trying to use queries detail screen and jobs details screen queries detail screen in spark UI typically provides information about the execution of a specific Spar SQL query or data frame operation even though this provides sum it doesn't give you the granular and whereas the jobs detail screen might show overall resource usage but often lacks granular spill Matrix so for that reason we are going to cancel this out and option C drivers and executor log files yes executor logs files may contain detailed information about spills such as warnings message related to the ex so this is the one of the right option but driver no the driver one won't show you those details so that is will that will become wrong because true and false will become false and option D executor logs right but executor detail screen it doesn't directly display spill metric so false and true false and option e we already decided queries details cre no so staging detail screen yes it does provide uh information about the execution of each stage within a job it can provide matrics related to shuffle spill such as memory usage spill size spill rate which indicate whether partitions are spilling to dis during the execution of stage so that is true but this is false true and false false so this stage detail is true executor is true true and true is true so these are the two primary indicators that is partition is spilling to this you can find from there a task orchestrator has been configured to run two hours task first an outside system rides par a data to a directory mounted at that location after this data is written a data bricks job containing the following code is executed you can see a read stream format is Park par and and that's the load and with a watermark of 2 hours and then you are they are dropping duplicates based on customer ID and Order ID then they are writing using once to order stable assume that the fields customer ID and Order ID serve as a composite key to uniquely identify each order and that the time field indicates when the record was queued in the source system if the Upstream system is known to occasionally inq duplicate entries for a single order hours apart which statement is correct all right option A which is incorrect The 2hour Watermark ensures that duplicates arriving outside this window are not retained they just ignore them and option b is also incorrect because DT or Delta live tables doesn't store data in a state store for extended periods before processing and option C is also incorrect the two power window is for processing late data not a restriction on the overall data restored or data stored and option e the D duplication logic actively removes duplicate records not ignores them so this is incorrect as well so what is the right option option D is the right one why let's break it down okay so reading data we can see it's reading in par format right and then then it has the/ Mount SL raw orders this specifies the directory containing the park files and water mark with Watermark time 2 hours what does it do this sets a watermark on the time column it signifies a threshold for processing data based on event time in this case the system considers events older than 2 hours late to arrive and D duplication they are dropping using drop duplicates so what are the key points to consider here within micro batches D duplication happens within each micro batch of data processed by the streaming job if multiple records with the same Keys arrive in the same micro batch both might be written before the D duplication logic is applied and the watermark plays a crucial role in handling duplicates across tab it essentially defines a window for considering data as valid here the window is 2 hours records with a Time Value more than 2 hours behind the watermark are ignored or discarded this prevents out dat duplicates for mentoring the orders table a junior data engineer is migrating a workload from a relational database system to the data bricks lak house the source system uses a star schema leveraging foreign key constraints and multi-table inserts to validate records on right which consideration will impact the decisions made by the engineer while migrating this workload option a while data break supports defining foreign key constants it doesn't restrict them to hashed identifiers additionally foreign key constants are not strictly enforced like it is enforced in a relational database system option b this option is still not entirely accurate while datab bricks does support spark SQL and jdbc the lack of strict informance of constraints means that certain aspects of the migration related to managing referential Integrity may require additional consideration and implementation and option C doesn't directly address the lack of enforcement of foreign key constraints in data breaks Delta Lake and option e doesn't accurate reflect the lack of informance of foreign key constants in data bricks Delta Lake as well so but which one is doing that option D is doing that this option accurately reflects the fact that while transactions are asset complaint at the table level data break does not enforce foreign key constants it does not even though it supports it it doesn't you can still use it for joining tables Etc just name sake but similar to our like rdbms where they enforce forign key constants here it doesn't a data architect has heard about Delta L built-in versioning and time travel capabilities very good good for him for auditing purposes they have a requirement to maintain a full record of all valid street addresses as they appear in the customers table the archit architect is interested in implementing a type one table type one in the sense type 1 SD slowly changing Dimension overwriting existing records with new values and relaying on Delta Lake time travel to support long-term auditing yeah that's what type one does a data engineer on the project feels that a type two table will provide better performance and scalability which piece of information is critical to this decision so you need to know the difference between type one and type two type two doesn't override it adds extra columns uh effective ineffective outdates showing you if that particular record is actually the effective one or active one or not let's go through options option A discusses potential data corruption issues related to type two tables but doesn't directly address the scalability concerns associated with Delta L time travel feature and option b doesn't directly address the scalability concerns associated with Delta leg time travel filter just like option A so that's a g and option C Delta Lex time travel feature it is talking about it but it doesn't allow querying previous versions of tables although scalability concerns may arise over time and option e Delta leg supports both type one and type two tables and Records can be updated or appended based on the tables configurations so this is wrong what is the right option option D why is it right option because this accurately reflects the scalability concerns associated with using Delta leg time travel Fe feature as a long-term versioning solution it highlights the potential issues with increased storage consumption and query latency over time a table named user LTV is being used to create a view that will be used by data analyst and various teams users in the workspace are configured into groups which are used for setting up data access using ACS the user LTV table has the following schema email age LTV the following view definition is executed they are creating it with that just by adding a where case when may if the if the member is auditing then true else age greater than or equal to 18 end so they are based on this condition they can view no minors an analyst who is not a member of the auditing group execute the following query clearly it says if he is auditing then that is true which means uh it will select from the user LTV table if he is not from auditing then it will go and check if the age is greater than or equal to 18 right which statement describes the results return by this query well let's look at it option e okay so this option looks similar to option A but it is mentioning greater than 18 but what do we have here greater than or equal to so it just mentioned greater than 18 when it is equal to 18 this won't work so this is not the right option option D the view definition applies a filter based on group membership and H only users who are not in the auditing group and are a older will be included in the result not with not from all records because there is a filter so you cannot see all records and option is talking about um all values for age column will be return returned as null values that is not true because the view filters out rows with under age user so there wouldn't be null values in the age column and option b is talking about less than 8 18 will be written obviously no we are talking about greater than 18 the view definition excludes rows with age less than 18 entirely there wouldn't be any null values for age in the results so the answer is a why because this is you might be thinking oh this is talking greater than 17 at least this is saying greater than 18 well think about it age greater than or equal to 18 means what it is greater than 17 when you say greater than 17 what are the values obviously it's not 17 it won't return because it is greater than 17 which means anything greater than7 18 so obviously that includes equal to 18 and greater than 18 as well greater than 17 means anything above 17 starts from 18 which is what this condition is even though they didn't directly say it would have been easy if this mention greater than or equal to 18 but that's a direct giveaway so greater than or equal to 18 means greater than 17 both are same don't get confused the data governance team is reviewing code used for deleting records for complaints team with gdpr the following logic has been implemented to propagate delete request from the user lookup table to the user Aggregates table okay read Delta option read change data true option starting stamp stamp ending St so we are reading between that and user look up the table you are uh and then you are creating or replacing table view changes then you are running a SQL saying delete from user Aggregates where user ID in I guess from here that view assuming that user ID is a unique identifying key and that all users that have requested deletion have been removed from the user lookup table which statement describes whether success fully executed the above logic guarantees that the records to be deleted from the user Aggregates table are no longer accessible and why let's go through the options option A merge into can be used for complex data manipulations but it's not strictly necessary for basic deletions guaranteed by asset transactions in Delta like and option C change data feed or CDF typically captures inserts updates and deletes but it's a separate mechanism for tracking data changes it doesn't directly guarantee deletion from the table itself option D it's not precise actually while Delta L offers acid guarantees for data operations these guarantees focus on ensuring the success of the deletion operation itself for example autocity consistency isolation durability right asset means that they don't necessarily guarantee immediate physical removal of data files until a vacuum is performed so you can cross this out option e while some CDF or change data feed implementations might not track deletes Delta L CDF can capture deletes but as mentioned earlier in option C it separate from the physical deletion process so hence you can cross this out so we are left with option b as you can see in the code it is using a delete statement which logically removes from the Delta table however Delta L keeps track of old versions of data to support time travel queries so the deleted records might still be physically present in underlying data files until a cleanup process is triggered which is where the vacuum command in Delta L identifies and removes data files containing records that are no longer referenced by the latest version of the table this ensures physical deletion and free sub storage space so that is the difference here this is logical vacuum is physical so the key points to remember Delta delete with acid guarantees ensures the deletion is logically successful time travel function functionality might still expose logically deleted data so use vacuum to physically remove data files and enforce gdpr complaints this is an interview question so you better remember this the data engineering team has been tasked with configuring connections to an external database does that does not have a supported native connector with datab brakes the external database already has data security configured by group membership these groups map directly to user groups already created in data bricks that represent various teams within the company a new login credential has been created for each group in the external Database The datab Bricks utilties Secrets module will be used to make these credentials available to data brick users assuming that all the credentials are configured correctly on the external database and a group membership is properly configured data breaks which statement describes how teams can be granted the minimum necessary access to using these credentials okay so I have a link for you if you would like to refer you can go ahead and refer to this one okay query LLS just Google I would say yeah you can just quer very secret ACLS Etc okay let's go through the options anyways option A this suggests granting manage permissions on a secret key which contradicts the clarification that access control is managed at the secret scope level and option b suggest setting permissions on a secret same similar to option A which may not be accurate if access control is indeed managed at the secret scope level option D suggests managing permissions at the secret scope level which is what we are looking for but it proposes granting manage permissions which may provide more access than necessary hence we are eliminating that option A even though it is not labeled no additional configuration is necessary as as long as users are configured as administrators in the workspace where Secrets have been added no that is not accurate so that will leave us with option C because we are supposed to the the main uh key consideration here is setting it at secret scope okay so both C and D are doing but we don't need manage all we need is read okay because by setting read permissions on a secret scope teams can securely access the credentials within that scope you don't want to give manage Okay so remember that so the correct option is C which indicators would you look for in the spark UI storage tab to signal that a cast table is not performing optimally assume you are using Sparks memory only Storage level all right let's look at these options option A suggest that the data size on dis is smaller than size in memory which could indicate efficient memory usage however for a cast table using memory only Storage level ideally the entire data should fit in memory so the size on disk should ideally be zero therefore this option is not indicative of suboptimal performance and option C same applies size on disk is greater than zero you should not have the size on dis it should be equal to zero and option D suggests a discrepancy between the number of partitions cast in memory and the number of partitions in the spark rdd ideally these numbers should match to ensure efficient caching therefore a discrepancy May indicate suboptimal performance and option e Compares on Heap and off Heap memory usage while a large difference between these values might suggest memory inefficiencies it doesn't Direct direct L indicate suboptimal performance of cast tables so that will leave us with option b wherein it's suggesting that rdd blocks with the star annotation or ASC annotation indicate a failure to Cache this indeed suggest suboptimal performance as caching has not been successful for those blocks potentially leading to increased recomputation therefore this option is a strong indicator of suboptimal performance all right so this is the question I'm talking about it's not professional it's not this question doesn't even belong in data engineer associate what is the first line of data bricks python notebook when viewed in a text editor seriously why why does it even matter why does it matter if it is any of these okay anyways here is the answer for that I don't have to expl explain anything about this because there is no explanation needed for this okay so this is the first line of python when you open it in Python that's how it looks SQL looks this one Scala looks this one R looks this one so they are asking you what is the first line so the answer is C okay this is I guess for SQL this is for scholar so do you even need any explanation which statement describes a key benefit of an n2n test I have already explained you what an n2n test is so let's hope based on that definition you would be able to pick it up if not that's okay let's go through the options option A makees it makes it easier to automate your test Su well while end to end tests can be automated they complexity often makes them harder to automate than simpler unit or integration test so this is not a considered as a benefit option b pinpoint errors in building blocks of your application actually unit and integration tests are focused on isolating errors in specific components not end to end end to end test they might reveal errors but they aren the primary concern with pinpoint accuracy so you don't use n to end test for pinpointing errors option C achieving 100% coverage is incredibly difficult in complex systems and end to end testing prioritizes critical user workflows not this one and option e ensures code is optimized in a real life workflow no performance optimization might be one aspect covered by NN test but it's not the so our primary focus of this test so no so what is actually the key benefit of it it closely simulates real world usage of your application right because n to end test Excell at simulating how a real user interacts with the system including interactions between multiple components external systems and the UI e end to end test validate an entire feature or flow unlike lower level test like in unit or integration that focus on individual components and the catch errors or issues that lower level test focused on code logic might miss so the important notes you need to consider n to end test shouldn't be your only testing mechanism they are best used in combination with unit integration and other types of testing but what are the trade-offs of using e to e or n to n test these are often slower and more brittle prone to break with changes than low test so you need to remember that before considering doing this the data breaks CLI is used to trigger a run of an existing job by passing the job ID parameter the response that the job run request has being submitted successfully includes a field run ID which statement describes what the number alongside this field represents seriously see this is the these are the questions that I am like I don't like option A the job ID and run the number of times the job has been run or concatenated and returned no the Run ID is not a concatenation of job ID and other data at all what is it we will learn in a minute option b total number of jobs that have been run in the workspace no that's not it the Run ID doesn't relate to the total number of jobs in your workspace at all and option C number of times the job definition has been run in the workplace no the job run ID is about a specific job execution not the total number of times a job execution has run so that's a gonor and option D job ID is returned in this field definitely not job ID identifies the job itself not an individual run so what is that number or the field alongside uh the number well it's the globally unique ID of the newly triggered run so now let's learn about run ID versus job ID what is job ID it is a permanent identifier for the job definition itself whereas the Run ID it's a unique identifier assigned to each individual execution of a job so when you run it whatever you got that number alongside this field represents unique idea of the that particular run the data science team has created and logged a production model using MLF flow the model accepts a list of column names and returns a new column of type double the following code correctly Imports the production model loads the customer's table containing the customer ID key column into a data frame and defines the feature columns needed for the model model equals to Pi fun function they are creating a spark _ UDF and then model urri is John prod dat of they're creating a data frame spark table customers from using the table loading into this and call they defining columns which code block will output a data frame with the schema customer ID long predictions double which one do you think option A in this option we are using DF do map which is used to apply a Lambda function to each row of the data frame DF inside the Lambda function the model function is applied to the specified columns however the usage of dot map is not suitable for applying the model function to data frame columns directly Additionally the Syntax for selecting columns in the the do select function is incorrect we mentioned customer ID comma predictions with one single double Cotes but it is supposed to be customer ID enclosed in double quotes prediction enclosed in double Cotes overall this option is incorrect due to the incorrect usage of do map and the incorrect Syntax for selecting columns and option C this option appears to use a method. predict on the model object to generate predictions for the data frame DF based on the specified columns however the mlflow dopy function. spark UDF function does not expose a method named predict so this usage is incorrect Additionally the Syntax for selecting columns from the data frame is is missing so this is option is incorrect and option D attempts to apply the model function using a pandas UDF user defined function with this one however the mlflow pi function is already used to define the model as a spark UDF so using a pandas UDF is unnecessary and incorrect okay although the usage of do select is correct but the application of the pandas UDF is incorrect in this context hence this is the wrong option option is uses the do apply function on the data frame to apply the model function to the specified columns again the usage of do apply in this context is not suitable for applying a model functions and similar to option A it is using incorrect do select syntax so we are left with option b it is using DF do select which is used to select the customer ID column from the data frame and the model columns function is applied to the specified columns generating predictions the alas predictions method is used to rename the resulting column to predictions this option correctly applies the model to the specified columns and selects the customer ID column along with the predictions fulfilling the requirement a nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure the data for each date represents all records that were processed by The Source system on that date noting that some records may be delayed as they await moderate moderator approval each entry represents a user review of a product and has the following schema user ID review ID product ID review time stamp and review text the injection job is configured to append all data for the previous date to a Target table reviews raw with a identical schema to the sources system the next step in the pipeline is a batch right to propagate all new records inserted into reviews raw to a table where data is fully D duplicated validated and enriched which solution minimizes the compute cost to propagate this batch of data so when they ask minimizes which means multiple options are correct but we need to choose the option that minimizes the compute cost option a it can handle new records efficiently if D duplication is separately taken care of but this requires reading the entire reviews raw table daily increasing computation costs especially for large data sets option C this is extremely efficient if the majority of Daily records are brand new as you only process the changes but this requires maintaining history which takes storage space Also it doesn't seemless and yeah it doesn't seamlessly handle delayed records you would need a separate strategy to reprocess old data periodically to catch late arrivals and option D simple to implement for recent data but reads the entire reviews draw table daily similar to option A and it completely misses records delayed beyond the 48 hour filter window and option e this is terribly inefficient reprocessing the entire history every time leads to excessive compute cost so what is the right option option b but why is it the right option because this is designed for batch style workloads and more importantly this naturally handles delayed records since it reads data as it arrives which statement describes Delta Lake optimized rights H okay option A suggests that the optimize command is automatically executed on all tables modified during the most recent job before the cluster terminates however this statement is incorrect optimize is a command used to optimize the layout of Delta tables but it needs to be explicitly triggered by users and is not automatically executed by the cluster option b describes a scenario where an asynchronous job runs after the right completes to detect if files could be further compacted if it finds that files could be compacted an optimized job is executed toward a default of 1 GB while the scenario could be a potential optimization strategy it's not how Delta L optimized wres typically work optimize is not automatically triggered based on file sizes but as learned in the previous option it needs to be explicitly invoked by the user option C describes a hypothetical scenario where data is CED in a messaging bus instead of being committed directly to memory once the job is complete all data is committed from the messaging bus in one batch however this is not how Delta L optimized works either Delta L writes data directly to storage and does not involve a messaging bus for data storage and option D correctly describes one aspect of Delta L optimized rights Delta L uses logical partitions instead of directory partitions to manage data partition boundaries are represented in metadata allowing Delta leg to efficiently manage data without creating a large number of small files however this option only addresses one aspect of Delta Le optimization strategy for that reason we won't pick that instead we'll pick option A because this accurately described deltal optimized rights Delta shuffles the data prior to writing to storage to group similar data together this grouping helps to reduce the number of files generated during right operations resulting in one more efficient storage utilization and improved performance by shuffling the data Delta ensures that data with similar partition keys are collocated leading to better data organization and storage efficiency which statement describes the default execution mode for data bricks autoloader option A this is partially correct because the first part about Cloud cues is accurate for autoloaders file notification mode not for default execution mode it is correct for file notification mode however in default directory listing mode it relies on directory scans autoloader wouldn't query all files in the directory each time as it processes new files incrementally so this is wrong option b as mentioned in above option um autoloader incrementally processes only new files not the entire directory every time so for that reason this is wrong and option C web Hooks could be involved but the focus here is more on schema inference autoloader in its default mode prioritizes efficient incremental loading and option e which is partially correct just like option A because the first half of it is similar to option A however it's not the default mode which uses directory listing so this is gner so that is left that LE with option D this accurately describes the default Behavior which is directory listing to find new files then incrementally and AD poent loading specifically into Delta Lake tables a Delta lake table representing metadata about content post from users has the following schema user ID post text post ID longitude latitude post time date based on the above schema which column is a good candidate for partitioning the Delta table this should be an easy pick why I will explain you so let's go through option a post time while time based queries are possible finer grain partitioning by time stamp might lead to too many tiny partitions making management difficult don't you agree Post Time is what it is uh where is it time stamp time stamp means it will have seconds and sometimes microc seconds which means for each microc new Partition will be created which is not ideal because you will have so many tiny partitions and latitude or longitude geospatial partitioning can be excellent but it depends on query patterns if most queries don't filter on location this partitioning won't be as helpful and if if there are a lot of users from same area or office or Etc I mean this is we are talking about content post but same region then this will be skewed and option C post ID has high cardinality many unique values similar to post time potentially leading to too many partitions and making it less effective and option D similar to option A and C can be useful if your primary query pattern focuses on a specific users Post history but it's less common than date based analysis so usually you will pick dates as partitioning why because queries on this particular table are highly likely to filter based on date ranges for example like analyzing post within a week or month partitioning by date aligns perfectly with this use case in naturally groups it it naturally groups post into daily chunks making queries targeting specific dates extremely efficient by skipping unnecessary partitions assuming dates are reasonably well distributed this ensures a moderate number of partitions balancing efficiency and management overhead some important considerations you need to remember if you are implementing this at your work if the table is huge partitioning by date might need further refinement for example like monthly partitions instead of daily and always consider your most common queries if location based filtering is dominant then latitude and longitude might be better choices also using Z ordering on multiple columns for example like date and post time for example can further enhance query performance and data access patterns can change so the best partitioning strategy might need to be adjusted over time I hope that helps a large company seeks to implement a near realtime solution involving hundreds of pipelines with parallel dates of many tables with extremely high volume and high velocity data which of the following solution would you implement to achieve this requirement let's look at the options option a high concurrency clusters in data bricks are designed to handle multiple concurrent us users and workloads efficiently while they can handle a large number of queries concurrently they may not necessarily optimized data through put for high volume and velocity data processing option C configure data bricks to save all data to attach SD volumes well while SSD storage can offer faster read write speeds compared to traditional hard dis hdds it may not be the most coste effective solution for storing large volumes of data Additionally the bottleneck in processing high volume and high velocity data may not necessarily be storage iio but rather the computer resources and parallel processing capabilities option D storage container isolation while potentially helpful for API limits this strategy doesn't directly address the core issue of massive parallel rights so cross this out option e single database consolidating everything can overload The Meta store which manages stable metadata so partitioning helps with both metadata handling and right distribution so for that reason this is the corner so what is the right answer option b why because partitioning inj Tables by a short time interval example minutes or even seconds naturally aligns with high velocity data intake it creates many smaller files enabling massive parallelism when writing to Delta Lake and Delta Lake and Spark work well with this pattern Distributing the workload across executors and effectively leveraging cluster resources and smaller partitions lead to more manageable metadata for the Delta leg transaction log which is beneficial at scale so important considerations are good partitioning keys are crucial to ensure even low distribution and optimized object stores like S3 or Azure blob storage are essential for scaling and visibility into right patterns and potential bot next is very important which describes a method of installing a python package coped at the notebook level to all nodes in the currently active cluster okay so let's learn about option A this suggests activating a virtual environment the command this one in a notebook setup script while activating a virtual environment is a common practice to isolate package installations it's not directly related to installing packages scoped at the notebook level across all nodes in the currently active cluster activating a virtual environment in a notebook setup script would only affect the environment within the notebook server where the script is executed not the entire cluster option b it seems like uh there might be a typo here in any case let's assume it's not a typo there is no standard magic command with the Single Character B and option D while percentile sh allows running shell commands installing via pip like this might only put the package on the driver node not other workers in the cluster so this is a gono and option e this usually installs at the cluster level affecting all notebooks not notebook specifics so how do you do it you will do option C you are talking about notebook level to all nodes so percentile Pip this magic command within a datab bricks notebook cell is specifically designed for installing python packages within the context of the notebook and its attached cluster packages installed using this magic command become available to all the nodes in the active cluster guaranteeing the package is present for your notebook code so remember notebook scope libraries installed with Pip doesn't persist across cluster restarts you would need to rerun the PIP install sell after a restart we have seen similar questions like this in previous video each configuration below is identical to the extent that each cluster has 400 GB total of ram 160 total course and only one executor per VM given an extremely long running job for which completion must be guaranteed which cluster configuration will be able to guarantee completion of the job in light of one or more VM failures so I mean basically when you see this option C is obviously eliminated because they are clearly said if one or more VM failures there is only one VM what would happen it's a goner but anyways let's look at all the options option A 8 VMS while somewhat distributed a failure of 1 VM would take out a larger Chun of the job increasing rate right time and option C 1 VM we already discussed about this this lacks any resiliency at all one VM failure could cause the entire job to fail and option D similar to a and C less distribution than option b making it more vulnerable to the impact of VM failures and E yes this has the least distribution making the impact of single VM failure significantly so what is the right option obviously option b would be more resilient to one or more VM failures because 16 VMS provide a good balance between Distributing the job and managing overhead too few VMS increases the impact of a single failure while too many can lead to coordination overheads having smaller executors about 25 GB means the work is divided into more fine grain task this makes it easier for other executors to take over task from failed VMS spark inherently tries to recover from task failures so options with more VMS give you better chances to reassign task successfully for jobs with high dependent Tas even finer grain executor sizes smaller than 25 GB might be more favorable a member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline assume that the commands provided below produce the logically correct results when run as presented which command should be removed from The Notebook before scheduling it as a job first of all we need to understand what this is doing what is command one is doing this command loads raw data into a data frame named raw DF using spark table it's essential for the pipeline as it fetches the required input so we cannot remove that command to print the schema of the raw DF data frame using raw DF print schema while it's helpful for understanding the structure of the loaded data it's not strictly necessary for the pipeline to run successfully however it can be useful for de debugging purposes so let's leave it there and option sorry command three flattens the raw DF data frame into flatten DF by selecting all columns and adding the values struct as separate columns it's a crucial step for data transformation and is necessary for generating the final output command four drops the values column from the flat end fre data frame creating the final data frame it's a necessary step for removing redundent or unwanted columns and preparing the data for storage command F explains the execution plan of the data frame transformation using explain while it's not strictly necessary for the pipeline to execute successfully it can provide valuable insights into house Park will execute the operations in in optimization Performance Tuning command six displays the final data frame using display it's specific to data Frame data bricks notebooks and is used for interactive visualization since scheduled jobs don't support interactive display this command should be removed for job scheduling command seven writes the final data frame to a table named flat data in upend mode it's a crucial step for persisting the processed data and is necessary for the pipeline so given all the above analysis command six should be removed before scheduling The Notebook as a job as it's specific to interactive visualization within datab bricks notebooks and is not suitable for scheduled job execution all other commands are relevant for processing and storing the data making them suitable for inclusion in the scheduled job so what is our answer option e is our right answer so that's the command and we are going to remove the business reporting team requires the data for their dashboards be updated every hour the total processing time for the pipeline that extracts transforms and loads the data for their pipeline runs in 10 minutes assuming normal operating conditions which configuration will meet their service level SLA requirements with the lowest cost so updated every hour this runs every 10 minutes let's see option A does not meet the SLA requirement of updating the data every hour automatically manual intervention may lead to delays and is not suitable for meeting SLS so anytime you see SLA and in production manual triggering is thrown out of the window and option C while structured streaming offers realtime processing capabilities setting a trigger interval of 60 Minutes may not be suitable for this scenario as it doesn't guarantee execution exactly once every hour it may introduce delays and doesn't align well with the hourly update requirement option D using a dedicated interactive cluster for executing the job would incur continuous costs even during idal periods when the cluster is not actively used this option may not be cost effective compared to using a job cluster which we'll see in a minute option e ensures that the job executes whenever new data arrives it doesn't guarantee hourly updates it may lead to unnecessary job executions and higher costs if data arrives more frequently than an A so what is the best solution that gives the lowest cost option b aligns with the SEL requirement of updating the data every hour and is cost effective as it involves pinning up a new job cluster only when needed minimizing costs the business intelligence team has a dashboard configured to track various summary statistics or metrics for retail stores this includes total sales for the previous day alongside totals and averages for a variety of time periods the fields required to populate this dashboard have the following schema go through it for demand forecasting the lake house contains a validated table of all itemized Sal sales updated incrementally in near real time this table named products per order includes the following Fields because reporting on long-term sales strengths is less volatile analysts using the new dashboard only require data to be refreshed once daily because the dashboard will be queried interactively by many users throughout a normal business day it should return results quickly and reduce total compute associated with each materialization which solution meets the expectations of the end users while controlling and limiting possible costs that's a big question okay let's look at option b using structured streaming May introduce unnecessary complexity and overhead as realtime processing is not required for dashboard updates they only want derres once a day so option C confering a web hook may lead to increased query latency and resource consumption if executed each time the dashboard is refreshed and option D using Delta cache is not suitable for reducing compute cost associated with each materialization whatsoever and option e defining a view may not offer the same performance benefits as pre-calculating the summary statistics in a nightly batch job so that leads us to option A since the dashboard only needs to be refreshed once daily and long-term sales Trends are less volatile a nightly batch job aligns well with these requirements by running a nightly batch job that required summary metrics can be calculated and saved as a table ensuring consistent and reliable data for the dashboard using a batch job reduces the total compute associated with each materialization as it allows for efficient processing of large volumes of data at once rather than processing each query in a real time overwriting the table with each update ensures that the dashboard reflects the data latest data without accumulating unnecessary historical data thus controlling and limiting possible costs so overall option A is the most appropriate solution for meeting the end users expectations while controlling and limiting costs a data stream structured streaming job is configured to calculate running Aggregates for item sales to update a downstream marketing dashboard the marketing team has has introduced a new promotion and they would like to add a new field to track the number of times this promotion code is used for each time a junior data engineer suggest updating the existing query as follows note that proposed changes are in bold so this is the original query and this is the Bold whatever you are seeing is in the new query which Step must also be completed to put the proposed query into production so whatever is there looks like it won't work so what is it okay let's go through the options option b if the new promotion codes field significantly increases the number of distinct groups or Aggregates in the group by operation it may lead to skew data distribution and inefficient processing due to inadequate Shuffle partitions increasing the shuffle partitions helps distribute the data more EV across executors improving parallelism and reducing processing B however this step is not explicitly mentioned in the proposed query and may not be necessary unless there are significant changes to the data distribution or Pro processing requirements and option C running refresh table command updates the metadata for the Delta table ensuring that changes made to the underlying data files are reflected in the tables metadata this step may be necessary if the tables metadata needs to be refreshed to reflect changes in the data structure or schema however it is not directly related to the streaming query modification described in this scenario option D Reg ing the data directory with the hive meta store allows Hive compatible tools and services to access and query the data while registering the data may be part of the production deployment process it is not specifically related to modifying the streaming query or updating checkpoint locations and option e the merge schema option is used to specify whether to merge the schema of the data files with the schema of the table when writing to Delta Lake if there are changes to the schema or additional fields are added it's important to ensure that the merge schema option is correctly set to avoid schema conflicts and ensure data Integrity however this option is not mentioned in the proposed query so there is no need to remove it so that ends up with option A when making sure or when making changes to structured streaming query especially modifying the output schema or changing aggregation operations it's essential to use a new checkpoint location this ensures that the streaming job starts from a clean State and does not reuse any existing checkpoints which may not be compatible with the updated query in the proposed query the checkpoint location option is specified but it has same location as this one item EG okay and a new checkpoint location needs to be specified to avoid conflicts with the previous checkpoints and the final question for the today and if you haven't subscribed subscribe like comment and do all that awesome stuff a structured streaming job deployed to production has been resulting in higher than expected cloud storage costs at present during normal execution each micro batch of data is processed in less than 3 seconds at least 12 times per minute a micro batch is processed that contains Zero Records the streaming R was configured using the default trigger setting the production job is currently scheduled alongside many other data breakes jobs in his workspace with instance pools provisioned to reduce startup time for jobs with batch execution holding all other variables constant and assuming records need to be processed in less than 10 minutes which adjustment will meet the requirement option A suggests reducing the trigger interval to 3 seconds to process data more frequently however such a short interval may result in smaller micro batches being processed potentially leading to more frequent checkpointing and increased overhead while it may reduce latency it could also increase cloud storage cost due to more frequent API calls to the Secure Storage count and potentially more data being return to this option b increasing the number of Shuffle partitions can improve parallelism and help distribute the workload across executors however it may not directly address the issue of higher than expected cloud storage costs while it can enhance performance it may not be the most effective solution for reducing costs in this scenario remember this that is the thing that we want to handle and option D similar to option A this proposes reducing the trigger interval to 500 milliseconds to process data more frequently however such a short interval may lead to more frequent APA calls and what are the explanation we did for option A and option e suggest using the trigger once option to execute the query every 10 minutes as a data bricks job while it can minimize comput and storage CA it introduces additional complexity by requiring separate job configuration and scheduling so then we are left with option C but why is that the right one because this suggest setting a longer trigger interval of 10 minutes to reduce the frequency of API calls to the source storage account by processing data less frequently it aims to minimize cloud storage cost associated with accessing data given that each micro batch is processed quickly this adjustment should still meet the requirement of processing records in less than 10 minutes so in summary option CC appears to be the most appropriate Choice as IT addresses the requirement of processing records in less than 10 minutes while minimizing cloud storage costs by reducing the frequency of API calls to the source storage account I think that's about it so I hope you enjoyed learning these 32 question questions and all the best for your examination and uh you know data bricks doesn't usually provide updates to their questions that's why you know we don't see that many questions um they rarely add any new questions so that's the reason I don't post anything all right but if I get any new questions I would obviously upload them and I think in 3 to four days you can expect a quiz set for all these 117 questions so all the best and again if you haven't subscribed to this channel do that and don't forget to like and comment that is a must anyways all the best once again thank you very much for hanging out here and uh watching this video see you in the next video have a great day peace out e

Transcript for:Databricks Data Engineer Exam Insights

Transcript for:
Databricks Data Engineer Exam Insights