Coconote
AI notes
AI voice & video notes
Export note
Try for free
Interview Transcript
Jul 3, 2024
Key Points from Interview Transcript
Introduction
Santosh
: Interviewee working at Microsoft with 9 years experience in data-driven technologies.
Arun
: Interviewer with 15 years experience, 10+ years at TCS as a data scientist and data engineer.
Santosh's Background
SQL
: 8 years
Python
: 6 years
Azure Services
: 6 years, Data bricks, Data Factory, Data Lake Storage
Big Data technologies
: 3 years, Hive, MapReduce, Pig
BI Tools
: Reporting for clients
Ratings
:
SQL: 8.5/10
Python: 8/10
Big Data: 7/10
Azure: 8/10
Technical Skills Demonstration
SQL Query
: Query to find country-wise total sales
Python
: Using pandas to replicate SQL query
PySpark
: Group by and sum in PySpark data frame
Azure Data Engineering
Daily Activities
: Agile methodology, adhoc requests, pipeline implementation, reporting.
Project Overview
: Data collection and integration from multiple sources, Transformation using Azure Data Factory and data bricks.
Transformations
: Rolling Seven Day sales, 14 Day sales average, product sold per year.
Optimization
Spark Jobs
: Use RDDs, data sets, data frames, persist, cache, serialization, broadcast variables and joins.
Lazy Evaluation
: Program organized into small operations, improve speed and avoid unnecessary computations.
Pipeline and Scheduling
ETL Tasks
: Data extraction and transformation, storage in data lakes, running models.
Airflow
: Used for scheduling and running jobs.
Incremental Load
: Use of watermark column for efficient data loading.
Autoscaling vs. Incremental Load
: Incremental load more efficient for specific intervals.
Additional Technical Knowledge
Group by Key vs. Reduce by Key
: Optimization and shuffling.
Azure Functions
: Use cases such as optimizing costs in pipelines by triggering on file change.
Data Validation
: Error log tables, exception handling in Python, stored procedures.
Custom Functions in PySpark
: UDF (User Defined Functions).
Access Management
: Azure Active Directory, access management using IAM, access keys.
Troubleshooting in Data bricks
: Resource consumption, log analytics.
Conclusion
Role Description
: ETL tasks, scheduling jobs, supporting data science and BI teams, and automating processes.
TCS Experience
: Large MNC, employee-friendly, possibility of onsite opportunities depending on performance.
📄
Full transcript