Key Points from Interview Transcript

Introduction

Santosh: Interviewee working at Microsoft with 9 years experience in data-driven technologies.
Arun: Interviewer with 15 years experience, 10+ years at TCS as a data scientist and data engineer.

Daily Activities: Agile methodology, adhoc requests, pipeline implementation, reporting.
Project Overview: Data collection and integration from multiple sources, Transformation using Azure Data Factory and data bricks.
Transformations: Rolling Seven Day sales, 14 Day sales average, product sold per year.

Spark Jobs: Use RDDs, data sets, data frames, persist, cache, serialization, broadcast variables and joins.
Lazy Evaluation: Program organized into small operations, improve speed and avoid unnecessary computations.

ETL Tasks: Data extraction and transformation, storage in data lakes, running models.
Airflow: Used for scheduling and running jobs.
Incremental Load: Use of watermark column for efficient data loading.
Autoscaling vs. Incremental Load: Incremental load more efficient for specific intervals.

Group by Key vs. Reduce by Key: Optimization and shuffling.
Azure Functions: Use cases such as optimizing costs in pipelines by triggering on file change.
Data Validation: Error log tables, exception handling in Python, stored procedures.
Custom Functions in PySpark: UDF (User Defined Functions).
Access Management: Azure Active Directory, access management using IAM, access keys.
Troubleshooting in Data bricks: Resource consumption, log analytics.

Role Description: ETL tasks, scheduling jobs, supporting data science and BI teams, and automating processes.
TCS Experience: Large MNC, employee-friendly, possibility of onsite opportunities depending on performance.