Interview Transcript

Jul 3, 2024

Key Points from Interview Transcript

Introduction

  • Santosh: Interviewee working at Microsoft with 9 years experience in data-driven technologies.
  • Arun: Interviewer with 15 years experience, 10+ years at TCS as a data scientist and data engineer.

Santosh's Background

  • SQL: 8 years
  • Python: 6 years
  • Azure Services: 6 years, Data bricks, Data Factory, Data Lake Storage
  • Big Data technologies: 3 years, Hive, MapReduce, Pig
  • BI Tools: Reporting for clients
  • Ratings:
    • SQL: 8.5/10
    • Python: 8/10
    • Big Data: 7/10
    • Azure: 8/10

Technical Skills Demonstration

  • SQL Query: Query to find country-wise total sales
  • Python: Using pandas to replicate SQL query
  • PySpark: Group by and sum in PySpark data frame

Azure Data Engineering

  • Daily Activities: Agile methodology, adhoc requests, pipeline implementation, reporting.
  • Project Overview: Data collection and integration from multiple sources, Transformation using Azure Data Factory and data bricks.
  • Transformations: Rolling Seven Day sales, 14 Day sales average, product sold per year.

Optimization

  • Spark Jobs: Use RDDs, data sets, data frames, persist, cache, serialization, broadcast variables and joins.
  • Lazy Evaluation: Program organized into small operations, improve speed and avoid unnecessary computations.

Pipeline and Scheduling

  • ETL Tasks: Data extraction and transformation, storage in data lakes, running models.
  • Airflow: Used for scheduling and running jobs.
  • Incremental Load: Use of watermark column for efficient data loading.
  • Autoscaling vs. Incremental Load: Incremental load more efficient for specific intervals.

Additional Technical Knowledge

  • Group by Key vs. Reduce by Key: Optimization and shuffling.
  • Azure Functions: Use cases such as optimizing costs in pipelines by triggering on file change.
  • Data Validation: Error log tables, exception handling in Python, stored procedures.
  • Custom Functions in PySpark: UDF (User Defined Functions).
  • Access Management: Azure Active Directory, access management using IAM, access keys.
  • Troubleshooting in Data bricks: Resource consumption, log analytics.

Conclusion

  • Role Description: ETL tasks, scheduling jobs, supporting data science and BI teams, and automating processes.
  • TCS Experience: Large MNC, employee-friendly, possibility of onsite opportunities depending on performance.