Transcript for:
Data Pipelines

today we're diving into the world of data pipelines so what exactly is data pipeline in today's data driven World Companies collect massive amounts of data from various sources this data is critical for making informed business decisions and driving Innovation however raw data is often messy unstructured and store in different formats across multiple systems data pipelines automate the process of collecting transforming and delivering data to make it usable and valuable data pipelines come in many different forms the term is Broad and covers any process of moving a large amount of data from one place to another represent here is a general version of it but this is by no means the only way to implement an effective data pipeline broadly speaking a data pipeline has these stages collect ingest store compute and consume the order of these stages can switch based on the type of data but they generally have them let's start at the top with data collection imagine we're working for an e-commerce like Amazon we get data flowing in from multiple sources data stores data streams and applications data stores are databases like MySQL postgress or Dynamo DB where transaction records are stored for instance every user registration order and payment transaction goes to these databases data streams capture live data feeds in real time think of tracking user clicks and searches as they happen using tools like Apache Kafka or Amazon kinesis or data coming in from iot devices with all these diverse data sources the next stage is the ingest phase where data gets loaded into the data pipeline environment depending on the type of data it could be loaded directly into the processing pipeline or into any intermediate event queue tools like Apache Kafka or Amazon Kinesis are company used for real-time data streaming data from databases is often ingested through batch processing or change data capture tools after ingesting the data may be processed immediately or stored first depending on the specific use cases here it makx sense to explain two broad categories of processing badge processing and stream processing badge processing involves processing large volumes of data at schedule intervals Apache spark with its distributed computing capabilities is key here other popular batch processing tools includes Apache Hado map reduce and Apache Hive for inance spark jobs can be configured to run lightly to aggregate daily sales data stream processing handles real-time data tools like Apache flank Google cloud data flow Apache storm or Apache zamza process data as this arrives for example fling can be used to detect fraudulent transactions in real time by analyzing transaction streams and applying complex event processing rules stream processing typically processes data directly from the data sources they data stores data streams and applications rather than tapping into the data Lake ETL or El processes are also critical to the compute phas ETL tools like Apache air flow and AWS glue orchestrate data loading ensuring Transformations like data cleaning normalization and enrichment appli before data is loaded into the storage layer this is a stage where messy unstructured and inconsistently formatted data is transformed into a clean structure format suitable for analysis after processing data flows into the storage phase here we have several options a data Lake a data warehouse and a data lake house data L store raw and process data using tools like Amazon S3 or htfs data is often store in formats like park or afro which are efficient for large scale storage and querying structured data is stored in data warehouses like snowflake Amazon R shift or Google bit query finally all this process data is ready for consumption various end users leverage this data data science team use it for predictive modeling tools like Jupiter notebooks with libraries like tensor flow or pytorch are common data scientists might build models to predict customer turn based on historical interaction data stored in the data warehouse business intelligence tools like Tableau or powerbi provide interactive dashboards and reports these tools connect directly to data warehouses or lak houses enabling Business Leaders to visualize kpis and Trends self-service analytics tools like loer Empower teams to run queries without deep technical knowledge locer ml loer modeling language abstracts the complexity of SQL allowing marketing teams to analyze campaign Performance Machine learning models use this data for continuous learning and Improvement for instance bank fraud detection models continuously trained with new transaction data to adapt to evolving fraud patterns and that's a wrap on the overview of data pipelines if you like a videos you might like a system design newsletter as well it covers topics and Trends in large scale system design trusted by 500,000 readers subscribe that blog. byby go.com