Transcript for:
Data Lake House and History of Data Management

what is a data lake house the history of data management in this video you'll learn about the origin and purpose of the data lake house and the challenges of managing Big Data to understand what a data lake house is you'll need to explore the history of data management and Analytics in the late 1980s businesses wanted to harness data-driven insights for business decisions and Innovation to do this organizations had to move past simple relational databases to systems that could manage and analyze data that was being generated and collected at high volumes and at a faster pace data warehouses were designed to collect and consolidate this influx of data and provide support for overall business intelligence and analytics data in a data warehouse is structured and clean with predefined schemas however data warehouses were not designed with semi-structured or unstructured data in mind and became very expensive when trying to store and analyze any data that didn't fit the schema as companies grew and the world became more digital data collection drastically increased in volume velocity and variety pushing data warehouses out of favor it took too much time to process data and provide results and there was limited capability to handle data variety and velocity in the early 2000s the Advent of Big Data drove the development of data Lakes where structured semi-structured and unstructured data could live simultaneously collected in the volumes and speeds necessary multiple data types could be stored side by side in a data Lake data created from many different sources such as web logs or sensor data could be streamed into the data Lake quickly and cheaply in low-cost Cloud object stores however while data Lake solved the storage dilemma it introduced additional concerns and lacked necessary features from data warehouses First Data Lakes are not supportive of transactional data and can't enforce data quality so the reliability of the data stored in the data lake is questionable mostly due to the various formats second with such a large volume of data the performance of analysis is slower and the timeliness of decision impacting results has never manifested and third governance over the data in a data Lake creates challenges with security and privacy enforcement due to the unstructured nature of the contents of a data Lake because data Lakes didn't fully replace data warehouses for Reliable bi insights businesses implemented complex technology stack environments including data Lakes data warehouses and additional specialized systems for streaming time series graph and image databases to name a few but such an environment introduced complexity And Delay as data teams were stuck in silos completing disjointed work data had to be copied between the systems and in some cases copied back impacting oversight and data usage governance not to mention the cost of storing the same information twice with disjointed systems successful AI implementation was difficult and actionable outcomes required data from multiple places the value behind the data was lost in a recent study by Accenture only 32 percent of companies reported measurable value from data something needed to change because businesses needed a single flexible high performance system to support the ever increasing use cases for data exploration predictive modeling and Predictive Analytics data teams needed systems to support data applications including SQL analytics real-time analysis data science and machine learning to meet these needs and address the concerns and challenges a new data management architecture emerged the data lake house the data lake house was developed as an open architecture combining the benefits of a data lake with the analytical power and controls of a data warehouse built on a data Lake a data lake house can store all data of any type together becoming a single reliable source of Truth providing direct access for AI and bi together data lake houses like The databricks Lakehouse platform offer several key features such as transaction support including acid transactions for concurrent read write interactions schema enforcement and governance for data integrity and robust auditing needs data governance to support privacy regulation and data use metrics bi support to reduce the latency between obtaining data and drawing insights Additionally the data lake house offers decoupled storage from compute meaning each operates on their own clusters allowing them to scale independently to support specific needs open storage formats such as Apache parquet which are open and standardized so a variety of tools and engines can access the data directly and efficiently support for diverse data types so a business can store refine analyze and access semi-structured structured and unstructured data in one location support for diverse workloads allowing a range of workloads such as data science machine learning and SQL analytics to use the same data repository and end-to-end streaming for real-time reports removes the need for a separate system dedicated to real-time data applications the lake house supports the work of data analysts data engineers and data scientists all in one location the lake house essentially is the modernized version of a data warehouse providing all the benefits and features without compromising the flexibility and depth of a data Lake