Eventually, a new name would be given to them. Even though it takes the opposite route of Databricks, which started as a big data company but has been adding more and more data warehousing features, they are becoming more and more alike. Yes, Snowflake started as a data warehousing company, however, it has been adding more and more data lake features. How about Snowflake, the Most Promising Data Warehouse? However, with the complete rewrite of its processing engine and performance optimisation techniques (caching, cost-based query optimizer, data skipping, data compaction, and so on), Databricks Lakehouse gets to the same performance level as data warehouse. In the other words, this is a design favouring high throughput over low latency and people don’t have that high expectation for their performance with interactive queries. With the rapid advance of Delta Lake capabilities, the hurdle of manageability/reliability is not that formidable.ĭatabricks and its open-source cousin, Apache Spark, were originally designed for offline processing of big data workloads. Manageability/Reliability and interactive query performance are two of the biggest hurdles for Lakehouse to be competent for data warehousing workloads. Now, get to the decisive factor, the performance, more specifically the interactive query performance. However, for similar data warehousing workloads and data volumes, Lakehouse has the advantages: low-cost cloud-based storage, elastic and pay as you go computing powers, and the latest serverless SQL feature (Databricks claims a 40% cost saving). However, the situation is changing at a fast pace.ĭatabricks Lakehouse is not cheap, especially when you need to pay for both the Databricks Units and the VMs provisioned for supporting it. It is no doubt that the data warehouse is normally more reliable and robust than Lakehouse under most of the conditions. The data warehouse has been around for more than 30 years. It offers data engineers more options and flexibility to integrate into or extend their lakehouse. Databricks Lakehouse originated from open-source initiatives and adopts an open architecture instead of building into a closed black-box as most of the classic data warehouses. I personally consider Lakehouse being more flexible than the classic data warehouse. Lakehouse natively supports streaming data. Lakehouse natively supports unstructured and semistructured data. It is impossible or is too expensive to handle that scale for a classic data warehouse. Lakehouse is capable of storing and processing very large volumes of data. The born of the data lake is driven by the difficulties that businesses face to handle data at greater volume, variety and speed with a classic data warehouse. Regarding the ‘Lake’ side workloads, there is no doubt that Lakehouse outperformed the data warehouses (of course, that is why it is called “ lakehouse”) and offers the capabilities that cannot be achieved by the data warehouses, such as native supports of unstructured or semi-structured data and machine learning type of workloads. Just a few years ago, we had to write rather awkward code as a workaround for the lack of merge capability when updating data in the data lake. This is impressive to see those data warehousing capabilities on top of data lakes. In addition, new capabilities are being continuously added at a fast pace, such as the latest Low Shuffle Merge feature and SQL custom function feature. Now, what could stop Lakehouse from replacing Data Warehouse?ĭatabricks SQL, powered by Delta Lake, offers the full suite of data warehousing capabilities such as ACID transactions, fine-grained data security, scalable metadata handling, first-class SQL support, and BI reporting.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |