Dev

Databricks puts cards on the table format as Snowflake looks for more players


Analysis With confirmation of support for table formats Apache Iceberg and Hudi this week, Databricks is striving to broaden the appeal of its approach to data lakes, strengthening its dominance in machine learning to branch out to data warehouse-type workloads.

Meanwhile, rival Snowflake has also unveiled updates to Iceberg Tables to further eliminate data silos.

Both companies claim to support unstructured data lake-style workloads, and the SQL-based reporting and analytics of data warehousing in the same system, while also using their analytics engines to address data held elsewhere.

In Delta Lake 3.0, Databricks — which cut its teeth developing Apache Spark back when Hadoop was king — has launched what it calls Universal Format (UniForm), designed to allow data stored in Delta to be read as if it were Apache Iceberg or Apache Hudi.

Days before the vendor’s annual shindig in San Francisco this week, marketing veep Joel Minnick told The Register that Delta was the “longest established, most enterprise adopted Lakehouse format from an open source perspective.”

All three table formats are based on the Apache Parquet data format, he pointed out: “Where the difference comes into play is that each one of these formats creates similar but not the same metadata” affecting how the data is expressed to applications and analytics workloads, he said.

The result is some incompatibility between Delta, Hudi and Iceberg. Hoping to simplify the problem for customers, Databricks has introduced its Universal Format or UniForm, for short.

Minnick said UniForm automatically generates the metadata for all three of the formats and automatically understands what format the users is trying to read or write to.

“It will automatically then do the translation for the user to the appropriate metadata that system is expecting. Now if you build for Delta Lake, you build for everyone and you’re able to eliminate all of this complexity of having to understand which Lakehouse format the system is expecting and maintaining different connectors to do these translations,” he said.

Apache Iceberg is an open table format designed for large-scale analytical workloads while supporting query engines including Spark, Trino, Flink, Presto, Hive and Impala. It has spent the last couple of years gathering momentum, after Snowflake, Google, and Cloudera announced their support last year. More specialist players are also in on the act, including Dremio, Starburst, and Tabular, which was founded by the team behind the Iceberg project when it was developed at Netflix.

In fact, Databricks CEO and co-founder Ali Ghodsi told The Register last year that the three table formats – Iceberg, Hudi and Delta – were similar, and all were likely to be adopted across the board by the majority of vendors. This year, SAP and Microsoft have announced support for Delta, but both have said they could address data in Iceberg and Hudi in time.

Ice cold

The backer of Iceberg, meanwhile, has not stood still. In some sort of enterprise data analytics grudge match, Snowflake decided to hold its annual get together in the same week as Databricks.

The cloud data warehouse and platform company — once valued at a staggering $120 billion — has announced a private preview of its Iceberg Tables, which also promises to reach across silos – although without supporting Hudi and Delta.

It said organizations could work with data in their own storage in the Apache Iceberg format, whether or not the storage was managed by Snowflake, but use the vendor’s performance management and governance tools.

Snowflake also announced its Native App Framework in public preview on AWS. The idea is developers can build and test Snowflake Native Apps, to exploit data in its marketplace. More than 25 apps were already available, it said.

Hyoun Park, CEO and chief analyst with Amalgam Insights, said there was a battle in the data lake world between the Iceberg, Hudi and Delta formats.

“A lot of third parties are working with Iceberg, feeling that it is the easiest data format to work with and because they are frankly afraid of empowering Databricks,” he told The Register.

However, Databricks’ move to support all three would allow it to offer services to Iceberg customers, including those using Snowflake or Cloudera.

“It’s a smart way to be able to be the intelligence above all of the data lake formats that are out there,” he said.

Park reckons Iceberg is technically winning in terms of adoption, but faces challenges in terms of performance.

Meanwhile, it was expectations from investors that is pushing Snowflake to branch out as much as anything else. “Snowflake’s valuation and the expectations put upon it by shareholders’ force, mean it is trying be all things data, whether it be an application development platform or machine learning platform, or anything in between,” Park said.

Mike Gualtieri, Forrester principal analyst, was unimpressed with Snowflake’s move in third-party apps. “I don’t think it’s convincing because this whole notion of apps that are just kind of focused on data is so incredibly lightweight and trivial, compared to full application solutions that enterprises need.”

But Snowflake is making progress at looking like a data lake, which was promising for the vendor and the customers who favor the platform, he added.

Over the past couple of years, boundaries have merged between data lakes and data warehouse. Databricks has coined its lakehouse concept, offering SQL and BI-like queries on its platform, while Snowflake, for example, has started to support unstructured data.

“There is a clash of these two technologies. The most desirable result for enterprises will be a unified platform. That’s why Snowflake can’t just sit there and say, ‘Oh, we’re a great data warehouse, kind of like Teradata.’ They have to say you can handle unstructured data and machine learning and when it lacks those capabilities, it fills those gaps through partnerships,” Gualtieri said.

But while the enterprises might want one platform, user expectations and the technology would prevent a unified market in the near future, he said.

“Teradata and Snowflake: they have some machine learning capabilities and you could do a lot with them. Databricks might have five times more capabilities. But if you take a BI user used to getting reports in Spotfire or Tableau, and they do a query, they expect instant results, not to wait three or more seconds that doing a query against a data lake might require. In terms of features and technical capabilities, there are gaps between both of them, so unification can’t happen right away,” Gualtieri said.

For now, many organizations will continue to employ both styles of data management and analytics. Snowflake and Databricks both have an impressive roster of multinational customers, including Kraft Heinz, Comcast and EDF Energy for the former while the latter claims Toyota, Shell and AT&T, notably also a Snowflake customer.

It might take three years for both sides of the data lake/data warehouse divide to build the full set of capabilities offered by the other, Gualtieri said. Meanwhile, the clash between the two vendors is likely to continue. ®

 



READ SOURCE

This website uses cookies. By continuing to use this site, you accept our use of cookies.