Data + AI - Summit 2022 Keynotes

If you are in contact with the world of data analytics or data management in your day-to-day, you have probably already heard of Data + AI Summit 2022 organized by Databricks. For those not in the know, according to the organizers' own words, the summit was "the world's largest gathering of the data and analytics community". And with 5000 people joining the conference in San Francisco and more than 60 000 watching online, there might be some truth to those claims.

If you were part of that 65 000 people, you know that some pretty important things were announced. If you work with Apache Spark, Delta Lake, or Unity, and even if you just look for ways to share and consume data, well, you probably should have been there. But reading this article will be the next best thing. After all, we compiled a list of the most interesting things that were announced during the Summit. So, fasten your seatbelts and be ready to take a Peak and learn a Bit.

Project Lightspeed

Okay, so let us start quick and fast with the launch of Project Lightspeed. First, for those uninitiated, what exactly is Project Lightspeed? It is simply a new, faster generation of Structured Streaming which in Databricks' own words is "a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. The user can express the logic using SQL or Dataset/DataFrame API."

So there you have it. And now that we know what Project Lightspeed is, it is worth mentioning why it is so important. The reason is simple, more and more people want to stream data. More and more organizations and even particular decisions are data-driven. This creates a need for tools for processing data in real or near-real time. However, such tools can be resource and time intensive. Thanks to Project Lightspeed, the process will be quicker, lighter and simplified. This is to be achieve d by:

  • Improving latency and ensuring its predictability.

  • Enhancing functionality for processing data with new operators and APIs.

  • Improving ecosystem support for connectors.

  • Simplifying deployment, operations, monitoring and troubleshooting.

And how are those goals supposed to be fulfilled? Well, for starters, the project-team noticed that Structured Streaming engine does a significant amount of bookkeeping. At the start of every micro-batch, offset management plans which records it constitutes of and logs to external storage. At the end it marks every micro-batch as done. It may sound needlessly complicated, but it is necessary, or at least it was. Nowadays, Databricks has another, simpler idea. Offset management will no longer mark each micro-batch as done; instead, offset should serve as a marker of pipeline progress. Additionally, offset pages do not have to be forced to disk immediately; instead, they can overlap with the next micro-batch execution. This should limit the amount of actions that need to be taken and recorded during streaming and thus speed up the process by reducing the latency and lightening up the load. That already sounds great, right? But wait! There is more!

pic. 1 a graph from Data + AI 2022 day 1 keynote presentation

As of right now, Spark does checkpoints of the state of the process synchronously, each time after a group of records has been processed, which increases latency. After the change users will be able to set checkpoint state for every Nth group of records. This change will improve latency by 20-30% in stateful processing. Once again, less bookkeeping, less actions, more speed. Additionally, for stateless processing, the current checkpointing mechanism synchronously writes to external storage after each group of records. If persisting checkpoints overlap with the execution of the next group of records, latency could be further improved by as much as 25%. As you can see on the graph below, it simply means that logging to external storage and processing can be taking place at the same time. Thanks to this method, resource utilization is improved and the time spent on the process is greatly reduced. However, we are wary of potential issues with stability due to less precise bookkeeping.

pic. 2 a graph from Data + AI 2022 day 1 keynote presentation

Ok, so now we know how Project Lighspeed will be made faster. Now let's see how it will be made simpler and more comfortable in use. The first step is making Python a first class citizen. No longer will Python users suffer while Scala thrives. In other words, Python will now be considered the primary language in Databricks environment. Functionalities which were only available in Scala are being rewritten in Python. This includes advanced state processing methods (so called arbitrary stateful processing). Such a change will make those techniques much more accessible to Python users. In particular, the following functions are planned to be available in Python:

  • flatMapGroupsWithState

  • mapGroupsWithState

Lightspeed will also provide tighter integrations with popular Python data processing packages like Pandas, which will make coding in Python even simpler. I don't know about you, but I love simpler. And Pandas. So I really like this change. For now, this is the extent of what Project Lightspeed promised to change in data streaming. All I have left to do is leave you with the words of the wise Ali Ghodsi, CEO and Co-founder of Databricks: Streaming is finally happening. We have been waiting for that year where streaming workloads take off, and I think last year was it. I think it's because people are moving to the right of this data/AI maturity curve, and they're having more and more AI use cases that just need to be real-time.

Multiple Stateful Operators

Another thrilling announcement is that Structured Streaming will support more than one stateful operator per streaming job. For example, after joining two streams we will still be able to e.g., group the rows by some dimension value or a time window. We are very happy that this error will not haunt us anymore as soon as the enhancement becomes available. We also can't wait for the details regarding the area of operations, deployment and monitoring, which is to be simplified and users are to be provided with an easy-to-debug interface.

Delta Lake

Another important take from the conference is the fact that Delta Lake fired our imaginations with the announcement of deletion vectors. This feature will allow for a change of a single value in a file instead of rewriting the whole parquet file to accommodate a small correction. A single, tiny write-in will no longer turn into a massive copy of all unchanged data! This enhancement is not only going to dramatically improve performance of single updates and deletes, but also cut cost on storage due to significantly a smaller number of read-write operations.

Unity Catalog

During the conference, some good news was also announced for Unity Catalog users. Now they will be able to see Lineage (currently in preview), which includes downstream as well as upstream lineage. The graphs are computed in real-time as jobs and pipelines are developed. Lineage is also provided on column level. If a column is computed, we will be able to see all predecessor columns used in the computation as well as any downstream columns based on a selected column.

Databricks Marketplace

For those interested in sharing their data, Delta Sharing has now moved to another stage with Databricks Marketplace, where organizations can now provide and consume third party data. Not only underlying data is shared, but also other data assets like dashboards, notebooks or ML models which accelerate usage and enhance understanding of data. This new offering has a potential to drastically simplify data integrations and might be a game changer for organizations which rely on 3rd party data. We may be on the verge of a great paradigm shift in the whole data-ecosystem. It begs a question, what direction it will take and if decentralization of data services will continue. If you want to get ready for such possibility and see the benefits of Databricks Marketplace, click here.


So here it is; Data + AI 2022 in a pill. A lot has happened and a lot will happen in the future. We got faster data streaming with Project Lightspeed thanks to some creative bookkeeping and resource optimization; easier times for Python and Unity users, and some quality of life improvement when it comes to small changes in data parcels in Delta Lakes. Lastly, but perhaps most importantly, a big step towards a global decentralized data ecosystem was announced with another upgrade to Databricks Marketplace. For those who want to learn more, we advise you to use "Watch on demand" function of Databricks conference that you can find here. And If you don't want to miss further developments, don't forget to bookmark our blog and follow us on LinkedIn for more valuable insights and content!

Beata Boraca Senior Data Engineer w BitPeak Marcel Klisz Junior Marketing Specialist w BitPeak

95 wyświetleń

Ostatnie posty

Zobacz wszystkie