Kedro capabilities to work with deltalake

At a glance

The community member is testing Kedro's capabilities to work with Delta Lake, a Delta table that is updated daily, and pipelines that need to recompute models daily. The table is currently small but may grow too large to fit in memory. The community member is using the pandas Delta Lake dataset and is looking for options besides PySpark.

In the comments, another community member suggests using the Polars Delta dataset, which is not officially available in the implementation. They provide a link to a custom Polars Delta dataset implementation that can be used as a workaround.

Useful resources

SSean Yogev

Hey there, I'm testing Kedro capabilities to work with DeltaLake. I have a Delta table that is going to be updated every day with new data, and some pipelines that need to recompute models daily. The table is pretty small now but the total data should be increase and might not fit in memory (load all the table and then filter it).
I'm currently using the pandas deltalake dataset.
what are my options in the future? beside pyspark

4 comments

SSean Yogev

one more thing, noticed the polars deltalake dataset is not available in your implementation due to specificly specifying the available formats. can you update it so we can actually use polars to lazy scan deltalake?

JJuan Luis Cano Rodríguez

hello @Sean Yogev! I was about to suggest Polars indeed. there's no official Polars Delta dataset but you can copy paste this:

https://github.com/astrojuanlu/kedro-deltalake-demo/blob/main/src/kedro_deltalake_demo/datasets/polars_delta_dataset.py

SSean Yogev

thanks, that's exactly what i was doing until now 😅

NNicolas P

cc @Théo Andro

Add a reply

Join the Kedro community

Kedro capabilities to work with deltalake