Join the Kedro community

Updated 2 months ago

Kedro capabilities to work with deltalake

At a glance

The community member is testing Kedro's capabilities to work with Delta Lake, a Delta table that is updated daily, and pipelines that need to recompute models daily. The table is currently small but may grow too large to fit in memory. The community member is using the pandas Delta Lake dataset and is looking for options besides PySpark.

In the comments, another community member suggests using the Polars Delta dataset, which is not officially available in the implementation. They provide a link to a custom Polars Delta dataset implementation that can be used as a workaround.

Useful resources

Hey there, I'm testing Kedro capabilities to work with DeltaLake. I have a Delta table that is going to be updated every day with new data, and some pipelines that need to recompute models daily. The table is pretty small now but the total data should be increase and might not fit in memory (load all the table and then filter it).
I'm currently using the pandas deltalake dataset.
what are my options in the future? beside pyspark

S
J
N
4 comments

one more thing, noticed the polars deltalake dataset is not available in your implementation due to specificly specifying the available formats. can you update it so we can actually use polars to lazy scan deltalake?

hello @Sean Yogev! I was about to suggest Polars indeed. there's no official Polars Delta dataset but you can copy paste this:

https://github.com/astrojuanlu/kedro-deltalake-demo/blob/main/src/kedro_deltalake_demo/datasets/polars_delta_dataset.py

thanks, that's exactly what i was doing until now 😅

cc @Théo Andro

Add a reply
Sign up and join the conversation on Slack