Join the Kedro community

Updated yesterday

Handling Large Databases with Partial Node Processing

Guys, are there any built-in solution to handle large databases, so that the nodes run them partially, like, lets say, a 100k rows will be running in batches of 10k each. Instead of doing by hand with for loop or something like it...

1
L
T
D
7 comments

I'm actually doing a blog post on this topic as we speak

but you can use the PartitionedDataset

Nice to know, I'll love to read it...

yeah I mean, I saw a little bit about PartitionedDataset, is just that to me was not that clear if its usable in all scenarios, like in order to avoid problems with lack of vm resources, to allow me to run even with lower count of CPUs and so on....

I do want to learn more about the "Kedro way" of things, to understand on its fully potential you know.

It's not really a question of the "Kedro way", but if you want to process large volumes of data from a database, the best way is to do the compute on the database.

For example, Ibis is a Python dataframe library that lets you lazily execute code on the backend. Ibis can be fairly easily integrated with Kedro (there are built-in datasets and examples).

Would this help, or am I misunderstanding your question?

https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#how-to-use-generator-functions-in-a-node Kedro supports generator functions, you only need to have a dataset which will load a part of the data and yield it and then in the node you need to iterate through the generated data chunks, process them and then yield them back, which Kedro will call save on a dataset which supports append. You can check in the docs an example how that would work.

This way only a small part of the dataset gets loaded into memory at a time.

Add a reply
Sign up and join the conversation on Slack