Handling Large Databases with Partial Node Processing

At a glance

The community members are discussing ways to handle large databases in a more efficient manner. One community member is working on a blog post on this topic and suggests using the PartitionedDataset. Another community member is interested in learning more about the "Kedro way" of handling large datasets. The discussion then shifts to using Ibis, a Python dataframe library that allows for lazy execution of code on the database backend, as well as using Kedro's support for generator functions to process data in smaller chunks. The community members agree that this approach of processing data in batches can help avoid memory issues when dealing with large datasets.

Useful resources

TThiago José Moser Poletto

Guys, are there any built-in solution to handle large databases, so that the nodes run them partially, like, lets say, a 100k rows will be running in batches of 10k each. Instead of doing by hand with for loop or something like it...

8 comments

LLaurens Vijnck

I'm actually doing a blog post on this topic as we speak

LLaurens Vijnck

but you can use the PartitionedDataset

TThiago José Moser Poletto

Nice to know, I'll love to read it...

yeah I mean, I saw a little bit about PartitionedDataset, is just that to me was not that clear if its usable in all scenarios, like in order to avoid problems with lack of vm resources, to allow me to run even with lower count of CPUs and so on....

TThiago José Moser Poletto

I do want to learn more about the "Kedro way" of things, to understand on its fully potential you know.

DDeepyaman Datta

It's not really a question of the "Kedro way", but if you want to process large volumes of data from a database, the best way is to do the compute on the database.

For example, Ibis is a Python dataframe library that lets you lazily execute code on the backend. Ibis can be fairly easily integrated with Kedro (there are built-in datasets and examples).

Would this help, or am I misunderstanding your question?

iidanov

https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#how-to-use-generator-functions-in-a-node Kedro supports generator functions, you only need to have a dataset which will load a part of the data and yield it and then in the node you need to iterate through the generated data chunks, process them and then yield them back, which Kedro will call save on a dataset which supports append. You can check in the docs an example how that would work.

iidanov

This way only a small part of the dataset gets loaded into memory at a time.

TThiago José Moser Poletto

I appreciate that guys, I'll read and try that ...
My problem right now its actually unknown, I was running a code that I didn't build, but it was working, but as soon as I changed the input data, which is larger than the last ones I we used to use, and its not working for some reason, the kernel dies before it finishes and since the process take quite some time, its kinda impossible for me to remain watching the code execution. So this is way I would like to know about a way to make sure that the input data is runned properly in every step of the way.

But I'll try that solution Ivan mentioned and see what I can do with that..

Add a reply

Join the Kedro community

Handling Large Databases with Partial Node Processing