Join the Kedro community

Updated yesterday

Handling Memory Issues in Kedro Projects for Processing Large Datasets

Hello, I'm facing a memory issue in my Kedro project and I would like to know if there is a kedro-oriented solution.

I am developing a pipeline for processing large datasets of audio recordings. This involves processing several audio files (large numpy arrays) in a single node and storing them again. I was rellaying on partitionedDatasets for doing so but I'm having memory issues because building the dictionary of numpy arrays is quite heavy and always ends up consuming all of my tiny memory.

Is there a way of storing each processed image as soon as it is done instead of storing them in RAM untill the last one is done? Off course this is possible in many ways but my question is regarding Kedro, is it possible saving in the body of the function using Kedro and partitioned datasets? Has any ou you expereienced something like this before?

Bests
Nicolas

R
N
F
6 comments

Hi @Nicolas Betancourt Cardona, maybe you can look at this doc on saving data using generators - https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#saving-data-with-generators

Hi @Rashida Kanchwala this seems perfect. I have been running some examples to understand how it works and now I'm wondering if generators can be coupled with partitioned datasets. What I mean is if the iterable returned by the generator can be used as the input for the _save method in partitionedDataset. Off course someone can make an abstract dataset for working like desired, but I would like to know if there is an straight forward way of doing it

If your kedro node returns dict[str, Callable] this will be processed by kedro partitiondataset. I had the same case in many occasions so it works as expected.

@Fazil Topal the problem is when I use generators the node returns an iterable instead of dict[str, Callable]

Just use lamda functions. Example:

d[fpath] = lambda: yourfunc()
return d

then kedro will call the function before saving.

@Fazil Topal Faz Your solution does not work for me because I would like to use yield insted of return in the body of the node. I don't want to use lazy saviing with partitioned datasets but generators with partitioned datasets. Thank you ver much tho for your time and for caring about my question

Add a reply
Sign up and join the conversation on Slack