Handling Memory Issues in Kedro Projects for Processing Large Datasets

Question

Hello, I'm facing a memory issue in my Kedro project and I would like to know if there is a kedro-oriented solution.

I am developing a pipeline for processing large datasets of audio recordings. This involves processing several audio files (large numpy arrays) in a single node and storing them again. I was rellaying on partitionedDatasets for doing so but I'm having memory issues because building the dictionary of numpy arrays is quite heavy and always ends up consuming all of my tiny memory.

Is there a way of storing each processed image as soon as it is done instead of storing them in RAM untill the last one is done? Off course this is possible in many ways but my question is regarding Kedro, is it possible saving in the body of the function using Kedro and partitioned datasets? Has any ou you expereienced something like this before?

Bests
Nicolas

Rashida Kanchwala · Answer

Hi  @Nicolas Betancourt Cardona , maybe you can look at this doc on saving data using generators -  https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#saving-data-with-generators

Nicolas Betancourt Cardona · Answer

Hi  @Rashida Kanchwala  this seems perfect. I have been running some examples to understand how it works and now I'm wondering if generators can be coupled with partitioned datasets. What I mean is if the iterable returned by the generator can be used as the input for the _save method in partitionedDataset. Off course someone can make an abstract dataset for working like desired, but I would like to know if there is an straight forward way of doing it

Fazil Topal · Answer

If your kedro node returns  dict[str, Callable]  this will be processed by kedro partitiondataset. I had the same case in many occasions so it works as expected.

Nicolas Betancourt Cardona · Answer

@Fazil Topal  the problem is when I use generators the node returns an iterable instead of  dict[str, Callable]

Fazil Topal · Answer

Just use lamda functions. Example: d[fpath] = lambda: yourfunc() return d then kedro will call the function before saving.

Nicolas Betancourt Cardona · Answer

@Fazil Topal  Faz Your solution does not work for me because I would like to use  yield   insted of  return  in the body of the node. I don't want to use lazy saviing with partitioned datasets but generators with partitioned datasets. Thank you ver much tho for your time and for caring about my question

Join the Kedro community

Handling Memory Issues in Kedro Projects for Processing Large Datasets