Parallel Runner Memory Utilization Issue with Modular Pipelines

Question

Hi everyone,

I have a question about the ParallelRunner. I have a modular pipeline setup with lots of pipelines that I want to run in parallel. At the end of each one of the modular pipelines the computations are saved to a PostgreSQL database. Now what I am noticing is that even though some pipelines are completing, the memory utilization is never going down. A little more info on my setup, I am running 32 worker on a 72 core machine and I have thousands of modular pipelines that I want to run in parallel. My question is this: Does the ParallelRunner hold on to the python objects until ALL of the modular pipelines are complete?

Nok Lam Chan · Accepted Answer

@Nok interesting, but then I don’t know how to explain why the memory utilization never goes down?

does it comes down when you are using SequentialRunner (the default)? Is this problem only appears with Parallerunner?

datajoely · Answer

You can see the implementation here: https://docs.kedro.org/en/stable/_modules/kedro/runner/parallel_runner.html#ParallelRunner It's also possible to subclass this an use your own runner

datajoely · Answer

can you try the  --async  flag?

datajoely · Answer

how are you connecting to Postgres? Is it pandas or Ibis?

Jonathan Dekermanjian · Answer

Thank you @datajoely

I will try the async flag and see if that does anything.

I am using a custom dataset that subclasses the SQLTableDataset and I just overload the save method to use polars instead of pandas.

datajoely · Answer

oh interesting

datajoely · Answer

I  think  if you use Ibis you can use thread-runner which will definitely be faster

Nok Lam Chan · Answer

just thinking quickly, I don't think the ParallelRunner tries to hold any data. In a kedro run, once you start the pipeline, what Kedro sees are a bunch of nodes (no more concept of pipeline), and the execution order are determined by solving the dependencies of nodes.

If the data is a persisted type,(i.e. CSVDataset), the memory of the data is released immediately after the node. If it's a memory type, it will be released once the data has no more downstream dependencies.

Jonathan Dekermanjian · Answer

Oh! @datajoely I will try that. I didn’t know that thread runner is faster than ParallelRunner.

@Nok interesting, but then I don’t know how to explain why the memory utilization never goes down?

I am trying to run it now with async to see if that makes a difference.

I would be happy to share my implementation unfortunately I work in the healthcare field so I can’t really share anything related to patient information.

datajoely · Answer

sorry it's only faster in certain conditions (because of python's limitation with concurrency more generally)

Jonathan Dekermanjian · Answer

Ah I see, well I will give it a try and see if it speeds up my use case. Thank you!

datajoely · Answer

Because Spark and Ibis delegate their processing to other execution engines you can use threads, in Pandas and Polars case you need to work with ParallelRunner on your machine

Jonathan Dekermanjian · Answer

I see, currently I am running on a single node server. Where eventually I think I will need to deploy this on a kubernetes cluster. Maybe then the switch to Spark and Ibis will give me a more meaningful boost in performance.

Jonathan Dekermanjian · Answer

It would be really cool if kedro had a kedro-kubernetes plugin!

datajoely · Answer

I think our airlfow plugin is the best way of achieving that today

datajoely · Answer

but also using Ibis makes postgres your workhorse and delegates execution directly to the SQL engine which handles scaling etc there

Jonathan Dekermanjian · Answer

Yes, I think so too. I have played around with some plugins and find that to be the most reliable. However, it seems tied to a platform I forget if it was called Astro or something. We are a Microsoft shop so we use Azure and all of our resources and Data Lake are on Azure. I need to find out if the plugin can play nicely with our current infrastructure.

Jonathan Dekermanjian · Answer

Oh that would actually be quite a performance boost. I will go ahead and replace the polars SQL implementation to use Ibis then. Thank you for all your help! I am still running with async and will report back if there are any improvements or not.

datajoely · Answer

💪

Jonathan Dekermanjian · Answer

Hey  @Nok , thank you for that tip. It looks like even using the sequential runner the memory was steadily increasing. I narrowed down the issue to a potential memory leak related to JAX. Or they might be caching things to reduce compilation time. I need to dig into it more. But your advice certainly helped me find that issue.  @datajoely  the async flag didn’t change anything but again if this is a memory leak then it isn’t expected to change anything.

Nok Lam Chan · Answer

Good progress on tracing down the source!

Join the Kedro community

Parallel Runner Memory Utilization Issue with Modular Pipelines