Join the Kedro community

Updated last week

Parallel Runner Memory Utilization Issue with Modular Pipelines

At a glance
The community member has a modular pipeline setup with many pipelines they want to run in parallel, and they are saving the computations to a PostgreSQL database. They noticed that even though some pipelines are completing, the memory utilization is not going down. They are running 32 workers on a 72 core machine and have thousands of modular pipelines to run in parallel. The community member is wondering if the ParallelRunner holds on to the Python objects until all the modular pipelines are complete. The comments suggest trying the --async flag, using Ibis instead of Pandas or Polars to connect to PostgreSQL, and that the ParallelRunner may not hold on to the data. The community member is going to try the --async flag and replace the Polars SQL implementation with Ibis. They also mention that the memory utilization issue seems to occur with both the SequentialRunner and the ParallelRunner, and they have narrowed it down to a potential memory leak related to JAX.
Useful resources

Hi everyone,

I have a question about the ParallelRunner. I have a modular pipeline setup with lots of pipelines that I want to run in parallel. At the end of each one of the modular pipelines the computations are saved to a PostgreSQL database. Now what I am noticing is that even though some pipelines are completing, the memory utilization is never going down. A little more info on my setup, I am running 32 worker on a 72 core machine and I have thousands of modular pipelines that I want to run in parallel. My question is this: Does the ParallelRunner hold on to the python objects until ALL of the modular pipelines are complete?

Marked as solution

@Nok interesting, but then I don’t know how to explain why the memory utilization never goes down?
does it comes down when you are using SequentialRunner (the default)? Is this problem only appears with Parallerunner?

View full solution
d
J
N
21 comments

You can see the implementation here:
https://docs.kedro.org/en/stable/_modules/kedro/runner/parallel_runner.html#ParallelRunner

It's also possible to subclass this an use your own runner

can you try the --async flag?

how are you connecting to Postgres? Is it pandas or Ibis?

Thank you @datajoely

I will try the async flag and see if that does anything.

I am using a custom dataset that subclasses the SQLTableDataset and I just overload the save method to use polars instead of pandas.

oh interesting

I think if you use Ibis you can use thread-runner which will definitely be faster

just thinking quickly, I don't think the ParallelRunner tries to hold any data. In a kedro run, once you start the pipeline, what Kedro sees are a bunch of nodes (no more concept of pipeline), and the execution order are determined by solving the dependencies of nodes.

If the data is a persisted type,(i.e. CSVDataset), the memory of the data is released immediately after the node. If it's a memory type, it will be released once the data has no more downstream dependencies.

Oh! @datajoely I will try that. I didn’t know that thread runner is faster than ParallelRunner.

@Nok interesting, but then I don’t know how to explain why the memory utilization never goes down?

I am trying to run it now with async to see if that makes a difference.

I would be happy to share my implementation unfortunately I work in the healthcare field so I can’t really share anything related to patient information.

sorry it's only faster in certain conditions (because of python's limitation with concurrency more generally)

Ah I see, well I will give it a try and see if it speeds up my use case. Thank you!

Because Spark and Ibis delegate their processing to other execution engines you can use threads, in Pandas and Polars case you need to work with ParallelRunner on your machine

I see, currently I am running on a single node server. Where eventually I think I will need to deploy this on a kubernetes cluster. Maybe then the switch to Spark and Ibis will give me a more meaningful boost in performance.

It would be really cool if kedro had a kedro-kubernetes plugin!

I think our airlfow plugin is the best way of achieving that today

but also using Ibis makes postgres your workhorse and delegates execution directly to the SQL engine which handles scaling etc there

Yes, I think so too. I have played around with some plugins and find that to be the most reliable. However, it seems tied to a platform I forget if it was called Astro or something. We are a Microsoft shop so we use Azure and all of our resources and Data Lake are on Azure. I need to find out if the plugin can play nicely with our current infrastructure.

Oh that would actually be quite a performance boost. I will go ahead and replace the polars SQL implementation to use Ibis then. Thank you for all your help! I am still running with async and will report back if there are any improvements or not.

@Nok interesting, but then I don’t know how to explain why the memory utilization never goes down?
does it comes down when you are using SequentialRunner (the default)? Is this problem only appears with Parallerunner?

Hey @Nok, thank you for that tip. It looks like even using the sequential runner the memory was steadily increasing. I narrowed down the issue to a potential memory leak related to JAX. Or they might be caching things to reduce compilation time. I need to dig into it more. But your advice certainly helped me find that issue. @datajoely the async flag didn’t change anything but again if this is a memory leak then it isn’t expected to change anything.

Good progress on tracing down the source!

Add a reply
Sign up and join the conversation on Slack