Hi everyone,
I have a question about the ParallelRunner. I have a modular pipeline setup with lots of pipelines that I want to run in parallel. At the end of each one of the modular pipelines the computations are saved to a PostgreSQL database. Now what I am noticing is that even though some pipelines are completing, the memory utilization is never going down. A little more info on my setup, I am running 32 worker on a 72 core machine and I have thousands of modular pipelines that I want to run in parallel. My question is this: Does the ParallelRunner hold on to the python objects until ALL of the modular pipelines are complete?
@Nok interesting, but then I don’t know how to explain why the memory utilization never goes down?does it comes down when you are using SequentialRunner (the default)? Is this problem only appears with Parallerunner?
You can see the implementation here:
https://docs.kedro.org/en/stable/_modules/kedro/runner/parallel_runner.html#ParallelRunner
It's also possible to subclass this an use your own runner
Thank you @datajoely
I will try the async flag and see if that does anything.
I am using a custom dataset that subclasses the SQLTableDataset and I just overload the save method to use polars instead of pandas.
just thinking quickly, I don't think the ParallelRunner tries to hold any data. In a kedro run
, once you start the pipeline, what Kedro sees are a bunch of nodes (no more concept of pipeline), and the execution order are determined by solving the dependencies of nodes.
If the data is a persisted type,(i.e. CSVDataset), the memory of the data is released immediately after the node. If it's a memory type, it will be released once the data has no more downstream dependencies.
Oh! @datajoely I will try that. I didn’t know that thread runner is faster than ParallelRunner.
@Nok interesting, but then I don’t know how to explain why the memory utilization never goes down?
I am trying to run it now with async to see if that makes a difference.
I would be happy to share my implementation unfortunately I work in the healthcare field so I can’t really share anything related to patient information.
sorry it's only faster in certain conditions (because of python's limitation with concurrency more generally)
Ah I see, well I will give it a try and see if it speeds up my use case. Thank you!
Because Spark and Ibis delegate their processing to other execution engines you can use threads, in Pandas and Polars case you need to work with ParallelRunner on your machine
I see, currently I am running on a single node server. Where eventually I think I will need to deploy this on a kubernetes cluster. Maybe then the switch to Spark and Ibis will give me a more meaningful boost in performance.
but also using Ibis makes postgres your workhorse and delegates execution directly to the SQL engine which handles scaling etc there
Yes, I think so too. I have played around with some plugins and find that to be the most reliable. However, it seems tied to a platform I forget if it was called Astro or something. We are a Microsoft shop so we use Azure and all of our resources and Data Lake are on Azure. I need to find out if the plugin can play nicely with our current infrastructure.
Oh that would actually be quite a performance boost. I will go ahead and replace the polars SQL implementation to use Ibis then. Thank you for all your help! I am still running with async and will report back if there are any improvements or not.
@Nok interesting, but then I don’t know how to explain why the memory utilization never goes down?does it comes down when you are using SequentialRunner (the default)? Is this problem only appears with Parallerunner?
Hey @Nok, thank you for that tip. It looks like even using the sequential runner the memory was steadily increasing. I narrowed down the issue to a potential memory leak related to JAX. Or they might be caching things to reduce compilation time. I need to dig into it more. But your advice certainly helped me find that issue. @datajoely the async flag didn’t change anything but again if this is a memory leak then it isn’t expected to change anything.