Join the Kedro community

Updated 2 weeks ago

Improving Kedro Project Load Times

Regarding https://github.com/kedro-org/kedro/issues/4322, I am working on upgrading a big project from kedro 0.18.13 to the latest version. While doing so, I am also removing a custom ConfigLoader as I want to use OmegaConf. However, I do see some performance issues here too compared to the custom implementation we had. Did some debugging (using logging in my hooks) and found the following:

  • project has 1500 catalog entries with most of the filepath combining info from globals (bucket, prefix, data version,…)
  • With kedro 0.18, I was able to load the project in a notebook in around 25sec
  • In the new version, it takes 100sec
  • Most of the load times happens after my after context created hooks (potentially when creating the catalog?)

I would like to see what I can do to improve load times or, at least figure out for sure what’s causing it. Any help would be nice (I cannot give access to the full project, but I will provide any info I can provide)

M
J
M
9 comments

Hi Matthias, thanks for sharing your observations. We did some more in depth analysis as well: https://github.com/kedro-org/kedro/issues/3893

The verdict is that most of the slowness comes from the omegaconf side. But and are working on improving what we can on our side.

Would be great to hear your ideas if you have any!

on top of what said, are you able to do line_profiler + cProfile reports on your code and let us know where the hotspots are?

I will try to do that and share them next week

I did a deep dive on what’s making loading the catalog slow for me. I only load from the base env which already contains 63 catalog files (with 1500 entries each). It seems the bottleneck is in the return statement when loading and merging configs. More specifically, the to_container with resolving the config.

Thanks , this seems to match our initial investigation on https://github.com/kedro-org/kedro/issues/3893

our intended solution is still "Reduce the time spent [...] on OmegaConf.to_container"

I can already say it does scale linearly on the number of entries. For me, it’s 44ms per catalog entry (x1500 -> 60sec)

Two small changes I immediately see is

  1. keep globals as an OmegaConf object internally to save a cast on every resolve
  2. First merge all configs, then loop over the keys to filter the ones starting with _ and then do the resolving
These will probably have very minimal impact on performance though (I expect at most a couple of secs)

What could be a huge improvement, at least in my case, is to keep using OmegaConf objects in the rest of the kedro project (as opposed to dicts). This will probably be a major backend change but you would then postpone to_container calls as long as possible (and in our case skipping many as we only use a portion of the catalog on every kedro run)

Add a reply
Sign up and join the conversation on Slack