Join the Kedro community

Updated 2 days ago

Dynamically changing catalog entry paths

Can anyone suggest a good way of dynamically changing a catalog entries path? For example, by default I want to use local paths for my intermediate datasets, but when I deploy to production I don't want anything to be saved locally. Duplicating the catalog.yml in the conf/production/ folder is not ideal, as I will have to maintain two sets of each catalog entry.

1
d
E
M
31 comments

We’re in the middle of building a new Kedro catalog where some of these requirements are going to bee covered @Elena Khaustova where is the best place to read up on this milestone?

But actually I think dataset factories may make your catalogs much simpler

Are you using this?

This particular feature won’t be in the new catalog, as we still suggest replacing the datasets instead of modifying them.

What can be helpful here is creating two catalogs for different environments and using needed based on the current environment.

What we do is to just change the path structure dynamically using env vars or globals. So a typical path would look like:
${globals:file_system}://${globals:prefix}/… where file system is s3 in prod and file for local testing

I ended up creating a quick and dirty solution similar to with @Matthias Roels by changing catalog file paths in settings.py

the only part I haven't figured out is how to know what the environment parameter is from settings.py?

e.g. kedro run should be whatever default env is (in my project it would be local) but kedro run --env production should be production

this may be off topic and worth a new thread, but its important as I want to still leverage the local filesystem when developing locally.

Ah settings.py isn’t related to that, what do you need to do with that information?

Hooks are registered in settings.py and can intercept the environment argument if needed

I can't share any of the code publicly so ill try my best to paraphrase....

but basically I want to do

# settings.py

if env=='local':
    pass
elif env in ('production', 'staging'):
    change_catalog_filepaths()

the change_catalog_filepaths() is working and doing exactly what I want, I just don't know of a non hacky way to access env in settings.py

Okay gimme a sec to think about this

Are you using dataset factories

yes we are using some dataset factories, but not all catalog entries are

Okay and how big is your catalog?

You can intercept a runtime parameter, set it as a global

And then use that argument in your file paths

len(OmegaConfigLoader(conf_source='conf/', **CONFIG_LOADER_ARGS).get('catalog').items()) is 53 in settings.py but thats before the dataset factories get resolved into their own unique entries

Okay so I think tempting your file paths to be driven by the environment argument is the way to go

So you’ll always have to run a cli command with the argument

But the minute you’re doing that sort of stuff in settings.py you’re kind of going out of bounds

I’d read through the runtime parameters/ global parameters / configuration environment docs

All should be there

hmmm ok...

A few months ago I spent a good amount of time reading through kedro docs + source code to see if I can get the kedro environment without any hacks... but for that specific use case was for a very different problem I was able to solve in a more "kedro approved" way

(in that case it was a custom dataset + credentials set per ENV for an email alerting system)

Okay so there are more elegant solutions, but what you can do is drive everything by the KEDRO_ENV environment variable?

  • use an omegaconf resolver to inject the variable in the filepath
  • kedro will select the right environment based on this
  • You can also do before_command_run hook to set the env var if you drive it by the CLI

Why do you want to actually change the filepaths? You could just parametrise them using globals. This way, you keep the same structure, you don’t need any additional code and you can just put the different options in a globals file in your production/local kedro environment folders.

What Matthias is saying is correct

Yeah after sleeping on this I am going to use the global variable solution, ty for the help

@Matthias Roels im revisiting this as I realized when I run this locally its saving to C:/data/01_raw/ instead of ./data/01_raw/

is this expected or is there a way to make the globals.yaml dynamically resolve the absolute path of my kedro project?

If you do:

filepath: ${globals:filesystem}/${globals:bucket}/…
With filesystem either s3:// or ./data and bucket the name of your S3 bucket or a subfolder of data to your choosing

Add a reply
Sign up and join the conversation on Slack