Can anyone suggest a good way of dynamically changing a catalog entries path? For example, by default I want to use local paths for my intermediate datasets, but when I deploy to production I don't want anything to be saved locally. Duplicating the catalog.yml in the conf/production/ folder is not ideal, as I will have to maintain two sets of each catalog entry.
We’re in the middle of building a new Kedro catalog where some of these requirements are going to bee covered @Elena Khaustova where is the best place to read up on this milestone?
This particular feature won’t be in the new catalog, as we still suggest replacing the datasets instead of modifying them.
What can be helpful here is creating two catalogs for different environments and using needed based on the current environment.
What we do is to just change the path structure dynamically using env vars or globals. So a typical path would look like:${globals:file_system}://${globals:prefix}/…
where file system is s3
in prod and file
for local testing
I ended up creating a quick and dirty solution similar to with @Matthias Roels by changing catalog file paths in settings.py
the only part I haven't figured out is how to know what the environment parameter is from settings.py?
e.g. kedro run
should be whatever default env is (in my project it would be local) but kedro run --env production
should be production
this may be off topic and worth a new thread, but its important as I want to still leverage the local filesystem when developing locally.
I can't share any of the code publicly so ill try my best to paraphrase....
but basically I want to do
# settings.py if env=='local': pass elif env in ('production', 'staging'): change_catalog_filepaths()
the change_catalog_filepaths()
is working and doing exactly what I want, I just don't know of a non hacky way to access env in settings.py
But additionally you may want to use this
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-use-globals-and-runtime-params
len(OmegaConfigLoader(conf_source='conf/', **CONFIG_LOADER_ARGS).get('catalog').items())
is 53 in settings.py but thats before the dataset factories get resolved into their own unique entries
Okay so I think tempting your file paths to be driven by the environment argument is the way to go
But the minute you’re doing that sort of stuff in settings.py you’re kind of going out of bounds
I’d read through the runtime parameters/ global parameters / configuration environment docs
hmmm ok...
A few months ago I spent a good amount of time reading through kedro docs + source code to see if I can get the kedro environment without any hacks... but for that specific use case was for a very different problem I was able to solve in a more "kedro approved" way
(in that case it was a custom dataset + credentials set per ENV for an email alerting system)
Okay so there are more elegant solutions, but what you can do is drive everything by the KEDRO_ENV
environment variable?
Why do you want to actually change the filepaths? You could just parametrise them using globals. This way, you keep the same structure, you don’t need any additional code and you can just put the different options in a globals file in your production/local kedro environment folders.
Yeah after sleeping on this I am going to use the global variable solution, ty for the help
@Matthias Roels im revisiting this as I realized when I run this locally its saving to C:/data/01_raw/ instead of ./data/01_raw/
is this expected or is there a way to make the globals.yaml dynamically resolve the absolute path of my kedro project?
If you do:
filepath: ${globals:filesystem}/${globals:bucket}/…With filesystem either
s3://
or ./data
and bucket the name of your S3 bucket or a subfolder of data to your choosing