Hello everyone
This is regarding the kubeflow plugin.
I wanted to just gain some information about how kedro nodes are executed by kubeflow.
Does kubeflow run each node in a separate container ?? or separate pods ?? or all of nodes are executed in the same container
I added a feature to group nodes during translation process to run aggregate some of them together in the same pod where it makes sense, but I don't remember if I ported it to kubeflow plugin
is the kubeflow plugin not maintained anymore ??
Is it a bad choice to choose kubeflow for deploying kedro pipeline in 2024
it is on life support for now, as in not actively maintained but we try our best to update the versions of dependencies and work on it when we have more resources, which for now we don't have
We've kinda put it on lower priority since kubeflow as an ecosystem has been declining in popularity
if you are not limited in any way (don't have anything in place yet) then it would depend on the planned scale of your project
still from open-source and self managed kubeflow and airflow with kubernetes seem like the best options for now for kedro
regarding node grouping- I haven't https://github.com/getindata/kedro-kubeflow/issues/262 but its draft is present in other plugins, so I could maybe find some time to do it or guide someone else willing to put work on it
The main problem with this plugin and its maintenance for us is:
So we have a self managed kubeflow service running already on aws EKS and we are thinking to use kubeflow plugin to publish our kedro pipelines. I do have couple of questions regarding running kedro pipelines on kubeflow
I can try to see what resources we have and maybe bump the priority in porting the grouping feature
As yeah without it I am aware that it's a huge pain to have everything run in separate pods
as it's at odds with a principle you'd like to keep in kedro to keep nodes simple and atomic
you don't want to dl a docker img and spin a pod just to add 2 numbers together or extract some params
so yes I am also trying to save the outputs on s3 it works out the box by versioning them as well, so whenever the next node fetches the same dataset it gets the latest version.
One quick question here -> So in our one of the use cases multiple users will be creating pipeline runs from the kubeflow ui. I wanted to understand how do i save them on s3 so that each run uses their own intermediate files generated and not the ones generated by some other run. Think of it like we have launched parallel runs for a pipeline - r1, r2, r3 , how can we ensure they do not mix the intermediate files.
I am really not familiar with kubeflow ui, or it has been so long since I was last using it so I forget what it looks like
but I'd be surprised if there are no options to pass any parameters or environment variables to the runs
Yeah you can ignore the ui for now. Buy yeah we can surely pass parameters from kubeflow ui, not sure about env variables although.
you could use an env variable to set the user and use this value in paths that are generated in catalog
got it will oc.env help me to access these env variables in kedro which will be set by kubeflow somehow
so let me know what you'll figure out about how to pass params, as I said I'm rusty in this topic and would be happy to know as well
So basically I can use some env variable to have different folder structure in s3 to make sure that nodes running for run r1 uses its own files and doesn't interferes with other runs r2, r3 .
so let me know what you'll figure out about how to pass params, as I said I'm rusty in this topic and would be happy to know as wellon this, someone in the channel mentioned that once we define the params in parameters.yml , it reflects on the kubeflow Ui, and takes the default values defined in the yaml. Users can edit it , i will test it and let you know for sure.
One thing that I want may be you can find some time to confirm , I am looking for a unique run_id for each run as this will help me to sort many problems like -
Even if kubeflow generates a unique run_id, I am not sure if that will be passed as some env variables to our kedro pipeline. Like I am looking to somehow use that unique run_id in hooks and catalogs to achieve many things.
Here's example how you can do/test it with omegaconf alone (take only generate_uuid function from this code):
import uuid from omegaconf import OmegaConf # Define a custom resolver to generate a random UUID def generate_uuid(): return str(uuid.uuid4()) # Register the resolver with OmegaConf OmegaConf.register_new_resolver("uuid", generate_uuid) # Example usage config = OmegaConf.create({ "id": "${uuid:}", }) # Access the config to generate a random UUID print(config.id) # Each time config.id is accessed, it generates a new UUIDin kedro's settings you can add this:
CONFIG_LOADER_ARGS = { "custom_resolvers": { "random_uid": generate_uuid } }and then enjoy in configs
${random_uid:}
if you need to generate it once and then re-use the same in the current session then the simplest solution would be to add cache decorator to the generate_uuid function or just do the caching manually
but before doing that I'd make double sure that you can't use the kubeflow id, as it would be better to have them be consistent and common
Suppose i want to access this random_uuid in hooks, will ${random_uid:} this works in hooks implementation as well.
Like this custom resolver can resolve the ${random_uuid:} used anywhere in the source code of kedro project
can we use Something like this in hooks.py
from kedro.config import OmegaConfigLoader from kedro.framework.project import settings # Instantiate an `OmegaConfigLoader` instance with the location of your project configuration. conf_path = str(project_path / settings.CONF_SOURCE) conf_loader = OmegaConfigLoader(conf_source=conf_path)
this magic is for letting your config loading execute some python code at load time
oh but how do we persist this uuid across the kedro session ? So i might be using this uuid across nodes and in many other hooks
I am not 100% sure if the config resolving does not happen in separate process, and in that case it would need some more care to keep it consistent but in general that should be the simple
the omegaconfig resolver syntax is only resolved by omegaconfig in config files (yamls) - at params, data catalog and others. Hooks are python classes, not yaml files - so you should call the python function behind config resolver directly
Ok so you actually mean that we cannot use resolvers to put dynamic values in some python files
resolvers are meant to enable usage of said function in CONFIGS not in python files
I need to first understand when should we use resolvers and why do we really need it.
But do not we have some way of persisting variables or objects in kedro session, something which we can generate in a before pipeline runs hook and then it can used in nodes.py and other hooks as well.
Caching is definitely one solution that you mentioned.
resolvers are a must if you want to have dynamic paths for your artifacts in data catalog
there is another option by using dataset factories, but they rely on namespaces which should also be static, so yeah, resolvers are the only option to be dynamic
resolvers are a must if you want to have dynamic paths for your artifacts in data catalogYeah that's something I learned recently.
But do not we have some way of persisting variables or objects in kedro session, something which we can generate in a before pipeline runs hook and then it can used in nodes.py and other hooks as well. ?
Like some global config a python dict kind of thing which can be retrieved at any point in the entire kedro sessions
Technically you can do it, but that's a much more ugly and convoluted solution in my opinion
I mean you can add custom code to edit kedro session and add anything to it or dynamically overwrite read configs... but why do that when you have legal mechanisms to achieve it
and nothing stops you from making the resolver just reach for some set field in your common python config dict
you would need to populate that dict at import time or at hook that happens before loading configs
So i will summarise now -
can you think of some other options as well which i can explore to utilise the kubeflow run_id
idk how this kubeflow run_id is handled, but you perhaps might try to use kubeflow api to get the current run id or maybe it's available in some templating syntax to fill command params - I'm just speculating here, this would require some googling for me
One quick question if kubeflow is able to pass the run_id through run params which eventually means the params stored in parameters.yml as well. So we can definately retrieve these params in nodes but can we also retrieve these params in kedro hooks ?
yes, in kedro hooks you can run a hook at step after catalog loaded and read it manually from catalog/params and then retrieve at other hook point
I'm not sure if this would work with data catalog templating at this moment, catalog is a bit special
you need to ask in #C03RKP2LW64 - can you access params or runtime params in catalog.yaml
?
if you see a kubeflow.yml is generated when we do a kedro kubeflow init
.
Couple of questions here -
upload_pipeline
command upload_pipeline
command always publishes a new pipeline on kubeflow or is there a way to simply publish a new version of an existing pipeline on kubeflow.Also as a side note, if your main case is for different users to have their own versions then you can use kedro-envs for that instead of fiddling with dynamic configs and resolvers.Can you please ellaborate on this ??