Join the Kedro community

M
M
M
D
M

Kubeflow execution of kedro nodes in containers or pods

Hello everyone

This is regarding the kubeflow plugin.

I wanted to just gain some information about how kedro nodes are executed by kubeflow.

Does kubeflow run each node in a separate container ?? or separate pods ?? or all of nodes are executed in the same container

V
A
95 comments

any idea on this ?

afaik the default is separate pods

I added a feature to group nodes during translation process to run aggregate some of them together in the same pod where it makes sense, but I don't remember if I ported it to kubeflow plugin

is the kubeflow plugin not maintained anymore ??

Is it a bad choice to choose kubeflow for deploying kedro pipeline in 2024

it is on life support for now, as in not actively maintained but we try our best to update the versions of dependencies and work on it when we have more resources, which for now we don't have

We've kinda put it on lower priority since kubeflow as an ecosystem has been declining in popularity

Would you recommend some other orchestrator for deploying our kedro pipelines ?

it really depends on your ecosystem/constraints

which cloud, on prem or not, what you already have in place

if you are not limited in any way (don't have anything in place yet) then it would depend on the planned scale of your project

still from open-source and self managed kubeflow and airflow with kubernetes seem like the best options for now for kedro

regarding node grouping- I haven't https://github.com/getindata/kedro-kubeflow/issues/262 but its draft is present in other plugins, so I could maybe find some time to do it or guide someone else willing to put work on it

The main problem with this plugin and its maintenance for us is:

  • we're short on people and resources to actively maintain it right now
  • we've reclaimed resources for kubeflow cluster and need some time to remake the environment to test the changes properly, the plan is to do it in github actions kubeflow cluster in minikube spinned on the fly
If someone else is willing to help keeping it alive it is welcome and I can help in guiding as time allows

So we have a self managed kubeflow service running already on aws EKS and we are thinking to use kubeflow plugin to publish our kedro pipelines. I do have couple of questions regarding running kedro pipelines on kubeflow

  1. Do i need to have a working knowledge of kubeflow for running kedro pipelines on kubeflow ?
  2. Each of our node are generating some output files and these are kind of intermediate files which will be used by some other node in pipeline. When I am running these pipelines locally either i can treat them as memoryDatasets or save them in data/ . All the nodes can access these files easily. What is a suggested strategy to handle this when the pipelines are deployed on kubeflow given that each node will not be running in the same pod ?

  1. probably not but it helps to see the picture and know how to optimize the workflow
  2. the intermediary files need to be saved in buckets/cloud storage or with the grouping feature you can group them together and keep them as memory datasets

I can try to see what resources we have and maybe bump the priority in porting the grouping feature

As yeah without it I am aware that it's a huge pain to have everything run in separate pods

as it's at odds with a principle you'd like to keep in kedro to keep nodes simple and atomic

you don't want to dl a docker img and spin a pod just to add 2 numbers together or extract some params

so yes I am also trying to save the outputs on s3 it works out the box by versioning them as well, so whenever the next node fetches the same dataset it gets the latest version.

One quick question here -> So in our one of the use cases multiple users will be creating pipeline runs from the kubeflow ui. I wanted to understand how do i save them on s3 so that each run uses their own intermediate files generated and not the ones generated by some other run. Think of it like we have launched parallel runs for a pipeline - r1, r2, r3 , how can we ensure they do not mix the intermediate files.

I am really not familiar with kubeflow ui, or it has been so long since I was last using it so I forget what it looks like

but I'd be surprised if there are no options to pass any parameters or environment variables to the runs

Yeah you can ignore the ui for now. Buy yeah we can surely pass parameters from kubeflow ui, not sure about env variables although.

you could use an env variable to set the user and use this value in paths that are generated in catalog

I'd need to confirm if this is parametrizable in kedro

got it will oc.env help me to access these env variables in kedro which will be set by kubeflow somehow

so let me know what you'll figure out about how to pass params, as I said I'm rusty in this topic and would be happy to know as well

So basically I can use some env variable to have different folder structure in s3 to make sure that nodes running for run r1 uses its own files and doesn't interferes with other runs r2, r3 .

so let me know what you'll figure out about how to pass params, as I said I'm rusty in this topic and would be happy to know as well
on this, someone in the channel mentioned that once we define the params in parameters.yml , it reflects on the kubeflow Ui, and takes the default values defined in the yaml. Users can edit it , i will test it and let you know for sure.


One thing that I want may be you can find some time to confirm , I am looking for a unique run_id for each run as this will help me to sort many problems like -

  1. The one we just discussed I can use this unique run id in kedro to have different S3 folders for each run and store the intermediate files.
  2. This unique run_id will be used to track various metrics for that run and we might dump these metrics like time_taken by each run , whether it was success or not etc etc corressponding to the unique run_id

Even if kubeflow generates a unique run_id, I am not sure if that will be passed as some env variables to our kedro pipeline. Like I am looking to somehow use that unique run_id in hooks and catalogs to achieve many things.

you can always generate your own

Here's example how you can do/test it with omegaconf alone (take only generate_uuid function from this code):

import uuid
from omegaconf import OmegaConf

# Define a custom resolver to generate a random UUID
def generate_uuid():
    return str(uuid.uuid4())

# Register the resolver with OmegaConf
OmegaConf.register_new_resolver("uuid", generate_uuid)

# Example usage
config = OmegaConf.create({
    "id": "${uuid:}",
})

# Access the config to generate a random UUID
print(config.id)  # Each time config.id is accessed, it generates a new UUID
in kedro's settings you can add this:
CONFIG_LOADER_ARGS = {
"custom_resolvers": {
"random_uid": generate_uuid
}
}
and then enjoy in configs
${random_uid:}

if you need to generate it once and then re-use the same in the current session then the simplest solution would be to add cache decorator to the generate_uuid function or just do the caching manually

but before doing that I'd make double sure that you can't use the kubeflow id, as it would be better to have them be consistent and common

Suppose i want to access this random_uuid in hooks, will ${random_uid:} this works in hooks implementation as well.

Like this custom resolver can resolve the ${random_uuid:} used anywhere in the source code of kedro project

in hooks no, but you can call the function directly

can we use Something like this in hooks.py

from kedro.config import OmegaConfigLoader
from kedro.framework.project import settings

# Instantiate an `OmegaConfigLoader` instance with the location of your project configuration.
conf_path = str(project_path / settings.CONF_SOURCE)
conf_loader = OmegaConfigLoader(conf_source=conf_path)

eh why would you do that

when you can just generate_uuid()

this magic is for letting your config loading execute some python code at load time

why would you want to go to config magic when running python code

oh but how do we persist this uuid across the kedro session ? So i might be using this uuid across nodes and in many other hooks

@cached decorator could work I think

I am not 100% sure if the config resolving does not happen in separate process, and in that case it would need some more care to keep it consistent but in general that should be the simple

just wanted to understood why does custom config resolvers do not work in hooks

Cannot we override the files where custom_resolver shall do the magic

and by hooks do you mean kedro hooks?

the omegaconfig resolver syntax is only resolved by omegaconfig in config files (yamls) - at params, data catalog and others. Hooks are python classes, not yaml files - so you should call the python function behind config resolver directly

Ok so you actually mean that we cannot use resolvers to put dynamic values in some python files

in python files you just use a function

resolvers are meant to enable usage of said function in CONFIGS not in python files

and they use the same function underneath

I need to first understand when should we use resolvers and why do we really need it.

But do not we have some way of persisting variables or objects in kedro session, something which we can generate in a before pipeline runs hook and then it can used in nodes.py and other hooks as well.

Caching is definitely one solution that you mentioned.

I don't think I can explain it any clearer πŸ˜„

resolvers are a must if you want to have dynamic paths for your artifacts in data catalog

there is another option by using dataset factories, but they rely on namespaces which should also be static, so yeah, resolvers are the only option to be dynamic

resolvers are a must if you want to have dynamic paths for your artifacts in data catalog
Yeah that's something I learned recently.

But do not we have some way of persisting variables or objects in kedro session, something which we can generate in a before pipeline runs hook and then it can used in nodes.py and other hooks as well. ?

Like some global config a python dict kind of thing which can be retrieved at any point in the entire kedro sessions

Technically you can do it, but that's a much more ugly and convoluted solution in my opinion

I mean you can add custom code to edit kedro session and add anything to it or dynamically overwrite read configs... but why do that when you have legal mechanisms to achieve it

and nothing stops you from making the resolver just reach for some set field in your common python config dict

but you need to be aware of the order of events happening in kedro

and reading & resolving configs is pretty early on

you would need to populate that dict at import time or at hook that happens before loading configs

So i will summarise now -

  1. First of all , I need to look for how can we utilise run_id being generated by kubeflow in the kedro pipeline . But if i want to use this across the entire session like in all the hooks, all nodes , this kubeflow run_id should be set as an env variable.
  2. If the first approach is not viable we can generate our own unique ids in the kedro session as we just discussed.

correct me if I made some mistake

  1. env variable is just the easiest way to communicate, there might be other options too - yes
  2. yes

can you think of some other options as well which i can explore to utilise the kubeflow run_id

idk how this kubeflow run_id is handled, but you perhaps might try to use kubeflow api to get the current run id or maybe it's available in some templating syntax to fill command params - I'm just speculating here, this would require some googling for me

i see , interesting . 😊

One quick question if kubeflow is able to pass the run_id through run params which eventually means the params stored in parameters.yml as well. So we can definately retrieve these params in nodes but can we also retrieve these params in kedro hooks ?

yes, in kedro hooks you can run a hook at step after catalog loaded and read it manually from catalog/params and then retrieve at other hook point

I'm not sure if this would work with data catalog templating at this moment, catalog is a bit special

you need to ask in #C03RKP2LW64 - can you access params or runtime params in catalog.yaml ?

sure i can ask this question there.

if you see a kubeflow.yml is generated when we do a kedro kubeflow init .

Couple of questions here -

  1. Is this configuration used only once to publish/upload a pipeline and if we make changes to this config we will have to again run upload_pipeline command
  2. Does upload_pipeline command always publishes a new pipeline on kubeflow or is there a way to simply publish a new version of an existing pipeline on kubeflow.
  3. Can we reconfigure these configs for different runs from kubeflow UI once it is published on kubeflow. Because if that is not the case someone will have to always re run the upload_pipeine command.

I'll reply tomorrow, I've got to quit for today

Sure , please carry on.

Attaching the kubeflow UI for a published kedro pipeline

Attachment
image.png

  1. yes - it's a local state for plugin to know how to translate the pipeline to kubeflow
  2. there should be a way to overwrite it, not sure
  3. afaik once you want a new version of pipeline then you need to re-run the translation process.
I can see there is run parameters section in kubeflow ui, I'll add a ticket to investigate usage of that for more flexibility of parametrizations of existing pipelines.
Also as a side note, if your main case is for different users to have their own versions then you can use kedro-envs for that instead of fiddling with dynamic configs and resolvers.

can you maybe take a look and confirm my answers?

Also as a side note, if your main case is for different users to have their own versions then you can use kedro-envs for that instead of fiddling with dynamic configs and resolvers.
Can you please ellaborate on this ??

Add a reply
Sign up and join the conversation on Slack
Join