Join the Kedro community

Updated 4 weeks ago

Seeking Help with Kedro Vertex AI Plugin Async Node Runs

Hey guys I would like to know if theres anyone that have tested the Kedro Vertex AI Plugin, on its latest version. I'm having some issues with async node runs, for some reason it is taking a lot longer than when run locally. It might be because I'm allocanting a GPU to parto of the process, but it shouldn't, in my perspective, so if anyone have any ideas or suggestions, I'll appreciate that...

R
T
12 comments

Hi , I hope you are aware of the below factors but posting here for reference (help from GPT). I hope community would pitch in as I do not have experience with Vertex AI. Thank you

Resource Provisioning Time: Vertex AI may take time to provision GPU resources, which could slow down node initialization. Check if GPU resources are immediately available or if they're being dynamically allocated.
Network Latency: When you're running async processes in the cloud, network I/O between nodes or external systems (such as GCS or BigQuery) can add delays compared to local runs.
GPU Utilization: Ensure that the GPU is being effectively utilized. If a part of your process doesn't benefit from GPU acceleration, it could be causing the slowdown. You can monitor GPU usage with Vertex AI monitoring tools to verify this.
Vertex AI Settings: Check if you're using the right machine type with enough CPU and RAM to complement the GPU. Sometimes, misconfiguration in machine types can lead to underutilization of GPU resources.
Async Behavior: If nodes are running asynchronously, make sure that there are no dependency bottlenecks between them. Sometimes, a node might wait for resources or inputs to be ready before starting, which could contribute to delays.
Containerization Overheads: If you're using Docker containers, they might introduce additional overhead, especially if large dependencies or data files are being loaded into memory on start.

Hi how are you?

So, my scenario is the following I not actually using the vertexai plugin, I'll try to update my code to it. I do use a cloudbuild to build images, and a kfp.py file in which convert my pipelines and nodes from Kedro to vertex AI. For some reason the time of execution of a simple call procedure is taking 20 minutes, while it should only take 5 minutes tops, since its a simple query. The GPU are allocated only to specific nodes, those that run a model, but sometimes the time of node execution (creation time - end time) are the same for all nodes, even though it should be less. I would like to know more about the kdp convert with the plugin, to see if its possible to optimize those things.

Hi , I am doing good. I hope you are well. Thanks for being patient, I think there might be many reasons why the node execution takes time. I found async node runs need all the datasets being used to be thread-safe. Not sure again if race conditions and GIL are delaying the node runs. I have no experience with this.

For some reason the time of execution of a simple call procedure is taking 20 minutes, while it should only take 5 minutes tops, since its a simple query
Did you try this before since you said it should take 5mins, I wonder if there was any filter condition or any change in the dataset leading to higher resolution times

the odd part is that when I run the pipeline on vertex ai without the nodes that uses GPU, the time of execution is much lower, a simple call procedure used to take 20 minutes runned on 6 minutos without those nodes within

My way of converting kedro into vertex ai pipelines is a bit old compared to the plugin, but that shouldn't be a reason, since it work flawlessly on ther scenarios.

also which runner are you using ?

my actual problem is with GCP, not kedro, so this is why I want to know if there's a better way to implement kedro on vertex ai in a way that I can control the nodes run, to avoid the usage of too much resources

So we have different runners to run your pipeline. I was curious on which runner were you using - https://docs.kedro.org/en/stable/nodes_and_pipelines/run_a_pipeline.html#run-a-pipeline

ohh that one, ok, we use Sequential Runner

Add a reply
Sign up and join the conversation on Slack