Join the Kedro community

M
M
M
D
M

Kedro pipeline deployment using VertexAI SDK with API endpoint triggering

Hey Kedroids! :kedro:

(Apologies in advance for the long message but would really really appreciate a good discussion on below from the kedro community! πŸ™‚ )

I have a usecase of deploying kedro pipelines using VertexAI SDK.

  1. In the production system (web app), I want to be able to trigger a kedro pipeline (or multiple pipelines) with specified parameters (say from the UI).
  2. Let's say we have a API endpoint https://my.web.app/api/v1/some-task
  1. Body includes parameters to trigger 1 or multiple kedro pipelines as a Vertex AI DAG

My VertexAI DAG has a combination of nodes (steps), and each node:

  1. May or may not be a kedro pipeline
  2. May be a pyspark workload running on dataproc or non spark workload running on a single compute VM
  3. May run a bigquery job
  4. May or may not run in a docker container

Let's take the example of submitting a kedro pipeline on Dataproc serverless running on a custom docker container using VertexAI SDK.

Questions:

  1. Do you package the kedro code as part of the Docker container or just the dependencies?

For example, i have seen this done alot which packages the kedro code as well:

RUN mkdir /usr/kedro
WORKDIR /usr/kedro/
COPY . .

which means copying the whole project, and then in the src/entrypoint.py ,

from kedro.framework import cli
import os

os.chdir("/usr/kedro")
cli.main()

2. Do I need to package my kedro project as a wheel file and submit it with the job to Dataproc? If so, how have you seen that done with DataprocPySparkBatchOp?

3. How do you recommend to pass dynamic parameters to the kedro pipeline run?

As I understand cli.main() picks up sys.argv to infer pipeline name and parameters so one could that

kedro run --pipeline <my_pipeline> --params=param_key1=value1,param_key2=2.0

Is there a better recommended way of doing this?

Thanks alot and hoping for a good discussion! πŸ™‚

R
A
3 comments

Hi Abhishek, Based on my understanding here are some suggestions -

  1. If you want the full project in docker which mirrors your local dev setup, COPY .. should work fine
  2. The cleaner approach, if you want to minimize the size of docker image and separate code from infra, .whl should be your option
  3. I am not aware of DataprocPySparkBatchOp but based on my search, you can package your kedro project as .whl file and submit it to Dataproc
  4. You can pass dynamic params as you mentioned via CLI which works well, you can also pass via VertextAI SDK using main.

from my_project.__main__ import main
main([
    "--pipeline", "<my_pipeline>", 
    "--params", "param_key1=value1,param_key2=2.0"
])
I would also wait for the community to respond if someone has tried this and have any recommendations
Thank you

Thanks for the response πŸ™‚

On packaging as a whl file and submitting it to dataproc cluster, my main question is, if I do not include the kedro project folder in the dockerfile, then how would kedro find the conf folder?

I guess the kedro project folder structure has to be brought in somehow so as to execute the project (cloning it or packaging it with the docker image)

And thanks for the 4th point πŸ™‚

Also, hopefully this discussion helps to guide towards creating a kedro gcp development / deployment guide. I can contribute too! πŸ™‚

Add a reply
Sign up and join the conversation on Slack
Join