Join the Kedro community

Updated 6 days ago

Deploying Kedro Pipelines on GCP with Dataproc and Cloud Composer for Automated Execution

ISSUE: Deployment of Kedro Pipelines on GCP with Dataproc and Cloud Composer

Description
I am conducting a POC with Kedro on a GCP environment and need assistance deploying my Kedro project in a GCP-compatible format. The goal is to package the Kedro project for execution on Cloud Dataproc clusters.

The intended workflow is as follows:

  1. Create a separate Dataproc cluster for each Kedro pipeline.
  2. Execute the pipelines on their respective Dataproc clusters.
  3. Use Cloud Composer (Airflow) to orchestrate the process.
  4. Data is stored in GCS buckets.

I have not found clear documentation or guidelines on how to structure or deploy Kedro projects for this specific setup. Any guidance or resources to achieve this would be greatly appreciated.

Requirements
  • Package Kedro project for GCP compatibility.
  • Deploy and run pipelines on Dataproc clusters.
  • Orchestrate pipeline execution using Cloud Composer.
  • GCS as the storage location for data.

This request is urgent, as it is critical for the POC success and subsequent project deployment.

D
A
2 comments

Unfortunately, I don't have any recent experience on GCP to be able to answer this question. Perhaps https://linen-slack.kedro.org/t/23168580/hey-kedroids-kedro-apologies-in-advance-for-the-long-message#614d3471-83e8-4a83-a29d-d40fa931b1a0 may help? (The approach seems pretty standard; you can package Kedro projects to create the wheel the standard way, and submit that to Dataproc.)

Hey @Mohamed El Guendouz, I can help with this request as I have done this very extensively across GCP Dataproc serverless + Compute Engine + Airflow (Cloud Composer)

I am contributing a GCP Dataproc deployment guide to Kedro's official docs here: https://github.com/kedro-org/kedro/pull/4393 (Currently it's in draft). Also can talk about a lot more than this guide has detailed i.e.

  • Dataproc compute engine
  • Dataproc provisioning
  • CI/CD - DEV/PROD workflows (if that environment tiering pattern applies to you),
  • Dataproc experimentation practices for Data Scientists
  • GCP IAM practices
  • Incorporating GCS, BigQuery etc storage + compute services with Dataproc
  • Common Dataproc errors / gotchas

Initially it is limited to Dataproc serverless but will add more contributions if this one gets incorporated.

Please have a look and let me know in case you have any questions πŸ™‚

Also, to the kedro maintainers, appreciate you taking a look at the PR :kedro:

CC: @Ravi Kumar Pilla (As I mentioned that I will be contributing a guide on Dataproc in the attached thread)

Add a reply
Sign up and join the conversation on Slack