Join the Kedro community

Home
Members
Mohamed El Guendouz
M
Mohamed El Guendouz
Offline, last seen 7 days ago
Joined October 2, 2024

ISSUE: Deployment of Kedro Pipelines on GCP with Dataproc and Cloud Composer

Description
I am conducting a POC with Kedro on a GCP environment and need assistance deploying my Kedro project in a GCP-compatible format. The goal is to package the Kedro project for execution on Cloud Dataproc clusters.

The intended workflow is as follows:

  1. Create a separate Dataproc cluster for each Kedro pipeline.
  2. Execute the pipelines on their respective Dataproc clusters.
  3. Use Cloud Composer (Airflow) to orchestrate the process.
  4. Data is stored in GCS buckets.

I have not found clear documentation or guidelines on how to structure or deploy Kedro projects for this specific setup. Any guidance or resources to achieve this would be greatly appreciated.

Requirements
  • Package Kedro project for GCP compatibility.
  • Deploy and run pipelines on Dataproc clusters.
  • Orchestrate pipeline execution using Cloud Composer.
  • GCS as the storage location for data.

This request is urgent, as it is critical for the POC success and subsequent project deployment.

2 comments
D
A

Hello 🙂
I would like to know if, when generating the Airflow DAG for a Kedro project using the kedro-airflow tool, is it possible to create a separate DAG for each pipeline in the project rather than a single DAG per project? If so, how can I configure each DAG to specify start times and other parameters for each DAG corresponding to each pipeline in the project?

2 comments
M
J

Hello,
I would like to work with Delta Tables using PySpark in a GCS bucket, but I'm having trouble using spark.DeltaTableDataset:

table_name:
  type: spark.DeltaTableDataset
  filepath: "<a target="_blank" rel="noopener noreferrer" href="gs://XXXX/poc-kedro/table_name/*.parquet">gs://XXXX/poc-kedro/table_name/*.parquet</a>"
Could you tell me what might be wrong with this?
Additionally, could you explain how to specify the credentials for accessing the table with this Dataset?

24 comments
R
M
N

Hi everyone,

I’m a Data Engineer, and my team is working on multiple pipelines, each addressing different use cases (1 use case = 1 pipeline). We have both ingestion pipelines and export pipelines delivering data to various clients.
We’re considering grouping certain nodes into a common library to be shared across these pipelines. I wanted to ask if this is considered a good practice within the Kedro framework. If so, could you recommend an approach or best practices for implementing this?
Additionally, do you have any recommendations for structuring a Kedro project when working with multiple pipelines like this?
Thanks in advance for your help!

Best regards,
El Guendouz Mohamed

3 comments
T
R
M