Join the Kedro community

Updated 2 days ago

Repo Separation Between Etl And Ml Apps

[Question: Repo separation btw ETL and ML apps]

O
R
3 comments

Hello, team!

I have some question regarding best practices. I am developing a relatively classic ML solution which reads data from S3, runs ETL, and then trains and serves multiple models. Each model has a different preprocessing pipeline while the ETL contains model-independent logic. I plan to use Kedro with Kedro-MLflow plugin. I think, the application architecture suggested works great for me but I have doubts about separation of concerns. My main concern is about keeping ETL and ML applications together in one repository. Here are some thoughts and inputs which I think will be useful for the decision:

  1. I think each model will have it's own repository with it's own Kedro + Kedro-MLflow usage. The logic btw models and their pipelines is very different and teams working on them are expected to be independent. However, all teams are dependent on the same ETL and therefore will have to sync some contract changes
  2. ETL and ML apps will very likely use different infrastructure: for example, AWS batch and AWS SageMaker respectively.
  3. Both ETL and ML apps are expected to be managed by Kedro-Airflow

Thank you very much for your help!

hi @Oleg Litvinov, Kedro doesn't have a best practice as such—it really depends on your team's workflows and requirements. Kedro is modular in nature and supports both integrated and decoupled approaches. Given your setup with distinct infrastructure requirements, separating them makes sense.

hi @Rashida Kanchwala! Thank you very much for sharing your thoughts. I appreciate it!

Add a reply
Sign up and join the conversation on Slack