I’m working on a big project that is about to hit it’s next phase. We are using kedro and we have a large single kedro project. To give you an idea on how big, we have about 500+ catalog entries, 500+ nodes in different kedro pipelines (we disabled the default sum of all pipelines as it is too large to use). Now I know the general guideline is to split your project in several smaller ones if it becomes too big, but I need some advice/opinions on this. I’ll explain more details in the comments. Thanks!
The project we have is a single solution rolled out to 50 countries across EMEA. For each country, we have the same code base, nodes and pipelines. But different configuration (parameters) that we store in different folders in `conf/`, i.e. every country is a `kedro_env`. Hence, splitting the project based on country won’t reduce the size and complexity. The only logical split I see is based on grouping certain pipelines and nodes and moving those to separate kedro projects, roughly with the following dependencies
```
— B — C
A—|
— D — E
```
These projects would share some code but that’s not too bad the handle. The biggest challenge is that would they share an extensive amount of config. So a change to a param would then need to be replicated in several projects. Any advice on how you would solve this would be awesome.
To help reduce your catalog size, can you use kedro dataset factory?
https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html
and pretty much the same for nodes : are each node unique in code? Is it mostly just parameter changing? If so, you could maybe generate nodes by code according to those parameters (that's what we do)
(but our project is not yet 500 nodes, but it could become like that (we have a bunch of scenario to train on + bunch of tests to verify x ~6 models (so far), so 5 scenarios on 6 models generate 65 nodes (5 to split the data for the scenario, then 2(train+tests)x6(models)x5(scenario) = 65). Our catalog is exactly 5 entries : 2 inputs (a metadata.csv describing binary files + the binary files themselves), 1 for all of the scenarios that will be used as a model_input, 1 for saving the various trained models, and 1 for all of the reporting.
Hi @Matthias Roels, If its just about sharing a param, you can consider saving the params in an s3 bucket as a CSV (maybe) and load it in a Kedro Hook like before_pipeline_run
. But as @Alexandre Ouellet mentioned, kedro dataset factories are a great way to reduce the config complexity. Thank you
The way we manage these is using a monorepo management framework called Alloy, it was actually initially developed to help manage Kedro projects like this at scale.
With Kedro projects we separate our functional logic from pipeline logic, and the functional logic is stored within Python packages, and the the pipeline logic within separate Python packages. In Alloy these packages are called components
. With Alloy there's also the concept of apps
, these are .yml files which declare combinations of components
and relevant config and metadata.
We then use Alloy to assemble apps
as needed for different contexts. Often different regions, or business units have their own app which is a specific combination of components relevant to them. By managing the code in this way there's really clear separations of concern and it's super easy to scale development.
More info: https://medium.com/quantumblack/engineering-solutions-for-reuse-1ff5a81d8611
Speaking of (and related I believe) : is there a way to manage a dataset of about 1 million files in AzureML? The files are about 4k each, and are entirely independent from each other
@Alexandre Ouellet would PartitionedDataset
work for you? (feel free to open a separate thread to keep this one focused)