I need a node that can dynamically accept a variable number of input datasets based on the number of unique values in a specific field of the primary input dataset.

Question

Hello Kedro community, I am currently developing a project where I need to pass in a dynamic number of catalog dataset entries as inputs to a node. The number of input datasets to this node depends on the primary input dataset being used , particularly the number of unique values in one field. For instance this node expects tree inputs: a column name (this is fixed and not dynamic), feature datasets ,target datasets. This node basically collates all these datasets together in one object as the output of the node- The number of feature and target datasets is dynamic . Can be 1 or 20. They all have catalog entries . I tried creating a list of catalog entry strings to be passed for the feature and target datasets as below- 
feature_df_list = [
    f"{group_name_cleaned}.features_with_clusters"
    for group_name_cleaned in groups_cleaned
]

target_df_list = [
    f"{group_name_cleaned}.target_with_clusters"
    for group_name_cleaned in groups_cleaned
]

input_dict = {
    "target_col": "params:target_col",
    "group_list": feature_df_list,
    "target_clusters_with_features": target_df_list,
}

node(
    func=collate_results,
    inputs=input_dict,
    outputs="run_collection",
), But it treats the catalog entries in the list as strings and does not load the datasets required with them Please help me in trying to understand ow best I can pass dynamic inputs to a node in Kedro :)

Rashida Kanchwala · Answer

hi, have you tried using  parameters  in Kedro ?

Vinayak Singh · Answer

I am not sure how i would use that Rashida? I have the params and catalog file setup. How would that help me passing dynamic inputs to a node? If you could share an example tat would be great 🙂

Rashida Kanchwala · Answer

Not sure if this example  is relevant in your case for namespace, variants in settings.DYNAMIC_PIPELINES_MAPPING.items():
        for variant in variants:
            pipes.append(
                pipeline(
                    data_science_pipeline,
                    inputs={"model_input_table": f"{namespace}.model_input_table"},
                    namespace=f"{namespace}.{variant}",
                    tags=[variant, namespace],
                )
            )
    return sum(pipes) It's from this blog -  https://getindata.com/blog/kedro-dynamic-pipelines/

Philipp Dahlke · Answer

You probably want a preprocessing pipeline and create data according to your groups and use those as inputs. check out  namespaces , helped me a lot with this and the blogpost mentioned by Rashida. Actually ended up implementing it.

Vinayak Singh · Answer

Thank you both for pointing me to namespaces . Extremely helpful 🙌.
I also want to create a node that collates output from all namespaces into one summary output. Is there a way to pass all outputs created by the dynamic namespaces to a single node which collates them?

Vinayak Singh · Answer

For instance in the example Rashida shared which has base, candidate1  &  candidate2 namespaces and regressor models for each. I want to create 1 node which takes the 3 ( this is dynamic)  models created as input.

Rashida Kanchwala · Answer

Hi  @Vinayak Singh , I haven't tried this myself, but in principle, the outputs of a node can serve as inputs to another node. If you define your outputs correctly in the DataCatalog, you should be able to reference them as inputs in a new node.

Philipp Dahlke · Answer

Maybe you can build your input dict beforehand by reading settings.DYNAMIC_PIPELINES_MAPPING.items() ? so you can populate your inputs with all used namespaces/variants and read it by useing kwargs.

Vinayak Singh · Answer

thank you both for your responses . Great suggestion  @Philipp Dahlke  , i will try and do that.

Join the Kedro community

I need a node that can dynamically accept a variable number of input datasets based on the number of unique values in a specific field of the primary input dataset.