Join the Kedro community

Updated 3 days ago

I need a node that can dynamically accept a variable number of input datasets based on the number of unique values in a specific field of the primary input dataset.

Hello Kedro community,

I am currently developing a project where I need to pass in a dynamic number of catalog dataset entries as inputs to a node. The number of input datasets to this node depends on the primary input dataset being used , particularly the number of unique values in one field.

For instance this node expects tree inputs: a column name (this is fixed and not dynamic), feature datasets ,target datasets. This node basically collates all these datasets together in one object as the output of the node-

  • The number of feature and target datasets is dynamic . Can be 1 or 20. They all have catalog entries .
  • I tried creating a list of catalog entry strings to be passed for the feature and target datasets as below-
feature_df_list = [
    f"{group_name_cleaned}.features_with_clusters"
    for group_name_cleaned in groups_cleaned
]

target_df_list = [
    f"{group_name_cleaned}.target_with_clusters"
    for group_name_cleaned in groups_cleaned
]

input_dict = {
    "target_col": "params:target_col",
    "group_list": feature_df_list,
    "target_clusters_with_features": target_df_list,
}


node(
    func=collate_results,
    inputs=input_dict,
    outputs="run_collection",
),
  • But it treats the catalog entries in the list as strings and does not load the datasets required with them

Please help me in trying to understand ow best I can pass dynamic inputs to a node in Kedro :)

R
V
P
9 comments

hi, have you tried using parameters in Kedro ?

I am not sure how i would use that Rashida? I have the params and catalog file setup. How would that help me passing dynamic inputs to a node? If you could share an example tat would be great πŸ™‚

Not sure if this example is relevant in your case

for namespace, variants in settings.DYNAMIC_PIPELINES_MAPPING.items():
        for variant in variants:
            pipes.append(
                pipeline(
                    data_science_pipeline,
                    inputs={"model_input_table": f"{namespace}.model_input_table"},
                    namespace=f"{namespace}.{variant}",
                    tags=[variant, namespace],
                )
            )
    return sum(pipes)

It's from this blog - https://getindata.com/blog/kedro-dynamic-pipelines/

You probably want a preprocessing pipeline and create data according to your groups and use those as inputs. check out namespaces, helped me a lot with this and the blogpost mentioned by Rashida. Actually ended up implementing it.

Thank you both for pointing me to namespaces . Extremely helpful πŸ™Œ.
I also want to create a node that collates output from all namespaces into one summary output. Is there a way to pass all outputs created by the dynamic namespaces to a single node which collates them?

For instance in the example Rashida shared which has base, candidate1 & candidate2 namespaces and regressor models for each. I want to create 1 node which takes the 3 ( this is dynamic) models created as input.

Hi @Vinayak Singh,
I haven't tried this myself, but in principle, the outputs of a node can serve as inputs to another node. If you define your outputs correctly in the DataCatalog, you should be able to reference them as inputs in a new node.

Maybe you can build your input dict beforehand by reading settings.DYNAMIC_PIPELINES_MAPPING.items() ?
so you can populate your inputs with all used namespaces/variants and read it by useing kwargs.

thank you both for your responses . Great suggestion @Philipp Dahlke , i will try and do that.

Add a reply
Sign up and join the conversation on Slack