Kedro

Log inLog into community

Join the Kedro community

Updated 3 months ago

Simpler Way To Use A Run Identifier On The Path Into The Catalog

Simpler Way To Use A Run Identifier On The Path Into The Catalog

At a glance

TThiago José Moser Poletto

·

Guys, I would like to check with you if theres a simpler way to use a run_identifier on the path into the catalog:

I'm loading a base from BigQuery and spliting each row to run in another pipeline, where I load and save dynamically the inputs/outputs.

I would like to get a value from a column and use as run_identifier in the path on catalog:

filepath: ${root_folder}/${current_datetime}/${run_identifier}/data/model/{placeholder:name}.pt

is there a way known to do something like that? I open to suggestions...

d

T

N

51 comments

<i>I think </i> setting run_idenifier using an env var is the easiest way to do this

TThiago José Moser Poletto

yeah, but I need it to be updated dynamically, based on the value from a row that comes from the input...

oh gotcha

that's slightly difficult in Kedro as IO / logic are decoupled intentionally

it can be done but it's a bit difficult

So I understand that you basically want to override some parameters base on a node output.

https://docs.kedro.org/en/stable/extend_kedro/architecture_overview.html

I'm loading a base from BigQuery and spliting each row to run in another pipeline

How are you doing this?

If you are calling separate Kedro pipeline you can simply inject those identifier as part of the runtime_params

TThiago José Moser Poletto

I do have a pipeline that load the BQ table, split that data into rows, that is saved dynamically on catalog.

Once that orchestration pipeline is done, the model pipeline runs.

def create_pipeline(**kwargs) -> Pipeline:
    params = load_catalog_params()
    print("input_tables entry from parameters/general.yml: ", params["input_tables"])
    all_pipelines = []
    for group in range(int(params['n_cores'])):
        p = Pipeline([
                    node(func    = train_setup,
                          inputs  = ['imagens', 'masks', 'split_data_input', f"train_config_params_{group}"],
                          outputs = [f"metrics_{group}", f"epoch_loss_{group}", f"validity_metrics_{group}", f"model_save_{group}", f"train_params_{group}"],
                          name    = f"train_setup_{group}")
                    ],tags = "training") 
        all_pipelines += [p]
    
    return reduce(add, all_pipelines)

Otherwise I am thinkin using namespace pipeline + dataset factory. So the catalog look like

{run_identifier}_xxx_dataset:
  filepath: {run_identifier}/some_folder/some_dataset.parquet

Then in those namespace pipeline you use the run_identifier as part of the input/output name

TThiago José Moser Poletto

so the outputs from that train_setup that I would like to be able to add the run_idenifier into the path

TThiago José Moser Poletto

so that for every whole run, I'll have a folder with date/run_idenifier/data..

TThiago José Moser Poletto

so that I can identify which run/model output has which params behind it

I see, I don't have an immediate solution. This is tricky because this is having a runtime output defining the dataset, which has been initiated way before the node is run.

https://docs.kedro.org/en/stable/extend_kedro/architecture_overview.html

The other way is likely using hook to override the catalog during node run.

TThiago José Moser Poletto

oh yeah I thought of that too, I'll check if works properly and post here

TThiago José Moser Poletto

thanks guys

Alternatively, if you can split this into two separate Kedro run, this would be very simple.

First run to generate the rows to run:
Second run are kedro run with runtime_params, that read the result of the previous run, from a table potentially.

The downside is that you cannot use a single kedro run command to do what you think as a whole job. In an orchestrator it shouldn't be a problem since you can treat two run as a job and it doesn't have to be mapped 1:!

TThiago José Moser Poletto

I'll try that as well

TThiago José Moser Poletto

Hey do you have a documention on how to implement that example you gave: namespace pipeline + dataset....

{run_identifier}_xxx_dataset:
  filepath: {run_identifier}/some_folder/some_dataset.parquet

let see if this help you.
Docs:

[How to generalise datasets using namespaces] https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html#how-to-generalise-datasets-using-namespaces
I have a demo project created which may help: https://github.com/noklam/kedro-example/blob/master/loop-pipeline/src/loop_pipeline/pipeline.py

TThiago José Moser Poletto

nice...
I get the part of the namespace now, but the layer one, where its defined? I mean, the namespace is define in the node... and layer?

What do you mean by layer?

TThiago José Moser Poletto

sorry... haha

Attachment

TThiago José Moser Poletto

and if its possible to use both layer and namespace for that matter

I see, this is using both namespace and factory at the same time. For example, if you only use factory but not namespace, it may look like this:

{layer}_some_dataset:
  filepath: data/{layer}/data.pq

Dataset factory allows you to define multiple dataset with a single pattern. While namespace is merely the "group" of dataset/pipeline, which support more powerful feature with kedro-viz. For kedro itself, namespace does not change anything, they are mainly for organising your code.

TThiago José Moser Poletto

ok, but I just don't know where the value "layer" is coming from, sorry about that, I'm somewhat new with more advanced Kedro concepts, so I don't know about the factory

So for example, when you have a node without any dataset factory:

node(my_func, inputs="some_data", outputs="my_fav_dataset")

then your catalog may look like:

some_data:
   ...

{some}_dataset:
   filepath: data/{some}/dataset.pq

With pattern matching, the outputs will automatically match the 2nd dataset in the catalog. Without a pattern, Kedro default to an in-memory dataset that is thrown away after the run.

the {layer} and {dataset_name} is pattern matching of the inputs, outputs string.

TThiago José Moser Poletto

so how would look a node and catalog that have a layer? just to see if I truly got it

don't worry about that, I can see the confusion as it requires both node and catalog to reason about the dataset. It's not super clear in the doc I will try to add some explanation.

hmm, reusing the example above

TThiago José Moser Poletto

ohhh that will be the layer? or I can call it anything? as long as follow a somewhat pattern in that mentioned rank...

some_data:
   ...

{some}_dataset_{abc}:
   filepath: data/{some}/dataset.pq
   layer: {abc}

{some}_dataset:
  filepath: data/{some}/dataset.pq

Extending the example, this time we have 2 dataset factories. an outputs/inputs can match more than 1 factory pattern, and the one is more specific wins. There is a command kedro catalog rank to help you to understand the resolution.

For example, a dataset call kedro_dataset, will match {some}_dataset, while kedro_dataset_something will match {some}_<i><code>dataset_{abc}</code></i>

It can be call anything, it's very similar to f-string or Jinja template.

the name inside {} is just a placeholder

with kedro_dataset_something,

some -> kedro
abc -> something

We use parse as the underlying library, which is a bit like reverse of f-string. This example should help you to understand more:

>>> from parse import compile
>>> p = compile("It's {}, I love it!")
>>> print(p)
<Parser "It's {}, I love it!">
>>> p.parse("It's spam, I love it!")
<Result ('spam',) {}>

TThiago José Moser Poletto

I see now, I believe that I'll be able to solve that issue I'm having with these features.

TThiago José Moser Poletto

I'll use a for loop inside the node to extract the information I need and use as layer

TThiago José Moser Poletto

to segment the path

TThiago José Moser Poletto

it would be possible to access a data from a input in the pipeline node definition?

Added some docs here: https://github.com/kedro-org/kedro/pull/4308

The build is stuck for some reason but you can review this temporarily with this link: https://5500-kedroorg-kedro-ov91qdu83us.ws-us116.gitpod.io/docs/build/html/data/kedro_dataset_factories.html

TThiago José Moser Poletto

the second link doesn't work, 401

try again?

TThiago José Moser Poletto

it worked now

TThiago José Moser Poletto

but like in that example you shared, you create a list of months, that you use as namespace to save the outputs, would be possible to access a value from the catalog instead of creating that list?

Do you have the value already before you create the nodes?

TThiago José Moser Poletto

yeah, the are created in another pipeline, basically it would be accessing a catalog entry inside the pipeline.py file... like you did there with the months, but accensing a entry...

months = ["jan", "feb", "mar", "apr"]
# instead use:
months = catalog.load('months') # lets say... something like that

It is possible though not the most convention way: https://github.com/kedro-org/kedro/issues/2627#issuecomment-1691596460

If you can afford two runs, this may be a simpler approach: https://github.com/noklam/kedro-example/blob/master/conditional-kedro-runs/conditional_run.py

TThiago José Moser Poletto

Hey

I do have a param:
conf/base/parameters/test.yml
group_id: null

I'm trying to update it during a pipeline run by catalog.save(), is it possible?

TThiago José Moser Poletto

I manage to workaround what I needed, its just the namespace that I'm not making to work for some reason

Add a reply

Sign up and join the conversation on Slack