Using placeholders for data catalog in pipeline

HHugo Acosta

Hello everyone,

I am encountering some issues regarding the use of placeholders for the data catalog and I was hoping you can shed some light on this .
I have the following pipeline:

load_date = settings.LOAD_DATE_COMPARISON.get("current")
previous_load_date = settings.LOAD_DATE_COMPARISON.get("previous")

def create_pipeline(**kwargs) -> Pipeline:


    format_data_quality = pipeline(
                [   node(
                        func= compare_id,
                        inputs=[f"maestro_indicadores_{load_date}",
                                f"maestro_indicadores_{previous_load_date}"],
                        outputs=f"compare_id_{load_date}_{previous_load_date}",
                        name="compare_id_node",
                        tags = "compare_id"
    ),]
    )
    return format_data_quality

With the corresponding catalog entry for the output:

compare_id_{load_date}_{previous_load_date}:
  type: json.JSONDataset
  filepath: reports/{load_date}/id_comparison/id_comparison_{load_date}_{previous_load_date}.json

The issue here is that whenever the value of load date is something like 2024_07_01, it will generate a path like:
reports/2024/id_comparison/id_comparison_ 2024_07_01_2024_05_01.json

Note that the first placeholder is not being substituted with the intended value, while the others are.
This will only happen when the value of load_date contains underscores, not happening with dots or hyphens.
Why does this happen?

12 comments

RRashida Kanchwala

Thank you for raising this issue. This will require further investigation from the team. Could you kindly raise this as a bug on GitHub?

HHugo Acosta

Sure, done!

NNok Lam Chan

Can you do kedro catalog resolve to understand this better? Is it using the pattern that you are intend to use?

NNok Lam Chan

So the problem is 2024_07_07 somehow become 2024?

HHugo Acosta

Yes, 2024_07_07 becomes 2024

NNok Lam Chan

Can you make an minimal example that can reproduce this issue?

VVishal Pandey

Just for learning purpose I wanted to know more about
settings.LOAD_DATE_COMPARISON.get("current")

What kind of object is LOAD_DATE_COMPARISON

and how it is defined in settings.py

AAnkita Katiyar

If you’re using dataset factories, it’s because parse (https://pypi.org/project/parse/) library that we use under the hood for matching dataset names to patterns works this way. It’ll resolve the brackets for compare_id_{load_date}_<i><code>{previous_load_date}</code></i> at the first underscore. It’s expected behaviour and i’d recommend using a different separator between the dates for this output dataset

HHugo Acosta

In that case, all of the placeholders should suffer from this issue, but it only happens on the one I'm highlighting:
reports/{load_date}/id_comparison/id_comparison_{load_date}_{previous_load_date}.json

HHugo Acosta

This is the content of settings.py
LOAD_DATE_COMPARISON = globals_config["load_dates_comparison"]

Which refers to the globals.yml file where:

load_dates_comparison:
  previous: "2024_07_01"
  current: "2024_10_07"

HHugo Acosta

So it turns out the problem comes from the catalog.yml entry naming having underscores and complying with the following schema:
When the name is something_{placeholder1}_{placeholder2} the path placeholders take unwanted values.
This does not happen if we name the entry like something_{placeholder1}_<i>vs</i>_{placeholder2}

AAnkita Katiyar

Yeah, if the placeholders themselves contain underscores and the separator between them is also an underscore, the string can be split in multiple ways so that it satisfies the pattern. parse library does it in a way that the first match that satisfies the pattern is returned.
So something_{2024}_{07_01_2024_10_07} and something_{2024_07}_{01_2024_10_07} and something_{2024_07_01}_{2024_10_07} all satisfy the pattern but the parse library returns the first match

Add a reply

Join on Slack

Join the Kedro community

Using placeholders for data catalog in pipeline