Join the Kedro community

M
M
M
D
M

Using placeholders for data catalog in pipeline

Hello everyone,

I am encountering some issues regarding the use of placeholders for the data catalog and I was hoping you can shed some light on this .
I have the following pipeline:

load_date = settings.LOAD_DATE_COMPARISON.get("current")
previous_load_date = settings.LOAD_DATE_COMPARISON.get("previous")

def create_pipeline(**kwargs) -> Pipeline:


    format_data_quality = pipeline(
                [   node(
                        func= compare_id,
                        inputs=[f"maestro_indicadores_{load_date}",
                                f"maestro_indicadores_{previous_load_date}"],
                        outputs=f"compare_id_{load_date}_{previous_load_date}",
                        name="compare_id_node",
                        tags = "compare_id"
    ),]
    )
    return format_data_quality
With the corresponding catalog entry for the output:

compare_id_{load_date}_{previous_load_date}:
  type: json.JSONDataset
  filepath: reports/{load_date}/id_comparison/id_comparison_{load_date}_{previous_load_date}.json
The issue here is that whenever the value of load date is something like 2024_07_01, it will generate a path like:
reports/2024/id_comparison/id_comparison_ 2024_07_01_2024_05_01.json

Note that the first placeholder is not being substituted with the intended value, while the others are.
This will only happen when the value of load_date contains underscores, not happening with dots or hyphens.
Why does this happen?

2
R
H
N
12 comments

Thank you for raising this issue. This will require further investigation from the team. Could you kindly raise this as a bug on GitHub?

Can you do kedro catalog resolve to understand this better? Is it using the pattern that you are intend to use?

So the problem is 2024_07_07 somehow become 2024?

Yes, 2024_07_07 becomes 2024

Can you make an minimal example that can reproduce this issue?

Just for learning purpose I wanted to know more about
settings.LOAD_DATE_COMPARISON.get("current")

What kind of object is LOAD_DATE_COMPARISON

and how it is defined in settings.py

If you’re using dataset factories, it’s because parse (https://pypi.org/project/parse/) library that we use under the hood for matching dataset names to patterns works this way. It’ll resolve the brackets for compare_id_{load_date}_<i><code>{previous_load_date}</code></i> at the first underscore. It’s expected behaviour and i’d recommend using a different separator between the dates for this output dataset

In that case, all of the placeholders should suffer from this issue, but it only happens on the one I'm highlighting:
reports/{load_date}/id_comparison/id_comparison_{load_date}_{previous_load_date}.json


This is the content of settings.py
LOAD_DATE_COMPARISON = globals_config["load_dates_comparison"]

Which refers to the globals.yml file where:

load_dates_comparison:
  previous: "2024_07_01"
  current: "2024_10_07"

So it turns out the problem comes from the catalog.yml entry naming having underscores and complying with the following schema:
When the name is something_{placeholder1}_{placeholder2} the path placeholders take unwanted values.
This does not happen if we name the entry like something_{placeholder1}_<i>vs</i>_{placeholder2}

Yeah, if the placeholders themselves contain underscores and the separator between them is also an underscore, the string can be split in multiple ways so that it satisfies the pattern. parse library does it in a way that the first match that satisfies the pattern is returned.
So something_{2024}_{07_01_2024_10_07} and something_{2024_07}_{01_2024_10_07} and something_{2024_07_01}_{2024_10_07} all satisfy the pattern but the parse library returns the first match

Add a reply
Sign up and join the conversation on Slack
Join