Hello everyone,
I am encountering some issues regarding the use of placeholders for the data catalog and I was hoping you can shed some light on this .
I have the following pipeline:
load_date = settings.LOAD_DATE_COMPARISON.get("current") previous_load_date = settings.LOAD_DATE_COMPARISON.get("previous") def create_pipeline(**kwargs) -> Pipeline: format_data_quality = pipeline( [ node( func= compare_id, inputs=[f"maestro_indicadores_{load_date}", f"maestro_indicadores_{previous_load_date}"], outputs=f"compare_id_{load_date}_{previous_load_date}", name="compare_id_node", tags = "compare_id" ),] ) return format_data_qualityWith the corresponding catalog entry for the output:
compare_id_{load_date}_{previous_load_date}: type: json.JSONDataset filepath: reports/{load_date}/id_comparison/id_comparison_{load_date}_{previous_load_date}.jsonThe issue here is that whenever the value of load date is something like 2024_07_01, it will generate a path like:
Thank you for raising this issue. This will require further investigation from the team. Could you kindly raise this as a bug on GitHub?
Can you do kedro catalog resolve
to understand this better? Is it using the pattern that you are intend to use?
Just for learning purpose I wanted to know more aboutsettings.LOAD_DATE_COMPARISON.get("current")
What kind of object is LOAD_DATE_COMPARISON
and how it is defined in settings.py
If you’re using dataset factories, it’s because parse
(https://pypi.org/project/parse/) library that we use under the hood for matching dataset names to patterns works this way. It’ll resolve the brackets for compare_id_{load_date}_
<i><code>{previous_load_date}</code></i> at the first underscore. It’s expected behaviour and i’d recommend using a different separator between the dates for this output dataset
In that case, all of the placeholders should suffer from this issue, but it only happens on the one I'm highlighting:
reports/{load_date}/id_comparison/id_comparison_{load_date}_{previous_load_date}.json
This is the content of settings.pyLOAD_DATE_COMPARISON = globals_config["load_dates_comparison"]
Which refers to the globals.yml file where:
load_dates_comparison: previous: "2024_07_01" current: "2024_10_07"
So it turns out the problem comes from the catalog.yml entry naming having underscores and complying with the following schema:
When the name is something_{placeholder1}_{placeholder2} the path placeholders take unwanted values.
This does not happen if we name the entry like something_{placeholder1}_<i>vs</i>_{placeholder2}
Yeah, if the placeholders themselves contain underscores and the separator between them is also an underscore, the string can be split in multiple ways so that it satisfies the pattern. parse
library does it in a way that the first match that satisfies the pattern is returned.
So something_{2024}_{07_01_2024_10_07}
and something_{2024_07}_{01_2024_10_07}
and something_{2024_07_01}_{2024_10_07}
all satisfy the pattern but the parse
library returns the first match