Adding Timestamps to Catalog Entries

At a glance

hello, what is the proper way to add a current timestamp to the names of catalog entries thanks

11 comments

Hi @Gauthier Pierard, if your goal is to version your dataset, you can set versioned: True in the catalog entry. This will save your datasets with a timestamp-based version for each kedro run.
https://docs.kedro.org/en/stable/data/data_catalog.html#dataset-versioning

GGauthier Pierard

thanks Rashida but I actually need more control.
in general my save folders are like output_folder_<parameter>_<from_date>_<to_date>.
where from_date and to_date are defined by a node and saved as memorydatasets in the catalog.
is it possible to define other catalog entries whose name depends on previous entries?

MMerel Theisen

If I understand this correctly you'd essentially like to dynamically create your catalog based on previous runs?

GGauthier Pierard

indeed. I suppose this is best done in python with something like

CSVDataset(
    filepath="<a target="_blank" rel="noopener noreferrer" href="s3://test_bucket/data/02_intermediate/company/motorbikes.csv">s3://test_bucket/data/02_intermediate/company/motorbikes.csv</a>",
    load_args=dict(sep=",", skiprows=5, skipfooter=1, na_values=["#NA", "NA"]),
    credentials=dict(key="token", secret="key"),
)

and

# save the dataset to data/01_raw/test.csv/<version>/test.csv
catalog.save("test_dataset", data1)

correct?

MMerel Theisen

The above allows you to save the data, but you wouldn't preserve the dataset entry in the catalog. Saving here doesn't add it to the catalog itself.

MMerel Theisen

Do you need to have the catalog for future processing or are you okay with just saving the data to storage?

GGauthier Pierard

Yes i understand the catalog file won't be updated, only the catalog object in memory.
However could I define a partitionedDataset at the parent directory that would load the dynamically generated output paths and files for future computations?

RRashida Kanchwala

You could possibly use OmegaConfigLoaders and define it in settings.py and then define your catalog filepath as filepath: data/02_intermediate/pypi_kedro_demo_${now:}.csv Here is an example code - https://github.com/kedro-org/kedro/issues/2355#issuecomment-2260512795

GGauthier Pierard

hmm this seems to involve the fle datasets.py with which I am not familiar, thanks for the idea in any case

RRashida Kanchwala

You can ignore that file! It was just an example

MMerel Theisen

However could I define a partitionedDataset at the parent directory that would load the dynamically generated output paths and files for future computations?

As far as I know this should be possible, because the load path you just provide the top level directory https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-load

Add a reply

Join the Kedro community

Adding Timestamps to Catalog Entries