Accessing Factory Datasets from a DataCatalog/KeroDataCatalog Instance

Question

Hi team!

Is there any way to resolve factory datasets and access them from a DataCatalog/KeroDataCatalog instance?

I notice using the CLI to create a list of datasets kedro catalog list will automatically resolve them (for a given pipeline - see this bit of code) while doing catalog.list() in a kedro jupyter notebook will just list non-factory datasets (and parameters). Are those two returning different outputs by design or is it a bug?

Thanks!

Ankita Katiyar · Accepted Answer

The factory datasets are lazy so they don’t show up in  catalog.list()  (Discussion in  https://github.com/kedro-org/kedro/issues/3312 ) With the new catalog you can do - catalog[" < dataset_name > "]  And it’ll resolve and get you the factory dataset for dataset in pipelines['__default__'].datasets():
  catalog.exists(dataset) # or catalog.get_dataset(dataset)

# now it'll show up
catalog.list()

Guillaume Tauzin · Answer

For a bit of context, I noticed this while using vizro which has a kedro integration that relies on catalog.list.

In practice, I would like to query by name a dataset defined by a factory dataset and get its load function.

datajoely · Answer

So there is a trick for doing this before we fix this. Essentially .list() needs to match the patterns before they show up so you can do  catalog.list(Pipeline.inputs() | Pipeline.outputs())

Guillaume Tauzin · Answer

Thanks  @datajoely , it does not work out of the blue: from kedro.framework.project import pipelines
pipeline = pipelines.get("__default__")
catalog.list(pipeline.inputs() | pipeline.outputs()) returns AttributeError: 'set' object has no attribute 'strip' Seems like regex_search is supposed to be a string? If I pass `regex_search=".*KWD.*", where KWD is part of one of my factored datasets, it also does not find it.

Guillaume Tauzin · Answer

Thank you  @Ankita Katiyar ! That's perfect. PS: I opened an issue to fix this on the vizro side.

Join the Kedro community

Accessing Factory Datasets from a DataCatalog/KeroDataCatalog Instance