Hi team!
Is there any way to resolve factory datasets and access them from a DataCatalog/KeroDataCatalog instance?
I notice using the CLI to create a list of datasets kedro catalog list
will automatically resolve them (for a given pipeline - see this bit of code) while doing catalog.list()
in a kedro jupyter notebook will just list non-factory datasets (and parameters). Are those two returning different outputs by design or is it a bug?
Thanks!
The factory datasets are lazy so they don’t show up in catalog.list()
(Discussion in https://github.com/kedro-org/kedro/issues/3312)
With the new catalog you can do -
catalog["<dataset_name>"]And it’ll resolve and get you the factory dataset
for dataset in pipelines['__default__'].datasets(): catalog.exists(dataset) # or catalog.get_dataset(dataset) # now it'll show up catalog.list()
For a bit of context, I noticed this while using vizro which has a kedro integration that relies on catalog.list.
In practice, I would like to query by name a dataset defined by a factory dataset and get its load function.
So there is a trick for doing this before we fix this.
Essentially .list() needs to match the patterns before they show up so you can do catalog.list(Pipeline.inputs() | Pipeline.outputs())
Thanks @datajoely, it does not work out of the blue:
from kedro.framework.project import pipelines pipeline = pipelines.get("__default__") catalog.list(pipeline.inputs() | pipeline.outputs())
AttributeError: 'set' object has no attribute 'strip'
The factory datasets are lazy so they don’t show up in catalog.list()
(Discussion in https://github.com/kedro-org/kedro/issues/3312)
With the new catalog you can do -
catalog["<dataset_name>"]And it’ll resolve and get you the factory dataset
for dataset in pipelines['__default__'].datasets(): catalog.exists(dataset) # or catalog.get_dataset(dataset) # now it'll show up catalog.list()
Thank you @Ankita Katiyar! That's perfect.
PS: I opened an issue to fix this on the vizro side.