Join the Kedro community

Updated 2 weeks ago

Accessing Factory Datasets from a DataCatalog/KeroDataCatalog Instance

At a glance

The community member is facing an issue where factory datasets are not showing up in the catalog.list() function in a Kedro Jupyter notebook, while the CLI command kedro catalog list is able to resolve and list them. The community members discuss potential solutions, including using catalog.list(Pipeline.inputs() | Pipeline.outputs()) and accessing the datasets directly using catalog[" < dataset_name > "]. The answer provided suggests that the factory datasets are lazy-loaded, and the recommended approach is to use catalog.exists(dataset) or catalog.get_dataset(dataset) to access the factory datasets, which will then show up in the catalog.list() output.

Useful resources

Hi team!

Is there any way to resolve factory datasets and access them from a DataCatalog/KeroDataCatalog instance?

I notice using the CLI to create a list of datasets kedro catalog list will automatically resolve them (for a given pipeline - see this bit of code) while doing catalog.list() in a kedro jupyter notebook will just list non-factory datasets (and parameters). Are those two returning different outputs by design or is it a bug?

Thanks!

Marked as solution

The factory datasets are lazy so they don’t show up in catalog.list() (Discussion in https://github.com/kedro-org/kedro/issues/3312)
With the new catalog you can do -

catalog["<dataset_name>"] 
And it’ll resolve and get you the factory dataset
for dataset in pipelines['__default__'].datasets():
  catalog.exists(dataset) # or catalog.get_dataset(dataset)

# now it'll show up
catalog.list()

View full solution
G
d
A
5 comments

For a bit of context, I noticed this while using vizro which has a kedro integration that relies on catalog.list.

In practice, I would like to query by name a dataset defined by a factory dataset and get its load function.

So there is a trick for doing this before we fix this.

Essentially .list() needs to match the patterns before they show up so you can do catalog.list(Pipeline.inputs() | Pipeline.outputs())

Thanks @datajoely, it does not work out of the blue:

from kedro.framework.project import pipelines
pipeline = pipelines.get("__default__")
catalog.list(pipeline.inputs() | pipeline.outputs())

returns

AttributeError: 'set' object has no attribute 'strip'

Seems like regex_search is supposed to be a string?

If I pass `regex_search=".*KWD.*", where KWD is part of one of my factored datasets, it also does not find it.

The factory datasets are lazy so they don’t show up in catalog.list() (Discussion in https://github.com/kedro-org/kedro/issues/3312)
With the new catalog you can do -

catalog["<dataset_name>"] 
And it’ll resolve and get you the factory dataset
for dataset in pipelines['__default__'].datasets():
  catalog.exists(dataset) # or catalog.get_dataset(dataset)

# now it'll show up
catalog.list()

Thank you @Ankita Katiyar! That's perfect.

PS: I opened an issue to fix this on the vizro side.

Add a reply
Sign up and join the conversation on Slack