Hello, I've worked on a lot of Kedro pipelines the past ~year and am a big fan, but there's one detail that I've seen cause some very confusing problems that id like help with.
Whenever there's an error loading a pipeline, whether it be a syntax error, missing import, etc... instead of the Kedro process erroring out, it will just not use that pipeline and continue on without that pipeline. This is not only confusing, but can lead to some pretty big problems in a model without any errors occurring.
I was wondering how I disable this, forcing Kedro to raise errors when loading pipelines? I tried googling but couldn't find anything.
Thanks!
https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_registry.html#pipeline-autodiscovery specify raise_errors=True
Think this would be just for the pipeline code; I don't recall exactly when errors in dataset code show up.
I'm not at my computer (and probably can't look into this today), but if you can test the custom dataset error bit easily/see what the behavior is, could see if it can be improved
Sounds good / no rush... the dataset one is luckily easier to deal with. I also realized I phrased the question poorly.
If I have an error in a custom dataset (like missing import for example) it will error out, but instead it will say "dataset.my_dataset.FooBar" not found, install kedro-datasets
This is definitely a better default behavior than the pipeline one I mentioned, and I know how to deal with it. I tell the rest of my team to try importing the dataset via the python interpreter to find the real error...
All that being said, IMO it would be better if kedro would raise the actual error inside a dataset
I don't have any good publicly available kedro pipelines I can use to make your life easier reproducing atm
If I have an error in a custom dataset (like missing import for example) it will error out, but instead it will say "dataset.my_dataset.FooBar" not found, install kedro-datasets
I guess the pipeline part is solved.
For dataset part we tried to improve this in the past. Afaik there is still an open ticket to further improve this. But you are right, directly importing it will reveal the issue better.
Maybe it helps to explain why this happens.
So, in the past dataset are all part of Kedro where they are implemented in Kedro.extras.dataset
The behaviour is still consistent, where you can define dataset as pandas.CSVDataset instead of Kedro.extra.datasets.pandas.CSVDataset
Behind the scene, Kedro is always searching the datasets in a few predefined places.
This allows a shorter type in the catalog, but it also means when there is an import error, we can only try our best effort to guess whether it’s coming from a missing module, partial missing dependencies or that dataset simply didn’t exist.
Say if u have a dataset call dataset.mydataset
Kedro will search kedro.io.dataset.mydataset, kedro_datasets.dataset.mydataset and finally what is desired in this case, which is just dataset.mydataset
interesting... that explanation clears up some other weird issues i've had with custom dataset errors.
I don't have any strong opinions or suggestions on how to solve this dataset import error issue, but I'll ponder it.
actually... heres a thought...
what if I could define / register my custom dataset in settings.py to be able to map 'mydataset': myproject.datasets.mydataset
I have no idea how hard it would be to do that with the current kedro architecture, but it could exist alongside the current system of dataset discovery.