Am I correct in understanding that Kedro-Pandera will only work with pandas schemas currently? I saw that it uses pandera.io.deserialize_schema
under the hood in it's schema resolver, and that seems to be only implemented in pandera for pandas, is that right?
Hiya Deepy, I think I've just discovered the same thing. Seems only pandas is supported so far
I do wonder though, Kedro already does have the dataset definition. Updating the code to ensure that the dataset type is used to construct the proper Pandera object should not be much of a stretch
Hi sorry I missed it. Yes it is the case, but hopefully you can build your own resolver to pass another schema ; not absolutely sure of how the hook will behave though. There are still a *lot* to do for this plugin, and unfortunately I don't think it will happen in a foreseeable future
I do wonder though, Kedro already does have the dataset definition. Updating the code to ensure that the dataset type is used to construct the proper Pandera object should not be much of a stretchYeah. I'm guessing this is also not a huge lift on the pandera side to just include the parser for other schemas; they all look pretty similar.
Kedro already does have the dataset definition. Updating the code to ensure that the dataset type is used to construct the proper Pandera object should not be much of a stretchCan you explains a little more on this?
I see. I think the question here is that kedro-pandera
relies on pandera
to do this deserialisation step (from object to YAML).pandera
only support pandas
so far, https://github.com/unionai-oss/pandera/blob/main/pandera/io/pandas_io.py
could that info be used to construct the correct schema object in pandera?
infer_schema
is another piece of the functionality. But you don't necessarily want to set your validation rule based on the inferred schema, maybe you want some subset or something. I think this is more P2 functionality.Two things coming together here indeed. I was not saying we should validate based on the inferred schema from the dataset, I intended to say that the type (and only the type, i.e., SparkDataset, PandasDataset) should aid in parsing the yaml into the <i>correct </i>Pandera object.
Specifically:
SparkDataset
results in creation of pandera.pyspark.DataFrameModel
PandasDataset
results in creation of pandera.api.pandas.model.DataFrameModel
and so on