Defining schema for managed table dataset in yaml file

At a glance

Hello Kedro community,

I have currently a problem on how my databricks.ManagedTableDataset is created (Problem of type, precision of DecimalType.

To avoid that, I want to define schema in my yaml file in order define schema should have the ManagedTableDataset in databricks.

Would you have some yaml example on how to create this schema ? (with the DecimalType if possible 🙂 ). I did not find any example, and IntegerType (a spark type) did not match anything for example.

Thanks and have a good day !

15 comments

NNok Lam Chan

Are you able to create it in Python code first? If you can do this it's a simple process to move it to YAML.

I assume these are not primitive type so you will likely need to use custom resolver. Search something like "Polars type", "custom resolver" in our configuration docs should give you some examples.

TThéo Andro

Thank you for the answer 🙂
I am not able to find an example of our to define a managedTable.
By exploring the source code, we need to pass a json path (as schema) that will be read as follow:
from pyspark.sql.types import StructType

StructType.fromjson(path_to_file)
I will follow this lead, and share my result

NNok Lam Chan

    def __init__(  # noqa: PLR0913
        self,
        *,
        table: str,
        catalog: str | None = None,
        database: str = "default",
        write_mode: str | None = None,
        dataframe_type: str = "spark",
        primary_key: str | list[str] | None = None,
        version: Version | None = None,
        # the following parameters are used by project hooks
        # to create or update table properties
        schema: dict[str, Any] | None = None,
        partition_columns: list[str] | None = None,
        owner_group: str | None = None,
        metadata: dict[str, Any] | None = None,
    ) -> None:

From the constructor, it is expecting a dictionary type

TThéo Andro

After trying several methods, here I am:

I tried to define schema -> filepath in yaml, but it is not supported for schemas
I defined the schema inside yaml with right typology for StructType.FromJson: (schema above)

I get the following error:
Failed to convert the JSON string '{"metadata":{},"name":"content_id","nullable":"true","type":"long"}' to a field.

TThéo Andro

Any idea to solve this?

NNok Lam Chan

Attachment

NNok Lam Chan

this seems to work for me

TThéo Andro

If we do this directly in python yes, but I am using a yaml template to define schema.
I do not know how the translation between yaml and python is done. Would you have an idea?

NNok Lam Chan

I think you can just do a yaml dump?

Attachment

TThéo Andro

Thank you I will try this 🙂

NNok Lam Chan

thanks, please let me know if this work. I would like to have a look at this dataset soon, I don't think having a JSON in yaml file is the best dev ex and we can improve this.

TThéo Andro

Thank you for that, it worked indeed.

Yes, you are right. It can be cool to have a place to store all schemas, and reuse them inside yaml conf or directly in nodes.

Anyway, thank you for your help 🙏

TThéo Andro

Here is what worked as schema definition in yaml.
It can be cool to give an example in the documentation 🙂 (There is no example)
And also explain how to chose column types (we can find them by doing StringType().typeName() or anyType.typeName())
Have a good day :)

NNok Lam Chan

Make sense, I think this is coming more from a Spark-first world so schema is a first class citizen there. Normally you have strong typed data so when you load it you already have the schema there, or it's a decision that you need to make yourself.

In case you have the StructType object already, you can convert it with .json() as well. Maybe linking the Spark docs would help? https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.StructType.html

Attachment

Add a reply

Join the Kedro community

Defining schema for managed table dataset in yaml file