Hello Kedro community,
I have currently a problem on how my databricks.ManagedTableDataset
is created (Problem of type, precision of DecimalType.
To avoid that, I want to define schema in my yaml file in order define schema should have the ManagedTableDataset in databricks.
Would you have some yaml example on how to create this schema ? (with the DecimalType if possible 🙂 ). I did not find any example, and IntegerType (a spark type) did not match anything for example.
Thanks and have a good day !
Are you able to create it in Python code first? If you can do this it's a simple process to move it to YAML.
I assume these are not primitive type so you will likely need to use custom resolver. Search something like "Polars type", "custom resolver" in our configuration docs should give you some examples.
Thank you for the answer 🙂
I am not able to find an example of our to define a managedTable.
By exploring the source code, we need to pass a json path (as schema) that will be read as follow:from pyspark.sql.types import StructType
StructType.fromjson(path_to_file)
I will follow this lead, and share my result
def __init__( # noqa: PLR0913 self, *, table: str, catalog: str | None = None, database: str = "default", write_mode: str | None = None, dataframe_type: str = "spark", primary_key: str | list[str] | None = None, version: Version | None = None, # the following parameters are used by project hooks # to create or update table properties schema: dict[str, Any] | None = None, partition_columns: list[str] | None = None, owner_group: str | None = None, metadata: dict[str, Any] | None = None, ) -> None:
After trying several methods, here I am:
Failed to convert the JSON string '{"metadata":{},"name":"content_id","nullable":"true","type":"long"}' to a field.
If we do this directly in python yes, but I am using a yaml template to define schema.
I do not know how the translation between yaml and python is done. Would you have an idea?
thanks, please let me know if this work. I would like to have a look at this dataset soon, I don't think having a JSON in yaml file is the best dev ex and we can improve this.
Thank you for that, it worked indeed.
Yes, you are right. It can be cool to have a place to store all schemas, and reuse them inside yaml conf or directly in nodes.
Anyway, thank you for your help 🙏
Here is what worked as schema definition in yaml.
It can be cool to give an example in the documentation 🙂 (There is no example)
And also explain how to chose column types (we can find them by doing StringType().typeName() or anyType.typeName())
Have a good day :)
Make sense, I think this is coming more from a Spark-first world so schema is a first class citizen there. Normally you have strong typed data so when you load it you already have the schema there, or it's a decision that you need to make yourself.
In case you have the StructType object already, you can convert it with .json()
as well. Maybe linking the Spark docs would help? https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.StructType.html