Join the Kedro community

Updated 3 weeks ago

How to Initialize a Delta Table That Doesn't Exist Using Kedro

Subject: How to Initialize a Delta Table That Doesn't Exist Using Kedro?
Hello everyone,
I’m facing an issue related to the Delta library. When I attempt to read a Delta table using spark.DeltaTableDataset, I receive a message stating that the table does not exist. This is expected since the table hasn't been created yet. However, my goal is to initialize the table with data that I will subsequently provide.
Unfortunately, the DeltaTableDataset does not support write operations. Does anyone know how to handle the initialization of a Delta table in this scenario?
Currently, I am working on a custom hook using the @hook_impl decorator:

@hook_impl
def before_dataset_loaded(self, dataset_name: str, node: Node) -> None:
    # My logic to initialize the Delta table
The idea is to initialize the Delta table (if it doesn’t already exist) using PySpark within this hook. However, I am struggling to dynamically retrieve the schema of the table for its creation.
If anyone has encountered a similar situation or has insights on how to resolve this, I would greatly appreciate your help!
Thank you in advance for your support!

H
M
2 comments

hey @Mohamed El Guendouz, our @juanlu had tried this hack DeltaTable.is_deltatable() before when working with the delta table.

https://github.com/delta-io/delta-rs/pull/2715

Thank you @Huong Nguyen ! 🙂

Ultimately, I created a custom Dataset to give it a specific schema.

  • I added the schema to the catalog and then created a hook that runs before each read operation.
  • This hook checks if the data being accessed is a DataFrame from my new custom dataset and, if so, initializes it if the table doesn't exist.
This ensures the table is created if it doesn't already exist and is updated once initialized within the node.

Thanks again for your suggestion!

Add a reply
Sign up and join the conversation on Slack