Join the Kedro community

Updated last month

Create a data catalog for files in databricks unity catalog

At a glance

The community member is facing an issue with creating a data catalog for files in Databricks Unity Catalog. When using Kedro, they manually specified absolute paths for input and output files in DBFS, but with Unity Catalog, the paths become relative and have hash keys. They can fetch absolute paths for input files, but are unsure how to create such paths for output files.

In the comments, another community member suggests using the ManagedTableDataset instead of SparkDataset and provides an example configuration. The original community member decides to try this approach.

Additionally, the community member asks if it is mandatory to write the input files using a different method, such as using ManagedTableDataset, when ingesting the input tables into Unity Catalog by calling raw tables from an external location. Another community member confirms that this approach works and helps resolve the issue.

<answer>The community members suggest using the <code>ManagedTableDataset</code> instead of <code>SparkDataset</code> to create a data catalog for files in Databricks Unity Catalog.</answer>
Useful resources

Subject: Create a data catalog for files in databricks unity catalog

Hello everyone.
When I create a data catalog for a kedro project, I manually specify the absolute paths for input and output files that reside in DBFS, as it is required by Kedro.
But since I started using Unity Catalog, these paths become relative, and the absolute paths start having hash keys in them.
It is easy to fetch absolute paths for input files, but how can we create such paths for output files?

can you maybe try the ManagedTableDataset instead of SparkDataset?

DataCatalog.from_config(
    {
        "nyctaxi_trips": {
            "type": "databricks.ManagedTableDataset",
            "catalog": "samples",
            "database": "nyctaxi",
            "table": "trips",
        }
    }
)

(from https://github.com/astrojuanlu/kedro-databricks-demo/blob/main/First%20Steps%20with%20Kedro%20on%20Databricks.ipynb)

View full solution
J
F
8 comments

hi @FIRAS ALAMEDDINE,

But since I started using Unity Catalog, these paths become relative, and the absolute paths start having hash keys in them.

could you clarify this a bit more?

I saved files on DBFS at first. When defining a data catalog, we used this format:

ingest_dict= {"df1": "dbfs:/Filestore/tables/path/to/df1",
              "df2": "dbfs:/Filestore/tables/path/to/df2",
                          ...}
dict_ingest_catalog = {}
for table in ingest_dict:
    a_df = SparkDataSet(
        filepath=ingest_dict[table],
        file_format='parquet',
            load_args={"header": True, "inferSchema": True,"nullValue" : "NA" },
            save_args={"sep": ",", "header": True, "mode":"overwrite"},
        )
     dict_ingest_catalog[table] = a_df

full_catalog = DataCatalog(dict_ingest_catalog)

Now I want to save files on UC. Instead of using initial file paths on DBFS, I tried using f"{catalog}.{schema}.{tableName}" and it failed.
Then I replaced filepaths with the new "locations" of files on UC. We can get them with this piece of code:

details = spark.sql(f"DESCRIBE DETAIL {catalog}.{schema}.{tableName}").collect()
location = details[0]['location']
I can get locations for a pipeline's input files. However, since its output files don't exist yet, I cannot get their location and I cannot predefine a format that is similar to location . Therefore, my data catalog is not complete

can you maybe try the ManagedTableDataset instead of SparkDataset?

DataCatalog.from_config(
    {
        "nyctaxi_trips": {
            "type": "databricks.ManagedTableDataset",
            "catalog": "samples",
            "database": "nyctaxi",
            "table": "trips",
        }
    }
)

(from https://github.com/astrojuanlu/kedro-databricks-demo/blob/main/First%20Steps%20with%20Kedro%20on%20Databricks.ipynb)

I'll give it a shot. Thank you!

A question:
I ingest my input tables into UC by calling raw tables that reside in an external location, using something like

df = spark.read.table(f"{external_catalog}.{external_schema}.{raw_table}")
then modified a bit, then written using
df.write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.{tableName}")

In order to use a config that is similar to what you wrote, is it mandatory to write these input files using another way? Maybe something like:
from kedro_dataset.databricks import ManagedTableDataset
dataset = ManagedTableDataset(table=tableName, catalog=catalog, database=schema, write_mode="overwrite")
dataset.save(df)

Apparently yes, my data engineering pipeline is working again. Thanks a lot Juan!

Add a reply
Sign up and join the conversation on Slack