Subject: Create a data catalog for files in databricks unity catalog
Hello everyone.
When I create a data catalog for a kedro project, I manually specify the absolute paths for input and output files that reside in DBFS, as it is required by Kedro.
But since I started using Unity Catalog, these paths become relative, and the absolute paths start having hash keys in them.
It is easy to fetch absolute paths for input files, but how can we create such paths for output files?
can you maybe try the ManagedTableDataset
instead of SparkDataset
?
DataCatalog.from_config( { "nyctaxi_trips": { "type": "databricks.ManagedTableDataset", "catalog": "samples", "database": "nyctaxi", "table": "trips", } } )
hi @FIRAS ALAMEDDINE,
But since I started using Unity Catalog, these paths become relative, and the absolute paths start having hash keys in them.
I saved files on DBFS at first. When defining a data catalog, we used this format:
ingest_dict= {"df1": "dbfs:/Filestore/tables/path/to/df1", "df2": "dbfs:/Filestore/tables/path/to/df2", ...} dict_ingest_catalog = {} for table in ingest_dict: a_df = SparkDataSet( filepath=ingest_dict[table], file_format='parquet', load_args={"header": True, "inferSchema": True,"nullValue" : "NA" }, save_args={"sep": ",", "header": True, "mode":"overwrite"}, ) dict_ingest_catalog[table] = a_df full_catalog = DataCatalog(dict_ingest_catalog)
Now I want to save files on UC. Instead of using initial file paths on DBFS, I tried using f"{catalog}.{schema}.{tableName}" and it failed.
Then I replaced filepaths with the new "locations" of files on UC. We can get them with this piece of code:
details = spark.sql(f"DESCRIBE DETAIL {catalog}.{schema}.{tableName}").collect() location = details[0]['location']I can get locations for a pipeline's input files. However, since its output files don't exist yet, I cannot get their location and I cannot predefine a format that is similar to
location
. Therefore, my data catalog is not completecan you maybe try the ManagedTableDataset
instead of SparkDataset
?
DataCatalog.from_config( { "nyctaxi_trips": { "type": "databricks.ManagedTableDataset", "catalog": "samples", "database": "nyctaxi", "table": "trips", } } )
A question:
I ingest my input tables into UC by calling raw tables that reside in an external location, using something like
df = spark.read.table(f"{external_catalog}.{external_schema}.{raw_table}")then modified a bit, then written using
df.write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.{tableName}")
config
that is similar to what you wrote, is it mandatory to write these input files using another way? Maybe something like:from kedro_dataset.databricks import ManagedTableDataset dataset = ManagedTableDataset(table=tableName, catalog=catalog, database=schema, write_mode="overwrite") dataset.save(df)