Join the Kedro community

Home
Members
Carlos Prieto - Tomtom
C
Carlos Prieto - Tomtom
Offline, last seen 3 weeks ago
Joined January 15, 2025

Hello Kedro Community,
I am working on a project where I need to store a Spark DataFrame in Delta format using Kedro. Specifically, I want to ensure that the data is stored in a specific way, as shown in the following function:

python

def export_results_to_delta(summary_df, output_path="/mnt/success5/Success5_results/metric_changes"):
    if DeltaTable.isDeltaTable(spark, output_path):
        DeltaTable.forPath(spark, output_path).alias("target").merge(
            summary_df.alias("source"),
            """target.reference_id = source.reference_id AND 
               target.country = source.country AND 
               target.provider_id = source.provider_id AND 
               target.matching_run_id = source.matching_run_id"""
        ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
    else:
        summary_df.write.format("delta").mode("overwrite").partitionBy(
            "country", "matching_run_id", "provider_id"
        ).save(output_path)
Is it possible to create a catalog entry in Kedro that allows me to store the dataset in this manner? If so, could you please provide an example of how to configure the catalog entry?
Thank you in advance for your help!

1 comment
J

Issue Summary
Confusion with Credential Configuration in Kedro 0.19 vs 0.18
Hello Kedro team,
I have encountered an issue regarding the configuration of credentials for accessing storage via abfss in Kedro 0.19.3, which was not present in version 0.18. Here is a summary of the problem:
In Kedro 0.18, I configured the credentials for accessing storage through Spark configurations with Azure Service Principal, and everything worked fine. However, after upgrading to Kedro 0.19.3, the same setup stopped working. After spending a couple of days troubleshooting, I discovered that adding the credentials as environment variables resolved the issue.
My questions are:

  1. Does Kedro 0.19.3 read these environment variables directly?
  2. Is this behavior managed by Kedro itself or by the abfss library?

Additionally, it seems redundant to add the credentials both in the Spark configuration and as environment variables. This redundancy is confusing and feels like a bug rather than a feature. Could you please clarify if this is the intended behavior?
Execution Environment:
  • This is being executed in Databricks.
  • The Spark configurations to use Azure Service Principal are added to the Databricks cluster used. (The cluster configuration includes credentials for multiple storages.)
  • Only one storage credentials can be added as environment variables, but since the spark config authenticates the spark session just filling in these values althugh incorrect allows to access the storages.

Thank you for your assistance!

2 comments
J
C