Troubleshooting delta table dataset in gcs bucket with pyspark

Hello,
I would like to work with Delta Tables using PySpark in a GCS bucket, but I'm having trouble using spark.DeltaTableDataset:

table_name:
  type: spark.DeltaTableDataset
  filepath: "<a target="_blank" rel="noopener noreferrer" href="gs://XXXX/poc-kedro/table_name/*.parquet">gs://XXXX/poc-kedro/table_name/*.parquet</a>"

Could you tell me what might be wrong with this?
Additionally, could you explain how to specify the credentials for accessing the table with this Dataset?

24 comments

Hi , what is the trouble you are facing here. Is it only related to credentials or something else. Do you see any error which can give us more information on the issue. Thank you

Hi , when trying to load data from a DeltaTable using PySpark and Kedro, an error occurs. The process attempts to load the dataset from a Google Cloud Storage (GCS) bucket, but fails with the following message:

"TypeError: 'JavaPackage' object is not callable", which points to an issue with the DeltaTable.forPath() method in the delta.tables library. This leads to a DatasetError in Kedro, preventing the data from being loaded successfully.

 File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXXX/poc-kedro/table_name/*.parquet, fs_prefix=gs://).
'JavaPackage' object is not callable

I was wondering if this issue could be caused by the fact that I haven't provided the credentials.

But the dataset doesn't seem to allow specifying the credentials in the parameters.

Did you use gs instead of gcs or it's just. a typo?

gs://

https://gcsfs.readthedocs.io/en/latest/ isn't the suffix gcs? not sure if I am missing anything here

Yes I tried but I have the same issues with gcs :

File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXX/poc-kedro/table_name/*.parquet, fs_prefix=gcs://).
'JavaPackage' object is not callable

I guess 2nd question will be, do you have delta configured for your sparkcontext?

The error is not from Python so very likely your Spark configuration is not working

Yes it is possible 👍 here is my spark configuration :

spark.driver.maxResultSize: 3g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.databricks.delta.properties.defaults.compatibility.symlinkFormatManifest.enabled: true
# <a target="_blank" rel="noopener noreferrer" href="https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner">https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner</a>
spark.scheduler.mode: FAIR

is this for s3 - spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem

Shouldn't we use google file system as we are trying for gcs ?

For now, I've made a few changes with this configuration, and I’m able to successfully launch the Spark session:

spark.driver.maxResultSize: 3g
spark.jars.packages: io.delta:delta-core_2.12:2.0.0
spark.jars: <a target="_blank" rel="noopener noreferrer" href="https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar">https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar</a>
spark.sql.execution.arrow.pyspark.enabled: true
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.databricks.delta.properties.defaults.compatibility.symlinkFormatManifest.enabled: true
# <a target="_blank" rel="noopener noreferrer" href="https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner">https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner</a>
spark.scheduler.mode: FAIR
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
spark.hadoop.fs.gs.auth.service.account.enable: true
spark.hadoop.google.cloud.auth.service.account.json.keyfile: XXXX.json

However, it's still not recognizing my table… And GCS isn’t working on my end, but with GS, they are able to fetch the table.

I think we use fsspec to access files and based on the docs, the url should have a prefix gcs:// . I am not sure how gs is working for you. But with gs:// is your issue resolved ?

okay fsspec uses both protocols to identify gcfs.

Yeah, I don’t understand it either... Unfortunately, no, it’s not resolved because I’m getting an error saying that the URL I provided is not a Delta table. I’ve tried the same URLs multiple times in notebooks with the same paths, and it works there.

can you check if there is a folder gs://your-bucket-name/path/to/delta-table/_delta_log exists ?
Also can you try filepath: "gs://XXXX/poc-kedro/table_name" (instead of wildcard parquet)

  File "/opt/anaconda3/lib/python3.11/site-packages/kedro/runner/runner.py", line 494, in _run_node_sequential
    inputs[name] = catalog.load(name)
                   ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/data_catalog.py", line 515, in load
    result = dataset.load()
             ^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXX/poc-kedro/table_name, fs_prefix=gs://).
`<a target="_blank" rel="noopener noreferrer" href="gs://XXXXXX/poc-kedro/table_name">gs://XXXXXX/poc-kedro/table_name</a>` is not a Delta table.

Attachment

Capture d’écran 2024-10-17 à 17.35.00.png

Thanks for your patience ,I am new to this and as I read the docs it shows an example of using a single parquet file but not with wildcard, so I am not sure if we use multiple files. Let me get some help from the team. Meanwhile, if you could resolve the issue, please let us know. Thank you

 weather@delta:
          type: spark.DeltaTableDataset
          filepath: data/02_intermediate/data.parquet

Yes, exactly. That's what I found strange when I read the official documentation, as a Delta Table contains multiple files, not just a single Parquet file. Thanks, I'll keep you posted if I find a solution.

I also tested with a single file from the table, and I got the same error:

kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXX/poc-kedro/table_name/processed_at=XXXXXX/part-00000-0076fd68-4ca3-46f6-982f-e77c539af8a1.c000.snappy.parquet, fs_prefix=gs://). <a target="_blank" rel="noopener noreferrer" href="gs://XXXXXX/poc-kedro/table_name/processed_at=XXXXXX/part-00000-0076fd68-4ca3-46f6-982f-e77c539af8a1.c000.snappy.parquet">gs://XXXXXX/poc-kedro/table_name/processed_at=XXXXXX/part-00000-0076fd68-4ca3-46f6-982f-e77c539af8a1.c000.snappy.parquet</a> is not a Delta table."