Hello,
I would like to work with Delta Tables using PySpark in a GCS bucket, but I'm having trouble using spark.DeltaTableDataset
:
table_name: type: spark.DeltaTableDataset filepath: "<a target="_blank" rel="noopener noreferrer" href="gs://XXXX/poc-kedro/table_name/*.parquet">gs://XXXX/poc-kedro/table_name/*.parquet</a>"Could you tell me what might be wrong with this?
Hi , what is the trouble you are facing here. Is it only related to credentials or something else. Do you see any error which can give us more information on the issue. Thank you
Hi , when trying to load data from a DeltaTable using PySpark and Kedro, an error occurs. The process attempts to load the dataset from a Google Cloud Storage (GCS) bucket, but fails with the following message:
delta.tables
library. This leads to a DatasetError
in Kedro, preventing the data from being loaded successfully.File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load raise DatasetError(message) from exc kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXXX/poc-kedro/table_name/*.parquet, fs_prefix=gs://). 'JavaPackage' object is not callable
I was wondering if this issue could be caused by the fact that I haven't provided the credentials.
But the dataset doesn't seem to allow specifying the credentials in the parameters.
https://gcsfs.readthedocs.io/en/latest/ isn't the suffix gcs
? not sure if I am missing anything here
Yes I tried but I have the same issues with gcs
:
File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load raise DatasetError(message) from exc kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXX/poc-kedro/table_name/*.parquet, fs_prefix=gcs://). 'JavaPackage' object is not callable
Yes it is possible 👍 here is my spark configuration :
spark.driver.maxResultSize: 3g spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem spark.sql.execution.arrow.pyspark.enabled: true spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog spark.databricks.delta.properties.defaults.compatibility.symlinkFormatManifest.enabled: true # <a target="_blank" rel="noopener noreferrer" href="https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner">https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner</a> spark.scheduler.mode: FAIR
For now, I've made a few changes with this configuration, and I’m able to successfully launch the Spark session:
spark.driver.maxResultSize: 3g spark.jars.packages: io.delta:delta-core_2.12:2.0.0 spark.jars: <a target="_blank" rel="noopener noreferrer" href="https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar">https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar</a> spark.sql.execution.arrow.pyspark.enabled: true spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog spark.databricks.delta.properties.defaults.compatibility.symlinkFormatManifest.enabled: true # <a target="_blank" rel="noopener noreferrer" href="https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner">https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner</a> spark.scheduler.mode: FAIR spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem spark.hadoop.fs.gs.auth.service.account.enable: true spark.hadoop.google.cloud.auth.service.account.json.keyfile: XXXX.jsonHowever, it's still not recognizing my table… And GCS isn’t working on my end, but with GS, they are able to fetch the table.
I think we use fsspec to access files and based on the docs, the url should have a prefix gcs://
. I am not sure how gs is working for you. But with gs://
is your issue resolved ?
Yeah, I don’t understand it either... Unfortunately, no, it’s not resolved because I’m getting an error saying that the URL I provided is not a Delta table. I’ve tried the same URLs multiple times in notebooks with the same paths, and it works there.
can you check if there is a folder gs://your-bucket-name/path/to/delta-table/_delta_log exists ?
Also can you try filepath: "gs://XXXX/poc-kedro/table_name" (instead of wildcard parquet)
File "/opt/anaconda3/lib/python3.11/site-packages/kedro/runner/runner.py", line 494, in _run_node_sequential inputs[name] = catalog.load(name) ^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/data_catalog.py", line 515, in load result = dataset.load() ^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load raise DatasetError(message) from exc kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXX/poc-kedro/table_name, fs_prefix=gs://). `<a target="_blank" rel="noopener noreferrer" href="gs://XXXXXX/poc-kedro/table_name">gs://XXXXXX/poc-kedro/table_name</a>` is not a Delta table.
Thanks for your patience ,I am new to this and as I read the docs it shows an example of using a single parquet file but not with wildcard, so I am not sure if we use multiple files. Let me get some help from the team. Meanwhile, if you could resolve the issue, please let us know. Thank you
weather@delta: type: spark.DeltaTableDataset filepath: data/02_intermediate/data.parquet
Yes, exactly. That's what I found strange when I read the official documentation, as a Delta Table contains multiple files, not just a single Parquet file. Thanks, I'll keep you posted if I find a solution.
I also tested with a single file from the table, and I got the same error:
kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXX/poc-kedro/table_name/processed_at=XXXXXX/part-00000-0076fd68-4ca3-46f6-982f-e77c539af8a1.c000.snappy.parquet, fs_prefix=gs://). <a target="_blank" rel="noopener noreferrer" href="gs://XXXXXX/poc-kedro/table_name/processed_at=XXXXXX/part-00000-0076fd68-4ca3-46f6-982f-e77c539af8a1.c000.snappy.parquet">gs://XXXXXX/poc-kedro/table_name/processed_at=XXXXXX/part-00000-0076fd68-4ca3-46f6-982f-e77c539af8a1.c000.snappy.parquet</a> is not a Delta table."
Hey ! 🙂 I figured out the cause of my issue—it was related to missing permissions from the SA. I found the solution by trying to read the table from a notebook in the Kedro project. Thanks for your help! 👍