Hello π I'd like to read a BigQuery table using spark.Dataset
, but I'm getting an error saying that I need to configure the project ID. Has anyone encountered this issue before?
Spark Session :
spark.jars.packages: io.delta:delta-spark_2.12:3.2.0 spark.jars: <a target="_blank" rel="noopener noreferrer" href="https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar">https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar</a> spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
DatasetError: Failed while loading data from dataset SparkDataset(file_format=bigquery, filepath=/tmp/dummy.parquet, load_args={'table': project_id.dataset_id.table_id}, save_args={}). An error occurred while calling o45.load. : com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors: 1) [Guice/ErrorInCustomProvider]: IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment. Please set a project ID using the builder. at SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:102) while locating SparkBigQueryConfig Learn more: <a target="_blank" rel="noopener noreferrer" href="https://github.com/google/guice/wiki/ERROR_IN_CUSTOM_PROVIDER">https://github.com/google/guice/wiki/ERROR_IN_CUSTOM_PROVIDER</a> 1 error ====================== Full classname legend: ====================== SparkBigQueryConfig: "com.google.cloud.spark.bigquery.SparkBigQueryConfig" SparkBigQueryConnectorModule: "com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule" ======================== End of classname legend: ========================
Hey Mohamed, I believe you can add your project ID on the spark.yml
file in your Kedro project. How are you passing the project ID to the Spark session?
Hello @Laura Couto π, the only place I have declared the project ID is in the catalog to identify the table in question. Do you think I should add any specific configuration to the spark.yml file?
Catalog.yml :
table_name: type: spark.SparkDataset file_format: bigquery filepath: "/tmp/dummy.parquet" load_args: table: "project_id.dataset_id.table_id"
I think you have to pass it to the Spark session either by declaring it on the spark.yml file or by declaring it on the hook where you initialize the session.
https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#centralise-spark-configuration-in-conf-base-spark-yml
Yes, that's a good lead, thank you @Laura Couto π. However, I can't find anything on the internet or in the Kedro documentation that explains how to properly configure these parameters in the Spark session.
I haven't found any information related to these parameters, which seem to be the ones I need to configure:
com.google.cloud.spark.bigquery.SparkBigQueryConfig
com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule
I am using the spark.yml file with this configuration:
spark.jars.packages: io.delta:delta-spark_2.12:3.2.0 spark.jars: <a target="_blank" rel="noopener noreferrer" href="https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar">https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar</a> spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
This sounds more like Spark configuration issue and you should consult the Bigquery/Spark doc instead
Try passing this to the session builder, it's what I could find in the spark docs.
.config('parentProject', 'google-project-ID')
Hi @Nok and @Laura Couto π ,
I wanted to share how I successfully read the BigQuery table. It turns out that some configurations were missing for reading the table.
Here is the configuration I used:
table_name: type: spark.SparkDataset file_format: bigquery filepath: "tmp/dummy.parquet" load_args: dataset: "project_id" table: "project_id.dataset_id.table_name"I also verified that the service account (SA) had the following roles:
Yes, what was missing was the addition of the dataset
parameter in the load_args
. I believe it would indeed be useful to include this configuration in the documentation.