Configuring the Project ID for Spark BigQuery Integration

Question

Hello 🙂 I'd like to read a BigQuery table using spark.Dataset, but I'm getting an error saying that I need to configure the project ID. Has anyone encountered this issue before?

Spark Session :

spark.jars.packages: io.delta:delta-spark_2.12:3.2.0
spark.jars: <a target="_blank" rel="noopener noreferrer" href="https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar">https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar</a>
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

Error :

DatasetError: Failed while loading data from dataset SparkDataset(file_format=bigquery, filepath=/tmp/dummy.parquet, load_args={'table': project_id.dataset_id.table_id}, save_args={}).
An error occurred while calling o45.load.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) [Guice/ErrorInCustomProvider]: IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment.  Please set a project ID using the 
builder.
  at SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:102)
  while locating SparkBigQueryConfig

Learn more:
  <a target="_blank" rel="noopener noreferrer" href="https://github.com/google/guice/wiki/ERROR_IN_CUSTOM_PROVIDER">https://github.com/google/guice/wiki/ERROR_IN_CUSTOM_PROVIDER</a>

1 error

======================
Full classname legend:
======================
SparkBigQueryConfig:          "com.google.cloud.spark.bigquery.SparkBigQueryConfig"
SparkBigQueryConnectorModule: "com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule"
========================
End of classname legend:
========================

Laura Couto · Answer

Hey Mohamed, I believe you can add your project ID on the  spark.yml  file in your Kedro project. How are you passing the project ID to the Spark session?

Mohamed El Guendouz · Answer

Hello  @Laura Couto  🙂, the only place I have declared the project ID is in the catalog to identify the table in question. Do you think I should add any specific configuration to the spark.yml file? Catalog.yml : table_name:
  type: spark.SparkDataset
  file_format: bigquery
  filepath: "/tmp/dummy.parquet"
  load_args:
    table: "project_id.dataset_id.table_id"

Laura Couto · Answer

I think you have to pass it to the Spark session either by declaring it on the spark.yml file or by declaring it on the hook where you initialize the session.

https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#centralise-spark-configuration-in-conf-base-spark-yml

Mohamed El Guendouz · Answer

Yes, that's a good lead, thank you @Laura Couto 🙂. However, I can't find anything on the internet or in the Kedro documentation that explains how to properly configure these parameters in the Spark session.

I haven't found any information related to these parameters, which seem to be the ones I need to configure:

com.google.cloud.spark.bigquery.SparkBigQueryConfig
com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule

Laura Couto · Answer

Would you mind sharing how you're configuring your spark session?

Mohamed El Guendouz · Answer

I am using the spark.yml file with this configuration:

spark.jars.packages: io.delta:delta-spark_2.12:3.2.0
spark.jars: <a target="_blank" rel="noopener noreferrer" href="https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar">https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar</a>
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

Mohamed El Guendouz · Answer

And I am using the basic Hook provided by Kedro to configure a Spark session.

Nok Lam Chan · Answer

This sounds more like Spark configuration issue and you should consult the Bigquery/Spark doc instead

Laura Couto · Answer

Try passing this to the session builder, it's what I could find in the spark docs. .config('parentProject', 'google-project-ID')

Mohamed El Guendouz · Answer

Hi  @Nok  and  @Laura Couto  🙂 , I wanted to share how I successfully read the BigQuery table. It turns out that some configurations were missing for reading the table. Here is the configuration I used: table_name:
  type: spark.SparkDataset
  file_format: bigquery
  filepath: "tmp/dummy.parquet"
  load_args:
    dataset: "project_id"
    table: "project_id.dataset_id.table_name" I also verified that the service account (SA) had the following roles: Storage Object Viewer BigQuery Data Viewer BigQuery Read Session User After properly configuring the credentials and without altering the Spark configuration I shared with you, I was able to read the BigQuery table successfully.

Nok Lam Chan · Answer

^ which configuration is the missing one?

Nok Lam Chan · Answer

Maybe it's something worth mentioned in the doc as an example at least

Mohamed El Guendouz · Answer

Yes, what was missing was the addition of the  dataset  parameter in the  load_args . I believe it would indeed be useful to include this configuration in the documentation.

Join the Kedro community

Configuring the Project ID for Spark BigQuery Integration