Join the Kedro community

Updated 6 days ago

Configuring the Project ID for Spark BigQuery Integration

Hello πŸ™‚ I'd like to read a BigQuery table using spark.Dataset, but I'm getting an error saying that I need to configure the project ID. Has anyone encountered this issue before?

Spark Session :

spark.jars.packages: io.delta:delta-spark_2.12:3.2.0
spark.jars: <a target="_blank" rel="noopener noreferrer" href="https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar">https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar</a>
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

Error :
DatasetError: Failed while loading data from dataset SparkDataset(file_format=bigquery, filepath=/tmp/dummy.parquet, load_args={'table': project_id.dataset_id.table_id}, save_args={}).
An error occurred while calling o45.load.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) [Guice/ErrorInCustomProvider]: IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment.  Please set a project ID using the 
builder.
  at SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:102)
  while locating SparkBigQueryConfig

Learn more:
  <a target="_blank" rel="noopener noreferrer" href="https://github.com/google/guice/wiki/ERROR_IN_CUSTOM_PROVIDER">https://github.com/google/guice/wiki/ERROR_IN_CUSTOM_PROVIDER</a>

1 error

======================
Full classname legend:
======================
SparkBigQueryConfig:          "com.google.cloud.spark.bigquery.SparkBigQueryConfig"
SparkBigQueryConnectorModule: "com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule"
========================
End of classname legend:
========================

L
M
N
13 comments

Hey Mohamed, I believe you can add your project ID on the spark.yml file in your Kedro project. How are you passing the project ID to the Spark session?

Hello @Laura Couto πŸ™‚, the only place I have declared the project ID is in the catalog to identify the table in question. Do you think I should add any specific configuration to the spark.yml file?

Catalog.yml :

table_name:
  type: spark.SparkDataset
  file_format: bigquery
  filepath: "/tmp/dummy.parquet"
  load_args:
    table: "project_id.dataset_id.table_id"

I think you have to pass it to the Spark session either by declaring it on the spark.yml file or by declaring it on the hook where you initialize the session.

https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#centralise-spark-configuration-in-conf-base-spark-yml

Yes, that's a good lead, thank you @Laura Couto πŸ™‚. However, I can't find anything on the internet or in the Kedro documentation that explains how to properly configure these parameters in the Spark session.

I haven't found any information related to these parameters, which seem to be the ones I need to configure:

  • com.google.cloud.spark.bigquery.SparkBigQueryConfig
  • com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule

Would you mind sharing how you're configuring your spark session?

I am using the spark.yml file with this configuration:

spark.jars.packages: io.delta:delta-spark_2.12:3.2.0
spark.jars: <a target="_blank" rel="noopener noreferrer" href="https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar">https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar,https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.36.1/spark-bigquery-with-dependencies_2.12-0.36.1.jar</a>
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

And I am using the basic Hook provided by Kedro to configure a Spark session.

This sounds more like Spark configuration issue and you should consult the Bigquery/Spark doc instead

Try passing this to the session builder, it's what I could find in the spark docs.

.config('parentProject', 'google-project-ID')

Hi @Nok and @Laura Couto πŸ™‚ ,
I wanted to share how I successfully read the BigQuery table. It turns out that some configurations were missing for reading the table.
Here is the configuration I used:

table_name:
  type: spark.SparkDataset
  file_format: bigquery
  filepath: "tmp/dummy.parquet"
  load_args:
    dataset: "project_id"
    table: "project_id.dataset_id.table_name"
I also verified that the service account (SA) had the following roles:
  • Storage Object Viewer
  • BigQuery Data Viewer
  • BigQuery Read Session User
After properly configuring the credentials and without altering the Spark configuration I shared with you, I was able to read the BigQuery table successfully.

^ which configuration is the missing one?

Maybe it's something worth mentioned in the doc as an example at least

Yes, what was missing was the addition of the dataset parameter in the load_args. I believe it would indeed be useful to include this configuration in the documentation.

Add a reply
Sign up and join the conversation on Slack