Read A Bigquery Table Using Spark Sparkdataset

At a glance

The community member is trying to read a BigQuery table using the spark.SparkDataSet with an arbitrary query, but is facing issues with the filepath parameter not being in the correct format. The community member notes that the following code works: spark.read.format("bigquery").option("query", "SELECT ph.category, MAX(trx.sales) FROM {project}.{dataset}.trx_data trx LEFT JOIN {project}.{dataset}.prod_hierarchy ph" ).load(), but the spark.SparkDataSet does not have this functionality.

The comments suggest that the community member could try the JDBC approach using kedro.datasets.spark.SparkJDBCDataSet, but this requires passing a table parameter, which may not work for the community member's arbitrary query. The comments also suggest creating a custom dataset or extending the existing spark.SparkDataSet to support this functionality.

The community member then mentions that creating a spark.GBQQueryDataSet similar to the kedro.datasets.pandas.GBQQueryDataSet might be a good approach. The community member

Useful resources

AAbhishek Bhatia

Hi Team!

I am trying to read a bigquery table using the spark.SparkDataSet with an arbitrary query as follows

trx_agg_data:
 type: spark.SparkDataSet
 file_format: bigquery
 load_args:
   viewsEnabled: true
   query: | 
     SELECT ph.category, MAX(trx.sales) 
     FROM {project}.{dataset}.trx_data trx
     LEFT JOIN {project}.{dataset}.prod_hierarchy ph
 filepath: <a target="_blank" rel="noopener noreferrer" href="gs://my-bucket/trx_agg_data.parquet">gs://my-bucket/trx_agg_data.parquet</a>

The dataset complains that the filepath is not in the correct format (BigQuery expected <project>.<dataset>.<table> ), but I am trying to read it with a query.

The following code works

spark.read.format("bigquery").option("query", "SELECT ph.category, MAX(trx.sales) 
     FROM {project}.{dataset}.trx_data trx
     LEFT JOIN {project}.{dataset}.prod_hierarchy ph"
).load()

Looks like spark.SparkDataSet does not have this functionality. Should I create a custom dataset here?

16 comments

ddatajoely

I think you'd have more luck with the JDBC approach
https://docs.kedro.org/en/0.18.7/kedro.datasets.spark.SparkJDBCDataSet.html

AAbhishek Bhatia

It requires to pass a table init parameter but my SQL query can contain arbitrary number of tables

ddatajoely

That sort of goes against the ergonomics Kedro is designed around

ddatajoely

I guess you could try doing a dataset factory here

ddatajoely

but it's a bit funky

AAbhishek Bhatia

Updated the post to show that I am using 2 tables in the query to BigQuery.

ddatajoely

sure but the spark.SparkDataSet wasn't build around that pattern

ddatajoely

you can extend it to work that way

ddatajoely

but it doesn't out of the box

AAbhishek Bhatia

So similar to kedro.datasets.pandas.GBQQueryDataSet is exactly what I want

vehicles:
  type: pandas.GBQQueryDataSet
  sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
  project: my-project
  credentials: gbq-creds
  load_args:
    reauth: True

I think then creating spark.GBQQueryDataSet is my best bet?

ddatajoely

Or you extend / override the existing spark.SparkDataSet to support this, if it works we'd love a PR back into Kedro

AAbhishek Bhatia

Sure, yes seems like a good idea! Although I must note that I am on kedro==0.18.14 for this 🙂 Would be similar to implement for kedro_datasets package post kedro>=0.19 though

ddatajoely

because they're now decoupled they should largely be backwards compatible

ddatajoely

get it working locally first and then I can help you get your contribution into kedro-datasets

AAbhishek Bhatia

Very soon! Thanks @U03R8FW4HUZ

AAbhishek Bhatia

Hi @U03R8FW4HUZ + Kedro Team :kedro:

Opened a PR on kedro-plugins to implement a new dataset spark.GBQQueryDataset

feat(datasets): Implement `spark.GBQQueryDataset` for reading data from BigQuery as a spark dataframe using SQL query #971

Currently draft, but would be great if I can have some initial comments 🙂

Add a reply

Join the Kedro community

Read A Bigquery Table Using Spark Sparkdataset