Join the Kedro community

Updated 2 months ago

Read A Bigquery Table Using Spark Sparkdataset

At a glance

The community member is trying to read a BigQuery table using the spark.SparkDataSet with an arbitrary query, but is facing issues with the filepath parameter not being in the correct format. The community member notes that the following code works: spark.read.format("bigquery").option("query", "SELECT ph.category, MAX(trx.sales) FROM {project}.{dataset}.trx_data trx LEFT JOIN {project}.{dataset}.prod_hierarchy ph" ).load(), but the spark.SparkDataSet does not have this functionality.

The comments suggest that the community member could try the JDBC approach using kedro.datasets.spark.SparkJDBCDataSet, but this requires passing a table parameter, which may not work for the community member's arbitrary query. The comments also suggest creating a custom dataset or extending the existing spark.SparkDataSet to support this functionality.

The community member then mentions that creating a spark.GBQQueryDataSet similar to the kedro.datasets.pandas.GBQQueryDataSet might be a good approach. The community member

Useful resources

Hi Team!

I am trying to read a bigquery table using the spark.SparkDataSet with an arbitrary query as follows

trx_agg_data:
 type: spark.SparkDataSet
 file_format: bigquery
 load_args:
   viewsEnabled: true
   query: | 
     SELECT ph.category, MAX(trx.sales) 
     FROM {project}.{dataset}.trx_data trx
     LEFT JOIN {project}.{dataset}.prod_hierarchy ph
 filepath: <a target="_blank" rel="noopener noreferrer" href="gs://my-bucket/trx_agg_data.parquet">gs://my-bucket/trx_agg_data.parquet</a>

The dataset complains that the filepath is not in the correct format (BigQuery expected <project>.<dataset>.<table> ), but I am trying to read it with a query.

The following code works

spark.read.format("bigquery").option("query", "SELECT ph.category, MAX(trx.sales) 
     FROM {project}.{dataset}.trx_data trx
     LEFT JOIN {project}.{dataset}.prod_hierarchy ph"
).load()

Looks like spark.SparkDataSet does not have this functionality. Should I create a custom dataset here?

d
A
16 comments

It requires to pass a table init parameter but my SQL query can contain arbitrary number of tables

That sort of goes against the ergonomics Kedro is designed around

I guess you could try doing a dataset factory here

but it's a bit funky

Updated the post to show that I am using 2 tables in the query to BigQuery.

sure but the spark.SparkDataSet wasn't build around that pattern

you can extend it to work that way

but it doesn't out of the box

So similar to kedro.datasets.pandas.GBQQueryDataSet is exactly what I want

vehicles:
  type: pandas.GBQQueryDataSet
  sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
  project: my-project
  credentials: gbq-creds
  load_args:
    reauth: True
I think then creating spark.GBQQueryDataSet is my best bet?

Or you extend / override the existing spark.SparkDataSet to support this, if it works we'd love a PR back into Kedro

Sure, yes seems like a good idea! Although I must note that I am on kedro==0.18.14 for this πŸ™‚ Would be similar to implement for kedro_datasets package post kedro>=0.19 though

because they're now decoupled they should largely be backwards compatible

get it working locally first and then I can help you get your contribution into kedro-datasets

Very soon! Thanks @U03R8FW4HUZ

Hi @U03R8FW4HUZ + Kedro Team :kedro:

Opened a PR on kedro-plugins to implement a new dataset spark.GBQQueryDataset

feat(datasets): Implement `spark.GBQQueryDataset` for reading data from BigQuery as a spark dataframe using SQL query #971

Currently draft, but would be great if I can have some initial comments πŸ™‚

Add a reply
Sign up and join the conversation on Slack