Join the Kedro community

Updated 3 months ago

Validating External Dataset Accessibility in BigQuery

At a glance
The community member is having issues with permissions in BigQuery and wants to find a way to validate all external datasets in the catalog before running their nodes. They are considering adding a hook and a metadata tag to identify the external datasets. The main concerns are how to handle different dataset types and how to only check the existence of the tables instead of loading them fully for speed reasons. The comments suggest using load_args to limit rows or a custom dataset that queries just one row, and using Ibis which has a lazy/deferred execution model to check table existence without loading data. The community member is using Ibis and wants to ensure they know whenever the service account running the code does not have access to any dataset. The answer suggests that for different dataset types, the community member would need to adjust the check accordingly, potentially using load_args or a custom dataset.

Hey team, what is a good way of checking whether all the input tables for the nodes that I want to run, are accessible. I am having issues with permissions in BigQuery and testing is cumbersome. Is there a way to run a validation of all external datasets in the catalog?

I was thinking of adding a hook and a metadata tag that identifies the datasets as external.

My main concerns are

  1. how do I handle different dataset types
  2. how do I only ping each table (or load just the first row) instead of loading it in full for speed reasons

Marked as solution

For different dataset types, you would need to adjust the check accordingly maybe use load_args to limit rows or using a custom dataset that queries just one row?

View full solution
S
D
J
5 comments

For different dataset types, you would need to adjust the check accordingly maybe use load_args to limit rows or using a custom dataset that queries just one row?

Just wondering—what are you using to work with BigQuery?

Ibis has good BigQuery support (the BigFrames team at Google helped build and maintain the Ibis backend, and they use it under the hood of BigFrames). Especially on the second point, Ibis has a lazy/deferred execution model, so it ensures tables exist and let you examine schema without loading any data.

Kedro's Ibis integration is on the newer side, but improving quickly.

I am using Ibis, yes. My issue is that I pull in external tables into different nodes and want to make sure I know whenever the SA running the code does not have access to any dataset.

I think an after_catalog_loaded or before_pipeline_run hook could make sense.

  1. Not sure I really understand this question.
  2. Shouldn't be an issue if you're using Ibis.

Add a reply
Sign up and join the conversation on Slack