Join the Kedro community

Updated 4 weeks ago

Validating External Dataset Accessibility in BigQuery

At a glance

Hey team, what is a good way of checking whether all the input tables for the nodes that I want to run, are accessible. I am having issues with permissions in BigQuery and testing is cumbersome. Is there a way to run a validation of all external datasets in the catalog?

I was thinking of adding a hook and a metadata tag that identifies the datasets as external.

My main concerns are

  1. how do I handle different dataset types
  2. how do I only ping each table (or load just the first row) instead of loading it in full for speed reasons

Marked as solution

For different dataset types, you would need to adjust the check accordingly maybe use load_args to limit rows or using a custom dataset that queries just one row?

View full solution
S
D
J
5 comments

For different dataset types, you would need to adjust the check accordingly maybe use load_args to limit rows or using a custom dataset that queries just one row?

Just wondering—what are you using to work with BigQuery?

Ibis has good BigQuery support (the BigFrames team at Google helped build and maintain the Ibis backend, and they use it under the hood of BigFrames). Especially on the second point, Ibis has a lazy/deferred execution model, so it ensures tables exist and let you examine schema without loading any data.

Kedro's Ibis integration is on the newer side, but improving quickly.

I am using Ibis, yes. My issue is that I pull in external tables into different nodes and want to make sure I know whenever the SA running the code does not have access to any dataset.

I think an after_catalog_loaded or before_pipeline_run hook could make sense.

  1. Not sure I really understand this question.
  2. Shouldn't be an issue if you're using Ibis.

Add a reply
Sign up and join the conversation on Slack