Enforcing Schema Data Type on Catalog

Question

Hey Guys, quick question: Is there a way to enforce on the catalog the schema data type? Like: cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ','
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .
    schema:

Thiago José Moser Poletto · Answer

so that I can specify the columns that I want to be in certain data type

Matthias Roels · Answer

When reading, you can use the  dtype  arg of pandas’  read_csv   method  in the  load_args  field. Is that what you are looking for? So: cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ','
    dtype: {}
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .

Thiago José Moser Poletto · Answer

not on loading but when saving it

Matthias Roels · Answer

Unfortunately no… pandas does not have such am option as it infers type from  df.dtypes  when writing to csv. You can always cast to the correct dtype from within the node

Thiago José Moser Poletto · Answer

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_gbq.html

table_schema list of dicts, optional
List of BigQuery table fields to which according DataFrame columns conform to, e.g. [{'name': 'col1', 'type': 'STRING'},...]. If schema is not provided, it will be generated according to dtypes of DataFrame columns. See BigQuery API documentation on available names of a field.
New in version 0.3.1 of pandas-gbq.

Thiago José Moser Poletto · Answer

so it seems that it is possible

Nok Lam Chan · Answer

to be clear, it's not a pandas issue. CSV does not have type because it is plain text and thus always known as a bad format for any data processing pipeline.

The reason why you can do that to BiqQuery because it is a typed system that stores your dataframe as a table.

If you want to preserve type, use something like Parquet

Thiago José Moser Poletto · Answer

but for some reason...

Thiago José Moser Poletto · Answer

so theres no way to feed a BQ table with specific data type for the columns using type: pandas.GBQTableDataSet

Nok Lam Chan · Answer

You can definitely do that with bigquery, I said you cannot save CSV with types as that was the original example that you provided.

Thiago José Moser Poletto · Answer

oh yeah, my bad, I'm using type: pandas.GBQTableDataSet but for some reason I got kedro.io.core.DatasetError: Failed while saving data to data set GBQTableDataset
Could not convert DataFrame to Parquet.

Join the Kedro community

Enforcing Schema Data Type on Catalog