Schema

Schemas have the following structure:

  • directives - specify global rules and settings.
  • column rules - Define a set of expressions on which to evaluate each column.
  • data optional - a yaml-formatted dataset can be appended to the end of a schema to define value sets or other data used in validation.

Directives

Directives are assigned using a @ prefix and apply global settings.

@na_values

Use @na_values to specify values to treat as NA. See Missing Data for more details.

Default

@na_values NA

@empty_values

Use @empty_values to specify values to treat as EMPTY.

Use "" for empty cells. See Missing Data for more details.

Default

@empty_values "" NULL

@ordered

Require that the columns appear in the same order as specified in the schema.

@ordered

@fixed

Require that column names match the schema in the same order.

@fixed

@separater

Sets the separater/delimiter for a data file. Do not quote the delimiter. Use TAB or \t for tab-delimited data.

@separater: TAB
# comma-delimited
@separater: ,

@sep also works.

Column Rules

Column Rules consist of a column name and expressions to test for each column. For example, the following tests that the color column must be equal to red, blue or green.

color: any("red", "blue", "green")

Data Providers

Certain rules are easier to specify if you need to compare a column against a larger set of data.

YAML Data

Adding a dashline (---) signals the beginning of the data section of the schema. Any content below the dashline is parsed as YAML and can be accessed in expressions using its key. For example:

color_values:
  - red
  - blue
  - green

flavor_values:
  - chocolate
  - vanilla
  - strawberry

Column rules might be specified like this:

color: any(color_values)
flavor: any(flavor_values)

Functions supporting data providers

  • any

Missing Data

There are two types of missing data that still manages for additional flexibility. However, you can choose to treat all missing data as NA if desired. NA values represent “known” missing data. These are similar to NA values in R. EMPTY can be considered “unknown” missing data

To clarify further, consider a dataset on cars. The column mpg for all-electric vehicles would be labeled NA (“not applicable”) as it does not apply. Another scenerio might be that you know the name, make, and mpg of a new vehicle but not the color. This flexibility would allow you to throw an error with an NA value for color, but not for an EMPTY value.

Comments

You can add comments to your schema file using // or /* block */. For example:

// This is a comment
color: any("blue", "red", "green") // expression for the color column
/*
    Using a block comment is fun
*/