Column Rules

Introduction

Column rules consist of the rule and a test expression. All test expressions return a boolean (true/false) and allow you to evaluate conditions on a column. For brevity, test expressions are implicitely passed the column being evaluated as the first argument. For example:

# Column --> Status; Testing for a value containing '# NOTE'
status: is("fail")
# Converted to:
status: is(status, "fail")

You can still be explicit when referencing the column of interest, and you can also combine test functions using different columns:

# Check that a failure reason is provided for missing data.
status: is(status, "fail") && !is_missing(reason)

Operators

still uses Knetic/govaluate to evaluate expressions. See the manual for more detail on operators. The following operators are supported.

  • Modifiers: + - / * & | ^ ** % >> <<
  • Comparators: > >= < <= == != =~ !~
  • Logical ops: || &&
  • Numeric constants, as 64-bit floating point (12345.678)
  • String constants (single quotes: 'foobar')
  • Date constants (single quotes, using any RFC3339, ISO8601, ruby date, or unix date; date parsing is automatically tried with any string constant)
  • Boolean constants: true false
  • Parenthesis to control order of evaluation ( )
  • Arrays (anything separated by , within parenthesis: (1, 2, 'foo'))
  • Prefixes: ! - ~
  • Ternary conditional: ? :
  • Null coalescence: ??

Dates are parsed when specified as single quotes.

Basic

is

Tests whether a value matches.

color: is("red")
not

Tests whether a value does not match.

is_passed: not("fail")

Sets

any

Tests whether a value matches any of passed arguments.

color: any("red", "blue", "green")
color: any(provider)

You can also use an array with the IN operator, but you must specify the column name:

color: color IN ("red", "blue", "green")
unique

Tests whether a column is unique.

items: unique()
# You can also test that a set of columns are unique.
color: unique(size, weight) # remember that color is implicit

Note

unique does not work well on large datasets. It stores a hash digest of the arguments to test for uniqueness.

identical

Tests whether all values of a column are identical.

dataset_id: identical()
is_subset_list
is_subset_list(group_set, column_delimiter)
  • group_set - A comma-delimited set of values.
  • column_delimiter - The delimiter for the specified column you are testing.

Tests whether a delimited list (nested data) is a subset of the specified delimited list. For example:

letters: is_subset_list("A,B,C", ",") # If letters='A,B' --> TRUE
letters: is_subset_list("A,C", ",")   # If letters='A'B' --> FALSE

Numbers

range
range(lower, upper)

Tests whether a value falls between lower and upper inclusive.

rating: range(0,10)
is_positive

Tests whether a value is positive.

is_negative

Tests whether a value is negative.

Strings

contains
contains(substring)

Tests for the presence of a substring in a value.

regex
regex(expression)

Tests whether a value matches a regular expression.

You can also use =~ or !~ regex comparators.

(colname =~ "L[0-9]+")
uppercase

Tests whether a value is all uppercase.

lowercase

Tests whether a value is all lowercase.

length
length(min_len, max_len)

Tests for string length in a given column.

# Test for exact length match
column: length(10)

# Test for minimum length
column: length(10, "*")

# Test for range of lengths
column: length(10, 20)
is_url

Tests whether a string is a valid URL

Types

is_numeric

Checks if the value numeric.

is_int

Checks if the value an integer.

is_bool

Tests that column contains true, TRUE, false, or FALSE

is_string

Checks if the value is a string. This is done by checking that it does not look like an integer, bool, or numeric.

Dates

is_date

Checks whether a value is date-like using strict criteria. is_date will fail on ambiguous date strings. For example, 02/03/2020 is interpretted differently in Europe vs. the US but 2020-02-03 is not.

is_date_relaxed

Checks whether a value is date-like with potential ambiguity. 02/03/2020 would pass.

is_date_format

Check whether a column matches a specified date format. Formats can be specified as any date like this:

September 17, 2012, 10:10:09
oct. 7, 70
8/8/1965 12:00:00 AM
2006-01-02T15:04:05+0000
2014-04-26
date_of_birth: is_date_format("June 10, 1987")

See araddon/dateparse for more examples.

Important! You need to escape date values containing dashes using brackets ([]) or a double backslash (\\). For example 2020-02-10 is escaped like this:

collection_date: is_date_format("[2020-02-10]")
collection_date: is_date_format("2020\\-02\\-10")

Files

file_exists

Checks whether a file exists

file_min_size

Checks whether a file is a minimum size

photo: file_min_size("1MB")

fsize is a size string such as 100 mb, 1G, or 500.

mimetype

Validates the mimetype for a given file. See gabriel-vasile/mimetype for available mimetypes.

photo: mimetype("image/jpeg")

Missing Data

is_na

Checks whether a value is NA. Use !is_na() to restrict the presence of EMPTY values.

name: !is_na() # tests that no NA values exist
is_empty

Checks whether a value is EMPTY. Use !is_empty() to restrict the presence of EMPTY values.

name: !is_empty() # tests that no EMPTY values exist
is_missing

Returns true if is_na() or is_empty() is true.

name: !is_empty() # tests that no NA or EMPTY values exist

Providers

Providers allow you to specify arguments in test expressions from an external data source. For example, if you want to restrict

Schema YAML

You can append YAML data to the end of a .still datafile. See Schmema YAML Data for more details.

YAML File