Column Rules¶
Introduction¶
Column rules consist of the rule and a test expression. All test expressions return a boolean (true/false) and allow you to evaluate conditions on a column. For brevity, test expressions are implicitely passed the column being evaluated as the first argument. For example:
# Column --> Status; Testing for a value containing '# NOTE'
status: is("fail")
# Converted to:
status: is(status, "fail")
You can still be explicit when referencing the column of interest, and you can also combine test functions using different columns:
# Check that a failure reason is provided for missing data.
status: is(status, "fail") && !is_missing(reason)
Operators¶
still
uses Knetic/govaluate to evaluate expressions. See the manual for more detail on operators. The following operators are supported.
- Modifiers:
+
-
/
*
&
|
^
**
%
>>
<<
- Comparators:
>
>=
<
<=
==
!=
=~
!~
- Logical ops:
||
&&
- Numeric constants, as 64-bit floating point (
12345.678
) - String constants (single quotes:
'foobar'
) - Date constants (single quotes, using any RFC3339, ISO8601, ruby date, or unix date; date parsing is automatically tried with any string constant)
- Boolean constants:
true
false
- Parenthesis to control order of evaluation
(
)
- Arrays (anything separated by
,
within parenthesis:(1, 2, 'foo')
) - Prefixes:
!
-
~
- Ternary conditional:
?
:
- Null coalescence:
??
Dates are parsed when specified as single quotes.
Basic¶
is
¶
Tests whether a value matches.
color: is("red")
not
¶
Tests whether a value does not match.
is_passed: not("fail")
Sets¶
any
¶
Tests whether a value matches any of passed arguments.
color: any("red", "blue", "green")
color: any(provider)
You can also use an array with the IN
operator, but you must specify the column name:
color: color IN ("red", "blue", "green")
unique
¶
Tests whether a column is unique.
items: unique()
# You can also test that a set of columns are unique.
color: unique(size, weight) # remember that color is implicit
Note
unique
does not work well on large datasets. It stores a hash digest of the arguments to test for uniqueness.
identical
¶
Tests whether all values of a column are identical.
dataset_id: identical()
is_subset_list
¶
is_subset_list(group_set, column_delimiter)
group_set
- A comma-delimited set of values.column_delimiter
- The delimiter for the specified column you are testing.
Tests whether a delimited list (nested data) is a subset of the specified delimited list. For example:
letters: is_subset_list("A,B,C", ",") # If letters='A,B' --> TRUE
letters: is_subset_list("A,C", ",") # If letters='A'B' --> FALSE
Numbers¶
range
¶
range(lower, upper)
Tests whether a value falls between lower
and upper
inclusive.
rating: range(0,10)
is_positive
¶
Tests whether a value is positive.
is_negative
¶
Tests whether a value is negative.
Strings¶
contains
¶
contains(substring)
Tests for the presence of a substring in a value.
regex
¶
regex(expression)
Tests whether a value matches a regular expression.
You can also use =~
or !~
regex comparators.
(colname =~ "L[0-9]+")
uppercase
¶
Tests whether a value is all uppercase.
lowercase
¶
Tests whether a value is all lowercase.
length
¶
length(min_len, max_len)
Tests for string length in a given column.
# Test for exact length match
column: length(10)
# Test for minimum length
column: length(10, "*")
# Test for range of lengths
column: length(10, 20)
is_url
¶
Tests whether a string is a valid URL
Types¶
is_numeric
¶
Checks if the value numeric.
is_int
¶
Checks if the value an integer.
is_bool
¶
Tests that column contains true
, TRUE
, false
, or FALSE
is_string
¶
Checks if the value is a string. This is done by checking that it does not look like an integer, bool, or numeric.
Dates¶
is_date
¶
Checks whether a value is date-like using strict criteria. is_date
will fail on ambiguous date strings. For example, 02/03/2020
is interpretted differently in Europe vs. the US but 2020-02-03
is not.
is_date_relaxed
¶
Checks whether a value is date-like with potential ambiguity. 02/03/2020
would pass.
is_date_format
¶
Check whether a column matches a specified date format. Formats can be specified as any date like this:
September 17, 2012, 10:10:09
oct. 7, 70
8/8/1965 12:00:00 AM
2006-01-02T15:04:05+0000
2014-04-26
date_of_birth: is_date_format("June 10, 1987")
See araddon/dateparse for more examples.
Important! You need to escape date values containing dashes using brackets ([]
) or a double backslash (\\
). For example 2020-02-10
is escaped like this:
collection_date: is_date_format("[2020-02-10]")
collection_date: is_date_format("2020\\-02\\-10")
Files¶
file_exists
¶
Checks whether a file exists
file_min_size
¶
Checks whether a file is a minimum size
photo: file_min_size("1MB")
fsize
is a size string such as 100 mb
, 1G
, or 500
.
mimetype
¶
Validates the mimetype for a given file. See gabriel-vasile/mimetype for available mimetypes.
photo: mimetype("image/jpeg")
Missing Data¶
is_na
¶
Checks whether a value is NA
. Use !is_na()
to restrict the presence of EMPTY
values.
name: !is_na() # tests that no NA values exist
is_empty
¶
Checks whether a value is EMPTY
. Use !is_empty()
to restrict the presence of EMPTY
values.
name: !is_empty() # tests that no EMPTY values exist
is_missing
¶
Returns true
if is_na()
or is_empty()
is true.
name: !is_empty() # tests that no NA or EMPTY values exist
Providers¶
Providers allow you to specify arguments in test expressions from an external data source. For example, if you want to restrict
Schema YAML¶
You can append YAML data to the end of a .still
datafile. See Schmema YAML Data for more details.