Quick links:   Flag list   Verb list   Function list   Glossary   Repository ↗
# Null/empty/absent data One of Miller's key features is its support for **heterogeneous** data. For example, take `mlr sort`: if you try to sort on field `hostname` when not all records in the data stream *have* a field named `hostname`, it is not an error (although you could pre-filter the data stream using `mlr having-fields --at-least hostname then sort ...`). Rather, records lacking one or more sort keys are simply output contiguously by `mlr sort`. Miller has three kinds of null data: * **Empty (key present, value empty)**: a field name is present in a record (or in an out-of-stream variable) with empty value: e.g. `x=,y=2` in the data input stream, or assignment `$x=""` or `@x=""` in `mlr put`. * **Absent (key not present)**: a field name is not present, e.g. input record is `x=1,y=2` and a `put` or `filter` expression refers to `$z`. Or, reading an out-of-stream variable which hasn't been assigned a value yet, e.g. `mlr put -q '@sum += $x; end{emit @sum}'` or `mlr put -q '@sum[$a][$b] += $x; end{emit @sum, "a", "b"}'`. * **JSON null**: The main purpose of this is to support reading the `null` type in JSON files. The [Miller programming language](programming-language.md) has a `null` keyword as well, so you can also write the null type using `$x = null`. Addtionally, though, when you write past the end of an array, leaving gaps -- e.g. writing `a[12]` when the array `a` has length 10 -- JSON-null is used to fill the gaps. See also the [arrays page](reference-main-arrays.md#auto-extend-and-null-gaps). You can test these programatically using the functions `is_empty`/`is_not_empty`, `is_absent`/`is_present`, and `is_null`/`is_not_null`. For the last pair, note that null means either empty or absent. Rules for null-handling: * Records with one or more empty sort-field values sort after records with all sort-field values present:
mlr cat data/sort-null.dat
a=3,b=2
a=1,b=8
a=,b=4
x=9,b=10
a=5,b=7
mlr sort -n a data/sort-null.dat
a=1,b=8
a=3,b=2
a=5,b=7
a=,b=4
x=9,b=10
mlr sort -nr a data/sort-null.dat
a=,b=4
a=5,b=7
a=3,b=2
a=1,b=8
x=9,b=10
* Functions/operators which have one or more *empty* arguments produce empty output: e.g.
echo 'x=2,y=3' | mlr put '$a=$x+$y'
x=2,y=3,a=5
echo 'x=,y=3' | mlr put '$a=$x+$y'
x=,y=3,a=
echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'
x=,y=3,a=,b=1.0986122886681096
with the exception that the `min` and `max` functions are special: if one argument is non-null, it wins:
echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'
x=,y=3,a=3,b=
* Functions of *absent* variables (e.g. `mlr put '$y = log10($nonesuch)'`) evaluate to absent, and arithmetic/bitwise/boolean operators with both operands being absent evaluate to absent. Arithmetic operators with one absent operand return the other operand. More specifically, absent values act like zero for addition/subtraction, and one for multiplication: Furthermore, **any expression which evaluates to absent is not stored in the left-hand side of an assignment statement**:
echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'
x=2,y=3,b=3,c=5
echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'
x=2,y=3,a=2,b=3
* Likewise, for assignment to maps, **absent-valued keys or values result in a skipped assignment**. The reasoning is as follows: * Empty values are explicit in the data so they should explicitly affect accumulations: `mlr put '@sum += $x'` should accumulate numeric `x` values into the sum but an empty `x`, when encountered in the input data stream, should make the sum non-numeric. To work around this you can use the `is_not_null` function as follows: `mlr put 'is_not_null($x) { @sum += $x }'` * Absent stream-record values should not break accumulations, since Miller by design handles heterogenous data: the running `@sum` in `mlr put '@sum += $x'` should not be invalidated for records which have no `x`. * Absent out-of-stream-variable values are precisely what allow you to write `mlr put '@sum += $x'`. Otherwise you would have to write `mlr put 'begin{@sum = 0}; @sum += $x'` -- which is tolerable -- but for `mlr put 'begin{...}; @sum[$a][$b] += $x'` you'd have to pre-initialize `@sum` for all values of `$a` and `$b` in your input data stream, which is intolerable. * The penalty for the absent feature is that misspelled variables can be hard to find: e.g. in `mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}'` the accumulator is spelt `@sumx` in the begin-block but `@sum` in the end-block, where since it is absent, `@sum*2` evaluates to 2. See also the section on [DSL errors and transparency](reference-dsl-errors.md). Since absent plus absent is absent (and likewise for other operators), accumulations such as `@sum += $x` work correctly on heterogenous data, as do within-record formulas if both operands are absent. If one operand is present, you may get behavior you don't desire. To work around this -- namely, to set an output field only for records which have all the inputs present -- you can use a pattern-action block with `is_present`:
mlr cat data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true
record_count=100,resource=/path/to/file
resource=/path/to/second/file,loadsec=0.32,ok=true
record_count=150,resource=/path/to/second/file
resource=/some/other/path,loadsec=0.97,ok=false
mlr put 'is_present($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true,loadmillis=450
record_count=100,resource=/path/to/file
resource=/path/to/second/file,loadsec=0.32,ok=true,loadmillis=320
record_count=150,resource=/path/to/second/file
resource=/some/other/path,loadsec=0.97,ok=false,loadmillis=970
mlr put '$loadmillis = (is_present($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true,loadmillis=450
record_count=100,resource=/path/to/file,loadmillis=0
resource=/path/to/second/file,loadsec=0.32,ok=true,loadmillis=320
record_count=150,resource=/path/to/second/file,loadmillis=0
resource=/some/other/path,loadsec=0.97,ok=false,loadmillis=970
If you're interested in a formal description of how empty and absent fields participate in arithmetic, here's a table for plus (other arithmetic/boolean/bitwise operators are similar):
mlr help type-arithmetic-info
(+)        | 1          2.5        (absent)   (error)   
------     + ------     ------     ------     ------    
1          | 2          3.5        1          (error)   
2.5        | 3.5        5          2.5        (error)   
(absent)   | 1          2.5        (absent)   (error)   
(error)    | (error)    (error)    (error)    (error)