miller/docs6/docs/reference-main-null-data.md.in

91 lines
5.5 KiB
Markdown

# Null/empty/absent data
One of Miller's key features is its support for **heterogeneous** data. For example, take `mlr sort`: if you try to sort on field `hostname` when not all records in the data stream *have* a field named `hostname`, it is not an error (although you could pre-filter the data stream using `mlr having-fields --at-least hostname then sort ...`). Rather, records lacking one or more sort keys are simply output contiguously by `mlr sort`.
Miller has three kinds of null data:
* **Empty (key present, value empty)**: a field name is present in a record (or in an out-of-stream variable) with empty value: e.g. `x=,y=2` in the data input stream, or assignment `$x=""` or `@x=""` in `mlr put`.
* **Absent (key not present)**: a field name is not present, e.g. input record is `x=1,y=2` and a `put` or `filter` expression refers to `$z`. Or, reading an out-of-stream variable which hasn't been assigned a value yet, e.g. `mlr put -q '@sum += $x; end{emit @sum}'` or `mlr put -q '@sum[$a][$b] += $x; end{emit @sum, "a", "b"}'`.
* **JSON null**: The main purpose of this is to support reading the `null` type in JSON files. The [Miller programming language](programming-language.md) has a `null` keyword as well, so you can also write the null type using `$x = null`. Addtionally, though, when you write past the end of an array, leaving gaps -- e.g. writing `a[12]` when the array `a` has length 10 -- JSON-null is used to fill the gaps. See also the [arrays page](reference-main-arrays.md#auto-extend-and-null-gaps).
You can test these programatically using the functions `is_empty`/`is_not_empty`, `is_absent`/`is_present`, and `is_null`/`is_not_null`. For the last pair, note that null means either empty or absent.
Rules for null-handling:
* Records with one or more empty sort-field values sort after records with all sort-field values present:
GENMD_RUN_COMMAND
mlr cat data/sort-null.dat
GENMD_EOF
GENMD_RUN_COMMAND
mlr sort -n a data/sort-null.dat
GENMD_EOF
GENMD_RUN_COMMAND
mlr sort -nr a data/sort-null.dat
GENMD_EOF
* Functions/operators which have one or more *empty* arguments produce empty output: e.g.
GENMD_RUN_COMMAND
echo 'x=2,y=3' | mlr put '$a=$x+$y'
GENMD_EOF
GENMD_RUN_COMMAND
echo 'x=,y=3' | mlr put '$a=$x+$y'
GENMD_EOF
GENMD_RUN_COMMAND
echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'
GENMD_EOF
with the exception that the `min` and `max` functions are special: if one argument is non-null, it wins:
GENMD_RUN_COMMAND
echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'
GENMD_EOF
* Functions of *absent* variables (e.g. `mlr put '$y = log10($nonesuch)'`) evaluate to absent, and arithmetic/bitwise/boolean operators with both operands being absent evaluate to absent. Arithmetic operators with one absent operand return the other operand. More specifically, absent values act like zero for addition/subtraction, and one for multiplication: Furthermore, **any expression which evaluates to absent is not stored in the left-hand side of an assignment statement**:
GENMD_RUN_COMMAND
echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'
GENMD_EOF
GENMD_RUN_COMMAND
echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'
GENMD_EOF
* Likewise, for assignment to maps, **absent-valued keys or values result in a skipped assignment**.
The reasoning is as follows:
* Empty values are explicit in the data so they should explicitly affect accumulations: `mlr put '@sum += $x'` should accumulate numeric `x` values into the sum but an empty `x`, when encountered in the input data stream, should make the sum non-numeric. To work around this you can use the `is_not_null` function as follows: `mlr put 'is_not_null($x) { @sum += $x }'`
* Absent stream-record values should not break accumulations, since Miller by design handles heterogenous data: the running `@sum` in `mlr put '@sum += $x'` should not be invalidated for records which have no `x`.
* Absent out-of-stream-variable values are precisely what allow you to write `mlr put '@sum += $x'`. Otherwise you would have to write `mlr put 'begin{@sum = 0}; @sum += $x'` -- which is tolerable -- but for `mlr put 'begin{...}; @sum[$a][$b] += $x'` you'd have to pre-initialize `@sum` for all values of `$a` and `$b` in your input data stream, which is intolerable.
* The penalty for the absent feature is that misspelled variables can be hard to find: e.g. in `mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}'` the accumulator is spelt `@sumx` in the begin-block but `@sum` in the end-block, where since it is absent, `@sum*2` evaluates to 2. See also the section on [DSL errors and transparency](reference-dsl-errors.md).
Since absent plus absent is absent (and likewise for other operators), accumulations such as `@sum += $x` work correctly on heterogenous data, as do within-record formulas if both operands are absent. If one operand is present, you may get behavior you don't desire. To work around this -- namely, to set an output field only for records which have all the inputs present -- you can use a pattern-action block with `is_present`:
GENMD_RUN_COMMAND
mlr cat data/het.dkvp
GENMD_EOF
GENMD_RUN_COMMAND
mlr put 'is_present($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp
GENMD_EOF
GENMD_RUN_COMMAND
mlr put '$loadmillis = (is_present($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp
GENMD_EOF
If you're interested in a formal description of how empty and absent fields participate in arithmetic, here's a table for plus (other arithmetic/boolean/bitwise operators are similar):
GENMD_RUN_COMMAND
mlr help type-arithmetic-info
GENMD_EOF