7.3 KiB
null data
One of Miller's key features is its support for heterogeneous data. For example, take mlr sort: if you try to sort on field hostname when not all records in the data stream have a field named hostname, it is not an error (although you could pre-filter the data stream using mlr having-fields --at-least hostname then sort ...). Rather, records lacking one or more sort keys are simply output contiguously by mlr sort.
Miller has two kinds of null data:
-
Empty (key present, value empty): a field name is present in a record (or in an out-of-stream variable) with empty value: e.g.
x=,y=2in the data input stream, or assignment$x=""or@x=""inmlr put. -
Absent (key not present): a field name is not present, e.g. input record is
x=1,y=2and aputorfilterexpression refers to$z. Or, reading an out-of-stream variable which hasn't been assigned a value yet, e.g.mlr put -q '@sum += $x; end{emit @sum}'ormlr put -q '@sum[$a][$b] += $x; end{emit @sum, "a", "b"}'.
You can test these programatically using the functions is_empty/is_not_empty, is_absent/is_present, and is_null/is_not_null. For the last pair, note that null means either empty or absent.
Rules for null-handling:
- Records with one or more empty sort-field values sort after records with all sort-field values present:
mlr cat data/sort-null.dat
a=3,b=2 a=1,b=8 a=,b=4 x=9,b=10 a=5,b=7
mlr sort -n a data/sort-null.dat
a=1,b=8 a=3,b=2 a=5,b=7 a=,b=4 x=9,b=10
mlr sort -nr a data/sort-null.dat
a=,b=4 a=5,b=7 a=3,b=2 a=1,b=8 x=9,b=10
- Functions/operators which have one or more empty arguments produce empty output: e.g.
echo 'x=2,y=3' | mlr put '$a=$x+$y'
x=2,y=3,a=5
echo 'x=,y=3' | mlr put '$a=$x+$y'
x=,y=3,a=
echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'
x=,y=3,a=,b=1.0986122886681096
with the exception that the min and max functions are special: if one argument is non-null, it wins:
echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'
x=,y=3,a=3,b=
- Functions of absent variables (e.g.
mlr put '$y = log10($nonesuch)') evaluate to absent, and arithmetic/bitwise/boolean operators with both operands being absent evaluate to absent. Arithmetic operators with one absent operand return the other operand. More specifically, absent values act like zero for addition/subtraction, and one for multiplication: Furthermore, any expression which evaluates to absent is not stored in the left-hand side of an assignment statement:
echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'
x=2,y=3,b=3,c=5
echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'
x=2,y=3,a=2,b=3
- Likewise, for assignment to maps, absent-valued keys or values result in a skipped assignment.
The reasoning is as follows:
-
Empty values are explicit in the data so they should explicitly affect accumulations:
mlr put '@sum += $x'should accumulate numericxvalues into the sum but an emptyx, when encountered in the input data stream, should make the sum non-numeric. To work around this you can use theis_not_nullfunction as follows:mlr put 'is_not_null($x) { @sum += $x }' -
Absent stream-record values should not break accumulations, since Miller by design handles heterogenous data: the running
@suminmlr put '@sum += $x'should not be invalidated for records which have nox. -
Absent out-of-stream-variable values are precisely what allow you to write
mlr put '@sum += $x'. Otherwise you would have to writemlr put 'begin{@sum = 0}; @sum += $x'-- which is tolerable -- but formlr put 'begin{...}; @sum[$a][$b] += $x'you'd have to pre-initialize@sumfor all values of$aand$bin your input data stream, which is intolerable. -
The penalty for the absent feature is that misspelled variables can be hard to find: e.g. in
mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}'the accumulator is spelt@sumxin the begin-block but@sumin the end-block, where since it is absent,@sum*2evaluates to 2. See also the section on DSL errors and transparency.
Since absent plus absent is absent (and likewise for other operators), accumulations such as @sum += $x work correctly on heterogenous data, as do within-record formulas if both operands are absent. If one operand is present, you may get behavior you don't desire. To work around this -- namely, to set an output field only for records which have all the inputs present -- you can use a pattern-action block with is_present:
mlr cat data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true record_count=100,resource=/path/to/file resource=/path/to/second/file,loadsec=0.32,ok=true record_count=150,resource=/path/to/second/file resource=/some/other/path,loadsec=0.97,ok=false
mlr put 'is_present($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true,loadmillis=450 record_count=100,resource=/path/to/file resource=/path/to/second/file,loadsec=0.32,ok=true,loadmillis=320 record_count=150,resource=/path/to/second/file resource=/some/other/path,loadsec=0.97,ok=false,loadmillis=970
mlr put '$loadmillis = (is_present($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true,loadmillis=450 record_count=100,resource=/path/to/file,loadmillis=0 resource=/path/to/second/file,loadsec=0.32,ok=true,loadmillis=320 record_count=150,resource=/path/to/second/file,loadmillis=0 resource=/some/other/path,loadsec=0.97,ok=false,loadmillis=970
If you're interested in a formal description of how empty and absent fields participate in arithmetic, here's a table for plus (other arithmetic/boolean/bitwise operators are similar):
mlr help type-arithmetic-info
(+) | 1 2.5 (absent) (error) ------ + ------ ------ ------ ------ 1 | 2 3.5 1 (error) 2.5 | 3.5 5 2.5 (error) (absent) | 1 2.5 (absent) (error) (error) | (error) (error) (error) (error)