miller/docs6/docs/two-pass-algorithms.md.in
John Kerl 4f1424789e
Doc6 proofreads 3 (#638)
* Docs6 proofreads batch 3

* BUild-everything script for local development

* Start of glossary

* Put quicklinks atop every page, not just the base-index page

* Expanded record-heterogeneity page

* streaming page

* separators page

* vimrc doc

* separators page
2021-09-03 23:19:32 -04:00

302 lines
8.2 KiB
Markdown

# Two-pass algorithms
## Overview
Miller is a streaming record processor; commands are performed once per record.
(See [here](reference-dsl.md#implicit-loop-over-records-for-main-statements)
and [here](operating-on-all-records.md) for an introductory discussion.) This
makes Miller particularly suitable for single-pass algorithms, allowing many of
its verbs to process files that are (much) larger than the amount of RAM
present in your system. (Of course, Miller verbs such as `sort`, `tac`, etc.
all must ingest and retain all input records before emitting any output records
-- see the [page on streaming processing and memory
usage](streaming-and-memory.md).) You can also use [out-of-stream
variables](reference-dsl-variables.md#out-of-stream-variables) to perform
multi-pass computations, at the price of retaining all input records in memory.
One of Miller's strengths is its compact notation: for example, given input of the form
GENMD_RUN_COMMAND
head -n 5 ./data/medium
GENMD_EOF
you can simply do
GENMD_RUN_COMMAND
mlr --oxtab stats1 -a sum -f x ./data/medium
GENMD_EOF
or
GENMD_RUN_COMMAND
mlr --opprint stats1 -a sum -f x -g b ./data/medium
GENMD_EOF
rather than the more tedious
GENMD_RUN_COMMAND
mlr --oxtab put -q '
@x_sum += $x;
end {
emit @x_sum
}
' data/medium
GENMD_EOF
or
GENMD_RUN_COMMAND
mlr --opprint put -q '
@x_sum[$b] += $x;
end {
emit @x_sum, "b"
}
' data/medium
GENMD_EOF
The former (`mlr stats1` et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.
Nonetheless,
[out-of-stream variables](reference-dsl-variables.md#out-of-stream-variables) (which I
whimsically call *oosvars*),
[begin/end blocks](reference-main-overview.md), and
[emit statements](reference-dsl-output-statements.md#emit-statements) give
you the ability to implement logic -- if you wish to do so -- which isn't
present in other Miller verbs. (If you find yourself often using the same
out-of-stream-variable logic over and over, please file a request at
[https://github.com/johnkerl/miller/issues](https://github.com/johnkerl/miller/issues)
to get it implemented directly in Go as a Miller verb of its own.)
The following examples compute some things using oosvars which are already computable using Miller verbs, by way of providing food for thought.
## Computation of percentages
For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record's value to a percentage.
GENMD_RUN_COMMAND
mlr --from data/small --opprint put -q '
# These are executed once per record, which is the first pass.
# The key is to use NR to index an out-of-stream variable to
# retain all the x-field values.
@x_min = min($x, @x_min);
@x_max = max($x, @x_max);
@x[NR] = $x;
# The second pass is in a for-loop in an end-block.
end {
for (nr, x in @x) {
@x_pct[nr] = 100 * (x - @x_min) / (@x_max - @x_min);
}
emit (@x, @x_pct), "NR"
}
'
GENMD_EOF
## Line-number ratios
Similarly, finding the total record count requires first reading through all the data:
GENMD_RUN_COMMAND
mlr --opprint --from data/small put -q '
@records[NR] = $*;
end {
for((Istring,k),v in @records) {
int I = int(Istring);
@records[I]["I"] = I;
@records[I]["N"] = NR;
@records[I]["PCT"] = 100*I/NR
}
emit @records,"I"
}
' then reorder -f I,N,PCT
GENMD_EOF
## Records having max value
The idea is to retain records having the largest value of `n` in the following data:
GENMD_RUN_COMMAND
mlr --itsv --opprint cat data/maxrows.tsv
GENMD_EOF
Of course, the largest value of `n` isn't known until after all data have been read. Using an [out-of-stream variable](reference-dsl-variables.md#out-of-stream-variables) we can [retain all records as they are read](operating-on-all-records.md), then filter them at the end:
GENMD_RUN_COMMAND
cat data/maxrows.mlr
GENMD_EOF
GENMD_RUN_COMMAND
mlr --itsv --opprint put -q -f data/maxrows.mlr data/maxrows.tsv
GENMD_EOF
## Feature-counting
Suppose you have some heterogeneous data like this:
GENMD_INCLUDE_ESCAPED(data/features.json)
A reasonable question to ask is, how many occurrences of each field are there? And, what percentage of total row count has each of them? Since the denominator of the percentage is not known until the end, this is a two-pass algorithm:
GENMD_INCLUDE_ESCAPED(data/feature-count.mlr)
Then
GENMD_RUN_COMMAND
mlr --json put -q -f data/feature-count.mlr data/features.json
GENMD_EOF
GENMD_RUN_COMMAND
mlr --ijson --opprint put -q -f data/feature-count.mlr data/features.json
GENMD_EOF
## Unsparsing
The previous section discussed how to fill out missing data fields within CSV with full header line -- so the list of all field names is present within the header line. Next, let's look at a related problem: we have data where each record has various key names but we want to produce rectangular output having the union of all key names.
There is a keystroke-saving verb for this: [unsparsify](reference-verbs.md#unsparsify). Here, we look at how to implement that in the DSL.
For example, suppose you have JSON input like this:
GENMD_RUN_COMMAND
cat data/sparse.json
GENMD_EOF
There are field names `a`, `b`, `v`, `u`, `x`, `w` in the data -- but not all in every record. Since we don't know the names of all the keys until we've read them all, this needs to be a two-pass algorithm. On the first pass, remember all the unique key names and all the records; on the second pass, loop through the records filling in absent values, then producing output. Use `put -q` since we don't want to produce per-record output, only emitting output in the `end` block:
GENMD_RUN_COMMAND
cat data/unsparsify.mlr
GENMD_EOF
GENMD_RUN_COMMAND
mlr --json put -q -f data/unsparsify.mlr data/sparse.json
GENMD_EOF
GENMD_RUN_COMMAND
mlr --ijson --ocsv put -q -f data/unsparsify.mlr data/sparse.json
GENMD_EOF
GENMD_RUN_COMMAND
mlr --ijson --opprint put -q -f data/unsparsify.mlr data/sparse.json
GENMD_EOF
## Mean without/with oosvars
GENMD_RUN_COMMAND
mlr --opprint stats1 -a mean -f x data/medium
GENMD_EOF
GENMD_RUN_COMMAND
mlr --opprint put -q '
@x_sum += $x;
@x_count += 1;
end {
@x_mean = @x_sum / @x_count;
emit @x_mean
}
' data/medium
GENMD_EOF
## Keyed mean without/with oosvars
GENMD_RUN_COMMAND
mlr --opprint stats1 -a mean -f x -g a,b data/medium
GENMD_EOF
GENMD_RUN_COMMAND
mlr --opprint put -q '
@x_sum[$a][$b] += $x;
@x_count[$a][$b] += 1;
end{
for ((a, b), v in @x_sum) {
@x_mean[a][b] = @x_sum[a][b] / @x_count[a][b];
}
emit @x_mean, "a", "b"
}
' data/medium
GENMD_EOF
## Variance and standard deviation without/with oosvars
GENMD_RUN_COMMAND
mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium
GENMD_EOF
GENMD_RUN_COMMAND
cat variance.mlr
GENMD_EOF
GENMD_RUN_COMMAND
mlr --oxtab put -q -f variance.mlr data/medium
GENMD_EOF
You can also do this keyed, of course, imitating the keyed-mean example above.
## Min/max without/with oosvars
GENMD_RUN_COMMAND
mlr --oxtab stats1 -a min,max -f x data/medium
GENMD_EOF
GENMD_RUN_COMMAND
mlr --oxtab put -q '
@x_min = min(@x_min, $x);
@x_max = max(@x_max, $x);
end{emitf @x_min, @x_max}
' data/medium
GENMD_EOF
## Keyed min/max without/with oosvars
GENMD_RUN_COMMAND
mlr --opprint stats1 -a min,max -f x -g a data/medium
GENMD_EOF
GENMD_RUN_COMMAND
mlr --opprint --from data/medium put -q '
@min[$a] = min(@min[$a], $x);
@max[$a] = max(@max[$a], $x);
end{
emit (@min, @max), "a";
}
'
GENMD_EOF
## Delta without/with oosvars
GENMD_RUN_COMMAND
mlr --opprint step -a delta -f x data/small
GENMD_EOF
GENMD_RUN_COMMAND
mlr --opprint put '
$x_delta = is_present(@last) ? $x - @last : 0;
@last = $x
' data/small
GENMD_EOF
## Keyed delta without/with oosvars
GENMD_RUN_COMMAND
mlr --opprint step -a delta -f x -g a data/small
GENMD_EOF
GENMD_RUN_COMMAND
mlr --opprint put '
$x_delta = is_present(@last[$a]) ? $x - @last[$a] : 0;
@last[$a]=$x
' data/small
GENMD_EOF
## Exponentially weighted moving averages without/with oosvars
GENMD_RUN_COMMAND
mlr --opprint step -a ewma -d 0.1 -f x data/small
GENMD_EOF
GENMD_RUN_COMMAND
mlr --opprint put '
begin{ @a=0.1 };
$e = NR==1 ? $x : @a * $x + (1 - @a) * @e;
@e=$e
' data/small
GENMD_EOF