mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 18:25:45 +00:00
* Docs6 proofreads batch 3 * BUild-everything script for local development * Start of glossary * Put quicklinks atop every page, not just the base-index page * Expanded record-heterogeneity page * streaming page * separators page * vimrc doc * separators page
302 lines
8.2 KiB
Markdown
302 lines
8.2 KiB
Markdown
# Two-pass algorithms
|
|
|
|
## Overview
|
|
|
|
Miller is a streaming record processor; commands are performed once per record.
|
|
(See [here](reference-dsl.md#implicit-loop-over-records-for-main-statements)
|
|
and [here](operating-on-all-records.md) for an introductory discussion.) This
|
|
makes Miller particularly suitable for single-pass algorithms, allowing many of
|
|
its verbs to process files that are (much) larger than the amount of RAM
|
|
present in your system. (Of course, Miller verbs such as `sort`, `tac`, etc.
|
|
all must ingest and retain all input records before emitting any output records
|
|
-- see the [page on streaming processing and memory
|
|
usage](streaming-and-memory.md).) You can also use [out-of-stream
|
|
variables](reference-dsl-variables.md#out-of-stream-variables) to perform
|
|
multi-pass computations, at the price of retaining all input records in memory.
|
|
|
|
One of Miller's strengths is its compact notation: for example, given input of the form
|
|
|
|
GENMD_RUN_COMMAND
|
|
head -n 5 ./data/medium
|
|
GENMD_EOF
|
|
|
|
you can simply do
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --oxtab stats1 -a sum -f x ./data/medium
|
|
GENMD_EOF
|
|
|
|
or
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint stats1 -a sum -f x -g b ./data/medium
|
|
GENMD_EOF
|
|
|
|
rather than the more tedious
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --oxtab put -q '
|
|
@x_sum += $x;
|
|
end {
|
|
emit @x_sum
|
|
}
|
|
' data/medium
|
|
GENMD_EOF
|
|
|
|
or
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint put -q '
|
|
@x_sum[$b] += $x;
|
|
end {
|
|
emit @x_sum, "b"
|
|
}
|
|
' data/medium
|
|
GENMD_EOF
|
|
|
|
The former (`mlr stats1` et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.
|
|
|
|
Nonetheless,
|
|
[out-of-stream variables](reference-dsl-variables.md#out-of-stream-variables) (which I
|
|
whimsically call *oosvars*),
|
|
[begin/end blocks](reference-main-overview.md), and
|
|
[emit statements](reference-dsl-output-statements.md#emit-statements) give
|
|
you the ability to implement logic -- if you wish to do so -- which isn't
|
|
present in other Miller verbs. (If you find yourself often using the same
|
|
out-of-stream-variable logic over and over, please file a request at
|
|
[https://github.com/johnkerl/miller/issues](https://github.com/johnkerl/miller/issues)
|
|
to get it implemented directly in Go as a Miller verb of its own.)
|
|
|
|
The following examples compute some things using oosvars which are already computable using Miller verbs, by way of providing food for thought.
|
|
|
|
## Computation of percentages
|
|
|
|
For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record's value to a percentage.
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --from data/small --opprint put -q '
|
|
# These are executed once per record, which is the first pass.
|
|
# The key is to use NR to index an out-of-stream variable to
|
|
# retain all the x-field values.
|
|
@x_min = min($x, @x_min);
|
|
@x_max = max($x, @x_max);
|
|
@x[NR] = $x;
|
|
|
|
# The second pass is in a for-loop in an end-block.
|
|
end {
|
|
for (nr, x in @x) {
|
|
@x_pct[nr] = 100 * (x - @x_min) / (@x_max - @x_min);
|
|
}
|
|
emit (@x, @x_pct), "NR"
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
## Line-number ratios
|
|
|
|
Similarly, finding the total record count requires first reading through all the data:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint --from data/small put -q '
|
|
@records[NR] = $*;
|
|
end {
|
|
for((Istring,k),v in @records) {
|
|
int I = int(Istring);
|
|
@records[I]["I"] = I;
|
|
@records[I]["N"] = NR;
|
|
@records[I]["PCT"] = 100*I/NR
|
|
}
|
|
emit @records,"I"
|
|
}
|
|
' then reorder -f I,N,PCT
|
|
GENMD_EOF
|
|
|
|
## Records having max value
|
|
|
|
The idea is to retain records having the largest value of `n` in the following data:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --itsv --opprint cat data/maxrows.tsv
|
|
GENMD_EOF
|
|
|
|
Of course, the largest value of `n` isn't known until after all data have been read. Using an [out-of-stream variable](reference-dsl-variables.md#out-of-stream-variables) we can [retain all records as they are read](operating-on-all-records.md), then filter them at the end:
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/maxrows.mlr
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --itsv --opprint put -q -f data/maxrows.mlr data/maxrows.tsv
|
|
GENMD_EOF
|
|
|
|
## Feature-counting
|
|
|
|
Suppose you have some heterogeneous data like this:
|
|
|
|
GENMD_INCLUDE_ESCAPED(data/features.json)
|
|
|
|
A reasonable question to ask is, how many occurrences of each field are there? And, what percentage of total row count has each of them? Since the denominator of the percentage is not known until the end, this is a two-pass algorithm:
|
|
|
|
GENMD_INCLUDE_ESCAPED(data/feature-count.mlr)
|
|
|
|
Then
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --json put -q -f data/feature-count.mlr data/features.json
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --ijson --opprint put -q -f data/feature-count.mlr data/features.json
|
|
GENMD_EOF
|
|
|
|
## Unsparsing
|
|
|
|
The previous section discussed how to fill out missing data fields within CSV with full header line -- so the list of all field names is present within the header line. Next, let's look at a related problem: we have data where each record has various key names but we want to produce rectangular output having the union of all key names.
|
|
|
|
There is a keystroke-saving verb for this: [unsparsify](reference-verbs.md#unsparsify). Here, we look at how to implement that in the DSL.
|
|
|
|
For example, suppose you have JSON input like this:
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/sparse.json
|
|
GENMD_EOF
|
|
|
|
There are field names `a`, `b`, `v`, `u`, `x`, `w` in the data -- but not all in every record. Since we don't know the names of all the keys until we've read them all, this needs to be a two-pass algorithm. On the first pass, remember all the unique key names and all the records; on the second pass, loop through the records filling in absent values, then producing output. Use `put -q` since we don't want to produce per-record output, only emitting output in the `end` block:
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/unsparsify.mlr
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --json put -q -f data/unsparsify.mlr data/sparse.json
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --ijson --ocsv put -q -f data/unsparsify.mlr data/sparse.json
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --ijson --opprint put -q -f data/unsparsify.mlr data/sparse.json
|
|
GENMD_EOF
|
|
|
|
## Mean without/with oosvars
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint stats1 -a mean -f x data/medium
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint put -q '
|
|
@x_sum += $x;
|
|
@x_count += 1;
|
|
end {
|
|
@x_mean = @x_sum / @x_count;
|
|
emit @x_mean
|
|
}
|
|
' data/medium
|
|
GENMD_EOF
|
|
|
|
## Keyed mean without/with oosvars
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint stats1 -a mean -f x -g a,b data/medium
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint put -q '
|
|
@x_sum[$a][$b] += $x;
|
|
@x_count[$a][$b] += 1;
|
|
end{
|
|
for ((a, b), v in @x_sum) {
|
|
@x_mean[a][b] = @x_sum[a][b] / @x_count[a][b];
|
|
}
|
|
emit @x_mean, "a", "b"
|
|
}
|
|
' data/medium
|
|
GENMD_EOF
|
|
|
|
## Variance and standard deviation without/with oosvars
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat variance.mlr
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --oxtab put -q -f variance.mlr data/medium
|
|
GENMD_EOF
|
|
|
|
You can also do this keyed, of course, imitating the keyed-mean example above.
|
|
|
|
## Min/max without/with oosvars
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --oxtab stats1 -a min,max -f x data/medium
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --oxtab put -q '
|
|
@x_min = min(@x_min, $x);
|
|
@x_max = max(@x_max, $x);
|
|
end{emitf @x_min, @x_max}
|
|
' data/medium
|
|
GENMD_EOF
|
|
|
|
## Keyed min/max without/with oosvars
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint stats1 -a min,max -f x -g a data/medium
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint --from data/medium put -q '
|
|
@min[$a] = min(@min[$a], $x);
|
|
@max[$a] = max(@max[$a], $x);
|
|
end{
|
|
emit (@min, @max), "a";
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
## Delta without/with oosvars
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint step -a delta -f x data/small
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint put '
|
|
$x_delta = is_present(@last) ? $x - @last : 0;
|
|
@last = $x
|
|
' data/small
|
|
GENMD_EOF
|
|
|
|
## Keyed delta without/with oosvars
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint step -a delta -f x -g a data/small
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint put '
|
|
$x_delta = is_present(@last[$a]) ? $x - @last[$a] : 0;
|
|
@last[$a]=$x
|
|
' data/small
|
|
GENMD_EOF
|
|
|
|
## Exponentially weighted moving averages without/with oosvars
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint step -a ewma -d 0.1 -f x data/small
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint put '
|
|
begin{ @a=0.1 };
|
|
$e = NR==1 ? $x : @a * $x + (1 - @a) * @e;
|
|
@e=$e
|
|
' data/small
|
|
GENMD_EOF
|