mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 10:15:36 +00:00
* Update index.md.in * more copy-editing * swipes.sh * swipes.sh * run `make docs` to generate `*.md` from `*.md.in`
323 lines
13 KiB
Markdown
323 lines
13 KiB
Markdown
# DSL output statements
|
|
|
|
You can **output** variable-values or expressions in **five ways**:
|
|
|
|
* **Assign** them to stream-record fields. For example, `$cumulative_sum = @sum`. For another example, `$nr = NR` adds a field named `nr` to each output record, containing the value of the built-in variable `NR` as of when that record was ingested.
|
|
|
|
* Use **emit1**/**emit**/**emitp**/**emitf** to send out-of-stream variables' current values to the output record stream, e.g. `@sum += $x; emit1 @sum` which produces an extra record such as `sum=3.1648382`. These records, just like records from input file(s), participate in downstream [then-chaining](reference-main-then-chaining.md) to other verbs.
|
|
|
|
* Use the **print** or **eprint** keywords which immediately print an expression *directly to standard output or standard error*, respectively. Note that `dump`, `edump`, `print`, and `eprint` don't output records that participate in `then`-chaining; rather, they're just immediate prints to stdout/stderr. The `printn` and `eprintn` keywords are the same except that they don't print final newlines. Additionally, you can print to a specified file instead of stdout/stderr.
|
|
|
|
* Use the **dump** or **edump** keywords, which *immediately print all out-of-stream variables as a JSON data structure to the standard output or standard error* (respectively).
|
|
|
|
* Use **tee**, which formats the current stream record (not just an arbitrary string as with **print**) to a specific file.
|
|
|
|
For the first two options, you are populating the output-records stream which feeds into the next verb in a `then`-chain (if any), or which otherwise is formatted for output using `--o...` flags.
|
|
|
|
For the last three options, you are sending output directly to standard output, standard error, or a file.
|
|
|
|
## Print statements
|
|
|
|
The `print` statement is perhaps self-explanatory, but with a few light caveats:
|
|
|
|
* There are four variants: `print` goes to stdout with final newline, `printn` goes to stdout without final newline (you can include one using "\n" in your output string), `eprint` goes to stderr with final newline, and `eprintn` goes to stderr without final newline.
|
|
|
|
* Output goes directly to stdout/stderr, respectively: data produced this way does not go downstream to the next verb in a `then`-chain. (Use `emit` for that.)
|
|
|
|
* Print statements are for strings (`print "hello"`), or things which can be made into strings: numbers (`print 3`, `print $a + $b`), or concatenations thereof (`print "a + b = " . ($a + $b)`). Maps (in `$*`, map-valued out-of-stream or local variables, and map literals) as well as arrays are printed as JSON.
|
|
|
|
* You can redirect print output to a file:
|
|
|
|
GENMD-CARDIFY-HIGHLIGHT-ONE
|
|
mlr --from myfile.dat put 'print > "tap.txt", $x'
|
|
GENMD-EOF
|
|
|
|
* You can redirect print output to multiple files, split by values present in various records:
|
|
|
|
GENMD-CARDIFY-HIGHLIGHT-ONE
|
|
mlr --from myfile.dat put 'print > $a.".txt", $x'
|
|
GENMD-EOF
|
|
|
|
See also [Redirected-output statements](reference-dsl-output-statements.md#redirected-output-statements) for examples.
|
|
|
|
## Dump statements
|
|
|
|
The `dump` statement is for printing expressions, including maps, directly to stdout/stderr, respectively:
|
|
|
|
* There are two variants: `dump` prints to stdout; `edump` prints to stderr.
|
|
|
|
* Output goes directly to stdout/stderr, respectively: data produced this way does not go downstream to the next verb in a `then`-chain. (Use `emit` for that.)
|
|
|
|
* You can use `dump` to output single strings, numbers, or expressions including map-valued data. Map-valued data is printed as JSON.
|
|
|
|
* If you use `dump` (or `edump`) with no arguments, you get a JSON structure representing the current values of all out-of-stream variables.
|
|
|
|
* As with `print`, you can redirect output to files.
|
|
|
|
* See also [Redirected-output statements](reference-dsl-output-statements.md#redirected-output-statements) for examples.
|
|
|
|
## Tee statements
|
|
|
|
Records produced by a `mlr put` go downstream to the next verb in your `then`-chain, if any, or otherwise to standard output. If you want to additionally copy out records to files, you can do that using `tee`.
|
|
|
|
The syntax is, for example:
|
|
|
|
GENMD-CARDIFY-HIGHLIGHT-ONE
|
|
mlr --from myfile.dat put 'tee > "tap.dat", $*' then sort -n index
|
|
GENMD-EOF
|
|
|
|
First is `tee >`, then the filename expression (which can be an expression such as `"tap.".$a.".dat"`), then a comma, then `$*`. (Nothing else but `$*` is teeable.)
|
|
|
|
You can also write to a variable file name -- for example, you can split a single file into multiple ones on field names:
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --csv cat example.csv
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --csv cat circle.csv
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --csv cat square.csv
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --csv cat triangle.csv
|
|
GENMD-EOF
|
|
|
|
See also [Redirected-output statements](reference-dsl-output-statements.md#redirected-output-statements) for examples.
|
|
|
|
## Redirected-output statements
|
|
|
|
The **print**, **dump** **tee**, **emit**, **emitp**, and **emitf** keywords all allow you to redirect output to one or more files or pipe-to commands. The filenames/commands are strings which can be constructed using record-dependent values, so you can do things like splitting a table into multiple files, one for each account ID, and so on.
|
|
|
|
Details:
|
|
|
|
* The `print` and `dump` keywords produce output immediately to standard output, or to specified file(s) or pipe-to command if present.
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr help keyword print
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr help keyword dump
|
|
GENMD-EOF
|
|
|
|
* `mlr put` sends the current record (possibly modified by the `put` expression) to the output record stream. Records are then input to the following verb in a `then`-chain (if any), else printed to standard output (unless `put -q`). The **tee** keyword *additionally* writes the output record to specified file(s) or pipe-to command, or immediately to `stdout`/`stderr`.
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr help keyword tee
|
|
GENMD-EOF
|
|
|
|
* `mlr put`'s `emitf`, `emitp`, and `emit` send out-of-stream variables to the output record stream. These are then input to the following verb in a `then`-chain (if any), else printed to standard output. When redirected with `>`, `>>`, or `|`, they *instead* write the out-of-stream variable(s) to specified file(s) or pipe-to command, or immediately to `stdout`/`stderr`.
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr help keyword emitf
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr help keyword emitp
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr help keyword emit
|
|
GENMD-EOF
|
|
|
|
## Emit1 and emit/emitp/emitf
|
|
|
|
There are four variants: `emit1`, `emitf`, `emit`, and `emitp`. These are used
|
|
to insert new records into the record stream -- or, optionally, redirect them
|
|
to files.
|
|
|
|
Keep in mind that out-of-stream variables are a nested, multi-level [map](reference-main-maps.md) (directly viewable as JSON using `dump`), while Miller record values are as well during processing -- but records may be flattened down for output to tabular formats. See the page [Flatten/unflatten: JSON vs. tabular formats](flatten-unflatten.md) for more information.
|
|
|
|
* You can use `emit1` to emit any map-valued expression, including `$*`, map-valued out-of-stream variables, the entire out-of-stream-variable collection `@*`, map-valued local variables, map literals, or map-valued function return values.
|
|
* For `emit`, `emitp`, and `emitf`, you can emit map-valued local variables, map-valued field attributes (with `$`), map-va out-of-stream variables (with `@`), `$*`, `@*`, or map literals (with outermost `{...}`) -- but not arbitrary expressions which evaluate to map (such as function return values).
|
|
|
|
The reason for this is partly historical and partly technical. As we'll see below, you can do lots of syntactical things with `emit`, `emitp`, and `emitf`, including printing them side-by-side, indexing them, redirecting the output to files, etc. What this means syntactically is that Miller's parser needs to handle all sorts of commas, parentheses, and so on:
|
|
|
|
GENMD-CARDIFY
|
|
emitf @count, @sum
|
|
emit @sum, "a", "b"
|
|
emitp (@count, @sum),"a","b"}
|
|
# etc
|
|
GENMD-EOF
|
|
|
|
When we try to allow `emitf`/`emit`/`emitp` to handle arbitrary map-valued expressions, like `mapexcept($*, mymap)` and so on, this inserts more syntactic complexity in terms of commas, parentheses, and so on. The technical term is _LR-1 shift-reduce conflicts_, but we can think of this in terms of the parser being unable to efficiently disambiguate all the punctuational opportunities.
|
|
|
|
So, `emit1` can handle syntactic richness in the one thing being emitted;
|
|
`emitf`, `emit`, and `emitp` can handle syntactic richness in the side-by-side
|
|
placement, indexing, and redirection.
|
|
|
|
(Mnemonic: If all you want is to insert a new record into the record stream, `emit1` is probably the _one_ you want.)
|
|
|
|
What this means is that if you want to emit an expression that evaluates to a map, you can do it quite simply:
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --c2p --from example.csv put -q '
|
|
emit1 mapsum({"id": NR}, $*)
|
|
'
|
|
GENMD-EOF
|
|
|
|
And if you want indexing, redirects, etc., just assign to a temporary variable and use one of the other `emit` variants:
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --c2p --from example.csv put -q '
|
|
o = mapsum({"id": NR}, $*);
|
|
emit o;
|
|
'
|
|
GENMD-EOF
|
|
|
|
## Emitf statements
|
|
|
|
Use **emitf** to output several out-of-stream variables side-by-side in the same output record. For `emitf`, these mustn't have indexing using `@name[...]`. Example:
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '
|
|
@count += 1;
|
|
@x_sum += $x;
|
|
@y_sum += $y;
|
|
end { emitf @count, @x_sum, @y_sum}
|
|
' data/small
|
|
GENMD-EOF
|
|
|
|
## Emit statements
|
|
|
|
Use **emit** to output an out-of-stream variable. If it's non-indexed, you'll get a simple key-value pair:
|
|
|
|
GENMD-RUN-COMMAND
|
|
cat data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum += $x; end { dump }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum += $x; end { emit @sum }' data/small
|
|
GENMD-EOF
|
|
|
|
If it's indexed, then use as many names after `emit` as there are indices:
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum[$a] += $x; end { dump }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum[$a] += $x; end { emit @sum, "a" }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum[$a][$b] += $x; end { dump }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum[$a][$b] += $x; end { emit @sum, "a", "b" }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum[$a][$b][$i] += $x; end { dump }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '
|
|
@sum[$a][$b][$i] += $x;
|
|
end { emit @sum, "a", "b", "i" }
|
|
' data/small
|
|
GENMD-EOF
|
|
|
|
Now for **emitp**: if you have as many names following `emit` as there are levels in the out-of-stream variable's map, then `emit` and `emitp` do the same thing. Where they differ is when you don't specify as many names as there are map levels. In this case, Miller needs to flatten multiple map indices down to output-record keys: `emitp` includes full prefixing (hence the `p` in `emitp`) while `emit` takes the deepest map key as the output-record key:
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum[$a][$b] += $x; end { dump }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum[$a][$b] += $x; end { emit @sum, "a" }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum[$a][$b] += $x; end { emit @sum }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum[$a][$b] += $x; end { emitp @sum, "a" }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --oxtab put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small
|
|
GENMD-EOF
|
|
|
|
Use **--flatsep** to specify the character that joins multilevel keys for `emitp` (it defaults to a colon):
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --flatsep / put -q '@sum[$a][$b] += $x; end { emitp @sum, "a" }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --flatsep / put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --flatsep / --oxtab put -q '
|
|
@sum[$a][$b] += $x;
|
|
end { emitp @sum }
|
|
' data/small
|
|
GENMD-EOF
|
|
|
|
## Multi-emit statements
|
|
|
|
You can emit **multiple map-valued expressions side-by-side** by
|
|
including their names in parentheses:
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --from data/medium --opprint put -q '
|
|
@x_count[$a][$b] += 1;
|
|
@x_sum[$a][$b] += $x;
|
|
end {
|
|
for ((a, b), _ in @x_count) {
|
|
@x_mean[a][b] = @x_sum[a][b] / @x_count[a][b]
|
|
}
|
|
emit (@x_sum, @x_count, @x_mean), "a", "b"
|
|
}
|
|
'
|
|
GENMD-EOF
|
|
|
|
What this does is walk through the first out-of-stream variable (`@x_sum` in this example) as usual, then for each keylist found (e.g., `pan,wye`), include the values for the remaining out-of-stream variables (here, `@x_count` and `@x_mean`). You should use this when all out-of-stream variables in the emit statement have **the same shape and the same keylists**.
|
|
|
|
## Emit-all statements
|
|
|
|
Use **emit all** (or `emit @*`, which is synonymous) to output all out-of-stream variables. You can use the following idiom to get various accumulators' output side-by-side (reminiscent of `mlr stats1`):
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --from data/small --opprint put -q '
|
|
@v[$a][$b]["sum"] += $x;
|
|
@v[$a][$b]["count"] += 1;
|
|
end{emit @*,"a","b"}
|
|
'
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --from data/small --opprint put -q '
|
|
@sum[$a][$b] += $x;
|
|
@count[$a][$b] += 1;
|
|
end{emit @*,"a","b"}
|
|
'
|
|
GENMD-EOF
|
|
|
|
GENMD-RUN-COMMAND
|
|
mlr --from data/small --opprint put -q '
|
|
@sum[$a][$b] += $x;
|
|
@count[$a][$b] += 1;
|
|
end{emit (@sum, @count),"a","b"}
|
|
'
|
|
GENMD-EOF
|