miller/docs/src/reference-dsl-output-statements.md.in
John Kerl b7248bae98
Doc copy edits (#1827)
* Update index.md.in

* more copy-editing

* swipes.sh

* swipes.sh

* run `make docs` to generate `*.md` from `*.md.in`
2025-07-04 13:43:22 -04:00

323 lines
13 KiB
Markdown

# DSL output statements
You can **output** variable-values or expressions in **five ways**:
* **Assign** them to stream-record fields. For example, `$cumulative_sum = @sum`. For another example, `$nr = NR` adds a field named `nr` to each output record, containing the value of the built-in variable `NR` as of when that record was ingested.
* Use **emit1**/**emit**/**emitp**/**emitf** to send out-of-stream variables' current values to the output record stream, e.g. `@sum += $x; emit1 @sum` which produces an extra record such as `sum=3.1648382`. These records, just like records from input file(s), participate in downstream [then-chaining](reference-main-then-chaining.md) to other verbs.
* Use the **print** or **eprint** keywords which immediately print an expression *directly to standard output or standard error*, respectively. Note that `dump`, `edump`, `print`, and `eprint` don't output records that participate in `then`-chaining; rather, they're just immediate prints to stdout/stderr. The `printn` and `eprintn` keywords are the same except that they don't print final newlines. Additionally, you can print to a specified file instead of stdout/stderr.
* Use the **dump** or **edump** keywords, which *immediately print all out-of-stream variables as a JSON data structure to the standard output or standard error* (respectively).
* Use **tee**, which formats the current stream record (not just an arbitrary string as with **print**) to a specific file.
For the first two options, you are populating the output-records stream which feeds into the next verb in a `then`-chain (if any), or which otherwise is formatted for output using `--o...` flags.
For the last three options, you are sending output directly to standard output, standard error, or a file.
## Print statements
The `print` statement is perhaps self-explanatory, but with a few light caveats:
* There are four variants: `print` goes to stdout with final newline, `printn` goes to stdout without final newline (you can include one using "\n" in your output string), `eprint` goes to stderr with final newline, and `eprintn` goes to stderr without final newline.
* Output goes directly to stdout/stderr, respectively: data produced this way does not go downstream to the next verb in a `then`-chain. (Use `emit` for that.)
* Print statements are for strings (`print "hello"`), or things which can be made into strings: numbers (`print 3`, `print $a + $b`), or concatenations thereof (`print "a + b = " . ($a + $b)`). Maps (in `$*`, map-valued out-of-stream or local variables, and map literals) as well as arrays are printed as JSON.
* You can redirect print output to a file:
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from myfile.dat put 'print > "tap.txt", $x'
GENMD-EOF
* You can redirect print output to multiple files, split by values present in various records:
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from myfile.dat put 'print > $a.".txt", $x'
GENMD-EOF
See also [Redirected-output statements](reference-dsl-output-statements.md#redirected-output-statements) for examples.
## Dump statements
The `dump` statement is for printing expressions, including maps, directly to stdout/stderr, respectively:
* There are two variants: `dump` prints to stdout; `edump` prints to stderr.
* Output goes directly to stdout/stderr, respectively: data produced this way does not go downstream to the next verb in a `then`-chain. (Use `emit` for that.)
* You can use `dump` to output single strings, numbers, or expressions including map-valued data. Map-valued data is printed as JSON.
* If you use `dump` (or `edump`) with no arguments, you get a JSON structure representing the current values of all out-of-stream variables.
* As with `print`, you can redirect output to files.
* See also [Redirected-output statements](reference-dsl-output-statements.md#redirected-output-statements) for examples.
## Tee statements
Records produced by a `mlr put` go downstream to the next verb in your `then`-chain, if any, or otherwise to standard output. If you want to additionally copy out records to files, you can do that using `tee`.
The syntax is, for example:
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from myfile.dat put 'tee > "tap.dat", $*' then sort -n index
GENMD-EOF
First is `tee >`, then the filename expression (which can be an expression such as `"tap.".$a.".dat"`), then a comma, then `$*`. (Nothing else but `$*` is teeable.)
You can also write to a variable file name -- for example, you can split a single file into multiple ones on field names:
GENMD-RUN-COMMAND
mlr --csv cat example.csv
GENMD-EOF
GENMD-RUN-COMMAND
mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'
GENMD-EOF
GENMD-RUN-COMMAND
mlr --csv cat circle.csv
GENMD-EOF
GENMD-RUN-COMMAND
mlr --csv cat square.csv
GENMD-EOF
GENMD-RUN-COMMAND
mlr --csv cat triangle.csv
GENMD-EOF
See also [Redirected-output statements](reference-dsl-output-statements.md#redirected-output-statements) for examples.
## Redirected-output statements
The **print**, **dump** **tee**, **emit**, **emitp**, and **emitf** keywords all allow you to redirect output to one or more files or pipe-to commands. The filenames/commands are strings which can be constructed using record-dependent values, so you can do things like splitting a table into multiple files, one for each account ID, and so on.
Details:
* The `print` and `dump` keywords produce output immediately to standard output, or to specified file(s) or pipe-to command if present.
GENMD-RUN-COMMAND
mlr help keyword print
GENMD-EOF
GENMD-RUN-COMMAND
mlr help keyword dump
GENMD-EOF
* `mlr put` sends the current record (possibly modified by the `put` expression) to the output record stream. Records are then input to the following verb in a `then`-chain (if any), else printed to standard output (unless `put -q`). The **tee** keyword *additionally* writes the output record to specified file(s) or pipe-to command, or immediately to `stdout`/`stderr`.
GENMD-RUN-COMMAND
mlr help keyword tee
GENMD-EOF
* `mlr put`'s `emitf`, `emitp`, and `emit` send out-of-stream variables to the output record stream. These are then input to the following verb in a `then`-chain (if any), else printed to standard output. When redirected with `>`, `>>`, or `|`, they *instead* write the out-of-stream variable(s) to specified file(s) or pipe-to command, or immediately to `stdout`/`stderr`.
GENMD-RUN-COMMAND
mlr help keyword emitf
GENMD-EOF
GENMD-RUN-COMMAND
mlr help keyword emitp
GENMD-EOF
GENMD-RUN-COMMAND
mlr help keyword emit
GENMD-EOF
## Emit1 and emit/emitp/emitf
There are four variants: `emit1`, `emitf`, `emit`, and `emitp`. These are used
to insert new records into the record stream -- or, optionally, redirect them
to files.
Keep in mind that out-of-stream variables are a nested, multi-level [map](reference-main-maps.md) (directly viewable as JSON using `dump`), while Miller record values are as well during processing -- but records may be flattened down for output to tabular formats. See the page [Flatten/unflatten: JSON vs. tabular formats](flatten-unflatten.md) for more information.
* You can use `emit1` to emit any map-valued expression, including `$*`, map-valued out-of-stream variables, the entire out-of-stream-variable collection `@*`, map-valued local variables, map literals, or map-valued function return values.
* For `emit`, `emitp`, and `emitf`, you can emit map-valued local variables, map-valued field attributes (with `$`), map-va out-of-stream variables (with `@`), `$*`, `@*`, or map literals (with outermost `{...}`) -- but not arbitrary expressions which evaluate to map (such as function return values).
The reason for this is partly historical and partly technical. As we'll see below, you can do lots of syntactical things with `emit`, `emitp`, and `emitf`, including printing them side-by-side, indexing them, redirecting the output to files, etc. What this means syntactically is that Miller's parser needs to handle all sorts of commas, parentheses, and so on:
GENMD-CARDIFY
emitf @count, @sum
emit @sum, "a", "b"
emitp (@count, @sum),"a","b"}
# etc
GENMD-EOF
When we try to allow `emitf`/`emit`/`emitp` to handle arbitrary map-valued expressions, like `mapexcept($*, mymap)` and so on, this inserts more syntactic complexity in terms of commas, parentheses, and so on. The technical term is _LR-1 shift-reduce conflicts_, but we can think of this in terms of the parser being unable to efficiently disambiguate all the punctuational opportunities.
So, `emit1` can handle syntactic richness in the one thing being emitted;
`emitf`, `emit`, and `emitp` can handle syntactic richness in the side-by-side
placement, indexing, and redirection.
(Mnemonic: If all you want is to insert a new record into the record stream, `emit1` is probably the _one_ you want.)
What this means is that if you want to emit an expression that evaluates to a map, you can do it quite simply:
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put -q '
emit1 mapsum({"id": NR}, $*)
'
GENMD-EOF
And if you want indexing, redirects, etc., just assign to a temporary variable and use one of the other `emit` variants:
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put -q '
o = mapsum({"id": NR}, $*);
emit o;
'
GENMD-EOF
## Emitf statements
Use **emitf** to output several out-of-stream variables side-by-side in the same output record. For `emitf`, these mustn't have indexing using `@name[...]`. Example:
GENMD-RUN-COMMAND
mlr put -q '
@count += 1;
@x_sum += $x;
@y_sum += $y;
end { emitf @count, @x_sum, @y_sum}
' data/small
GENMD-EOF
## Emit statements
Use **emit** to output an out-of-stream variable. If it's non-indexed, you'll get a simple key-value pair:
GENMD-RUN-COMMAND
cat data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr put -q '@sum += $x; end { dump }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr put -q '@sum += $x; end { emit @sum }' data/small
GENMD-EOF
If it's indexed, then use as many names after `emit` as there are indices:
GENMD-RUN-COMMAND
mlr put -q '@sum[$a] += $x; end { dump }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr put -q '@sum[$a] += $x; end { emit @sum, "a" }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { dump }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { emit @sum, "a", "b" }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b][$i] += $x; end { dump }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr put -q '
@sum[$a][$b][$i] += $x;
end { emit @sum, "a", "b", "i" }
' data/small
GENMD-EOF
Now for **emitp**: if you have as many names following `emit` as there are levels in the out-of-stream variable's map, then `emit` and `emitp` do the same thing. Where they differ is when you don't specify as many names as there are map levels. In this case, Miller needs to flatten multiple map indices down to output-record keys: `emitp` includes full prefixing (hence the `p` in `emitp`) while `emit` takes the deepest map key as the output-record key:
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { dump }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { emit @sum, "a" }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { emit @sum }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { emitp @sum, "a" }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr --oxtab put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small
GENMD-EOF
Use **--flatsep** to specify the character that joins multilevel keys for `emitp` (it defaults to a colon):
GENMD-RUN-COMMAND
mlr --flatsep / put -q '@sum[$a][$b] += $x; end { emitp @sum, "a" }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr --flatsep / put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small
GENMD-EOF
GENMD-RUN-COMMAND
mlr --flatsep / --oxtab put -q '
@sum[$a][$b] += $x;
end { emitp @sum }
' data/small
GENMD-EOF
## Multi-emit statements
You can emit **multiple map-valued expressions side-by-side** by
including their names in parentheses:
GENMD-RUN-COMMAND
mlr --from data/medium --opprint put -q '
@x_count[$a][$b] += 1;
@x_sum[$a][$b] += $x;
end {
for ((a, b), _ in @x_count) {
@x_mean[a][b] = @x_sum[a][b] / @x_count[a][b]
}
emit (@x_sum, @x_count, @x_mean), "a", "b"
}
'
GENMD-EOF
What this does is walk through the first out-of-stream variable (`@x_sum` in this example) as usual, then for each keylist found (e.g., `pan,wye`), include the values for the remaining out-of-stream variables (here, `@x_count` and `@x_mean`). You should use this when all out-of-stream variables in the emit statement have **the same shape and the same keylists**.
## Emit-all statements
Use **emit all** (or `emit @*`, which is synonymous) to output all out-of-stream variables. You can use the following idiom to get various accumulators' output side-by-side (reminiscent of `mlr stats1`):
GENMD-RUN-COMMAND
mlr --from data/small --opprint put -q '
@v[$a][$b]["sum"] += $x;
@v[$a][$b]["count"] += 1;
end{emit @*,"a","b"}
'
GENMD-EOF
GENMD-RUN-COMMAND
mlr --from data/small --opprint put -q '
@sum[$a][$b] += $x;
@count[$a][$b] += 1;
end{emit @*,"a","b"}
'
GENMD-EOF
GENMD-RUN-COMMAND
mlr --from data/small --opprint put -q '
@sum[$a][$b] += $x;
@count[$a][$b] += 1;
end{emit (@sum, @count),"a","b"}
'
GENMD-EOF