mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 18:25:45 +00:00
* Docs6 proofreads batch 3 * BUild-everything script for local development * Start of glossary * Put quicklinks atop every page, not just the base-index page * Expanded record-heterogeneity page * streaming page * separators page * vimrc doc * separators page
396 lines
12 KiB
Markdown
396 lines
12 KiB
Markdown
# DSL control structures
|
|
|
|
## Pattern-action blocks
|
|
|
|
These are reminiscent of `awk` syntax. They can be used to allow assignments to be done only when appropriate -- e.g. for math-function domain restrictions, regex-matching, and so on:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr cat data/put-gating-example-1.dkvp
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr put '$x > 0.0 { $y = log10($x); $z = sqrt($y) }' data/put-gating-example-1.dkvp
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr cat data/put-gating-example-2.dkvp
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr put '
|
|
$a =~ "([a-z]+)_([0-9]+)" {
|
|
$b = "left_\1"; $c = "right_\2"
|
|
}' \
|
|
data/put-gating-example-2.dkvp
|
|
GENMD_EOF
|
|
|
|
This produces heteregenous output which Miller, of course, has no problems with (see [Record Heterogeneity](record-heterogeneity.md)). But if you want homogeneous output, the curly braces can be replaced with a semicolon between the expression and the body statements. This causes `put` to evaluate the boolean expression (along with any side effects, namely, regex-captures `\1`, `\2`, etc.) but doesn't use it as a criterion for whether subsequent assignments should be executed. Instead, subsequent assignments are done unconditionally:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint put '
|
|
$a =~ "([a-z]+)_([0-9]+)";
|
|
$b = "left_\1";
|
|
$c = "right_\2"
|
|
' data/put-gating-example-2.dkvp
|
|
GENMD_EOF
|
|
|
|
Note that pattern-action blocks are just a syntactic variation of if-statements. The following do the same thing:
|
|
|
|
GENMD_CARDIFY
|
|
boolean_condition {
|
|
body
|
|
}
|
|
GENMD_EOF
|
|
|
|
GENMD_CARDIFY
|
|
if (boolean_condition) {
|
|
body
|
|
}
|
|
GENMD_EOF
|
|
|
|
## If-statements
|
|
|
|
These are again reminiscent of `awk`. Pattern-action blocks are a special case of `if` with no `elif` or `else` blocks, no `if` keyword, and parentheses optional around the boolean expression:
|
|
|
|
GENMD_SHOW_COMMAND
|
|
mlr put 'NR == 4 {$foo = "bar"}'
|
|
GENMD_EOF
|
|
|
|
GENMD_SHOW_COMMAND
|
|
mlr put 'if (NR == 4) {$foo = "bar"}'
|
|
GENMD_EOF
|
|
|
|
Compound statements use `elif` (rather than `elsif` or `else if`):
|
|
|
|
GENMD_SHOW_COMMAND
|
|
mlr put '
|
|
if (NR == 2) {
|
|
...
|
|
} elif (NR ==4) {
|
|
...
|
|
} elif (NR ==6) {
|
|
...
|
|
} else {
|
|
...
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
## While and do-while loops
|
|
|
|
Miller's `while` and `do-while` are unsurprising in comparison to various languages, as are `break` and `continue`:
|
|
|
|
GENMD_RUN_COMMAND
|
|
echo x=1,y=2 | mlr put '
|
|
while (NF < 10) {
|
|
$[NF+1] = ""
|
|
}
|
|
$foo = "bar"
|
|
'
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
echo x=1,y=2 | mlr put '
|
|
do {
|
|
$[NF+1] = "";
|
|
if (NF == 5) {
|
|
break
|
|
}
|
|
} while (NF < 10);
|
|
$foo = "bar"
|
|
'
|
|
GENMD_EOF
|
|
|
|
A `break` or `continue` within nested conditional blocks or if-statements will,
|
|
of course, propagate to the innermost loop enclosing them, if any. A `break` or
|
|
`continue` outside a loop is a syntax error that will be flagged as soon as the
|
|
expression is parsed, before any input records are ingested.
|
|
|
|
The existence of `while`, `do-while`, and `for` loops in Miller's DSL means
|
|
that you can create infinite-loop scenarios inadvertently. In particular,
|
|
please recall that DSL statements are executed once if in `begin` or `end`
|
|
blocks, and once *per record* otherwise. For example, **while (NR < 10) will
|
|
never terminate**. The [`NR`
|
|
variable](reference-dsl-variables.md#built-in-variables) is only incremented
|
|
between records, and each DSL expression is invoked once per record: so, once
|
|
for `NR=1`, once for `NR=2`, etc.
|
|
|
|
If you do want to loop over records, see [Operating on all
|
|
records](operating-on-all-records.md) for some options.
|
|
|
|
## For-loops
|
|
|
|
While Miller's `while` and `do-while` statements are much as in many other languages, `for` loops are more idiosyncratic to Miller. They are loops over key-value pairs, whether in stream records, out-of-stream variables, local variables, or map-literals: more reminiscent of `foreach`, as in (for example) PHP. There are **for-loops over map keys** and **for-loops over key-value tuples**. Additionally, Miller has a **C-style triple-for loop** with initialize, test, and update statements. Each is described below.
|
|
|
|
As with `while` and `do-while`, a `break` or `continue` within nested control structures will propagate to the innermost loop enclosing them, if any, and a `break` or `continue` outside a loop is a syntax error that will be flagged as soon as the expression is parsed, before any input records are ingested.
|
|
|
|
### Single-variable for-loops
|
|
|
|
For [maps](reference-main-maps.md), the single variable is always bound to the *key* of key-value pairs:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --from data/small put -q '
|
|
print "NR = ".NR;
|
|
for (e in $*) {
|
|
print " key:", e, "value:", $[e];
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr -n put -q '
|
|
end {
|
|
o = {"a":1, "b":{"c":3}};
|
|
for (e in o) {
|
|
print "key:", e, "valuetype:", typeof(o[e]);
|
|
}
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
Note that the value corresponding to a given key may be gotten as through a **computed field name** using square brackets as in `$[e]` for stream records, or by indexing the looped-over variable using square brackets.
|
|
|
|
For [arrays](reference-main-arrays.md), the single variable is always bound to the *value* (not the array index):
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr -n put -q '
|
|
end {
|
|
o = [10, "20", {}, "four", true];
|
|
for (e in o) {
|
|
print "value:", e, "valuetype:", typeof(e);
|
|
}
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
### Key-value for-loops
|
|
|
|
For [maps](reference-main-maps.md), the first loop variable is the key and the
|
|
second is the value; for [arrays](reference-main-arrays.md), the first loop
|
|
variable is the (1-up) array index and the second is the value.
|
|
|
|
Single-level keys may be gotten at using either `for(k,v)` or `for((k),v)`; multi-level keys may be gotten at using `for((k1,k2,k3),v)` and so on. The `v` variable will be bound to to a scalar value (non-array/non-map) if the map stops at that level, or to a map-valued or array-valued variable if the map goes deeper. If the map isn't deep enough then the loop body won't be executed.
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/for-srec-example.tbl
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --pprint --from data/for-srec-example.tbl put '
|
|
$sum1 = $f1 + $f2 + $f3;
|
|
$sum2 = 0;
|
|
$sum3 = 0;
|
|
for (key, value in $*) {
|
|
if (key =~ "^f[0-9]+") {
|
|
$sum2 += value;
|
|
$sum3 += $[key];
|
|
}
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --from data/small --opprint put 'for (k,v in $*) { $[k."_type"] = typeof(v) }'
|
|
GENMD_EOF
|
|
|
|
Note that the value of the current field in the for-loop can be gotten either using the bound variable `value`, or through a **computed field name** using square brackets as in `$[key]`.
|
|
|
|
Important note: to avoid inconsistent looping behavior in case you're setting new fields (and/or unsetting existing ones) while looping over the record, **Miller makes a copy of the record before the loop: loop variables are bound from the copy and all other reads/writes involve the record itself**:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --from data/small --opprint put '
|
|
$sum1 = 0;
|
|
$sum2 = 0;
|
|
for (k,v in $*) {
|
|
if (is_numeric(v)) {
|
|
$sum1 +=v;
|
|
$sum2 += $[k];
|
|
}
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
It can be confusing to modify the stream record while iterating over a copy of it, so instead you might find it simpler to use a local variable in the loop and only update the stream record after the loop:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --from data/small --opprint put '
|
|
sum = 0;
|
|
for (k,v in $*) {
|
|
if (is_numeric(v)) {
|
|
sum += $[k];
|
|
}
|
|
}
|
|
$sum = sum
|
|
'
|
|
GENMD_EOF
|
|
|
|
You can also start iterating on sub-maps of an out-of-stream or local variable; you can loop over nested keys; you can loop over all out-of-stream variables. The bound variables are bound to a copy of the sub-map as it was before the loop started. The sub-map is specified by square-bracketed indices after `in`, and additional deeper indices are bound to loop key-variables. The terminal values are bound to the loop value-variable whenever the keys are not too shallow. The value-variable may refer to a terminal (string, number) or it may be map-valued if the map goes deeper. Example indexing is as follows:
|
|
|
|
GENMD_INCLUDE_ESCAPED(data/for-oosvar-example-0a.txt)
|
|
|
|
That's confusing in the abstract, so a concrete example is in order. Suppose the out-of-stream variable `@myvar` is populated as follows:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr -n put --jknquoteint -q '
|
|
begin {
|
|
@myvar = {
|
|
1: 2,
|
|
3: { 4 : 5 },
|
|
6: { 7: { 8: 9 } }
|
|
}
|
|
}
|
|
end { dump }
|
|
'
|
|
GENMD_EOF
|
|
|
|
Then we can get at various values as follows:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr -n put --jknquoteint -q '
|
|
begin {
|
|
@myvar = {
|
|
1: 2,
|
|
3: { 4 : 5 },
|
|
6: { 7: { 8: 9 } }
|
|
}
|
|
}
|
|
end {
|
|
for (k, v in @myvar) {
|
|
print
|
|
"key=" . k .
|
|
",valuetype=" . typeof(v);
|
|
}
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr -n put --jknquoteint -q '
|
|
begin {
|
|
@myvar = {
|
|
1: 2,
|
|
3: { 4 : 5 },
|
|
6: { 7: { 8: 9 } }
|
|
}
|
|
}
|
|
end {
|
|
for ((k1, k2), v in @myvar) {
|
|
print
|
|
"key1=" . k1 .
|
|
",key2=" . k2 .
|
|
",valuetype=" . typeof(v);
|
|
}
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr -n put --jknquoteint -q '
|
|
begin {
|
|
@myvar = {
|
|
1: 2,
|
|
3: { 4 : 5 },
|
|
6: { 7: { 8: 9 } }
|
|
}
|
|
}
|
|
end {
|
|
for ((k1, k2), v in @myvar[6]) {
|
|
print
|
|
"key1=" . k1 .
|
|
",key2=" . k2 .
|
|
",valuetype=" . typeof(v);
|
|
}
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
### C-style triple-for loops
|
|
|
|
These are supported as follows:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --from data/small --opprint put '
|
|
num suma = 0;
|
|
for (a = 1; a <= NR; a += 1) {
|
|
suma += a;
|
|
}
|
|
$suma = suma;
|
|
'
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --from data/small --opprint put '
|
|
num suma = 0;
|
|
num sumb = 0;
|
|
for (num a = 1, num b = 1; a <= NR; a += 1, b *= 2) {
|
|
suma += a;
|
|
sumb += b;
|
|
}
|
|
$suma = suma;
|
|
$sumb = sumb;
|
|
'
|
|
GENMD_EOF
|
|
|
|
Notes:
|
|
|
|
* In `for (start; continuation; update) { body }`, the start, continuation, and update statements may be empty, single statements, or multiple comma-separated statements. If the continuation is empty (e.g. `for(i=1;;i+=1)`) it defaults to true.
|
|
|
|
* In particular, you may use `$`-variables and/or `@`-variables in the start, continuation, and/or update steps (as well as the body, of course).
|
|
|
|
* The typedecls such as `int` or `num` are optional. If a typedecl is provided (for a local variable), it binds a variable scoped to the for-loop regardless of whether a same-name variable is present in outer scope. If a typedecl is not provided, then the variable is scoped to the for-loop if no same-name variable is present in outer scope, or if a same-name variable is present in outer scope then it is modified.
|
|
|
|
* Miller has no `++` or `--` operators.
|
|
|
|
* As with all `for`/`if`/`while` statements in Miller, the curly braces are required even if the body is a single statement, or empty.
|
|
|
|
## Begin/end blocks
|
|
|
|
Miller supports an `awk`-like `begin/end` syntax. The statements in the `begin` block are executed before any input records are read; the statements in the `end` block are executed after the last input record is read. (If you want to execute some statement at the start of each file, not at the start of the first file as with `begin`, you might use a pattern/action block of the form `FNR == 1 { ... }`.) All statements outside of `begin` or `end` are, of course, executed on every input record. Semicolons separate statements inside or outside of begin/end blocks; semicolons are required between begin/end block bodies and any subsequent statement. For example:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr put '
|
|
begin { @sum = 0 };
|
|
@x_sum += $x;
|
|
end { emit @x_sum }
|
|
' ./data/small
|
|
GENMD_EOF
|
|
|
|
Since uninitialized out-of-stream variables default to 0 for addition/substraction and 1 for multiplication when they appear on expression right-hand sides (not quite as in `awk`, where they'd default to 0 either way), the above can be written more succinctly as
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr put '
|
|
@x_sum += $x;
|
|
end { emit @x_sum }
|
|
' ./data/small
|
|
GENMD_EOF
|
|
|
|
The **put -q** option suppresses printing of each output record, with only `emit` statements being output. So to get only summary outputs, you could write
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr put -q '
|
|
@x_sum += $x;
|
|
end { emit @x_sum }
|
|
' ./data/small
|
|
GENMD_EOF
|
|
|
|
We can do similarly with multiple out-of-stream variables:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr put -q '
|
|
@x_count += 1;
|
|
@x_sum += $x;
|
|
end {
|
|
emit @x_count;
|
|
emit @x_sum;
|
|
}
|
|
' ./data/small
|
|
GENMD_EOF
|
|
|
|
This is of course (see also [here](reference-dsl.md#verbs-compared-to-dsl)) not much different than
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr stats1 -a count,sum -f x ./data/small
|
|
GENMD_EOF
|
|
|
|
Note that it's a syntax error for begin/end blocks to refer to field names (beginning with `$`), since begin/end blocks execute outside the context of input records.
|
|
|