miller/docs6/docs/reference-dsl.md.in
2021-08-22 11:41:31 -04:00

147 lines
4.8 KiB
Markdown

# DSL overview
## Verbs compared to DSL
Here's comparison of verbs and `put`/`filter` DSL expressions:
Example:
GENMD_RUN_COMMAND
mlr stats1 -a sum -f x -g a data/small
GENMD_EOF
* Verbs are coded in Go
* They run a bit faster
* They take fewer keystrokes
* There is less to learn
* Their customization is limited to each verb's options
Example:
GENMD_RUN_COMMAND
mlr put -q '@x_sum[$a] += $x; end{emit @x_sum, "a"}' data/small
GENMD_EOF
* You get to write your own DSL expressions
* They run a bit slower
* They take more keystrokes
* There is more to learn
* They are highly customizable
Please see [Verbs Reference](reference-verbs.md) for information on verbs other than `put` and `filter`.
## Implicit loop over records for main statements
The most important point about the Miller DSL is that it is designed for _streaming operation over records_.
DSL statements include:
* `func` and `subr` for user-defined functions and subroutines, which we'll look at later in the [separate page about them](reference-dsl-user-defined-functions.md);
* `begin` and `end` blocks, for statements you want to run before the first record, or after the last one;
* everything else, which collectively are called _main statements_.
The feature of _streaming operation over records_ is implemented by the main
statements getting invoked once per record. You don't explicitly loop over
records, as you would in some dataframes contexts; rather, _Miller loops over
records for you_, and it lets you specify what to do on each record: you write
the body of the loop.
(You can, if you like, use the per-record statements to grow a list of records,
then loop over them all in an `end` block. This is described in the page on
[operating over all records](operating-over-all-records.md)).
To see this in action, let's take a look at the [data/short.csv](./data/short.csv) file:
GENMD_RUN_COMMAND
cat data/short.csv
GENMD_EOF
There are three records in this file, with `word=apple`, `word=ball`, and
`word=cat`, respectively. Let's print something in a `begin` statement, add a
field in a main statement, and print something else in an `end` statement:
GENMD_RUN_COMMAND
mlr --csv --from data/short.csv put '
begin {
print "begin";
}
$nr = NR;
end {
print "end";
}
'
GENMD_EOF
The `print` statements for `begin` and `end` went out before the first record
was seen and after the last was seen; the field-creation statement `$nr = NR`
was invoked three times, once for each record. We didn't explicitly loop over
records, since Miller was already looping over records, and invoked our main
statement on each loop iteration.
For almost all simple uses of the Miller programming language, this implicit
looping over records is probably all you will need. (For more involved cases you
can see the pages on [operating over all records](operating-on-all-records.md),
[out-of-stream variables](reference-dsl-variables.md#out-of-stream-variables),
and [two-pass algorithms](two-pass-algorithms.md).)
## Essential use: record-selection and record-updating
The essential usages of `mlr filter` and `mlr put` are for record-selection and
record-updating expressions, respectively. For example, given the following
input data:
GENMD_RUN_COMMAND
cat data/small
GENMD_EOF
you might retain only the records whose `a` field has value `eks`:
GENMD_RUN_COMMAND
mlr filter '$a == "eks"' data/small
GENMD_EOF
or you might add a new field which is a function of existing fields:
GENMD_RUN_COMMAND
mlr put '$ab = $a . "_" . $b ' data/small
GENMD_EOF
## Differences between put and filter
The two verbs `mlr filter` and `mlr put` are essentially the same. The only differences are:
* Expressions sent to `mlr filter` should contain a boolean expression, which is the filtering criterion. (If not, all records pass through.)
* `mlr filter` expressions may not reference the `filter` keyword within them.
## Location of boolean expression for filter
You can define and invoke functions and subroutines to help produce the bare-boolean statement, and record fields may be assigned in the statements before or after the bare-boolean statement. For example:
GENMD_RUN_COMMAND
mlr --c2p --from example.csv filter '
# Bare-boolean filter expression: only records matching this pass through:
$quantity >= 70;
# For records that do pass through, set these:
if ($rate > 8) {
$description = "high rate";
} else {
$description = "low rate";
}
'
GENMD_EOF
GENMD_RUN_COMMAND
mlr --c2p --from example.csv filter '
# Bare-boolean filter expression: only records matching this pass through:
$shape =~ "^(...)(...)$";
# For records that do pass through, capture the first "(...)" into $left and
# the second "(...)" into $right
$left = "\1";
$right = "\2";
'
GENMD_EOF
There are more details and more choices, of course, as detailed in the following sections.