mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-24 02:36:15 +00:00
* Docs6 proofreads batch 3 * BUild-everything script for local development * Start of glossary * Put quicklinks atop every page, not just the base-index page * Expanded record-heterogeneity page * streaming page * separators page * vimrc doc * separators page
157 lines
5.3 KiB
Markdown
157 lines
5.3 KiB
Markdown
# DSL overview
|
|
|
|
_DSL_ stands for _domain-specific language_: it's language particular to Miller
|
|
which you can use to write expressions to specify customer transformations to
|
|
your data. (See [Miller programming language](programming-language.md) for an
|
|
introduction.)
|
|
|
|
## Verbs compared to DSL
|
|
|
|
While `put` and `filter` are [verbs](reference-verbs.md), they're different
|
|
from the rest in that they let you use the DSL -- so we often contrast _DSL_
|
|
(things you can do in the `put` and `filter` verbs), and _verbs_ (things you
|
|
can do using the other verbs besides `put` and `filter`.)
|
|
|
|
Here's comparison of verbs and `put`/`filter` DSL expressions:
|
|
|
|
Example:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr stats1 -a sum -f x -g a data/small
|
|
GENMD_EOF
|
|
|
|
* Verbs are coded in Go
|
|
* They run a bit faster
|
|
* They take fewer keystrokes
|
|
* There is less to learn
|
|
* Their customization is limited to each verb's options
|
|
|
|
Example:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr put -q '@x_sum[$a] += $x; end{emit @x_sum, "a"}' data/small
|
|
GENMD_EOF
|
|
|
|
* You get to write your own DSL expressions
|
|
* They run a bit slower
|
|
* They take more keystrokes
|
|
* There is more to learn
|
|
* They are highly customizable
|
|
|
|
Please see [Verbs Reference](reference-verbs.md) for information on verbs other than `put` and `filter`.
|
|
|
|
## Implicit loop over records for main statements
|
|
|
|
The most important point about the Miller DSL is that it is designed for [streaming operation over records](streaming-and-memory.md).
|
|
|
|
DSL statements include:
|
|
|
|
* `func` and `subr` for user-defined functions and subroutines, which we'll look at later in the [separate page about them](reference-dsl-user-defined-functions.md);
|
|
* `begin` and `end` blocks, for statements you want to run before the first record, or after the last one;
|
|
* everything else, which collectively are called _main statements_.
|
|
|
|
The feature of _streaming operation over records_ is implemented by the main
|
|
statements getting invoked once per record. You don't explicitly loop over
|
|
records, as you would in some dataframes contexts; rather, _Miller loops over
|
|
records for you_, and it lets you specify what to do on each record: you write
|
|
the body of the loop, not the loop itself.
|
|
|
|
(But you can, if you like, use those per-record statements to grow a list of
|
|
records, then loop over them all in an `end` block. This is described in the
|
|
page on [operating on all records](operating-on-all-records.md)).
|
|
|
|
To see this in action, let's take a look at the [data/short.csv](./data/short.csv) file:
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/short.csv
|
|
GENMD_EOF
|
|
|
|
There are three records in this file, with `word=apple`, `word=ball`, and
|
|
`word=cat`, respectively. Let's print something in a `begin` statement, add a
|
|
field in a main statement, and print something else in an `end` statement:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --csv --from data/short.csv put '
|
|
begin {
|
|
print "begin";
|
|
}
|
|
$nr = NR;
|
|
end {
|
|
print "end";
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
The `print` statements for `begin` and `end` went out before the first record
|
|
was seen and after the last was seen; the field-creation statement `$nr = NR`
|
|
was invoked three times, once for each record. We didn't explicitly loop over
|
|
records, since Miller was already looping over records, and invoked our main
|
|
statement on each loop iteration.
|
|
|
|
For almost all simple uses of the Miller programming language, this implicit
|
|
looping over records is probably all you will need. (For more involved cases you
|
|
can see the pages on [operating on all records](operating-on-all-records.md),
|
|
[out-of-stream variables](reference-dsl-variables.md#out-of-stream-variables),
|
|
and [two-pass algorithms](two-pass-algorithms.md).)
|
|
|
|
## Essential use: record-selection and record-updating
|
|
|
|
The essential usages of `mlr filter` and `mlr put` are for record-selection and
|
|
record-updating expressions, respectively. For example, given the following
|
|
input data:
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/small
|
|
GENMD_EOF
|
|
|
|
you might retain only the records whose `a` field has value `eks`:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr filter '$a == "eks"' data/small
|
|
GENMD_EOF
|
|
|
|
or you might add a new field which is a function of existing fields:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr put '$ab = $a . "_" . $b ' data/small
|
|
GENMD_EOF
|
|
|
|
## Differences between put and filter
|
|
|
|
The two verbs `mlr filter` and `mlr put` are essentially the same. The only differences are:
|
|
|
|
* Expressions sent to `mlr filter` should contain a boolean expression, which is the filtering criterion. (If not, all records pass through.)
|
|
|
|
* `mlr filter` expressions may not reference the `filter` keyword within them.
|
|
|
|
## Location of boolean expression for filter
|
|
|
|
You can define and invoke functions and subroutines to help produce the bare-boolean statement, and record fields may be assigned in the statements before or after the bare-boolean statement. For example:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --c2p --from example.csv filter '
|
|
# Bare-boolean filter expression: only records matching this pass through:
|
|
$quantity >= 70;
|
|
# For records that do pass through, set these:
|
|
if ($rate > 8) {
|
|
$description = "high rate";
|
|
} else {
|
|
$description = "low rate";
|
|
}
|
|
'
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --c2p --from example.csv filter '
|
|
# Bare-boolean filter expression: only records matching this pass through:
|
|
$shape =~ "^(...)(...)$";
|
|
# For records that do pass through, capture the first "(...)" into $left and
|
|
# the second "(...)" into $right
|
|
$left = "\1";
|
|
$right = "\2";
|
|
'
|
|
GENMD_EOF
|
|
|
|
|
|
There are more details and more choices, of course, as detailed in the following sections.
|
|
|