More Miller 6 release prep

This commit is contained in:
John Kerl 2021-11-06 12:57:24 -04:00
parent 18dda9e01e
commit 91def6906f
94 changed files with 2104 additions and 2271 deletions

3
.gitignore vendored
View file

@ -1,4 +1,7 @@
go/mlr
go/mlr.exe
./mlr
./mlr.exe
a.out
*.dSYM
catc

View file

@ -1,5 +1,9 @@
# TODO: 'cp go/mlr .' or 'copy go\mlr.exe .' with reliable platform detection
# and no confusing error messages.
build:
make -C go build
@echo Miller executable is: go/mlr
check:
make -C go check

View file

@ -28,11 +28,11 @@ key-value-pair data in a variety of data formats.
# Getting started
* [Miller in 10 minutes](https://miller.readthedocs.io/en/latest/10min)
* [A quick tutorial on Miller](https://www.ict4g.net/adolfo/notes/data-analysis/miller-quick-tutorial.html)
* [Tools to manipulate CSV files from the Command Line](https://www.ict4g.net/adolfo/notes/data-analysis/tools-to-manipulate-csv.html)
* [www.togaware.com/linux/survivor/CSV_Files.html](https://www.togaware.com/linux/survivor/CSV_Files.html)
* [MLR for CSV manipulation](https://guillim.github.io/terminal/2018/06/19/MLR-for-CSV-manipulation.html)
* [Miller in 10 minutes](https://miller.readthedocs.io/en/latest/10min.html)
* [Linux Magazine: Process structured text files with Miller](https://www.linux-magazine.com/Issues/2016/187/Miller)
* [Miller: Command Line CSV File Processing](https://onepointzero.app/posts/miller-command-line-csv-file-processing/)
@ -43,12 +43,6 @@ key-value-pair data in a variety of data formats.
* [Notes about issue-labeling in the Github repo](https://github.com/johnkerl/miller/wiki/Issue-labeling)
* [Active issues](https://github.com/johnkerl/miller/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc)
# Miller 6 pre-release
* Pre-release/WIP docs are at [http://johnkerl.org/miller6](http://johnkerl.org/miller6)
* [go/README.md](./go/README.md)
* [Tracking issue](https://github.com/johnkerl/miller/issues/372)
# Installing
There's a good chance you can get Miller pre-built for your system:
@ -81,9 +75,15 @@ See also [building from source](https://miller.readthedocs.io/en/latest/build.ht
[![Go-port multi-platform build status](https://github.com/johnkerl/miller/actions/workflows/go.yml/badge.svg)](https://github.com/johnkerl/miller/actions)
[License: BSD2](https://github.com/johnkerl/miller/blob/master/LICENSE.txt)
# Building from source
[Docs](https://miller.readthedocs.io/en/latest/?badge=latest)
* `make` and `make check`
* The Miller executable is `go/mlr` (or `go\mlr.exe` on Windows)
* For more developer information please see [go/README.md](./go/README.md)
# License
[License: BSD2](https://github.com/johnkerl/miller/blob/master/LICENSE.txt)
# Community

View file

@ -105,7 +105,6 @@ nav:
- 'Misc. reference':
- "Auxiliary commands": "reference-main-auxiliary-commands.md"
- "Manual page": "manpage.md"
- "Installation": "installation.md"
- "Building from source": "build.md"
- "Documents for previous releases": "release-docs.md"
- "Glossary": "glossary.md"

View file

@ -8,69 +8,69 @@ For most of this section we'll use our [example.csv](./example.csv).
`mlr cat` is like system `cat` (or `type` on Windows) -- it passes the data through unmodified:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat example.csv
GENMD_EOF
GENMD-EOF
But `mlr cat` can also do format conversion -- for example, you can pretty-print in tabular format:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint cat example.csv
GENMD_EOF
GENMD-EOF
`mlr head` and `mlr tail` count records rather than lines. Whether you're getting the first few records or the last few, the CSV header is included either way:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv head -n 4 example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv tail -n 4 example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson tail -n 2 example.csv
GENMD_EOF
GENMD-EOF
You can sort on a single field:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint sort -f shape example.csv
GENMD_EOF
GENMD-EOF
Or, you can sort primarily alphabetically on one field, then secondarily numerically descending on another field, and so on:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint sort -f shape -nr index example.csv
GENMD_EOF
GENMD-EOF
If there are fields you don't want to see in your data, you can use `cut` to keep only the ones you want, in the same order they appeared in the input data:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint cut -f flag,shape example.csv
GENMD_EOF
GENMD-EOF
You can also use `cut -o` to keep specified fields, but in your preferred order:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint cut -o -f flag,shape example.csv
GENMD_EOF
GENMD-EOF
You can use `cut -x` to omit fields you don't care about:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint cut -x -f flag,shape example.csv
GENMD_EOF
GENMD-EOF
Even though Miller's main selling point is name-indexing, sometimes you really want to refer to a field name by its positional index. Use `$[[3]]` to access the name of field 3 or `$[[[3]]]` to access the value of field 3:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint put '$[[3]] = "NEW"' example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint put '$[[[3]]] = "NEW"' example.csv
GENMD_EOF
GENMD-EOF
You can find the full list of verbs at the [Verbs Reference](reference-verbs.md) page.
@ -78,33 +78,33 @@ You can find the full list of verbs at the [Verbs Reference](reference-verbs.md)
You can use `filter` to keep only records you care about:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint filter '$color == "red"' example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint filter '$color == "red" && $flag == true' example.csv
GENMD_EOF
GENMD-EOF
## Computing new fields
You can use `put` to create new fields which are computed from other fields:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint put '
$ratio = $quantity / $rate;
$color_shape = $color . "_" . $shape
' example.csv
GENMD_EOF
GENMD-EOF
When you create a new field, it can immediately be used in subsequent statements:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint --from example.csv put '
$y = $index + 1;
$z = $y**2 + $k;
'
GENMD_EOF
GENMD-EOF
For `put` and `filter` we were able to type out expressions using a programming-language syntax.
See the [Miller programming language page](miller-programming-language.md) for more information.
@ -113,50 +113,50 @@ See the [Miller programming language page](miller-programming-language.md) for m
Miller takes all the files from the command line as an input stream. But it's format-aware, so it doesn't repeat CSV header lines. For example, with input files [data/a.csv](data/a.csv) and [data/b.csv](data/b.csv), the system `cat` command will repeat header lines:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/a.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/b.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/a.csv data/b.csv
GENMD_EOF
GENMD-EOF
However, `mlr cat` will not:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat data/a.csv data/b.csv
GENMD_EOF
GENMD-EOF
## Chaining verbs together
Often we want to chain queries together -- for example, sorting by a field and taking the top few values. We can do this using pipes:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv sort -nr index example.csv | mlr --icsv --opprint head -n 3
GENMD_EOF
GENMD-EOF
This works fine -- but Miller also lets you chain verbs together using the word `then`. Think of this as a Miller-internal pipe that lets you use fewer keystrokes:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
GENMD_EOF
GENMD-EOF
As another convenience, you can put the filename first using `--from`. When you're interacting with your data at the command line, this makes it easier to up-arrow and append to the previous command:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint --from example.csv sort -nr index then head -n 3
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint --from example.csv \
sort -nr index \
then head -n 3 \
then cut -f shape,quantity
GENMD_EOF
GENMD-EOF
## Sorts and stats
@ -164,55 +164,55 @@ Now suppose you want to sort the data on a given column, *and then* take the top
Here are the records with the top three `index` values:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
GENMD_EOF
GENMD-EOF
Lots of Miller commands take a `-g` option for group-by: here, `head -n 1 -g shape` outputs the first record for each distinct value of the `shape` field. This means we're finding the record with highest `index` field for each distinct `shape` field:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv
GENMD_EOF
GENMD-EOF
Statistics can be computed with or without group-by field(s):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint --from example.csv \
stats1 -a count,min,mean,max -f quantity -g shape
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint --from example.csv \
stats1 -a count,min,mean,max -f quantity -g shape,color
GENMD_EOF
GENMD-EOF
If your output has a lot of columns, you can use XTAB format to line things up vertically for you instead:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --oxtab --from example.csv \
stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate
GENMD_EOF
GENMD-EOF
## Unicode and internationalization
While Miller's function names, verb names, online help, etc. are all in English, Miller supports
UTF-8 data. For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat παράδειγμα.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p filter '$σχήμα == "κύκλος"' παράδειγμα.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p sort -f σημαία παράδειγμα.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p put '$форма = toupper($форма); $длина = strlen($цвет)' пример.csv
GENMD_EOF
GENMD-EOF
## File formats and format conversion
@ -230,22 +230,22 @@ What's a CSV file, really? It's an array of rows, or *records*, each being a lis
For example, if you have:
GENMD_CARDIFY
GENMD-CARDIFY
shape,flag,index
circle,1,24
square,0,36
GENMD_EOF
GENMD-EOF
then that's a way of saying:
GENMD_CARDIFY
GENMD-CARDIFY
shape=circle,flag=1,index=24
shape=square,flag=0,index=36
GENMD_EOF
GENMD-EOF
Other ways to write the same data:
GENMD_CARDIFY
GENMD-CARDIFY
CSV PPRINT
shape,flag,index shape flag index
circle,1,24 circle 1 24
@ -266,7 +266,7 @@ JSON XTAB
DKVP
shape=circle,flag=1,index=24
shape=square,flag=0,index=36
GENMD_EOF
GENMD-EOF
Anything we can do with CSV input data, we can do with any other format input data. And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.
@ -280,36 +280,36 @@ You can read more about this at the [File Formats](file-formats.md) page.
If all record values are numbers, strings, etc., then converting back and forth between CSV and JSON is
a matter of specifying input-format and output-format flags:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json cat example.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --ocsv cat example.json
GENMD_EOF
GENMD-EOF
However, if JSON data has map-valued or array-valued fields, Miller gives you choices on how to
convert these to CSV columns. For example, here's some JSON data with map-valued fields:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/server-log.json
GENMD_EOF
GENMD-EOF
We can convert this to CSV, or other tabular formats:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --ocsv cat data/server-log.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --oxtab cat data/server-log.json
GENMD_EOF
GENMD-EOF
These transformations are reversible:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --oxtab cat data/server-log.json | mlr --ixtab --ojson cat
GENMD_EOF
GENMD-EOF
See the [flatten/unflatten page](flatten-unflatten.md) for more information.
@ -319,13 +319,13 @@ Often we want to print output to the screen. Miller does this by default, as we'
Sometimes, though, we want to print output to another file. Just use `> outputfilenamegoeshere` at the end of your command:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --icsv --opprint cat example.csv > newfile.csv
# Output goes to the new file;
# nothing is printed to the screen.
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
cat newfile.csv
color shape flag index quantity rate
yellow triangle true 11 43.6498 9.8870
@ -338,15 +338,15 @@ purple triangle false 65 80.1405 5.8240
yellow circle true 73 63.9785 4.2370
yellow circle true 87 63.5058 8.3350
purple square false 91 72.3735 8.2430
GENMD_EOF
GENMD-EOF
Other times we just want our files to be **changed in-place**: just use `mlr -I`:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
cp example.csv newfile.txt
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
cat newfile.txt
color,shape,flag,index,quantity,rate
yellow,triangle,true,11,43.6498,9.8870
@ -359,13 +359,13 @@ purple,triangle,false,65,80.1405,5.8240
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
purple,square,false,91,72.3735,8.2430
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr -I --csv sort -f shape newfile.txt
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
cat newfile.txt
color,shape,flag,index,quantity,rate
red,circle,true,16,13.8103,2.9010
@ -378,30 +378,30 @@ purple,square,false,91,72.3735,8.2430
yellow,triangle,true,11,43.6498,9.8870
purple,triangle,false,51,81.2290,8.5910
purple,triangle,false,65,80.1405,5.8240
GENMD_EOF
GENMD-EOF
Also using `mlr -I` you can bulk-operate on lots of files: e.g.:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr -I --csv cut -x -f unwanted_column_name *.csv
GENMD_EOF
GENMD-EOF
If you like, you can first copy off your original data somewhere else, before doing in-place operations.
Lastly, using `tee` within `put`, you can split your input data into separate files per one or more field names:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat circle.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat square.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat triangle.csv
GENMD_EOF
GENMD-EOF

View file

@ -16,7 +16,7 @@ Quick links:
</div>
# Building from source
Please also see [Installation](installation.md) for information about pre-built executables.
Please also see [Installation](installing-miller.md) for information about pre-built executables.
You will need to first install Go version 1.15 or higher: please see [https://go.dev](https://go.dev).
@ -30,15 +30,18 @@ Two-clause BSD license [https://github.com/johnkerl/miller/blob/master/LICENSE.t
* `tar zxvf mlr-i.j.k.tar.gz`
* `cd mlr-i.j.k`
* `cd go`
* `./build` creates the `go/mlr` executable and runs regression tests
* `go build mlr.go` creates the `go/mlr` executable without running regression tests
* `make` creates the `go/mlr` (or `go\mlr.exe` on Windows) executable
* `make check` runs tests
* `make install` installs the `mlr` executable and the `mlr` manpage
* On Windows, if you don't have `make`, then you can do `choco install make` -- or, alternatively:
* `cd go`
* `go build` creates `mlr.exe`
* `go test -v mlr\src\...` and `go test -v` runs tests
## From git clone
* `git clone https://github.com/johnkerl/miller`
* `cd miller/go`
* `./build` creates the `go/mlr` executable and runs regression tests
* `go build mlr.go` creates the `go/mlr` executable without running regression tests
* `make`, `make check`, and `make install` as above
## In case of problems
@ -66,10 +69,8 @@ In this example I am using version 6.1.0 to 6.2.0; of course that will change fo
* Update version found in `mlr --version` and `man mlr`:
* Edit `go/src/version/version.go` from `6.1.0-dev` to `6.2.0`.
* `cd ../docs`
* `export PATH=../go:$PATH`
* `make html`
* The ordering is important: the first build creates `mlr`; the second runs `mlr` to create `manpage.txt`; the third includes `manpage.txt` into one of its outputs.
* `sh build-go-src-test-man-doc.sh`
* The ordering in this script is important: the first build creates `mlr`; the second runs `mlr` to create `manpage.txt`; the third includes `manpage.txt` into one of its outputs.
* Commit and push.
* Create the release tarball and SRPM:
@ -88,11 +89,9 @@ In this example I am using version 6.1.0 to 6.2.0; of course that will change fo
* Check the release-specific docs:
* Look at [https://miller.readthedocs.io](https://miller.readthedocs.io) for new-version docs, after a few minutes' propagation time. Note this won't work until Miller 6 is released.
* ISP-push to [https://johnkerl.org/miller6](https://johnkerl.org/miller6). (Until release: this is a temporary substitute for readthedocs.)
* Notify:
* Only do these once Miller 6 is released:
* Submit `brew` pull request; notify any other distros which don't appear to have autoupdated since the previous release (notes below)
* Similarly for `macports`: [https://github.com/macports/macports-ports/blob/master/textproc/miller/Portfile](https://github.com/macports/macports-ports/blob/master/textproc/miller/Portfile)
* Social-media updates.

View file

@ -1,6 +1,6 @@
# Building from source
Please also see [Installation](installation.md) for information about pre-built executables.
Please also see [Installation](installing-miller.md) for information about pre-built executables.
You will need to first install Go version 1.15 or higher: please see [https://go.dev](https://go.dev).
@ -14,15 +14,18 @@ Two-clause BSD license [https://github.com/johnkerl/miller/blob/master/LICENSE.t
* `tar zxvf mlr-i.j.k.tar.gz`
* `cd mlr-i.j.k`
* `cd go`
* `./build` creates the `go/mlr` executable and runs regression tests
* `go build mlr.go` creates the `go/mlr` executable without running regression tests
* `make` creates the `go/mlr` (or `go\mlr.exe` on Windows) executable
* `make check` runs tests
* `make install` installs the `mlr` executable and the `mlr` manpage
* On Windows, if you don't have `make`, then you can do `choco install make` -- or, alternatively:
* `cd go`
* `go build` creates `mlr.exe`
* `go test -v mlr\src\...` and `go test -v` runs tests
## From git clone
* `git clone https://github.com/johnkerl/miller`
* `cd miller/go`
* `./build` creates the `go/mlr` executable and runs regression tests
* `go build mlr.go` creates the `go/mlr` executable without running regression tests
* `make`, `make check`, and `make install` as above
## In case of problems
@ -50,10 +53,8 @@ In this example I am using version 6.1.0 to 6.2.0; of course that will change fo
* Update version found in `mlr --version` and `man mlr`:
* Edit `go/src/version/version.go` from `6.1.0-dev` to `6.2.0`.
* `cd ../docs`
* `export PATH=../go:$PATH`
* `make html`
* The ordering is important: the first build creates `mlr`; the second runs `mlr` to create `manpage.txt`; the third includes `manpage.txt` into one of its outputs.
* `sh build-go-src-test-man-doc.sh`
* The ordering in this script is important: the first build creates `mlr`; the second runs `mlr` to create `manpage.txt`; the third includes `manpage.txt` into one of its outputs.
* Commit and push.
* Create the release tarball and SRPM:
@ -72,16 +73,14 @@ In this example I am using version 6.1.0 to 6.2.0; of course that will change fo
* Check the release-specific docs:
* Look at [https://miller.readthedocs.io](https://miller.readthedocs.io) for new-version docs, after a few minutes' propagation time. Note this won't work until Miller 6 is released.
* ISP-push to [https://johnkerl.org/miller6](https://johnkerl.org/miller6). (Until release: this is a temporary substitute for readthedocs.)
* Notify:
* Only do these once Miller 6 is released:
* Submit `brew` pull request; notify any other distros which don't appear to have autoupdated since the previous release (notes below)
* Similarly for `macports`: [https://github.com/macports/macports-ports/blob/master/textproc/miller/Portfile](https://github.com/macports/macports-ports/blob/master/textproc/miller/Portfile)
* Social-media updates.
GENMD_CARDIFY
GENMD-CARDIFY
# brew notes:
git remote add upstream https://github.com/Homebrew/homebrew-core # one-time setup only
git fetch upstream
@ -98,7 +97,7 @@ git add Formula/miller.rb
git commit -m 'miller 6.1.0'
git push -u origin miller-6.1.0
(submit the pull request)
GENMD_EOF
GENMD-EOF
* Afterwork:

View file

@ -26,27 +26,16 @@ Pre-release Miller documentation is at [https://github.com/johnkerl/miller/tree/
Instructions for modifying, viewing, and submitting PRs for these are in the [docs/README.md](https://github.com/johnkerl/miller/blob/main/docs/README.md).
While Miller 6 is in pre-release, these docs are not viewable at
[https://miller.readthedocs.io](https://miller.readthedocs.io) which shows Miller 5 docs.
For now, I'll push Miller-6 docs to my ISP space at
[https://johnkerl.org/miller6](https://johnkerl.org/miller6) after your PR is merged.
<!---
TODO: after Miller6 release when these are on RTD
Once PRs are merged, readthedocs creates [https://miller.readthedocs.io](https://miller.readthedocs.io) using the following configs:
* [https://readthedocs.org/projects/miller](https://readthedocs.org/projects/miller)
* [https://readthedocs.org/projects/miller/builds](https://readthedocs.org/projects/miller/builds)
* [https://github.com/johnkerl/miller/settings/hooks](https://github.com/johnkerl/miller/settings/hooks)
-->
## Testing
As of Miller-6's current pre-release status, the best way to test is to either build from source via [Building from source](build.md), or by getting a recent binary at [https://github.com/johnkerl/miller/actions](https://github.com/johnkerl/miller/actions), then click latest build, then *Artifacts*. Then simply use Miller for whatever you do, and create an issue at [https://github.com/johnkerl/miller/issues](https://github.com/johnkerl/miller/issues).
Do note that as of mid-2021 a few things have not been ported to Miller 6 -- most notably, including localtime DSL functions and other issues.
## Feature development
Issues: [https://github.com/johnkerl/miller/issues](https://github.com/johnkerl/miller/issues)

View file

@ -10,27 +10,16 @@ Pre-release Miller documentation is at [https://github.com/johnkerl/miller/tree/
Instructions for modifying, viewing, and submitting PRs for these are in the [docs/README.md](https://github.com/johnkerl/miller/blob/main/docs/README.md).
While Miller 6 is in pre-release, these docs are not viewable at
[https://miller.readthedocs.io](https://miller.readthedocs.io) which shows Miller 5 docs.
For now, I'll push Miller-6 docs to my ISP space at
[https://johnkerl.org/miller6](https://johnkerl.org/miller6) after your PR is merged.
<!---
TODO: after Miller6 release when these are on RTD
Once PRs are merged, readthedocs creates [https://miller.readthedocs.io](https://miller.readthedocs.io) using the following configs:
* [https://readthedocs.org/projects/miller](https://readthedocs.org/projects/miller)
* [https://readthedocs.org/projects/miller/builds](https://readthedocs.org/projects/miller/builds)
* [https://github.com/johnkerl/miller/settings/hooks](https://github.com/johnkerl/miller/settings/hooks)
-->
## Testing
As of Miller-6's current pre-release status, the best way to test is to either build from source via [Building from source](build.md), or by getting a recent binary at [https://github.com/johnkerl/miller/actions](https://github.com/johnkerl/miller/actions), then click latest build, then *Artifacts*. Then simply use Miller for whatever you do, and create an issue at [https://github.com/johnkerl/miller/issues](https://github.com/johnkerl/miller/issues).
Do note that as of mid-2021 a few things have not been ported to Miller 6 -- most notably, including localtime DSL functions and other issues.
## Feature development
Issues: [https://github.com/johnkerl/miller/issues](https://github.com/johnkerl/miller/issues)

View file

@ -4,37 +4,37 @@
Sometimes we get CSV files which lack a header. For example, [data/headerless.csv](./data/headerless.csv):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/headerless.csv
GENMD_EOF
GENMD-EOF
You can use Miller to add a header. The `--implicit-csv-header` applies positionally indexed labels:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --implicit-csv-header cat data/headerless.csv
GENMD_EOF
GENMD-EOF
Following that, you can rename the positionally indexed labels to names with meaning for your context. For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --implicit-csv-header label name,age,status data/headerless.csv
GENMD_EOF
GENMD-EOF
Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller's output to the system command `sed 1d`, or you can use Miller's `--headerless-csv-output` option:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
head -5 data/colored-shapes.dkvp | mlr --ocsv cat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat
GENMD_EOF
GENMD-EOF
Lastly, often we say "CSV" or "TSV" when we have positionally indexed data in columns which are separated by commas or tabs, respectively. In this case it's perhaps simpler to **just use NIDX format** which was designed for this purpose. (See also [File Formats](file-formats.md).) For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --inidx --ifs comma --oxtab cut -f 1,3 data/headerless.csv
GENMD_EOF
GENMD-EOF
## Headerless CSV with duplicate field values
@ -43,16 +43,16 @@ However, lots of folks think of CSV data -- comma-separated values -- as just th
Here's some sample CSV data which is values-only, i.e. headerless:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/nas.csv
GENMD_EOF
GENMD-EOF
There are clearly nine fields here, but if we try to have Miller parse it as CSV, we
see there are fewer than nine columns:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat data/nas.csv
GENMD_EOF
GENMD-EOF
What happened?
@ -67,36 +67,36 @@ values are being seen as duplicate keys.
One solution is to use `--implicit-csv-header`, or its shorter alias `--hi`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --hi cat data/nas.csv
GENMD_EOF
GENMD-EOF
Another solution is to use [NIDX format](file-formats.md#nidx-index-numbered-toolkit-style):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --inidx --ifs comma --ocsv cat data/nas.csv
GENMD_EOF
GENMD-EOF
Either way, since there is no explicit header, fields are named `1` through `9`. We can use the
[label verb](reference-verbs.md#label) to apply more meaningful namees:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --hi cat then label xsn,ysn,x,y,t,a,e29,e31,e32 data/nas.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --inidx --ifs comma --ocsv cat then label xsn,ysn,x,y,t,a,e29,e31,e32 data/nas.csv
GENMD_EOF
GENMD-EOF
## Regularizing ragged CSV
Miller handles [RFC-4180-compliant CSV](file-formats.md#csvtsvasvusvetc): in particular, it's an error if the number of data fields in a given data line don't match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/ragged.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/ragged.csv --fs comma --nidx put '
@maxnf = max(@maxnf, NF);
@nf = NF;
@ -105,17 +105,17 @@ mlr --from data/ragged.csv --fs comma --nidx put '
$[@nf] = ""
}
'
GENMD_EOF
GENMD-EOF
or, more simply,
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/ragged.csv --fs comma --nidx put '
@maxnf = max(@maxnf, NF);
while(NF < @maxnf) {
$[NF+1] = "";
}
'
GENMD_EOF
GENMD-EOF
See also the [record-heterogeneity page](record-heterogeneity.md).

View file

@ -4,29 +4,29 @@
Suppose you always use CSV files. Then instead of always having to type `--csv` as in
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --csv cut -x -f extra mydata.csv
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --csv sort -n id mydata.csv
GENMD_EOF
GENMD-EOF
and so on, you can instead put the following into your `$HOME/.mlrrc`:
GENMD_CARDIFY
GENMD-CARDIFY
--csv
GENMD_EOF
GENMD-EOF
Then you can just type things like
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr cut -x -f extra mydata.csv
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr sort -n id mydata.csv
GENMD_EOF
GENMD-EOF
and the `--csv` part will automatically be understood. If you do want to process, say, a JSON file then `mlr --json ...` at the command line will still override the defaults you've placed in your `.mlrrc`.
@ -46,7 +46,7 @@ and the `--csv` part will automatically be understood. If you do want to process
Here is an example `.mlrrc` file:
GENMD_INCLUDE_ESCAPED(sample_mlrrc)
GENMD-INCLUDE-ESCAPED(sample_mlrrc)
## Where to put your .mlrrc

View file

@ -2,36 +2,36 @@
Here are some ways to use the type-checking options as described in the [Type-checking page](reference-dsl-variables.md#type-checking). Suppose you have the following data file, with inconsistent typing for boolean. (Also imagine that, for the sake of discussion, we have a million-line file rather than a four-line file, so we can't see it all at once and some automation is called for.)
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/het-bool.csv
GENMD_EOF
GENMD-EOF
One option is to coerce everything to boolean, or integer:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint put '$reachable = boolean($reachable)' data/het-bool.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint put '$reachable = int(boolean($reachable))' data/het-bool.csv
GENMD_EOF
GENMD-EOF
A second option is to flag badly formatted data within the output stream:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint put '$format_ok = is_string($reachable)' data/het-bool.csv
GENMD_EOF
GENMD-EOF
Or perhaps to flag badly formatted data outside the output stream:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint put '
if (!is_string($reachable)) {eprint "Malformed at NR=".NR}
' data/het-bool.csv
GENMD_EOF
GENMD-EOF
A third way is to abort the process on first instance of bad data:
GENMD_RUN_COMMAND_STDERR_ONLY
GENMD-RUN-COMMAND-STDERR-ONLY
mlr --csv put '$reachable = asserting_string($reachable)' data/het-bool.csv
GENMD_EOF
GENMD-EOF

View file

@ -6,56 +6,56 @@ The [flins.csv](data/flins.csv) file is some sample data obtained from [https://
Vertical-tabular format is good for a quick look at CSV data layout -- seeing what columns you have to work with, as this is a file big enough that we can't just see it on a single screenful:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
wc -l data/flins.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2x --from data/flins.csv head -n 2
GENMD_EOF
GENMD-EOF
A few simple queries:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from data/flins.csv count-distinct -f county | head
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from data/flins.csv count-distinct -f line
GENMD_EOF
GENMD-EOF
Categorization of total insured value:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2x --from data/flins.csv stats1 -a min,mean,max -f tiv_2012
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from data/flins.csv \
stats1 -a min,mean,max -f tiv_2012 -g construction,line
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2x --from data/flins.csv \
stats1 -a p0,p10,p50,p90,p95,p99,p100 -f hu_site_deductible
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from data/flins.csv \
stats1 -a p95,p99,p100 -f hu_site_deductible -g county \
then sort -f county | head
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2x --from data/flins.csv \
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2x --from data/flins.csv --ofmt '%.4f' \
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012 -g county \
then head -n 5
GENMD_EOF
GENMD-EOF
## Color/shape data
@ -71,50 +71,50 @@ The [data/colored-shapes.dkvp](data/colored-shapes.dkvp) file is some sample dat
Peek at the data:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
wc -l data/colored-shapes.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
head -n 6 data/colored-shapes.dkvp | mlr --opprint cat
GENMD_EOF
GENMD-EOF
Look at uncategorized stats (using [creach](https://github.com/johnkerl/scripts/blob/master/fundam/creach) for spacing).
Here it looks reasonable that `u` is unit-uniform; something's up with `v` but we can't yet see what:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --oxtab stats1 -a min,mean,max -f flag,u,v data/colored-shapes.dkvp | creach 3
GENMD_EOF
GENMD-EOF
The histogram shows the different distribution of 0/1 flags:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint histogram -f flag,u,v --lo -0.1 --hi 1.1 --nbins 12 data/colored-shapes.dkvp
GENMD_EOF
GENMD-EOF
Look at univariate stats by color and shape. In particular, color-dependent flag probabilities pop out, aligning with their original Bernoulli probablities from the data-generator script:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g color \
then sort -f color \
data/colored-shapes.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g shape \
then sort -f shape \
data/colored-shapes.dkvp
GENMD_EOF
GENMD-EOF
Look at bivariate stats by color and shape. In particular, `u,v` pairwise correlation for red circles pops out:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint --right stats2 -a corr -f u,v,w,x data/colored-shapes.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint --right \
stats2 -a corr -f u,v,w,x -g color,shape then sort -nr u_v_corr \
data/colored-shapes.dkvp
GENMD_EOF
GENMD-EOF

View file

@ -4,17 +4,17 @@
Given input like
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat dates.csv
GENMD_EOF
GENMD-EOF
we can use [strptime](reference-verbs.md#strptime) to parse the date field into seconds-since-epoch and then do numeric comparisons. Simply match your input dataset's date-formatting to the [strptime](reference-verbs.md#strptime) format-string. For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv filter '
strptime($date, "%Y-%m-%d") > strptime("2018-03-03", "%Y-%m-%d")
' dates.csv
GENMD_EOF
GENMD-EOF
Caveat: localtime-handling in timezones with DST is still a work in progress; see [https://github.com/johnkerl/miller/issues/170](https://github.com/johnkerl/miller/issues/170) . See also [https://github.com/johnkerl/miller/issues/208](https://github.com/johnkerl/miller/issues/208) -- thanks @aborruso!
@ -22,40 +22,40 @@ Caveat: localtime-handling in timezones with DST is still a work in progress; se
Suppose you have some date-stamped data which may (or may not) be missing entries for one or more dates:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
head -n 10 data/miss-date.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
wc -l data/miss-date.csv
GENMD_EOF
GENMD-EOF
Since there are 1372 lines in the data file, some automation is called for. To find the missing dates, you can convert the dates to seconds since the epoch using `strptime`, then compute adjacent differences (the `cat -n` simply inserts record-counters):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/miss-date.csv --icsv \
cat -n \
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
then step -a delta -f datestamp \
| head
GENMD_EOF
GENMD-EOF
Then, filter for adjacent difference not being 86400 (the number of seconds in a day):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/miss-date.csv --icsv \
cat -n \
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
then step -a delta -f datestamp \
then filter '$datestamp_delta != 86400 && $n != 1'
GENMD_EOF
GENMD-EOF
Given this, it's now easy to see where the gaps are:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr cat -n then filter '$n >= 770 && $n <= 780' data/miss-date.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr cat -n then filter '$n >= 1115 && $n <= 1125' data/miss-date.csv
GENMD_EOF
GENMD-EOF

View file

@ -4,46 +4,46 @@
Here are the I/O routines:
GENMD_INCLUDE_ESCAPED(polyglot-dkvp-io/dkvp_io.py)
GENMD-INCLUDE-ESCAPED(polyglot-dkvp-io/dkvp_io.py)
And here is an example using them:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat polyglot-dkvp-io/example.py
GENMD_EOF
GENMD-EOF
Run as-is:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
python polyglot-dkvp-io/example.py < data/small
GENMD_EOF
GENMD-EOF
Run as-is, then pipe to Miller for pretty-printing:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
python polyglot-dkvp-io/example.py < data/small | mlr --opprint cat
GENMD_EOF
GENMD-EOF
## DKVP I/O in Ruby
Here are the I/O routines:
GENMD_INCLUDE_ESCAPED(polyglot-dkvp-io/dkvp_io.rb)
GENMD-INCLUDE-ESCAPED(polyglot-dkvp-io/dkvp_io.rb)
And here is an example using them:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat polyglot-dkvp-io/example.rb
GENMD_EOF
GENMD-EOF
Run as-is:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
ruby -I./polyglot-dkvp-io polyglot-dkvp-io/example.rb data/small
GENMD_EOF
GENMD-EOF
Run as-is, then pipe to Miller for pretty-printing:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
ruby -I./polyglot-dkvp-io polyglot-dkvp-io/example.rb data/small | mlr --opprint cat
GENMD_EOF
GENMD-EOF

View file

@ -6,9 +6,9 @@ Additionally, Miller gives you the option of including comments within your data
## Examples
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help file-formats
GENMD_EOF
GENMD-EOF
## CSV/TSV/ASV/USV/etc.
@ -67,39 +67,39 @@ you.
An **array of single-level objects** is, quite simply, **a table**:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json head -n 2 then cut -f color,shape data/json-example-1.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json head -n 2 then cut -f color,u,v data/json-example-1.json
GENMD_EOF
GENMD-EOF
Single-level JSON data goes back and forth between JSON and tabular formats
in the direct way:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint head -n 2 then cut -f color,u,v data/json-example-1.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint cat data/json-example-1.json
GENMD_EOF
GENMD-EOF
### Nested JSON objects
Additionally, Miller can **tabularize nested objects by concatentating keys**. If your processing has
input as well as output in JSON format, JSON structure is preserved throughout the processing:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json --jvstack head -n 2 data/json-example-2.json
GENMD_EOF
GENMD-EOF
But if the input format is JSON and the output format is not (or vice versa) then key-concatenation applies:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint head -n 4 data/json-example-2.json
GENMD_EOF
GENMD-EOF
This is discussed in more detail on the page [Flatten/unflatten: JSON vs. tabular formats](flatten-unflatten.md).
@ -128,13 +128,13 @@ decode these in Miller.
Miller's pretty-print format is like CSV, but column-aligned. For example, compare
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ocsv cat data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint cat data/small
GENMD_EOF
GENMD-EOF
Note that while Miller is a line-at-a-time processor and retains input lines in memory only where necessary (e.g. for sort), pretty-print output requires it to accumulate all input lines (so that it can compute maximum column widths) before producing any output. This has two consequences: (a) pretty-print output won't work on `tail -f` contexts, where Miller will be waiting for an end-of-file marker which never arrives; (b) pretty-print output for large files is constrained by available machine memory.
@ -142,17 +142,17 @@ See [Record Heterogeneity](record-heterogeneity.md) for how Miller handles chang
For output only (this isn't supported in the input-scanner as of 5.0.0) you can use `--barred` with pprint output format:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint --barred cat data/small
GENMD_EOF
GENMD-EOF
## Markdown tabular
Markdown format looks like this:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --omd cat data/small
GENMD_EOF
GENMD-EOF
which renders like this when dropped into various web tools (e.g. github comments):
@ -165,7 +165,7 @@ As of Miller 4.3.0, markdown format is supported only for output, not input.
This is perhaps most useful for looking a very wide and/or multi-column data which causes line-wraps on the screen (but see also
[ngrid](https://github.com/twosigma/ngrid/) for an entirely different, very powerful option). Namely:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
$ grep -v '^#' /etc/passwd | head -n 6 | mlr --nidx --fs : --opprint cat
1 2 3 4 5 6 7
nobody * -2 -2 Unprivileged User /var/empty /usr/bin/false
@ -174,9 +174,9 @@ daemon * 1 1 System Services /var/root /usr/bin/false
_uucp * 4 4 Unix to Unix Copy Protocol /var/spool/uucp /usr/sbin/uucico
_taskgated * 13 13 Task Gate Daemon /var/empty /usr/bin/false
_networkd * 24 24 Network Services /var/networkd /usr/bin/false
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
$ grep -v '^#' /etc/passwd | head -n 2 | mlr --nidx --fs : --oxtab cat
1 nobody
2 *
@ -193,9 +193,9 @@ $ grep -v '^#' /etc/passwd | head -n 2 | mlr --nidx --fs : --oxtab cat
5 System Administrator
6 /var/root
7 /bin/sh
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_THREE
GENMD-CARDIFY-HIGHLIGHT-THREE
$ grep -v '^#' /etc/passwd | head -n 2 | \
mlr --nidx --fs : --ojson --jvstack --jlistwrap \
label name,password,uid,gid,gecos,home_dir,shell
@ -219,45 +219,45 @@ $ grep -v '^#' /etc/passwd | head -n 2 | \
"shell": "/bin/sh"
}
]
GENMD_EOF
GENMD-EOF
## DKVP: Key-value pairs
Miller's default file format is DKVP, for **delimited key-value pairs**. Example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr cat data/small
GENMD_EOF
GENMD-EOF
Such data are easy to generate, e.g. in Ruby with
GENMD_CARDIFY
GENMD-CARDIFY
puts "host=#{hostname},seconds=#{t2-t1},message=#{msg}"
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY
GENMD-CARDIFY
puts mymap.collect{|k,v| "#{k}=#{v}"}.join(',')
GENMD_EOF
GENMD-EOF
or `print` statements in various languages, e.g.
GENMD_CARDIFY
GENMD-CARDIFY
echo "type=3,user=$USER,date=$date\n";
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY
GENMD-CARDIFY
logger.log("type=3,user=$USER,date=$date\n");
GENMD_EOF
GENMD-EOF
Fields lacking an IPS will have positional index (starting at 1) used as the key, as in NIDX format. For example, `dish=7,egg=8,flint` is parsed as `"dish" => "7", "egg" => "8", "3" => "flint"` and `dish,egg,flint` is parsed as `"1" => "dish", "2" => "egg", "3" => "flint"`.
As discussed in [Record Heterogeneity](record-heterogeneity.md), Miller handles changes of field names within the same data stream. But using DKVP format this is particularly natural. One of my favorite use-cases for Miller is in application/server logs, where I log all sorts of lines such as
GENMD_CARDIFY
GENMD-CARDIFY
resource=/path/to/file,loadsec=0.45,ok=true
record_count=100, resource=/path/to/file
resource=/some/other/path,loadsec=0.97,ok=false
GENMD_EOF
GENMD-EOF
etc. and I just log them as needed. Then later, I can use `grep`, `mlr --opprint group-like`, etc.
to analyze my logs.
@ -272,41 +272,41 @@ This recapitulates Unix-toolkit behavior.
Example with index-numbered output:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --onidx --ofs ' ' cat data/small
GENMD_EOF
GENMD-EOF
Example with index-numbered input:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/mydata.txt
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --inidx --ifs ' ' --odkvp cat data/mydata.txt
GENMD_EOF
GENMD-EOF
Example with index-numbered input and output:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/mydata.txt
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --nidx --fs ' ' --repifs cut -f 2,3 data/mydata.txt
GENMD_EOF
GENMD-EOF
## Data-conversion keystroke-savers
While you can do format conversion using `mlr --icsv --ojson cat myfile.csv`, there are also keystroke-savers for this purpose, such as `mlr --c2j cat myfile.csv`. For a complete list:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help format-conversion
GENMD_EOF
GENMD-EOF
<!---
TODO: probably entirely unsupport this feature in Miller6.
@ -330,20 +330,20 @@ See also the [separators page](reference-main-separators.md) for more informatio
You can include comments within your data files, and either have them ignored, or passed directly through to the standard output as soon as they are encountered:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help comments-in-data
GENMD_EOF
GENMD-EOF
Examples:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/budget.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --skip-comments --icsv --opprint sort -nr quantity data/budget.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --pass-comments --icsv --opprint sort -nr quantity data/budget.csv
GENMD_EOF
GENMD-EOF

View file

@ -14,13 +14,13 @@ For [JSON files](file-formats.md#json), this is easy -- JSON is a nested format
can be maps or arrays, which can contain other maps or arrays, and so on, with the nesting
happily indicated by curly braces:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/map-values.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/map-values-nested.json
GENMD_EOF
GENMD-EOF
How can we represent these in CSV files?
@ -44,51 +44,51 @@ default behavior is to spread the map values into multiple keys -- using
Miller's `flatsep` separator, which defaults to `.` -- to join the original
record key with map keys:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/map-values.json
GENMD_EOF
GENMD-EOF
Flattened to CSV format:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --ocsv cat data/map-values.json
GENMD_EOF
GENMD-EOF
Flattened to pretty-print format:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint cat data/map-values.json
GENMD_EOF
GENMD-EOF
Using flatten-separator `:` instead of the default `.`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint --flatsep : cat data/map-values.json
GENMD_EOF
GENMD-EOF
If the maps are more deeply nested, each level of map keys is joined in:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/map-values-nested.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint cat data/map-values-nested.json
GENMD_EOF
GENMD-EOF
**Unflattening** is simply the reverse -- from non-JSON back to JSON:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/map-values.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --ocsv cat data/map-values.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --ocsv cat data/map-values.json | mlr --icsv --ojson cat
GENMD_EOF
GENMD-EOF
## Converting arrays between JSON and non-JSON
@ -96,23 +96,23 @@ If the input data contains arrays, these are also flattened similarly: the
[1-up array indices](reference-main-arrays.md#1-up-indexing) `1,2,3,...` become string keys
`"1","2","3",...`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/array-values.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint cat data/array-values.json
GENMD_EOF
GENMD-EOF
If the arrays are more deeply nested, each level of arrays keys is joined in:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/array-values-nested.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint cat data/array-values-nested.json
GENMD_EOF
GENMD-EOF
In the nested-data examples shown here, nested map values are shown containing
maps, and nested array values are shown containing arrays -- of course (even
@ -120,17 +120,17 @@ though not shown here) nested map values can contain arrays, and vice versa.
**Unflattening** arrays is, again, simply the reverse -- from non-JSON back to JSON:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/array-values.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --ocsv cat data/array-values.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --ocsv cat data/array-values.json | mlr --icsv --ojson cat
GENMD_EOF
GENMD-EOF
## Auto-inferencing of arrays on unflatten
@ -140,21 +140,21 @@ with keys `"1"`, `"2"`, etc. -- starting with `"1"`, consecutively, and with
no gaps -- it turns that back into an array. This is precisely to undo the
flatten conversion. However, it may (or may not) be surprising:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/consecutive.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2j cat data/consecutive.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/non-consecutive.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2j cat data/non-consecutive.csv
GENMD_EOF
GENMD-EOF
## Manual control
@ -174,50 +174,50 @@ because
[map](reference-main-maps.md)-valued/[array](reference-main-arrays.md)-valued
fields can be produced using [DSL statements](miller-programming-language.md):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/hostnames.csv
GENMD_EOF
GENMD-EOF
Using JSON output, we can see that `splita` has produced an array-valued field named `components`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/hostnames.csv put '$components = splita($host, ".")'
GENMD_EOF
GENMD-EOF
Using CSV output, with default auto-flatten, we get `components.1` through `components.4`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from data/hostnames.csv put '$components = splita($host, ".")'
GENMD_EOF
GENMD-EOF
Using CSV output, without default auto-flatten, we get a JSON-stringified encoding of the `components` field:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from data/hostnames.csv --no-auto-flatten put '$components = splita($host, ".")'
GENMD_EOF
GENMD-EOF
Now suppose we ran this
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --oxtab --from data/hostnames.csv --no-auto-flatten put '
$a = splita($host, ".");
$b = splita($host, ".");
'
GENMD_EOF
GENMD-EOF
into a file [data/hostnames.xtab](./data/hostnames.xtab):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/hostnames.xtab
GENMD_EOF
GENMD-EOF
This was written with `--no-auto-unflatten` so we need to manually revive the
array-valued fields, if we choose -- here, we can JSON-parse the `a` field and
leave `b` JSON-stringified:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ixtab --ojson json-parse -f a data/hostnames.xtab
GENMD_EOF
GENMD-EOF
See also the
[JSON parse and stringify section](reference-main-data-types.md#json-parse-and-stringify) section for

View file

@ -29,53 +29,53 @@ def main
begin
content_line = input_handle.readline
if content_line =~ /^GENMD_RUN_COMMAND$/
if content_line =~ /^GENMD-RUN-COMMAND$/
cmd_lines = read_until_genmd_eof(input_handle)
run_command(cmd_lines, output_handle)
elsif content_line =~ /^GENMD_CARDIFY$/
elsif content_line =~ /^GENMD-CARDIFY$/
lines = read_until_genmd_eof(input_handle)
write_card([], lines, output_handle)
elsif content_line =~ /^GENMD_CARDIFY_HIGHLIGHT_ONE$/
elsif content_line =~ /^GENMD-CARDIFY-HIGHLIGHT-ONE$/
lines = read_until_genmd_eof(input_handle)
line1 = lines.shift
write_card([line1], lines, output_handle)
elsif content_line =~ /^GENMD_CARDIFY_HIGHLIGHT_TWO$/
elsif content_line =~ /^GENMD-CARDIFY-HIGHLIGHT-TWO$/
lines = read_until_genmd_eof(input_handle)
write_card(lines.slice(0,2), lines.slice(2, lines.length), output_handle)
elsif content_line =~ /^GENMD_CARDIFY_HIGHLIGHT_THREE$/
elsif content_line =~ /^GENMD-CARDIFY-HIGHLIGHT-THREE$/
lines = read_until_genmd_eof(input_handle)
write_card(lines.slice(0,3), lines.slice(3, lines.length), output_handle)
elsif content_line =~ /^GENMD_CARDIFY_HIGHLIGHT_FOUR$/
elsif content_line =~ /^GENMD-CARDIFY-HIGHLIGHT-FOUR$/
lines = read_until_genmd_eof(input_handle)
write_card(lines.slice(0,4), lines.slice(4, lines.length), output_handle)
elsif content_line =~ /^GENMD_RUN_COMMAND_TOLERATING_ERROR$/
elsif content_line =~ /^GENMD-RUN-COMMAND-TOLERATING-ERROR$/
cmd_lines = read_until_genmd_eof(input_handle)
run_command_tolerating_error(cmd_lines, output_handle)
elsif content_line =~ /^GENMD_RUN_COMMAND_STDERR_ONLY$/
elsif content_line =~ /^GENMD-RUN-COMMAND-STDERR-ONLY$/
cmd_lines = read_until_genmd_eof(input_handle)
run_command_stderr_only(cmd_lines, output_handle)
elsif content_line =~ /^GENMD_SHOW_COMMAND$/
elsif content_line =~ /^GENMD-SHOW-COMMAND$/
cmd_lines = read_until_genmd_eof(input_handle)
show_command(cmd_lines, output_handle)
elsif content_line =~ /^GENMD_INCLUDE_ESCAPED\(([^)]+)\)$/
elsif content_line =~ /^GENMD-INCLUDE-ESCAPED\(([^)]+)\)$/
included_file_name = $1
include_escaped(included_file_name, output_handle)
elsif content_line =~ /^GENMD_INCLUDE_AND_RUN_ESCAPED\(([^)]+)\)$/
elsif content_line =~ /^GENMD-INCLUDE-AND-RUN-ESCAPED\(([^)]+)\)$/
included_file_name = $1
cmd_lines = File.readlines(included_file_name).map{|line| line.chomp}
run_command(cmd_lines, output_handle)
elsif content_line =~ /^GENMD_RUN_CONTENT_GENERATOR\(([^)]+)\)$/
elsif content_line =~ /^GENMD-RUN-CONTENT-GENERATOR\(([^)]+)\)$/
cmd = $1
run_content_generator(cmd, output_handle)
@ -147,7 +147,7 @@ def read_until_genmd_eof(input_handle)
lines = []
while true
line = input_handle.readline.chomp
if line == 'GENMD_EOF'
if line == 'GENMD-EOF'
break
end
lines << line
@ -155,7 +155,7 @@ def read_until_genmd_eof(input_handle)
return lines
rescue EOFError
$stderr.puts "$0: did not find GENMD_EOF"
$stderr.puts "$0: did not find GENMD-EOF"
exit 1
end
end

View file

@ -14,19 +14,19 @@ pretty-print your data horizontally or vertically to make it easier to read.
**Compact verbs vs programming language:** For low-keystroking you can do things like
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr --csv sort -f name input.csv
GENMD_EOF
GENMD-EOF
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr --json head -n 1 myfile.json
GENMD_EOF
GENMD-EOF
The `sort`, `head`, etc are called *verbs*. They're analogs of familiar command-line tools like `sort`, `head`, and so on -- but they're aware of name-indexed, multi-line file formats like CSV, TSV, and JSON. In addition, though, using Miller's `put` verb you can use programming-language statements for expressions like
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr --csv put '$rate = $units / $seconds' input.csv
GENMD_EOF
GENMD-EOF
which allow you to succintly express your own logic.

View file

@ -1,70 +0,0 @@
<!--- PLEASE DO NOT EDIT DIRECTLY. EDIT THE .md.in FILE PLEASE. --->
<div>
<span class="quicklinks">
Quick links:
&nbsp;
<a class="quicklink" href="../reference-main-flag-list/index.html">Flags</a>
&nbsp;
<a class="quicklink" href="../reference-verbs/index.html">Verbs</a>
&nbsp;
<a class="quicklink" href="../reference-dsl-builtin-functions/index.html">Functions</a>
&nbsp;
<a class="quicklink" href="../glossary/index.html">Glossary</a>
&nbsp;
<a class="quicklink" href="../release-docs/index.html">Release docs</a>
</span>
</div>
# Installation
Note:
* Miller 6 is in pre-release, and is described by the docs you're reading ([https://johnkerl.org/miller6](https://johnkerl.org/miller6)).
* Miller 5 is released, and is described by [https://miller.readthedocs.io](https://miller.readthedocs.io). Package managers will currently give you Miller 5.
## Prebuilt executables via package managers (Miller 5 only)
[Homebrew](https://brew.sh/) installation support for OS X is available via
<pre class="pre-highlight-non-pair">
<b>brew update && brew install miller</b>
</pre>
... and also via [MacPorts](https://www.macports.org/):
<pre class="pre-highlight-non-pair">
<b>sudo port selfupdate && sudo port install miller</b>
</pre>
Note that Homebrew is available for Linux as well: [https://docs.brew.sh/linux](https://docs.brew.sh/linux).
You may already have the `mlr` executable available in your platform's package manager on NetBSD, Debian Linux, Ubuntu Xenial and upward, Arch Linux, or perhaps other distributions. For example, on various Linux distributions you might do one of the following:
<pre class="pre-highlight-non-pair">
<b>sudo apt-get install miller</b>
</pre>
<pre class="pre-highlight-non-pair">
<b>sudo apt install miller</b>
</pre>
<pre class="pre-highlight-non-pair">
<b>sudo yum install miller</b>
</pre>
On Windows, Miller is available via [Chocolatey](https://chocolatey.org/):
<pre class="pre-highlight-non-pair">
<b>choco install miller</b>
</pre>
## Prebuilt executables via GitHub per release (Miller 5 only)
Please see [https://github.com/johnkerl/miller/releases](https://github.com/johnkerl/miller/releases) where there are builds for OS X Yosemite, Linux x86-64 (dynamically linked), and Windows.
## Prebuilt executables via GitHub per commit (Miller 6)
Miller is [autobuilt for **Linux**, **MacOS**, and **Windows** using **GitHub Actions** on every commit](https://github.com/johnkerl/miller/actions): select the latest build and click _Artifacts_. (These are retained for 5 days after each commit.)
## Building from source (Miller 6)
Please see [Building from source](build.md).

View file

@ -1,54 +0,0 @@
# Installation
Note:
* Miller 6 is in pre-release, and is described by the docs you're reading ([https://johnkerl.org/miller6](https://johnkerl.org/miller6)).
* Miller 5 is released, and is described by [https://miller.readthedocs.io](https://miller.readthedocs.io). Package managers will currently give you Miller 5.
## Prebuilt executables via package managers (Miller 5 only)
[Homebrew](https://brew.sh/) installation support for OS X is available via
GENMD_CARDIFY_HIGHLIGHT_ONE
brew update && brew install miller
GENMD_EOF
... and also via [MacPorts](https://www.macports.org/):
GENMD_CARDIFY_HIGHLIGHT_ONE
sudo port selfupdate && sudo port install miller
GENMD_EOF
Note that Homebrew is available for Linux as well: [https://docs.brew.sh/linux](https://docs.brew.sh/linux).
You may already have the `mlr` executable available in your platform's package manager on NetBSD, Debian Linux, Ubuntu Xenial and upward, Arch Linux, or perhaps other distributions. For example, on various Linux distributions you might do one of the following:
GENMD_CARDIFY_HIGHLIGHT_ONE
sudo apt-get install miller
GENMD_EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
sudo apt install miller
GENMD_EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
sudo yum install miller
GENMD_EOF
On Windows, Miller is available via [Chocolatey](https://chocolatey.org/):
GENMD_CARDIFY_HIGHLIGHT_ONE
choco install miller
GENMD_EOF
## Prebuilt executables via GitHub per release (Miller 5 only)
Please see [https://github.com/johnkerl/miller/releases](https://github.com/johnkerl/miller/releases) where there are builds for OS X Yosemite, Linux x86-64 (dynamically linked), and Windows.
## Prebuilt executables via GitHub per commit (Miller 6)
Miller is [autobuilt for **Linux**, **MacOS**, and **Windows** using **GitHub Actions** on every commit](https://github.com/johnkerl/miller/actions): select the latest build and click _Artifacts_. (These are retained for 5 days after each commit.)
## Building from source (Miller 6)
Please see [Building from source](build.md).

View file

@ -18,12 +18,12 @@ Quick links:
You can install Miller for various platforms as follows.
* Miller 6 is in pre-release, and is described by the docs you're reading ([https://johnkerl.org/miller6](https://johnkerl.org/miller6)).
* Miller 6 is in pre-release.
* You can get latest Miller 6 builds for Linux, MacOS, and Windows by visiting [https://github.com/johnkerl/miller/actions](https://github.com/johnkerl/miller/actions), selecting the latest build, and clicking _Artifacts_. (These are retained for 5 days after each commit.)
* See also the [build page](build.md) if you prefer -- in particular, if your platform's package manager doesn't have the latest release.
* Miller 5 is released, and is described by [https://miller.readthedocs.io](https://miller.readthedocs.io).
* Miller 5 is released.
* Linux: `yum install miller` or `apt-get install miller` depending on your flavor of Linux, or [Homebrew](https://docs.brew.sh/linux).
* MacOS: `brew install miller` or `port install miller` depending on your preference of [Homebrew](https://brew.sh) or [MacPorts](https://macports.org).
* MacOS: `brew update` and `brew install miller`, or `sudo port selfupdate` and `sudo port install miller`, depending on your preference of [Homebrew](https://brew.sh) or [MacPorts](https://macports.org).
* Windows: `choco install miller` using [Chocolatey](https://chocolatey.org).
As a first check, you should be able to run `mlr --version` at your system's command prompt and see something like the following:

View file

@ -2,29 +2,29 @@
You can install Miller for various platforms as follows.
* Miller 6 is in pre-release, and is described by the docs you're reading ([https://johnkerl.org/miller6](https://johnkerl.org/miller6)).
* Miller 6 is in pre-release.
* You can get latest Miller 6 builds for Linux, MacOS, and Windows by visiting [https://github.com/johnkerl/miller/actions](https://github.com/johnkerl/miller/actions), selecting the latest build, and clicking _Artifacts_. (These are retained for 5 days after each commit.)
* See also the [build page](build.md) if you prefer -- in particular, if your platform's package manager doesn't have the latest release.
* Miller 5 is released, and is described by [https://miller.readthedocs.io](https://miller.readthedocs.io).
* Miller 5 is released.
* Linux: `yum install miller` or `apt-get install miller` depending on your flavor of Linux, or [Homebrew](https://docs.brew.sh/linux).
* MacOS: `brew install miller` or `port install miller` depending on your preference of [Homebrew](https://brew.sh) or [MacPorts](https://macports.org).
* MacOS: `brew update` and `brew install miller`, or `sudo port selfupdate` and `sudo port install miller`, depending on your preference of [Homebrew](https://brew.sh) or [MacPorts](https://macports.org).
* Windows: `choco install miller` using [Chocolatey](https://chocolatey.org).
As a first check, you should be able to run `mlr --version` at your system's command prompt and see something like the following:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --version
GENMD_EOF
GENMD-EOF
As a second check, given [example.csv](./example.csv) you should be able to do
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint cat example.csv
GENMD_EOF
GENMD-EOF
If you run into issues on these checks, please check out the resources on the [community page](community.md) for help.

View file

@ -9,18 +9,18 @@ Support for internationalization includes:
* The [toupper](reference-dsl-builtin-functions.md#toupper), [tolower](reference-dsl-builtin-functions.md#tolower), and [capitalize](reference-dsl-builtin-functions.md#capitalize) DSL functions operate within the capabilities of the Go libraries.
* While Miller's function names, verb names, online help, etc. are all in English, you can write field names, string literals, variable names, etc in UTF-8.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat παράδειγμα.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p filter '$σχήμα == "κύκλος"' παράδειγμα.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p sort -f σημαία παράδειγμα.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p put '$форма = toupper($форма); $длина = strlen($цвет)' пример.csv
GENMD_EOF
GENMD-EOF

View file

@ -4,13 +4,13 @@
In our examples so far we've often made use of `mlr --icsv --opprint` or `mlr --icsv --ojson`. These are such frequently occurring patterns that they have short options like `--c2p` and `--c2j`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p head -n 2 example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2j head -n 2 example.csv
GENMD_EOF
GENMD-EOF
You can get the full list [here](file-formats.md#data-conversion-keystroke-savers).
@ -18,19 +18,19 @@ You can get the full list [here](file-formats.md#data-conversion-keystroke-saver
Already we saw that you can put the filename first using `--from`. When you're interacting with your data at the command line, this makes it easier to up-arrow and append to the previous command:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv sort -nr index then head -n 3
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv sort -nr index then head -n 3 then cut -f shape,quantity
GENMD_EOF
GENMD-EOF
If there's more than one input file, you can use `--mfrom`, then however many file names, then `--` to indicate the end of your input-file-name list:
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr --c2p --mfrom data/*.csv -- sort -n index
GENMD_EOF
GENMD-EOF
## Shortest flags for CSV, TSV, and JSON

View file

@ -10,47 +10,47 @@ Writing a program -- in any language whatsoever -- you can have it print out log
Suppose your program has printed something like this [log.txt](./log.txt):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat log.txt
GENMD_EOF
GENMD-EOF
Each print statement simply contains local information: the current timestamp, whether a particular cache was hit or not, etc. Then using either the system `grep` command, or Miller's [having-fields verb](reference-verbs.md#having-fields), or the [is_present DSL function](reference-dsl-builtin-functions.md#is_present), we can pick out the parts we want and analyze them:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
grep op=cache log.txt \
| mlr --idkvp --opprint stats1 -a mean -f hit -g type then sort -f type
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from log.txt --opprint \
filter 'is_present($batch_size)' \
then step -a delta -f time,num_filtered \
then sec2gmt time
GENMD_EOF
GENMD-EOF
Alternatively, we can simply group the similar data for a better look:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint group-like log.txt
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint group-like then sec2gmt time log.txt
GENMD_EOF
GENMD-EOF
## Parsing log-file output
This, of course, depends highly on what's in your log files. But, as an example, suppose you have log-file lines such as
GENMD_CARDIFY
GENMD-CARDIFY
2015-10-08 08:29:09,445 INFO com.company.path.to.ClassName @ [sometext] various/sorts/of data {& punctuation} hits=1 status=0 time=2.378
GENMD_EOF
GENMD-EOF
I prefer to pre-filter with `grep` and/or `sed` to extract the structured text, then hand that to Miller. Example:
GENMD_CARDIFY_HIGHLIGHT_THREE
GENMD-CARDIFY-HIGHLIGHT-THREE
grep 'various sorts' *.log \
| sed 's/.*} //' \
| mlr --fs space --repifs --oxtab stats1 -a min,p10,p50,p90,max -f time -g status
... output here ...
GENMD_EOF
GENMD-EOF

View file

@ -40,7 +40,7 @@ SYNOPSIS
example.csv
Please see 'mlr help topics' for more information. Please also see
https://johnkerl.org/miller6
https://miller.readthedocs.io
DESCRIPTION
@ -1019,7 +1019,7 @@ VERBS
Using 'any' higher-order function to see if $index is 10, 20, or 30:
'any([10,20,30], func(e) {return $index == e})'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
flatten
Usage: mlr flatten [options]
@ -1437,7 +1437,7 @@ VERBS
end{emitf @min, @max}
'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
regularize
Usage: mlr regularize [options]
@ -2692,7 +2692,7 @@ KEYWORDS FOR PUT AND FILTER
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit &gt; stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
emitf
emitf: inserts non-indexed out-of-stream variable(s) side-by-side into the
@ -2720,7 +2720,7 @@ KEYWORDS FOR PUT AND FILTER
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern", @a, @b, @c'
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern &gt; mytap.dat", @a, @b, @c'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
emitp
emitp: inserts an out-of-stream variable into the output record stream.
@ -2750,7 +2750,7 @@ KEYWORDS FOR PUT AND FILTER
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp &gt; stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
end
end: defines a block of statements to be executed after input records
@ -2975,5 +2975,5 @@ SEE ALSO
2021-11-05 MILLER(1)
2021-11-06 MILLER(1)
</pre>

View file

@ -2,4 +2,4 @@
This is simply a copy of what you should see on running `man mlr` at a command prompt, once Miller is installed on your system.
GENMD_INCLUDE_ESCAPED(manpage.txt)
GENMD-INCLUDE-ESCAPED(manpage.txt)

View file

@ -19,7 +19,7 @@ SYNOPSIS
example.csv
Please see 'mlr help topics' for more information. Please also see
https://johnkerl.org/miller6
https://miller.readthedocs.io
DESCRIPTION
@ -998,7 +998,7 @@ VERBS
Using 'any' higher-order function to see if $index is 10, 20, or 30:
'any([10,20,30], func(e) {return $index == e})'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
flatten
Usage: mlr flatten [options]
@ -1416,7 +1416,7 @@ VERBS
end{emitf @min, @max}
'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
regularize
Usage: mlr regularize [options]
@ -2671,7 +2671,7 @@ KEYWORDS FOR PUT AND FILTER
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit > stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
emitf
emitf: inserts non-indexed out-of-stream variable(s) side-by-side into the
@ -2699,7 +2699,7 @@ KEYWORDS FOR PUT AND FILTER
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern", @a, @b, @c'
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern > mytap.dat", @a, @b, @c'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
emitp
emitp: inserts an out-of-stream variable into the output record stream.
@ -2729,7 +2729,7 @@ KEYWORDS FOR PUT AND FILTER
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp > stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
end
end: defines a block of statements to be executed after input records
@ -2954,4 +2954,4 @@ SEE ALSO
2021-11-05 MILLER(1)
2021-11-06 MILLER(1)

View file

@ -24,7 +24,7 @@ Miller was originally developed for Unix-like operating systems including Linux
MSYS2 is no longer required -- although you can of course still use Miller from within MSYS2 if you prefer. There is now simply a single `mlr.exe`, with no `msys2.dll` alongside anymore.
See [Installation](installation.md) for how to get a copy of `mlr.exe`.
See [Installation](installing-miller.md) for how to get a copy of `mlr.exe`.
## Setup

View file

@ -8,7 +8,7 @@ Miller was originally developed for Unix-like operating systems including Linux
MSYS2 is no longer required -- although you can of course still use Miller from within MSYS2 if you prefer. There is now simply a single `mlr.exe`, with no `msys2.dll` alongside anymore.
See [Installation](installation.md) for how to get a copy of `mlr.exe`.
See [Installation](installing-miller.md) for how to get a copy of `mlr.exe`.
## Setup

View file

@ -10,9 +10,9 @@ In the [DSL reference](reference-dsl.md) page we have a complete reference to Mi
Let's keep using the [example.csv](./example.csv) file:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p put '$cost = $quantity * $rate' example.csv
GENMD_EOF
GENMD-EOF
When we type that, a few things are happening:
@ -25,26 +25,26 @@ When we type that, a few things are happening:
You can use more than one statement, separating them with semicolons, and optionally putting them on lines of their own:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p put '$cost = $quantity * $rate; $index = $index * 100' example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p put '
$cost = $quantity * $rate;
$index *= 100
' example.csv
GENMD_EOF
GENMD-EOF
One of Miller's key features is the ability to express data-transformation right there at the keyboard, interactively. But if you find yourself using expressions repeatedly, you can put everything between the single quotes into a file and refer to that using `put -f`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat dsl-example.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p put -f dsl-example.mlr example.csv
GENMD_EOF
GENMD-EOF
This becomes particularly important on Windows. Quite a bit of effort was put into making Miller on Windows be able to handle the kinds of single-quoted expressions we're showing here, but if you get syntax-error messages on Windows using examples in this documentation, you can put the parts between single quotes into a file and refer to that using `mlr put -f` -- or, use the triple-double-quote trick as described in the [Miller on Windows page](miller-on-windows.md).
@ -56,34 +56,34 @@ Above we also saw that names like `$quantity` are bound to each record in turn.
To make `begin` and `end` statements useful, we need somewhere to put things that persist across the duration of the record stream, and a way to emit them. Miller uses [**out-of-stream variables**](reference-dsl-variables.md#out-of-stream-variables) (or **oosvars** for short) whose names start with an `@` sigil, along with the [`emit`](reference-dsl-output-statements.md#emit-statements) keyword to write them into the output record stream:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put 'begin { @sum = 0 } @sum += $quantity; end {emit @sum}'
GENMD_EOF
GENMD-EOF
If you want the end-block output to be the only output, and not include the records from the input data, you can use `mlr put -q`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put -q 'begin { @sum = 0 } @sum += $quantity; end {emit @sum}'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2j --from example.csv put -q 'begin { @sum = 0 } @sum += $quantity; end {emit @sum}'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2j --from example.csv put -q '
begin { @count = 0; @sum = 0 }
@count += 1;
@sum += $quantity;
end {emit (@count, @sum)}
'
GENMD_EOF
GENMD-EOF
We'll see in the documentation for [stats1](reference-verbs.md#stats1) that there's a lower-keystroking way to get counts and sums of things:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2j --from example.csv stats1 -a sum,count -f quantity
GENMD_EOF
GENMD-EOF
So, take this sum/count example as an indication of the kinds of things you can do using Miller's programming language.
@ -97,33 +97,33 @@ Also inspired by [AWK](https://en.wikipedia.org/wiki/AWK), the Miller DSL has th
* `NR` -- starting from 1, counter of how many records processed so far.
* `FNR` -- similar, but resets to 1 at the start of each file.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat context-example.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/a.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/b.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p put -f context-example.mlr data/a.csv data/b.csv
GENMD_EOF
GENMD-EOF
## Functions and local variables
You can [define your own functions](reference-dsl-user-defined-functions.md):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat factorial-example.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put -f factorial-example.mlr -e '$fact = factorial(NR)'
GENMD_EOF
GENMD-EOF
Note that here we used the `-f` flag to `put` to load our function
definition, and also the `-e` flag to add another statement on the command
@ -135,13 +135,13 @@ future use.)
Suppose you want to only compute sums conditionally -- you can use an `if` statement:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat if-example.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put -q -f if-example.mlr
GENMD_EOF
GENMD-EOF
Miller's else-if is spelled `elif`.
@ -154,17 +154,17 @@ haven't encountered maps and arrays yet in this introduction, but for now it
suffices to know that `$*` is a special variable holding the current record as
a map:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat for-example.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat data/a.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from data/a.csv put -qf for-example.mlr
GENMD_EOF
GENMD-EOF
Here we used the local variables `k` and `v`. Now we've seen four kinds of variables:
@ -199,10 +199,10 @@ basic idea is:
For example, you can sum up all the `$a` values across records without having to check whether they're present or not:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json cat absent-example.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json put '@sum_of_a += $a; end {emit @sum_of_a}' absent-example.json
GENMD_EOF
GENMD-EOF

View file

@ -2,61 +2,61 @@
Column select:
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr --csv cut -f hostname,uptime mydata.csv
GENMD_EOF
GENMD-EOF
Add new columns as function of other columns:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --nidx put '$sum = $7 < 0.0 ? 3.5 : $7 + 2.1*$8' *.dat
GENMD_EOF
GENMD-EOF
Row filter:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --csv filter '$status != "down" && $upsec >= 10000' *.csv
GENMD_EOF
GENMD-EOF
Apply column labels and pretty-print:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
grep -v '^#' /etc/group | mlr --ifs : --nidx --opprint label group,pass,gid,member then sort -f group
GENMD_EOF
GENMD-EOF
Join multiple data sources on key columns:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr join -j account_id -f accounts.dat then group-by account_name balances.dat
GENMD_EOF
GENMD-EOF
Mulltiple formats including JSON:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --json put '$attr = sub($attr, "([0-9]+)_([0-9]+)_.*", "\1:\2")' data/*.json
GENMD_EOF
GENMD-EOF
Aggregate per-column statistics:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr stats1 -a min,mean,max,p10,p50,p90 -f flag,u,v data/*
GENMD_EOF
GENMD-EOF
Linear regression:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr stats2 -a linreg-pca -f u,v -g shape data/*
GENMD_EOF
GENMD-EOF
Aggregate custom per-column statistics:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr put -q '@sum[$a][$b] += $x; end {emit @sum, "a", "b"}' data/*
GENMD_EOF
GENMD-EOF
Iterate over data using DSL expressions:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from estimates.tbl put '
for (k,v in $*) {
if (is_numeric(v) && k =~ "^[t-z].*$") {
@ -65,85 +65,85 @@ mlr --from estimates.tbl put '
}
$mean = $sum / $count # no assignment if count unset
'
GENMD_EOF
GENMD-EOF
Run DSL expressions from a script file:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from infile.dat put -f analyze.mlr
GENMD_EOF
GENMD-EOF
Split/reduce output to multiple filenames:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from infile.dat put 'tee > "./taps/data-".$a."-".$b, $*'
GENMD_EOF
GENMD-EOF
Compressed I/O:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from infile.dat put 'tee | "gzip > ./taps/data-".$a."-".$b.".gz", $*'
GENMD_EOF
GENMD-EOF
Interoperate with other data-processing tools using standard pipes:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from infile.dat put -q '@v=$*; dump | "jq .[]"'
GENMD_EOF
GENMD-EOF
Tap/trace:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from infile.dat put '(NR % 1000 == 0) { print > stderr, "Checkpoint ".NR}'
GENMD_EOF
GENMD-EOF
## Program timing
This admittedly artificial example demonstrates using Miller time and stats functions to introspectively acquire some information about Miller's own runtime. The `delta` function computes the difference between successive timestamps.
GENMD_INCLUDE_ESCAPED(data/timing-example.txt)
GENMD-INCLUDE-ESCAPED(data/timing-example.txt)
## Showing differences between successive queries
Suppose you have a database query which you run at one point in time, producing the output on the left, then again later producing the output on the right:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/previous_counters.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/current_counters.csv
GENMD_EOF
GENMD-EOF
And, suppose you want to compute the differences in the counters between adjacent keys. Since the color names aren't all in the same order, nor are they all present on both sides, we can't just paste the two files side-by-side and do some column-four-minus-column-two arithmetic.
First, rename counter columns to make them distinct:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv rename count,previous_count data/previous_counters.csv > data/prevtemp.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/prevtemp.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv rename count,current_count data/current_counters.csv > data/currtemp.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/currtemp.csv
GENMD_EOF
GENMD-EOF
Then, join on the key field(s), and use unsparsify to zero-fill counters absent on one side but present on the other. Use `--ul` and `--ur` to emit unpaired records (namely, purple on the left and yellow on the right):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint \
join -j color --ul --ur -f data/prevtemp.csv \
then unsparsify --fill-with 0 \
then put '$count_delta = $current_count - $previous_count' \
data/currtemp.csv
GENMD_EOF
GENMD-EOF
See also the [record-heterogeneity page](record-heterogeneity.md).
@ -151,11 +151,11 @@ See also the [record-heterogeneity page](record-heterogeneity.md).
The recursive function for the Fibonacci sequence is famous for its computational complexity. Namely, using f(0)=1, f(1)=1, f(n)=f(n-1)+f(n-2) for n>=2, the evaluation tree branches left as well as right at each non-trivial level, resulting in millions or more paths to the root 0/1 nodes for larger n. This program
GENMD_INCLUDE_ESCAPED(data/fibo-uncached.sh)
GENMD-INCLUDE-ESCAPED(data/fibo-uncached.sh)
produces output like this:
GENMD_CARDIFY
GENMD-CARDIFY
i o fcount seconds_delta
1 1 1 0
2 2 3 0.000039101
@ -185,15 +185,15 @@ i o fcount seconds_delta
26 196418 392835 0.334423065
27 317811 635621 0.605969906
28 514229 1028457 0.971235037
GENMD_EOF
GENMD-EOF
Note that the time it takes to evaluate the function is blowing up exponentially as the input argument increases. Using `@`-variables, which persist across records, we can cache and reuse the results of previous computations:
GENMD_INCLUDE_ESCAPED(data/fibo-cached.sh)
GENMD-INCLUDE-ESCAPED(data/fibo-cached.sh)
with output like this:
GENMD_CARDIFY
GENMD-CARDIFY
i o fcount seconds_delta
1 1 1 0
2 2 3 0.000053883
@ -223,4 +223,4 @@ i o fcount seconds_delta
26 196418 3 0.000012875
27 317811 3 0.000013113
28 514229 3 0.000012875
GENMD_EOF
GENMD-EOF

View file

@ -7,12 +7,12 @@ while true
break
end
if line =~ /GENMD_RUN_COMMAND{{.*}}HERE/
line.sub!("GENMD_RUN_COMMAND{{", "")
if line =~ /GENMD-RUN-COMMAND{{.*}}HERE/
line.sub!("GENMD-RUN-COMMAND{{", "")
line.sub!("}}HERE", "")
puts 'GENMD_RUN_COMMAND'
puts 'GENMD-RUN-COMMAND'
puts line
puts 'GENMD_EOF'
puts 'GENMD-EOF'
else
puts line
end

View file

@ -72,7 +72,7 @@ See also the [Arrays reference](reference-main-arrays.md) for more information.
Stronger support for Windows (with or without MSYS2), with a couple of
exceptions. See [Miller on Windows](miller-on-windows.md) for more information.
Binaries are reliably available using GitHub Actions: see also [Installation](installation.md).
Binaries are reliably available using GitHub Actions: see also [Installation](installing-miller.md).
## In-process support for compressed input

View file

@ -56,7 +56,7 @@ See also the [Arrays reference](reference-main-arrays.md) for more information.
Stronger support for Windows (with or without MSYS2), with a couple of
exceptions. See [Miller on Windows](miller-on-windows.md) for more information.
Binaries are reliably available using GitHub Actions: see also [Installation](installation.md).
Binaries are reliably available using GitHub Actions: see also [Installation](installing-miller.md).
## In-process support for compressed input
@ -66,10 +66,10 @@ In addition to `--prepipe gunzip`, you can now use the `--gzin` flag. In fact, i
You can read input with prefixes `https://`, `http://`, and `file://`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv sort -f shape \
https://raw.githubusercontent.com/johnkerl/miller/main/docs/src/gz-example.csv.gz
GENMD_EOF
GENMD-EOF
## Output colorization
@ -98,13 +98,13 @@ strings throughout the processing chain.
For example (see [https://github.com/johnkerl/miller/issues/178](https://github.com/johnkerl/miller/issues/178)) you can now do
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo '{ "a": "0123" }' | mlr --json cat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo '{ "x": 1.230, "y": 1.230000000 }' | mlr --json cat
GENMD_EOF
GENMD-EOF
## REPL
@ -150,10 +150,10 @@ Miller 6 has getoptish command-line parsing ([pull request 467](https://github.c
For `mlr put` and `mlr filter`, parse-error messages now include location information:
GENMD_CARDIFY
GENMD-CARDIFY
mlr: cannot parse DSL expression.
Parse error on token ">" at line 63 columnn 7.
GENMD_EOF
GENMD-EOF
## Developer-specific aspects

View file

@ -35,7 +35,7 @@ Output of one verb may be chained as input to another using "then", e.g.
mlr --csv stats1 -a min,mean,max -f quantity then sort -f color example.csv
Please see 'mlr help topics' for more information.
Please also see https://johnkerl.org/miller6
Please also see https://miller.readthedocs.io
</pre>
<pre class="pre-highlight-in-pair">
@ -214,7 +214,7 @@ You can use `:h` or `:help` inside the [REPL](repl.md):
</pre>
<pre class="pre-non-highlight-in-pair">
Miller v6.0.0-dev REPL for darwin:amd64:go1.16.5
Pre-release docs for Miller 6: https://johnkerl.org/miller6
Docs: https://miller.readthedocs.io
Type ':h' or ':help' for on-line help; ':q' or ':quit' to quit.
[mlr] :h
Options:

View file

@ -6,17 +6,17 @@ Miller has several online help mechanisms built in.
The front door is `mlr --help` or its synonym `mlr -h`. This leads you to `mlr help topics` with its list of specific areas:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --help
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help topics
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help functions
GENMD_EOF
GENMD-EOF
Etc.
@ -30,17 +30,17 @@ See `mlr help flags` for a full listing.
This is a command-line version of the [List of verbs](reference-verbs.md) page.
Given the name of a verb (from `mlr -l`) you can invoke it with `--help` or `-h` -- or, use `mlr help verb`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr cat --help
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr group-like -h
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help verb sort
GENMD_EOF
GENMD-EOF
Etc.
@ -49,17 +49,17 @@ Etc.
This is a command-line version of the [DSL built-in functions](reference-dsl-builtin-functions.md) page.
Given the name of a DSL function (from `mlr -f`) you can use `mlr help function` for details:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help function append
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help function split
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help function splita
GENMD_EOF
GENMD-EOF
Etc.
@ -68,10 +68,10 @@ Etc.
You can use `:h` or `:help` inside the [REPL](repl.md):
<!--- TODO: repl-executor genmd function -->
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
$ mlr repl
Miller v6.0.0-dev REPL for darwin:amd64:go1.16.5
Pre-release docs for Miller 6: https://johnkerl.org/miller6
Docs: https://miller.readthedocs.io
Type ':h' or ':help' for on-line help; ':q' or ':quit' to quit.
[mlr] :h
Options:
@ -85,7 +85,7 @@ Options:
:help {function name}, e.g. :help sec2gmt
:help {function name}, e.g. :help sec2gmt
[mlr]
GENMD_EOF
GENMD-EOF
## Manual page

View file

@ -4,55 +4,55 @@
Suppose you want to replace spaces with underscores in your column names:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/spaces.csv
GENMD_EOF
GENMD-EOF
The simplest way is to use `mlr rename` with `-g` (for global replace, not just first occurrence of space within each field) and `-r` for pattern-matching (rather than explicit single-column renames):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv rename -g -r ' ,_' data/spaces.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --opprint rename -g -r ' ,_' data/spaces.csv
GENMD_EOF
GENMD-EOF
You can also do this with a for-loop:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/bulk-rename-for-loop.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint put -f data/bulk-rename-for-loop.mlr data/spaces.csv
GENMD_EOF
GENMD-EOF
## Search-and-replace over all fields
How to do `$name = gsub($name, "old", "new")` for all fields?
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/sar.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/sar.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv put -f data/sar.mlr data/sar.csv
GENMD_EOF
GENMD-EOF
## Full field renames and reassigns
Using Miller 5.0.0's map literals and assigning to `$*`, you can fully generalize [rename](reference-verbs.md#rename), [reorder](reference-verbs.md#reorder), etc.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '
begin {
@i_cumu = 0;
@ -68,4 +68,4 @@ mlr put '
"x": $y,
};
' data/small
GENMD_EOF
GENMD-EOF

View file

@ -22,9 +22,9 @@ to retain sums, counters, etc.
For example, let's look at our short data file [data/short.csv](data/short.csv):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/short.csv
GENMD_EOF
GENMD-EOF
We can track count and sum using
[out-of-stream variables](reference-dsl-variables.md#out-of-stream-variables) -- the ones that
@ -32,7 +32,7 @@ start with the `@` sigil -- then
[emit](reference-dsl-output-statements.md#emit-statements) them as a new record
after all the input is read.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/short.csv put '
begin {
@count = 0;
@ -44,12 +44,12 @@ mlr --icsv --ojson --from data/short.csv put '
emit (@count, @sum);
}
'
GENMD_EOF
GENMD-EOF
And if all we want is the final output and not the input data, we can use `put
-q` to not pass through the input records:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/short.csv put -q '
begin {
@count = 0;
@ -61,7 +61,7 @@ mlr --icsv --ojson --from data/short.csv put -q '
emit (@count, @sum);
}
'
GENMD_EOF
GENMD-EOF
As discussed a bit more on the page on [streaming processing and memory
usage](streaming-and-memory.md), this doesn't keep all records in memory, only
@ -74,11 +74,11 @@ The second option is to retain entire records in a [map](reference-main-maps.md)
Let's use the same short data file [data/short.csv](data/short.csv):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/short.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/short.csv put -q '
# map
begin {
@ -95,11 +95,11 @@ mlr --icsv --ojson --from data/short.csv put -q '
emit (count, sum);
}
'
GENMD_EOF
GENMD-EOF
The downside to this, of course, is that this retains all records (plus data-structure overhead) in memory, so you're limited to processing files that fit in your computer's memory. The upside, though, is that you can do random access over the records using things like
GENMD_CARDIFY
GENMD-CARDIFY
output = 0;
for (i = 1; i <= NR; i += 1) {
for (j = 1; j <= NR; j += 1) {
@ -109,13 +109,13 @@ GENMD_CARDIFY
}
}
# do something with the output
GENMD_EOF
GENMD-EOF
## Retaining records in an array
The third option is to retain records in an [array](reference-main-arrays.md), then loop over them in an `end` block.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/short.csv put -q '
# array
begin {
@ -132,7 +132,7 @@ mlr --icsv --ojson --from data/short.csv put -q '
emit (count, sum);
}
'
GENMD_EOF
GENMD-EOF
Just as with the retain-as-map approach, the downside is the overhead of
retaining all records in memory, and the upside is that you get random access
@ -149,7 +149,7 @@ start with 1, not 0 as discussed in the [Arrays](reference-main-arrays.md)
page.) This means that if you are only retaining a subset of records then your
array will have [null-gaps](reference-main-arrays.md) in it:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/short.csv put -q '
begin {
@records = [];
@ -161,11 +161,11 @@ mlr --icsv --ojson --from data/short.csv put -q '
dump @records;
}
'
GENMD_EOF
GENMD-EOF
You can index `@records` by `@count` rather than `NR` to get a contiguous array:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/short.csv put -q '
begin {
@records = [];
@ -186,11 +186,11 @@ mlr --icsv --ojson --from data/short.csv put -q '
emit (count, sum);
}
'
GENMD_EOF
GENMD-EOF
If you use a map to retain records, then this is a non-issue: maps can retain whatever values you like:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/short.csv put -q '
begin {
@records = {};
@ -209,7 +209,7 @@ mlr --icsv --ojson --from data/short.csv put -q '
emit (count, sum);
}
'
GENMD_EOF
GENMD-EOF
Do note that Miller [maps](reference-main-maps.md) preserve insertion order, so
at the end you're guaranteed to loop over records in the same order you read
@ -222,7 +222,7 @@ If all you need is one or a few attributes out of a record, you don't need to
retain full records. You can retain a map, or array, of just the fields you're
interested in:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/short.csv put -q '
begin {
@values = {};
@ -241,7 +241,7 @@ mlr --icsv --ojson --from data/short.csv put -q '
emit (count, sum);
}
'
GENMD_EOF
GENMD-EOF
## Sorting

View file

@ -6,13 +6,13 @@ Here are a few things focusing on Miller's DSL as a programming language per se,
The [Sieve of Eratosthenes](http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) is a standard introductory programming topic. The idea is to find all primes up to some *N* by making a list of the numbers 1 to *N*, then striking out all multiples of 2 except 2 itself, all multiples of 3 except 3 itself, all multiples of 4 except 4 itself, and so on. Whatever survives that without getting marked is a prime. This is easy enough in Miller. Notice that here all the work is in `begin` and `end` statements; there is no file input (so we use `mlr -n` to keep Miller from waiting for input data).
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat programs/sieve.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put -f programs/sieve.mlr
GENMD_EOF
GENMD-EOF
## Mandelbrot-set generator
@ -20,19 +20,19 @@ The [Mandelbrot set](http://en.wikipedia.org/wiki/Mandelbrot_set) is also easily
The (approximate) computation of points in the complex plane which are and aren't members is just a few lines of complex arithmetic (see the [Wikipedia article](https://en.wikipedia.org/wiki/Mandelbrot_set)); how to render them visually is another task. Using graphics libraries you can create PNG or JPEG files, but another fun way to do this is by printing various characters to the screen:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat programs/mand.mlr
GENMD_EOF
GENMD-EOF
At standard resolution this makes a nice little ASCII plot:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put -s iheight=25 -s iwidth=50 -f ./programs/mand.mlr
GENMD_EOF
GENMD-EOF
But using a very small font size (as small as my Mac will let me go), and by choosing the coordinates to zoom in on a particular part of the complex plane, we can get a nice little picture:
GENMD_CARDIFY
GENMD-CARDIFY
#!/bin/bash
# Get the number of rows and columns from the terminal window dimensions
iheight=$(stty size | mlr --nidx --fs space cut -f 1)
@ -40,6 +40,6 @@ iwidth=$(stty size | mlr --nidx --fs space cut -f 2)
mlr -n put \
-s rcorn=-1.755350 -s icorn=0.014230 -s side=0.000020 -s maxits=10000 -s iheight=$iheight -s iwidth=$iwidth \
-f programs/mand.mlr
GENMD_EOF
GENMD-EOF
![pix/mand.png](pix/mand.png)

View file

@ -6,25 +6,25 @@
For example, the right file here has nine records, and the left file should add in the `hostname` column -- so the join output should also have 9 records:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsvlite --opprint cat data/join-u-left.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsvlite --opprint cat data/join-u-right.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsvlite --opprint join -s -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
GENMD_EOF
GENMD-EOF
The issue is that Miller's `join`, by default (before 5.1.0), took input sorted (lexically ascending) by the sort keys on both the left and right files. This design decision was made intentionally to parallel the Unix/Linux system `join` command, which has the same semantics. The benefit of this default is that the joiner program can stream through the left and right files, needing to load neither entirely into memory. The drawback, of course, is that is requires sorted input.
The solution (besides pre-sorting the input files on the join keys) is to simply use **mlr join -u** (which is now the default). This loads the left file entirely into memory (while the right file is still streamed one line at a time) and does all possible joins without requiring sorted input:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsvlite --opprint join -u -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
GENMD_EOF
GENMD-EOF
General advice is to make sure the left-file is relatively small, e.g. containing name-to-number mappings, while saving large amounts of data for the right file.
@ -32,29 +32,29 @@ General advice is to make sure the left-file is relatively small, e.g. containin
Suppose you have the following two data files:
GENMD_INCLUDE_ESCAPED(data/color-codes.csv)
GENMD-INCLUDE-ESCAPED(data/color-codes.csv)
GENMD_INCLUDE_ESCAPED(data/color-names.csv)
GENMD-INCLUDE-ESCAPED(data/color-names.csv)
Joining on color the results are as expected:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv join -j id -f data/color-codes.csv data/color-names.csv
GENMD_EOF
GENMD-EOF
However, if we ask for left-unpaireds, since there's no `color` column, we get a row not having the same column names as the other:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv join --ul -j id -f data/color-codes.csv data/color-names.csv
GENMD_EOF
GENMD-EOF
To fix this, we can use **unsparsify**:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv join --ul -j id -f data/color-codes.csv \
then unsparsify --fill-with "" \
data/color-names.csv
GENMD_EOF
GENMD-EOF
Thanks to @aborruso for the tip!
@ -64,24 +64,24 @@ See also the [record-heterogeneity page](record-heterogeneity.md).
Suppose we have the following data:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat multi-join/input.csv
GENMD_EOF
GENMD-EOF
And we want to augment the `id` column with lookups from the following data files:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat multi-join/name-lookup.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat multi-join/status-lookup.csv
GENMD_EOF
GENMD-EOF
We can run the input file through multiple `join` commands in a `then`-chain:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint join -f multi-join/name-lookup.csv -j id \
then join -f multi-join/status-lookup.csv -j id \
multi-join/input.csv
GENMD_EOF
GENMD-EOF

View file

@ -6,55 +6,55 @@ Then-chaining found in Miller is intended to function the same as Unix pipes, bu
First, look at the input data:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/then-example.csv
GENMD_EOF
GENMD-EOF
Next, run the first step of your command, omitting anything from the first `then` onward:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/then-example.csv --c2p count-distinct -f Status,Payment_Type
GENMD_EOF
GENMD-EOF
After that, run it with the next `then` step included:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/then-example.csv --c2p count-distinct -f Status,Payment_Type \
then sort -nr count
GENMD_EOF
GENMD-EOF
Now if you use `then` to include another verb after that, the columns `Status`, `Payment_Type`, and `count` will be the input to that verb.
Note, by the way, that you'll get the same results using pipes:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/then-example.csv --csv count-distinct -f Status,Payment_Type \
| mlr --c2p sort -nr count
GENMD_EOF
GENMD-EOF
## NR is not consecutive after then-chaining
Given this input data:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/small
GENMD_EOF
GENMD-EOF
why don't I see `NR=1` and `NR=2` here??
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small filter '$x > 0.5' then put '$NR = NR'
GENMD_EOF
GENMD-EOF
The reason is that `NR` is computed for the original input records and isn't dynamically updated. By contrast, `NF` is dynamically updated: it's the number of fields in the current record, and if you add/remove a field, the value of `NF` will change:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x=1,y=2,z=3 | mlr put '$nf1 = NF; $u = 4; $nf2 = NF; unset $x,$y,$z; $nf3 = NF'
GENMD_EOF
GENMD-EOF
`NR`, by contrast (and `FNR` as well), retains the value from the original input stream, and records may be dropped by a `filter` within a `then`-chain. To recover consecutive record numbers, you can use out-of-stream variables as follows:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint --from data/small put '
begin{ @nr1 = 0 }
@nr1 += 1;
@ -66,10 +66,10 @@ then put '
@nr2 += 1;
$nr2 = @nr2
'
GENMD_EOF
GENMD-EOF
Or, simply use `mlr cat -n`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr filter '$x > 0.5' then cat -n data/small
GENMD_EOF
GENMD-EOF

View file

@ -4,9 +4,9 @@
Here we can chain together a few simple building blocks:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat expo-sample.sh
GENMD_EOF
GENMD-EOF
Namely:
@ -18,15 +18,15 @@ Namely:
The output is as follows:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
sh expo-sample.sh
GENMD_EOF
GENMD-EOF
## Randomly selecting words from a list
Given this [word list](./data/english-words.txt), first take a look to see what the first few lines look like:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
head data/english-words.txt
a
aa
@ -38,11 +38,11 @@ aardwolf
aba
abac
abaca
GENMD_EOF
GENMD-EOF
Then the following will randomly sample ten words with four to eight characters in them:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from data/english-words.txt --nidx filter -S 'n=strlen($1);4<=n&&n<=8' then sample -k 10
thionine
birchman
@ -54,7 +54,7 @@ askant
aiming
insulant
coinmate
GENMD_EOF
GENMD-EOF
## Randomly generating jabberwocky words
@ -62,7 +62,7 @@ These are simple *n*-grams as [described here](http://johnkerl.org/randspell/ran
The idea is that words from the input file are consumed, then taken apart and pasted back together in ways which imitate the letter-to-letter transitions found in the word list -- giving us automatically generated words in the same vein as *bromance* and *spork*:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --nidx --from ./ngrams/gsl-2000.txt put -q -f ./ngrams/ngfuncs.mlr -f ./ngrams/ng5.mlr
beard
plastinguish
@ -80,4 +80,4 @@ rottendence
lessenger
diffendant
suggestional
GENMD_EOF
GENMD-EOF

View file

@ -127,8 +127,6 @@ If you `mlr csv cat` this, you'll get an error message:
<b>mlr --csv cat data/het/ragged.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a,b,c
1,2,3
mlr : mlr: CSV header/data length mismatch 3 != 2 at filename data/het/ragged.csv row 3.
</pre>

View file

@ -17,35 +17,35 @@ Different kinds of heterogeneous data include _ragged_, _irregular_, and _sparse
A **homogeneous** list of records is one in which all records have _the same keys, in the same order_.
For example, here is a well-formed [CSV file](file-formats.md#csvtsvasvusvetc):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat data/het/hom.csv
GENMD_EOF
GENMD-EOF
It has three records (written here using JSON formatting):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --no-jvstack cat data/het/hom.csv
GENMD_EOF
GENMD-EOF
Here every row has the same keys, in the same order: `a,b,c`.
These are also sometimes called **rectangular** since if we pretty-print them we get a nice rectangle:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint cat data/het/hom.csv
GENMD_EOF
GENMD-EOF
### Fillable data
A second example has some empty cells which could be **filled**:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat data/het/fillable.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --no-jvstack cat data/het/fillable.csv
GENMD_EOF
GENMD-EOF
This example is still homogeneous, though: every row has the same keys, in the same order: `a,b,c`.
Empty values don't make the data heterogeneous.
@ -53,23 +53,23 @@ Empty values don't make the data heterogeneous.
Note however that we can use the [`fill-down`](reference-verbs.md#fill-empty) verb to make these
values non-empty, if we like:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint fill-empty -v filler data/het/fillable.csv
GENMD_EOF
GENMD-EOF
### Ragged data
Next let's look at non-well-formed CSV files. For a third example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/het/ragged.csv
GENMD_EOF
GENMD-EOF
If you `mlr csv cat` this, you'll get an error message:
GENMD_RUN_COMMAND_TOLERATING_ERROR
GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr --csv cat data/het/ragged.csv
GENMD_EOF
GENMD-EOF
There are two kinds of raggedness here. Since CSVs form records by zipping the
keys from the header line together with the values from each data line, the
@ -80,18 +80,18 @@ Using the [`--allow-ragged-csv-input` flag](reference-main-flag-list.md#csv-only
we can fill values in too-short rows, and provide a key (column number starting
with 1) for too-long rows:
GENMD_RUN_COMMAND_TOLERATING_ERROR
GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr --icsv --ojson --allow-ragged-csv-input cat data/het/ragged.csv
GENMD_EOF
GENMD-EOF
### Irregular data
Here's another situation -- this file has, in some sense, the "same" data as
our `ragged.csv` example above:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/het/irregular.json
GENMD_EOF
GENMD-EOF
For example, on the second record, `a` is 4, `b` is 5, `c` is 6. But this data
is heterogeneous because the keys `a,b,c` aren't in the same order in each
@ -107,9 +107,9 @@ We can use the [`regularize`](reference-verbs.md#regularize) or
[`sort-within-records`](reference-verbs.md#sort-within-records) verb to order
the keys:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json --no-jvstack regularize data/het/irregular.json
GENMD_EOF
GENMD-EOF
The `regularize` verb tries to re-order subsequent rows to look like the first
(whatever order that is); the `sort-within-records` verb simply uses
@ -121,24 +121,24 @@ record has keys in the order `a,b,c`).
Here's another frequently occurring situation -- quite often, systems will log
data for items which are present, but won't log data for items which aren't.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json cat data/het/sparse.json
GENMD_EOF
GENMD-EOF
This data is called **sparse** (from the [data-storage term](https://en.wikipedia.org/wiki/Sparse_matrix)).
We can use the [`unsparsify`](reference-verbs.md#unsparsify) verb to make sure
every record has the same keys:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json unsparsify data/het/sparse.json
GENMD_EOF
GENMD-EOF
Since this data is now homogeneous (rectangular), it pretty-prints nicely:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint unsparsify data/het/sparse.json
GENMD_EOF
GENMD-EOF
## Reading and writing heterogeneous data
@ -149,31 +149,31 @@ to transform the data to make it homogeneous.
For these formats, record-heterogeneity comes naturally:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/het/sparse.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --onidx --ofs ' ' cat data/het/sparse.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --oxtab cat data/het/sparse.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --odkvp cat data/het/sparse.json
GENMD_EOF
GENMD-EOF
Even then, we may wish to put like with like, using the [`group-like`](reference-verbs.md#group-like) verb:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --odkvp cat data/het.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --odkvp group-like data/het.json
GENMD_EOF
GENMD-EOF
### Rectangular file formats: CSV and pretty-print
@ -188,23 +188,23 @@ the same way. The difference between CSV and CSV-lite is that the former is
[RFC-4180-compliant](file-formats.md#csvtsvasvusvetc), while the latter readily
handles heterogeneous data (which is non-compliant). For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/het.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint cat data/het.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint group-like data/het.json
GENMD_EOF
GENMD-EOF
Miller handles explicit header changes as just shown. If your CSV input contains ragged data -- if there are implicit header changes (no intervening blank line and new header line) as seen above -- you can use `--allow-ragged-csv-input` (or keystroke-saver `--ragged`).
GENMD_RUN_COMMAND_TOLERATING_ERROR
GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr --csv --ragged cat data/het/ragged.csv
GENMD_EOF
GENMD-EOF
## Processing heterogeneous data
@ -216,10 +216,10 @@ you are sorting on the `count` field then all records in the input stream must
have a `count` field but the other fields can vary, and moreover the sorted-on
field name(s) don't need to be in the same position on each line:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/sort-het.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr sort -n count data/sort-het.dkvp
GENMD_EOF
GENMD-EOF

View file

@ -3,12 +3,12 @@
These are functions in the [Miller programming language](miller-programming-language.md)
that you can call when you use `mlr put` and `mlr filter`. For example, when you type
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint --from example.csv put '
$color = toupper($color);
$shape = gsub($shape, "[aeiou]", "*");
'
GENMD_EOF
GENMD-EOF
the `toupper` and `gsub` bits are _functions_.
@ -35,4 +35,4 @@ say `x+y` so the details for the `+` operator say that its number of arguments
is 2. Unary operators such as `!` and `~` show argument-count of 1; the ternary
`? :` operator shows an argument-count of 3.
GENMD_RUN_CONTENT_GENERATOR(./mk-func-info.rb)
GENMD-RUN-CONTENT-GENERATOR(./mk-func-info.rb)

View file

@ -4,65 +4,65 @@
These are reminiscent of `awk` syntax. They can be used to allow assignments to be done only when appropriate -- e.g. for math-function domain restrictions, regex-matching, and so on:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr cat data/put-gating-example-1.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '$x > 0.0 { $y = log10($x); $z = sqrt($y) }' data/put-gating-example-1.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr cat data/put-gating-example-2.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '
$a =~ "([a-z]+)_([0-9]+)" {
$b = "left_\1"; $c = "right_\2"
}' \
data/put-gating-example-2.dkvp
GENMD_EOF
GENMD-EOF
This produces heteregenous output which Miller, of course, has no problems with (see [Record Heterogeneity](record-heterogeneity.md)). But if you want homogeneous output, the curly braces can be replaced with a semicolon between the expression and the body statements. This causes `put` to evaluate the boolean expression (along with any side effects, namely, regex-captures `\1`, `\2`, etc.) but doesn't use it as a criterion for whether subsequent assignments should be executed. Instead, subsequent assignments are done unconditionally:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put '
$a =~ "([a-z]+)_([0-9]+)";
$b = "left_\1";
$c = "right_\2"
' data/put-gating-example-2.dkvp
GENMD_EOF
GENMD-EOF
Note that pattern-action blocks are just a syntactic variation of if-statements. The following do the same thing:
GENMD_CARDIFY
GENMD-CARDIFY
boolean_condition {
body
}
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY
GENMD-CARDIFY
if (boolean_condition) {
body
}
GENMD_EOF
GENMD-EOF
## If-statements
These are again reminiscent of `awk`. Pattern-action blocks are a special case of `if` with no `elif` or `else` blocks, no `if` keyword, and parentheses optional around the boolean expression:
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr put 'NR == 4 {$foo = "bar"}'
GENMD_EOF
GENMD-EOF
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr put 'if (NR == 4) {$foo = "bar"}'
GENMD_EOF
GENMD-EOF
Compound statements use `elif` (rather than `elsif` or `else if`):
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr put '
if (NR == 2) {
...
@ -74,22 +74,22 @@ mlr put '
...
}
'
GENMD_EOF
GENMD-EOF
## While and do-while loops
Miller's `while` and `do-while` are unsurprising in comparison to various languages, as are `break` and `continue`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x=1,y=2 | mlr put '
while (NF < 10) {
$[NF+1] = ""
}
$foo = "bar"
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x=1,y=2 | mlr put '
do {
$[NF+1] = "";
@ -99,7 +99,7 @@ echo x=1,y=2 | mlr put '
} while (NF < 10);
$foo = "bar"
'
GENMD_EOF
GENMD-EOF
A `break` or `continue` within nested conditional blocks or if-statements will,
of course, propagate to the innermost loop enclosing them, if any. A `break` or
@ -128,16 +128,16 @@ As with `while` and `do-while`, a `break` or `continue` within nested control st
For [maps](reference-main-maps.md), the single variable is always bound to the *key* of key-value pairs:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small put -q '
print "NR = ".NR;
for (e in $*) {
print " key:", e, "value:", $[e];
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put -q '
end {
o = {"a":1, "b":{"c":3}};
@ -146,13 +146,13 @@ mlr -n put -q '
}
}
'
GENMD_EOF
GENMD-EOF
Note that the value corresponding to a given key may be gotten as through a **computed field name** using square brackets as in `$[e]` for stream records, or by indexing the looped-over variable using square brackets.
For [arrays](reference-main-arrays.md), the single variable is always bound to the *value* (not the array index):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put -q '
end {
o = [10, "20", {}, "four", true];
@ -161,7 +161,7 @@ mlr -n put -q '
}
}
'
GENMD_EOF
GENMD-EOF
### Key-value for-loops
@ -171,11 +171,11 @@ variable is the (1-up) array index and the second is the value.
Single-level keys may be gotten at using either `for(k,v)` or `for((k),v)`; multi-level keys may be gotten at using `for((k1,k2,k3),v)` and so on. The `v` variable will be bound to to a scalar value (non-array/non-map) if the map stops at that level, or to a map-valued or array-valued variable if the map goes deeper. If the map isn't deep enough then the loop body won't be executed.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/for-srec-example.tbl
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --pprint --from data/for-srec-example.tbl put '
$sum1 = $f1 + $f2 + $f3;
$sum2 = 0;
@ -187,17 +187,17 @@ mlr --pprint --from data/for-srec-example.tbl put '
}
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small --opprint put 'for (k,v in $*) { $[k."_type"] = typeof(v) }'
GENMD_EOF
GENMD-EOF
Note that the value of the current field in the for-loop can be gotten either using the bound variable `value`, or through a **computed field name** using square brackets as in `$[key]`.
Important note: to avoid inconsistent looping behavior in case you're setting new fields (and/or unsetting existing ones) while looping over the record, **Miller makes a copy of the record before the loop: loop variables are bound from the copy and all other reads/writes involve the record itself**:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small --opprint put '
$sum1 = 0;
$sum2 = 0;
@ -208,11 +208,11 @@ mlr --from data/small --opprint put '
}
}
'
GENMD_EOF
GENMD-EOF
It can be confusing to modify the stream record while iterating over a copy of it, so instead you might find it simpler to use a local variable in the loop and only update the stream record after the loop:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small --opprint put '
sum = 0;
for (k,v in $*) {
@ -222,15 +222,15 @@ mlr --from data/small --opprint put '
}
$sum = sum
'
GENMD_EOF
GENMD-EOF
You can also start iterating on sub-maps of an out-of-stream or local variable; you can loop over nested keys; you can loop over all out-of-stream variables. The bound variables are bound to a copy of the sub-map as it was before the loop started. The sub-map is specified by square-bracketed indices after `in`, and additional deeper indices are bound to loop key-variables. The terminal values are bound to the loop value-variable whenever the keys are not too shallow. The value-variable may refer to a terminal (string, number) or it may be map-valued if the map goes deeper. Example indexing is as follows:
GENMD_INCLUDE_ESCAPED(data/for-oosvar-example-0a.txt)
GENMD-INCLUDE-ESCAPED(data/for-oosvar-example-0a.txt)
That's confusing in the abstract, so a concrete example is in order. Suppose the out-of-stream variable `@myvar` is populated as follows:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put --jknquoteint -q '
begin {
@myvar = {
@ -241,11 +241,11 @@ mlr -n put --jknquoteint -q '
}
end { dump }
'
GENMD_EOF
GENMD-EOF
Then we can get at various values as follows:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put --jknquoteint -q '
begin {
@myvar = {
@ -262,9 +262,9 @@ mlr -n put --jknquoteint -q '
}
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put --jknquoteint -q '
begin {
@myvar = {
@ -282,9 +282,9 @@ mlr -n put --jknquoteint -q '
}
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put --jknquoteint -q '
begin {
@myvar = {
@ -302,13 +302,13 @@ mlr -n put --jknquoteint -q '
}
}
'
GENMD_EOF
GENMD-EOF
### C-style triple-for loops
These are supported as follows:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small --opprint put '
num suma = 0;
for (a = 1; a <= NR; a += 1) {
@ -316,9 +316,9 @@ mlr --from data/small --opprint put '
}
$suma = suma;
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small --opprint put '
num suma = 0;
num sumb = 0;
@ -329,7 +329,7 @@ mlr --from data/small --opprint put '
$suma = suma;
$sumb = sumb;
'
GENMD_EOF
GENMD-EOF
Notes:
@ -347,35 +347,35 @@ Notes:
Miller supports an `awk`-like `begin/end` syntax. The statements in the `begin` block are executed before any input records are read; the statements in the `end` block are executed after the last input record is read. (If you want to execute some statement at the start of each file, not at the start of the first file as with `begin`, you might use a pattern/action block of the form `FNR == 1 { ... }`.) All statements outside of `begin` or `end` are, of course, executed on every input record. Semicolons separate statements inside or outside of begin/end blocks; semicolons are required between begin/end block bodies and any subsequent statement. For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '
begin { @sum = 0 };
@x_sum += $x;
end { emit @x_sum }
' ./data/small
GENMD_EOF
GENMD-EOF
Since uninitialized out-of-stream variables default to 0 for addition/substraction and 1 for multiplication when they appear on expression right-hand sides (not quite as in `awk`, where they'd default to 0 either way), the above can be written more succinctly as
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '
@x_sum += $x;
end { emit @x_sum }
' ./data/small
GENMD_EOF
GENMD-EOF
The **put -q** option suppresses printing of each output record, with only `emit` statements being output. So to get only summary outputs, you could write
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '
@x_sum += $x;
end { emit @x_sum }
' ./data/small
GENMD_EOF
GENMD-EOF
We can do similarly with multiple out-of-stream variables:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '
@x_count += 1;
@x_sum += $x;
@ -384,13 +384,13 @@ mlr put -q '
emit @x_sum;
}
' ./data/small
GENMD_EOF
GENMD-EOF
This is of course (see also [here](reference-dsl.md#verbs-compared-to-dsl)) not much different than
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr stats1 -a count,sum -f x ./data/small
GENMD_EOF
GENMD-EOF
Note that it's a syntax error for begin/end blocks to refer to field names (beginning with `$`), since begin/end blocks execute outside the context of input records.

View file

@ -28,16 +28,16 @@ semicolon where one is needed . The parser tries to remind you about semicolons
whenever there's a chance a missing semicolon might be involved in a parse
error.
GENMD_RUN_COMMAND_TOLERATING_ERROR
GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr --csv --from example.csv put -q '
begin {
@count = 0 # No semicolon required -- before closing curly brace
}
$x=1 # No semicolon required -- at end of expression
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND_TOLERATING_ERROR
GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr --csv --from example.csv put -q '
begin {
@count = 0 # No semicolon required -- before closing curly brace
@ -45,7 +45,7 @@ mlr --csv --from example.csv put -q '
$x=1 # Needs a semicolon after it
$y=2 # No semicolon required -- at end of expression
'
GENMD_EOF
GENMD-EOF
## elif
@ -56,39 +56,39 @@ Miller has [`elif`](reference-dsl-control-structures.md#if-statements), not `els
Miller is simple-minded about scoping [local variables](reference-dsl-variables.md#local-variables) to blocks.
If you have
GENMD_CARDIFY
GENMD-CARDIFY
if (something) {
x = 1
} else {
x = 2
}
GENMD_EOF
GENMD-EOF
then there are two `x` variable, each confined only to their enclosing curly
braces; there is no hoisting out of the `if` and `else` blocks.
A suggestion is
GENMD_CARDIFY
GENMD-CARDIFY
var x
if (something) {
x = 1
} else {
x = 2
}
GENMD_EOF
GENMD-EOF
## Required curly braces
Bodies for all compound statements must be enclosed in curly braces, even if the body is a single statement:
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr ... put 'if ($x == 1) $y = 2' # Syntax error
GENMD_EOF
GENMD-EOF
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr ... put 'if ($x == 1) { $y = 2 }' # This is OK
GENMD_EOF
GENMD-EOF
## No autoconvert to boolean
@ -105,7 +105,7 @@ As discussed on the [arithmetic page](reference-main-arithmetic.md) the sum, dif
Likewise, while quotient and remainder are generally pythonic, the quotient and exponentiation of two integers is an integer when possible.
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
$ mlr repl -q
[mlr] 6/2
3
@ -124,7 +124,7 @@ int
[mlr] typeof(7**80)
float
GENMD_EOF
GENMD-EOF
## Print adds spaces around multiple arguments
@ -133,14 +133,14 @@ As seen in the previous example,
comma-delimited arguments fills in intervening spaces for you. If you want to
avoid this, use the dot operator for string-concatenation instead.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put -q '
end {
print "[", "a", "b", "c", "]";
print "[" . "a" . "b" . "c" . "]";
}
'
GENMD_EOF
GENMD-EOF
Similarly, a final newline is printed for you; use [`printn`](reference-dsl-output-statements.md#print-statements) to avoid this.
@ -183,11 +183,11 @@ Arrays and strings are indexed starting with 1, not 0. This is discussed in
detail on the [arrays page](reference-main-arrays.md) and the [strings
page](reference-main-strings.md).
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from data/short.csv cat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from data/short.csv put -q '
@records[NR] = $*;
end {
@ -196,7 +196,7 @@ mlr --csv --from data/short.csv put -q '
}
}
'
GENMD_EOF
GENMD-EOF
Also, slices for arrays and strings are _doubly inclusive_: `x[3:5]` gets you
elements 3, 4, and 5 of the array or string named `x`.

View file

@ -2,20 +2,20 @@
You can use the `filter` DSL keyword within the `put` verb. In fact, the following two are synonymous:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv filter 'NR==2 || NR==3' example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv put 'filter NR==2 || NR==3' example.csv
GENMD_EOF
GENMD-EOF
The former, of course, is a little easier to type. For another example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv put '@running_sum += $quantity; filter @running_sum > 500' example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv filter '@running_sum += $quantity; @running_sum > 500' example.csv
GENMD_EOF
GENMD-EOF

View file

@ -33,7 +33,7 @@ A perhaps helpful analogy: the `select` function is to arrays and maps as the
Array examples:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
my_array = [2, 9, 10, 3, 1, 4, 5, 8, 7, 6];
@ -51,11 +51,11 @@ mlr -n put '
print;
}
'
GENMD_EOF
GENMD-EOF
Map examples:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
my_map = {"cubit": 823, "dale": 13, "apple": 199, "ember": 191, "bottle": 107};
@ -71,7 +71,7 @@ mlr -n put '
print select(my_map, func (k,v) { return v % 10 >= 5});
}
'
GENMD_EOF
GENMD-EOF
## apply
@ -88,7 +88,7 @@ A perhaps helpful analogy: the `apply` function is to arrays and maps as the
Array examples:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
my_array = [2, 9, 10, 3, 1, 4, 5, 8, 7, 6];
@ -108,9 +108,9 @@ mlr -n put '
print sort(apply(my_array, func(e) { return e**3 }));
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
my_map = {"cubit": 823, "dale": 13, "apple": 199, "ember": 191, "bottle": 107};
@ -130,7 +130,7 @@ mlr -n put '
print sort(apply(my_map, func(k,v) { return {toupper(k): v**3} }));
}
'
GENMD_EOF
GENMD-EOF
## reduce
@ -146,7 +146,7 @@ accumulator.
The start value for the accumulator is the first element for arrays, or the
first element's key-value pair for maps.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
my_array = [2, 9, 10, 3, 1, 4, 5, 8, 7, 6];
@ -175,9 +175,9 @@ mlr -n put '
print reduce(my_array, func (acc,e) { return acc. "," . e });
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
my_map = {"cubit": 823, "dale": 13, "apple": 199, "ember": 191, "bottle": 107};
@ -209,7 +209,7 @@ mlr -n put '
print reduce(my_map, func (acck,accv,ek,ev) { return {"joined": accv . "," . ev }});
}
'
GENMD_EOF
GENMD-EOF
## fold
@ -218,7 +218,7 @@ The [`fold`](reference-dsl-builtin-functions.md#fold) function is the same as
taken from the first entry of the array/map, you specify it as the third
argument.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
my_array = [2, 9, 10, 3, 1, 4, 5, 8, 7, 6];
@ -239,9 +239,9 @@ mlr -n put '
print fold(my_array, func (acc,e) { return acc + e }, 1000000);
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
my_map = {"cubit": 823, "dale": 13, "apple": 199, "ember": 191, "bottle": 107};
@ -265,7 +265,7 @@ mlr -n put '
print fold(my_map, func (acck,accv,ek,ev) { return {"sum": accv + ev} }, {"sum": 1000000});
}
'
GENMD_EOF
GENMD-EOF
## sort
@ -288,7 +288,7 @@ values.
Array examples:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
my_array = [2, 9, 10, 3, 1, 4, 5, 8, 7, 6];
@ -307,11 +307,11 @@ mlr -n put '
print sort(my_array, func (a,b) { return b <=> a });
}
'
GENMD_EOF
GENMD-EOF
Map examples:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
my_map = {"cubit": 823, "dale": 13, "apple": 199, "ember": 191, "bottle": 107};
@ -338,7 +338,7 @@ mlr -n put '
print sort(my_map, func(ak,av,bk,bv) { return bv <=> av });
}
'
GENMD_EOF
GENMD-EOF
Please see the [sorting page](sorting.md) for more examples.
@ -346,36 +346,36 @@ Please see the [sorting page](sorting.md) for more examples.
This is a way to do a logical OR/AND, respectively, of several boolean expressions, without the explicit `||`/`&&` and without a `for`-loop. This is a keystroke-saving convenience.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p cat example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv filter 'any({"color":"red","shape":"square"}, func(k,v) {return $[k] == v})'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv filter 'every({"color":"red","shape":"square"}, func(k,v) {return $[k] == v})'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put '$is_red_square = every({"color":"red","shape":"square"}, func(k,v) {return $[k] == v})'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv filter 'any([16,51,61,64], func(e) {return $index == e})'
GENMD_EOF
GENMD-EOF
This last example could also be done using a map:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv filter '
begin {
@indices = {16:true, 51:true, 61:true, 64:true};
}
@indices[$index] == true;
'
GENMD_EOF
GENMD-EOF
## Combined examples
@ -383,11 +383,11 @@ Using a paradigm from the [page on operating on all
records](operating-on-all-records.md), we can retain a column from the input
data as an array, then apply some higher-order functions to it:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p cat example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put -q '
begin {
@indexes = [] # So auto-extend will make an array, not a map
@ -420,7 +420,7 @@ mlr --c2p --from example.csv put -q '
)
}
'
GENMD_EOF
GENMD-EOF
## Caveats
@ -428,21 +428,21 @@ GENMD_EOF
From other languages it's easy to accidentially write
GENMD_RUN_COMMAND_TOLERATING_ERROR
GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr -n put 'end { print select([1,2,3,4,5], func (e) { e >= 3 })}'
GENMD_EOF
GENMD-EOF
instead of
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put 'end { print select([1,2,3,4,5], func (e) { return e >= 3 })}'
GENMD_EOF
GENMD-EOF
### No IIFEs
As of September 2021, immediately invoked function expressions (IIFEs) are not part of the Miller DSL's grammar. For example, this doesn't work yet:
GENMD_RUN_COMMAND_TOLERATING_ERROR
GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr -n put '
end {
x = 3;
@ -450,11 +450,11 @@ mlr -n put '
print y;
}
'
GENMD_EOF
GENMD-EOF
but this does:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = 3;
@ -463,7 +463,7 @@ mlr -n put '
print y;
}
'
GENMD_EOF
GENMD-EOF
### Built-in functions currently unsupported as arguments
@ -474,7 +474,7 @@ be used directly as arguments to higher-order functions.
For example, this doesn't work yet:
GENMD_RUN_COMMAND_TOLERATING_ERROR
GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr -n put '
end {
notches = [0,1,2,3];
@ -483,11 +483,11 @@ mlr -n put '
print cosines;
}
'
GENMD_EOF
GENMD-EOF
but this does:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
notches = [0,1,2,3];
@ -497,4 +497,4 @@ mlr -n put '
print cosines;
}
'
GENMD_EOF
GENMD-EOF

View file

@ -54,45 +54,45 @@ The main use for the `.` operator is for string concatenation: `"abc" . "def"` i
However, in Miller 6 it has optional use for map traversal. Example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/server-log.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json --from data/server-log.json put -q '
print $req["headers"]["host"];
print $req.headers.host;
'
GENMD_EOF
GENMD-EOF
This also works on the left-hand sides of assignment statements:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json --from data/server-log.json put '
$req.headers.host = "UPDATED";
'
GENMD_EOF
GENMD-EOF
A few caveats:
* This is why `.` has higher precedece than `+` in the table above -- in Miller 5 and below, where `.` was only used for concatenation, it had the same precedence as `+`. So you can now do this:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json --from data/server-log.json put -q '
print $req.id + $res.status_code
'
GENMD_EOF
GENMD-EOF
* However (awkwardly), if you want to use `.` for map-traversal as well as string-concatenation in the same statement, you'll need to insert parentheses, as the default associativity is left-to-right:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json --from data/server-log.json put -q '
print $req.method . " -- " . $req.path
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json --from data/server-log.json put -q '
print ($req.method) . " -- " . ($req.path)
'
GENMD_EOF
GENMD-EOF

View file

@ -246,7 +246,7 @@ etc., to control the format of the output if the output is redirected. See also
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern", @a, @b, @c'
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern > mytap.dat", @a, @b, @c'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
</pre>
<pre class="pre-highlight-in-pair">
@ -280,7 +280,7 @@ etc., to control the format of the output if the output is redirected. See also
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp > stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
</pre>
<pre class="pre-highlight-in-pair">
@ -315,7 +315,7 @@ etc., to control the format of the output if the output is redirected. See also
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit > stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
</pre>
## Emit statements

View file

@ -28,15 +28,15 @@ The `print` statement is perhaps self-explanatory, but with a few light caveats:
* You can redirect print output to a file:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from myfile.dat put 'print > "tap.txt", $x'
GENMD_EOF
GENMD-EOF
* You can redirect print output to multiple files, split by values present in various records:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from myfile.dat put 'print > $a.".txt", $x'
GENMD_EOF
GENMD-EOF
See also [Redirected-output statements](reference-dsl-output-statements.md#redirected-output-statements) for examples.
@ -62,34 +62,34 @@ Records produced by a `mlr put` go downstream to the next verb in your `then`-ch
The syntax is, by example:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from myfile.dat put 'tee > "tap.dat", $*' then sort -n index
GENMD_EOF
GENMD-EOF
First is `tee >`, then the filename expression (which can be an expression such as `"tap.".$a.".dat"`), then a comma, then `$*`. (Nothing else but `$*` is teeable.)
You can also write to a variable file name -- for example, you can split a
single file into multiple ones on field names:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat circle.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat square.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat triangle.csv
GENMD_EOF
GENMD-EOF
See also [Redirected-output statements](reference-dsl-output-statements.md#redirected-output-statements) for examples.
@ -101,33 +101,33 @@ Details:
* The `print` and `dump` keywords produce output immediately to standard output, or to specified file(s) or pipe-to command if present.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help keyword print
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help keyword dump
GENMD_EOF
GENMD-EOF
* `mlr put` sends the current record (possibly modified by the `put` expression) to the output record stream. Records are then input to the following verb in a `then`-chain (if any), else printed to standard output (unless `put -q`). The **tee** keyword *additionally* writes the output record to specified file(s) or pipe-to command, or immediately to `stdout`/`stderr`.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help keyword tee
GENMD_EOF
GENMD-EOF
* `mlr put`'s `emitf`, `emitp`, and `emit` send out-of-stream variables to the output record stream. These are then input to the following verb in a `then`-chain (if any), else printed to standard output. When redirected with `>`, `>>`, or `|`, they *instead* write the out-of-stream variable(s) to specified file(s) or pipe-to command, or immediately to `stdout`/`stderr`.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help keyword emitf
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help keyword emitp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help keyword emit
GENMD_EOF
GENMD-EOF
## Emit statements
@ -142,108 +142,108 @@ You can emit any map-valued expression, including `$*`, map-valued out-of-stream
Use **emitf** to output several out-of-stream variables side-by-side in the same output record. For `emitf` these mustn't have indexing using `@name[...]`. Example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '
@count += 1;
@x_sum += $x;
@y_sum += $y;
end { emitf @count, @x_sum, @y_sum}
' data/small
GENMD_EOF
GENMD-EOF
Use **emit** to output an out-of-stream variable. If it's non-indexed you'll get a simple key-value pair:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum += $x; end { dump }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum += $x; end { emit @sum }' data/small
GENMD_EOF
GENMD-EOF
If it's indexed then use as many names after `emit` as there are indices:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a] += $x; end { dump }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a] += $x; end { emit @sum, "a" }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { dump }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { emit @sum, "a", "b" }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b][$i] += $x; end { dump }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '
@sum[$a][$b][$i] += $x;
end { emit @sum, "a", "b", "i" }
' data/small
GENMD_EOF
GENMD-EOF
Now for **emitp**: if you have as many names following `emit` as there are levels in the out-of-stream variable's map, then `emit` and `emitp` do the same thing. Where they differ is when you don't specify as many names as there are map levels. In this case, Miller needs to flatten multiple map indices down to output-record keys: `emitp` includes full prefixing (hence the `p` in `emitp`) while `emit` takes the deepest map key as the output-record key:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { dump }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { emit @sum, "a" }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { emit @sum }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { emitp @sum, "a" }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --oxtab put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small
GENMD_EOF
GENMD-EOF
Use **--flatsep** to specify the character which joins multilevel
keys for `emitp` (it defaults to a colon):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --flatsep / put -q '@sum[$a][$b] += $x; end { emitp @sum, "a" }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --flatsep / put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --flatsep / --oxtab put -q '
@sum[$a][$b] += $x;
end { emitp @sum }
' data/small
GENMD_EOF
GENMD-EOF
## Multi-emit statements
You can emit **multiple map-valued expressions side-by-side** by
including their names in parentheses:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/medium --opprint put -q '
@x_count[$a][$b] += 1;
@x_sum[$a][$b] += $x;
@ -254,7 +254,7 @@ mlr --from data/medium --opprint put -q '
emit (@x_sum, @x_count, @x_mean), "a", "b"
}
'
GENMD_EOF
GENMD-EOF
What this does is walk through the first out-of-stream variable (`@x_sum` in this example) as usual, then for each keylist found (e.g. `pan,wye`), include the values for the remaining out-of-stream variables (here, `@x_count` and `@x_mean`). You should use this when all out-of-stream variables in the emit statement have **the same shape and the same keylists**.
@ -262,27 +262,27 @@ What this does is walk through the first out-of-stream variable (`@x_sum` in thi
Use **emit all** (or `emit @*` which is synonymous) to output all out-of-stream variables. You can use the following idiom to get various accumulators output side-by-side (reminiscent of `mlr stats1`):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small --opprint put -q '
@v[$a][$b]["sum"] += $x;
@v[$a][$b]["count"] += 1;
end{emit @*,"a","b"}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small --opprint put -q '
@sum[$a][$b] += $x;
@count[$a][$b] += 1;
end{emit @*,"a","b"}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small --opprint put -q '
@sum[$a][$b] += $x;
@count[$a][$b] += 1;
end{emit (@sum, @count),"a","b"}
'
GENMD_EOF
GENMD-EOF

View file

@ -4,13 +4,13 @@
Multiple expressions may be given, separated by semicolons, and each may refer to the ones before:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
ruby -e '10.times{|i|puts "i=#{i}"}' | mlr --opprint put '$j = $i + 1; $k = $i +$j'
GENMD_EOF
GENMD-EOF
Newlines within the expression are ignored, which can help increase legibility of complex expressions:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put '
$nf = NF;
$nr = NR;
@ -18,46 +18,46 @@ mlr --opprint put '
$filenum = FILENUM;
$filename = FILENAME
' data/small data/small2
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint filter '($x > 0.5 && $y < 0.5) || ($x < 0.5 && $y > 0.5)' \
then stats2 -a corr -f x,y \
data/medium
GENMD_EOF
GENMD-EOF
## Expressions from files
The simplest way to enter expressions for `put` and `filter` is between single quotes on the command line (see also [here](miller-on-windows.md) for Windows). For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small put '$xy = sqrt($x**2 + $y**2)'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small put 'func f(a, b) { return sqrt(a**2 + b**2) } $xy = f($x, $y)'
GENMD_EOF
GENMD-EOF
You may, though, find it convenient to put expressions into files for reuse, and read them
**using the -f option**. For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/fe-example-3.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small put -f data/fe-example-3.mlr
GENMD_EOF
GENMD-EOF
If you have some of the logic in a file and you want to write the rest on the command line, you can **use the -f and -e options together**:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/fe-example-4.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small put -f data/fe-example-4.mlr -e '$xy = f($x, $y)'
GENMD_EOF
GENMD-EOF
A suggested use-case here is defining functions in files, and calling them from command-line expressions.
@ -69,25 +69,25 @@ Moreover, you can have one or more `-f` expressions (maybe one function per file
Miller uses **semicolons as statement separators**, not statement terminators. This means you can write:
GENMD_INCLUDE_ESCAPED(data/semicolon-example.txt)
GENMD-INCLUDE-ESCAPED(data/semicolon-example.txt)
Semicolons are optional after closing curly braces (which close conditionals and loops as discussed below).
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x=1,y=2 | mlr put 'while (NF < 10) { $[NF+1] = ""} $foo = "bar"'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x=1,y=2 | mlr put 'while (NF < 10) { $[NF+1] = ""}; $foo = "bar"'
GENMD_EOF
GENMD-EOF
Semicolons are required between statements even if those statements are on separate lines. **Newlines** are for your convenience but have no syntactic meaning: line endings do not terminate statements. For example, adjacent assignment statements must be separated by semicolons even if those statements are on separate lines:
GENMD_INCLUDE_ESCAPED(data/newline-example.txt)
GENMD-INCLUDE-ESCAPED(data/newline-example.txt)
**Trailing commas** are allowed in function/subroutine definitions, function/subroutine callsites, and map literals. This is intended for (although not restricted to) the multi-line case:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csvlite --from data/a.csv put '
func f(
num a,
@ -105,21 +105,21 @@ mlr --csvlite --from data/a.csv put '
"v": NR,
}
'
GENMD_EOF
GENMD-EOF
Bodies for all compound statements must be enclosed in **curly braces**, even if the body is a single statement:
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr put 'if ($x == 1) $y = 2' # Syntax error
GENMD_EOF
GENMD-EOF
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr put 'if ($x == 1) { $y = 2 }' # This is OK
GENMD_EOF
GENMD-EOF
Bodies for compound statements may be empty:
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr put 'if ($x == 1) { }' # This no-op is syntactically acceptable
GENMD_EOF
GENMD-EOF

View file

@ -31,17 +31,17 @@ seconds, are common in some contexts, particulary JavaScript. If you ever
(anywhere) see a timestamp for the year 49,000-something -- probably someone is
treating epoch-milliseconds as epoch-seconds.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put 'end {
print sec2gmt(1500000000);
print sec2gmt(1500000000000);
}'
GENMD_EOF
GENMD-EOF
You can get the current system time, as epoch-seconds, using the
[systime](reference-dsl-builtin-functions.md#systime) DSL function:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --c2p --from example.csv put '$t = systime()'
color shape flag k index quantity rate t
yellow triangle true 1 11 43.6498 9.8870 1634784588.045347
@ -54,7 +54,7 @@ purple triangle false 7 65 80.1405 5.8240 1634784588.045418
yellow circle true 8 73 63.9785 4.2370 1634784588.045419
yellow circle true 9 87 63.5058 8.3350 1634784588.045421
purple square false 10 91 72.3735 8.2430 1634784588.045422
GENMD_EOF
GENMD-EOF
The [systimeint](reference-dsl-builtin-functions.md#systimeint) DSL functions
is nothing more than a keystroke-saver for `int(systime())`.
@ -72,7 +72,7 @@ You can get these from epoch-seconds using the
(Note that the terms _UTC_ and _GMT_ are used interchangeably in Miller.)
We also have [sec2gmtdate](reference-dsl-builtin-functions.md#sec2gmtdate) DSL function.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put 'end {
print sec2gmt(0);
print sec2gmt(1234567890.123);
@ -82,7 +82,7 @@ mlr -n put 'end {
print sec2gmtdate(1234567890.123);
print sec2gmtdate(-1234567890.123);
}'
GENMD_EOF
GENMD-EOF
# Local times with standard format; specifying timezones
@ -101,20 +101,20 @@ You can specify the timezone using any of the following:
Regardless, if you specify an invalid timezone, you'll be clearly notified:
GENMD_RUN_COMMAND_TOLERATING_ERROR
GENMD-RUN-COMMAND-TOLERATING-ERROR
mlr --from example.csv --tz This/Is/A/Typo cat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
export TZ=Asia/Istanbul
mlr -n put 'end { print sec2localtime(0) }'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --tz America/Sao_Paulo -n put 'end { print sec2localtime(0) }'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put 'end {
ENV["TZ"] = "Asia/Istanbul";
print sec2localtime(0);
@ -126,9 +126,9 @@ mlr -n put 'end {
print sec2localdate(0);
print localtime2sec("2000-01-02 03:04:05");
}'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put 'end {
print sec2localtime(0, 0, "Asia/Istanbul");
print sec2localdate(0, "Asia/Istanbul");
@ -138,7 +138,7 @@ mlr -n put 'end {
print sec2localdate(0, "America/Sao_Paulo");
print localtime2sec("2000-01-02 03:04:05", "America/Sao_Paulo");
}'
GENMD_EOF
GENMD-EOF
Note that for local times, Miller omits the `T` and the `Z` you see in GMT times.
@ -146,22 +146,22 @@ We also have the
[gmt2localtime](reference-dsl-builtin-functions.md#gmt2localtime) and
[localtime2gmt](reference-dsl-builtin-functions.md#localtime2gmt) convenience functions:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put 'end {
ENV["TZ"] = "Asia/Istanbul";
print gmt2localtime("1970-01-01T00:00:00Z");
print localtime2gmt("1970-01-01 00:00:00");
}'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put 'end {
print gmt2localtime("1970-01-01T00:00:00Z", "America/Sao_Paulo");
print gmt2localtime("1970-01-01T00:00:00Z", "Asia/Istanbul");
print localtime2gmt("1970-01-01 00:00:00", "America/Sao_Paulo");
print localtime2gmt("1970-01-01 00:00:00", "Asia/Istanbul");
}'
GENMD_EOF
GENMD-EOF
# GMT and local times with custom formats
@ -181,7 +181,7 @@ Notes:
Some examples:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put 'end {
ENV["TZ"] = "Asia/Istanbul";
print strftime(0, "%Y-%m-%d %H:%M:%S");
@ -190,7 +190,7 @@ mlr -n put 'end {
print strftime(0, "%A, %B %e, %Y");
print strftime(123456789, "%I:%M %p");
}'
GENMD_EOF
GENMD-EOF
Unfortunately, names from `%A` and `%B` are only available in English, as an
artifact of a design choice in the Go `time` library which Miller (and its
@ -200,7 +200,7 @@ We also have
[strftimelocal](reference-dsl-builtin-functions.md#strftimelocal) and
[strptimelocal](reference-dsl-builtin-functions.md#strptimelocal):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put 'end {
ENV["TZ"] = "America/Anchorage";
print strftime_local(0, "%Y-%m-%d %H:%M:%S %Z");
@ -214,9 +214,9 @@ mlr -n put 'end {
print strftime_local(0, "%A, %B %e, %Y");
print strptime_local("2020-03-01 00:00:00", "%Y-%m-%d %H:%M:%S");
}'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put 'end {
print strftime_local(0, "%Y-%m-%d %H:%M:%S %Z", "America/Anchorage");
print strftime_local(0, "%Y-%m-%d %H:%M:%S %z", "America/Anchorage");
@ -228,7 +228,7 @@ mlr -n put 'end {
print strftime_local(0, "%A, %B %e, %Y", "Asia/Hong_Kong");
print strptime_local("2020-03-01 00:00:00", "%Y-%m-%d %H:%M:%S", "Asia/Hong_Kong");
}'
GENMD_EOF
GENMD-EOF
# Relative times
@ -236,7 +236,7 @@ You can get the seconds since the Miller process start using
[uptime](reference-dsl-builtin-functions.md#uptime):
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
color shape flag k index quantity rate u
yellow triangle true 1 11 43.6498 9.8870 0.0011110305786132812
red square true 2 15 79.2778 0.0130 0.0011241436004638672
@ -248,13 +248,13 @@ purple triangle false 7 65 80.1405 5.8240 0.0024831295013427734
yellow circle true 8 73 63.9785 4.2370 0.0024831295013427734
yellow circle true 9 87 63.5058 8.3350 0.0024852752685546875
purple square false 10 91 72.3735 8.2430 0.002485990524291992
GENMD_EOF
GENMD-EOF
Time-differences can be done in seconds, of course; you can also use the following if you like:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -F | grep hms
GENMD_EOF
GENMD-EOF
# References

View file

@ -2,22 +2,22 @@
You can clear a map key by assigning the empty string as its value: `$x=""` or `@x=""`. Using `unset` you can remove the key entirely. Examples:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put 'unset $x, $a' data/small
GENMD_EOF
GENMD-EOF
This can also be done, of course, using `mlr cut -x`. You can also clear out-of-stream or local variables, at the base name level, or at an indexed sublevel:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { dump; unset @sum; dump }' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@sum[$a][$b] += $x; end { dump; unset @sum["eks"]; dump }' data/small
GENMD_EOF
GENMD-EOF
If you use `unset all` (or `unset @*` which is synonymous), that will unset all out-of-stream variables which have been assigned up to that point.

View file

@ -6,7 +6,7 @@ As of Miller 5.0.0 you can define your own functions, as well as subroutines.
Here's the obligatory example of a recursive function to compute the factorial function:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint --from data/small put '
func f(n) {
if (is_numeric(n)) {
@ -21,7 +21,7 @@ mlr --opprint --from data/small put '
$ox = f($x + NR);
$oi = f($i);
'
GENMD_EOF
GENMD-EOF
Properties of user-defined functions:
@ -45,7 +45,7 @@ Properties of user-defined functions:
Example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint --from data/small put -q '
begin {
@call_count = 0;
@ -63,7 +63,7 @@ mlr --opprint --from data/small put -q '
print "NR=" . NR;
call s(NR);
'
GENMD_EOF
GENMD-EOF
Properties of user-defined subroutines:
@ -97,15 +97,15 @@ If you have a file with UDFs you use frequently, say `my-udfs.mlr`, you can use
`--load` or `--mload` to define them for your Miller scripts. For example, in
your shell,
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
alias mlr='mlr --load ~/my-functions.mlr'
GENMD_EOF
GENMD-EOF
or
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
alias mlr='mlr --load /u/miller-udfs/'
GENMD_EOF
GENMD-EOF
See the [miscellaneous-flags page](reference-main-flag-list.md#miscellaneous-flags) for more information.
@ -123,16 +123,16 @@ for more information on
For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put '
f = func(s, t) {
return s . ":" . t;
};
$z = f($color, $shape);
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put '
a = func(s, t) {
return s . ":" . t . " above";
@ -143,7 +143,7 @@ mlr --c2p --from example.csv put '
f = $index >= 50 ? a : b;
$z = f($color, $shape);
'
GENMD_EOF
GENMD-EOF
Note that you need a semicolon after the closing curly brace of the function literal.
@ -151,7 +151,7 @@ Unlike named functions, function literals (also known as unnamed functions)
have access to local variables defined in their enclosing scope. That's
so you can do things like this:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put '
f = func(s, t, i) {
if (i >= cap) {
@ -163,6 +163,6 @@ mlr --c2p --from example.csv put '
cap = 10;
$z = f($color, $shape, $index);
'
GENMD_EOF
GENMD-EOF
See the [page on higher-order functions](reference-dsl-higher-order-functions.md) for more.

View file

@ -987,7 +987,7 @@ etc., to control the format of the output if the output is redirected. See also
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit > stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
emitf: inserts non-indexed out-of-stream variable(s) side-by-side into the
output record stream.
@ -1014,7 +1014,7 @@ etc., to control the format of the output if the output is redirected. See also
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern", @a, @b, @c'
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern > mytap.dat", @a, @b, @c'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
emitp: inserts an out-of-stream variable into the output record stream.
Hashmap indices present in the data but not slotted by emitp arguments are
@ -1043,7 +1043,7 @@ etc., to control the format of the output if the output is redirected. See also
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp > stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
end: defines a block of statements to be executed after input records
are ingested. The body statements must be wrapped in curly braces.

View file

@ -20,13 +20,13 @@ If field names have **special characters** such as `.` then you can use braces,
You may also use a **computed field name** in square brackets, e.g.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo a=3,b=4 | mlr filter '$["x"] < 0.5'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo s=green,t=blue,a=3,b=4 | mlr put '$[$s."_".$t] = $a * $b'
GENMD_EOF
GENMD-EOF
Notes:
@ -46,39 +46,39 @@ Use `$[[3]]` to access the name of field 3. More generally, any expression eval
Then using a computed field name, `$[ $[[3]] ]` is the value in the third field. This has the shorter equivalent notation `$[[[3]]]`.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr cat data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '$[[3]] = "NEW"' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '$[[[3]]] = "NEW"' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '$NEW = $[[NR]]' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '$NEW = $[[[NR]]]' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '$[[[NR]]] = "NEW"' data/small
GENMD_EOF
GENMD-EOF
Right-hand side accesses to non-existent fields -- i.e. with index less than 1 or greater than `NF` -- return an absent value. Likewise, left-hand side accesses only refer to fields which already exist. For example, if a field has 5 records then assigning the name or value of the 6th (or 600th) field results in a no-op.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '$[[6]] = "NEW"' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '$[[[6]]] = "NEW"' data/small
GENMD_EOF
GENMD-EOF
## Out-of-stream variables
@ -90,21 +90,21 @@ Just as for field names in stream records, if you want to define out-of-stream v
You may use a **computed key** in square brackets, e.g.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo s=green,t=blue,a=3,b=4 | mlr put -q '@[$s."_".$t] = $a * $b; emit all'
GENMD_EOF
GENMD-EOF
Out-of-stream variables are **scoped** to the `put` command in which they appear. In particular, if you have two or more `put` commands separated by `then`, each put will have its own set of out-of-stream variables:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/a.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '@sum += $a; end {emit @sum}' \
then put 'is_present($a) {$a=10*$a; @sum += $a}; end {emit @sum}' \
data/a.dkvp
GENMD_EOF
GENMD-EOF
Out-of-stream variables' **extent** is from the start to the end of the record stream, i.e. every time the `put` or `filter` statement referring to them is executed.
@ -114,7 +114,7 @@ Out-of-stream variables are **read-write**: you can do `$sum=@sum`, `@sum=$sum`,
Using an index on the `@count` and `@sum` variables, we get the benefit of the `-g` (group-by) option which `mlr stats1` and various other Miller commands have:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '
@x_count[$a] += 1;
@x_sum[$a] += $x;
@ -123,15 +123,15 @@ mlr put -q '
emit @x_sum, "a";
}
' ./data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr stats1 -a count,sum -f x -g a ./data/small
GENMD_EOF
GENMD-EOF
Indices can be arbitrarily deep -- here there are two or more of them:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/medium put -q '
@x_count[$a][$b] += 1;
@x_sum[$a][$b] += $x;
@ -139,13 +139,13 @@ mlr --from data/medium put -q '
emit (@x_count, @x_sum), "a", "b";
}
'
GENMD_EOF
GENMD-EOF
The idea is that `stats1`, and other Miller verbs, encapsulate frequently-used patterns with a minimum of keystroking (and run a little faster), whereas using out-of-stream variables you have more flexibility and control in what you do.
Begin/end blocks can be mixed with pattern/action blocks. For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '
begin {
@num_total = 0;
@ -160,7 +160,7 @@ mlr put '
emitf @num_total, @num_positive
}
' data/put-gating-example-1.dkvp
GENMD_EOF
GENMD-EOF
## Local variables
@ -168,7 +168,7 @@ Local variables are similar to out-of-stream variables, except that their extent
For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
# Here I'm using a specified random-number seed so this example always
# produces the same output for this web document: in everyday practice we
# would leave off the --seed 12345 part.
@ -185,7 +185,7 @@ mlr --seed 12345 seqgen --start 1 --stop 10 then put '
num o = f(10, 20); # local to the top-level scope
$o = o;
'
GENMD_EOF
GENMD-EOF
Things which are completely unsurprising, resembling many other languages:
@ -213,23 +213,23 @@ Things which are perhaps surprising compared to other languages:
The following example demonstrates the scope rules:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/scope-example.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/scope-example.dat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --oxtab --from data/scope-example.dat put -f data/scope-example.mlr
GENMD_EOF
GENMD-EOF
And this example demonstrates the type-declaration rules:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/type-decl-example.mlr
GENMD_EOF
GENMD-EOF
## Map literals
@ -237,7 +237,7 @@ Miller's `put`/`filter` DSL has four kinds of maps. **Stream records** are (sing
For example, the following swaps the input stream's `a` and `i` fields, modifies `y`, and drops the rest:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put '
$* = {
"a": $i,
@ -245,11 +245,11 @@ mlr --opprint put '
"y": $y * 10,
}
' data/small
GENMD_EOF
GENMD-EOF
Likewise, you can assign map literals to out-of-stream variables or local variables; pass them as arguments to user-defined functions, return them from functions, and so on:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small put '
func f(map m): map {
m["x"] *= 200;
@ -257,11 +257,11 @@ mlr --from data/small put '
}
$* = f({"a": $a, "x": $x});
'
GENMD_EOF
GENMD-EOF
Like out-of-stream and local variables, map literals can be multi-level:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small put -q '
begin {
@o = {
@ -281,7 +281,7 @@ mlr --from data/small put -q '
dump @o;
}
'
GENMD_EOF
GENMD-EOF
See also the [Maps page](reference-main-maps.md).
@ -301,13 +301,13 @@ read/write access to environment variables, e.g. `ENV["HOME"]` or
<!--- TODO: FLATSEP IFLATSEP OFLATSEP --->
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr filter 'FNR == 2' data/small*
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '$fnr = FNR' data/small*
GENMD_EOF
GENMD-EOF
Their values of `NF`, `NR`, `FNR`, `FILENUM`, and `FILENAME` change from one
record to the next as Miller scans through your input data stream. The
@ -318,13 +318,13 @@ system environment variables at the time Miller starts. Any changes made to
Their **scope is global**: you can refer to them in any `filter` or `put` statement. Their values are assigned by the input-record reader:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv put '$nr = NR' data/a.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv repeat -n 3 then put '$nr = NR' data/a.csv
GENMD_EOF
GENMD-EOF
The **extent** is for the duration of the put/filter: in a `begin` statement (which executes before the first input record is consumed) you will find `NR=1` and in an `end` statement (which is executed after the last input record is consumed) you will find `NR` to be the total number of records ingested.
@ -340,13 +340,13 @@ Use of type-checking is entirely up to you: omit it if you want flexibility with
The following `is_...` functions take a value and return a boolean indicating whether the argument is of the indicated type. The `assert_...` functions return their argument if it is of the specified type, and cause a fatal error otherwise:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -f | grep ^is
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -f | grep ^assert
GENMD_EOF
GENMD-EOF
See [Data-cleaning Examples](data-cleaning-examples.md) for examples of how to use these.
@ -356,36 +356,36 @@ Local variables can be defined either untyped as in `x = 1`, or typed as in `int
The reason for `num` is that `int` and `float` typedecls are very precise:
GENMD_CARDIFY
GENMD-CARDIFY
float a = 0; # Runtime error since 0 is int not float
int b = 1.0; # Runtime error since 1.0 is float not int
num c = 0; # OK
num d = 1.0; # OK
GENMD_EOF
GENMD-EOF
A suggestion is to use `num` for general use when you want numeric content, and use `int` when you genuinely want integer-only values, e.g. in loop indices or map keys (since Miller map keys can only be strings or ints).
The `var` type declaration indicates no type restrictions, e.g. `var x = 1` has the same type restrictions on `x` as `x = 1`. The difference is in intentional shadowing: if you have `x = 1` in outer scope and `x = 2` in inner scope (e.g. within a for-loop or an if-statement) then outer-scope `x` has value 2 after the second assignment. But if you have `var x = 2` in the inner scope, then you are declaring a variable scoped to the inner block.) For example:
GENMD_CARDIFY
GENMD-CARDIFY
x = 1;
if (NR == 4) {
x = 2; # Refers to outer-scope x: value changes from 1 to 2.
}
print x; # Value of x is now two
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY
GENMD-CARDIFY
x = 1;
if (NR == 4) {
var x = 2; # Defines a new inner-scope x with value 2
}
print x; # Value of this x is still 1
GENMD_EOF
GENMD-EOF
Likewise function arguments can optionally be typed, with type enforced when the function is called:
GENMD_CARDIFY
GENMD-CARDIFY
func f(map m, int i) {
...
}
@ -396,11 +396,11 @@ if (NR == 4) {
var x = 2; # Defines a new inner-scope x with value 2
}
print x; # Value of this x is still 1
GENMD_EOF
GENMD-EOF
Thirdly, function return values can be type-checked at the point of `return` using `:` and a typedecl after the parameter list:
GENMD_CARDIFY
GENMD-CARDIFY
func f(map m, int i): bool {
...
...
@ -419,7 +419,7 @@ func f(map m, int i): bool {
# So it would also be a runtime error on reaching the end of this function without
# an explicit return statement.
}
GENMD_EOF
GENMD-EOF
## Aggregate variable assignments
@ -431,7 +431,7 @@ There are three remaining kinds of variable assignment using out-of-stream varia
Example recursive copy of out-of-stream variables:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint --from data/small put -q '
@v["sum"] += $x;
@v["count"] += 1;
@ -441,35 +441,35 @@ mlr --opprint --from data/small put -q '
dump
}
'
GENMD_EOF
GENMD-EOF
Example of out-of-stream variable assigned to full stream record, where the 2nd record is stashed, and the 4th record is overwritten with that:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put 'NR == 2 {@keep = $*}; NR == 4 {$* = @keep}' data/small
GENMD_EOF
GENMD-EOF
Example of full stream record assigned to an out-of-stream variable, finding the record for which the `x` field has the largest value in the input stream:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put -q '
is_null(@xmax) || $x > @xmax {@xmax = $x; @recmax = $*};
end {emit @recmax}
' data/small
GENMD_EOF
GENMD-EOF
## Keywords for filter and put
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help list-keywords # you can also use mlr -k
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help usage-keywords # you can also use mlr -K
GENMD_EOF
GENMD-EOF

View file

@ -16,9 +16,9 @@ Here's comparison of verbs and `put`/`filter` DSL expressions:
Example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr stats1 -a sum -f x -g a data/small
GENMD_EOF
GENMD-EOF
* Verbs are coded in Go
* They run a bit faster
@ -28,9 +28,9 @@ GENMD_EOF
Example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@x_sum[$a] += $x; end{emit @x_sum, "a"}' data/small
GENMD_EOF
GENMD-EOF
* You get to write your own DSL expressions
* They run a bit slower
@ -62,15 +62,15 @@ page on [operating on all records](operating-on-all-records.md).)
To see this in action, let's take a look at the [data/short.csv](./data/short.csv) file:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/short.csv
GENMD_EOF
GENMD-EOF
There are three records in this file, with `word=apple`, `word=ball`, and
`word=cat`, respectively. Let's print something in a `begin` statement, add a
field in a main statement, and print something else in an `end` statement:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from data/short.csv put '
begin {
print "begin";
@ -80,7 +80,7 @@ mlr --csv --from data/short.csv put '
print "end";
}
'
GENMD_EOF
GENMD-EOF
The `print` statements for `begin` and `end` went out before the first record
was seen and after the last was seen; the field-creation statement `$nr = NR`
@ -100,21 +100,21 @@ The essential usages of `mlr filter` and `mlr put` are for record-selection and
record-updating expressions, respectively. For example, given the following
input data:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/small
GENMD_EOF
GENMD-EOF
you might retain only the records whose `a` field has value `eks`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr filter '$a == "eks"' data/small
GENMD_EOF
GENMD-EOF
or you might add a new field which is a function of existing fields:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '$ab = $a . "_" . $b ' data/small
GENMD_EOF
GENMD-EOF
## Differences between put and filter
@ -128,7 +128,7 @@ The two verbs `mlr filter` and `mlr put` are essentially the same. The only diff
You can define and invoke functions and subroutines to help produce the bare-boolean statement, and record fields may be assigned in the statements before or after the bare-boolean statement. For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv filter '
# Bare-boolean filter expression: only records matching this pass through:
$quantity >= 70;
@ -139,9 +139,9 @@ mlr --c2p --from example.csv filter '
$description = "low rate";
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv filter '
# Bare-boolean filter expression: only records matching this pass through:
$shape =~ "^(...)(...)$";
@ -150,7 +150,7 @@ mlr --c2p --from example.csv filter '
$left = "\1";
$right = "\2";
'
GENMD_EOF
GENMD-EOF
There are more details and more choices, of course, as detailed in the following sections.

View file

@ -22,7 +22,7 @@ The short of it is that Miller does this transparently for you so you needn't th
Implementation details of this, for the interested: integer adds and subtracts overflow by at most one bit so it suffices to check sign-changes. Thus, Miller allows you to add and subtract arbitrary 64-bit signed integers, converting only to float precisely when the result is less than -2\*\*63 or greater than 2\*\*63 - 1. Multiplies, on the other hand, can overflow by a word size and a sign-change technique does not suffice to detect overflow. Instead, Miller tests whether the floating-point product exceeds the representable integer range. Now, 64-bit integers have 64-bit precision while IEEE-doubles have only 52-bit mantissas -- so, there are 53 bits including implicit leading one. The following experiment explicitly demonstrates the resolution at this range:
GENMD_CARDIFY
GENMD-CARDIFY
64-bit integer 64-bit integer Casted to double Back to 64-bit
in hex in decimal integer
0x7ffffffffffff9ff 9223372036854774271 9223372036854773760.000000 0x7ffffffffffff800
@ -33,7 +33,7 @@ in hex in decimal integer
0x7ffffffffffffe00 9223372036854775296 9223372036854775808.000000 0x8000000000000000
0x7ffffffffffffffe 9223372036854775806 9223372036854775808.000000 0x8000000000000000
0x7fffffffffffffff 9223372036854775807 9223372036854775808.000000 0x8000000000000000
GENMD_EOF
GENMD-EOF
That is, one cannot check an integer product to see if it is precisely greater than 2\*\*63 - 1 or less than -2\*\*63 using either integer arithmetic (it may have already overflowed) or using double-precision (due to granularity). Instead, Miller checks for overflow in 64-bit integer multiplication by seeing whether the absolute value of the double-precision product exceeds the largest representable IEEE double less than 2\*\*63, which we see from the listing above is 9223372036854774784. (An alternative would be to do all integer multiplies using handcrafted multi-word 128-bit arithmetic. This approach is not taken.)

View file

@ -10,18 +10,18 @@ of the major advantages of Miller 6.
Array literals are written in square brackets braces with integer indices. Array slots can be any [Miller data type](reference-main-data-types.md) (including other arrays, or maps).
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = [ "a", 1, "b", {"x": 2, "y": [3,4,5]}, 99, true];
print x;
}
'
GENMD_EOF
GENMD-EOF
As with maps and argument-lists, trailing commas are supported:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = [
@ -32,7 +32,7 @@ mlr -n put '
print x;
}
'
GENMD_EOF
GENMD-EOF
Also note that several [built-in functions](reference-dsl-builtin-functions.md) operate on arrays and/or return arrays.
@ -60,7 +60,7 @@ are already 1-up in Miller, and always have been, mostly inherited from AWK:
Imitating Python and other languages, you can use negative indices to read backward from the end of the array,
while positive indices read forward from the start. If an array has length `n` then `-n..-1` are aliases for `1..n`, respectively; 0 is never a valid array index in Miller.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = [10, 20, 30, 40, 50];
@ -70,7 +70,7 @@ mlr -n put '
print x[-2:-1];
}
'
GENMD_EOF
GENMD-EOF
## Slicing
@ -79,7 +79,7 @@ in a slice can be negatively aliased as described above. Unlike in Python,
Miller array-slice indices are inclusive on both sides: `x[3:5]` means `[x[3],
x[4], x[5]]`.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = [10, 20, 30, 40, 50];
@ -90,7 +90,7 @@ mlr -n put '
print x[2:-2];
}
'
GENMD_EOF
GENMD-EOF
## Out-of-bounds indexing
@ -98,7 +98,7 @@ Somewhat imitating Python, out-of-bounds index accesses are
[absent](reference-main-null-data.md), but out-of-bounds slice accesses result
in trimming the indices, resulting in a short array or even the empty array:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = [10, 20, 30, 40, 50];
@ -107,9 +107,9 @@ mlr -n put '
print x[6]; # absent
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = [10, 20, 30, 40, 50];
@ -118,7 +118,7 @@ mlr -n put '
print x[10:20];
}
'
GENMD_EOF
GENMD-EOF
## Auto-create results in maps
@ -126,7 +126,7 @@ As noted on the [maps page](reference-main-maps.md), indexing any
as-yet-assigned local variable or out-of-stream variable results in
**auto-create** of that variable as a map variable:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from example.csv put -q '
# You can do this but you do not need to:
# begin { @last_rates = {} }
@ -135,12 +135,12 @@ mlr --csv --from example.csv put -q '
dump @last_rates;
}
'
GENMD_EOF
GENMD-EOF
*This also means that auto-create results in maps, not arrays, even if keys are integers.*
If you want to auto-extend an [array](reference-main-arrays.md), initialize it explicitly to `[]`.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from example.csv head -n 4 then put -q '
begin {
@my_array = [];
@ -151,7 +151,7 @@ mlr --csv --from example.csv head -n 4 then put -q '
dump
}
'
GENMD_EOF
GENMD-EOF
## Auto-extend and null-gaps
@ -170,7 +170,7 @@ However, if an array is written to more than one past its end, [values of type
JSON-null](reference-main-data-types.md) are used to fill in the gaps. These
are called **null-gaps**.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
no_gaps = [];
@ -185,13 +185,13 @@ mlr -n put '
print gaps;
}
'
GENMD_EOF
GENMD-EOF
## Unset as shift
Unsetting an array index results in shifting all higher-index elements down by one:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = [ "a", "b", "c", "d", "e"];
@ -200,11 +200,11 @@ mlr -n put '
print x;
}
'
GENMD_EOF
GENMD-EOF
More generally, you can get shift and pop operations by unsetting indices 1 and -1:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
$ mlr repl -q
[mlr] x=[1,2,3,4,5]
[mlr] unset x[-1]
@ -222,7 +222,7 @@ $ mlr repl -q
[mlr] x
[3, 4, 5]
[mlr]
GENMD_EOF
GENMD-EOF
## Looping

View file

@ -2,46 +2,46 @@
There are a few nearly-standalone programs which have a little to do with the rest of Miller, do not participate in record streams, and do not deal with file formats. They might as well be little standalone executables, but instead they're delivered within the main Miller executable for convenience.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr aux-list
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr lecat --help
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr termcvt --help
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr hex --help
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr unhex --help
GENMD_EOF
GENMD-EOF
Examples:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'Hello, world!' | mlr lecat --mono
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'Hello, world!' | mlr termcvt --lf2crlf | mlr lecat --mono
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr hex data/budget.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr hex -r data/budget.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr hex -r data/budget.csv | sed 's/20/2a/g' | mlr unhex
GENMD_EOF
GENMD-EOF
Additionally, [`mlr help`](online-help.md), [`mlr repl`](repl.md), and [`mlr regtest`](https://github.com/johnkerl/miller/blob/main/go/regtest/README.md) are implemented here.

View file

@ -8,14 +8,14 @@ more general `--prepipe` option to support other decompression programs.
If your files end in `.gz`, `.bz2`, or `.z` then Miller will autodetect by file extension:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
file gz-example.csv.gz
gz-example.csv.gz: gzip compressed data, was "gz-example.csv", last modified: Mon Aug 23 02:04:34 2021, from Unix, original size modulo 2^32 429
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv sort -f color gz-example.csv.gz
GENMD_EOF
GENMD-EOF
This will decompress the input data on the fly, while leaving the disk file unmodified. This helps you save disk space, at the cost of some additional runtime CPU usage to decompress the data.
@ -23,9 +23,9 @@ This will decompress the input data on the fly, while leaving the disk file unmo
If the filename doesn't in in `.gz`, `.bz2`, or `.z` then you can use the flags `--gzin`, `--bz2in`, or `--zin` to let Miller know:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --csv --gzin sort -f color myfile.bin # myfile.bin has gzip contents
GENMD_EOF
GENMD-EOF
## External decompressors on input
@ -35,9 +35,9 @@ piping the standard output of that program to Miller's standard input.
You can, of course, already do without this for single input files, for example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
gunzip < gz-example.csv.gz | mlr --csv sort -f color
GENMD_EOF
GENMD-EOF
The benefit of `--prepipe` is that Miller will run the specified program once per
file, respecting file boundaries.
@ -81,27 +81,27 @@ For compressed output:
file, which is annoying for version control. That can be suppressed by using 'gzip -n' but then that's
confusing for the reader, who has no need for -n. We handle this by making this code sample non-live.
--->
GENMD_CARDIFY_HIGHLIGHT_FOUR
GENMD-CARDIFY-HIGHLIGHT-FOUR
mlr --from example.csv --csv put -q '
filename = $color.".csv.gz";
tee | "gzip > ".filename, $*
'
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
file red.csv.gz purple.csv.gz yellow.csv.gz
red.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 185
purple.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 164
yellow.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 158
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --csv cat yellow.csv.gz
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350
GENMD_EOF
GENMD-EOF
* Using the [in-place flag](reference-main-in-place-processing.md) `-I`, the overwritten file will
be compressed when possible. See the [page on in-place mode](reference-main-in-place-processing.md) for details.

View file

@ -49,11 +49,11 @@ dot operator has been generalized to stringify non-strings
Examples:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat data/type-infer.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --oxtab --from data/type-infer.csv put '
$d = $a . $c;
$e = 7;
@ -67,7 +67,7 @@ mlr --icsv --oxtab --from data/type-infer.csv put '
$tf = typeof($f);
$tg = typeof($g);
' then reorder -f a,ta,b,tb,c,tc,d,td,e,te,f,tf,g,tg
GENMD_EOF
GENMD-EOF
On input, string values representable as boolean (e.g. `"true"`, `"false"`)
are *not* automatically treated as boolean. This is because `"true"` and
@ -90,21 +90,21 @@ they will not be auto-converted, but you can use the
or the
[`json_parse` DSL function](reference-dsl-builtin-functions.md#json_parse):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from data/json-in-csv.csv cat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/json-in-csv.csv cat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/json-in-csv.csv json-parse -f blob
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/json-in-csv.csv put '$blob = json_parse($blob)'
GENMD_EOF
GENMD-EOF
These have their respective operations to convert back to string: the
[`json-stringify` verb](reference-verbs.md#json-stringify)

View file

@ -2,13 +2,13 @@
Here are flags you can use when invoking Miller. For example, when you type
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson head -n 1 example.csv
GENMD_EOF
GENMD-EOF
the `--icsv` and `--ojson` bits are _flags_. See the [Miller command
structure](reference-main-overview.md) page for context.
Also, at the command line, you can use `mlr -g` for a list much like this one.
GENMD_RUN_CONTENT_GENERATOR(./mk-flag-info.rb)
GENMD-RUN-CONTENT-GENERATOR(./mk-flag-info.rb)

View file

@ -11,7 +11,7 @@ there are a few differences as noted below.
_Map literals_ are written in curly braces with string keys any [Miller data type](reference-main-data-types.md) (including other maps, or arrays) as values. Also, integers may be given as keys although they'll be stored as strings.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = {"a": 1, "b": {"x": 2, "y": [3,4,5]}, 99: true};
@ -20,11 +20,11 @@ mlr -n put '
print x["99"];
}
'
GENMD_EOF
GENMD-EOF
As with arrays and argument-lists, trailing commas are supported:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = {
@ -35,20 +35,20 @@ mlr -n put '
print x;
}
'
GENMD_EOF
GENMD-EOF
The current record, accessible using `$*`, is a map.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from example.csv head -n 2 then put -q '
dump $*;
print "Color is", $*["color"];
'
GENMD_EOF
GENMD-EOF
The collection of all [out-of-stream variables](reference-dsl-variables.md#out-of-stream0variables), `@*`, is a map.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from example.csv put -q '
begin {
@last_rates = {};
@ -59,7 +59,7 @@ mlr --csv --from example.csv put -q '
dump @*;
}
'
GENMD_EOF
GENMD-EOF
Also note that several [built-in functions](reference-dsl-builtin-functions.md) operate on maps and/or return maps.
@ -82,7 +82,7 @@ let people do `@records[NR] = $*`.
Indexing any as-yet-assigned local variable or out-of-stream variable results
in **auto-create** of that variable as a map variable:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from example.csv put -q '
# You can do this but you do not need to:
# begin { @last_rates = {} }
@ -91,12 +91,12 @@ mlr --csv --from example.csv put -q '
dump @last_rates;
}
'
GENMD_EOF
GENMD-EOF
*This also means that auto-create results in maps, not arrays, even if keys are integers.*
If you want to auto-extend an [array](reference-main-arrays.md), initialize it explicitly to `[]`.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from example.csv head -n 4 then put -q '
begin {
@my_array = [];
@ -107,7 +107,7 @@ mlr --csv --from example.csv head -n 4 then put -q '
dump
}
'
GENMD_EOF
GENMD-EOF
## Auto-deepen
@ -116,14 +116,14 @@ without first setting `@m["a"]={}` and `@m["a"]["b"]={}`. The reason for this
is for doing data aggregations: for example if you want compute keyed sums, you
can do that with a minimum of keystrokes.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint --from example.csv put -q '
@quantity_sum[$color][$shape] += $rate;
end {
emit @quantity_sum, "color", "shape";
}
'
GENMD_EOF
GENMD-EOF
## Looping

View file

@ -14,55 +14,55 @@ Miller has three kinds of null data:
You can test these programatically using the functions `is_empty`/`is_not_empty`, `is_absent`/`is_present`, and `is_null`/`is_not_null`. For the last pair, note that null means either empty or absent. Here is a full list of such functions:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -f | grep is_
GENMD_EOF
GENMD-EOF
## Rules for null-handling
* Records with one or more empty sort-field values sort after records with all sort-field values present:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr cat data/sort-null.dat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr sort -n a data/sort-null.dat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr sort -nr a data/sort-null.dat
GENMD_EOF
GENMD-EOF
* Functions/operators which have one or more *empty* arguments produce empty output: e.g.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=2,y=3' | mlr put '$a=$x+$y'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=,y=3' | mlr put '$a=$x+$y'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'
GENMD_EOF
GENMD-EOF
with the exception that the `min` and `max` functions are special: if one argument is non-null, it wins:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'
GENMD_EOF
GENMD-EOF
* Functions of *absent* variables (e.g. `mlr put '$y = log10($nonesuch)'`) evaluate to absent, and arithmetic/bitwise/boolean operators with both operands being absent evaluate to absent. Arithmetic operators with one absent operand return the other operand. More specifically, absent values act like zero for addition/subtraction, and one for multiplication: Furthermore, **any expression which evaluates to absent is not stored in the left-hand side of an assignment statement**:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'
GENMD_EOF
GENMD-EOF
* Likewise, for assignment to maps, **absent-valued keys or values result in a skipped assignment**.
@ -80,22 +80,22 @@ The reasoning is as follows:
Since absent plus absent is absent (and likewise for other operators), accumulations such as `@sum += $x` work correctly on heterogenous data, as do within-record formulas if both operands are absent. If one operand is present, you may get behavior you don't desire. To work around this -- namely, to set an output field only for records which have all the inputs present -- you can use a pattern-action block with `is_present`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr cat data/het.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put 'is_present($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put '$loadmillis = (is_present($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp
GENMD_EOF
GENMD-EOF
## Arithmetic rules
If you're interested in a formal description of how empty and absent fields participate in arithmetic, here's a table for plus (other arithmetic/boolean/bitwise operators are similar):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help type-arithmetic-info
GENMD_EOF
GENMD-EOF

View file

@ -4,19 +4,19 @@
The command-line option `--ofmt {format string}` is the global number format for all numeric fields. Examples:
GENMD_CARDIFY
GENMD-CARDIFY
--ofmt %.9e --ofmt %.6f --ofmt %.0f
GENMD_EOF
GENMD-EOF
These are just familiar `printf` formats. (TODO: write about type-checking once that's implemented.) Additionally, if you use leading width (e.g. `%18.12f`) then the output will contain embedded whitespace, which may not be what you want if you pipe the output to something else, particularly CSV. I use Miller's pretty-print format (`mlr --opprint`) to column-align numerical data.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=3.1,y=4.3' | mlr --ofmt '%8.3f' cat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=3.1,y=4.3' | mlr --ofmt '%11.8e' cat
GENMD_EOF
GENMD-EOF
## The format-values verb
@ -29,20 +29,20 @@ To apply formatting to a single field, you can also use
[`fmtnum`](reference-dsl-builtin-functions.md#fmtnum) function within `mlr
put`. For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=3.1,y=4.3' | mlr put '$z=fmtnum($x*$y,"%08f")'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=0xffff,y=0xff' | mlr put '$z=fmtnum(int($x*$y),"%08x")'
GENMD_EOF
GENMD-EOF
Input conversion from hexadecimal is done automatically on fields handled by `mlr put` and `mlr filter` as long as the field value begins with `0x`. To apply output conversion to hexadecimal on a single column, you may use `fmtnum`, or the keystroke-saving [`hexfmt`](reference-dsl-builtin-functions.md#hexfmt) function. Example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=0xffff,y=0xff' | mlr put '$z=$x*$y'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x=0xffff,y=0xff' | mlr put '$z=hexfmt($x*$y)'
GENMD_EOF
GENMD-EOF

View file

@ -11,19 +11,19 @@ The outline of an invocation of Miller is:
For example, reading from a file:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint head -n 2 then sort -f shape example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from example.csv --icsv --opprint head -n 2 then sort -f shape
GENMD_EOF
GENMD-EOF
Reading from standard input:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat example.csv | mlr --icsv --opprint head -n 2 then sort -f shape
GENMD_EOF
GENMD-EOF
The rest of this reference section gives you full information on each of these parts of the command line.
@ -41,9 +41,9 @@ Here's a comparison of verbs and `put`/`filter` DSL expressions:
Example of using a verb for data processing:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr stats1 -a sum -f x -g a data/small
GENMD_EOF
GENMD-EOF
* Verbs are coded in Go
* They run a bit faster
@ -53,9 +53,9 @@ GENMD_EOF
Example of doing the same thing using a DSL expression:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr put -q '@x_sum[$a] += $x; end{emit @x_sum, "a"}' data/small
GENMD_EOF
GENMD-EOF
* You get to write your own expressions in Miller's programming language
* They run a bit slower

View file

@ -26,13 +26,13 @@ Points demonstrated by the above examples:
Example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/regex-in-data.dat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr filter '$name =~ $regex' data/regex-in-data.dat
GENMD_EOF
GENMD-EOF
## Regex captures
@ -40,21 +40,21 @@ Regex captures of the form `\0` through `\9` are supported as
* Captures have in-function context for `sub` and `gsub`. For example, the first `\1,\2` pair belong to the first `sub` and the second `\1,\2` pair belong to the second `sub`:
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr put '$b = sub($a, "(..)_(...)", "\2-\1"); $c = sub($a, "(..)_(.)(..)", ":\1:\2:\3")'
GENMD_EOF
GENMD-EOF
* Captures endure for the entirety of a `put` for the `=~` and `!=~` operators. For example, here the `\1,\2` are set by the `=~` operator and are used by both subsequent assignment statements:
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr put '$a =~ "(..)_(....); $b = "left_\1"; $c = "right_\2"'
GENMD_EOF
GENMD-EOF
* The captures are not retained across multiple puts. For example, here the `\1,\2` won't be expanded from the regex capture:
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr put '$a =~ "(..)_(....)' then {... something else ...} then put '$b = "left_\1"; $c = "right_\2"'
GENMD_EOF
GENMD-EOF
* Up to nine matches are supported: `\1` through `\9`, while `\0` is the entire match string; `\15` is treated as `\1` followed by an unrelated `5`.

View file

@ -6,9 +6,9 @@ Miller has record separators, field separators, and pair separators. For
example, given the following [DKVP](file-formats.md#dkvp-key-value-pairs)
records:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/a.dkvp
GENMD_EOF
GENMD-EOF
* the **record separator** is newline -- it separates records from one another;
* the **field separator** is `,` -- it separates fields (key-value pairs) from one another;
@ -40,33 +40,33 @@ separators, `IFS` and `OFS` for the input and output field separators, and
For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/a.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ifs , --ofs ';' --ips = --ops : cut -o -f c,a,b data/a.dkvp
GENMD_EOF
GENMD-EOF
If your data has non-default separators and you don't want to change those
between input and output, you can use `--rs`, `--fs`, and `--ps`. Setting `--fs
:` is the same as setting `--ifs : --ofs :`, but with fewer keystrokes.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/modsep.dkvp
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --fs ';' --ps : cut -o -f c,a,b data/modsep.dkvp
GENMD_EOF
GENMD-EOF
## Multi-character and regular-expression separators
The separators default to single characters, but can be multiple characters if you like:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ifs ';' --ips : --ofs ';;;' --ops := cut -o -f c,a,b data/modsep.dkvp
GENMD_EOF
GENMD-EOF
As of September 2021:
@ -89,22 +89,22 @@ is internally implemented in terms of `--repifs`.
For example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/extra-spaces.txt
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ifs ' ' --repifs --inidx --oxtab cat data/extra-spaces.txt
GENMD_EOF
GENMD-EOF
## Aliases
Many things we'd like to write as separators need to be escaped from the shell
-- e.g. `--ifs ';'` or `--ofs '|'`, and so on. You can use the following if you like:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr help list-separator-aliases
GENMD_EOF
GENMD-EOF
Note that `spaces`, `tabs`, and `whitespace` already are regexes so you
shouldn't use `--repifs` with them.
@ -113,11 +113,11 @@ shouldn't use `--repifs` with them.
Given the above, we now have seen the following flags:
GENMD_CARDIFY
GENMD-CARDIFY
--rs --irs --ors
--fs --ifs --ofs --repifs
--ps --ips --ops
GENMD_EOF
GENMD-EOF
See also the [separator-flags section](reference-main-flag-list.md#separator-flags).
@ -127,9 +127,9 @@ Miller exposes for you read-only [built-in variables](reference-dsl-variables.md
names `IRS`, `ORS`, `IFS`, `OFS`, `IPS`, and `OPS`. Unlike in AWK, you can't set these in begin-blocks --
their values indicate what you specified at the command line -- so their use is limited.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ifs , --ofs ';' --ips = --ops : --from data/a.dkvp put '$d = ">>>" . IFS . "|||" . OFS . "<<<"'
GENMD_EOF
GENMD-EOF
## Which separators apply to which file formats

View file

@ -10,9 +10,9 @@ the single-quotes are consumed by the shell and Miller gets `$b=$a.".suffix"`. (
A basic string operation is the `.` (concatenation) operator:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from example.csv put '$output = $color . ":" . $shape'
GENMD_EOF
GENMD-EOF
Also see the [list of string-related built-in functions](reference-dsl-builtin-functions.md#string-functions).
@ -47,7 +47,7 @@ backward from the end of the string, while positive indices read forward from
the start. If a string has length `n` then `-n..-1` are aliases for `1..n`,
respectively; 0 is never a valid string index in Miller.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = "abcde";
@ -57,7 +57,7 @@ mlr -n put '
print x[-2:-1];
}
'
GENMD_EOF
GENMD-EOF
## Slicing
@ -65,7 +65,7 @@ Miller supports slicing using `[lo:hi]` syntax. Either or both of the indices
in a slice can be negatively aliased as described above. Unlike in Python,
Miller string-slice indices are inclusive on both sides: `x[3:5]` means `x[3] . x[4] . x[5]`.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = "abcde";
@ -76,7 +76,7 @@ mlr -n put '
print x[2:-2];
}
'
GENMD_EOF
GENMD-EOF
## Out-of-bounds indexing
@ -84,7 +84,7 @@ Somewhat imitating Python, out-of-bounds index accesses are
[errors](reference-main-data-types.md), but out-of-bounds slice accesses result
in trimming the indices, resulting in a short string or even the empty string:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = "abcde";
@ -93,9 +93,9 @@ mlr -n put '
print x[6]; # absent
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
x = "abcde";
@ -104,7 +104,7 @@ mlr -n put '
print x[10:20];
}
'
GENMD_EOF
GENMD-EOF
## Escape sequences for string literals

View file

@ -2,21 +2,21 @@
In accord with the [Unix philosophy](http://en.wikipedia.org/wiki/Unix_philosophy), you can pipe data into or out of Miller. For example:
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr cut --complement -f os_version *.dat | mlr sort -f hostname,uptime
GENMD_EOF
GENMD-EOF
You can, if you like, instead simply chain commands together using the `then` keyword:
GENMD_SHOW_COMMAND
GENMD-SHOW-COMMAND
mlr cut --complement -f os_version then sort -f hostname,uptime *.dat
GENMD_EOF
GENMD-EOF
(You can precede the very first verb with `then`, if you like, for symmetry.)
Here's a performance comparison:
GENMD_INCLUDE_ESCAPED(data/then-chaining-performance.txt)
GENMD-INCLUDE-ESCAPED(data/then-chaining-performance.txt)
There are two reasons to use then-chaining: one is for performance, although I don't expect this to be a win in all cases. Using then-chaining avoids redundant string-parsing and string-formatting at each pipeline step: instead input records are parsed once, they are fed through each pipeline stage in memory, and then output records are formatted once.

View file

@ -996,7 +996,7 @@ More example filter expressions:
Using 'any' higher-order function to see if $index is 10, 20, or 30:
'any([10,20,30], func(e) {return $index == e})'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
</pre>
### Features which filter shares with put
@ -2226,7 +2226,7 @@ More example put expressions:
end{emitf @min, @max}
'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
</pre>
### Features which put shares with filter

File diff suppressed because it is too large Load diff

View file

@ -4,12 +4,12 @@ The Miller REPL (read-evaluate-print loop) is an interactive counterpart to reco
Miller's REPL isn't a source-level debugger which lets you execute one source-code *statement* at a time -- however, it does let you operate on one *record* at a time. Further, it lets you use "immediate expressions", namely, you can interact with the [Miller programming language](miller-programming-language.md) without having to provide data from an input file.
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr repl
[mlr] 1 + 2
3
GENMD_EOF
GENMD-EOF
## Using Miller without the REPL
@ -23,10 +23,10 @@ Using `put` and `filter`, you can do the following as we've seen above:
* Specify statements to be executed on each record -- which are anything outside of `begin`/`end`/`func`/`subr`.
* Example:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from example.csv head -n 2 \
then put 'begin {print "HELLO"} $qr = $quantity / $rate; end {print "GOODBYE"}'
GENMD_EOF
GENMD-EOF
## Using Miller with the REPL
@ -74,7 +74,7 @@ printed to the terminal, e.g. if you type `1+2`, you will see `3`.
Use the REPL to look at arithmetic:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr repl
[mlr] 6/3
@ -88,11 +88,11 @@ int
[mlr] typeof(6/5)
float
GENMD_EOF
GENMD-EOF
Read the first record from a small file:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr repl
[mlr] :open foo.dat
@ -115,11 +115,11 @@ FILENAME="foo.dat",FILENUM=1,NR=1,FNR=1
[mlr] :write
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463,z=4.381399393871141
GENMD_EOF
GENMD-EOF
Skip until deep into a larger file, then inspect a record:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr repl --csv
[mlr] :open data/colored-shapes.csv
@ -136,7 +136,7 @@ mlr repl --csv
"w": 0.4799125551304738,
"x": 6.379888206335166
}
GENMD_EOF
GENMD-EOF
## History-editing

View file

@ -16,39 +16,39 @@ Also try `od -xcv` and/or `cat -e` on your file to check for non-printable chara
Use the `file` command to see if there are CR/LF terminators (in this case, there are not):
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
file data/colours.csv
data/colours.csv: UTF-8 Unicode text
GENMD_EOF
GENMD-EOF
Look at the file to find names of fields:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
cat data/colours.csv
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR
masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz
masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah
GENMD_EOF
GENMD-EOF
Extract a few fields:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --csv cut -f KEY,PL,RO data/colours.csv
(only blank lines appear)
GENMD_EOF
GENMD-EOF
Use XTAB output format to get a sharper picture of where records/fields are being split:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --icsv --oxtab cat data/colours.csv
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah
GENMD_EOF
GENMD-EOF
Using XTAB output format makes it clearer that `KEY;DE;...;RO;TR` is being treated as a single field name in the CSV header, and likewise each subsequent line is being treated as a single field value. This is because the default field separator is a comma but we have semicolons here. Use XTAB again with different field separator (`--fs semicolon`):
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --icsv --ifs semicolon --oxtab cat data/colours.csv
KEY masterdata_colourcode_1
DE Weiß
@ -73,98 +73,98 @@ NL Zwart
PL Czarny
RO Negru
TR Siyah
GENMD_EOF
GENMD-EOF
Using the new field-separator, retry the cut:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --csv --fs semicolon cut -f KEY,PL,RO data/colours.csv
KEY;PL;RO
masterdata_colourcode_1;Biały;Alb
masterdata_colourcode_2;Czarny;Negru
GENMD_EOF
GENMD-EOF
## I assigned $9 and it's not 9th
Miller records are ordered lists of key-value pairs. For NIDX format, DKVP format when keys are missing, or CSV/CSV-lite format with `--implicit-csv-header`, Miller will sequentially assign keys of the form `1`, `2`, etc. But these are not integer array indices: they're just field names taken from the initial field ordering in the input data, when it was originally read from the input file(s).
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x,y,z | mlr --dkvp cat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x,y,z | mlr --dkvp put '$6="a";$4="b";$55="cde"'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x,y,z | mlr --nidx cat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x,y,z | mlr --csv --implicit-csv-header cat
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x,y,z | mlr --dkvp rename 2,999
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x,y,z | mlr --dkvp rename 2,newname
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo x,y,z | mlr --csv --implicit-csv-header reorder -f 3,1,2
GENMD_EOF
GENMD-EOF
## Why doesn't mlr cut put fields in the order I want?
Example: columns `rate,shape,flag` were requested but they appear here in the order `shape,flag,rate`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cut -f rate,shape,flag example.csv
GENMD_EOF
GENMD-EOF
The issue is that Miller's `cut`, by default, outputs cut fields in the order they appear in the input data. This design decision was made intentionally to parallel the Unix/Linux system `cut` command, which has the same semantics.
The solution is to use the `-o` option:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cut -o -f rate,shape,flag example.csv
GENMD_EOF
GENMD-EOF
## Numbering and renumbering records
The `awk`-like built-in variable `NR` is incremented for each input record:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat example.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv put '$nr = NR' example.csv
GENMD_EOF
GENMD-EOF
However, this is the record number within the original input stream -- not after any filtering you may have done:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv filter '$color == "yellow"' then put '$nr = NR' example.csv
GENMD_EOF
GENMD-EOF
There are two good options here. One is to use the `cat` verb with `-n`:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv filter '$color == "yellow"' then cat -n example.csv
GENMD_EOF
GENMD-EOF
The other is to keep your own counter within the `put` DSL:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv filter '$color == "yellow"' then put 'begin {@n = 1} $n = @n; @n += 1' example.csv
GENMD_EOF
GENMD-EOF
The difference is a matter of taste (although `mlr cat -n` puts the counter first).
@ -172,50 +172,50 @@ The difference is a matter of taste (although `mlr cat -n` puts the counter firs
Suppose you want to just keep the first two components of the hostnames:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/hosts.csv
GENMD_EOF
GENMD-EOF
Using the [`splita`](reference-dsl-builtin-functions.md#splita) and
[`joinv`](reference-dsl-builtin-functions.md#joinv) functions, along with
[array slicing](reference-main-arrays.md#slicing), we get
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv --from data/hosts.csv put '$host = joinv(splita($host, ".")[1:2], ".")'
GENMD_EOF
GENMD-EOF
## Splitting nested fields
Suppose you have a TSV file like this:
GENMD_INCLUDE_ESCAPED(data/nested.tsv)
GENMD-INCLUDE-ESCAPED(data/nested.tsv)
The simplest option is to use [nest](reference-verbs.md#nest):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --tsv nest --explode --values --across-records -f b --nested-fs : data/nested.tsv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --tsv nest --explode --values --across-fields -f b --nested-fs : data/nested.tsv
GENMD_EOF
GENMD-EOF
While `mlr nest` is simplest, let's also take a look at a few ways to do this using the `put` DSL.
One option to split out the colon-delimited values in the `b` column is to use `splitnv` to create an integer-indexed map and loop over it, adding new fields to the current record:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/nested.tsv --itsv --oxtab put '
o = splitnv($b, ":");
for (k,v in o) {
$["p".k]=v
}
'
GENMD_EOF
GENMD-EOF
while another is to loop over the same map from `splitnv` and use it (with `put -q` to suppress printing the original record) to produce multiple records:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/nested.tsv --itsv --oxtab put -q '
o = splitnv($b, ":");
for (k,v in o) {
@ -223,16 +223,16 @@ mlr --from data/nested.tsv --itsv --oxtab put -q '
emit x
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/nested.tsv --tsv put -q '
o = splitnv($b, ":");
for (k,v in o) {
x = mapsum($*, {"b":v}); emit x
}
'
GENMD_EOF
GENMD-EOF
## Options for dealing with duplicate rows
@ -244,23 +244,23 @@ If you want to look at partial uniqueness -- for example, show only the first re
Suppose you have a method (in whatever language) which is printing things of the form
GENMD_INCLUDE_ESCAPED(data/rect-outer.txt)
GENMD-INCLUDE-ESCAPED(data/rect-outer.txt)
and then calls another method which prints things of the form
GENMD_INCLUDE_ESCAPED(data/rect-middle.txt)
GENMD-INCLUDE-ESCAPED(data/rect-middle.txt)
and then, perhaps, that second method calls a third method which prints things of the form
GENMD_INCLUDE_ESCAPED(data/rect-inner.txt)
GENMD-INCLUDE-ESCAPED(data/rect-inner.txt)
with the result that your program's output is
GENMD_INCLUDE_ESCAPED(data/rect.txt)
GENMD-INCLUDE-ESCAPED(data/rect.txt)
The idea here is that middles starting with a 1 belong to the outer value of 1, and so on. (For example, the outer values might be account IDs, the middle values might be invoice IDs, and the inner values might be invoice line-items.) If you want all the middle and inner lines to have the context of which outers they belong to, you can modify your software to pass all those through your methods. Alternatively, don't refactor your code just to handle some ad-hoc log-data formatting -- instead, use the following to [rectangularize the data](record-heterogeneity.md). The idea is to use an out-of-stream variable to accumulate fields across records. Clear that variable when you see an outer ID; accumulate fields; emit output when you see the inner IDs.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/rect.txt put -q '
is_present($outer) {
unset @r
@ -271,7 +271,7 @@ mlr --from data/rect.txt put -q '
is_present($inner1) {
emit @r
}'
GENMD_EOF
GENMD-EOF
See also the [record-heterogeneity page](record-heterogeneity.md); see in
particular the [`regularize` verb](reference-verbs.md#regularize) for a way to

View file

@ -4,23 +4,23 @@ TODO: while-read example from issues
The [system](reference-dsl.md#system) DSL function allows you to run a specific shell command and put its output -- minus the final newline -- into a record field. The command itself is any string, either a literal string, or a concatenation of strings, perhaps including other field values or what have you.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put '$o = system("echo hello world")' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put '$o = system("echo {" . NR . "}")' data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put '$o = system("echo -n ".$a."| md5")' data/small
GENMD_EOF
GENMD-EOF
Note that running a subprocess on every record takes a non-trivial amount of time. Comparing asking the system `date` command for the current time in nanoseconds versus computing it in process:
<!--- hard-coded, not live-code, since %N doesn't exist on all platforms -->
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --opprint put '$t=system("date +%s.%N")' then step -a delta -f t data/small
a b i x y t t_delta
pan pan 1 0.3467901443380824 0.7268028627434533 1568774318.513903817 0
@ -28,9 +28,9 @@ eks pan 2 0.7586799647899636 0.5221511083334797 1568774318.514722876 0.000819
wye wye 3 0.20460330576630303 0.33831852551664776 1568774318.515618046 0.000895
eks wye 4 0.38139939387114097 0.13418874328430463 1568774318.516547441 0.000929
wye pan 5 0.5732889198020006 0.8636244699032729 1568774318.517518828 0.000971
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --opprint put '$t=systime()' then step -a delta -f t data/small
a b i x y t t_delta
pan pan 1 0.3467901443380824 0.7268028627434533 1568774318.518699 0
@ -38,4 +38,4 @@ eks pan 2 0.7586799647899636 0.5221511083334797 1568774318.518717 0.000018
wye wye 3 0.20460330576630303 0.33831852551664776 1568774318.518723 0.000006
eks wye 4 0.38139939387114097 0.13418874328430463 1568774318.518727 0.000004
wye pan 5 0.5732889198020006 0.8636244699032729 1568774318.518730 0.000003
GENMD_EOF
GENMD-EOF

View file

@ -16,21 +16,21 @@ another, etc.
Input data:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p cat example.csv
GENMD_EOF
GENMD-EOF
Sorted numerically ascending by rate:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p sort -n rate example.csv
GENMD_EOF
GENMD-EOF
Sorted lexically ascending by color; then, within each color, numerically descending by quantity:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p sort -f color -nr quantity example.csv
GENMD_EOF
GENMD-EOF
## Sorting fields within records: the sort-within-records verb
@ -40,17 +40,17 @@ leaves records in their original order in the data stream, but reorders fields
within each record. A typical use-case is for given all records the same column-ordering,
in particular for converting JSON to CSV (or other tabular formats):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/sort-within-records.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint cat data/sort-within-records.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint sort-within-records data/sort-within-records.json
GENMD_EOF
GENMD-EOF
## The sort function by example
@ -58,77 +58,77 @@ GENMD_EOF
* Without second argument, uses the natural ordering.
* With second which is string, takes sorting flags from it: `"f"` for lexical or `"c"` for case-folded lexical, and/or `"r"` for reverse/descending.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
# Sort array with natural ordering
print sort([5,2,3,1,4]);
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
# Sort array with reverse-natural ordering
print sort([5,2,3,1,4], "r");
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
# Sort array with custom function: natural ordering
print sort([5,2,3,1,4], func(a,b) { return a <=> b});
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
# Sort array with custom function: reverse-natural ordering
print sort([5,2,3,1,4], func(a,b) { return b <=> a});
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
# Sort map with natural ordering on keys
print sort({"c":2, "a": 3, "b": 1});
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
# Sort map with reverse-natural ordering on keys
print sort({"c":2, "a": 3, "b": 1}, "r");
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
# Sort map with custom function: natural ordering on values
print sort({"c":2, "a": 3, "b": 1}, func(ak,av,bk,bv){return av <=> bv});
}
'
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put '
end {
# Sort map with custom function: reverse-natural ordering on values
print sort({"c":2, "a": 3, "b": 1}, func(ak,av,bk,bv){return bv <=> av});
}
'
GENMD_EOF
GENMD-EOF
In the rest of this page we'll look more closely at these variants.
@ -147,47 +147,47 @@ contain strings, floats, and booleans; if you need to sort an array whose
values are themselves maps or arrays, you'll need `sort` with function argument
as described further down in this page.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/sorta-example.csv
GENMD_EOF
GENMD-EOF
Default sort is numerical ascending:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from data/sorta-example.csv put '
$values = splita($values, ";");
$values = sort($values); # default flags
$values = joinv($values, ";");
'
GENMD_EOF
GENMD-EOF
Use the `"r"` flag for reverse, which is numerical descending:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from data/sorta-example.csv put '
$values = splita($values, ";");
$values = sort($values, "r"); # 'r' flag for reverse sort
$values = joinv($values, ";");
'
GENMD_EOF
GENMD-EOF
Use the `"f"` flag for lexical ascending sort (and `"fr"` would lexical descending):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from data/sorta-example.csv put '
$values = splita($values, ";");
$values = sort($values, "f"); # 'f' flag for lexical sort
$values = joinv($values, ";");
'
GENMD_EOF
GENMD-EOF
Without and with case-folding:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/sorta-example-text.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --c2p --from data/sorta-example-text.csv put '
$values = splita($values, ";");
if (NR == 1) {
@ -197,7 +197,7 @@ mlr --c2p --from data/sorta-example-text.csv put '
}
$values = joinv($values, ";");
'
GENMD_EOF
GENMD-EOF
## Simple sorting of maps within records
@ -211,16 +211,16 @@ described further down in this page.
Also note that, unlike the `sort-within-record` verb with its `-r` flag,
`sort` doesn't recurse into submaps and sort those.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/server-log.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json --from data/server-log.json put '
$req = sort($req); # Ascending here
$res = sort($res, "r"); # Descending here
'
GENMD_EOF
GENMD-EOF
## Simple sorting of maps across records
@ -235,7 +235,7 @@ of accumulating records in a map, then sorting the map.
Using the `f` flag we're sorting the map keys (1-up NR) lexically, so we
have 1, then 10, then 2:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint --from example.csv put -q '
begin {
@records = {}; # Define as a map
@ -249,7 +249,7 @@ mlr --icsv --opprint --from example.csv put -q '
}
}
'
GENMD_EOF
GENMD-EOF
## Custom sorting of arrays within records
@ -264,14 +264,14 @@ for comparing elements.
For example, let's use the following input data. Instead of having an array, it
has some semicolon-delimited data in a field which we can split and sort:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/sortaf-example.csv
GENMD_EOF
GENMD-EOF
In the following example we sort data in several ways -- the first two just
recaptiulate (for reference) what `sort` with default flags already does; the third is novel:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/sortaf-example.csv put '
# Same as sort($values)
@ -302,7 +302,7 @@ mlr --icsv --ojson --from data/sortaf-example.csv put '
$reverse = sort(split_values, reverse);
$even_then_odd = sort(split_values, even_then_odd);
'
GENMD_EOF
GENMD-EOF
## Custom sorting of arrays across records
@ -319,7 +319,7 @@ functions are maps -- and we have to access the `index` field using either
indexing](reference-dsl-operators.md#the-double-purpose-dot-operator))
`a.index` and `b.index`.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint --from example.csv put -q '
# Sort primarily ascending on the shape field, then secondarily
# descending numeric on the index field.
@ -342,7 +342,7 @@ mlr --icsv --opprint --from example.csv put -q '
}
}
'
GENMD_EOF
GENMD-EOF
## Custom sorting of maps within records
@ -356,7 +356,7 @@ keys and/or values.
For example, we can sort ascending or descending by map key or map value:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr -n put -q '
func f1(ak, av, bk, bv) {
return ak <=> bk
@ -383,7 +383,7 @@ mlr -n put -q '
print sort(x, f4);
}
'
GENMD_EOF
GENMD-EOF
## Custom sorting of maps across records
@ -395,7 +395,7 @@ of them -- densely -- accumulating them in an array is fine. If we're only
taking a subset -- sparsely -- and we want to retain the original `NR` as keys,
using a map is handy, since we don't need continguous keys.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --opprint --from example.csv put -q '
# Sort descending numeric on the index field
func cmp(ak, av, bk, bv) {
@ -412,4 +412,4 @@ mlr --icsv --opprint --from example.csv put -q '
}
}
'
GENMD_EOF
GENMD-EOF

View file

@ -4,66 +4,66 @@
[CSV](file-formats.md) handles this well and by design:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat commas.csv
GENMD_EOF
GENMD-EOF
Likewise [JSON](file-formats.md#json):
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --ojson cat commas.csv
GENMD_EOF
GENMD-EOF
For Miller's [XTAB](file-formats.md#xtab-vertical-tabular) there is no escaping for carriage returns, but commas work fine:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --oxtab cat commas.csv
GENMD_EOF
GENMD-EOF
But for [key-value-pairs](file-formats.md#dkvp-key-value-pairs) and [index-numbered](file-formats.md#nidx-index-numbered-toolkit-style) formats, commas are the default field separator. And -- as of Miller 5.4.0 anyway -- there is no CSV-style double-quote-handling like there is for CSV. So commas within the data look like delimiters:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --odkvp cat commas.csv
GENMD_EOF
GENMD-EOF
One solution is to use a different delimiter, such as a pipe character:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --odkvp --ofs pipe cat commas.csv
GENMD_EOF
GENMD-EOF
To be extra-sure to avoid data/delimiter clashes, you can also use control
characters as delimiters -- here, control-A:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --icsv --odkvp --ofs '\001' cat commas.csv | cat -v
GENMD_EOF
GENMD-EOF
## How can I handle field names with special symbols in them?
Simply surround the field names with curly braces:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo 'x.a=3,y:b=4,z/c=5' | mlr put '${product.all} = ${x.a} * ${y:b} * ${z/c}'
GENMD_EOF
GENMD-EOF
## How can I put single quotes into strings?
This is a little tricky due to the shell's handling of quotes. For simplicity, let's first put an update script into a file:
GENMD_INCLUDE_ESCAPED(data/single-quote-example.mlr)
GENMD-INCLUDE-ESCAPED(data/single-quote-example.mlr)
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo a=bcd | mlr put -f data/single-quote-example.mlr
GENMD_EOF
GENMD-EOF
So: Miller's DSL uses double quotes for strings, and you can put single quotes (or backslash-escaped double-quotes) inside strings, no problem.
Without putting the update expression in a file, it's messier:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
echo a=bcd | mlr put '$a="It'\''s OK, I said, '\''for now'\''."'
GENMD_EOF
GENMD-EOF
The idea is that the outermost single-quotes are to protect the `put` expression from the shell, and the double quotes within them are for Miller. To get a single quote in the middle there, you need to actually put it *outside* the single-quoting for the shell. The pieces are the following, all concatenated together:
@ -79,14 +79,14 @@ The idea is that the outermost single-quotes are to protect the `put` expression
One way is to use square brackets; an alternative is to use simple string-substitution rather than a regular expression.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/question.dat
GENMD_EOF
GENMD_RUN_COMMAND
GENMD-EOF
GENMD-RUN-COMMAND
mlr --oxtab put '$c = gsub($a, "[?]"," ...")' data/question.dat
GENMD_EOF
GENMD_RUN_COMMAND
GENMD-EOF
GENMD-RUN-COMMAND
mlr --oxtab put '$c = ssub($a, "?"," ...")' data/question.dat
GENMD_EOF
GENMD-EOF
The `ssub` function exists precisely for this reason: so you don't have to escape anything.

View file

@ -6,7 +6,7 @@ I like to produce SQL-query output with header-column and tab delimiter: this is
For example, using default output formatting in `mysql` we get formatting like Miller's `--opprint --barred`:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mysql --database=mydb -e 'show columns in mytable'
+------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
@ -17,11 +17,11 @@ mysql --database=mydb -e 'show columns in mytable'
| assigned_to | bigint(20) | YES | | NULL | |
| last_update_time | int(11) | YES | | NULL | |
+------------------+--------------+------+-----+---------+-------+
GENMD_EOF
GENMD-EOF
Using `mysql`'s `-B` we get TSV output:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mysql --database=mydb -B -e 'show columns in mytable' | mlr --itsvlite --opprint cat
Field Type Null Key Default Extra
id bigint(20) NO MUL NULL -
@ -29,11 +29,11 @@ category varchar(256) NO - NULL -
is_permanent tinyint(1) NO - NULL -
assigned_to bigint(20) YES - NULL -
last_update_time int(11) YES - NULL -
GENMD_EOF
GENMD-EOF
Since Miller handles TSV output, we can do as much or as little processing as we want in the SQL query, then send the rest on to Miller. This includes outputting as JSON, doing further selects/joins in Miller, doing stats, etc. etc.:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mysql --database=mydb -B -e 'show columns in mytable' | mlr --itsvlite --ojson --jlistwrap --jvstack cat
[
{
@ -77,13 +77,13 @@ mysql --database=mydb -B -e 'show columns in mytable' | mlr --itsvlite --ojson -
"Extra": ""
}
]
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mysql --database=mydb -B -e 'select * from mytable' > query.tsv
GENMD_EOF
GENMD-EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --from query.tsv --t2p stats1 -a count -f id -g category,assigned_to
category assigned_to id_count
special 10000978 207
@ -93,7 +93,7 @@ standard 10000978 524
standard 10003924 392
standard 10009872 108
...
GENMD_EOF
GENMD-EOF
## SQL-input examples
@ -101,7 +101,7 @@ One use of NIDX (value-only, no keys) format is for loading up SQL tables.
Create and load SQL table:
GENMD_CARDIFY
GENMD-CARDIFY
mysql> CREATE TABLE abixy(
a VARCHAR(32),
b VARCHAR(32),
@ -140,11 +140,11 @@ mysql> SELECT * FROM abixy LIMIT 10;
| hat | wye | 9 | 0.03144187646093577 | 0.7495507603507059 |
| pan | wye | 10 | 0.5026260055412137 | 0.9526183602969864 |
+------+------+------+---------------------+---------------------+
GENMD_EOF
GENMD-EOF
Aggregate counts within SQL:
GENMD_CARDIFY
GENMD-CARDIFY
mysql> SELECT a, b, COUNT(*) AS count FROM abixy GROUP BY a, b ORDER BY COUNT DESC;
+------+------+-------+
| a | b | count |
@ -176,11 +176,11 @@ mysql> SELECT a, b, COUNT(*) AS count FROM abixy GROUP BY a, b ORDER BY COUNT DE
| eks | zee | 357 |
+------+------+-------+
25 rows in set (0.01 sec)
GENMD_EOF
GENMD-EOF
Aggregate counts within Miller:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mlr --opprint uniq -c -g a,b then sort -nr count data/medium
a b count
zee wye 455
@ -198,11 +198,11 @@ zee zee 403
pan wye 395
hat pan 363
eks zee 357
GENMD_EOF
GENMD-EOF
Pipe SQL output to aggregate counts within Miller:
GENMD_CARDIFY_HIGHLIGHT_ONE
GENMD-CARDIFY-HIGHLIGHT-ONE
mysql -D miller -B -e 'select * from abixy' | mlr --itsv --opprint uniq -c -g a,b then sort -nr count
a b count
zee wye 455
@ -230,4 +230,4 @@ wye wye 377
eks pan 371
hat pan 363
eks zee 357
GENMD_EOF
GENMD-EOF

View file

@ -4,15 +4,15 @@
For one or more specified field names, simply compute p25 and p75, then write the IQR as the difference of p75 and p25:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --oxtab stats1 -f x -a p25,p75 \
then put '$x_iqr = $x_p75 - $x_p25' \
data/medium
GENMD_EOF
GENMD-EOF
For wildcarded field names, first compute p25 and p75, then loop over field names with `p25` in them:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --oxtab stats1 --fr '[i-z]' -a p25,p75 \
then put 'for (k,v in $*) {
if (k =~ "(.*)_p25") {
@ -20,13 +20,13 @@ mlr --oxtab stats1 --fr '[i-z]' -a p25,p75 \
}
}' \
data/medium
GENMD_EOF
GENMD-EOF
## Computing weighted means
This might be more elegantly implemented as an option within the `stats1` verb. Meanwhile, it's expressible within the DSL:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/medium put -q '
# Using the y field for weighting in this example
weight = $y;
@ -51,4 +51,4 @@ mlr --from data/medium put -q '
#emit mean, "a";
emit (wmean, mean), "a";
}'
GENMD_EOF
GENMD-EOF

View file

@ -16,43 +16,43 @@ multi-pass computations, at the price of retaining all input records in memory.
One of Miller's strengths is its compact notation: for example, given input of the form
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
head -n 5 ./data/medium
GENMD_EOF
GENMD-EOF
you can simply do
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --oxtab stats1 -a sum -f x ./data/medium
GENMD_EOF
GENMD-EOF
or
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint stats1 -a sum -f x -g b ./data/medium
GENMD_EOF
GENMD-EOF
rather than the more tedious
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --oxtab put -q '
@x_sum += $x;
end {
emit @x_sum
}
' data/medium
GENMD_EOF
GENMD-EOF
or
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put -q '
@x_sum[$b] += $x;
end {
emit @x_sum, "b"
}
' data/medium
GENMD_EOF
GENMD-EOF
The former (`mlr stats1` et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.
@ -73,7 +73,7 @@ The following examples compute some things using oosvars which are already compu
For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record's value to a percentage.
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --from data/small --opprint put -q '
# These are executed once per record, which is the first pass.
# The key is to use NR to index an out-of-stream variable to
@ -90,13 +90,13 @@ mlr --from data/small --opprint put -q '
emit (@x, @x_pct), "NR"
}
'
GENMD_EOF
GENMD-EOF
## Line-number ratios
Similarly, finding the total record count requires first reading through all the data:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint --from data/small put -q '
@records[NR] = $*;
end {
@ -109,45 +109,45 @@ mlr --opprint --from data/small put -q '
emit @records,"I"
}
' then reorder -f I,N,PCT
GENMD_EOF
GENMD-EOF
## Records having max value
The idea is to retain records having the largest value of `n` in the following data:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --itsv --opprint cat data/maxrows.tsv
GENMD_EOF
GENMD-EOF
Of course, the largest value of `n` isn't known until after all data have been read. Using an [out-of-stream variable](reference-dsl-variables.md#out-of-stream-variables) we can [retain all records as they are read](operating-on-all-records.md), then filter them at the end:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/maxrows.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --itsv --opprint put -q -f data/maxrows.mlr data/maxrows.tsv
GENMD_EOF
GENMD-EOF
## Feature-counting
Suppose you have some [heterogeneous data](record-heterogeneity.md) like this:
GENMD_INCLUDE_ESCAPED(data/features.json)
GENMD-INCLUDE-ESCAPED(data/features.json)
A reasonable question to ask is, how many occurrences of each field are there? And, what percentage of total row count has each of them? Since the denominator of the percentage is not known until the end, this is a two-pass algorithm:
GENMD_INCLUDE_ESCAPED(data/feature-count.mlr)
GENMD-INCLUDE-ESCAPED(data/feature-count.mlr)
Then
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json put -q -f data/feature-count.mlr data/features.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint put -q -f data/feature-count.mlr data/features.json
GENMD_EOF
GENMD-EOF
## Unsparsing
@ -157,35 +157,35 @@ There is a keystroke-saving verb for this: [unsparsify](reference-verbs.md#unspa
For example, suppose you have JSON input like this:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/sparse.json
GENMD_EOF
GENMD-EOF
There are field names `a`, `b`, `v`, `u`, `x`, `w` in the data -- but not all in every record. Since we don't know the names of all the keys until we've read them all, this needs to be a two-pass algorithm. On the first pass, remember all the unique key names and all the records; on the second pass, loop through the records filling in absent values, then producing output. Use `put -q` since we don't want to produce per-record output, only emitting output in the `end` block:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/unsparsify.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --json put -q -f data/unsparsify.mlr data/sparse.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --ocsv put -q -f data/unsparsify.mlr data/sparse.json
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --ijson --opprint put -q -f data/unsparsify.mlr data/sparse.json
GENMD_EOF
GENMD-EOF
## Mean without/with oosvars
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint stats1 -a mean -f x data/medium
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put -q '
@x_sum += $x;
@x_count += 1;
@ -194,15 +194,15 @@ mlr --opprint put -q '
emit @x_mean
}
' data/medium
GENMD_EOF
GENMD-EOF
## Keyed mean without/with oosvars
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint stats1 -a mean -f x -g a,b data/medium
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put -q '
@x_sum[$a][$b] += $x;
@x_count[$a][$b] += 1;
@ -213,45 +213,45 @@ mlr --opprint put -q '
emit @x_mean, "a", "b"
}
' data/medium
GENMD_EOF
GENMD-EOF
## Variance and standard deviation without/with oosvars
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat variance.mlr
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --oxtab put -q -f variance.mlr data/medium
GENMD_EOF
GENMD-EOF
You can also do this keyed, of course, imitating the keyed-mean example above.
## Min/max without/with oosvars
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --oxtab stats1 -a min,max -f x data/medium
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --oxtab put -q '
@x_min = min(@x_min, $x);
@x_max = max(@x_max, $x);
end{emitf @x_min, @x_max}
' data/medium
GENMD_EOF
GENMD-EOF
## Keyed min/max without/with oosvars
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint stats1 -a min,max -f x -g a data/medium
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint --from data/medium put -q '
@min[$a] = min(@min[$a], $x);
@max[$a] = max(@max[$a], $x);
@ -259,44 +259,44 @@ mlr --opprint --from data/medium put -q '
emit (@min, @max), "a";
}
'
GENMD_EOF
GENMD-EOF
## Delta without/with oosvars
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint step -a delta -f x data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put '
$x_delta = is_present(@last) ? $x - @last : 0;
@last = $x
' data/small
GENMD_EOF
GENMD-EOF
## Keyed delta without/with oosvars
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint step -a delta -f x -g a data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put '
$x_delta = is_present(@last[$a]) ? $x - @last[$a] : 0;
@last[$a]=$x
' data/small
GENMD_EOF
GENMD-EOF
## Exponentially weighted moving averages without/with oosvars
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint step -a ewma -d 0.1 -f x data/small
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --opprint put '
begin{ @a=0.1 };
$e = NR==1 ? $x : @a * $x + (1 - @a) * @e;
@e=$e
' data/small
GENMD_EOF
GENMD-EOF

View file

@ -6,21 +6,21 @@ How does Miller fit within the Unix toolkit (`grep`, `sed`, `awk`, etc.)?
Miller respects CSV headers. If you do `mlr --csv cat *.csv` then the header line is written once:
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/a.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
cat data/b.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv cat data/a.csv data/b.csv
GENMD_EOF
GENMD-EOF
GENMD_RUN_COMMAND
GENMD-RUN-COMMAND
mlr --csv sort -nr b data/a.csv data/b.csv
GENMD_EOF
GENMD-EOF
Likewise with `mlr sort`, `mlr tac`, and so on.

View file

@ -1,47 +1,14 @@
# Quickstart
A TL;DR for anyone wanting to compile and run the Go port of Miller:
# Quickstart for developers
* `go build` -- produces the `mlr` executable
* Miller has tens of unit tests and thousands of regression tests:
* `go test mlr/src/...` runs the unit tests.
* `go test` runs the regression tests. This runs the same tests that `mlr regtest` runs by default, but note that (see `mlr regtest -h`) the latter gives you more options.
* `./mlr regtest` -- runs `regtest/cases`, which are cases passing on all platforms
* `./mlr regtest regtest/cases-pending-go-port` -- needing Go code to be ported from C
* `./mlr regtest regtest/cases-pending-windows` -- for Go code already ported from C but needing some work for Windows
* `go test` or `mlr regtest` runs the regression tests in `regtest/cases/`. Using `mlr regtest -h` you can see more options available than are exposed by `go test`.
Pre-release/rough-draft docs are at http://johnkerl.org/miller6.
# Continuous integration
See also the tracking issue (somewhat redundant to this README file) https://github.com/johnkerl/miller/issues/372.
A note on Continuous Integration:
* The C implementation is auto-built for Linux using Travis: see [../.travis.yml](../.travis.yml).
* The C implementation is also auto-built for Windows using Appveyor: see [../appveyor.yml](../appveyor.yml). However Ifind that it often breaks and I'm bewildered as to how to fix it.
* See also [../README.md](../README.md).
* The Go implementation is auto-built using GitHub Actions: see [../.github/workflows/go.yml](../.github/workflows/go.yml). This works splendidly on Linux, MacOS, and Windows.
# Status of the Go port
* This will be a full Go port of [Miller](https://miller.readthedocs.io/). Things are currently rough and iterative and incomplete. I don't have a firm timeline but I suspect it will take a few more months of late-evening/spare-time work.
* The released Go port will become Miller 6.0. As noted below, this will be a win both at the source-code level, and for users of Miller.
* I hope to retain backward compatibility at the command-line level as much as possible.
* In the meantime I will still keep fixing bugs, doing some features, etc. in C on Miller 5.x -- in the near term, support for Miller's C implementation continues as before.
# Trying out the Go port
* Caveat: *lots* of things present in the C implementation are currently missing in the Go implementation. So if something doesn't work, it's almost certainly because it doesn't work *yet*.
* That said, if anyone is interested in playing around with it and giving early feedback, I'll be happy for it.
* Building:
* Clone the Miller repo
* `cd go`
* `./build` should create `mlr`. If it doesn't do this on your platform, please [file an issue](https://github.com/johnkerl/miller/issues).
* Platforms tried so far:
* macOS with Go 1.14 and 1.16, Linux Mint with Go 1.10 and 1.16, and Windows 10 with Go 1.16
* On-line help:
* `mlr --help` advertises some things the Go implementation doesn't actually do yet.
* `mlr --help-all-verbs` correctly lists verbs which do things in the Go implementation.
* See also https://github.com/johnkerl/miller/issues/372
* See also [../README.md](../README.md).
# Benefits of porting to Go
@ -62,10 +29,6 @@ A note on Continuous Integration:
* Go is an up-and-coming language, with good reason -- it's mature, stable, with few of C's weaknesses and many of C's strengths.
* The source code will be easier to read/maintain/write, by myself and others.
# Things which may change
Please see https://github.com/johnkerl/miller/issues/372.
# Efficiency of the Go port
As I wrote [here](http://johnkerl.org/miller/doc/whyc.html) back in 2015 I couldn't get Rust or Go (or any other language I tried) to do some test-case processing as quickly as C, so I stuck with C.

View file

@ -245,7 +245,7 @@ More example filter expressions:
Using 'any' higher-order function to see if $index is 10, 20, or 30:
'any([10,20,30], func(e) {return $index == e})'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
================================================================
flatten
@ -683,7 +683,7 @@ More example put expressions:
end{emitf @min, @max}
'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
================================================================
regularize

View file

@ -51,7 +51,7 @@ func getPrompt2() string {
func (repl *Repl) printStartupBanner() {
if repl.inputIsTerminal {
fmt.Printf("Miller %s REPL for %s:%s:%s\n", version.STRING, runtime.GOOS, runtime.GOARCH, runtime.Version())
fmt.Printf("Pre-release docs for Miller 6: %s\n", lib.DOC_URL)
fmt.Printf("Docs: %s\n", lib.DOC_URL)
fmt.Printf("Type ':h' or ':help' for online help; ':q' or ':quit' to quit.\n")
}
}

View file

@ -23,7 +23,7 @@ type RecordReaderCSV struct {
// ----------------------------------------------------------------
func NewRecordReaderCSV(readerOptions *cli.TReaderOptions) (*RecordReaderCSV, error) {
if readerOptions.IRS != "\n" {
return nil, errors.New("CSV IRS can only be newline")
return nil, errors.New("CSV IRS can only be newline; LF vs CR/LF is autodetected.")
}
if len(readerOptions.IFS) != 1 {
return nil, errors.New("CSV IFS can only be a single character")

View file

@ -1,6 +1,3 @@
package lib
// DOC_URL is for the current location of Miller 6 docs on the web.
// Miller 5 is released and its docs are at https://miller.readthedocs.io/en/latest.
// Miller 6 is pre-release and its doc are at the following location.
const DOC_URL = "https://johnkerl.org/miller6"
const DOC_URL = "https://miller.readthedocs.io"

View file

@ -1,10 +1,28 @@
================================================================
PUNCHDOWN LIST
* ./configure equivalent
o make:
- windoc note 'choco install make'
- (works in GH CI due to their toolchain)
* plan:
o blockers:
- fractional-strptime
- cmp-matrices
- all-contribs
- license triple-checks
- ./configure --prefix
? alpha?
- csv irs lf/crlf ignores -- ? already is so?
- `mlr put` -> coverart
o doc / release:
- ? auto-cp go/mlr // go/mlr.exe to basedir?
- auto-path (../go/mlr) in docs dir ...
- brew as first trial -- ?
> brew macports chocolatey
ubuntu debian fedora gentoo prolinux archlinux
netbsd freebsd
* post-release:
w installing-miller.md.in
w build.md.in developer/release notes
? RTD -> GP -- ?
? twi-dm re all-contribs: all-contributors.org
* nikos materials -> fold in
@ -46,7 +64,6 @@ PUNCHDOWN LIST
o TODO in *.go & *.mi
o release notes per se
o ./configure whatever equivalent
o readthedocs -- find out what's necessary to get per-version history
* doc
o new-in-miller-6: missings:

View file

@ -19,7 +19,7 @@ SYNOPSIS
example.csv
Please see 'mlr help topics' for more information. Please also see
https://johnkerl.org/miller6
https://miller.readthedocs.io
DESCRIPTION
@ -998,7 +998,7 @@ VERBS
Using 'any' higher-order function to see if $index is 10, 20, or 30:
'any([10,20,30], func(e) {return $index == e})'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
flatten
Usage: mlr flatten [options]
@ -1416,7 +1416,7 @@ VERBS
end{emitf @min, @max}
'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
regularize
Usage: mlr regularize [options]
@ -2671,7 +2671,7 @@ KEYWORDS FOR PUT AND FILTER
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit > stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
emitf
emitf: inserts non-indexed out-of-stream variable(s) side-by-side into the
@ -2699,7 +2699,7 @@ KEYWORDS FOR PUT AND FILTER
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern", @a, @b, @c'
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern > mytap.dat", @a, @b, @c'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
emitp
emitp: inserts an out-of-stream variable into the output record stream.
@ -2729,7 +2729,7 @@ KEYWORDS FOR PUT AND FILTER
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp > stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
end
end: defines a block of statements to be executed after input records
@ -2954,4 +2954,4 @@ SEE ALSO
2021-11-05 MILLER(1)
2021-11-06 MILLER(1)

View file

@ -2,12 +2,12 @@
.\" Title: mlr
.\" Author: [see the "AUTHOR" section]
.\" Generator: ./mkman.rb
.\" Date: 2021-11-05
.\" Date: 2021-11-06
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "MILLER" "1" "2021-11-05" "\ \&" "\ \&"
.TH "MILLER" "1" "2021-11-06" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Portability definitions
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -38,7 +38,7 @@ Output of one verb may be chained as input to another using "then", e.g.
mlr --csv stats1 -a min,mean,max -f quantity then sort -f color example.csv
Please see 'mlr help topics' for more information.
Please also see https://johnkerl.org/miller6
Please also see https://miller.readthedocs.io
.SH "DESCRIPTION"
.sp
@ -1245,7 +1245,7 @@ More example filter expressions:
Using 'any' higher-order function to see if $index is 10, 20, or 30:
'any([10,20,30], func(e) {return $index == e})'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
.fi
.if n \{\
.RE
@ -1783,7 +1783,7 @@ More example put expressions:
end{emitf @min, @max}
'
See also https://johnkerl.org/miller6/reference-dsl for more context.
See also https://miller.readthedocs.io/reference-dsl for more context.
.fi
.if n \{\
.RE
@ -4488,7 +4488,7 @@ etc., to control the format of the output if the output is redirected. See also
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit > stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emit | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
.fi
.if n \{\
.RE
@ -4522,7 +4522,7 @@ etc., to control the format of the output if the output is redirected. See also
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern", @a, @b, @c'
Example: mlr --from f.dat put '@a=$i;@b+=$x;@c+=$y; emitf | "grep somepattern > mytap.dat", @a, @b, @c'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
.fi
.if n \{\
.RE
@ -4558,7 +4558,7 @@ etc., to control the format of the output if the output is redirected. See also
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp > stderr, @*, "index1", "index2"'
Example: mlr --from f.dat put '@sums[$a][$b]+=$x; emitp | "grep somepattern", @*, "index1", "index2"'
Please see https://johnkerl.org/miller6://johnkerl.org/miller/doc for more information.
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
.fi
.if n \{\
.RE