More docs6 proofreads

This commit is contained in:
John Kerl 2021-08-23 22:36:34 -04:00
parent 59b03433a2
commit dffaee0328
113 changed files with 1283 additions and 706 deletions

View file

@ -11,15 +11,15 @@
* You need `pip install mkdocs` (or `pip3 install mkdocs`).
* The docs include lots of live code examples which will be invoked using `mlr` which must be somewhere in your `$PATH`.
* Clone https://github.com/johnkerl/miller and cd into `docs/` within your clone.
* Clone https://github.com/johnkerl/miller and cd into `docs6/` within your clone.
* Quick-editing loop:
* In one terminal, cd to this directory and leave `mkdocs serve` running.
* In another terminal, cd to the `docs` subdirectory and edit `*.md.in`.
* In another terminal, cd to the `docs` subdirectory of `docs6` and edit `*.md.in`.
* Run `genmds` to re-create all the `*.md` files, or `genmds foo.md.in` to just re-create the `foo.md.in` file you just edited.
* In your browser, visit http://127.0.0.1:8000
* Alternate editing loop:
* Leave one terminal open as a place you will run `mkdocs build`
* In one terminal, cd to the `docs` subdirectory and edit `*.md.in`.
* In one terminal, cd to the `docs` subdirectory of `docs6` and edit `*.md.in`.
* Generate `docs/*.md` from `docs/*.md.in`, and then from that generate the `site/*/*.html`:
* Run `genmds` to re-create all the `*.md` files, or `genmds foo.md.in` to just re-create the `foo.md.in` file you just edited.
* In the first terminal, run `mkdocs build` which will populate the `site` directory.
@ -34,7 +34,7 @@
## Notes
* CSS:
* I used the Mkdocs Readthedocs theme which I like a lot. I customized `docs/extra.css` for Miller coloring/branding.
* I used the Mkdocs Readthedocs theme which I like a lot. I customized `docs6/docs/extra.css` for Miller coloring/branding.
* Live code:
* I didn't find a way to include non-Python live-code examples within Mkdocs so I adapted the pre-Mkdocs Miller-doc strategy which is to have a generator script read a template file (here, `foo.md.in`), run the marked lines, and generate the output file (`foo.md`). This is `genmds`.
* Edit the `*.md.in` files, not `*.md` directly.

View file

@ -1,2 +0,0 @@
grep op=cache log.txt \
| mlr --idkvp --opprint stats1 -a mean -f hit -g type then sort -f type

View file

@ -1,4 +0,0 @@
mlr --from log.txt --opprint \
filter 'is_present($batch_size)' \
then step -a delta -f time,num_filtered \
then sec2gmt time

View file

@ -9,27 +9,28 @@ You can ask questions -- or answer them! -- following the links on the [Communit
Pre-release Miller documentation is at [https://github.com/johnkerl/miller/tree/main/docs6](https://github.com/johnkerl/miller/tree/main/docs6).
Clone [https://github.com/johnkerl/miller](https://github.com/johnkerl/miller) and `cd` into `docs6`.
Instructions for modifying, viewing, and submitting PRs for these are in the [docs6/README.md](https://github.com/johnkerl/miller/blob/main/docs6/README.md).
After `sudo pip install sphinx` (or `pip3`) you should be able to do `make html`.
While Miller 6 is in pre-release, these docs are not viewable at
[https://miller.readthedocs.io](https://miller.readthedocs.io) which shows Miller 5 docs.
For now, I'll push Miller-6 docs to my ISP space at
[https://johnkerl.org/miller6](https://johnkerl.org/miller6) after your PR is merged.
Edit `*.md.in` files, then `make html` to generate `*.md`, then run the Sphinx document-generator.
Open `_build/html/index.html` in your browser, e.g. `file:////Users/yourname/git/miller/docs6/_build/html/contributing.html`, to verify.
PRs are welcome at [https://github.com/johnkerl/miller](https://github.com/johnkerl/miller).
<!---
TODO: after Miller6 release when these are on RTD
Once PRs are merged, readthedocs creates [https://miller.readthedocs.io](https://miller.readthedocs.io) using the following configs:
* [https://readthedocs.org/projects/miller](https://readthedocs.org/projects/miller)
* [https://readthedocs.org/projects/miller/builds](https://readthedocs.org/projects/miller/builds)
* [https://github.com/johnkerl/miller/settings/hooks](https://github.com/johnkerl/miller/settings/hooks)
-->
## Testing
As of Miller-6's current pre-release status, the best way to test is to either build from source via [Building from source](build.md), or by getting a recent binary at [https://github.com/johnkerl/miller/actions](https://github.com/johnkerl/miller/actions), then click latest build, then *Artifacts*. Then simply use Miller for whatever you do, and create an issue at [https://github.com/johnkerl/miller/issues](https://github.com/johnkerl/miller/issues).
Do note that as of 2021-06-17 a few things have not been ported to Miller 6 -- most notably, including localtime DSL functions and other issues.
Do note that as of mid-2021 a few things have not been ported to Miller 6 -- most notably, including localtime DSL functions and other issues.
## Feature development

View file

@ -8,27 +8,28 @@ You can ask questions -- or answer them! -- following the links on the [Communit
Pre-release Miller documentation is at [https://github.com/johnkerl/miller/tree/main/docs6](https://github.com/johnkerl/miller/tree/main/docs6).
Clone [https://github.com/johnkerl/miller](https://github.com/johnkerl/miller) and `cd` into `docs6`.
Instructions for modifying, viewing, and submitting PRs for these are in the [docs6/README.md](https://github.com/johnkerl/miller/blob/main/docs6/README.md).
After `sudo pip install sphinx` (or `pip3`) you should be able to do `make html`.
While Miller 6 is in pre-release, these docs are not viewable at
[https://miller.readthedocs.io](https://miller.readthedocs.io) which shows Miller 5 docs.
For now, I'll push Miller-6 docs to my ISP space at
[https://johnkerl.org/miller6](https://johnkerl.org/miller6) after your PR is merged.
Edit `*.md.in` files, then `make html` to generate `*.md`, then run the Sphinx document-generator.
Open `_build/html/index.html` in your browser, e.g. `file:////Users/yourname/git/miller/docs6/_build/html/contributing.html`, to verify.
PRs are welcome at [https://github.com/johnkerl/miller](https://github.com/johnkerl/miller).
<!---
TODO: after Miller6 release when these are on RTD
Once PRs are merged, readthedocs creates [https://miller.readthedocs.io](https://miller.readthedocs.io) using the following configs:
* [https://readthedocs.org/projects/miller](https://readthedocs.org/projects/miller)
* [https://readthedocs.org/projects/miller/builds](https://readthedocs.org/projects/miller/builds)
* [https://github.com/johnkerl/miller/settings/hooks](https://github.com/johnkerl/miller/settings/hooks)
-->
## Testing
As of Miller-6's current pre-release status, the best way to test is to either build from source via [Building from source](build.md), or by getting a recent binary at [https://github.com/johnkerl/miller/actions](https://github.com/johnkerl/miller/actions), then click latest build, then *Artifacts*. Then simply use Miller for whatever you do, and create an issue at [https://github.com/johnkerl/miller/issues](https://github.com/johnkerl/miller/issues).
Do note that as of 2021-06-17 a few things have not been ported to Miller 6 -- most notably, including localtime DSL functions and other issues.
Do note that as of mid-2021 a few things have not been ported to Miller 6 -- most notably, including localtime DSL functions and other issues.
## Feature development

View file

@ -61,8 +61,24 @@ GENMD_RUN_COMMAND
cat data/ragged.csv
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv.sh)
GENMD_RUN_COMMAND
mlr --from data/ragged.csv --fs comma --nidx put '
@maxnf = max(@maxnf, NF);
@nf = NF;
while(@nf < @maxnf) {
@nf += 1;
$[@nf] = ""
}
'
GENMD_EOF
or, more simply,
GENMD_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv-2.sh)
GENMD_RUN_COMMAND
mlr --from data/ragged.csv --fs comma --nidx put '
@maxnf = max(@maxnf, NF);
while(NF < @maxnf) {
$[NF+1] = "";
}
'
GENMD_EOF

View file

@ -1,5 +0,0 @@
mlr put '
begin { @sum = 0 };
@x_sum += $x;
end { emit @x_sum }
' ./data/small

View file

@ -1,4 +0,0 @@
mlr put '
@x_sum += $x;
end { emit @x_sum }
' ./data/small

View file

@ -1,4 +0,0 @@
mlr put -q '
@x_sum += $x;
end { emit @x_sum }
' ./data/small

View file

@ -1,8 +0,0 @@
mlr put -q '
@x_count += 1;
@x_sum += $x;
end {
emit @x_count;
emit @x_sum;
}
' ./data/small

View file

@ -1 +0,0 @@
mlr stats1 -a count,sum -f x ./data/small

View file

@ -1,8 +0,0 @@
mlr put -q '
@x_count[$a] += 1;
@x_sum[$a] += $x;
end {
emit @x_count, "a";
emit @x_sum, "a";
}
' ./data/small

View file

@ -1,7 +0,0 @@
mlr --from data/medium put -q '
@x_count[$a][$b] += 1;
@x_sum[$a][$b] += $x;
end {
emit (@x_count, @x_sum), "a", "b";
}
'

View file

@ -1 +0,0 @@
mlr stats1 -a count,sum -f x -g a ./data/small

View file

@ -1,14 +0,0 @@
mlr put '
begin {
@num_total = 0;
@num_positive = 0;
};
@num_total += 1;
$x > 0.0 {
@num_positive += 1;
$y = log10($x); $z = sqrt($y)
};
end {
emitf @num_total, @num_positive
}
' data/put-gating-example-1.dkvp

View file

@ -1,10 +0,0 @@
mlr --from data/medium --opprint put -q '
@x_count[$a][$b] += 1;
@x_sum[$a][$b] += $x;
end {
for ((a, b), _ in @x_count) {
@x_mean[a][b] = @x_sum[a][b] / @x_count[a][b]
}
emit (@x_sum, @x_count, @x_mean), "a", "b"
}
'

View file

@ -1,14 +0,0 @@
mlr --opprint --from data/small put '
func f(n) {
if (is_numeric(n)) {
if (n > 0) {
return n * f(n-1);
} else {
return 1;
}
}
# implicitly return absent-null if non-numeric
}
$ox = f($x + NR);
$oi = f($i);
'

View file

@ -1 +0,0 @@
mlr --from data/small put '$xy = sqrt($x**2 + $y**2)'

View file

@ -1 +0,0 @@
mlr --from data/small put 'func f(a, b) { return sqrt(a**2 + b**2) } $xy = f($x, $y)'

View file

@ -1,10 +0,0 @@
mlr -n put --jknquoteint -q '
begin {
@myvar = {
1: 2,
3: { 4 : 5 },
6: { 7: { 8: 9 } }
}
}
end { dump }
'

View file

@ -1,16 +0,0 @@
mlr -n put --jknquoteint -q '
begin {
@myvar = {
1: 2,
3: { 4 : 5 },
6: { 7: { 8: 9 } }
}
}
end {
for (k, v in @myvar) {
print
"key=" . k .
",valuetype=" . typeof(v);
}
}
'

View file

@ -1,17 +0,0 @@
mlr -n put --jknquoteint -q '
begin {
@myvar = {
1: 2,
3: { 4 : 5 },
6: { 7: { 8: 9 } }
}
}
end {
for ((k1, k2), v in @myvar) {
print
"key1=" . k1 .
",key2=" . k2 .
",valuetype=" . typeof(v);
}
}
'

View file

@ -1,17 +0,0 @@
mlr -n put --jknquoteint -q '
begin {
@myvar = {
1: 2,
3: { 4 : 5 },
6: { 7: { 8: 9 } }
}
}
end {
for ((k1, k2), v in @myvar[6]) {
print
"key1=" . k1 .
",key2=" . k2 .
",valuetype=" . typeof(v);
}
}
'

View file

@ -1,11 +0,0 @@
mlr --pprint --from data/for-srec-example.tbl put '
$sum1 = $f1 + $f2 + $f3;
$sum2 = 0;
$sum3 = 0;
for (key, value in $*) {
if (key =~ "^f[0-9]+") {
$sum2 += value;
$sum3 += $[key];
}
}
'

View file

@ -1,10 +0,0 @@
mlr --from data/small --opprint put '
$sum1 = 0;
$sum2 = 0;
for (k,v in $*) {
if (is_numeric(v)) {
$sum1 +=v;
$sum2 += $[k];
}
}
'

View file

@ -1,9 +0,0 @@
mlr --from data/small --opprint put '
sum = 0;
for (k,v in $*) {
if (is_numeric(v)) {
sum += $[k];
}
}
$sum = sum
'

View file

@ -1,15 +0,0 @@
mlr put '
begin {
@i_cumu = 0;
}
@i_cumu += $i;
$* = {
"z": $x + y,
"KEYFIELD": $a,
"i": @i_cumu,
"b": $b,
"y": $x,
"x": $y,
};
' data/small

View file

@ -1,3 +0,0 @@
mlr --oxtab stats1 -f x -a p25,p75 \
then put '$x_iqr = $x_p75 - $x_p25' \
data/medium

View file

@ -1,7 +0,0 @@
mlr --oxtab stats1 --fr '[i-z]' -a p25,p75 \
then put 'for (k,v in $*) {
if (k =~ "(.*)_p25") {
$["\1_iqr"] = $["\1_p75"] - $["\1_p25"]
}
}' \
data/medium

View file

@ -1,10 +0,0 @@
mlr --opprint put -q '
@x_sum[$a][$b] += $x;
@x_count[$a][$b] += 1;
end{
for ((a, b), v in @x_sum) {
@x_mean[a][b] = @x_sum[a][b] / @x_count[a][b];
}
emit @x_mean, "a", "b"
}
' data/medium

View file

@ -1,7 +0,0 @@
mlr --opprint --from data/medium put -q '
@min[$a] = min(@min[$a], $x);
@max[$a] = max(@max[$a], $x);
end{
emit (@min, @max), "a";
}
'

View file

@ -1,16 +0,0 @@
# Here I'm using a specified random-number seed so this example always
# produces the same output for this web document: in everyday practice we
# would leave off the --seed 12345 part.
mlr --seed 12345 seqgen --start 1 --stop 10 then put '
func f(a, b) { # function arguments a and b
r = 0.0; # local r scoped to the function
for (int i = 0; i < 6; i += 1) { # local i scoped to the for-loop
num u = urand(); # local u scoped to the for-loop
r += u; # updates r from the enclosing scope
}
r /= 6;
return a + (b - a) * r;
}
num o = f(10, 20); # local to the top-level scope
$o = o;
'

View file

@ -1,7 +0,0 @@
mlr --opprint put '
$* = {
"a": $i,
"i": $a,
"y": $y * 10,
}
' data/small

View file

@ -1,7 +0,0 @@
mlr --from data/small put '
func f(map m): map {
m["x"] *= 200;
return m;
}
$* = f({"a": $a, "x": $x});
'

View file

@ -1,19 +0,0 @@
mlr --from data/small put -q '
begin {
@o = {
"nrec": 0,
"nkey": {"numeric":0, "non-numeric":0},
};
}
@o["nrec"] += 1;
for (k, v in $*) {
if (is_numeric(v)) {
@o["nkey"]["numeric"] += 1;
} else {
@o["nkey"]["non-numeric"] += 1;
}
}
end {
dump @o;
}
'

View file

@ -1,8 +0,0 @@
mlr --opprint put -q '
@x_sum += $x;
@x_count += 1;
end {
@x_mean = @x_sum / @x_count;
emit @x_mean
}
' data/medium

View file

@ -1,5 +0,0 @@
mlr --from data/miss-date.csv --icsv \
cat -n \
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
then step -a delta -f datestamp \
| head

View file

@ -1,5 +0,0 @@
mlr --from data/miss-date.csv --icsv \
cat -n \
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
then step -a delta -f datestamp \
then filter '$datestamp_delta != 86400 && $n != 1'

View file

@ -1,5 +0,0 @@
mlr --icsv --opprint \
join -j color --ul --ur -f data/prevtemp.csv \
then unsparsify --fill-with 0 \
then put '$count_delta = $current_count - $previous_count' \
data/currtemp.csv

View file

@ -1,7 +0,0 @@
mlr --opprint put '
$nf = NF;
$nr = NR;
$fnr = FNR;
$filenum = FILENUM;
$filename = FILENAME
' data/small data/small2

View file

@ -1,6 +0,0 @@
mlr --from data/ragged.csv --fs comma --nidx put '
@maxnf = max(@maxnf, NF);
while(NF < @maxnf) {
$[NF+1] = "";
}
'

View file

@ -1,8 +0,0 @@
mlr --from data/ragged.csv --fs comma --nidx put '
@maxnf = max(@maxnf, NF);
@nf = NF;
while(@nf < @maxnf) {
@nf += 1;
$[@nf] = ""
}
'

View file

@ -1,10 +0,0 @@
mlr --from data/rect.txt put -q '
is_present($outer) {
unset @r
}
for (k, v in $*) {
@r[k] = v
}
is_present($inner1) {
emit @r
}'

View file

@ -1,8 +0,0 @@
mlr --from data/small put '
print "NR = ".NR;
for (key in $*) {
value = $[key];
print " key:" . key . " value:".value;
}
'

View file

@ -1,8 +0,0 @@
mlr -n put '
end {
o = {1:2, 3:{4:5}};
for (key in o) {
print " key:" . key . " valuetype:" . typeof(o[key]);
}
}
'

View file

@ -1,17 +0,0 @@
mlr --opprint --from data/small put -q '
begin {
@call_count = 0;
}
subr s(n) {
@call_count += 1;
if (is_numeric(n)) {
if (n > 1) {
call s(n-1);
} else {
print "numcalls=" . @call_count;
}
}
}
print "NR=" . NR;
call s(NR);
'

View file

@ -1,17 +0,0 @@
mlr --csvlite --from data/a.csv put '
func f(
num a,
num b,
): num {
return a**2 + b**2;
}
$* = {
"s": $a + $b,
"t": $a - $b,
"u": f(
$a,
$b,
),
"v": NR,
}
'

View file

@ -1,7 +0,0 @@
mlr --from data/small --opprint put '
num suma = 0;
for (a = 1; a <= NR; a += 1) {
suma += a;
}
$suma = suma;
'

View file

@ -1,10 +0,0 @@
mlr --from data/small --opprint put '
num suma = 0;
num sumb = 0;
for (num a = 1, num b = 1; a <= NR; a += 1, b *= 2) {
suma += a;
sumb += b;
}
$suma = suma;
$sumb = sumb;
'

View file

@ -1,24 +0,0 @@
mlr --from data/medium put -q '
# Using the y field for weighting in this example
weight = $y;
# Using the a field for weighted aggregation in this example
@sumwx[$a] += weight * $i;
@sumw[$a] += weight;
@sumx[$a] += $i;
@sumn[$a] += 1;
end {
map wmean = {};
map mean = {};
for (a in @sumwx) {
wmean[a] = @sumwx[a] / @sumw[a]
}
for (a in @sumx) {
mean[a] = @sumx[a] / @sumn[a]
}
#emit wmean, "a";
#emit mean, "a";
emit (wmean, mean), "a";
}'

View file

@ -1,6 +0,0 @@
echo x=1,y=2 | mlr put '
while (NF < 10) {
$[NF+1] = ""
}
$foo = "bar"
'

View file

@ -1,9 +0,0 @@
echo x=1,y=2 | mlr put '
do {
$[NF+1] = "";
if (NF == 5) {
break
}
} while (NF < 10);
$foo = "bar"
'

View file

@ -1,5 +1,5 @@
<!--- PLEASE DO NOT EDIT DIRECTLY. EDIT THE .md.in FILE PLEASE. --->
# Dates and times
# Date/time examples
## How can I filter by date?

View file

@ -1,4 +1,4 @@
# Dates and times
# Date/time examples
## How can I filter by date?
@ -32,11 +32,23 @@ GENMD_EOF
Since there are 1372 lines in the data file, some automation is called for. To find the missing dates, you can convert the dates to seconds since the epoch using `strptime`, then compute adjacent differences (the `cat -n` simply inserts record-counters):
GENMD_INCLUDE_AND_RUN_ESCAPED(data/miss-date-1.sh)
GENMD_RUN_COMMAND
mlr --from data/miss-date.csv --icsv \
cat -n \
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
then step -a delta -f datestamp \
| head
GENMD_EOF
Then, filter for adjacent difference not being 86400 (the number of seconds in a day):
GENMD_INCLUDE_AND_RUN_ESCAPED(data/miss-date-2.sh)
GENMD_RUN_COMMAND
mlr --from data/miss-date.csv --icsv \
cat -n \
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
then step -a delta -f datestamp \
then filter '$datestamp_delta != 86400 && $n != 1'
GENMD_EOF
Given this, it's now easy to see where the gaps are:

View file

@ -1,4 +0,0 @@
mlr --c2p put '
$cost = $quantity * $rate;
$index *= 100
' example.csv

Binary file not shown.

View file

@ -16,9 +16,17 @@ GENMD_EOF
Each print statement simply contains local information: the current timestamp, whether a particular cache was hit or not, etc. Then using either the system `grep` command, or Miller's `having-fields`, or `is_present`, we can pick out the parts we want and analyze them:
GENMD_INCLUDE_AND_RUN_ESCAPED(10-1.sh)
GENMD_RUN_COMMAND
grep op=cache log.txt \
| mlr --idkvp --opprint stats1 -a mean -f hit -g type then sort -f type
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(10-2.sh)
GENMD_RUN_COMMAND
mlr --from log.txt --opprint \
filter 'is_present($batch_size)' \
then step -a delta -f time,num_filtered \
then sec2gmt time
GENMD_EOF
Alternatively, we can simply group the similar data for a better look:

View file

@ -1,7 +1,7 @@
<!--- PLEASE DO NOT EDIT DIRECTLY. EDIT THE .md.in FILE PLEASE. --->
# Manual page
This is simply a copy of what you should see on running **man mlr** at a command prompt, once Miller is installed on your system.
This is simply a copy of what you should see on running `man mlr` at a command prompt, once Miller is installed on your system.
<pre class="pre-non-highlight-non-pair">
MILLER(1) MILLER(1)

View file

@ -1,5 +1,5 @@
# Manual page
This is simply a copy of what you should see on running **man mlr** at a command prompt, once Miller is installed on your system.
This is simply a copy of what you should see on running `man mlr` at a command prompt, once Miller is installed on your system.
GENMD_INCLUDE_ESCAPED(manpage.txt)

View file

@ -137,7 +137,13 @@ GENMD_EOF
Then, join on the key field(s), and use unsparsify to zero-fill counters absent on one side but present on the other. Use `--ul` and `--ur` to emit unpaired records (namely, purple on the left and yellow on the right):
GENMD_INCLUDE_AND_RUN_ESCAPED(data/previous-to-current.sh)
GENMD_RUN_COMMAND
mlr --icsv --opprint \
join -j color --ul --ur -f data/prevtemp.csv \
then unsparsify --fill-with 0 \
then put '$count_delta = $current_count - $previous_count' \
data/currtemp.csv
GENMD_EOF
## Memoization with out-of-stream variables

View file

@ -5,7 +5,7 @@ See also the [list of issues tagged with go-port](https://github.com/johnkerl/mi
## Documentation improvements
Documentation (what you're reading here) and on-line help (`mlr --help`) have been completely reworked.
Documentation (what you're reading here) and online help (`mlr --help`) have been completely reworked.
In the initial release, the focus was convincing users already familiar with
`awk`/`grep`/`cut` that Miller was a viable alternative -- but over time it's
@ -45,7 +45,7 @@ Binaries are reliably available using GitHub Actions: see also [Installation](in
## In-process support for compressed input
In addition to `--prepipe gunzip`, you can now use the `--gzin` flag. In fact, if your files end in `.gz` you don't even need to do that -- Miller will autodetect by file extension and automatically uncompress `mlr --csv cat foo.csv.gz`. Similarly for `.z` and `.bz2` files. Please see section [TODO:linkify] for more information.
In addition to `--prepipe gunzip`, you can now use the `--gzin` flag. In fact, if your files end in `.gz` you don't even need to do that -- Miller will autodetect by file extension and automatically uncompress `mlr --csv cat foo.csv.gz`. Similarly for `.z` and `.bz2` files. Please see the page on [Compressed data](reference-main-compressed-data.md) for more information.
## Output colorization

View file

@ -4,7 +4,7 @@ See also the [list of issues tagged with go-port](https://github.com/johnkerl/mi
## Documentation improvements
Documentation (what you're reading here) and on-line help (`mlr --help`) have been completely reworked.
Documentation (what you're reading here) and online help (`mlr --help`) have been completely reworked.
In the initial release, the focus was convincing users already familiar with
`awk`/`grep`/`cut` that Miller was a viable alternative -- but over time it's
@ -44,7 +44,7 @@ Binaries are reliably available using GitHub Actions: see also [Installation](in
## In-process support for compressed input
In addition to `--prepipe gunzip`, you can now use the `--gzin` flag. In fact, if your files end in `.gz` you don't even need to do that -- Miller will autodetect by file extension and automatically uncompress `mlr --csv cat foo.csv.gz`. Similarly for `.z` and `.bz2` files. Please see section [TODO:linkify] for more information.
In addition to `--prepipe gunzip`, you can now use the `--gzin` flag. In fact, if your files end in `.gz` you don't even need to do that -- Miller will autodetect by file extension and automatically uncompress `mlr --csv cat foo.csv.gz`. Similarly for `.z` and `.bz2` files. Please see the page on [Compressed data](reference-main-compressed-data.md) for more information.
## Output colorization

View file

@ -221,4 +221,4 @@ Options:
## Manual page
If you've gotten Miller from a package installer, you should have `man mlr` producing a traditional manual page.
If not, no worries -- the manual page is a concatenated listing of the same information also available by each of the topics in `mlr help topics`.
If not, no worries -- the manual page is a concatenated listing of the same information also available by each of the topics in `mlr help topics`. See also the [Manual page](manpage.md) which is an online copy.

View file

@ -83,4 +83,4 @@ GENMD_EOF
## Manual page
If you've gotten Miller from a package installer, you should have `man mlr` producing a traditional manual page.
If not, no worries -- the manual page is a concatenated listing of the same information also available by each of the topics in `mlr help topics`.
If not, no worries -- the manual page is a concatenated listing of the same information also available by each of the topics in `mlr help topics`. See also the [Manual page](manpage.md) which is an online copy.

View file

@ -1,5 +0,0 @@
mlr --opprint put '
begin{ @a=0.1 };
$e = NR==1 ? $x : @a * $x + (1 - @a) * @e;
@e=$e
' data/small

View file

@ -1,6 +0,0 @@
mlr --opprint put -q '
@x_sum[$b] += $x;
end {
emit @x_sum, "b"
}
' data/medium

View file

@ -1,6 +0,0 @@
mlr --oxtab put -q '
@x_sum += $x;
end {
emit @x_sum
}
' data/medium

View file

@ -52,4 +52,20 @@ GENMD_RUN_COMMAND
cat data/small
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(data/full-reorg.sh)
GENMD_RUN_COMMAND
mlr put '
begin {
@i_cumu = 0;
}
@i_cumu += $i;
$* = {
"z": $x + y,
"KEYFIELD": $a,
"i": @i_cumu,
"b": $b,
"y": $x,
"x": $y,
};
' data/small
GENMD_EOF

View file

@ -10,7 +10,7 @@ Things having colors:
* Keys in CSV header lines, JSON keys, etc
* Values in CSV data lines, JSON scalar values, etc
* "PASS" and "FAIL" in regression-test output
* Some online-help strings
* Some [online-help](online-help.md) strings
Rules for colorization:

View file

@ -9,7 +9,7 @@ Things having colors:
* Keys in CSV header lines, JSON keys, etc
* Values in CSV data lines, JSON scalar values, etc
* "PASS" and "FAIL" in regression-test output
* Some online-help strings
* Some [online-help](online-help.md) strings
Rules for colorization:

View file

@ -29,7 +29,12 @@ GENMD_RUN_COMMAND
mlr --c2p put '$cost = $quantity * $rate; $index = $index * 100' example.csv
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(dsl-example-multiline.sh)
GENMD_RUN_COMMAND
mlr --c2p put '
$cost = $quantity * $rate;
$index *= 100
' example.csv
GENMD_EOF
One of Miller's key features is the ability to express data-transformation right there at the keyboard, interactively. But if you find yourself using expressions repeatedly, you can put everything between the single quotes into a file and refer to that using `put -f`:

View file

@ -1,31 +1,29 @@
----------------------------------------------------------------
ALL:
* unvisited links are still blue -- ?!?
* GENMD_INCLUDE_AND_RUN_ESCAPED -> remove and replace with GENMD_RUN_COMMAND
* example.csv rename index to something else, and add i column, update any code samples which use index
* csv to csv,tsv throughout
* rid of explicitly passing around os.Stdout in all various help functions, annoying
* het.dkvp > het.json in more places
* check each page for adequate h2 coverage
* hash-map / hashmap -> map everywhere
----------------------------------------------------------------
w compression page: make one! :)
w flatten/unflatten page: make one! :)
w flesh out arrays page!
E flatten/unflatten page: make one! :)
E memory/streaming page
E flesh out data-types page!
E flesh out arrays page!
E new maps page!
- insertion order ...
- for-k-v etc
e reduce #digits in data/small
e mcvt pass
c GOMAXPROCS -- up it? separate page maybe -- ?
- note one goroutine for in, out, & each verb
- check and respect env-var
* move aux-cmds down lower. maybe some other reorders as well.
* new different-from-other-languages page
- no ++
- 1-up arrays
for (i = 1; i <= n; i += 1) { ... }
- hash-maps are order-preserving
- single-for over array: var is value; over map: var is key
----------------------------------------------------------------
index:
@ -56,12 +54,8 @@ record-heterogeneity:
new-in-miller-6:
w flatten/unflatten -- needs a new separate page
l gzin/bz2in linkify
? TODO marks
contributing:
L add a pre-release note about https://johnkerl.org/miller6 & why no double RTD
E update for sphinx -> mkdocs. and/or link to r.md.
* ?? operator
csv-with-and-without-headers:
? Headerless CSV with duplicate field values -> typo-fix -- duplicate keys actually -- ?!?
@ -120,6 +114,7 @@ statistics-examples:
two-pass-algorithms:
l link to "new" verbs x everywhere possible
l Of course, Miller verbs such as sort, tac, etc. all must ... -> linkify to new memory/streaming page
x this (or wherever ...) maybe get rid of some of the too-many examples. pick some survivors; x the rest.
misc examples:
? Program timing & subsequents -> another page
@ -195,5 +190,9 @@ E Keep in mind that out-of-stream variables are a nested, multi-level hashmap (d
o 2 examples not 3?
o why not '--oflatsep /' respected?
reference-dsl-differences.md:
l check for linkify opportunities
manpage:
? [NEEDS READ-THROUGH]
? 'Kerl .' and 'Veith .'

BIN
docs6/docs/purple.csv.gz Normal file

Binary file not shown.

View file

@ -1,5 +1,5 @@
<!--- PLEASE DO NOT EDIT DIRECTLY. EDIT THE .md.in FILE PLEASE. --->
# Joins
# Questions about joins
## Why am I not seeing all possible joins occur?

View file

@ -1,4 +1,4 @@
# Joins
# Questions about joins
## Why am I not seeing all possible joins occur?

View file

@ -1,5 +1,5 @@
<!--- PLEASE DO NOT EDIT DIRECTLY. EDIT THE .md.in FILE PLEASE. --->
# Then-chaining
# Questions about then-chaining
## How do I examine then-chaining?

View file

@ -1,4 +1,4 @@
# Then-chaining
# Questions about then-chaining
## How do I examine then-chaining?

BIN
docs6/docs/red.csv.gz Normal file

Binary file not shown.

View file

@ -157,78 +157,95 @@ While Miller's `while` and `do-while` statements are much as in many other langu
As with `while` and `do-while`, a `break` or `continue` within nested control structures will propagate to the innermost loop enclosing them, if any, and a `break` or `continue` outside a loop is a syntax error that will be flagged as soon as the expression is parsed, before any input records are ingested.
### Key-only for-loops
### Single-variable for-loops
The `key` variable is always bound to the *key* of key-value pairs:
For [maps](reference-dsl-maps.md), the single variable is always bound to the *key* of key-value pairs:
<pre class="pre-highlight-in-pair">
<b>mlr --from data/small put '</b>
<b>mlr --from data/small put -q '</b>
<b> print "NR = ".NR;</b>
<b> for (key in $*) {</b>
<b> value = $[key];</b>
<b> print " key:" . key . " value:".value;</b>
<b> for (e in $*) {</b>
<b> print " key:", e, "value:", $[e];</b>
<b> }</b>
<b></b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
NR = 1
key:a value:pan
key:b value:pan
key:i value:1
key:x value:0.3467901443380824
key:y value:0.7268028627434533
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
key: a value: pan
key: b value: pan
key: i value: 1
key: x value: 0.3467901443380824
key: y value: 0.7268028627434533
NR = 2
key:a value:eks
key:b value:pan
key:i value:2
key:x value:0.7586799647899636
key:y value:0.5221511083334797
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
key: a value: eks
key: b value: pan
key: i value: 2
key: x value: 0.7586799647899636
key: y value: 0.5221511083334797
NR = 3
key:a value:wye
key:b value:wye
key:i value:3
key:x value:0.20460330576630303
key:y value:0.33831852551664776
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
key: a value: wye
key: b value: wye
key: i value: 3
key: x value: 0.20460330576630303
key: y value: 0.33831852551664776
NR = 4
key:a value:eks
key:b value:wye
key:i value:4
key:x value:0.38139939387114097
key:y value:0.13418874328430463
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
key: a value: eks
key: b value: wye
key: i value: 4
key: x value: 0.38139939387114097
key: y value: 0.13418874328430463
NR = 5
key:a value:wye
key:b value:pan
key:i value:5
key:x value:0.5732889198020006
key:y value:0.8636244699032729
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
key: a value: wye
key: b value: pan
key: i value: 5
key: x value: 0.5732889198020006
key: y value: 0.8636244699032729
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr -n put '</b>
<b>mlr -n put -q '</b>
<b> end {</b>
<b> o = {1:2, 3:{4:5}};</b>
<b> for (key in o) {</b>
<b> print " key:" . key . " valuetype:" . typeof(o[key]);</b>
<b> o = {"a":1, "b":{"c":3}};</b>
<b> for (e in o) {</b>
<b> print "key:", e, "valuetype:", typeof(o[e]);</b>
<b> }</b>
<b> }</b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
key:1 valuetype:int
key:3 valuetype:map
key: a valuetype: int
key: b valuetype: map
</pre>
Note that the value corresponding to a given key may be gotten as through a **computed field name** using square brackets as in `$[key]` for stream records, or by indexing the looped-over variable using square brackets.
Note that the value corresponding to a given key may be gotten as through a **computed field name** using square brackets as in `$[e]` for stream records, or by indexing the looped-over variable using square brackets.
For [arrays](reference-dsl-arrays.md), the single variable is always bound to the *value* (not the array index):
<pre class="pre-highlight-in-pair">
<b>mlr -n put -q '</b>
<b> end {</b>
<b> o = [10, "20", {}, "four", true];</b>
<b> for (e in o) {</b>
<b> print "value:", e, "valuetype:", typeof(e);</b>
<b> }</b>
<b> }</b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
value: 10 valuetype: int
value: 20 valuetype: string
value: {} valuetype: map
value: four valuetype: string
value: true valuetype: bool
</pre>
### Key-value for-loops
Single-level keys may be gotten at using either `for(k,v)` or `for((k),v)`; multi-level keys may be gotten at using `for((k1,k2,k3),v)` and so on. The `v` variable will be bound to to a scalar value (a string or a number) if the map stops at that level, or to a map-valued variable if the map goes deeper. If the map isn't deep enough then the loop body won't be executed.
For [maps](reference-dsl-maps.md), the first loop variable is the key and the
second is the value; for [arrays](reference-dsl-arrays.md), the first loop
variable is the (1-up) array index and the second is the value.
Single-level keys may be gotten at using either `for(k,v)` or `for((k),v)`; multi-level keys may be gotten at using `for((k1,k2,k3),v)` and so on. The `v` variable will be bound to to a scalar value (non-array/non-map) if the map stops at that level, or to a map-valued or array-valued variable if the map goes deeper. If the map isn't deep enough then the loop body won't be executed.
<pre class="pre-highlight-in-pair">
<b>cat data/for-srec-example.tbl</b>

View file

@ -70,9 +70,26 @@ GENMD_EOF
Miller's `while` and `do-while` are unsurprising in comparison to various languages, as are `break` and `continue`:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/while-example-1.sh)
GENMD_RUN_COMMAND
echo x=1,y=2 | mlr put '
while (NF < 10) {
$[NF+1] = ""
}
$foo = "bar"
'
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(data/while-example-2.sh)
GENMD_RUN_COMMAND
echo x=1,y=2 | mlr put '
do {
$[NF+1] = "";
if (NF == 5) {
break
}
} while (NF < 10);
$foo = "bar"
'
GENMD_EOF
A `break` or `continue` within nested conditional blocks or if-statements will,
of course, propagate to the innermost loop enclosing them, if any. A `break` or
@ -97,25 +114,70 @@ While Miller's `while` and `do-while` statements are much as in many other langu
As with `while` and `do-while`, a `break` or `continue` within nested control structures will propagate to the innermost loop enclosing them, if any, and a `break` or `continue` outside a loop is a syntax error that will be flagged as soon as the expression is parsed, before any input records are ingested.
### Key-only for-loops
### Single-variable for-loops
The `key` variable is always bound to the *key* of key-value pairs:
For [maps](reference-dsl-maps.md), the single variable is always bound to the *key* of key-value pairs:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/single-for-example-1.sh)
GENMD_RUN_COMMAND
mlr --from data/small put -q '
print "NR = ".NR;
for (e in $*) {
print " key:", e, "value:", $[e];
}
'
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(data/single-for-example-2.sh)
GENMD_RUN_COMMAND
mlr -n put -q '
end {
o = {"a":1, "b":{"c":3}};
for (e in o) {
print "key:", e, "valuetype:", typeof(o[e]);
}
}
'
GENMD_EOF
Note that the value corresponding to a given key may be gotten as through a **computed field name** using square brackets as in `$[key]` for stream records, or by indexing the looped-over variable using square brackets.
Note that the value corresponding to a given key may be gotten as through a **computed field name** using square brackets as in `$[e]` for stream records, or by indexing the looped-over variable using square brackets.
For [arrays](reference-dsl-arrays.md), the single variable is always bound to the *value* (not the array index):
GENMD_RUN_COMMAND
mlr -n put -q '
end {
o = [10, "20", {}, "four", true];
for (e in o) {
print "value:", e, "valuetype:", typeof(e);
}
}
'
GENMD_EOF
### Key-value for-loops
Single-level keys may be gotten at using either `for(k,v)` or `for((k),v)`; multi-level keys may be gotten at using `for((k1,k2,k3),v)` and so on. The `v` variable will be bound to to a scalar value (a string or a number) if the map stops at that level, or to a map-valued variable if the map goes deeper. If the map isn't deep enough then the loop body won't be executed.
For [maps](reference-dsl-maps.md), the first loop variable is the key and the
second is the value; for [arrays](reference-dsl-arrays.md), the first loop
variable is the (1-up) array index and the second is the value.
Single-level keys may be gotten at using either `for(k,v)` or `for((k),v)`; multi-level keys may be gotten at using `for((k1,k2,k3),v)` and so on. The `v` variable will be bound to to a scalar value (non-array/non-map) if the map stops at that level, or to a map-valued or array-valued variable if the map goes deeper. If the map isn't deep enough then the loop body won't be executed.
GENMD_RUN_COMMAND
cat data/for-srec-example.tbl
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(data/for-srec-example-1.sh)
GENMD_RUN_COMMAND
mlr --pprint --from data/for-srec-example.tbl put '
$sum1 = $f1 + $f2 + $f3;
$sum2 = 0;
$sum3 = 0;
for (key, value in $*) {
if (key =~ "^f[0-9]+") {
$sum2 += value;
$sum3 += $[key];
}
}
'
GENMD_EOF
GENMD_RUN_COMMAND
mlr --from data/small --opprint put 'for (k,v in $*) { $[k."_type"] = typeof(v) }'
@ -125,11 +187,32 @@ Note that the value of the current field in the for-loop can be gotten either us
Important note: to avoid inconsistent looping behavior in case you're setting new fields (and/or unsetting existing ones) while looping over the record, **Miller makes a copy of the record before the loop: loop variables are bound from the copy and all other reads/writes involve the record itself**:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/for-srec-example-2.sh)
GENMD_RUN_COMMAND
mlr --from data/small --opprint put '
$sum1 = 0;
$sum2 = 0;
for (k,v in $*) {
if (is_numeric(v)) {
$sum1 +=v;
$sum2 += $[k];
}
}
'
GENMD_EOF
It can be confusing to modify the stream record while iterating over a copy of it, so instead you might find it simpler to use a local variable in the loop and only update the stream record after the loop:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/for-srec-example-3.sh)
GENMD_RUN_COMMAND
mlr --from data/small --opprint put '
sum = 0;
for (k,v in $*) {
if (is_numeric(v)) {
sum += $[k];
}
}
$sum = sum
'
GENMD_EOF
You can also start iterating on sub-hashmaps of an out-of-stream or local variable; you can loop over nested keys; you can loop over all out-of-stream variables. The bound variables are bound to a copy of the sub-hashmap as it was before the loop started. The sub-hashmap is specified by square-bracketed indices after `in`, and additional deeper indices are bound to loop key-variables. The terminal values are bound to the loop value-variable whenever the keys are not too shallow. The value-variable may refer to a terminal (string, number) or it may be map-valued if the map goes deeper. Example indexing is as follows:
@ -137,23 +220,106 @@ GENMD_INCLUDE_ESCAPED(data/for-oosvar-example-0a.txt)
That's confusing in the abstract, so a concrete example is in order. Suppose the out-of-stream variable `@myvar` is populated as follows:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/for-oosvar-example-0b.sh)
GENMD_RUN_COMMAND
mlr -n put --jknquoteint -q '
begin {
@myvar = {
1: 2,
3: { 4 : 5 },
6: { 7: { 8: 9 } }
}
}
end { dump }
'
GENMD_EOF
Then we can get at various values as follows:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/for-oosvar-example-0c.sh)
GENMD_RUN_COMMAND
mlr -n put --jknquoteint -q '
begin {
@myvar = {
1: 2,
3: { 4 : 5 },
6: { 7: { 8: 9 } }
}
}
end {
for (k, v in @myvar) {
print
"key=" . k .
",valuetype=" . typeof(v);
}
}
'
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(data/for-oosvar-example-0d.sh)
GENMD_RUN_COMMAND
mlr -n put --jknquoteint -q '
begin {
@myvar = {
1: 2,
3: { 4 : 5 },
6: { 7: { 8: 9 } }
}
}
end {
for ((k1, k2), v in @myvar) {
print
"key1=" . k1 .
",key2=" . k2 .
",valuetype=" . typeof(v);
}
}
'
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(data/for-oosvar-example-0e.sh)
GENMD_RUN_COMMAND
mlr -n put --jknquoteint -q '
begin {
@myvar = {
1: 2,
3: { 4 : 5 },
6: { 7: { 8: 9 } }
}
}
end {
for ((k1, k2), v in @myvar[6]) {
print
"key1=" . k1 .
",key2=" . k2 .
",valuetype=" . typeof(v);
}
}
'
GENMD_EOF
### C-style triple-for loops
These are supported as follows:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/triple-for-example-1.sh)
GENMD_RUN_COMMAND
mlr --from data/small --opprint put '
num suma = 0;
for (a = 1; a <= NR; a += 1) {
suma += a;
}
$suma = suma;
'
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(data/triple-for-example-2.sh)
GENMD_RUN_COMMAND
mlr --from data/small --opprint put '
num suma = 0;
num sumb = 0;
for (num a = 1, num b = 1; a <= NR; a += 1, b *= 2) {
suma += a;
sumb += b;
}
$suma = suma;
$sumb = sumb;
'
GENMD_EOF
Notes:
@ -171,23 +337,50 @@ Notes:
Miller supports an `awk`-like `begin/end` syntax. The statements in the `begin` block are executed before any input records are read; the statements in the `end` block are executed after the last input record is read. (If you want to execute some statement at the start of each file, not at the start of the first file as with `begin`, you might use a pattern/action block of the form `FNR == 1 { ... }`.) All statements outside of `begin` or `end` are, of course, executed on every input record. Semicolons separate statements inside or outside of begin/end blocks; semicolons are required between begin/end block bodies and any subsequent statement. For example:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-1.sh)
GENMD_RUN_COMMAND
mlr put '
begin { @sum = 0 };
@x_sum += $x;
end { emit @x_sum }
' ./data/small
GENMD_EOF
Since uninitialized out-of-stream variables default to 0 for addition/substraction and 1 for multiplication when they appear on expression right-hand sides (not quite as in `awk`, where they'd default to 0 either way), the above can be written more succinctly as
GENMD_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-2.sh)
GENMD_RUN_COMMAND
mlr put '
@x_sum += $x;
end { emit @x_sum }
' ./data/small
GENMD_EOF
The **put -q** option suppresses printing of each output record, with only `emit` statements being output. So to get only summary outputs, you could write
GENMD_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-3.sh)
GENMD_RUN_COMMAND
mlr put -q '
@x_sum += $x;
end { emit @x_sum }
' ./data/small
GENMD_EOF
We can do similarly with multiple out-of-stream variables:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-4.sh)
GENMD_RUN_COMMAND
mlr put -q '
@x_count += 1;
@x_sum += $x;
end {
emit @x_count;
emit @x_sum;
}
' ./data/small
GENMD_EOF
This is of course (see also [here](reference-dsl.md#verbs-compared-to-dsl)) not much different than
GENMD_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-5.sh)
GENMD_RUN_COMMAND
mlr stats1 -a count,sum -f x ./data/small
GENMD_EOF
Note that it's a syntax error for begin/end blocks to refer to field names (beginning with `$`), since begin/end blocks execute outside the context of input records.

View file

@ -0,0 +1,157 @@
<!--- PLEASE DO NOT EDIT DIRECTLY. EDIT THE .md.in FILE PLEASE. --->
# Differences from other programming languages
The Miller programming language is intended to be straightforward and familiar,
as well as [not overly complex](reference-dsl-complexity.md). It doesn't try to
break new ground in terms of syntax; there are no classes or closures, and so
on.
While the [Principle of Least
Surprise](https://en.wikipedia.org/wiki/Principle_of_least_astonishment) is
often held to, nonetheless the following may be surprising.
## No ++ or --
There is no `++` or `--` [operator](reference-dsl-operators.md). To increment
`x`, use `x = x+1` or `x += 1`, and similarly for decrement.
## Semicolons as delimiters
You don't need a semicolon to end expressions, only to separate them. This
was done intentionally from the very start of Miller: you should be able to do
simple things like `mlr put '$z = $x * $y' myfile.dat` without needing a
semicolon.
Note that since you also don't need a semicolon before or after closing curly
braces (such as `begin`/`end` blocks, `if`-statements, `for`-loops, etc.) it's
easy to key in several semicolon-free statements, and then to forget a
semicolon where one is needed . The parser tries to remind you about semicolons
whenever there's a chance a missing semicolon might be involved in a parse
error.
## No autoconvert to boolean
Boolean tests in `if`/`while`/`for`/etc must always take a boolean expression:
`if (1) {...}` results in the parse error
`Miller: conditional expression did not evaluate to boolean.`,
Likewise `if (x) {...}`, unless `x` is a variable of boolean type.
Please use `if (x != 0) {...}`, etc.
## Integer-preserving arithmetic
As discussed on the [arithmetic page](reference-main-arithmetic.md) the sum, difference, and product of two integers is again an integer, unless overflow occurs -- in which case Miller tries to convert to float in the least obtrusive way possible.
Likewise, while quotient and remainder are generally pythonic, the quotient and exponentiation of two integers is an integer when possible.
<pre class="pre-highlight-in-pair">
<b>$ mlr repl -q</b>
</pre>
<pre class="pre-non-highlight-in-pair">
[mlr] 6/2
3
[mlr] typeof(6/2)
int
[mlr] 6/5
1.2
[mlr] typeof(6/5)
float
[mlr] typeof(7**8)
int
[mlr] typeof(7**80)
float
</pre>
## 1-up array indices
Arrays are indexed starting with 1, not 0. This is discussed in detail on the [arrays page](reference-dsl-arrays.md).
<pre class="pre-highlight-in-pair">
<b>mlr --csv --from data/short.csv cat</b>
</pre>
<pre class="pre-non-highlight-in-pair">
word,value
apple,37
ball,28
cat,54
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --csv --from data/short.csv put -q '</b>
<b> @records[NR] = $*;</b>
<b> end {</b>
<b> for (i = 1; i <= NR; i += 1) {</b>
<b> print "Record", i, "has word", @records[i]["word"];</b>
<b> }</b>
<b> }</b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
Record 1 has word apple
Record 2 has word ball
Record 3 has word cat
</pre>
## Print adds spaces around multiple arguments
As seen in the previous example,
[`print`](reference-dsl-output-statements.md#print-statements) with multiple
comma-delimited arguments fills in intervening spaces for you. If you want to
avoid this, use the dot operator for string-concatenation instead.
<pre class="pre-highlight-in-pair">
<b>mlr -n put -q '</b>
<b> end {</b>
<b> print "[", "a", "b", "c", "]";</b>
<b> print "[" . "a" . "b" . "c" . "]";</b>
<b> }</b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
[ a b c ]
[abc]
</pre>
Similarly, a final newline is printed for you; use [`printn`](reference-dsl-output-statements.md#print-statements) to avoid this.
## Insertion-order-preserving hashmaps
Miller's hashmaps [TODO:linkify] (as in many modern languages) preserve insertion order. If you set `x["foo"]=1` and then `x["bar"]=2`, then you are guaranteed that any looping over `x` will retrieve the `"foo"` key-value pair first, and the `"bar"` key-value pair second.
<pre class="pre-highlight-in-pair">
<b>mlr -n put -q 'end {</b>
<b> x["foo"] = 1;</b>
<b> x["bar"] = 2;</b>
<b> dump x;</b>
<b> for (k,v in x) {</b>
<b> print "key", k, "value", v</b>
<b> }</b>
<b>}'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
{
"foo": 1,
"bar": 2
}
key foo value 1
key bar value 2
</pre>
## Two-variable for-loops
Miller has a [key-value loop flavor](reference-dsl-control-structures.md#key-value-for-loops): whether `x` is a map or array, in `for (k,v in x) { ... }` the `k` will be bound to successive map keys (for maps) or 1-up array indices (for arrays), and the `v` will be bound to successive map values.
## Semantics for one-variable for-loops
Miller also has a [single-variable loop flavor](reference-dsl-control-structures.md#single-variable-for-loops). If `x` is a map then `for (e in x) { ... }` binds `e` to successive map _keys_ (not values as in PHP). But if `x` is an array then `for e in x) { ... }` binds `e` to successive array _values_ (not indices).
## Absent-null
Miller has a somewhat novel flavor of null data called _absent_: if a record
has a field `x` then `$y=$x` creates a field `y`, but if it doesn't then the assignment
is skipped. See the [null-data page](reference-main-null-data.md) for more
information.

View file

@ -0,0 +1,131 @@
# Differences from other programming languages
The Miller programming language is intended to be straightforward and familiar,
as well as [not overly complex](reference-dsl-complexity.md). It doesn't try to
break new ground in terms of syntax; there are no classes or closures, and so
on.
While the [Principle of Least
Surprise](https://en.wikipedia.org/wiki/Principle_of_least_astonishment) is
often held to, nonetheless the following may be surprising.
## No ++ or --
There is no `++` or `--` [operator](reference-dsl-operators.md). To increment
`x`, use `x = x+1` or `x += 1`, and similarly for decrement.
## Semicolons as delimiters
You don't need a semicolon to end expressions, only to separate them. This
was done intentionally from the very start of Miller: you should be able to do
simple things like `mlr put '$z = $x * $y' myfile.dat` without needing a
semicolon.
Note that since you also don't need a semicolon before or after closing curly
braces (such as `begin`/`end` blocks, `if`-statements, `for`-loops, etc.) it's
easy to key in several semicolon-free statements, and then to forget a
semicolon where one is needed . The parser tries to remind you about semicolons
whenever there's a chance a missing semicolon might be involved in a parse
error.
## No autoconvert to boolean
Boolean tests in `if`/`while`/`for`/etc must always take a boolean expression:
`if (1) {...}` results in the parse error
`Miller: conditional expression did not evaluate to boolean.`,
Likewise `if (x) {...}`, unless `x` is a variable of boolean type.
Please use `if (x != 0) {...}`, etc.
## Integer-preserving arithmetic
As discussed on the [arithmetic page](reference-main-arithmetic.md) the sum, difference, and product of two integers is again an integer, unless overflow occurs -- in which case Miller tries to convert to float in the least obtrusive way possible.
Likewise, while quotient and remainder are generally pythonic, the quotient and exponentiation of two integers is an integer when possible.
GENMD_CARDIFY_HIGHLIGHT_ONE
$ mlr repl -q
[mlr] 6/2
3
[mlr] typeof(6/2)
int
[mlr] 6/5
1.2
[mlr] typeof(6/5)
float
[mlr] typeof(7**8)
int
[mlr] typeof(7**80)
float
GENMD_EOF
## 1-up array indices
Arrays are indexed starting with 1, not 0. This is discussed in detail on the [arrays page](reference-dsl-arrays.md).
GENMD_RUN_COMMAND
mlr --csv --from data/short.csv cat
GENMD_EOF
GENMD_RUN_COMMAND
mlr --csv --from data/short.csv put -q '
@records[NR] = $*;
end {
for (i = 1; i <= NR; i += 1) {
print "Record", i, "has word", @records[i]["word"];
}
}
'
GENMD_EOF
## Print adds spaces around multiple arguments
As seen in the previous example,
[`print`](reference-dsl-output-statements.md#print-statements) with multiple
comma-delimited arguments fills in intervening spaces for you. If you want to
avoid this, use the dot operator for string-concatenation instead.
GENMD_RUN_COMMAND
mlr -n put -q '
end {
print "[", "a", "b", "c", "]";
print "[" . "a" . "b" . "c" . "]";
}
'
GENMD_EOF
Similarly, a final newline is printed for you; use [`printn`](reference-dsl-output-statements.md#print-statements) to avoid this.
## Insertion-order-preserving hashmaps
Miller's hashmaps [TODO:linkify] (as in many modern languages) preserve insertion order. If you set `x["foo"]=1` and then `x["bar"]=2`, then you are guaranteed that any looping over `x` will retrieve the `"foo"` key-value pair first, and the `"bar"` key-value pair second.
GENMD_RUN_COMMAND
mlr -n put -q 'end {
x["foo"] = 1;
x["bar"] = 2;
dump x;
for (k,v in x) {
print "key", k, "value", v
}
}'
GENMD_EOF
## Two-variable for-loops
Miller has a [key-value loop flavor](reference-dsl-control-structures.md#key-value-for-loops): whether `x` is a map or array, in `for (k,v in x) { ... }` the `k` will be bound to successive map keys (for maps) or 1-up array indices (for arrays), and the `v` will be bound to successive map values.
## Semantics for one-variable for-loops
Miller also has a [single-variable loop flavor](reference-dsl-control-structures.md#single-variable-for-loops). If `x` is a map then `for (e in x) { ... }` binds `e` to successive map _keys_ (not values as in PHP). But if `x` is an array then `for e in x) { ... }` binds `e` to successive array _values_ (not indices).
## Absent-null
Miller has a somewhat novel flavor of null data called _absent_: if a record
has a field `x` then `$y=$x` creates a field `y`, but if it doesn't then the assignment
is skipped. See the [null-data page](reference-main-null-data.md) for more
information.

View file

@ -0,0 +1,4 @@
<!--- PLEASE DO NOT EDIT DIRECTLY. EDIT THE .md.in FILE PLEASE. --->
# Maps
TODO

View file

@ -0,0 +1,3 @@
# Maps
TODO

View file

@ -213,7 +213,18 @@ GENMD_EOF
You can emit **multiple map-valued expressions side-by-side** by
including their names in parentheses:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/emit-lashed.sh)
GENMD_RUN_COMMAND
mlr --from data/medium --opprint put -q '
@x_count[$a][$b] += 1;
@x_sum[$a][$b] += $x;
end {
for ((a, b), _ in @x_count) {
@x_mean[a][b] = @x_sum[a][b] / @x_count[a][b]
}
emit (@x_sum, @x_count, @x_mean), "a", "b"
}
'
GENMD_EOF
What this does is walk through the first out-of-stream variable (`@x_sum` in this example) as usual, then for each keylist found (e.g. `pan,wye`), include the values for the remaining out-of-stream variables (here, `@x_count` and `@x_mean`). You should use this when all out-of-stream variables in the emit statement have **the same shape and the same keylists**.

View file

@ -10,7 +10,15 @@ GENMD_EOF
Newlines within the expression are ignored, which can help increase legibility of complex expressions:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/put-multiline-example.txt)
GENMD_RUN_COMMAND
mlr --opprint put '
$nf = NF;
$nr = NR;
$fnr = FNR;
$filenum = FILENUM;
$filename = FILENAME
' data/small data/small2
GENMD_EOF
GENMD_RUN_COMMAND
mlr --opprint filter '($x > 0.5 && $y < 0.5) || ($x < 0.5 && $y > 0.5)' \
@ -22,9 +30,13 @@ GENMD_EOF
The simplest way to enter expressions for `put` and `filter` is between single quotes on the command line (see also [here](miller-on-windows.md) for Windows). For example:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/fe-example-1.sh)
GENMD_RUN_COMMAND
mlr --from data/small put '$xy = sqrt($x**2 + $y**2)'
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(data/fe-example-2.sh)
GENMD_RUN_COMMAND
mlr --from data/small put 'func f(a, b) { return sqrt(a**2 + b**2) } $xy = f($x, $y)'
GENMD_EOF
You may, though, find it convenient to put expressions into files for reuse, and read them
**using the -f option**. For example:
@ -75,7 +87,25 @@ GENMD_INCLUDE_ESCAPED(data/newline-example.txt)
**Trailing commas** are allowed in function/subroutine definitions, function/subroutine callsites, and map literals. This is intended for (although not restricted to) the multi-line case:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/trailing-commas.sh)
GENMD_RUN_COMMAND
mlr --csvlite --from data/a.csv put '
func f(
num a,
num b,
): num {
return a**2 + b**2;
}
$* = {
"s": $a + $b,
"t": $a - $b,
"u": f(
$a,
$b,
),
"v": NR,
}
'
GENMD_EOF
Bodies for all compound statements must be enclosed in **curly braces**, even if the body is a single statement:

View file

@ -6,7 +6,22 @@ As of Miller 5.0.0 you can define your own functions, as well as subroutines.
Here's the obligatory example of a recursive function to compute the factorial function:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/factorial-example.sh)
GENMD_RUN_COMMAND
mlr --opprint --from data/small put '
func f(n) {
if (is_numeric(n)) {
if (n > 0) {
return n * f(n-1);
} else {
return 1;
}
}
# implicitly return absent-null if non-numeric
}
$ox = f($x + NR);
$oi = f($i);
'
GENMD_EOF
Properties of user-defined functions:
@ -30,7 +45,25 @@ Properties of user-defined functions:
Example:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/subr-example.sh)
GENMD_RUN_COMMAND
mlr --opprint --from data/small put -q '
begin {
@call_count = 0;
}
subr s(n) {
@call_count += 1;
if (is_numeric(n)) {
if (n > 1) {
call s(n-1);
} else {
print "numcalls=" . @call_count;
}
}
}
print "NR=" . NR;
call s(NR);
'
GENMD_EOF
Properties of user-defined subroutines:

View file

@ -144,19 +144,53 @@ Out-of-stream variables are **read-write**: you can do `$sum=@sum`, `@sum=$sum`,
Using an index on the `@count` and `@sum` variables, we get the benefit of the `-g` (group-by) option which `mlr stats1` and various other Miller commands have:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-6.sh)
GENMD_RUN_COMMAND
mlr put -q '
@x_count[$a] += 1;
@x_sum[$a] += $x;
end {
emit @x_count, "a";
emit @x_sum, "a";
}
' ./data/small
GENMD_EOF
GENMD_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-7.sh)
GENMD_RUN_COMMAND
mlr stats1 -a count,sum -f x -g a ./data/small
GENMD_EOF
Indices can be arbitrarily deep -- here there are two or more of them:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-6a.sh)
GENMD_RUN_COMMAND
mlr --from data/medium put -q '
@x_count[$a][$b] += 1;
@x_sum[$a][$b] += $x;
end {
emit (@x_count, @x_sum), "a", "b";
}
'
GENMD_EOF
The idea is that `stats1`, and other Miller verbs, encapsulate frequently-used patterns with a minimum of keystroking (and run a little faster), whereas using out-of-stream variables you have more flexibility and control in what you do.
Begin/end blocks can be mixed with pattern/action blocks. For example:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-8.sh)
GENMD_RUN_COMMAND
mlr put '
begin {
@num_total = 0;
@num_positive = 0;
};
@num_total += 1;
$x > 0.0 {
@num_positive += 1;
$y = log10($x); $z = sqrt($y)
};
end {
emitf @num_total, @num_positive
}
' data/put-gating-example-1.dkvp
GENMD_EOF
## Local variables
@ -164,7 +198,24 @@ Local variables are similar to out-of-stream variables, except that their extent
For example:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/local-example-1.sh)
GENMD_RUN_COMMAND
# Here I'm using a specified random-number seed so this example always
# produces the same output for this web document: in everyday practice we
# would leave off the --seed 12345 part.
mlr --seed 12345 seqgen --start 1 --stop 10 then put '
func f(a, b) { # function arguments a and b
r = 0.0; # local r scoped to the function
for (int i = 0; i < 6; i += 1) { # local i scoped to the for-loop
num u = urand(); # local u scoped to the for-loop
r += u; # updates r from the enclosing scope
}
r /= 6;
return a + (b - a) * r;
}
num o = f(10, 20); # local to the top-level scope
$o = o;
'
GENMD_EOF
Things which are completely unsurprising, resembling many other languages:
@ -216,15 +267,51 @@ Miller's `put`/`filter` DSL has four kinds of hashmaps. **Stream records** are (
For example, the following swaps the input stream's `a` and `i` fields, modifies `y`, and drops the rest:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/map-literal-example-1.sh)
GENMD_RUN_COMMAND
mlr --opprint put '
$* = {
"a": $i,
"i": $a,
"y": $y * 10,
}
' data/small
GENMD_EOF
Likewise, you can assign map literals to out-of-stream variables or local variables; pass them as arguments to user-defined functions, return them from functions, and so on:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/map-literal-example-2.sh)
GENMD_RUN_COMMAND
mlr --from data/small put '
func f(map m): map {
m["x"] *= 200;
return m;
}
$* = f({"a": $a, "x": $x});
'
GENMD_EOF
Like out-of-stream and local variables, map literals can be multi-level:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/map-literal-example-3.sh)
GENMD_RUN_COMMAND
mlr --from data/small put -q '
begin {
@o = {
"nrec": 0,
"nkey": {"numeric":0, "non-numeric":0},
};
}
@o["nrec"] += 1;
for (k, v in $*) {
if (is_numeric(v)) {
@o["nkey"]["numeric"] += 1;
} else {
@o["nkey"]["non-numeric"] += 1;
}
}
end {
dump @o;
}
'
GENMD_EOF
## Type-checking

View file

@ -59,7 +59,7 @@ the body of the loop.
(You can, if you like, use the per-record statements to grow a list of records,
then loop over them all in an `end` block. This is described in the page on
[operating over all records](operating-over-all-records.md)).
[operating on all records](operating-on-all-records.md)).
To see this in action, let's take a look at the [data/short.csv](./data/short.csv) file:
@ -105,7 +105,7 @@ statement on each loop iteration.
For almost all simple uses of the Miller programming language, this implicit
looping over records is probably all you will need. (For more involved cases you
can see the pages on [operating over all records](operating-on-all-records.md),
can see the pages on [operating on all records](operating-on-all-records.md),
[out-of-stream variables](reference-dsl-variables.md#out-of-stream-variables),
and [two-pass algorithms](two-pass-algorithms.md).)

View file

@ -48,7 +48,7 @@ the body of the loop.
(You can, if you like, use the per-record statements to grow a list of records,
then loop over them all in an `end` block. This is described in the page on
[operating over all records](operating-over-all-records.md)).
[operating on all records](operating-on-all-records.md)).
To see this in action, let's take a look at the [data/short.csv](./data/short.csv) file:
@ -80,7 +80,7 @@ statement on each loop iteration.
For almost all simple uses of the Miller programming language, this implicit
looping over records is probably all you will need. (For more involved cases you
can see the pages on [operating over all records](operating-on-all-records.md),
can see the pages on [operating on all records](operating-on-all-records.md),
[out-of-stream variables](reference-dsl-variables.md#out-of-stream-variables),
and [two-pass algorithms](two-pass-algorithms.md).)

View file

@ -0,0 +1,133 @@
<!--- PLEASE DO NOT EDIT DIRECTLY. EDIT THE .md.in FILE PLEASE. --->
# Compressed data
As of [Miller 6](new-in-miller-6.md), Miller supports reading GZIP, BZIP2, and
ZLIB formats transparently, and in-process. And (as before Miller 6) you have a
more general `--prepipe` option to support other decompression programs.
## Automatic detection on input
If your files end in `.gz`, `.bz2`, or `.z` then Miller will autodetect by file extension:
<pre class="pre-highlight-in-pair">
<b>file gz-example.csv.gz</b>
</pre>
<pre class="pre-non-highlight-in-pair">
gz-example.csv.gz: gzip compressed data, was "gz-example.csv", last modified: Mon Aug 23 02:04:34 2021, from Unix, original size modulo 2^32 429
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --csv sort -f color gz-example.csv.gz</b>
</pre>
<pre class="pre-non-highlight-in-pair">
color,shape,flag,k,index,quantity,rate
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240
purple,square,false,10,91,72.3735,8.2430
red,square,true,2,15,79.2778,0.0130
red,circle,true,3,16,13.8103,2.9010
red,square,false,4,48,77.5542,7.4670
red,square,false,6,64,77.1991,9.5310
yellow,triangle,true,1,11,43.6498,9.8870
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350
</pre>
This will decompress the input data on the fly, while leaving the disk file unmodified. This helps you save disk space, at the cost of some additional runtime CPU usage to decompress the data.
## Manual detection on input
If the filename doesn't in in `.gz`, `.bz2`, or `.z` then you can use the flags `--gzin`, `--bz2in`, or `--zin` to let Miller know:
<pre class="pre-highlight-non-pair">
<b>mlr --csv --gzin sort -f color myfile.bin # myfile.bin has gzip contents</b>
</pre>
## External decompressors on input
Using the `--prepipe` flag, you can provide the name of any decompression
program in your `$PATH` and Miller will run it on each input file, effectively
piping the standard output of that program to Miller's standard input.
You can, of course, already do without this for single input files, for example:
<pre class="pre-highlight-in-pair">
<b>gunzip < gz-example.csv.gz | mlr --csv sort -f color</b>
</pre>
<pre class="pre-non-highlight-in-pair">
color,shape,flag,k,index,quantity,rate
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240
purple,square,false,10,91,72.3735,8.2430
red,square,true,2,15,79.2778,0.0130
red,circle,true,3,16,13.8103,2.9010
red,square,false,4,48,77.5542,7.4670
red,square,false,6,64,77.1991,9.5310
yellow,triangle,true,1,11,43.6498,9.8870
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350
</pre>
The benefit of `--prepipe` is that Miller will run the specified program once per
file, respecting file boundaries.
The prepipe command can be anything which reads from standard input and produces
data acceptable to Miller. Nominally this allows you to use whichever
decompression utilities you have installed on your system, on a per-file basis.
If the command has flags, quote them: e.g. `mlr --prepipe 'zcat -cf'`.
Note that this feature is quite general and is not limited to decompression
utilities. You can use it to apply per-file filters of your choice: e.g. `mlr
--prepipe head -n 10 ...`, if you like.
There is a `--prepipe` and a `--prepipex`:
* If the command normally runs with `nameofprogram < filename.ext` (such as `gunzip` or `zcat -cf` or `xz -cd`) then use `--prepipe`.
* If the command normally runs with `nameofprogram filename.ext` (such as `unzip -qc`) then use `--prepipex`.
Lastly, note that if `--prepipe` or `--prepipex` is specified on the Miller
command line, it replaces any autodetect decisions that might have been made
based on the filename extension. Likewise, `--gzin`/`--bz2in`/`--zin` are ignored if
`--prepipe` or `--prepipex` is also specified.
## Compressed output
Everything said so far on this page has to do with compressed input.
For compressed output:
* Normally Miller output is to stdout, so you can pipe the output: `mlr sort -n quantity foo.csv | gzip > sorted.csv.gz`.
* For [`tee` statements](reference-dsl-output-statements.md#tee-statements), which write output to files rather than stdout, use `tee`'s redirect syntax:
<pre class="pre-highlight-non-pair">
<b>mlr --from example.csv --csv put -q '</b>
<b> filename = $color.".csv.gz";</b>
<b> tee | "gzip > ".filename, $*</b>
<b>'</b>
</pre>
<pre class="pre-highlight-in-pair">
<b>file red.csv.gz purple.csv.gz yellow.csv.gz</b>
</pre>
<pre class="pre-non-highlight-in-pair">
red.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 185
purple.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 164
yellow.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 158
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --csv cat yellow.csv.gz</b>
</pre>
<pre class="pre-non-highlight-in-pair">
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350
</pre>
* Using the [in-place flag](reference-main-io-options.md#in-place-mode) `-I`,
as of August 2021 the overwritten file will _not_ be compressed as it was when it was read:
e.g. `mlr -I --csv cat gz-example.csv.gz` will write `gz-example.csv.gz` which contains
a plain, uncompressed CSV contents. This is a bug and will be fixed.

View file

@ -0,0 +1,96 @@
# Compressed data
As of [Miller 6](new-in-miller-6.md), Miller supports reading GZIP, BZIP2, and
ZLIB formats transparently, and in-process. And (as before Miller 6) you have a
more general `--prepipe` option to support other decompression programs.
## Automatic detection on input
If your files end in `.gz`, `.bz2`, or `.z` then Miller will autodetect by file extension:
GENMD_CARDIFY_HIGHLIGHT_ONE
file gz-example.csv.gz
gz-example.csv.gz: gzip compressed data, was "gz-example.csv", last modified: Mon Aug 23 02:04:34 2021, from Unix, original size modulo 2^32 429
GENMD_EOF
GENMD_RUN_COMMAND
mlr --csv sort -f color gz-example.csv.gz
GENMD_EOF
This will decompress the input data on the fly, while leaving the disk file unmodified. This helps you save disk space, at the cost of some additional runtime CPU usage to decompress the data.
## Manual detection on input
If the filename doesn't in in `.gz`, `.bz2`, or `.z` then you can use the flags `--gzin`, `--bz2in`, or `--zin` to let Miller know:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --csv --gzin sort -f color myfile.bin # myfile.bin has gzip contents
GENMD_EOF
## External decompressors on input
Using the `--prepipe` flag, you can provide the name of any decompression
program in your `$PATH` and Miller will run it on each input file, effectively
piping the standard output of that program to Miller's standard input.
You can, of course, already do without this for single input files, for example:
GENMD_RUN_COMMAND
gunzip < gz-example.csv.gz | mlr --csv sort -f color
GENMD_EOF
The benefit of `--prepipe` is that Miller will run the specified program once per
file, respecting file boundaries.
The prepipe command can be anything which reads from standard input and produces
data acceptable to Miller. Nominally this allows you to use whichever
decompression utilities you have installed on your system, on a per-file basis.
If the command has flags, quote them: e.g. `mlr --prepipe 'zcat -cf'`.
Note that this feature is quite general and is not limited to decompression
utilities. You can use it to apply per-file filters of your choice: e.g. `mlr
--prepipe head -n 10 ...`, if you like.
There is a `--prepipe` and a `--prepipex`:
* If the command normally runs with `nameofprogram < filename.ext` (such as `gunzip` or `zcat -cf` or `xz -cd`) then use `--prepipe`.
* If the command normally runs with `nameofprogram filename.ext` (such as `unzip -qc`) then use `--prepipex`.
Lastly, note that if `--prepipe` or `--prepipex` is specified on the Miller
command line, it replaces any autodetect decisions that might have been made
based on the filename extension. Likewise, `--gzin`/`--bz2in`/`--zin` are ignored if
`--prepipe` or `--prepipex` is also specified.
## Compressed output
Everything said so far on this page has to do with compressed input.
For compressed output:
* Normally Miller output is to stdout, so you can pipe the output: `mlr sort -n quantity foo.csv | gzip > sorted.csv.gz`.
* For [`tee` statements](reference-dsl-output-statements.md#tee-statements), which write output to files rather than stdout, use `tee`'s redirect syntax:
GENMD_RUN_COMMAND
mlr --from example.csv --csv put -q '
filename = $color.".csv.gz";
tee | "gzip > ".filename, $*
'
GENMD_EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
file red.csv.gz purple.csv.gz yellow.csv.gz
red.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 185
purple.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 164
yellow.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 158
GENMD_EOF
GENMD_RUN_COMMAND
mlr --csv cat yellow.csv.gz
GENMD_EOF
* Using the [in-place flag](reference-main-io-options.md#in-place-mode) `-I`,
as of August 2021 the overwritten file will _not_ be compressed as it was when it was read:
e.g. `mlr -I --csv cat gz-example.csv.gz` will write `gz-example.csv.gz` which contains
a plain, uncompressed CSV contents. This is a bug and will be fixed.

View file

@ -61,37 +61,7 @@ Please see [Choices for printing to files](10min.md#choices-for-printing-to-file
## Compression
Options:
<pre class="pre-non-highlight-non-pair">
--prepipe {command}
</pre>
The prepipe command is anything which reads from standard input and produces data acceptable to Miller. Nominally this allows you to use whichever decompression utilities you have installed on your system, on a per-file basis. If the command has flags, quote them: e.g. `mlr --prepipe 'zcat -cf'`. Examples:
<pre class="pre-non-highlight-non-pair">
# These two produce the same output:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz
# With multiple input files you need --prepipe:
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz
$ mlr --prepipe gunzip --idkvp --oxtab cut -f hostname,uptime myfile1.dat.gz myfile2.dat.gz
</pre>
<pre class="pre-non-highlight-non-pair">
# Similar to the above, but with compressed output as well as input:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | gzip > outfile.csv.gz
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz | gzip > outfile.csv.gz
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz | gzip > outfile.csv.gz
</pre>
<pre class="pre-non-highlight-non-pair">
# Similar to the above, but with different compression tools for input and output:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | xz -z > outfile.csv.xz
$ xz -cd < myfile1.csv.xz | mlr cut -f hostname,uptime | gzip > outfile.csv.xz
$ mlr --prepipe 'xz -cd' cut -f hostname,uptime myfile1.csv.xz myfile2.csv.xz | xz -z > outfile.csv.xz
</pre>
See the separate page on [Compressed data](reference-main-compressed-data.md).
## Record/field/pair separators

View file

@ -44,37 +44,7 @@ Please see [Choices for printing to files](10min.md#choices-for-printing-to-file
## Compression
Options:
GENMD_CARDIFY
--prepipe {command}
GENMD_EOF
The prepipe command is anything which reads from standard input and produces data acceptable to Miller. Nominally this allows you to use whichever decompression utilities you have installed on your system, on a per-file basis. If the command has flags, quote them: e.g. `mlr --prepipe 'zcat -cf'`. Examples:
GENMD_CARDIFY
# These two produce the same output:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz
# With multiple input files you need --prepipe:
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz
$ mlr --prepipe gunzip --idkvp --oxtab cut -f hostname,uptime myfile1.dat.gz myfile2.dat.gz
GENMD_EOF
GENMD_CARDIFY
# Similar to the above, but with compressed output as well as input:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | gzip > outfile.csv.gz
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz | gzip > outfile.csv.gz
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz | gzip > outfile.csv.gz
GENMD_EOF
GENMD_CARDIFY
# Similar to the above, but with different compression tools for input and output:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | xz -z > outfile.csv.xz
$ xz -cd < myfile1.csv.xz | mlr cut -f hostname,uptime | gzip > outfile.csv.xz
$ mlr --prepipe 'xz -cd' cut -f hostname,uptime myfile1.csv.xz myfile2.csv.xz | xz -z > outfile.csv.xz
GENMD_EOF
See the separate page on [Compressed data](reference-main-compressed-data.md).
## Record/field/pair separators

View file

@ -3,12 +3,12 @@
## Overview
The outline of an invocation of Miller is
The outline of an invocation of Miller is:
* `mlr`
* The program name `mlr`.
* Options controlling input/output formatting, etc. (See [I/O options](reference-main-io-options.md)).
* One or more verbs -- such as `cut`, `sort`, etc. (see [Verbs Reference](reference-verbs.md)) -- chained together using [then](reference-main-then-chaining.md). You use these to transform your data.
* Zero or more filenames, with input taken from standard input if there are no filenames present.
* Zero or more filenames, with input taken from standard input if there are no filenames present. (You can place the filenames up front using `--from` or `--mfrom` as described on the [keystroke-savers page](keystroke-savers.md#file-names-up-front-including-from).)
For example, reading from a file:
@ -21,6 +21,15 @@ red square true 2 15 79.2778 0.0130
yellow triangle true 1 11 43.6498 9.8870
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --from example.csv --icsv --opprint head -n 2 then sort -f shape</b>
</pre>
<pre class="pre-non-highlight-in-pair">
color shape flag k index quantity rate
red square true 2 15 79.2778 0.0130
yellow triangle true 1 11 43.6498 9.8870
</pre>
Reading from standard input:
<pre class="pre-highlight-in-pair">
@ -38,7 +47,7 @@ The rest of this reference section gives you full information on each of these p
When you type `mlr {something} myfile.dat`, the `{something}` part is called a **verb**. It specifies how you want to transform your data. Most of the verbs are counterparts of built-in system tools like `cut` and `sort` -- but with file-format awareness, and giving you the ability to refer to fields by name.
The verbs `put` and `filter` are special in that they have a rich expression language (domain-specific language, or "DSL"). More information about them can be found at [DSL reference](reference-dsl.md).
The verbs `put` and `filter` are special in that they have a rich expression language (domain-specific language, or "DSL"). More information about them can be found at on the [Intro to Miller's programming language page](programming-language.md); see also [DSL reference](reference-dsl.md) for more details.
Here's a comparison of verbs and `put`/`filter` DSL expressions:

View file

@ -2,12 +2,12 @@
## Overview
The outline of an invocation of Miller is
The outline of an invocation of Miller is:
* `mlr`
* The program name `mlr`.
* Options controlling input/output formatting, etc. (See [I/O options](reference-main-io-options.md)).
* One or more verbs -- such as `cut`, `sort`, etc. (see [Verbs Reference](reference-verbs.md)) -- chained together using [then](reference-main-then-chaining.md). You use these to transform your data.
* Zero or more filenames, with input taken from standard input if there are no filenames present.
* Zero or more filenames, with input taken from standard input if there are no filenames present. (You can place the filenames up front using `--from` or `--mfrom` as described on the [keystroke-savers page](keystroke-savers.md#file-names-up-front-including-from).)
For example, reading from a file:
@ -15,6 +15,10 @@ GENMD_RUN_COMMAND
mlr --icsv --opprint head -n 2 then sort -f shape example.csv
GENMD_EOF
GENMD_RUN_COMMAND
mlr --from example.csv --icsv --opprint head -n 2 then sort -f shape
GENMD_EOF
Reading from standard input:
GENMD_RUN_COMMAND
@ -27,7 +31,7 @@ The rest of this reference section gives you full information on each of these p
When you type `mlr {something} myfile.dat`, the `{something}` part is called a **verb**. It specifies how you want to transform your data. Most of the verbs are counterparts of built-in system tools like `cut` and `sort` -- but with file-format awareness, and giving you the ability to refer to fields by name.
The verbs `put` and `filter` are special in that they have a rich expression language (domain-specific language, or "DSL"). More information about them can be found at [DSL reference](reference-dsl.md).
The verbs `put` and `filter` are special in that they have a rich expression language (domain-specific language, or "DSL"). More information about them can be found at on the [Intro to Miller's programming language page](programming-language.md); see also [DSL reference](reference-dsl.md) for more details.
Here's a comparison of verbs and `put`/`filter` DSL expressions:

View file

@ -179,6 +179,6 @@ etc. depending on your platform.
Suggestion: `alias mrpl='rlwrap mlr repl'` in your shell's startup file.
## On-line help
## Online help
After `mlr repl`, type `:help` to see more about your options. In particular, `:help examples`.

View file

@ -146,6 +146,6 @@ etc. depending on your platform.
Suggestion: `alias mrpl='rlwrap mlr repl'` in your shell's startup file.
## On-line help
## Online help
After `mlr repl`, type `:help` to see more about your options. In particular, `:help examples`.

View file

@ -244,4 +244,15 @@ GENMD_INCLUDE_ESCAPED(data/rect.txt)
The idea here is that middles starting with a 1 belong to the outer value of 1, and so on. (For example, the outer values might be account IDs, the middle values might be invoice IDs, and the inner values might be invoice line-items.) If you want all the middle and inner lines to have the context of which outers they belong to, you can modify your software to pass all those through your methods. Alternatively, don't refactor your code just to handle some ad-hoc log-data formatting -- instead, use the following to rectangularize the data. The idea is to use an out-of-stream variable to accumulate fields across records. Clear that variable when you see an outer ID; accumulate fields; emit output when you see the inner IDs.
GENMD_INCLUDE_AND_RUN_ESCAPED(data/rect.sh)
GENMD_RUN_COMMAND
mlr --from data/rect.txt put -q '
is_present($outer) {
unset @r
}
for (k, v in $*) {
@r[k] = v
}
is_present($inner1) {
emit @r
}'
GENMD_EOF

View file

@ -4,14 +4,51 @@
For one or more specified field names, simply compute p25 and p75, then write the IQR as the difference of p75 and p25:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/iqr1.sh)
GENMD_RUN_COMMAND
mlr --oxtab stats1 -f x -a p25,p75 \
then put '$x_iqr = $x_p75 - $x_p25' \
data/medium
GENMD_EOF
For wildcarded field names, first compute p25 and p75, then loop over field names with `p25` in them:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/iqrn.sh)
GENMD_RUN_COMMAND
mlr --oxtab stats1 --fr '[i-z]' -a p25,p75 \
then put 'for (k,v in $*) {
if (k =~ "(.*)_p25") {
$["\1_iqr"] = $["\1_p75"] - $["\1_p25"]
}
}' \
data/medium
GENMD_EOF
## Computing weighted means
This might be more elegantly implemented as an option within the `stats1` verb. Meanwhile, it's expressible within the DSL:
GENMD_INCLUDE_AND_RUN_ESCAPED(data/weighted-mean.sh)
GENMD_RUN_COMMAND
mlr --from data/medium put -q '
# Using the y field for weighting in this example
weight = $y;
# Using the a field for weighted aggregation in this example
@sumwx[$a] += weight * $i;
@sumw[$a] += weight;
@sumx[$a] += $i;
@sumn[$a] += 1;
end {
map wmean = {};
map mean = {};
for (a in @sumwx) {
wmean[a] = @sumwx[a] / @sumw[a]
}
for (a in @sumx) {
mean[a] = @sumx[a] / @sumn[a]
}
#emit wmean, "a";
#emit mean, "a";
emit (wmean, mean), "a";
}'
GENMD_EOF

Some files were not shown because too many files have changed in this diff Show more