Replace miller-6 sphinx docs with mkdocs docs (#618)

This commit is contained in:
John Kerl 2021-08-04 22:03:01 -04:00 committed by GitHub
parent fa3ee05822
commit afd3c9c149
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
1313 changed files with 32 additions and 287985 deletions

2
.gitignore vendored
View file

@ -120,4 +120,4 @@ experiments/dsl-parser/two/main
experiments/cli-parser/cliparse
experiments/cli-parser/cliparse.exe
docs6b/site/
docs6/site/

View file

@ -1 +0,0 @@
../multi-join

View file

@ -1 +0,0 @@
../ngrams

View file

@ -1 +0,0 @@
../polyglot-dkvp-io

View file

@ -1,2 +0,0 @@
map \d :w<C-m>:!clear;build-one %<C-m>
map \f :w<C-m>:!clear;make html<C-m>

View file

@ -1,653 +0,0 @@
..
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
Miller in 10 minutes
====================
Obtaining Miller
^^^^^^^^^^^^^^^^
You can install Miller for various platforms as follows:
* Linux: ``yum install miller`` or ``apt-get install miller`` depending on your flavor of Linux
* MacOS: ``brew install miller`` or ``port install miller`` depending on your preference of `Homebrew <https://brew.sh>`_ or `MacPorts <https://macports.org>`_.
* Windows: ``choco install miller`` using `Chocolatey <https://chocolatey.org>`_.
* You can get latest builds for Linux, MacOS, and Windows by visiting https://github.com/johnkerl/miller/actions, selecting the latest build, and clicking _Artifacts_. (These are retained for 5 days after each commit.)
* See also :doc:`build` if you prefer -- in particular, if your platform's package manager doesn't have the latest release.
As a first check, you should be able to run ``mlr --version`` at your system's command prompt and see something like the following:
.. code-block:: none
:emphasize-lines: 1-1
mlr --version
Miller v6.0.0-dev
As a second check, given (`example.csv <./example.csv>`_) you should be able to do
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv cat example.csv
color,shape,flag,index,quantity,rate
yellow,triangle,true,11,43.6498,9.8870
red,square,true,15,79.2778,0.0130
red,circle,true,16,13.8103,2.9010
red,square,false,48,77.5542,7.4670
purple,triangle,false,51,81.2290,8.5910
red,square,false,64,77.1991,9.5310
purple,triangle,false,65,80.1405,5.8240
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
purple,square,false,91,72.3735,8.2430
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint cat example.csv
color shape flag index quantity rate
yellow triangle true 11 43.6498 9.8870
red square true 15 79.2778 0.0130
red circle true 16 13.8103 2.9010
red square false 48 77.5542 7.4670
purple triangle false 51 81.2290 8.5910
red square false 64 77.1991 9.5310
purple triangle false 65 80.1405 5.8240
yellow circle true 73 63.9785 4.2370
yellow circle true 87 63.5058 8.3350
purple square false 91 72.3735 8.2430
If you run into issues on these checks, please check out the resources on the :doc:`community` page for help.
Miller verbs
^^^^^^^^^^^^
Let's take a quick look at some of the most useful Miller verbs -- file-format-aware, name-index-empowered equivalents of standard system commands.
``mlr cat`` is like system ``cat`` (or ``type`` on Windows) -- it passes the data through unmodified:
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv cat example.csv
color,shape,flag,index,quantity,rate
yellow,triangle,true,11,43.6498,9.8870
red,square,true,15,79.2778,0.0130
red,circle,true,16,13.8103,2.9010
red,square,false,48,77.5542,7.4670
purple,triangle,false,51,81.2290,8.5910
red,square,false,64,77.1991,9.5310
purple,triangle,false,65,80.1405,5.8240
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
purple,square,false,91,72.3735,8.2430
But ``mlr cat`` can also do format conversion -- for example, you can pretty-print in tabular format:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint cat example.csv
color shape flag index quantity rate
yellow triangle true 11 43.6498 9.8870
red square true 15 79.2778 0.0130
red circle true 16 13.8103 2.9010
red square false 48 77.5542 7.4670
purple triangle false 51 81.2290 8.5910
red square false 64 77.1991 9.5310
purple triangle false 65 80.1405 5.8240
yellow circle true 73 63.9785 4.2370
yellow circle true 87 63.5058 8.3350
purple square false 91 72.3735 8.2430
``mlr head`` and ``mlr tail`` count records rather than lines. Whether you're getting the first few records or the last few, the CSV header is included either way:
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv head -n 4 example.csv
color,shape,flag,index,quantity,rate
yellow,triangle,true,11,43.6498,9.8870
red,square,true,15,79.2778,0.0130
red,circle,true,16,13.8103,2.9010
red,square,false,48,77.5542,7.4670
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv tail -n 4 example.csv
color,shape,flag,index,quantity,rate
purple,triangle,false,65,80.1405,5.8240
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
purple,square,false,91,72.3735,8.2430
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --ojson tail -n 2 example.csv
{
"color": "yellow",
"shape": "circle",
"flag": true,
"index": 87,
"quantity": 63.5058,
"rate": 8.3350
}
{
"color": "purple",
"shape": "square",
"flag": false,
"index": 91,
"quantity": 72.3735,
"rate": 8.2430
}
You can sort on a single field:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint sort -f shape example.csv
color shape flag index quantity rate
red circle true 16 13.8103 2.9010
yellow circle true 73 63.9785 4.2370
yellow circle true 87 63.5058 8.3350
red square true 15 79.2778 0.0130
red square false 48 77.5542 7.4670
red square false 64 77.1991 9.5310
purple square false 91 72.3735 8.2430
yellow triangle true 11 43.6498 9.8870
purple triangle false 51 81.2290 8.5910
purple triangle false 65 80.1405 5.8240
Or, you can sort primarily alphabetically on one field, then secondarily numerically descending on another field, and so on:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint sort -f shape -nr index example.csv
color shape flag index quantity rate
yellow circle true 87 63.5058 8.3350
yellow circle true 73 63.9785 4.2370
red circle true 16 13.8103 2.9010
purple square false 91 72.3735 8.2430
red square false 64 77.1991 9.5310
red square false 48 77.5542 7.4670
red square true 15 79.2778 0.0130
purple triangle false 65 80.1405 5.8240
purple triangle false 51 81.2290 8.5910
yellow triangle true 11 43.6498 9.8870
If there are fields you don't want to see in your data, you can use ``cut`` to keep only the ones you want, in the same order they appeared in the input data:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint cut -f flag,shape example.csv
shape flag
triangle true
square true
circle true
square false
triangle false
square false
triangle false
circle true
circle true
square false
You can also use ``cut -o`` to keep specified fields, but in your preferred order:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint cut -o -f flag,shape example.csv
flag shape
true triangle
true square
true circle
false square
false triangle
false square
false triangle
true circle
true circle
false square
You can use ``cut -x`` to omit fields you don't care about:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint cut -x -f flag,shape example.csv
color index quantity rate
yellow 11 43.6498 9.8870
red 15 79.2778 0.0130
red 16 13.8103 2.9010
red 48 77.5542 7.4670
purple 51 81.2290 8.5910
red 64 77.1991 9.5310
purple 65 80.1405 5.8240
yellow 73 63.9785 4.2370
yellow 87 63.5058 8.3350
purple 91 72.3735 8.2430
You can use ``filter`` to keep only records you care about:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint filter '$color == "red"' example.csv
color shape flag index quantity rate
red square true 15 79.2778 0.0130
red circle true 16 13.8103 2.9010
red square false 48 77.5542 7.4670
red square false 64 77.1991 9.5310
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint filter '$color == "red" && $flag == true' example.csv
color shape flag index quantity rate
red square true 15 79.2778 0.0130
red circle true 16 13.8103 2.9010
You can use ``put`` to create new fields which are computed from other fields:
.. code-block:: none
:emphasize-lines: 1-4
mlr --icsv --opprint put '
$ratio = $quantity / $rate;
$color_shape = $color . "_" . $shape
' example.csv
color shape flag index quantity rate ratio color_shape
yellow triangle true 11 43.6498 9.8870 4.414868008496004 yellow_triangle
red square true 15 79.2778 0.0130 6098.292307692308 red_square
red circle true 16 13.8103 2.9010 4.760530851430541 red_circle
red square false 48 77.5542 7.4670 10.386259541984733 red_square
purple triangle false 51 81.2290 8.5910 9.455127458968688 purple_triangle
red square false 64 77.1991 9.5310 8.099790158430384 red_square
purple triangle false 65 80.1405 5.8240 13.760388049450551 purple_triangle
yellow circle true 73 63.9785 4.2370 15.09995279679018 yellow_circle
yellow circle true 87 63.5058 8.3350 7.619172165566886 yellow_circle
purple square false 91 72.3735 8.2430 8.779995147397793 purple_square
Even though Miller's main selling point is name-indexing, sometimes you really want to refer to a field name by its positional index. Use ``$[[3]]`` to access the name of field 3 or ``$[[[3]]]`` to access the value of field 3:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint put '$[[3]] = "NEW"' example.csv
color shape NEW index quantity rate
yellow triangle true 11 43.6498 9.8870
red square true 15 79.2778 0.0130
red circle true 16 13.8103 2.9010
red square false 48 77.5542 7.4670
purple triangle false 51 81.2290 8.5910
red square false 64 77.1991 9.5310
purple triangle false 65 80.1405 5.8240
yellow circle true 73 63.9785 4.2370
yellow circle true 87 63.5058 8.3350
purple square false 91 72.3735 8.2430
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint put '$[[[3]]] = "NEW"' example.csv
color shape flag index quantity rate
yellow triangle NEW 11 43.6498 9.8870
red square NEW 15 79.2778 0.0130
red circle NEW 16 13.8103 2.9010
red square NEW 48 77.5542 7.4670
purple triangle NEW 51 81.2290 8.5910
red square NEW 64 77.1991 9.5310
purple triangle NEW 65 80.1405 5.8240
yellow circle NEW 73 63.9785 4.2370
yellow circle NEW 87 63.5058 8.3350
purple square NEW 91 72.3735 8.2430
You can find the full list of verbs at the :doc:`reference-verbs` page.
Multiple input files
^^^^^^^^^^^^^^^^^^^^
Miller takes all the files from the command line as an input stream. But it's format-aware, so it doesn't repeat CSV header lines. For example, with input files (`data/a.csv <data/a.csv>`_) and (`data/b.csv <data/b.csv>`_), the system ``cat`` command will repeat header lines:
.. code-block:: none
:emphasize-lines: 1-1
cat data/a.csv
a,b,c
1,2,3
4,5,6
.. code-block:: none
:emphasize-lines: 1-1
cat data/b.csv
a,b,c
7,8,9
.. code-block:: none
:emphasize-lines: 1-1
cat data/a.csv data/b.csv
a,b,c
1,2,3
4,5,6
a,b,c
7,8,9
However, ``mlr cat`` will not:
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv cat data/a.csv data/b.csv
a,b,c
1,2,3
4,5,6
7,8,9
Chaining verbs together
^^^^^^^^^^^^^^^^^^^^^^^
Often we want to chain queries together -- for example, sorting by a field and taking the top few values. We can do this using pipes:
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv sort -nr index example.csv | mlr --icsv --opprint head -n 3
color shape flag index quantity rate
purple square false 91 72.3735 8.2430
yellow circle true 87 63.5058 8.3350
yellow circle true 73 63.9785 4.2370
This works fine -- but Miller also lets you chain verbs together using the word ``then``. Think of this as a Miller-internal pipe that lets you use fewer keystrokes:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
color shape flag index quantity rate
purple square false 91 72.3735 8.2430
yellow circle true 87 63.5058 8.3350
yellow circle true 73 63.9785 4.2370
As another convenience, you can put the filename first using ``--from``. When you're interacting with your data at the command line, this makes it easier to up-arrow and append to the previous command:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint --from example.csv sort -nr index then head -n 3
color shape flag index quantity rate
purple square false 91 72.3735 8.2430
yellow circle true 87 63.5058 8.3350
yellow circle true 73 63.9785 4.2370
.. code-block:: none
:emphasize-lines: 1-4
mlr --icsv --opprint --from example.csv \
sort -nr index \
then head -n 3 \
then cut -f shape,quantity
shape quantity
square 72.3735
circle 63.5058
circle 63.9785
Sorts and stats
^^^^^^^^^^^^^^^
Now suppose you want to sort the data on a given column, *and then* take the top few in that ordering. You can use Miller's ``then`` feature to pipe commands together.
Here are the records with the top three ``index`` values:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
color shape flag index quantity rate
purple square false 91 72.3735 8.2430
yellow circle true 87 63.5058 8.3350
yellow circle true 73 63.9785 4.2370
Lots of Miller commands take a ``-g`` option for group-by: here, ``head -n 1 -g shape`` outputs the first record for each distinct value of the ``shape`` field. This means we're finding the record with highest ``index`` field for each distinct ``shape`` field:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv
color shape flag index quantity rate
yellow circle true 87 63.5058 8.3350
purple square false 91 72.3735 8.2430
purple triangle false 65 80.1405 5.8240
Statistics can be computed with or without group-by field(s):
.. code-block:: none
:emphasize-lines: 1-2
mlr --icsv --opprint --from example.csv \
stats1 -a count,min,mean,max -f quantity -g shape
shape quantity_count quantity_min quantity_mean quantity_max
triangle 3 43.6498 68.33976666666666 81.229
square 4 72.3735 76.60114999999999 79.2778
circle 3 13.8103 47.0982 63.9785
.. code-block:: none
:emphasize-lines: 1-2
mlr --icsv --opprint --from example.csv \
stats1 -a count,min,mean,max -f quantity -g shape,color
shape color quantity_count quantity_min quantity_mean quantity_max
triangle yellow 1 43.6498 43.6498 43.6498
square red 3 77.1991 78.01036666666666 79.2778
circle red 1 13.8103 13.8103 13.8103
triangle purple 2 80.1405 80.68475000000001 81.229
circle yellow 2 63.5058 63.742149999999995 63.9785
square purple 1 72.3735 72.3735 72.3735
If your output has a lot of columns, you can use XTAB format to line things up vertically for you instead:
.. code-block:: none
:emphasize-lines: 1-2
mlr --icsv --oxtab --from example.csv \
stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate
rate_p0 0.0130
rate_p10 2.9010
rate_p25 4.2370
rate_p50 8.2430
rate_p75 8.5910
rate_p90 9.8870
rate_p99 9.8870
rate_p100 9.8870
File formats and format conversion
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Miller supports the following formats:
* CSV (comma-separared values)
* TSV (tab-separated values)
* JSON (JavaScript Object Notation)
* PPRINT (pretty-printed tabular)
* XTAB (vertical-tabular or sideways-tabular)
* NIDX (numerically indexed, label-free, with implicit labels ``"1"``, ``"2"``, etc.)
* DKVP (delimited key-value pairs).
What's a CSV file, really? It's an array of rows, or *records*, each being a list of key-value pairs, or *fields*: for CSV it so happens that all the keys are shared in the header line and the values vary from one data line to another.
For example, if you have:
.. code-block:: none
shape,flag,index
circle,1,24
square,0,36
then that's a way of saying:
.. code-block:: none
shape=circle,flag=1,index=24
shape=square,flag=0,index=36
Other ways to write the same data:
.. code-block:: none
CSV PPRINT
shape,flag,index shape flag index
circle,1,24 circle 1 24
square,0,36 square 0 36
JSON XTAB
{ shape circle
"shape": "circle", flag 1
"flag": 1, index 24
"index": 24 .
} shape square
{ flag 0
"shape": "square", index 36
"flag": 0,
"index": 36
}
DKVP
shape=circle,flag=1,index=24
shape=square,flag=0,index=36
Anything we can do with CSV input data, we can do with any other format input data. And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.
How to specify these to Miller:
* If you use ``--csv`` or ``--json`` or ``--pprint``, etc., then Miller will use that format for input and output.
* If you use ``--icsv`` and ``--ojson`` (note the extra ``i`` and ``o``) then Miller will use CSV for input and JSON for output, etc. See also :doc:`keystroke-savers` for even shorter options like ``--c2j``.
You can read more about this at the :doc:`file-formats` page.
.. _10min-choices-for-printing-to-files:
Choices for printing to files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Often we want to print output to the screen. Miller does this by default, as we've seen in the previous examples.
Sometimes, though, we want to print output to another file. Just use **> outputfilenamegoeshere** at the end of your command:
.. code-block:: none
:emphasize-lines: 1,1
mlr --icsv --opprint cat example.csv > newfile.csv
# Output goes to the new file;
# nothing is printed to the screen.
.. code-block:: none
:emphasize-lines: 1,1
cat newfile.csv
color shape flag index quantity rate
yellow triangle true 11 43.6498 9.8870
red square true 15 79.2778 0.0130
red circle true 16 13.8103 2.9010
red square false 48 77.5542 7.4670
purple triangle false 51 81.2290 8.5910
red square false 64 77.1991 9.5310
purple triangle false 65 80.1405 5.8240
yellow circle true 73 63.9785 4.2370
yellow circle true 87 63.5058 8.3350
purple square false 91 72.3735 8.2430
Other times we just want our files to be **changed in-place**: just use **mlr -I**:
.. code-block:: none
:emphasize-lines: 1,1
cp example.csv newfile.txt
.. code-block:: none
:emphasize-lines: 1,1
cat newfile.txt
color,shape,flag,index,quantity,rate
yellow,triangle,true,11,43.6498,9.8870
red,square,true,15,79.2778,0.0130
red,circle,true,16,13.8103,2.9010
red,square,false,48,77.5542,7.4670
purple,triangle,false,51,81.2290,8.5910
red,square,false,64,77.1991,9.5310
purple,triangle,false,65,80.1405,5.8240
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
purple,square,false,91,72.3735,8.2430
.. code-block:: none
:emphasize-lines: 1,1
mlr -I --csv sort -f shape newfile.txt
.. code-block:: none
:emphasize-lines: 1,1
cat newfile.txt
color,shape,flag,index,quantity,rate
red,circle,true,16,13.8103,2.9010
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
red,square,true,15,79.2778,0.0130
red,square,false,48,77.5542,7.4670
red,square,false,64,77.1991,9.5310
purple,square,false,91,72.3735,8.2430
yellow,triangle,true,11,43.6498,9.8870
purple,triangle,false,51,81.2290,8.5910
purple,triangle,false,65,80.1405,5.8240
Also using ``mlr -I`` you can bulk-operate on lots of files: e.g.:
.. code-block:: none
:emphasize-lines: 1,1
mlr -I --csv cut -x -f unwanted_column_name *.csv
If you like, you can first copy off your original data somewhere else, before doing in-place operations.
Lastly, using ``tee`` within ``put``, you can split your input data into separate files per one or more field names:
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'
.. code-block:: none
:emphasize-lines: 1-1
cat circle.csv
color,shape,flag,index,quantity,rate
red,circle,true,16,13.8103,2.9010
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
.. code-block:: none
:emphasize-lines: 1-1
cat square.csv
color,shape,flag,index,quantity,rate
red,square,true,15,79.2778,0.0130
red,square,false,48,77.5542,7.4670
red,square,false,64,77.1991,9.5310
purple,square,false,91,72.3735,8.2430
.. code-block:: none
:emphasize-lines: 1-1
cat triangle.csv
color,shape,flag,index,quantity,rate
yellow,triangle,true,11,43.6498,9.8870
purple,triangle,false,51,81.2290,8.5910
purple,triangle,false,65,80.1405,5.8240

View file

@ -1,378 +0,0 @@
Miller in 10 minutes
====================
Obtaining Miller
^^^^^^^^^^^^^^^^
You can install Miller for various platforms as follows:
* Linux: ``yum install miller`` or ``apt-get install miller`` depending on your flavor of Linux
* MacOS: ``brew install miller`` or ``port install miller`` depending on your preference of `Homebrew <https://brew.sh>`_ or `MacPorts <https://macports.org>`_.
* Windows: ``choco install miller`` using `Chocolatey <https://chocolatey.org>`_.
* You can get latest builds for Linux, MacOS, and Windows by visiting https://github.com/johnkerl/miller/actions, selecting the latest build, and clicking _Artifacts_. (These are retained for 5 days after each commit.)
* See also :doc:`build` if you prefer -- in particular, if your platform's package manager doesn't have the latest release.
As a first check, you should be able to run ``mlr --version`` at your system's command prompt and see something like the following:
GENRST_RUN_COMMAND
mlr --version
GENRST_EOF
As a second check, given (`example.csv <./example.csv>`_) you should be able to do
GENRST_RUN_COMMAND
mlr --csv cat example.csv
GENRST_EOF
GENRST_RUN_COMMAND
mlr --icsv --opprint cat example.csv
GENRST_EOF
If you run into issues on these checks, please check out the resources on the :doc:`community` page for help.
Miller verbs
^^^^^^^^^^^^
Let's take a quick look at some of the most useful Miller verbs -- file-format-aware, name-index-empowered equivalents of standard system commands.
``mlr cat`` is like system ``cat`` (or ``type`` on Windows) -- it passes the data through unmodified:
GENRST_RUN_COMMAND
mlr --csv cat example.csv
GENRST_EOF
But ``mlr cat`` can also do format conversion -- for example, you can pretty-print in tabular format:
GENRST_RUN_COMMAND
mlr --icsv --opprint cat example.csv
GENRST_EOF
``mlr head`` and ``mlr tail`` count records rather than lines. Whether you're getting the first few records or the last few, the CSV header is included either way:
GENRST_RUN_COMMAND
mlr --csv head -n 4 example.csv
GENRST_EOF
GENRST_RUN_COMMAND
mlr --csv tail -n 4 example.csv
GENRST_EOF
GENRST_RUN_COMMAND
mlr --icsv --ojson tail -n 2 example.csv
GENRST_EOF
You can sort on a single field:
GENRST_RUN_COMMAND
mlr --icsv --opprint sort -f shape example.csv
GENRST_EOF
Or, you can sort primarily alphabetically on one field, then secondarily numerically descending on another field, and so on:
GENRST_RUN_COMMAND
mlr --icsv --opprint sort -f shape -nr index example.csv
GENRST_EOF
If there are fields you don't want to see in your data, you can use ``cut`` to keep only the ones you want, in the same order they appeared in the input data:
GENRST_RUN_COMMAND
mlr --icsv --opprint cut -f flag,shape example.csv
GENRST_EOF
You can also use ``cut -o`` to keep specified fields, but in your preferred order:
GENRST_RUN_COMMAND
mlr --icsv --opprint cut -o -f flag,shape example.csv
GENRST_EOF
You can use ``cut -x`` to omit fields you don't care about:
GENRST_RUN_COMMAND
mlr --icsv --opprint cut -x -f flag,shape example.csv
GENRST_EOF
You can use ``filter`` to keep only records you care about:
GENRST_RUN_COMMAND
mlr --icsv --opprint filter '$color == "red"' example.csv
GENRST_EOF
GENRST_RUN_COMMAND
mlr --icsv --opprint filter '$color == "red" && $flag == true' example.csv
GENRST_EOF
You can use ``put`` to create new fields which are computed from other fields:
GENRST_RUN_COMMAND
mlr --icsv --opprint put '
$ratio = $quantity / $rate;
$color_shape = $color . "_" . $shape
' example.csv
GENRST_EOF
Even though Miller's main selling point is name-indexing, sometimes you really want to refer to a field name by its positional index. Use ``$[[3]]`` to access the name of field 3 or ``$[[[3]]]`` to access the value of field 3:
GENRST_RUN_COMMAND
mlr --icsv --opprint put '$[[3]] = "NEW"' example.csv
GENRST_EOF
GENRST_RUN_COMMAND
mlr --icsv --opprint put '$[[[3]]] = "NEW"' example.csv
GENRST_EOF
You can find the full list of verbs at the :doc:`reference-verbs` page.
Multiple input files
^^^^^^^^^^^^^^^^^^^^
Miller takes all the files from the command line as an input stream. But it's format-aware, so it doesn't repeat CSV header lines. For example, with input files (`data/a.csv <data/a.csv>`_) and (`data/b.csv <data/b.csv>`_), the system ``cat`` command will repeat header lines:
GENRST_RUN_COMMAND
cat data/a.csv
GENRST_EOF
GENRST_RUN_COMMAND
cat data/b.csv
GENRST_EOF
GENRST_RUN_COMMAND
cat data/a.csv data/b.csv
GENRST_EOF
However, ``mlr cat`` will not:
GENRST_RUN_COMMAND
mlr --csv cat data/a.csv data/b.csv
GENRST_EOF
Chaining verbs together
^^^^^^^^^^^^^^^^^^^^^^^
Often we want to chain queries together -- for example, sorting by a field and taking the top few values. We can do this using pipes:
GENRST_RUN_COMMAND
mlr --csv sort -nr index example.csv | mlr --icsv --opprint head -n 3
GENRST_EOF
This works fine -- but Miller also lets you chain verbs together using the word ``then``. Think of this as a Miller-internal pipe that lets you use fewer keystrokes:
GENRST_RUN_COMMAND
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
GENRST_EOF
As another convenience, you can put the filename first using ``--from``. When you're interacting with your data at the command line, this makes it easier to up-arrow and append to the previous command:
GENRST_RUN_COMMAND
mlr --icsv --opprint --from example.csv sort -nr index then head -n 3
GENRST_EOF
GENRST_RUN_COMMAND
mlr --icsv --opprint --from example.csv \
sort -nr index \
then head -n 3 \
then cut -f shape,quantity
GENRST_EOF
Sorts and stats
^^^^^^^^^^^^^^^
Now suppose you want to sort the data on a given column, *and then* take the top few in that ordering. You can use Miller's ``then`` feature to pipe commands together.
Here are the records with the top three ``index`` values:
GENRST_RUN_COMMAND
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
GENRST_EOF
Lots of Miller commands take a ``-g`` option for group-by: here, ``head -n 1 -g shape`` outputs the first record for each distinct value of the ``shape`` field. This means we're finding the record with highest ``index`` field for each distinct ``shape`` field:
GENRST_RUN_COMMAND
mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv
GENRST_EOF
Statistics can be computed with or without group-by field(s):
GENRST_RUN_COMMAND
mlr --icsv --opprint --from example.csv \
stats1 -a count,min,mean,max -f quantity -g shape
GENRST_EOF
GENRST_RUN_COMMAND
mlr --icsv --opprint --from example.csv \
stats1 -a count,min,mean,max -f quantity -g shape,color
GENRST_EOF
If your output has a lot of columns, you can use XTAB format to line things up vertically for you instead:
GENRST_RUN_COMMAND
mlr --icsv --oxtab --from example.csv \
stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate
GENRST_EOF
File formats and format conversion
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Miller supports the following formats:
* CSV (comma-separared values)
* TSV (tab-separated values)
* JSON (JavaScript Object Notation)
* PPRINT (pretty-printed tabular)
* XTAB (vertical-tabular or sideways-tabular)
* NIDX (numerically indexed, label-free, with implicit labels ``"1"``, ``"2"``, etc.)
* DKVP (delimited key-value pairs).
What's a CSV file, really? It's an array of rows, or *records*, each being a list of key-value pairs, or *fields*: for CSV it so happens that all the keys are shared in the header line and the values vary from one data line to another.
For example, if you have:
GENRST_CARDIFY
shape,flag,index
circle,1,24
square,0,36
GENRST_EOF
then that's a way of saying:
GENRST_CARDIFY
shape=circle,flag=1,index=24
shape=square,flag=0,index=36
GENRST_EOF
Other ways to write the same data:
GENRST_CARDIFY
CSV PPRINT
shape,flag,index shape flag index
circle,1,24 circle 1 24
square,0,36 square 0 36
JSON XTAB
{ shape circle
"shape": "circle", flag 1
"flag": 1, index 24
"index": 24 .
} shape square
{ flag 0
"shape": "square", index 36
"flag": 0,
"index": 36
}
DKVP
shape=circle,flag=1,index=24
shape=square,flag=0,index=36
GENRST_EOF
Anything we can do with CSV input data, we can do with any other format input data. And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.
How to specify these to Miller:
* If you use ``--csv`` or ``--json`` or ``--pprint``, etc., then Miller will use that format for input and output.
* If you use ``--icsv`` and ``--ojson`` (note the extra ``i`` and ``o``) then Miller will use CSV for input and JSON for output, etc. See also :doc:`keystroke-savers` for even shorter options like ``--c2j``.
You can read more about this at the :doc:`file-formats` page.
.. _10min-choices-for-printing-to-files:
Choices for printing to files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Often we want to print output to the screen. Miller does this by default, as we've seen in the previous examples.
Sometimes, though, we want to print output to another file. Just use **> outputfilenamegoeshere** at the end of your command:
.. code-block:: none
:emphasize-lines: 1,1
mlr --icsv --opprint cat example.csv > newfile.csv
# Output goes to the new file;
# nothing is printed to the screen.
.. code-block:: none
:emphasize-lines: 1,1
cat newfile.csv
color shape flag index quantity rate
yellow triangle true 11 43.6498 9.8870
red square true 15 79.2778 0.0130
red circle true 16 13.8103 2.9010
red square false 48 77.5542 7.4670
purple triangle false 51 81.2290 8.5910
red square false 64 77.1991 9.5310
purple triangle false 65 80.1405 5.8240
yellow circle true 73 63.9785 4.2370
yellow circle true 87 63.5058 8.3350
purple square false 91 72.3735 8.2430
Other times we just want our files to be **changed in-place**: just use **mlr -I**:
.. code-block:: none
:emphasize-lines: 1,1
cp example.csv newfile.txt
.. code-block:: none
:emphasize-lines: 1,1
cat newfile.txt
color,shape,flag,index,quantity,rate
yellow,triangle,true,11,43.6498,9.8870
red,square,true,15,79.2778,0.0130
red,circle,true,16,13.8103,2.9010
red,square,false,48,77.5542,7.4670
purple,triangle,false,51,81.2290,8.5910
red,square,false,64,77.1991,9.5310
purple,triangle,false,65,80.1405,5.8240
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
purple,square,false,91,72.3735,8.2430
.. code-block:: none
:emphasize-lines: 1,1
mlr -I --csv sort -f shape newfile.txt
.. code-block:: none
:emphasize-lines: 1,1
cat newfile.txt
color,shape,flag,index,quantity,rate
red,circle,true,16,13.8103,2.9010
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
red,square,true,15,79.2778,0.0130
red,square,false,48,77.5542,7.4670
red,square,false,64,77.1991,9.5310
purple,square,false,91,72.3735,8.2430
yellow,triangle,true,11,43.6498,9.8870
purple,triangle,false,51,81.2290,8.5910
purple,triangle,false,65,80.1405,5.8240
Also using ``mlr -I`` you can bulk-operate on lots of files: e.g.:
.. code-block:: none
:emphasize-lines: 1,1
mlr -I --csv cut -x -f unwanted_column_name *.csv
If you like, you can first copy off your original data somewhere else, before doing in-place operations.
Lastly, using ``tee`` within ``put``, you can split your input data into separate files per one or more field names:
GENRST_RUN_COMMAND
mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'
GENRST_EOF
GENRST_RUN_COMMAND
cat circle.csv
GENRST_EOF
GENRST_RUN_COMMAND
cat square.csv
GENRST_EOF
GENRST_RUN_COMMAND
cat triangle.csv
GENRST_EOF

View file

@ -1,28 +0,0 @@
# Minimal makefile for Sphinx documentation
#
# Note: run this after make in the ../c directory and make in the ../man directory
# since ../c/mlr is used to autogenerate ../man/manpage.txt which is used in this directory.
# See also https://miller.readthedocs.io/en/latest/build.html#creating-a-new-release-for-developers
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Respective MANPATH entries would include /usr/local/share/man or $HOME/man.
INSTALLDIR=/usr/local/share/man/man1
INSTALLHOME=$(HOME)/man/man1
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
./genrsts
$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

View file

@ -1,40 +1,44 @@
# Miller Sphinx docs
# Miller docs
## Why use Sphinx
## Why use Mkdocs
* Connects to https://miller.readthedocs.io so people can get their docmods onto the web instead of the self-hosted https://johnkerl.org/miller/doc. Thanks to @pabloab for the great advice!
* More standard look and feel -- lots of people use readthedocs for other things so this should feel familiar
* We get a Search feature for free
* More standard look and feel -- lots of people use readthedocs for other things so this should feel familiar.
* We get a Search feature for free.
* Mkdocs vs Sphinx: these are similar tools, but I find that I more easily get better desktop+mobile formatting using Mkdocs.
## Contributing
* You need `pip install sphinx` (or `pip3 install sphinx`)
* The docs include lots of live code examples which will be invoked using `mlr` which must be somewhere in your `$PATH`
* Clone https://github.com/johnkerl/miller and cd into `docs/` within your clone
* Editing loop:
* Edit `*.rst.in`
* Run `make html`
* Either `open _build/html/index.html` (MacOS) or point your browser to `file:///path/to/your/clone/of/miller/docs/_build/html/index.html`
* You need `pip install mkdocs` (or `pip3 install mkdocs`).
* The docs include lots of live code examples which will be invoked using `mlr` which must be somewhere in your `$PATH`.
* Clone https://github.com/johnkerl/miller and cd into `docs/` within your clone.
* Quick-editing loop:
* In one terminal, cd to this directory and leave `mkdocs serve` running.
* In another terminal, cd to the `docs` subdirectory and edit `*.md.in`.
* Run `genmds` to re-create all the `*.md` files, or `genmds foo.md.in` to just re-create the `foo.md.in` file you just edited.
* In your browser, visit http://127.0.0.1:8000
* Alternate editing loop:
* Leave one terminal open as a place you will run `mkdocs build`
* In one terminal, cd to the `docs` subdirectory and edit `*.md.in`.
* Run `genmds` to re-create all the `*.md` files, or `genmds foo.md.in` to just re-create the `foo.md.in` file you just edited.
* In the first terminal, run `mkdocs build` which will populate the `site` directory.
* In your browser, visit `file:///your/path/to/miller/docs/site/index.html`
* Link-checking:
* `sudo pip3 install git+https://github.com/linkchecker/linkchecker.git`
* `cd site` and `linkchecker .`
* Submitting:
* `git add` your modified files, `git commit`, `git push`, and submit a PR at https://github.com/johnkerl/miller
* A nice markup reference: https://www.sphinx-doc.org/en/1.8/usage/restructuredtext/basics.html
* `git add` your modified files, `git commit`, `git push`, and submit a PR at https://github.com/johnkerl/miller.
## Notes
* CSS:
* I used the Sphinx Classic theme which I like a lot except the colors -- it's a blue scheme and Miller has never been blue.
* Files are in `docs/_static/*.css` where I marked my mods with `/* CHANGE ME */`.
* If you modify the CSS you must run `make clean html` (not just `make html`) then reload in your browser.
* I used the Mkdocs Readthedocs theme which I like a lot. I customized `docs/extra.css` for Miller coloring/branding.
* Live code:
* I didn't find a way to include non-Python live-code examples within Sphinx so I adapted the pre-Sphinx Miller-doc strategy which is to have a generator script read a template file (here, `foo.rst.in`), run the marked lines, and generate the output file (`foo.rst`).
* Edit the `*.rst.in` files, not `*.rst` directly.
* Within the `*.rst.in` files are lines like `GENRST_RUN_COMMAND`. These will be run, and their output included, by `make html` which calls the `genrsts` script for you.
* I didn't find a way to include non-Python live-code examples within Mkdocs so I adapted the pre-Mkdocs Miller-doc strategy which is to have a generator script read a template file (here, `foo.md.in`), run the marked lines, and generate the output file (`foo.md`). This is `genmds`.
* Edit the `*.md.in` files, not `*.md` directly.
* Within the `*.md.in` files are lines like `GENMD_RUN_COMMAND`. These will be run, and their output included, by `genmds` which calls the `genmds` script for you.
* readthedocs:
* https://readthedocs.org/
* https://readthedocs.org/projects/miller/
* https://readthedocs.org/projects/miller/builds/
* https://miller.readthedocs.io/en/latest/
## To do
* Let's all discuss if/how we want the v2 docs to be structured better than the v1 docs.

View file

@ -1,13 +0,0 @@
#!/bin/bash
set -euo pipefail
if [ $# -ge 1 ]; then
for name; do
if [[ $name == *.rst.in ]]; then
genrsts $name;
fi
done
else
for rstin in *.rst.in; do genrsts $rstin; done
fi
sphinx-build -M html . _build

View file

@ -1,121 +0,0 @@
..
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
Building from source
================================================================
Please also see :doc:`installation` for information about pre-built executables.
Miller license
----------------------------------------------------------------
Two-clause BSD license https://github.com/johnkerl/miller/blob/master/LICENSE.txt.
From release tarball
----------------------------------------------------------------
* Obtain ``mlr-i.j.k.tar.gz`` from https://github.com/johnkerl/miller/tags, replacing ``i.j.k`` with the desired release, e.g. ``6.1.0``.
* ``tar zxvf mlr-i.j.k.tar.gz``
* ``cd mlr-i.j.k``
* ``cd go``
* ``./build`` creates the ``go/mlr`` executable and runs regression tests
* ``go build mlr.go`` creates the ``go/mlr`` executable without running regression tests
From git clone
----------------------------------------------------------------
* ``git clone https://github.com/johnkerl/miller``
* ``cd miller/go``
* ``./build`` creates the ``go/mlr`` executable and runs regression tests
* ``go build mlr.go`` creates the ``go/mlr`` executable without running regression tests
In case of problems
----------------------------------------------------------------
If you have any build errors, feel free to open an issue with "New Issue" at https://github.com/johnkerl/miller/issues.
Dependencies
----------------------------------------------------------------
Required external dependencies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
These are necessary to produce the ``mlr`` executable.
* Go version 1.16 or higher
* Others packaged within ``go.mod`` and ``go.sum`` which you don't need to deal with manually -- the Go build process handles them for us
Optional external dependencies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This documentation pageset is built using Sphinx. Please see https://github.com/johnkerl/miller/blob/main/docs6/README.md for details.
Creating a new release: for developers
----------------------------------------------------------------
At present I'm the primary developer so this is just my checklist for making new releases.
In this example I am using version 6.1.0 to 6.2.0; of course that will change for subsequent revisions.
* Update version found in ``mlr --version`` and ``man mlr``:
* Edit ``go/src/version/version.go`` from ``6.1.0-dev`` to ``6.2.0``.
* Likewise ``docs6/conf.py``
* ``cd ../docs6``
* ``export PATH=../go:$PATH``
* ``make html``
* The ordering is important: the first build creates ``mlr``; the second runs ``mlr`` to create ``manpage.txt``; the third includes ``manpage.txt`` into one of its outputs.
* Commit and push.
* Create the release tarball and SRPM:
* TBD for the Go port ...
* Linux/MacOS/Windows binaries from GitHub Actions ...
* Pull back release tarball ``mlr-6.2.0.tar.gz`` from buildbox, and ``mlr.{arch}`` binaries from whatever buildboxes.
* Create the Github release tag:
* Don't forget the ``v`` in ``v6.2.0``
* Write the release notes
* Attach the release tarball and binaries. Double-check assets were successfully uploaded.
* Publish the release
* Check the release-specific docs:
* Look at https://miller.readthedocs.io for new-version docs, after a few minutes' propagation time.
* Notify:
* Submit ``brew`` pull request; notify any other distros which don't appear to have autoupdated since the previous release (notes below)
* Similarly for ``macports``: https://github.com/macports/macports-ports/blob/master/textproc/miller/Portfile.
* Social-media updates.
.. code-block:: none
git remote add upstream https://github.com/Homebrew/homebrew-core # one-time setup only
git fetch upstream
git rebase upstream/master
git checkout -b miller-6.1.0
shasum -a 256 /path/to/mlr-6.1.0.tar.gz
edit Formula/miller.rb
# Test the URL from the line like
# url "https://github.com/johnkerl/miller/releases/download/v6.1.0/mlr-6.1.0.tar.gz"
# in a browser for typos
# A '@BrewTestBot Test this please' comment within the homebrew-core pull request will restart the homebrew travis build
git add Formula/miller.rb
git commit -m 'miller 6.1.0'
git push -u origin miller-6.1.0
(submit the pull request)
* Afterwork:
* Edit ``go/src/version/version.go`` and ``docs6/conf.py`` to change version from ``6.2.0`` to ``6.2.0-dev``.
* ``cd go``
* ``./build``
* Commit and push.
Misc. development notes
----------------------------------------------------------------
I use terminal width 120 and tabwidth 4.

View file

@ -1,118 +0,0 @@
Building from source
================================================================
Please also see :doc:`installation` for information about pre-built executables.
Miller license
----------------------------------------------------------------
Two-clause BSD license https://github.com/johnkerl/miller/blob/master/LICENSE.txt.
From release tarball
----------------------------------------------------------------
* Obtain ``mlr-i.j.k.tar.gz`` from https://github.com/johnkerl/miller/tags, replacing ``i.j.k`` with the desired release, e.g. ``6.1.0``.
* ``tar zxvf mlr-i.j.k.tar.gz``
* ``cd mlr-i.j.k``
* ``cd go``
* ``./build`` creates the ``go/mlr`` executable and runs regression tests
* ``go build mlr.go`` creates the ``go/mlr`` executable without running regression tests
From git clone
----------------------------------------------------------------
* ``git clone https://github.com/johnkerl/miller``
* ``cd miller/go``
* ``./build`` creates the ``go/mlr`` executable and runs regression tests
* ``go build mlr.go`` creates the ``go/mlr`` executable without running regression tests
In case of problems
----------------------------------------------------------------
If you have any build errors, feel free to open an issue with "New Issue" at https://github.com/johnkerl/miller/issues.
Dependencies
----------------------------------------------------------------
Required external dependencies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
These are necessary to produce the ``mlr`` executable.
* Go version 1.16 or higher
* Others packaged within ``go.mod`` and ``go.sum`` which you don't need to deal with manually -- the Go build process handles them for us
Optional external dependencies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This documentation pageset is built using Sphinx. Please see https://github.com/johnkerl/miller/blob/main/docs6/README.md for details.
Creating a new release: for developers
----------------------------------------------------------------
At present I'm the primary developer so this is just my checklist for making new releases.
In this example I am using version 6.1.0 to 6.2.0; of course that will change for subsequent revisions.
* Update version found in ``mlr --version`` and ``man mlr``:
* Edit ``go/src/version/version.go`` from ``6.1.0-dev`` to ``6.2.0``.
* Likewise ``docs6/conf.py``
* ``cd ../docs6``
* ``export PATH=../go:$PATH``
* ``make html``
* The ordering is important: the first build creates ``mlr``; the second runs ``mlr`` to create ``manpage.txt``; the third includes ``manpage.txt`` into one of its outputs.
* Commit and push.
* Create the release tarball and SRPM:
* TBD for the Go port ...
* Linux/MacOS/Windows binaries from GitHub Actions ...
* Pull back release tarball ``mlr-6.2.0.tar.gz`` from buildbox, and ``mlr.{arch}`` binaries from whatever buildboxes.
* Create the Github release tag:
* Don't forget the ``v`` in ``v6.2.0``
* Write the release notes
* Attach the release tarball and binaries. Double-check assets were successfully uploaded.
* Publish the release
* Check the release-specific docs:
* Look at https://miller.readthedocs.io for new-version docs, after a few minutes' propagation time.
* Notify:
* Submit ``brew`` pull request; notify any other distros which don't appear to have autoupdated since the previous release (notes below)
* Similarly for ``macports``: https://github.com/macports/macports-ports/blob/master/textproc/miller/Portfile.
* Social-media updates.
GENRST_CARDIFY
git remote add upstream https://github.com/Homebrew/homebrew-core # one-time setup only
git fetch upstream
git rebase upstream/master
git checkout -b miller-6.1.0
shasum -a 256 /path/to/mlr-6.1.0.tar.gz
edit Formula/miller.rb
# Test the URL from the line like
# url "https://github.com/johnkerl/miller/releases/download/v6.1.0/mlr-6.1.0.tar.gz"
# in a browser for typos
# A '@BrewTestBot Test this please' comment within the homebrew-core pull request will restart the homebrew travis build
git add Formula/miller.rb
git commit -m 'miller 6.1.0'
git push -u origin miller-6.1.0
(submit the pull request)
GENRST_EOF
* Afterwork:
* Edit ``go/src/version/version.go`` and ``docs6/conf.py`` to change version from ``6.2.0`` to ``6.2.0-dev``.
* ``cd go``
* ``./build``
* Commit and push.
Misc. development notes
----------------------------------------------------------------
I use terminal width 120 and tabwidth 4.

View file

@ -1,10 +0,0 @@
..
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
Community
=========
* See `Miller GitHub Discussions <https://github.com/johnkerl/miller/discussions>`_ for general Q&A, advice, sharing success stories, etc.
* See also `Miller-tagged questions on Stack Overflow <https://stackoverflow.com/questions/tagged/miller?tab=Newest>`_
* See `Miller GitHub Issues <https://github.com/johnkerl/miller/issues>`_ for bug reports and feature requests
* Other correspondence: mailto:kerl.john.r+miller@gmail.com

View file

@ -1,7 +0,0 @@
Community
=========
* See `Miller GitHub Discussions <https://github.com/johnkerl/miller/discussions>`_ for general Q&A, advice, sharing success stories, etc.
* See also `Miller-tagged questions on Stack Overflow <https://stackoverflow.com/questions/tagged/miller?tab=Newest>`_
* See `Miller GitHub Issues <https://github.com/johnkerl/miller/issues>`_ for bug reports and feature requests
* Other correspondence: mailto:kerl.john.r+miller@gmail.com

View file

@ -1,112 +0,0 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
# -- Project information -----------------------------------------------------
project = 'Miller'
copyright = '2021, John Kerl'
author = 'John Kerl'
# The full version, including alpha/beta/rc tags
release = '6.0.0-alpha'
# -- General configuration ---------------------------------------------------
master_doc = 'index'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
#html_theme = 'alabaster'
#html_theme = 'classic'
#html_theme = 'sphinxdoc'
#html_theme = 'nature'
html_theme = 'scrolls'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# ----------------------------------------------------------------
# Include code-sample files in the Sphinx build tree.
# See also https://github.com/johnkerl/miller/issues/560.
#
# There is a problem, and an opportunity for a hack.
#
# * Our data files are in ./data/* (and a few other subdirs of .).
#
# * We want them copied to ./_build/data/* so that we can symlink from our doc
# files (written in ./*.rst. autogenned to HTML in ./_build/html/*.html) to
# relative paths like ./data/a.csv.
#
# * If we use html_extra_path = ['data'] then the files like ./data/a.csv
# are copied to _build/html/a.csv -- one directory 'down'. This means that
# example Miller commands are shown in the generated HTML using 'mlr --csv
# cat data/a.csv' but 'data/a.csv' doesn't exist relative to _build/html.
# This is bad enough for local Sphinx builds but worse for readthedocs
# (https://miller.readthedocs.io) since *only* _build/html files are put into
# readthedocs.
#
# * In our Makefile it's easy enough to do some cp -a commands from ./data
# to ./_build_html/data etc. for local Sphinx builds -- however, readthedocs
# doesn't use the Makefile at all, only this conf.py file.
#
# * Hence the hack: we have a subdir ./sphinx-hack which has a symlink
# ./sphinx-hack/data pointing to ./data. So when the Sphinx build executes
# html_extra_path and removes one directory level, it's an 'extra' level we
# can do without.
#
# * This all relies on symlinks being propagated through GitHub version
# control, readthedocs, and Sphinx build at readthedocs.
html_extra_path = [
'sphinx-hack',
'10-1.sh',
'10-2.sh',
'circle.csv',
'commas.csv',
'dates.csv',
'example.csv',
'expo-sample.sh',
'log.txt',
'make.bat',
'manpage.txt',
'oosvar-example-ewma.sh',
'oosvar-example-sum-grouped.sh',
'oosvar-example-sum.sh',
'sample_mlrrc',
'square.csv',
'triangle.csv',
'variance.mlr',
'verb-example-ewma.sh',
]

View file

@ -1,47 +0,0 @@
..
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
How to contribute
================================================================
Community
----------------------------------------------------------------
You can ask questions -- or answer them! -- following the links at :doc:`community`.
Documentation improvements
----------------------------------------------------------------
Pre-release Miller documentation is at https://github.com/johnkerl/miller/tree/main/docs6.
Clone https://github.com/johnkerl/miller and `cd` into `docs6`.
After ``sudo pip install sphinx`` (or ``pip3``) you should be able to do ``make html``.
Edit ``*.rst.in`` files, then ``make html`` to generate ``*.rst``, then run the Sphinx document-generator.
Open ``_build/html/index.html`` in your browser, e.g. ``file:////Users/yourname/git/miller/docs6/_build/html/contributing.html``, to verify.
PRs are welcome at https://github.com/johnkerl/miller.
Once PRs are merged, readthedocs creates https://miller.readthedocs.io using the following configs:
* https://readthedocs.org/projects/miller/
* https://readthedocs.org/projects/miller/builds/
* https://github.com/johnkerl/miller/settings/hooks
Testing
----------------------------------------------------------------
As of Miller-6's current pre-release status, the best way to test is to either build from source via :doc:`build`, or by getting a recent binary at https://github.com/johnkerl/miller/actions, then click latest build, then *Artifacts*. Then simply use Miller for whatever you do, and create an issue at https://github.com/johnkerl/miller/issues.
Do note that as of 2021-06-17 a few things have not been ported to Miller 6 -- most notably, including regex captures and localtime DSL functions.
Feature development
----------------------------------------------------------------
Issues: https://github.com/johnkerl/miller/issues
Developer notes: https://github.com/johnkerl/miller/blob/main/go/README.md
PRs which pass regression test (https://github.com/johnkerl/miller/blob/main/go/regtest/README.md) are always welcome!

View file

@ -1,44 +0,0 @@
How to contribute
================================================================
Community
----------------------------------------------------------------
You can ask questions -- or answer them! -- following the links at :doc:`community`.
Documentation improvements
----------------------------------------------------------------
Pre-release Miller documentation is at https://github.com/johnkerl/miller/tree/main/docs6.
Clone https://github.com/johnkerl/miller and `cd` into `docs6`.
After ``sudo pip install sphinx`` (or ``pip3``) you should be able to do ``make html``.
Edit ``*.rst.in`` files, then ``make html`` to generate ``*.rst``, then run the Sphinx document-generator.
Open ``_build/html/index.html`` in your browser, e.g. ``file:////Users/yourname/git/miller/docs6/_build/html/contributing.html``, to verify.
PRs are welcome at https://github.com/johnkerl/miller.
Once PRs are merged, readthedocs creates https://miller.readthedocs.io using the following configs:
* https://readthedocs.org/projects/miller/
* https://readthedocs.org/projects/miller/builds/
* https://github.com/johnkerl/miller/settings/hooks
Testing
----------------------------------------------------------------
As of Miller-6's current pre-release status, the best way to test is to either build from source via :doc:`build`, or by getting a recent binary at https://github.com/johnkerl/miller/actions, then click latest build, then *Artifacts*. Then simply use Miller for whatever you do, and create an issue at https://github.com/johnkerl/miller/issues.
Do note that as of 2021-06-17 a few things have not been ported to Miller 6 -- most notably, including regex captures and localtime DSL functions.
Feature development
----------------------------------------------------------------
Issues: https://github.com/johnkerl/miller/issues
Developer notes: https://github.com/johnkerl/miller/blob/main/go/README.md
PRs which pass regression test (https://github.com/johnkerl/miller/blob/main/go/regtest/README.md) are always welcome!

View file

@ -1,149 +0,0 @@
..
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
CSV, with and without headers
=============================
Headerless CSV on input or output
----------------------------------------------------------------
Sometimes we get CSV files which lack a header. For example (`data/headerless.csv <./data/headerless.csv>`_):
.. code-block:: none
:emphasize-lines: 1-1
cat data/headerless.csv
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present
You can use Miller to add a header. The ``--implicit-csv-header`` applies positionally indexed labels:
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv --implicit-csv-header cat data/headerless.csv
1,2,3
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present
Following that, you can rename the positionally indexed labels to names with meaning for your context. For example:
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv --implicit-csv-header label name,age,status data/headerless.csv
name,age,status
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present
Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller's output to the system command ``sed 1d``, or you can use Miller's ``--headerless-csv-output`` option:
.. code-block:: none
:emphasize-lines: 1-1
head -5 data/colored-shapes.dkvp | mlr --ocsv cat
color,shape,flag,i,u,v,w,x
yellow,triangle,1,11,0.6321695890307647,0.9887207810889004,0.4364983936735774,5.7981881667050565
red,square,1,15,0.21966833570651523,0.001257332190235938,0.7927778364718627,2.944117399716207
red,circle,1,16,0.20901671281497636,0.29005231936593445,0.13810280912907674,5.065034003400998
red,square,0,48,0.9562743938458542,0.7467203085342884,0.7755423050923582,7.117831369597269
purple,triangle,0,51,0.4355354501763202,0.8591292672156728,0.8122903963006748,5.753094629505863
.. code-block:: none
:emphasize-lines: 1-1
head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat
yellow,triangle,1,11,0.6321695890307647,0.9887207810889004,0.4364983936735774,5.7981881667050565
red,square,1,15,0.21966833570651523,0.001257332190235938,0.7927778364718627,2.944117399716207
red,circle,1,16,0.20901671281497636,0.29005231936593445,0.13810280912907674,5.065034003400998
red,square,0,48,0.9562743938458542,0.7467203085342884,0.7755423050923582,7.117831369597269
purple,triangle,0,51,0.4355354501763202,0.8591292672156728,0.8122903963006748,5.753094629505863
Lastly, often we say "CSV" or "TSV" when we have positionally indexed data in columns which are separated by commas or tabs, respectively. In this case it's perhaps simpler to **just use NIDX format** which was designed for this purpose. (See also :doc:`file-formats`.) For example:
.. code-block:: none
:emphasize-lines: 1-1
mlr --inidx --ifs comma --oxtab cut -f 1,3 data/headerless.csv
1 John
3 present
1 Fred
3 present
1 Alice
3 missing
1 Carol
3 present
Headerless CSV with duplicate field values
------------------------------------------
Miller is (by central design) a mapping from name to value, rather than integer position to value as in most tools in the Unix toolkit such as ``sort``, ``cut``, ``awk``, etc. So given input ``Yea=1,Yea=2`` on the same input line, first ``Yea=1`` is stored, then updated with ``Yea=2``. This is in the input-parser and the value ``Yea=1`` is unavailable to any further processing. The following example line comes from a headerless CSV file and includes 5 times the string (value) ``'NA'``:
.. code-block:: none
:emphasize-lines: 1-1
ag '0.9' nas.csv | head -1
2:-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
The repeated ``'NA'`` strings (values) in the same line will be treated as fields (columns) with same name, thus only one is kept in the output.
This can be worked around by telling ``mlr`` that there is no header row by using ``--implicit-csv-header`` or changing the input format by using ``nidx`` like so:
.. code-block:: none
ag '0.9' nas.csv | mlr --n2c --fs "," label xsn,ysn,x,y,t,a,e29,e31,e32 then head
Regularizing ragged CSV
----------------------------------------------------------------
Miller handles compliant CSV: in particular, it's an error if the number of data fields in a given data line don't match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line.
.. code-block:: none
:emphasize-lines: 1-1
cat data/ragged.csv
a,b,c
1,2,3
4,5
6,7,8,9
.. code-block:: none
:emphasize-lines: 1-8
mlr --from data/ragged.csv --fs comma --nidx put '
@maxnf = max(@maxnf, NF);
@nf = NF;
while(@nf < @maxnf) {
@nf += 1;
$[@nf] = ""
}
'
a,b,c
1,2,3
4,5
6,7,8,9
or, more simply,
.. code-block:: none
:emphasize-lines: 1-6
mlr --from data/ragged.csv --fs comma --nidx put '
@maxnf = max(@maxnf, NF);
while(NF < @maxnf) {
$[NF+1] = "";
}
'
a,b,c
1,2,3
4,5
6,7,8,9

View file

@ -1,72 +0,0 @@
CSV, with and without headers
=============================
Headerless CSV on input or output
----------------------------------------------------------------
Sometimes we get CSV files which lack a header. For example (`data/headerless.csv <./data/headerless.csv>`_):
GENRST_RUN_COMMAND
cat data/headerless.csv
GENRST_EOF
You can use Miller to add a header. The ``--implicit-csv-header`` applies positionally indexed labels:
GENRST_RUN_COMMAND
mlr --csv --implicit-csv-header cat data/headerless.csv
GENRST_EOF
Following that, you can rename the positionally indexed labels to names with meaning for your context. For example:
GENRST_RUN_COMMAND
mlr --csv --implicit-csv-header label name,age,status data/headerless.csv
GENRST_EOF
Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller's output to the system command ``sed 1d``, or you can use Miller's ``--headerless-csv-output`` option:
GENRST_RUN_COMMAND
head -5 data/colored-shapes.dkvp | mlr --ocsv cat
GENRST_EOF
GENRST_RUN_COMMAND
head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat
GENRST_EOF
Lastly, often we say "CSV" or "TSV" when we have positionally indexed data in columns which are separated by commas or tabs, respectively. In this case it's perhaps simpler to **just use NIDX format** which was designed for this purpose. (See also :doc:`file-formats`.) For example:
GENRST_RUN_COMMAND
mlr --inidx --ifs comma --oxtab cut -f 1,3 data/headerless.csv
GENRST_EOF
Headerless CSV with duplicate field values
------------------------------------------
Miller is (by central design) a mapping from name to value, rather than integer position to value as in most tools in the Unix toolkit such as ``sort``, ``cut``, ``awk``, etc. So given input ``Yea=1,Yea=2`` on the same input line, first ``Yea=1`` is stored, then updated with ``Yea=2``. This is in the input-parser and the value ``Yea=1`` is unavailable to any further processing. The following example line comes from a headerless CSV file and includes 5 times the string (value) ``'NA'``:
GENRST_CARDIFY_HIGHLIGHT_ONE
ag '0.9' nas.csv | head -1
2:-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
GENRST_EOF
The repeated ``'NA'`` strings (values) in the same line will be treated as fields (columns) with same name, thus only one is kept in the output.
This can be worked around by telling ``mlr`` that there is no header row by using ``--implicit-csv-header`` or changing the input format by using ``nidx`` like so:
GENRST_CARDIFY
ag '0.9' nas.csv | mlr --n2c --fs "," label xsn,ysn,x,y,t,a,e29,e31,e32 then head
GENRST_EOF
Regularizing ragged CSV
----------------------------------------------------------------
Miller handles compliant CSV: in particular, it's an error if the number of data fields in a given data line don't match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line.
GENRST_RUN_COMMAND
cat data/ragged.csv
GENRST_EOF
GENRST_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv.sh)
or, more simply,
GENRST_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv-2.sh)

View file

@ -1,96 +0,0 @@
..
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
Customization: .mlrrc
================================================================
How to use .mlrrc
----------------------------------------------------------------
Suppose you always use CSV files. Then instead of always having to type ``--csv`` as in
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv cut -x -f extra mydata.csv
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv sort -n id mydata.csv
and so on, you can instead put the following into your ``$HOME/.mlrrc``:
.. code-block:: none
--csv
Then you can just type things like
.. code-block:: none
:emphasize-lines: 1-1
mlr cut -x -f extra mydata.csv
.. code-block:: none
:emphasize-lines: 1-1
mlr sort -n id mydata.csv
and the ``--csv`` part will automatically be understood. (If you do want to process, say, a JSON file then ``mlr --json ...`` at the command line will override the default from your ``.mlrrc``.)
What you can put in your .mlrrc
----------------------------------------------------------------
* You can include any command-line flags, except the "terminal" ones such as ``--help``.
* The ``--prepipe``, ``--load``, and ``--mload`` flags aren't allowed in ``.mlrrc`` as they control code execution, and could result in your scripts running things you don't expect if you receive data from someone with a ``.mlrrc`` in it.
* The formatting rule is you need to put one flag beginning with ``--`` per line: for example, ``--csv`` on one line and ``--nr-progress-mod 1000`` on a separate line.
* Since every line starts with a ``--`` option, you can leave off the initial ``--`` if you want. For example, ``ojson`` is the same as ``--ojson``, and ``nr-progress-mod 1000`` is the same as ``--nr-progress-mod 1000``.
* Comments are from a ``#`` to the end of the line.
* Empty lines are ignored -- including lines which are empty after comments are removed.
Here is an example ``.mlrrc`` file:
.. code-block:: none
# Input and output formats are CSV by default (unless otherwise specified
# on the mlr command line):
csv
# If a data line has fewer fields than the header line, instead of erroring
# (which is the default), just insert empty values for the missing ones:
allow-ragged-csv-input
# These are no-ops for CSV, but when I do use JSON output, I want these
# pretty-printing options to be used:
jvstack
jlistwrap
# Use "@", rather than "#", for comments within data files:
skip-comments-with @
Where to put your .mlrrc
----------------------------------------------------------------
If the environment variable ``MLRRC`` is set:
* If its value is ``__none__`` then no ``.mlrrc`` files are processed. (This is nice for things like regression testing.)
* Otherwise, its value (as a filename) is loaded and processed. If there are syntax errors, they abort ``mlr`` with a usage message (as if you had mistyped something on the command line). If the file can't be loaded at all, though, it is silently skipped.
* Any ``.mlrrc`` in your home directory or current directory is ignored whenever ``MLRRC`` is set in the environment.
* Example line in your shell's rc file: ``export MLRRC=/path/to/my/mlrrc``
Otherwise:
* If ``$HOME/.mlrrc`` exists, it's processed as above.
* If ``./.mlrrc`` exists, it's then also processed as above.
* The idea is you can have all your settings in your ``$HOME/.mlrrc``, then override maybe one or two for your current directory if you like.

View file

@ -1,73 +0,0 @@
Customization: .mlrrc
================================================================
How to use .mlrrc
----------------------------------------------------------------
Suppose you always use CSV files. Then instead of always having to type ``--csv`` as in
GENRST_CARDIFY_HIGHLIGHT_ONE
mlr --csv cut -x -f extra mydata.csv
GENRST_EOF
GENRST_CARDIFY_HIGHLIGHT_ONE
mlr --csv sort -n id mydata.csv
GENRST_EOF
and so on, you can instead put the following into your ``$HOME/.mlrrc``:
GENRST_CARDIFY
--csv
GENRST_EOF
Then you can just type things like
GENRST_CARDIFY_HIGHLIGHT_ONE
mlr cut -x -f extra mydata.csv
GENRST_EOF
GENRST_CARDIFY_HIGHLIGHT_ONE
mlr sort -n id mydata.csv
GENRST_EOF
and the ``--csv`` part will automatically be understood. (If you do want to process, say, a JSON file then ``mlr --json ...`` at the command line will override the default from your ``.mlrrc``.)
What you can put in your .mlrrc
----------------------------------------------------------------
* You can include any command-line flags, except the "terminal" ones such as ``--help``.
* The ``--prepipe``, ``--load``, and ``--mload`` flags aren't allowed in ``.mlrrc`` as they control code execution, and could result in your scripts running things you don't expect if you receive data from someone with a ``.mlrrc`` in it.
* The formatting rule is you need to put one flag beginning with ``--`` per line: for example, ``--csv`` on one line and ``--nr-progress-mod 1000`` on a separate line.
* Since every line starts with a ``--`` option, you can leave off the initial ``--`` if you want. For example, ``ojson`` is the same as ``--ojson``, and ``nr-progress-mod 1000`` is the same as ``--nr-progress-mod 1000``.
* Comments are from a ``#`` to the end of the line.
* Empty lines are ignored -- including lines which are empty after comments are removed.
Here is an example ``.mlrrc`` file:
GENRST_INCLUDE_ESCAPED(sample_mlrrc)
Where to put your .mlrrc
----------------------------------------------------------------
If the environment variable ``MLRRC`` is set:
* If its value is ``__none__`` then no ``.mlrrc`` files are processed. (This is nice for things like regression testing.)
* Otherwise, its value (as a filename) is loaded and processed. If there are syntax errors, they abort ``mlr`` with a usage message (as if you had mistyped something on the command line). If the file can't be loaded at all, though, it is silently skipped.
* Any ``.mlrrc`` in your home directory or current directory is ignored whenever ``MLRRC`` is set in the environment.
* Example line in your shell's rc file: ``export MLRRC=/path/to/my/mlrrc``
Otherwise:
* If ``$HOME/.mlrrc`` exists, it's processed as above.
* If ``./.mlrrc`` exists, it's then also processed as above.
* The idea is you can have all your settings in your ``$HOME/.mlrrc``, then override maybe one or two for your current directory if you like.

View file

@ -1,77 +0,0 @@
..
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
Data-cleaning examples
================================================================
Here are some ways to use the type-checking options as described in :ref:`reference-dsl-type-tests-and-assertions` Suppose you have the following data file, with inconsistent typing for boolean. (Also imagine that, for the sake of discussion, we have a million-line file rather than a four-line file, so we can't see it all at once and some automation is called for.)
.. code-block:: none
:emphasize-lines: 1-1
cat data/het-bool.csv
name,reachable
barney,false
betty,true
fred,true
wilma,1
One option is to coerce everything to boolean, or integer:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint put '$reachable = boolean($reachable)' data/het-bool.csv
name reachable
barney false
betty true
fred true
wilma true
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint put '$reachable = int(boolean($reachable))' data/het-bool.csv
name reachable
barney 0
betty 1
fred 1
wilma 1
A second option is to flag badly formatted data within the output stream:
.. code-block:: none
:emphasize-lines: 1-1
mlr --icsv --opprint put '$format_ok = is_string($reachable)' data/het-bool.csv
name reachable format_ok
barney false false
betty true false
fred true false
wilma 1 false
Or perhaps to flag badly formatted data outside the output stream:
.. code-block:: none
:emphasize-lines: 1-3
mlr --icsv --opprint put '
if (!is_string($reachable)) {eprint "Malformed at NR=".NR}
' data/het-bool.csv
Malformed at NR=1
Malformed at NR=2
Malformed at NR=3
Malformed at NR=4
name reachable
barney false
betty true
fred true
wilma 1
A third way is to abort the process on first instance of bad data:
.. code-block:: none
:emphasize-lines: 1-1
mlr --csv put '$reachable = asserting_string($reachable)' data/het-bool.csv
Miller: is_string type-assertion failed at NR=1 FNR=1 FILENAME=data/het-bool.csv

View file

@ -1,38 +0,0 @@
Data-cleaning examples
================================================================
Here are some ways to use the type-checking options as described in :ref:`reference-dsl-type-tests-and-assertions` Suppose you have the following data file, with inconsistent typing for boolean. (Also imagine that, for the sake of discussion, we have a million-line file rather than a four-line file, so we can't see it all at once and some automation is called for.)
GENRST_RUN_COMMAND
cat data/het-bool.csv
GENRST_EOF
One option is to coerce everything to boolean, or integer:
GENRST_RUN_COMMAND
mlr --icsv --opprint put '$reachable = boolean($reachable)' data/het-bool.csv
GENRST_EOF
GENRST_RUN_COMMAND
mlr --icsv --opprint put '$reachable = int(boolean($reachable))' data/het-bool.csv
GENRST_EOF
A second option is to flag badly formatted data within the output stream:
GENRST_RUN_COMMAND
mlr --icsv --opprint put '$format_ok = is_string($reachable)' data/het-bool.csv
GENRST_EOF
Or perhaps to flag badly formatted data outside the output stream:
GENRST_RUN_COMMAND
mlr --icsv --opprint put '
if (!is_string($reachable)) {eprint "Malformed at NR=".NR}
' data/het-bool.csv
GENRST_EOF
A third way is to abort the process on first instance of bad data:
GENRST_RUN_COMMAND_TOLERATING_ERROR
mlr --csv put '$reachable = asserting_string($reachable)' data/het-bool.csv
GENRST_EOF

View file

@ -1,234 +0,0 @@
..
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
Data-diving examples
================================================================
flins data
----------------------------------------------------------------
The `flins.csv <data/flins.csv>`_ file is some sample data obtained from https://support.spatialkey.com/spatialkey-sample-csv-data.
Vertical-tabular format is good for a quick look at CSV data layout -- seeing what columns you have to work with:
.. code-block:: none
:emphasize-lines: 1-1
head -n 2 data/flins.csv | mlr --icsv --oxtab cat
county Seminole
tiv_2011 22890.55
tiv_2012 20848.71
line Residential
A few simple queries:
.. code-block:: none
:emphasize-lines: 1-1
mlr --from data/flins.csv --icsv --opprint count-distinct -f county | head
county count
Seminole 1
Miami Dade 2
Palm Beach 1
Highlands 2
Duval 1
St. Johns 1
.. code-block:: none
:emphasize-lines: 1-1
mlr --from data/flins.csv --icsv --opprint count-distinct -f construction,line
Categorization of total insured value:
.. code-block:: none
:emphasize-lines: 1-1
mlr --from data/flins.csv --icsv --opprint stats1 -a min,mean,max -f tiv_2012
tiv_2012_min tiv_2012_mean tiv_2012_max
19757.91 1061531.4637499999 2785551.63
.. code-block:: none
:emphasize-lines: 1-2
mlr --from data/flins.csv --icsv --opprint \
stats1 -a min,mean,max -f tiv_2012 -g construction,line
.. code-block:: none
:emphasize-lines: 1-2
mlr --from data/flins.csv --icsv --oxtab \
stats1 -a p0,p10,p50,p90,p95,p99,p100 -f hu_site_deductible
.. code-block:: none
:emphasize-lines: 1-3
mlr --from data/flins.csv --icsv --opprint \
stats1 -a p95,p99,p100 -f hu_site_deductible -g county \
then sort -f county | head
county
Duval
Highlands
Miami Dade
Palm Beach
Seminole
St. Johns
.. code-block:: none
:emphasize-lines: 1-2
mlr --from data/flins.csv --icsv --oxtab \
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012
tiv_2011_tiv_2012_corr 0.9353629581411828
tiv_2011_tiv_2012_ols_m 1.0890905877734807
tiv_2011_tiv_2012_ols_b 103095.52335638746
tiv_2011_tiv_2012_ols_n 8
tiv_2011_tiv_2012_r2 0.8749038634626236
.. code-block:: none
:emphasize-lines: 1-2
mlr --from data/flins.csv --icsv --opprint \
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012 -g county
county tiv_2011_tiv_2012_corr tiv_2011_tiv_2012_ols_m tiv_2011_tiv_2012_ols_b tiv_2011_tiv_2012_ols_n tiv_2011_tiv_2012_r2
Seminole - - - 1 -
Miami Dade 1 0.9306426512386247 -2311.1543275160047 2 0.9999999999999999
Palm Beach - - - 1 -
Highlands 0.9999999999999997 1.055692910750992 -4529.7939388307705 2 0.9999999999999992
Duval - - - 1 -
St. Johns - - - 1 -
Color/shape data
----------------------------------------------------------------
The `colored-shapes.dkvp <https://github.com/johnkerl/miller/blob/master/docs/data/colored-shapes.dkvp>`_ file is some sample data produced by the `mkdat2 <data/mkdat2>`_ script. The idea is:
* Produce some data with known distributions and correlations, and verify that Miller recovers those properties empirically.
* Each record is labeled with one of a few colors and one of a few shapes.
* The ``flag`` field is 0 or 1, with probability dependent on color
* The ``u`` field is plain uniform on the unit interval.
* The ``v`` field is the same, except tightly correlated with ``u`` for red circles.
* The ``w`` field is autocorrelated for each color/shape pair.
* The ``x`` field is boring Gaussian with mean 5 and standard deviation about 1.2, with no dependence on color or shape.
Peek at the data:
.. code-block:: none
:emphasize-lines: 1-1
wc -l data/colored-shapes.dkvp
10078 data/colored-shapes.dkvp
.. code-block:: none
:emphasize-lines: 1-1
head -n 6 data/colored-shapes.dkvp | mlr --opprint cat
color shape flag i u v w x
yellow triangle 1 11 0.6321695890307647 0.9887207810889004 0.4364983936735774 5.7981881667050565
red square 1 15 0.21966833570651523 0.001257332190235938 0.7927778364718627 2.944117399716207
red circle 1 16 0.20901671281497636 0.29005231936593445 0.13810280912907674 5.065034003400998
red square 0 48 0.9562743938458542 0.7467203085342884 0.7755423050923582 7.117831369597269
purple triangle 0 51 0.4355354501763202 0.8591292672156728 0.8122903963006748 5.753094629505863
red square 0 64 0.2015510269821953 0.9531098083420033 0.7719912015786777 5.612050466474166
Look at uncategorized stats (using `creach <https://github.com/johnkerl/scripts/blob/master/fundam/creach>`_ for spacing).
Here it looks reasonable that ``u`` is unit-uniform; something's up with ``v`` but we can't yet see what:
.. code-block:: none
:emphasize-lines: 1-1
mlr --oxtab stats1 -a min,mean,max -f flag,u,v data/colored-shapes.dkvp | creach 3
flag_min 0
flag_mean 0.39888866838658465
flag_max 1
u_min 0.000043912454007477564
u_mean 0.4983263438118866
u_max 0.9999687954968421
v_min -0.09270905318501277
v_mean 0.49778696527477023
v_max 1.0724998185026013
The histogram shows the different distribution of 0/1 flags:
.. code-block:: none
:emphasize-lines: 1-1
mlr --opprint histogram -f flag,u,v --lo -0.1 --hi 1.1 --nbins 12 data/colored-shapes.dkvp
bin_lo bin_hi flag_count u_count v_count
-0.010000000000000002 0.09000000000000002 6058 0 36
0.09000000000000002 0.19000000000000003 0 1062 988
0.19000000000000003 0.29000000000000004 0 985 1003
0.29000000000000004 0.39000000000000007 0 1024 1014
0.39000000000000007 0.4900000000000001 0 1002 991
0.4900000000000001 0.5900000000000002 0 989 1041
0.5900000000000002 0.6900000000000002 0 1001 1016
0.6900000000000002 0.7900000000000001 0 972 962
0.7900000000000001 0.8900000000000002 0 1035 1070
0.8900000000000002 0.9900000000000002 0 995 993
0.9900000000000002 1.0900000000000003 4020 1013 939
1.0900000000000003 1.1900000000000002 0 0 25
Look at univariate stats by color and shape. In particular, color-dependent flag probabilities pop out, aligning with their original Bernoulli probablities from the data-generator script:
.. code-block:: none
:emphasize-lines: 1-3
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g color \
then sort -f color \
data/colored-shapes.dkvp
color flag_min flag_mean flag_max u_min u_mean u_max v_min v_mean v_max
blue 0 0.5843537414965987 1 0.000043912454007477564 0.517717155039078 0.9999687954968421 0.0014886830387470518 0.49105642841387653 0.9995761761685742
green 0 0.20919747520288548 1 0.00048750676198217047 0.5048610622924616 0.9999361779701204 0.0005012669003675585 0.49908475928072205 0.9996764373885353
orange 0 0.5214521452145214 1 0.00123537823160913 0.49053241689014415 0.9988853487546249 0.0024486660337188493 0.4877637745987629 0.998475130432018
purple 0 0.09019264448336252 1 0.0002655214518428872 0.4940049543793683 0.9996465731736793 0.0003641137096487279 0.497050699948439 0.9999751864255598
red 0 0.3031674208144796 1 0.0006711367180041172 0.49255964831571375 0.9998822102016469 -0.09270905318501277 0.4965350959465078 1.0724998185026013
yellow 0 0.8924274593064402 1 0.001300228762057487 0.49712912165196765 0.99992313390574 0.0007109695568577878 0.510626599360317 0.9999189897724752
.. code-block:: none
:emphasize-lines: 1-3
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g shape \
then sort -f shape \
data/colored-shapes.dkvp
shape flag_min flag_mean flag_max u_min u_mean u_max v_min v_mean v_max
circle 0 0.3998456194519491 1 0.000043912454007477564 0.49855450951394115 0.99992313390574 -0.09270905318501277 0.49552415740048406 1.0724998185026013
square 0 0.39611178614823817 1 0.0001881939925673093 0.499385458061097 0.9999687954968421 0.00008930277299445954 0.49653825501903986 0.9999751864255598
triangle 0 0.4015421115065243 1 0.000881025170573424 0.4968585405884252 0.9996614910922645 0.000716883409890845 0.501049532862137 0.9999946837499262
Look at bivariate stats by color and shape. In particular, ``u,v`` pairwise correlation for red circles pops out:
.. code-block:: none
:emphasize-lines: 1-1
mlr --opprint --right stats2 -a corr -f u,v,w,x data/colored-shapes.dkvp
u_v_corr w_x_corr
0.13341803768384553 -0.011319938208638764
.. code-block:: none
:emphasize-lines: 1-3
mlr --opprint --right \
stats2 -a corr -f u,v,w,x -g color,shape then sort -nr u_v_corr \
data/colored-shapes.dkvp
color shape u_v_corr w_x_corr
red circle 0.9807984157534667 -0.018565046320623148
orange square 0.17685846147882145 -0.07104374629148885
green circle 0.05764430126828069 0.011795210176784067
red square 0.055744791559722166 -0.0006802175149145207
yellow triangle 0.04457267106380469 0.02460476240108526
yellow square 0.04379171794446621 -0.04462267239937856
purple circle 0.03587354791796681 0.13411247530136805
blue square 0.03241156493114544 -0.05350791240143263
blue triangle 0.015356295190464324 -0.0006084778850362686
orange circle 0.01051866723398945 -0.1627949723421722
red triangle 0.00809781003735548 0.012485753551391776
purple triangle 0.005155038421780437 -0.04505792148014131
purple square -0.02568020549187632 0.05769444883779078
green square -0.025775985300150128 -0.003265248022084335
orange triangle -0.030456930370361554 -0.131870019629393
yellow circle -0.06477338560056926 0.07369474300245252
blue circle -0.1023476302678634 -0.030529007506883508
green triangle -0.10901830007460846 -0.0484881707807228

View file

@ -1,118 +0,0 @@
Data-diving examples
================================================================
flins data
----------------------------------------------------------------
The `flins.csv <data/flins.csv>`_ file is some sample data obtained from https://support.spatialkey.com/spatialkey-sample-csv-data.
Vertical-tabular format is good for a quick look at CSV data layout -- seeing what columns you have to work with:
GENRST_RUN_COMMAND
head -n 2 data/flins.csv | mlr --icsv --oxtab cat
GENRST_EOF
A few simple queries:
GENRST_RUN_COMMAND
mlr --from data/flins.csv --icsv --opprint count-distinct -f county | head
GENRST_EOF
GENRST_RUN_COMMAND
mlr --from data/flins.csv --icsv --opprint count-distinct -f construction,line
GENRST_EOF
Categorization of total insured value:
GENRST_RUN_COMMAND
mlr --from data/flins.csv --icsv --opprint stats1 -a min,mean,max -f tiv_2012
GENRST_EOF
GENRST_RUN_COMMAND
mlr --from data/flins.csv --icsv --opprint \
stats1 -a min,mean,max -f tiv_2012 -g construction,line
GENRST_EOF
GENRST_RUN_COMMAND
mlr --from data/flins.csv --icsv --oxtab \
stats1 -a p0,p10,p50,p90,p95,p99,p100 -f hu_site_deductible
GENRST_EOF
GENRST_RUN_COMMAND
mlr --from data/flins.csv --icsv --opprint \
stats1 -a p95,p99,p100 -f hu_site_deductible -g county \
then sort -f county | head
GENRST_EOF
GENRST_RUN_COMMAND
mlr --from data/flins.csv --icsv --oxtab \
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012
GENRST_EOF
GENRST_RUN_COMMAND
mlr --from data/flins.csv --icsv --opprint \
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012 -g county
GENRST_EOF
Color/shape data
----------------------------------------------------------------
The `colored-shapes.dkvp <https://github.com/johnkerl/miller/blob/master/docs/data/colored-shapes.dkvp>`_ file is some sample data produced by the `mkdat2 <data/mkdat2>`_ script. The idea is:
* Produce some data with known distributions and correlations, and verify that Miller recovers those properties empirically.
* Each record is labeled with one of a few colors and one of a few shapes.
* The ``flag`` field is 0 or 1, with probability dependent on color
* The ``u`` field is plain uniform on the unit interval.
* The ``v`` field is the same, except tightly correlated with ``u`` for red circles.
* The ``w`` field is autocorrelated for each color/shape pair.
* The ``x`` field is boring Gaussian with mean 5 and standard deviation about 1.2, with no dependence on color or shape.
Peek at the data:
GENRST_RUN_COMMAND
wc -l data/colored-shapes.dkvp
GENRST_EOF
GENRST_RUN_COMMAND
head -n 6 data/colored-shapes.dkvp | mlr --opprint cat
GENRST_EOF
Look at uncategorized stats (using `creach <https://github.com/johnkerl/scripts/blob/master/fundam/creach>`_ for spacing).
Here it looks reasonable that ``u`` is unit-uniform; something's up with ``v`` but we can't yet see what:
GENRST_RUN_COMMAND
mlr --oxtab stats1 -a min,mean,max -f flag,u,v data/colored-shapes.dkvp | creach 3
GENRST_EOF
The histogram shows the different distribution of 0/1 flags:
GENRST_RUN_COMMAND
mlr --opprint histogram -f flag,u,v --lo -0.1 --hi 1.1 --nbins 12 data/colored-shapes.dkvp
GENRST_EOF
Look at univariate stats by color and shape. In particular, color-dependent flag probabilities pop out, aligning with their original Bernoulli probablities from the data-generator script:
GENRST_RUN_COMMAND
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g color \
then sort -f color \
data/colored-shapes.dkvp
GENRST_EOF
GENRST_RUN_COMMAND
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g shape \
then sort -f shape \
data/colored-shapes.dkvp
GENRST_EOF
Look at bivariate stats by color and shape. In particular, ``u,v`` pairwise correlation for red circles pops out:
GENRST_RUN_COMMAND
mlr --opprint --right stats2 -a corr -f u,v,w,x data/colored-shapes.dkvp
GENRST_EOF
GENRST_RUN_COMMAND
mlr --opprint --right \
stats2 -a corr -f u,v,w,x -g color,shape then sort -nr u_v_corr \
data/colored-shapes.dkvp
GENRST_EOF

View file

@ -1,126 +0,0 @@
..
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
Dates and times
===============
How can I filter by date?
----------------------------------------------------------------
Given input like
.. code-block:: none
:emphasize-lines: 1-1
cat dates.csv
date,event
2018-02-03,initialization
2018-03-07,discovery
2018-02-03,allocation
we can use ``strptime`` to parse the date field into seconds-since-epoch and then do numeric comparisons. Simply match your input dataset's date-formatting to the :ref:`reference-dsl-strptime` format-string. For example:
.. code-block:: none
:emphasize-lines: 1-3
mlr --csv filter '
strptime($date, "%Y-%m-%d") > strptime("2018-03-03", "%Y-%m-%d")
' dates.csv
date,event
2018-03-07,discovery
Caveat: localtime-handling in timezones with DST is still a work in progress; see https://github.com/johnkerl/miller/issues/170. See also https://github.com/johnkerl/miller/issues/208 -- thanks @aborruso!
Finding missing dates
----------------------------------------------------------------
Suppose you have some date-stamped data which may (or may not) be missing entries for one or more dates:
.. code-block:: none
:emphasize-lines: 1-1
head -n 10 data/miss-date.csv
date,qoh
2012-03-05,10055
2012-03-06,10486
2012-03-07,10430
2012-03-08,10674
2012-03-09,10880
2012-03-10,10718
2012-03-11,10795
2012-03-12,11043
2012-03-13,11177
.. code-block:: none
:emphasize-lines: 1-1
wc -l data/miss-date.csv
1372 data/miss-date.csv
Since there are 1372 lines in the data file, some automation is called for. To find the missing dates, you can convert the dates to seconds since the epoch using ``strptime``, then compute adjacent differences (the ``cat -n`` simply inserts record-counters):
.. code-block:: none
:emphasize-lines: 1-5
mlr --from data/miss-date.csv --icsv \
cat -n \
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
then step -a delta -f datestamp \
| head
n=1,date=2012-03-05,qoh=10055,datestamp=1330905600,datestamp_delta=0
n=2,date=2012-03-06,qoh=10486,datestamp=1330992000,datestamp_delta=86400
n=3,date=2012-03-07,qoh=10430,datestamp=1331078400,datestamp_delta=86400
n=4,date=2012-03-08,qoh=10674,datestamp=1331164800,datestamp_delta=86400
n=5,date=2012-03-09,qoh=10880,datestamp=1331251200,datestamp_delta=86400
n=6,date=2012-03-10,qoh=10718,datestamp=1331337600,datestamp_delta=86400
n=7,date=2012-03-11,qoh=10795,datestamp=1331424000,datestamp_delta=86400
n=8,date=2012-03-12,qoh=11043,datestamp=1331510400,datestamp_delta=86400
n=9,date=2012-03-13,qoh=11177,datestamp=1331596800,datestamp_delta=86400
n=10,date=2012-03-14,qoh=11498,datestamp=1331683200,datestamp_delta=86400
Then, filter for adjacent difference not being 86400 (the number of seconds in a day):
.. code-block:: none
:emphasize-lines: 1-5
mlr --from data/miss-date.csv --icsv \
cat -n \
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
then step -a delta -f datestamp \
then filter '$datestamp_delta != 86400 && $n != 1'
n=774,date=2014-04-19,qoh=130140,datestamp=1397865600,datestamp_delta=259200
n=1119,date=2015-03-31,qoh=181625,datestamp=1427760000,datestamp_delta=172800
Given this, it's now easy to see where the gaps are:
.. code-block:: none
:emphasize-lines: 1-1
mlr cat -n then filter '$n >= 770 && $n <= 780' data/miss-date.csv
n=770,1=2014-04-12,2=129435
n=771,1=2014-04-13,2=129868
n=772,1=2014-04-14,2=129797
n=773,1=2014-04-15,2=129919
n=774,1=2014-04-16,2=130181
n=775,1=2014-04-19,2=130140
n=776,1=2014-04-20,2=130271
n=777,1=2014-04-21,2=130368
n=778,1=2014-04-22,2=130368
n=779,1=2014-04-23,2=130849
n=780,1=2014-04-24,2=131026
.. code-block:: none
:emphasize-lines: 1-1
mlr cat -n then filter '$n >= 1115 && $n <= 1125' data/miss-date.csv
n=1115,1=2015-03-25,2=181006
n=1116,1=2015-03-26,2=180995
n=1117,1=2015-03-27,2=181043
n=1118,1=2015-03-28,2=181112
n=1119,1=2015-03-29,2=181306
n=1120,1=2015-03-31,2=181625
n=1121,1=2015-04-01,2=181494
n=1122,1=2015-04-02,2=181718
n=1123,1=2015-04-03,2=181835
n=1124,1=2015-04-04,2=182104
n=1125,1=2015-04-05,2=182528

View file

@ -1,52 +0,0 @@
Dates and times
===============
How can I filter by date?
----------------------------------------------------------------
Given input like
GENRST_RUN_COMMAND
cat dates.csv
GENRST_EOF
we can use ``strptime`` to parse the date field into seconds-since-epoch and then do numeric comparisons. Simply match your input dataset's date-formatting to the :ref:`reference-dsl-strptime` format-string. For example:
GENRST_RUN_COMMAND
mlr --csv filter '
strptime($date, "%Y-%m-%d") > strptime("2018-03-03", "%Y-%m-%d")
' dates.csv
GENRST_EOF
Caveat: localtime-handling in timezones with DST is still a work in progress; see https://github.com/johnkerl/miller/issues/170. See also https://github.com/johnkerl/miller/issues/208 -- thanks @aborruso!
Finding missing dates
----------------------------------------------------------------
Suppose you have some date-stamped data which may (or may not) be missing entries for one or more dates:
GENRST_RUN_COMMAND
head -n 10 data/miss-date.csv
GENRST_EOF
GENRST_RUN_COMMAND
wc -l data/miss-date.csv
GENRST_EOF
Since there are 1372 lines in the data file, some automation is called for. To find the missing dates, you can convert the dates to seconds since the epoch using ``strptime``, then compute adjacent differences (the ``cat -n`` simply inserts record-counters):
GENRST_INCLUDE_AND_RUN_ESCAPED(data/miss-date-1.sh)
Then, filter for adjacent difference not being 86400 (the number of seconds in a day):
GENRST_INCLUDE_AND_RUN_ESCAPED(data/miss-date-2.sh)
Given this, it's now easy to see where the gaps are:
GENRST_RUN_COMMAND
mlr cat -n then filter '$n >= 770 && $n <= 780' data/miss-date.csv
GENRST_EOF
GENRST_RUN_COMMAND
mlr cat -n then filter '$n >= 1115 && $n <= 1125' data/miss-date.csv
GENRST_EOF

View file

@ -1,253 +0,0 @@
..
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
DKVP I/O examples
======================
DKVP I/O in Python
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here are the I/O routines:
.. code-block:: none
#!/usr/bin/env python
# ================================================================
# Example of DKVP I/O using Python.
#
# Key point: Use Miller for what it's good at; pass data into/out of tools in
# other languages to do what they're good at.
#
# bash$ python -i dkvp_io.py
#
# # READ
# >>> map = dkvpline2map('x=1,y=2', '=', ',')
# >>> map
# OrderedDict([('x', '1'), ('y', '2')])
#
# # MODIFY
# >>> map['z'] = map['x'] + map['y']
# >>> map
# OrderedDict([('x', '1'), ('y', '2'), ('z', 3)])
#
# # WRITE
# >>> line = map2dkvpline(map, '=', ',')
# >>> line
# 'x=1,y=2,z=3'
#
# ================================================================
import re
import collections
# ----------------------------------------------------------------
# ips and ifs (input pair separator and input field separator) are nominally '=' and ','.
def dkvpline2map(line, ips, ifs):
pairs = re.split(ifs, line)
map = collections.OrderedDict()
for pair in pairs:
key, value = re.split(ips, pair, 1)
# Type inference:
try:
value = int(value)
except:
try:
value = float(value)
except:
pass
map[key] = value
return map
# ----------------------------------------------------------------
# ops and ofs (output pair separator and output field separator) are nominally '=' and ','.
def map2dkvpline(map , ops, ofs):
line = ''
pairs = []
for key in map:
pairs.append(str(key) + ops + str(map[key]))
return str.join(ofs, pairs)
And here is an example using them:
.. code-block:: none
:emphasize-lines: 1-1
cat polyglot-dkvp-io/example.py
#!/usr/bin/env python
import sys
import re
import copy
import dkvp_io
while True:
# Read the original record:
line = sys.stdin.readline().strip()
if line == '':
break
map = dkvp_io.dkvpline2map(line, '=', ',')
# Drop a field:
map.pop('x')
# Compute some new fields:
map['ab'] = map['a'] + map['b']
map['iy'] = map['i'] + map['y']
# Add new fields which show type of each already-existing field:
omap = copy.copy(map) # since otherwise the for-loop will modify what it loops over
keys = omap.keys()
for key in keys:
# Convert "<type 'int'>" to just "int", etc.:
type_string = str(map[key].__class__)
type_string = re.sub("<type '", "", type_string) # python2
type_string = re.sub("<class '", "", type_string) # python3
type_string = re.sub("'>", "", type_string)
map['t'+key] = type_string
# Write the modified record:
print(dkvp_io.map2dkvpline(map, '=', ','))
Run as-is:
.. code-block:: none
:emphasize-lines: 1-1
python polyglot-dkvp-io/example.py < data/small
a=pan,b=pan,i=1,y=0.7268028627434533,ab=panpan,iy=1.7268028627434533,ta=str,tb=str,ti=int,ty=float,tab=str,tiy=float
a=eks,b=pan,i=2,y=0.5221511083334797,ab=ekspan,iy=2.5221511083334796,ta=str,tb=str,ti=int,ty=float,tab=str,tiy=float
a=wye,b=wye,i=3,y=0.33831852551664776,ab=wyewye,iy=3.3383185255166477,ta=str,tb=str,ti=int,ty=float,tab=str,tiy=float
a=eks,b=wye,i=4,y=0.13418874328430463,ab=ekswye,iy=4.134188743284304,ta=str,tb=str,ti=int,ty=float,tab=str,tiy=float
a=wye,b=pan,i=5,y=0.8636244699032729,ab=wyepan,iy=5.863624469903273,ta=str,tb=str,ti=int,ty=float,tab=str,tiy=float
Run as-is, then pipe to Miller for pretty-printing:
.. code-block:: none
:emphasize-lines: 1-1
python polyglot-dkvp-io/example.py < data/small | mlr --opprint cat
a b i y ab iy ta tb ti ty tab tiy
pan pan 1 0.7268028627434533 panpan 1.7268028627434533 str str int float str float
eks pan 2 0.5221511083334797 ekspan 2.5221511083334796 str str int float str float
wye wye 3 0.33831852551664776 wyewye 3.3383185255166477 str str int float str float
eks wye 4 0.13418874328430463 ekswye 4.134188743284304 str str int float str float
wye pan 5 0.8636244699032729 wyepan 5.863624469903273 str str int float str float
DKVP I/O in Ruby
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here are the I/O routines:
.. code-block:: none
#!/usr/bin/env ruby
# ================================================================
# Example of DKVP I/O using Ruby.
#
# Key point: Use Miller for what it's good at; pass data into/out of tools in
# other languages to do what they're good at.
#
# bash$ irb -I. -r dkvp_io.rb
#
# # READ
# irb(main):001:0> map = dkvpline2map('x=1,y=2', '=', ',')
# => {"x"=>"1", "y"=>"2"}
#
# # MODIFY
# irb(main):001:0> map['z'] = map['x'] + map['y']
# => 3
#
# # WRITE
# irb(main):002:0> line = map2dkvpline(map, '=', ',')
# => "x=1,y=2,z=3"
#
# ================================================================
# ----------------------------------------------------------------
# ips and ifs (input pair separator and input field separator) are nominally '=' and ','.
def dkvpline2map(line, ips, ifs)
map = {}
line.split(ifs).each do |pair|
(k, v) = pair.split(ips, 2)
# Type inference:
begin
v = Integer(v)
rescue ArgumentError
begin
v = Float(v)
rescue ArgumentError
# Leave as string
end
end
map[k] = v
end
map
end
# ----------------------------------------------------------------
# ops and ofs (output pair separator and output field separator) are nominally '=' and ','.
def map2dkvpline(map, ops, ofs)
map.collect{|k,v| k.to_s + ops + v.to_s}.join(ofs)
end
And here is an example using them:
.. code-block:: none
:emphasize-lines: 1-1
cat polyglot-dkvp-io/example.rb
#!/usr/bin/env ruby
require 'dkvp_io'
ARGF.each do |line|
# Read the original record:
map = dkvpline2map(line.chomp, '=', ',')
# Drop a field:
map.delete('x')
# Compute some new fields:
map['ab'] = map['a'] + map['b']
map['iy'] = map['i'] + map['y']
# Add new fields which show type of each already-existing field:
keys = map.keys
keys.each do |key|
map['t'+key] = map[key].class
end
# Write the modified record:
puts map2dkvpline(map, '=', ',')
end
Run as-is:
.. code-block:: none
:emphasize-lines: 1-1
ruby -I./polyglot-dkvp-io polyglot-dkvp-io/example.rb data/small
a=pan,b=pan,i=1,y=0.7268028627434533,ab=panpan,iy=1.7268028627434533,ta=String,tb=String,ti=Integer,ty=Float,tab=String,tiy=Float
a=eks,b=pan,i=2,y=0.5221511083334797,ab=ekspan,iy=2.5221511083334796,ta=String,tb=String,ti=Integer,ty=Float,tab=String,tiy=Float
a=wye,b=wye,i=3,y=0.33831852551664776,ab=wyewye,iy=3.3383185255166477,ta=String,tb=String,ti=Integer,ty=Float,tab=String,tiy=Float
a=eks,b=wye,i=4,y=0.13418874328430463,ab=ekswye,iy=4.134188743284304,ta=String,tb=String,ti=Integer,ty=Float,tab=String,tiy=Float
a=wye,b=pan,i=5,y=0.8636244699032729,ab=wyepan,iy=5.863624469903273,ta=String,tb=String,ti=Integer,ty=Float,tab=String,tiy=Float
Run as-is, then pipe to Miller for pretty-printing:
.. code-block:: none
:emphasize-lines: 1-1
ruby -I./polyglot-dkvp-io polyglot-dkvp-io/example.rb data/small | mlr --opprint cat
a b i y ab iy ta tb ti ty tab tiy
pan pan 1 0.7268028627434533 panpan 1.7268028627434533 String String Integer Float String Float
eks pan 2 0.5221511083334797 ekspan 2.5221511083334796 String String Integer Float String Float
wye wye 3 0.33831852551664776 wyewye 3.3383185255166477 String String Integer Float String Float
eks wye 4 0.13418874328430463 ekswye 4.134188743284304 String String Integer Float String Float
wye pan 5 0.8636244699032729 wyepan 5.863624469903273 String String Integer Float String Float

View file

@ -1,52 +0,0 @@
DKVP I/O examples
======================
DKVP I/O in Python
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here are the I/O routines:
GENRST_INCLUDE_ESCAPED(polyglot-dkvp-io/dkvp_io.py)
And here is an example using them:
GENRST_RUN_COMMAND
cat polyglot-dkvp-io/example.py
GENRST_EOF
Run as-is:
GENRST_RUN_COMMAND
python polyglot-dkvp-io/example.py < data/small
GENRST_EOF
Run as-is, then pipe to Miller for pretty-printing:
GENRST_RUN_COMMAND
python polyglot-dkvp-io/example.py < data/small | mlr --opprint cat
GENRST_EOF
DKVP I/O in Ruby
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here are the I/O routines:
GENRST_INCLUDE_ESCAPED(polyglot-dkvp-io/dkvp_io.rb)
And here is an example using them:
GENRST_RUN_COMMAND
cat polyglot-dkvp-io/example.rb
GENRST_EOF
Run as-is:
GENRST_RUN_COMMAND
ruby -I./polyglot-dkvp-io polyglot-dkvp-io/example.rb data/small
GENRST_EOF
Run as-is, then pipe to Miller for pretty-printing:
GENRST_RUN_COMMAND
ruby -I./polyglot-dkvp-io polyglot-dkvp-io/example.rb data/small | mlr --opprint cat
GENRST_EOF

View file

@ -1,7 +0,0 @@
linkify introduction.rst.in? too much too soon?
quicklinks maybe? redundant w/ TOC?
so i didn't want to pop out to R just to compute that 'one last thing'; hence covar/linreg/etc

Some files were not shown because too many files have changed in this diff Show more