mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 02:14:13 +00:00
Replace miller-6 sphinx docs with mkdocs docs (#618)
This commit is contained in:
parent
fa3ee05822
commit
afd3c9c149
1313 changed files with 32 additions and 287985 deletions
2
.gitignore
vendored
2
.gitignore
vendored
|
|
@ -120,4 +120,4 @@ experiments/dsl-parser/two/main
|
|||
experiments/cli-parser/cliparse
|
||||
experiments/cli-parser/cliparse.exe
|
||||
|
||||
docs6b/site/
|
||||
docs6/site/
|
||||
|
|
|
|||
|
|
@ -1 +0,0 @@
|
|||
../multi-join
|
||||
|
|
@ -1 +0,0 @@
|
|||
../ngrams
|
||||
|
|
@ -1 +0,0 @@
|
|||
../polyglot-dkvp-io
|
||||
|
|
@ -1,2 +0,0 @@
|
|||
map \d :w<C-m>:!clear;build-one %<C-m>
|
||||
map \f :w<C-m>:!clear;make html<C-m>
|
||||
653
docs6/10min.rst
653
docs6/10min.rst
|
|
@ -1,653 +0,0 @@
|
|||
..
|
||||
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
|
||||
|
||||
Miller in 10 minutes
|
||||
====================
|
||||
|
||||
Obtaining Miller
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
You can install Miller for various platforms as follows:
|
||||
|
||||
* Linux: ``yum install miller`` or ``apt-get install miller`` depending on your flavor of Linux
|
||||
* MacOS: ``brew install miller`` or ``port install miller`` depending on your preference of `Homebrew <https://brew.sh>`_ or `MacPorts <https://macports.org>`_.
|
||||
* Windows: ``choco install miller`` using `Chocolatey <https://chocolatey.org>`_.
|
||||
* You can get latest builds for Linux, MacOS, and Windows by visiting https://github.com/johnkerl/miller/actions, selecting the latest build, and clicking _Artifacts_. (These are retained for 5 days after each commit.)
|
||||
* See also :doc:`build` if you prefer -- in particular, if your platform's package manager doesn't have the latest release.
|
||||
|
||||
As a first check, you should be able to run ``mlr --version`` at your system's command prompt and see something like the following:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --version
|
||||
Miller v6.0.0-dev
|
||||
|
||||
As a second check, given (`example.csv <./example.csv>`_) you should be able to do
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv cat example.csv
|
||||
color,shape,flag,index,quantity,rate
|
||||
yellow,triangle,true,11,43.6498,9.8870
|
||||
red,square,true,15,79.2778,0.0130
|
||||
red,circle,true,16,13.8103,2.9010
|
||||
red,square,false,48,77.5542,7.4670
|
||||
purple,triangle,false,51,81.2290,8.5910
|
||||
red,square,false,64,77.1991,9.5310
|
||||
purple,triangle,false,65,80.1405,5.8240
|
||||
yellow,circle,true,73,63.9785,4.2370
|
||||
yellow,circle,true,87,63.5058,8.3350
|
||||
purple,square,false,91,72.3735,8.2430
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint cat example.csv
|
||||
color shape flag index quantity rate
|
||||
yellow triangle true 11 43.6498 9.8870
|
||||
red square true 15 79.2778 0.0130
|
||||
red circle true 16 13.8103 2.9010
|
||||
red square false 48 77.5542 7.4670
|
||||
purple triangle false 51 81.2290 8.5910
|
||||
red square false 64 77.1991 9.5310
|
||||
purple triangle false 65 80.1405 5.8240
|
||||
yellow circle true 73 63.9785 4.2370
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
purple square false 91 72.3735 8.2430
|
||||
|
||||
If you run into issues on these checks, please check out the resources on the :doc:`community` page for help.
|
||||
|
||||
Miller verbs
|
||||
^^^^^^^^^^^^
|
||||
|
||||
Let's take a quick look at some of the most useful Miller verbs -- file-format-aware, name-index-empowered equivalents of standard system commands.
|
||||
|
||||
``mlr cat`` is like system ``cat`` (or ``type`` on Windows) -- it passes the data through unmodified:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv cat example.csv
|
||||
color,shape,flag,index,quantity,rate
|
||||
yellow,triangle,true,11,43.6498,9.8870
|
||||
red,square,true,15,79.2778,0.0130
|
||||
red,circle,true,16,13.8103,2.9010
|
||||
red,square,false,48,77.5542,7.4670
|
||||
purple,triangle,false,51,81.2290,8.5910
|
||||
red,square,false,64,77.1991,9.5310
|
||||
purple,triangle,false,65,80.1405,5.8240
|
||||
yellow,circle,true,73,63.9785,4.2370
|
||||
yellow,circle,true,87,63.5058,8.3350
|
||||
purple,square,false,91,72.3735,8.2430
|
||||
|
||||
But ``mlr cat`` can also do format conversion -- for example, you can pretty-print in tabular format:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint cat example.csv
|
||||
color shape flag index quantity rate
|
||||
yellow triangle true 11 43.6498 9.8870
|
||||
red square true 15 79.2778 0.0130
|
||||
red circle true 16 13.8103 2.9010
|
||||
red square false 48 77.5542 7.4670
|
||||
purple triangle false 51 81.2290 8.5910
|
||||
red square false 64 77.1991 9.5310
|
||||
purple triangle false 65 80.1405 5.8240
|
||||
yellow circle true 73 63.9785 4.2370
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
purple square false 91 72.3735 8.2430
|
||||
|
||||
``mlr head`` and ``mlr tail`` count records rather than lines. Whether you're getting the first few records or the last few, the CSV header is included either way:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv head -n 4 example.csv
|
||||
color,shape,flag,index,quantity,rate
|
||||
yellow,triangle,true,11,43.6498,9.8870
|
||||
red,square,true,15,79.2778,0.0130
|
||||
red,circle,true,16,13.8103,2.9010
|
||||
red,square,false,48,77.5542,7.4670
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv tail -n 4 example.csv
|
||||
color,shape,flag,index,quantity,rate
|
||||
purple,triangle,false,65,80.1405,5.8240
|
||||
yellow,circle,true,73,63.9785,4.2370
|
||||
yellow,circle,true,87,63.5058,8.3350
|
||||
purple,square,false,91,72.3735,8.2430
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --ojson tail -n 2 example.csv
|
||||
{
|
||||
"color": "yellow",
|
||||
"shape": "circle",
|
||||
"flag": true,
|
||||
"index": 87,
|
||||
"quantity": 63.5058,
|
||||
"rate": 8.3350
|
||||
}
|
||||
{
|
||||
"color": "purple",
|
||||
"shape": "square",
|
||||
"flag": false,
|
||||
"index": 91,
|
||||
"quantity": 72.3735,
|
||||
"rate": 8.2430
|
||||
}
|
||||
|
||||
You can sort on a single field:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint sort -f shape example.csv
|
||||
color shape flag index quantity rate
|
||||
red circle true 16 13.8103 2.9010
|
||||
yellow circle true 73 63.9785 4.2370
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
red square true 15 79.2778 0.0130
|
||||
red square false 48 77.5542 7.4670
|
||||
red square false 64 77.1991 9.5310
|
||||
purple square false 91 72.3735 8.2430
|
||||
yellow triangle true 11 43.6498 9.8870
|
||||
purple triangle false 51 81.2290 8.5910
|
||||
purple triangle false 65 80.1405 5.8240
|
||||
|
||||
Or, you can sort primarily alphabetically on one field, then secondarily numerically descending on another field, and so on:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint sort -f shape -nr index example.csv
|
||||
color shape flag index quantity rate
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
yellow circle true 73 63.9785 4.2370
|
||||
red circle true 16 13.8103 2.9010
|
||||
purple square false 91 72.3735 8.2430
|
||||
red square false 64 77.1991 9.5310
|
||||
red square false 48 77.5542 7.4670
|
||||
red square true 15 79.2778 0.0130
|
||||
purple triangle false 65 80.1405 5.8240
|
||||
purple triangle false 51 81.2290 8.5910
|
||||
yellow triangle true 11 43.6498 9.8870
|
||||
|
||||
If there are fields you don't want to see in your data, you can use ``cut`` to keep only the ones you want, in the same order they appeared in the input data:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint cut -f flag,shape example.csv
|
||||
shape flag
|
||||
triangle true
|
||||
square true
|
||||
circle true
|
||||
square false
|
||||
triangle false
|
||||
square false
|
||||
triangle false
|
||||
circle true
|
||||
circle true
|
||||
square false
|
||||
|
||||
You can also use ``cut -o`` to keep specified fields, but in your preferred order:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint cut -o -f flag,shape example.csv
|
||||
flag shape
|
||||
true triangle
|
||||
true square
|
||||
true circle
|
||||
false square
|
||||
false triangle
|
||||
false square
|
||||
false triangle
|
||||
true circle
|
||||
true circle
|
||||
false square
|
||||
|
||||
You can use ``cut -x`` to omit fields you don't care about:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint cut -x -f flag,shape example.csv
|
||||
color index quantity rate
|
||||
yellow 11 43.6498 9.8870
|
||||
red 15 79.2778 0.0130
|
||||
red 16 13.8103 2.9010
|
||||
red 48 77.5542 7.4670
|
||||
purple 51 81.2290 8.5910
|
||||
red 64 77.1991 9.5310
|
||||
purple 65 80.1405 5.8240
|
||||
yellow 73 63.9785 4.2370
|
||||
yellow 87 63.5058 8.3350
|
||||
purple 91 72.3735 8.2430
|
||||
|
||||
You can use ``filter`` to keep only records you care about:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint filter '$color == "red"' example.csv
|
||||
color shape flag index quantity rate
|
||||
red square true 15 79.2778 0.0130
|
||||
red circle true 16 13.8103 2.9010
|
||||
red square false 48 77.5542 7.4670
|
||||
red square false 64 77.1991 9.5310
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint filter '$color == "red" && $flag == true' example.csv
|
||||
color shape flag index quantity rate
|
||||
red square true 15 79.2778 0.0130
|
||||
red circle true 16 13.8103 2.9010
|
||||
|
||||
You can use ``put`` to create new fields which are computed from other fields:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-4
|
||||
|
||||
mlr --icsv --opprint put '
|
||||
$ratio = $quantity / $rate;
|
||||
$color_shape = $color . "_" . $shape
|
||||
' example.csv
|
||||
color shape flag index quantity rate ratio color_shape
|
||||
yellow triangle true 11 43.6498 9.8870 4.414868008496004 yellow_triangle
|
||||
red square true 15 79.2778 0.0130 6098.292307692308 red_square
|
||||
red circle true 16 13.8103 2.9010 4.760530851430541 red_circle
|
||||
red square false 48 77.5542 7.4670 10.386259541984733 red_square
|
||||
purple triangle false 51 81.2290 8.5910 9.455127458968688 purple_triangle
|
||||
red square false 64 77.1991 9.5310 8.099790158430384 red_square
|
||||
purple triangle false 65 80.1405 5.8240 13.760388049450551 purple_triangle
|
||||
yellow circle true 73 63.9785 4.2370 15.09995279679018 yellow_circle
|
||||
yellow circle true 87 63.5058 8.3350 7.619172165566886 yellow_circle
|
||||
purple square false 91 72.3735 8.2430 8.779995147397793 purple_square
|
||||
|
||||
Even though Miller's main selling point is name-indexing, sometimes you really want to refer to a field name by its positional index. Use ``$[[3]]`` to access the name of field 3 or ``$[[[3]]]`` to access the value of field 3:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint put '$[[3]] = "NEW"' example.csv
|
||||
color shape NEW index quantity rate
|
||||
yellow triangle true 11 43.6498 9.8870
|
||||
red square true 15 79.2778 0.0130
|
||||
red circle true 16 13.8103 2.9010
|
||||
red square false 48 77.5542 7.4670
|
||||
purple triangle false 51 81.2290 8.5910
|
||||
red square false 64 77.1991 9.5310
|
||||
purple triangle false 65 80.1405 5.8240
|
||||
yellow circle true 73 63.9785 4.2370
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
purple square false 91 72.3735 8.2430
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint put '$[[[3]]] = "NEW"' example.csv
|
||||
color shape flag index quantity rate
|
||||
yellow triangle NEW 11 43.6498 9.8870
|
||||
red square NEW 15 79.2778 0.0130
|
||||
red circle NEW 16 13.8103 2.9010
|
||||
red square NEW 48 77.5542 7.4670
|
||||
purple triangle NEW 51 81.2290 8.5910
|
||||
red square NEW 64 77.1991 9.5310
|
||||
purple triangle NEW 65 80.1405 5.8240
|
||||
yellow circle NEW 73 63.9785 4.2370
|
||||
yellow circle NEW 87 63.5058 8.3350
|
||||
purple square NEW 91 72.3735 8.2430
|
||||
|
||||
You can find the full list of verbs at the :doc:`reference-verbs` page.
|
||||
|
||||
Multiple input files
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Miller takes all the files from the command line as an input stream. But it's format-aware, so it doesn't repeat CSV header lines. For example, with input files (`data/a.csv <data/a.csv>`_) and (`data/b.csv <data/b.csv>`_), the system ``cat`` command will repeat header lines:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat data/a.csv
|
||||
a,b,c
|
||||
1,2,3
|
||||
4,5,6
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat data/b.csv
|
||||
a,b,c
|
||||
7,8,9
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat data/a.csv data/b.csv
|
||||
a,b,c
|
||||
1,2,3
|
||||
4,5,6
|
||||
a,b,c
|
||||
7,8,9
|
||||
|
||||
However, ``mlr cat`` will not:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv cat data/a.csv data/b.csv
|
||||
a,b,c
|
||||
1,2,3
|
||||
4,5,6
|
||||
7,8,9
|
||||
|
||||
Chaining verbs together
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Often we want to chain queries together -- for example, sorting by a field and taking the top few values. We can do this using pipes:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv sort -nr index example.csv | mlr --icsv --opprint head -n 3
|
||||
color shape flag index quantity rate
|
||||
purple square false 91 72.3735 8.2430
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
yellow circle true 73 63.9785 4.2370
|
||||
|
||||
This works fine -- but Miller also lets you chain verbs together using the word ``then``. Think of this as a Miller-internal pipe that lets you use fewer keystrokes:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
|
||||
color shape flag index quantity rate
|
||||
purple square false 91 72.3735 8.2430
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
yellow circle true 73 63.9785 4.2370
|
||||
|
||||
As another convenience, you can put the filename first using ``--from``. When you're interacting with your data at the command line, this makes it easier to up-arrow and append to the previous command:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint --from example.csv sort -nr index then head -n 3
|
||||
color shape flag index quantity rate
|
||||
purple square false 91 72.3735 8.2430
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
yellow circle true 73 63.9785 4.2370
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-4
|
||||
|
||||
mlr --icsv --opprint --from example.csv \
|
||||
sort -nr index \
|
||||
then head -n 3 \
|
||||
then cut -f shape,quantity
|
||||
shape quantity
|
||||
square 72.3735
|
||||
circle 63.5058
|
||||
circle 63.9785
|
||||
|
||||
Sorts and stats
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
Now suppose you want to sort the data on a given column, *and then* take the top few in that ordering. You can use Miller's ``then`` feature to pipe commands together.
|
||||
|
||||
Here are the records with the top three ``index`` values:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
|
||||
color shape flag index quantity rate
|
||||
purple square false 91 72.3735 8.2430
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
yellow circle true 73 63.9785 4.2370
|
||||
|
||||
Lots of Miller commands take a ``-g`` option for group-by: here, ``head -n 1 -g shape`` outputs the first record for each distinct value of the ``shape`` field. This means we're finding the record with highest ``index`` field for each distinct ``shape`` field:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv
|
||||
color shape flag index quantity rate
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
purple square false 91 72.3735 8.2430
|
||||
purple triangle false 65 80.1405 5.8240
|
||||
|
||||
Statistics can be computed with or without group-by field(s):
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-2
|
||||
|
||||
mlr --icsv --opprint --from example.csv \
|
||||
stats1 -a count,min,mean,max -f quantity -g shape
|
||||
shape quantity_count quantity_min quantity_mean quantity_max
|
||||
triangle 3 43.6498 68.33976666666666 81.229
|
||||
square 4 72.3735 76.60114999999999 79.2778
|
||||
circle 3 13.8103 47.0982 63.9785
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-2
|
||||
|
||||
mlr --icsv --opprint --from example.csv \
|
||||
stats1 -a count,min,mean,max -f quantity -g shape,color
|
||||
shape color quantity_count quantity_min quantity_mean quantity_max
|
||||
triangle yellow 1 43.6498 43.6498 43.6498
|
||||
square red 3 77.1991 78.01036666666666 79.2778
|
||||
circle red 1 13.8103 13.8103 13.8103
|
||||
triangle purple 2 80.1405 80.68475000000001 81.229
|
||||
circle yellow 2 63.5058 63.742149999999995 63.9785
|
||||
square purple 1 72.3735 72.3735 72.3735
|
||||
|
||||
If your output has a lot of columns, you can use XTAB format to line things up vertically for you instead:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-2
|
||||
|
||||
mlr --icsv --oxtab --from example.csv \
|
||||
stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate
|
||||
rate_p0 0.0130
|
||||
rate_p10 2.9010
|
||||
rate_p25 4.2370
|
||||
rate_p50 8.2430
|
||||
rate_p75 8.5910
|
||||
rate_p90 9.8870
|
||||
rate_p99 9.8870
|
||||
rate_p100 9.8870
|
||||
|
||||
|
||||
File formats and format conversion
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Miller supports the following formats:
|
||||
|
||||
* CSV (comma-separared values)
|
||||
* TSV (tab-separated values)
|
||||
* JSON (JavaScript Object Notation)
|
||||
* PPRINT (pretty-printed tabular)
|
||||
* XTAB (vertical-tabular or sideways-tabular)
|
||||
* NIDX (numerically indexed, label-free, with implicit labels ``"1"``, ``"2"``, etc.)
|
||||
* DKVP (delimited key-value pairs).
|
||||
|
||||
What's a CSV file, really? It's an array of rows, or *records*, each being a list of key-value pairs, or *fields*: for CSV it so happens that all the keys are shared in the header line and the values vary from one data line to another.
|
||||
|
||||
For example, if you have:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
shape,flag,index
|
||||
circle,1,24
|
||||
square,0,36
|
||||
|
||||
then that's a way of saying:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
shape=circle,flag=1,index=24
|
||||
shape=square,flag=0,index=36
|
||||
|
||||
Other ways to write the same data:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
CSV PPRINT
|
||||
shape,flag,index shape flag index
|
||||
circle,1,24 circle 1 24
|
||||
square,0,36 square 0 36
|
||||
|
||||
JSON XTAB
|
||||
{ shape circle
|
||||
"shape": "circle", flag 1
|
||||
"flag": 1, index 24
|
||||
"index": 24 .
|
||||
} shape square
|
||||
{ flag 0
|
||||
"shape": "square", index 36
|
||||
"flag": 0,
|
||||
"index": 36
|
||||
}
|
||||
|
||||
DKVP
|
||||
shape=circle,flag=1,index=24
|
||||
shape=square,flag=0,index=36
|
||||
|
||||
Anything we can do with CSV input data, we can do with any other format input data. And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.
|
||||
|
||||
How to specify these to Miller:
|
||||
|
||||
* If you use ``--csv`` or ``--json`` or ``--pprint``, etc., then Miller will use that format for input and output.
|
||||
* If you use ``--icsv`` and ``--ojson`` (note the extra ``i`` and ``o``) then Miller will use CSV for input and JSON for output, etc. See also :doc:`keystroke-savers` for even shorter options like ``--c2j``.
|
||||
|
||||
You can read more about this at the :doc:`file-formats` page.
|
||||
|
||||
.. _10min-choices-for-printing-to-files:
|
||||
|
||||
Choices for printing to files
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Often we want to print output to the screen. Miller does this by default, as we've seen in the previous examples.
|
||||
|
||||
Sometimes, though, we want to print output to another file. Just use **> outputfilenamegoeshere** at the end of your command:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
mlr --icsv --opprint cat example.csv > newfile.csv
|
||||
# Output goes to the new file;
|
||||
# nothing is printed to the screen.
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
cat newfile.csv
|
||||
color shape flag index quantity rate
|
||||
yellow triangle true 11 43.6498 9.8870
|
||||
red square true 15 79.2778 0.0130
|
||||
red circle true 16 13.8103 2.9010
|
||||
red square false 48 77.5542 7.4670
|
||||
purple triangle false 51 81.2290 8.5910
|
||||
red square false 64 77.1991 9.5310
|
||||
purple triangle false 65 80.1405 5.8240
|
||||
yellow circle true 73 63.9785 4.2370
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
purple square false 91 72.3735 8.2430
|
||||
|
||||
Other times we just want our files to be **changed in-place**: just use **mlr -I**:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
cp example.csv newfile.txt
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
cat newfile.txt
|
||||
color,shape,flag,index,quantity,rate
|
||||
yellow,triangle,true,11,43.6498,9.8870
|
||||
red,square,true,15,79.2778,0.0130
|
||||
red,circle,true,16,13.8103,2.9010
|
||||
red,square,false,48,77.5542,7.4670
|
||||
purple,triangle,false,51,81.2290,8.5910
|
||||
red,square,false,64,77.1991,9.5310
|
||||
purple,triangle,false,65,80.1405,5.8240
|
||||
yellow,circle,true,73,63.9785,4.2370
|
||||
yellow,circle,true,87,63.5058,8.3350
|
||||
purple,square,false,91,72.3735,8.2430
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
mlr -I --csv sort -f shape newfile.txt
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
cat newfile.txt
|
||||
color,shape,flag,index,quantity,rate
|
||||
red,circle,true,16,13.8103,2.9010
|
||||
yellow,circle,true,73,63.9785,4.2370
|
||||
yellow,circle,true,87,63.5058,8.3350
|
||||
red,square,true,15,79.2778,0.0130
|
||||
red,square,false,48,77.5542,7.4670
|
||||
red,square,false,64,77.1991,9.5310
|
||||
purple,square,false,91,72.3735,8.2430
|
||||
yellow,triangle,true,11,43.6498,9.8870
|
||||
purple,triangle,false,51,81.2290,8.5910
|
||||
purple,triangle,false,65,80.1405,5.8240
|
||||
|
||||
Also using ``mlr -I`` you can bulk-operate on lots of files: e.g.:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
mlr -I --csv cut -x -f unwanted_column_name *.csv
|
||||
|
||||
If you like, you can first copy off your original data somewhere else, before doing in-place operations.
|
||||
|
||||
Lastly, using ``tee`` within ``put``, you can split your input data into separate files per one or more field names:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat circle.csv
|
||||
color,shape,flag,index,quantity,rate
|
||||
red,circle,true,16,13.8103,2.9010
|
||||
yellow,circle,true,73,63.9785,4.2370
|
||||
yellow,circle,true,87,63.5058,8.3350
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat square.csv
|
||||
color,shape,flag,index,quantity,rate
|
||||
red,square,true,15,79.2778,0.0130
|
||||
red,square,false,48,77.5542,7.4670
|
||||
red,square,false,64,77.1991,9.5310
|
||||
purple,square,false,91,72.3735,8.2430
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat triangle.csv
|
||||
color,shape,flag,index,quantity,rate
|
||||
yellow,triangle,true,11,43.6498,9.8870
|
||||
purple,triangle,false,51,81.2290,8.5910
|
||||
purple,triangle,false,65,80.1405,5.8240
|
||||
|
|
@ -1,378 +0,0 @@
|
|||
Miller in 10 minutes
|
||||
====================
|
||||
|
||||
Obtaining Miller
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
You can install Miller for various platforms as follows:
|
||||
|
||||
* Linux: ``yum install miller`` or ``apt-get install miller`` depending on your flavor of Linux
|
||||
* MacOS: ``brew install miller`` or ``port install miller`` depending on your preference of `Homebrew <https://brew.sh>`_ or `MacPorts <https://macports.org>`_.
|
||||
* Windows: ``choco install miller`` using `Chocolatey <https://chocolatey.org>`_.
|
||||
* You can get latest builds for Linux, MacOS, and Windows by visiting https://github.com/johnkerl/miller/actions, selecting the latest build, and clicking _Artifacts_. (These are retained for 5 days after each commit.)
|
||||
* See also :doc:`build` if you prefer -- in particular, if your platform's package manager doesn't have the latest release.
|
||||
|
||||
As a first check, you should be able to run ``mlr --version`` at your system's command prompt and see something like the following:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --version
|
||||
GENRST_EOF
|
||||
|
||||
As a second check, given (`example.csv <./example.csv>`_) you should be able to do
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --csv cat example.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint cat example.csv
|
||||
GENRST_EOF
|
||||
|
||||
If you run into issues on these checks, please check out the resources on the :doc:`community` page for help.
|
||||
|
||||
Miller verbs
|
||||
^^^^^^^^^^^^
|
||||
|
||||
Let's take a quick look at some of the most useful Miller verbs -- file-format-aware, name-index-empowered equivalents of standard system commands.
|
||||
|
||||
``mlr cat`` is like system ``cat`` (or ``type`` on Windows) -- it passes the data through unmodified:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --csv cat example.csv
|
||||
GENRST_EOF
|
||||
|
||||
But ``mlr cat`` can also do format conversion -- for example, you can pretty-print in tabular format:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint cat example.csv
|
||||
GENRST_EOF
|
||||
|
||||
``mlr head`` and ``mlr tail`` count records rather than lines. Whether you're getting the first few records or the last few, the CSV header is included either way:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --csv head -n 4 example.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --csv tail -n 4 example.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --ojson tail -n 2 example.csv
|
||||
GENRST_EOF
|
||||
|
||||
You can sort on a single field:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint sort -f shape example.csv
|
||||
GENRST_EOF
|
||||
|
||||
Or, you can sort primarily alphabetically on one field, then secondarily numerically descending on another field, and so on:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint sort -f shape -nr index example.csv
|
||||
GENRST_EOF
|
||||
|
||||
If there are fields you don't want to see in your data, you can use ``cut`` to keep only the ones you want, in the same order they appeared in the input data:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint cut -f flag,shape example.csv
|
||||
GENRST_EOF
|
||||
|
||||
You can also use ``cut -o`` to keep specified fields, but in your preferred order:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint cut -o -f flag,shape example.csv
|
||||
GENRST_EOF
|
||||
|
||||
You can use ``cut -x`` to omit fields you don't care about:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint cut -x -f flag,shape example.csv
|
||||
GENRST_EOF
|
||||
|
||||
You can use ``filter`` to keep only records you care about:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint filter '$color == "red"' example.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint filter '$color == "red" && $flag == true' example.csv
|
||||
GENRST_EOF
|
||||
|
||||
You can use ``put`` to create new fields which are computed from other fields:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint put '
|
||||
$ratio = $quantity / $rate;
|
||||
$color_shape = $color . "_" . $shape
|
||||
' example.csv
|
||||
GENRST_EOF
|
||||
|
||||
Even though Miller's main selling point is name-indexing, sometimes you really want to refer to a field name by its positional index. Use ``$[[3]]`` to access the name of field 3 or ``$[[[3]]]`` to access the value of field 3:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint put '$[[3]] = "NEW"' example.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint put '$[[[3]]] = "NEW"' example.csv
|
||||
GENRST_EOF
|
||||
|
||||
You can find the full list of verbs at the :doc:`reference-verbs` page.
|
||||
|
||||
Multiple input files
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Miller takes all the files from the command line as an input stream. But it's format-aware, so it doesn't repeat CSV header lines. For example, with input files (`data/a.csv <data/a.csv>`_) and (`data/b.csv <data/b.csv>`_), the system ``cat`` command will repeat header lines:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat data/a.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat data/b.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat data/a.csv data/b.csv
|
||||
GENRST_EOF
|
||||
|
||||
However, ``mlr cat`` will not:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --csv cat data/a.csv data/b.csv
|
||||
GENRST_EOF
|
||||
|
||||
Chaining verbs together
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Often we want to chain queries together -- for example, sorting by a field and taking the top few values. We can do this using pipes:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --csv sort -nr index example.csv | mlr --icsv --opprint head -n 3
|
||||
GENRST_EOF
|
||||
|
||||
This works fine -- but Miller also lets you chain verbs together using the word ``then``. Think of this as a Miller-internal pipe that lets you use fewer keystrokes:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
|
||||
GENRST_EOF
|
||||
|
||||
As another convenience, you can put the filename first using ``--from``. When you're interacting with your data at the command line, this makes it easier to up-arrow and append to the previous command:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint --from example.csv sort -nr index then head -n 3
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint --from example.csv \
|
||||
sort -nr index \
|
||||
then head -n 3 \
|
||||
then cut -f shape,quantity
|
||||
GENRST_EOF
|
||||
|
||||
Sorts and stats
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
Now suppose you want to sort the data on a given column, *and then* take the top few in that ordering. You can use Miller's ``then`` feature to pipe commands together.
|
||||
|
||||
Here are the records with the top three ``index`` values:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
|
||||
GENRST_EOF
|
||||
|
||||
Lots of Miller commands take a ``-g`` option for group-by: here, ``head -n 1 -g shape`` outputs the first record for each distinct value of the ``shape`` field. This means we're finding the record with highest ``index`` field for each distinct ``shape`` field:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv
|
||||
GENRST_EOF
|
||||
|
||||
Statistics can be computed with or without group-by field(s):
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint --from example.csv \
|
||||
stats1 -a count,min,mean,max -f quantity -g shape
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint --from example.csv \
|
||||
stats1 -a count,min,mean,max -f quantity -g shape,color
|
||||
GENRST_EOF
|
||||
|
||||
If your output has a lot of columns, you can use XTAB format to line things up vertically for you instead:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --oxtab --from example.csv \
|
||||
stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate
|
||||
GENRST_EOF
|
||||
|
||||
|
||||
File formats and format conversion
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Miller supports the following formats:
|
||||
|
||||
* CSV (comma-separared values)
|
||||
* TSV (tab-separated values)
|
||||
* JSON (JavaScript Object Notation)
|
||||
* PPRINT (pretty-printed tabular)
|
||||
* XTAB (vertical-tabular or sideways-tabular)
|
||||
* NIDX (numerically indexed, label-free, with implicit labels ``"1"``, ``"2"``, etc.)
|
||||
* DKVP (delimited key-value pairs).
|
||||
|
||||
What's a CSV file, really? It's an array of rows, or *records*, each being a list of key-value pairs, or *fields*: for CSV it so happens that all the keys are shared in the header line and the values vary from one data line to another.
|
||||
|
||||
For example, if you have:
|
||||
|
||||
GENRST_CARDIFY
|
||||
shape,flag,index
|
||||
circle,1,24
|
||||
square,0,36
|
||||
GENRST_EOF
|
||||
|
||||
then that's a way of saying:
|
||||
|
||||
GENRST_CARDIFY
|
||||
shape=circle,flag=1,index=24
|
||||
shape=square,flag=0,index=36
|
||||
GENRST_EOF
|
||||
|
||||
Other ways to write the same data:
|
||||
|
||||
GENRST_CARDIFY
|
||||
CSV PPRINT
|
||||
shape,flag,index shape flag index
|
||||
circle,1,24 circle 1 24
|
||||
square,0,36 square 0 36
|
||||
|
||||
JSON XTAB
|
||||
{ shape circle
|
||||
"shape": "circle", flag 1
|
||||
"flag": 1, index 24
|
||||
"index": 24 .
|
||||
} shape square
|
||||
{ flag 0
|
||||
"shape": "square", index 36
|
||||
"flag": 0,
|
||||
"index": 36
|
||||
}
|
||||
|
||||
DKVP
|
||||
shape=circle,flag=1,index=24
|
||||
shape=square,flag=0,index=36
|
||||
GENRST_EOF
|
||||
|
||||
Anything we can do with CSV input data, we can do with any other format input data. And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.
|
||||
|
||||
How to specify these to Miller:
|
||||
|
||||
* If you use ``--csv`` or ``--json`` or ``--pprint``, etc., then Miller will use that format for input and output.
|
||||
* If you use ``--icsv`` and ``--ojson`` (note the extra ``i`` and ``o``) then Miller will use CSV for input and JSON for output, etc. See also :doc:`keystroke-savers` for even shorter options like ``--c2j``.
|
||||
|
||||
You can read more about this at the :doc:`file-formats` page.
|
||||
|
||||
.. _10min-choices-for-printing-to-files:
|
||||
|
||||
Choices for printing to files
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Often we want to print output to the screen. Miller does this by default, as we've seen in the previous examples.
|
||||
|
||||
Sometimes, though, we want to print output to another file. Just use **> outputfilenamegoeshere** at the end of your command:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
mlr --icsv --opprint cat example.csv > newfile.csv
|
||||
# Output goes to the new file;
|
||||
# nothing is printed to the screen.
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
cat newfile.csv
|
||||
color shape flag index quantity rate
|
||||
yellow triangle true 11 43.6498 9.8870
|
||||
red square true 15 79.2778 0.0130
|
||||
red circle true 16 13.8103 2.9010
|
||||
red square false 48 77.5542 7.4670
|
||||
purple triangle false 51 81.2290 8.5910
|
||||
red square false 64 77.1991 9.5310
|
||||
purple triangle false 65 80.1405 5.8240
|
||||
yellow circle true 73 63.9785 4.2370
|
||||
yellow circle true 87 63.5058 8.3350
|
||||
purple square false 91 72.3735 8.2430
|
||||
|
||||
Other times we just want our files to be **changed in-place**: just use **mlr -I**:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
cp example.csv newfile.txt
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
cat newfile.txt
|
||||
color,shape,flag,index,quantity,rate
|
||||
yellow,triangle,true,11,43.6498,9.8870
|
||||
red,square,true,15,79.2778,0.0130
|
||||
red,circle,true,16,13.8103,2.9010
|
||||
red,square,false,48,77.5542,7.4670
|
||||
purple,triangle,false,51,81.2290,8.5910
|
||||
red,square,false,64,77.1991,9.5310
|
||||
purple,triangle,false,65,80.1405,5.8240
|
||||
yellow,circle,true,73,63.9785,4.2370
|
||||
yellow,circle,true,87,63.5058,8.3350
|
||||
purple,square,false,91,72.3735,8.2430
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
mlr -I --csv sort -f shape newfile.txt
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
cat newfile.txt
|
||||
color,shape,flag,index,quantity,rate
|
||||
red,circle,true,16,13.8103,2.9010
|
||||
yellow,circle,true,73,63.9785,4.2370
|
||||
yellow,circle,true,87,63.5058,8.3350
|
||||
red,square,true,15,79.2778,0.0130
|
||||
red,square,false,48,77.5542,7.4670
|
||||
red,square,false,64,77.1991,9.5310
|
||||
purple,square,false,91,72.3735,8.2430
|
||||
yellow,triangle,true,11,43.6498,9.8870
|
||||
purple,triangle,false,51,81.2290,8.5910
|
||||
purple,triangle,false,65,80.1405,5.8240
|
||||
|
||||
Also using ``mlr -I`` you can bulk-operate on lots of files: e.g.:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1,1
|
||||
|
||||
mlr -I --csv cut -x -f unwanted_column_name *.csv
|
||||
|
||||
If you like, you can first copy off your original data somewhere else, before doing in-place operations.
|
||||
|
||||
Lastly, using ``tee`` within ``put``, you can split your input data into separate files per one or more field names:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat circle.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat square.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat triangle.csv
|
||||
GENRST_EOF
|
||||
|
|
@ -1,28 +0,0 @@
|
|||
# Minimal makefile for Sphinx documentation
|
||||
#
|
||||
# Note: run this after make in the ../c directory and make in the ../man directory
|
||||
# since ../c/mlr is used to autogenerate ../man/manpage.txt which is used in this directory.
|
||||
# See also https://miller.readthedocs.io/en/latest/build.html#creating-a-new-release-for-developers
|
||||
|
||||
# You can set these variables from the command line, and also
|
||||
# from the environment for the first two.
|
||||
SPHINXOPTS ?=
|
||||
SPHINXBUILD ?= sphinx-build
|
||||
SOURCEDIR = .
|
||||
BUILDDIR = _build
|
||||
|
||||
# Respective MANPATH entries would include /usr/local/share/man or $HOME/man.
|
||||
INSTALLDIR=/usr/local/share/man/man1
|
||||
INSTALLHOME=$(HOME)/man/man1
|
||||
|
||||
# Put it first so that "make" without argument is like "make help".
|
||||
help:
|
||||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
|
||||
.PHONY: help Makefile
|
||||
|
||||
# Catch-all target: route all unknown targets to Sphinx using the new
|
||||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
|
||||
%: Makefile
|
||||
./genrsts
|
||||
$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
|
|
@ -1,40 +1,44 @@
|
|||
# Miller Sphinx docs
|
||||
# Miller docs
|
||||
|
||||
## Why use Sphinx
|
||||
## Why use Mkdocs
|
||||
|
||||
* Connects to https://miller.readthedocs.io so people can get their docmods onto the web instead of the self-hosted https://johnkerl.org/miller/doc. Thanks to @pabloab for the great advice!
|
||||
* More standard look and feel -- lots of people use readthedocs for other things so this should feel familiar
|
||||
* We get a Search feature for free
|
||||
* More standard look and feel -- lots of people use readthedocs for other things so this should feel familiar.
|
||||
* We get a Search feature for free.
|
||||
* Mkdocs vs Sphinx: these are similar tools, but I find that I more easily get better desktop+mobile formatting using Mkdocs.
|
||||
|
||||
## Contributing
|
||||
|
||||
* You need `pip install sphinx` (or `pip3 install sphinx`)
|
||||
* The docs include lots of live code examples which will be invoked using `mlr` which must be somewhere in your `$PATH`
|
||||
* Clone https://github.com/johnkerl/miller and cd into `docs/` within your clone
|
||||
* Editing loop:
|
||||
* Edit `*.rst.in`
|
||||
* Run `make html`
|
||||
* Either `open _build/html/index.html` (MacOS) or point your browser to `file:///path/to/your/clone/of/miller/docs/_build/html/index.html`
|
||||
* You need `pip install mkdocs` (or `pip3 install mkdocs`).
|
||||
* The docs include lots of live code examples which will be invoked using `mlr` which must be somewhere in your `$PATH`.
|
||||
* Clone https://github.com/johnkerl/miller and cd into `docs/` within your clone.
|
||||
* Quick-editing loop:
|
||||
* In one terminal, cd to this directory and leave `mkdocs serve` running.
|
||||
* In another terminal, cd to the `docs` subdirectory and edit `*.md.in`.
|
||||
* Run `genmds` to re-create all the `*.md` files, or `genmds foo.md.in` to just re-create the `foo.md.in` file you just edited.
|
||||
* In your browser, visit http://127.0.0.1:8000
|
||||
* Alternate editing loop:
|
||||
* Leave one terminal open as a place you will run `mkdocs build`
|
||||
* In one terminal, cd to the `docs` subdirectory and edit `*.md.in`.
|
||||
* Run `genmds` to re-create all the `*.md` files, or `genmds foo.md.in` to just re-create the `foo.md.in` file you just edited.
|
||||
* In the first terminal, run `mkdocs build` which will populate the `site` directory.
|
||||
* In your browser, visit `file:///your/path/to/miller/docs/site/index.html`
|
||||
* Link-checking:
|
||||
* `sudo pip3 install git+https://github.com/linkchecker/linkchecker.git`
|
||||
* `cd site` and `linkchecker .`
|
||||
* Submitting:
|
||||
* `git add` your modified files, `git commit`, `git push`, and submit a PR at https://github.com/johnkerl/miller
|
||||
* A nice markup reference: https://www.sphinx-doc.org/en/1.8/usage/restructuredtext/basics.html
|
||||
* `git add` your modified files, `git commit`, `git push`, and submit a PR at https://github.com/johnkerl/miller.
|
||||
|
||||
## Notes
|
||||
|
||||
* CSS:
|
||||
* I used the Sphinx Classic theme which I like a lot except the colors -- it's a blue scheme and Miller has never been blue.
|
||||
* Files are in `docs/_static/*.css` where I marked my mods with `/* CHANGE ME */`.
|
||||
* If you modify the CSS you must run `make clean html` (not just `make html`) then reload in your browser.
|
||||
* I used the Mkdocs Readthedocs theme which I like a lot. I customized `docs/extra.css` for Miller coloring/branding.
|
||||
* Live code:
|
||||
* I didn't find a way to include non-Python live-code examples within Sphinx so I adapted the pre-Sphinx Miller-doc strategy which is to have a generator script read a template file (here, `foo.rst.in`), run the marked lines, and generate the output file (`foo.rst`).
|
||||
* Edit the `*.rst.in` files, not `*.rst` directly.
|
||||
* Within the `*.rst.in` files are lines like `GENRST_RUN_COMMAND`. These will be run, and their output included, by `make html` which calls the `genrsts` script for you.
|
||||
* I didn't find a way to include non-Python live-code examples within Mkdocs so I adapted the pre-Mkdocs Miller-doc strategy which is to have a generator script read a template file (here, `foo.md.in`), run the marked lines, and generate the output file (`foo.md`). This is `genmds`.
|
||||
* Edit the `*.md.in` files, not `*.md` directly.
|
||||
* Within the `*.md.in` files are lines like `GENMD_RUN_COMMAND`. These will be run, and their output included, by `genmds` which calls the `genmds` script for you.
|
||||
* readthedocs:
|
||||
* https://readthedocs.org/
|
||||
* https://readthedocs.org/projects/miller/
|
||||
* https://readthedocs.org/projects/miller/builds/
|
||||
* https://miller.readthedocs.io/en/latest/
|
||||
|
||||
## To do
|
||||
|
||||
* Let's all discuss if/how we want the v2 docs to be structured better than the v1 docs.
|
||||
|
|
|
|||
|
|
@ -1,13 +0,0 @@
|
|||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
if [ $# -ge 1 ]; then
|
||||
for name; do
|
||||
if [[ $name == *.rst.in ]]; then
|
||||
genrsts $name;
|
||||
fi
|
||||
done
|
||||
else
|
||||
for rstin in *.rst.in; do genrsts $rstin; done
|
||||
fi
|
||||
sphinx-build -M html . _build
|
||||
121
docs6/build.rst
121
docs6/build.rst
|
|
@ -1,121 +0,0 @@
|
|||
..
|
||||
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
|
||||
|
||||
Building from source
|
||||
================================================================
|
||||
|
||||
Please also see :doc:`installation` for information about pre-built executables.
|
||||
|
||||
Miller license
|
||||
----------------------------------------------------------------
|
||||
|
||||
Two-clause BSD license https://github.com/johnkerl/miller/blob/master/LICENSE.txt.
|
||||
|
||||
From release tarball
|
||||
----------------------------------------------------------------
|
||||
|
||||
* Obtain ``mlr-i.j.k.tar.gz`` from https://github.com/johnkerl/miller/tags, replacing ``i.j.k`` with the desired release, e.g. ``6.1.0``.
|
||||
* ``tar zxvf mlr-i.j.k.tar.gz``
|
||||
* ``cd mlr-i.j.k``
|
||||
* ``cd go``
|
||||
* ``./build`` creates the ``go/mlr`` executable and runs regression tests
|
||||
* ``go build mlr.go`` creates the ``go/mlr`` executable without running regression tests
|
||||
|
||||
From git clone
|
||||
----------------------------------------------------------------
|
||||
|
||||
* ``git clone https://github.com/johnkerl/miller``
|
||||
* ``cd miller/go``
|
||||
* ``./build`` creates the ``go/mlr`` executable and runs regression tests
|
||||
* ``go build mlr.go`` creates the ``go/mlr`` executable without running regression tests
|
||||
|
||||
In case of problems
|
||||
----------------------------------------------------------------
|
||||
|
||||
If you have any build errors, feel free to open an issue with "New Issue" at https://github.com/johnkerl/miller/issues.
|
||||
|
||||
Dependencies
|
||||
----------------------------------------------------------------
|
||||
|
||||
Required external dependencies
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
These are necessary to produce the ``mlr`` executable.
|
||||
|
||||
* Go version 1.16 or higher
|
||||
* Others packaged within ``go.mod`` and ``go.sum`` which you don't need to deal with manually -- the Go build process handles them for us
|
||||
|
||||
Optional external dependencies
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
This documentation pageset is built using Sphinx. Please see https://github.com/johnkerl/miller/blob/main/docs6/README.md for details.
|
||||
|
||||
Creating a new release: for developers
|
||||
----------------------------------------------------------------
|
||||
|
||||
At present I'm the primary developer so this is just my checklist for making new releases.
|
||||
|
||||
In this example I am using version 6.1.0 to 6.2.0; of course that will change for subsequent revisions.
|
||||
|
||||
* Update version found in ``mlr --version`` and ``man mlr``:
|
||||
|
||||
* Edit ``go/src/version/version.go`` from ``6.1.0-dev`` to ``6.2.0``.
|
||||
* Likewise ``docs6/conf.py``
|
||||
* ``cd ../docs6``
|
||||
* ``export PATH=../go:$PATH``
|
||||
* ``make html``
|
||||
* The ordering is important: the first build creates ``mlr``; the second runs ``mlr`` to create ``manpage.txt``; the third includes ``manpage.txt`` into one of its outputs.
|
||||
* Commit and push.
|
||||
|
||||
* Create the release tarball and SRPM:
|
||||
|
||||
* TBD for the Go port ...
|
||||
* Linux/MacOS/Windows binaries from GitHub Actions ...
|
||||
* Pull back release tarball ``mlr-6.2.0.tar.gz`` from buildbox, and ``mlr.{arch}`` binaries from whatever buildboxes.
|
||||
|
||||
* Create the Github release tag:
|
||||
|
||||
* Don't forget the ``v`` in ``v6.2.0``
|
||||
* Write the release notes
|
||||
* Attach the release tarball and binaries. Double-check assets were successfully uploaded.
|
||||
* Publish the release
|
||||
|
||||
* Check the release-specific docs:
|
||||
|
||||
* Look at https://miller.readthedocs.io for new-version docs, after a few minutes' propagation time.
|
||||
|
||||
* Notify:
|
||||
|
||||
* Submit ``brew`` pull request; notify any other distros which don't appear to have autoupdated since the previous release (notes below)
|
||||
* Similarly for ``macports``: https://github.com/macports/macports-ports/blob/master/textproc/miller/Portfile.
|
||||
* Social-media updates.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
git remote add upstream https://github.com/Homebrew/homebrew-core # one-time setup only
|
||||
git fetch upstream
|
||||
git rebase upstream/master
|
||||
git checkout -b miller-6.1.0
|
||||
shasum -a 256 /path/to/mlr-6.1.0.tar.gz
|
||||
edit Formula/miller.rb
|
||||
# Test the URL from the line like
|
||||
# url "https://github.com/johnkerl/miller/releases/download/v6.1.0/mlr-6.1.0.tar.gz"
|
||||
# in a browser for typos
|
||||
# A '@BrewTestBot Test this please' comment within the homebrew-core pull request will restart the homebrew travis build
|
||||
git add Formula/miller.rb
|
||||
git commit -m 'miller 6.1.0'
|
||||
git push -u origin miller-6.1.0
|
||||
(submit the pull request)
|
||||
|
||||
* Afterwork:
|
||||
|
||||
* Edit ``go/src/version/version.go`` and ``docs6/conf.py`` to change version from ``6.2.0`` to ``6.2.0-dev``.
|
||||
* ``cd go``
|
||||
* ``./build``
|
||||
* Commit and push.
|
||||
|
||||
|
||||
Misc. development notes
|
||||
----------------------------------------------------------------
|
||||
|
||||
I use terminal width 120 and tabwidth 4.
|
||||
|
|
@ -1,118 +0,0 @@
|
|||
Building from source
|
||||
================================================================
|
||||
|
||||
Please also see :doc:`installation` for information about pre-built executables.
|
||||
|
||||
Miller license
|
||||
----------------------------------------------------------------
|
||||
|
||||
Two-clause BSD license https://github.com/johnkerl/miller/blob/master/LICENSE.txt.
|
||||
|
||||
From release tarball
|
||||
----------------------------------------------------------------
|
||||
|
||||
* Obtain ``mlr-i.j.k.tar.gz`` from https://github.com/johnkerl/miller/tags, replacing ``i.j.k`` with the desired release, e.g. ``6.1.0``.
|
||||
* ``tar zxvf mlr-i.j.k.tar.gz``
|
||||
* ``cd mlr-i.j.k``
|
||||
* ``cd go``
|
||||
* ``./build`` creates the ``go/mlr`` executable and runs regression tests
|
||||
* ``go build mlr.go`` creates the ``go/mlr`` executable without running regression tests
|
||||
|
||||
From git clone
|
||||
----------------------------------------------------------------
|
||||
|
||||
* ``git clone https://github.com/johnkerl/miller``
|
||||
* ``cd miller/go``
|
||||
* ``./build`` creates the ``go/mlr`` executable and runs regression tests
|
||||
* ``go build mlr.go`` creates the ``go/mlr`` executable without running regression tests
|
||||
|
||||
In case of problems
|
||||
----------------------------------------------------------------
|
||||
|
||||
If you have any build errors, feel free to open an issue with "New Issue" at https://github.com/johnkerl/miller/issues.
|
||||
|
||||
Dependencies
|
||||
----------------------------------------------------------------
|
||||
|
||||
Required external dependencies
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
These are necessary to produce the ``mlr`` executable.
|
||||
|
||||
* Go version 1.16 or higher
|
||||
* Others packaged within ``go.mod`` and ``go.sum`` which you don't need to deal with manually -- the Go build process handles them for us
|
||||
|
||||
Optional external dependencies
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
This documentation pageset is built using Sphinx. Please see https://github.com/johnkerl/miller/blob/main/docs6/README.md for details.
|
||||
|
||||
Creating a new release: for developers
|
||||
----------------------------------------------------------------
|
||||
|
||||
At present I'm the primary developer so this is just my checklist for making new releases.
|
||||
|
||||
In this example I am using version 6.1.0 to 6.2.0; of course that will change for subsequent revisions.
|
||||
|
||||
* Update version found in ``mlr --version`` and ``man mlr``:
|
||||
|
||||
* Edit ``go/src/version/version.go`` from ``6.1.0-dev`` to ``6.2.0``.
|
||||
* Likewise ``docs6/conf.py``
|
||||
* ``cd ../docs6``
|
||||
* ``export PATH=../go:$PATH``
|
||||
* ``make html``
|
||||
* The ordering is important: the first build creates ``mlr``; the second runs ``mlr`` to create ``manpage.txt``; the third includes ``manpage.txt`` into one of its outputs.
|
||||
* Commit and push.
|
||||
|
||||
* Create the release tarball and SRPM:
|
||||
|
||||
* TBD for the Go port ...
|
||||
* Linux/MacOS/Windows binaries from GitHub Actions ...
|
||||
* Pull back release tarball ``mlr-6.2.0.tar.gz`` from buildbox, and ``mlr.{arch}`` binaries from whatever buildboxes.
|
||||
|
||||
* Create the Github release tag:
|
||||
|
||||
* Don't forget the ``v`` in ``v6.2.0``
|
||||
* Write the release notes
|
||||
* Attach the release tarball and binaries. Double-check assets were successfully uploaded.
|
||||
* Publish the release
|
||||
|
||||
* Check the release-specific docs:
|
||||
|
||||
* Look at https://miller.readthedocs.io for new-version docs, after a few minutes' propagation time.
|
||||
|
||||
* Notify:
|
||||
|
||||
* Submit ``brew`` pull request; notify any other distros which don't appear to have autoupdated since the previous release (notes below)
|
||||
* Similarly for ``macports``: https://github.com/macports/macports-ports/blob/master/textproc/miller/Portfile.
|
||||
* Social-media updates.
|
||||
|
||||
GENRST_CARDIFY
|
||||
git remote add upstream https://github.com/Homebrew/homebrew-core # one-time setup only
|
||||
git fetch upstream
|
||||
git rebase upstream/master
|
||||
git checkout -b miller-6.1.0
|
||||
shasum -a 256 /path/to/mlr-6.1.0.tar.gz
|
||||
edit Formula/miller.rb
|
||||
# Test the URL from the line like
|
||||
# url "https://github.com/johnkerl/miller/releases/download/v6.1.0/mlr-6.1.0.tar.gz"
|
||||
# in a browser for typos
|
||||
# A '@BrewTestBot Test this please' comment within the homebrew-core pull request will restart the homebrew travis build
|
||||
git add Formula/miller.rb
|
||||
git commit -m 'miller 6.1.0'
|
||||
git push -u origin miller-6.1.0
|
||||
(submit the pull request)
|
||||
GENRST_EOF
|
||||
|
||||
* Afterwork:
|
||||
|
||||
* Edit ``go/src/version/version.go`` and ``docs6/conf.py`` to change version from ``6.2.0`` to ``6.2.0-dev``.
|
||||
* ``cd go``
|
||||
* ``./build``
|
||||
* Commit and push.
|
||||
|
||||
|
||||
Misc. development notes
|
||||
----------------------------------------------------------------
|
||||
|
||||
I use terminal width 120 and tabwidth 4.
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
..
|
||||
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
|
||||
|
||||
Community
|
||||
=========
|
||||
|
||||
* See `Miller GitHub Discussions <https://github.com/johnkerl/miller/discussions>`_ for general Q&A, advice, sharing success stories, etc.
|
||||
* See also `Miller-tagged questions on Stack Overflow <https://stackoverflow.com/questions/tagged/miller?tab=Newest>`_
|
||||
* See `Miller GitHub Issues <https://github.com/johnkerl/miller/issues>`_ for bug reports and feature requests
|
||||
* Other correspondence: mailto:kerl.john.r+miller@gmail.com
|
||||
|
|
@ -1,7 +0,0 @@
|
|||
Community
|
||||
=========
|
||||
|
||||
* See `Miller GitHub Discussions <https://github.com/johnkerl/miller/discussions>`_ for general Q&A, advice, sharing success stories, etc.
|
||||
* See also `Miller-tagged questions on Stack Overflow <https://stackoverflow.com/questions/tagged/miller?tab=Newest>`_
|
||||
* See `Miller GitHub Issues <https://github.com/johnkerl/miller/issues>`_ for bug reports and feature requests
|
||||
* Other correspondence: mailto:kerl.john.r+miller@gmail.com
|
||||
112
docs6/conf.py
112
docs6/conf.py
|
|
@ -1,112 +0,0 @@
|
|||
# Configuration file for the Sphinx documentation builder.
|
||||
#
|
||||
# This file only contains a selection of the most common options. For a full
|
||||
# list see the documentation:
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html
|
||||
|
||||
# -- Path setup --------------------------------------------------------------
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
#
|
||||
# import os
|
||||
# import sys
|
||||
# sys.path.insert(0, os.path.abspath('.'))
|
||||
|
||||
|
||||
# -- Project information -----------------------------------------------------
|
||||
|
||||
project = 'Miller'
|
||||
copyright = '2021, John Kerl'
|
||||
author = 'John Kerl'
|
||||
|
||||
# The full version, including alpha/beta/rc tags
|
||||
release = '6.0.0-alpha'
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
master_doc = 'index'
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
|
||||
# ones.
|
||||
extensions = [
|
||||
]
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
templates_path = ['_templates']
|
||||
|
||||
# List of patterns, relative to source directory, that match files and
|
||||
# directories to ignore when looking for source files.
|
||||
# This pattern also affects html_static_path and html_extra_path.
|
||||
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
|
||||
|
||||
# -- Options for HTML output -------------------------------------------------
|
||||
|
||||
# The theme to use for HTML and HTML Help pages. See the documentation for
|
||||
# a list of builtin themes.
|
||||
#
|
||||
#html_theme = 'alabaster'
|
||||
#html_theme = 'classic'
|
||||
#html_theme = 'sphinxdoc'
|
||||
#html_theme = 'nature'
|
||||
html_theme = 'scrolls'
|
||||
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||
html_static_path = ['_static']
|
||||
|
||||
# ----------------------------------------------------------------
|
||||
# Include code-sample files in the Sphinx build tree.
|
||||
# See also https://github.com/johnkerl/miller/issues/560.
|
||||
#
|
||||
# There is a problem, and an opportunity for a hack.
|
||||
#
|
||||
# * Our data files are in ./data/* (and a few other subdirs of .).
|
||||
#
|
||||
# * We want them copied to ./_build/data/* so that we can symlink from our doc
|
||||
# files (written in ./*.rst. autogenned to HTML in ./_build/html/*.html) to
|
||||
# relative paths like ./data/a.csv.
|
||||
#
|
||||
# * If we use html_extra_path = ['data'] then the files like ./data/a.csv
|
||||
# are copied to _build/html/a.csv -- one directory 'down'. This means that
|
||||
# example Miller commands are shown in the generated HTML using 'mlr --csv
|
||||
# cat data/a.csv' but 'data/a.csv' doesn't exist relative to _build/html.
|
||||
# This is bad enough for local Sphinx builds but worse for readthedocs
|
||||
# (https://miller.readthedocs.io) since *only* _build/html files are put into
|
||||
# readthedocs.
|
||||
#
|
||||
# * In our Makefile it's easy enough to do some cp -a commands from ./data
|
||||
# to ./_build_html/data etc. for local Sphinx builds -- however, readthedocs
|
||||
# doesn't use the Makefile at all, only this conf.py file.
|
||||
#
|
||||
# * Hence the hack: we have a subdir ./sphinx-hack which has a symlink
|
||||
# ./sphinx-hack/data pointing to ./data. So when the Sphinx build executes
|
||||
# html_extra_path and removes one directory level, it's an 'extra' level we
|
||||
# can do without.
|
||||
#
|
||||
# * This all relies on symlinks being propagated through GitHub version
|
||||
# control, readthedocs, and Sphinx build at readthedocs.
|
||||
|
||||
html_extra_path = [
|
||||
'sphinx-hack',
|
||||
'10-1.sh',
|
||||
'10-2.sh',
|
||||
'circle.csv',
|
||||
'commas.csv',
|
||||
'dates.csv',
|
||||
'example.csv',
|
||||
'expo-sample.sh',
|
||||
'log.txt',
|
||||
'make.bat',
|
||||
'manpage.txt',
|
||||
'oosvar-example-ewma.sh',
|
||||
'oosvar-example-sum-grouped.sh',
|
||||
'oosvar-example-sum.sh',
|
||||
'sample_mlrrc',
|
||||
'square.csv',
|
||||
'triangle.csv',
|
||||
'variance.mlr',
|
||||
'verb-example-ewma.sh',
|
||||
]
|
||||
|
|
@ -1,47 +0,0 @@
|
|||
..
|
||||
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
|
||||
|
||||
How to contribute
|
||||
================================================================
|
||||
|
||||
Community
|
||||
----------------------------------------------------------------
|
||||
|
||||
You can ask questions -- or answer them! -- following the links at :doc:`community`.
|
||||
|
||||
Documentation improvements
|
||||
----------------------------------------------------------------
|
||||
|
||||
Pre-release Miller documentation is at https://github.com/johnkerl/miller/tree/main/docs6.
|
||||
|
||||
Clone https://github.com/johnkerl/miller and `cd` into `docs6`.
|
||||
|
||||
After ``sudo pip install sphinx`` (or ``pip3``) you should be able to do ``make html``.
|
||||
|
||||
Edit ``*.rst.in`` files, then ``make html`` to generate ``*.rst``, then run the Sphinx document-generator.
|
||||
|
||||
Open ``_build/html/index.html`` in your browser, e.g. ``file:////Users/yourname/git/miller/docs6/_build/html/contributing.html``, to verify.
|
||||
|
||||
PRs are welcome at https://github.com/johnkerl/miller.
|
||||
|
||||
Once PRs are merged, readthedocs creates https://miller.readthedocs.io using the following configs:
|
||||
|
||||
* https://readthedocs.org/projects/miller/
|
||||
* https://readthedocs.org/projects/miller/builds/
|
||||
* https://github.com/johnkerl/miller/settings/hooks
|
||||
|
||||
Testing
|
||||
----------------------------------------------------------------
|
||||
|
||||
As of Miller-6's current pre-release status, the best way to test is to either build from source via :doc:`build`, or by getting a recent binary at https://github.com/johnkerl/miller/actions, then click latest build, then *Artifacts*. Then simply use Miller for whatever you do, and create an issue at https://github.com/johnkerl/miller/issues.
|
||||
|
||||
Do note that as of 2021-06-17 a few things have not been ported to Miller 6 -- most notably, including regex captures and localtime DSL functions.
|
||||
|
||||
Feature development
|
||||
----------------------------------------------------------------
|
||||
|
||||
Issues: https://github.com/johnkerl/miller/issues
|
||||
|
||||
Developer notes: https://github.com/johnkerl/miller/blob/main/go/README.md
|
||||
|
||||
PRs which pass regression test (https://github.com/johnkerl/miller/blob/main/go/regtest/README.md) are always welcome!
|
||||
|
|
@ -1,44 +0,0 @@
|
|||
How to contribute
|
||||
================================================================
|
||||
|
||||
Community
|
||||
----------------------------------------------------------------
|
||||
|
||||
You can ask questions -- or answer them! -- following the links at :doc:`community`.
|
||||
|
||||
Documentation improvements
|
||||
----------------------------------------------------------------
|
||||
|
||||
Pre-release Miller documentation is at https://github.com/johnkerl/miller/tree/main/docs6.
|
||||
|
||||
Clone https://github.com/johnkerl/miller and `cd` into `docs6`.
|
||||
|
||||
After ``sudo pip install sphinx`` (or ``pip3``) you should be able to do ``make html``.
|
||||
|
||||
Edit ``*.rst.in`` files, then ``make html`` to generate ``*.rst``, then run the Sphinx document-generator.
|
||||
|
||||
Open ``_build/html/index.html`` in your browser, e.g. ``file:////Users/yourname/git/miller/docs6/_build/html/contributing.html``, to verify.
|
||||
|
||||
PRs are welcome at https://github.com/johnkerl/miller.
|
||||
|
||||
Once PRs are merged, readthedocs creates https://miller.readthedocs.io using the following configs:
|
||||
|
||||
* https://readthedocs.org/projects/miller/
|
||||
* https://readthedocs.org/projects/miller/builds/
|
||||
* https://github.com/johnkerl/miller/settings/hooks
|
||||
|
||||
Testing
|
||||
----------------------------------------------------------------
|
||||
|
||||
As of Miller-6's current pre-release status, the best way to test is to either build from source via :doc:`build`, or by getting a recent binary at https://github.com/johnkerl/miller/actions, then click latest build, then *Artifacts*. Then simply use Miller for whatever you do, and create an issue at https://github.com/johnkerl/miller/issues.
|
||||
|
||||
Do note that as of 2021-06-17 a few things have not been ported to Miller 6 -- most notably, including regex captures and localtime DSL functions.
|
||||
|
||||
Feature development
|
||||
----------------------------------------------------------------
|
||||
|
||||
Issues: https://github.com/johnkerl/miller/issues
|
||||
|
||||
Developer notes: https://github.com/johnkerl/miller/blob/main/go/README.md
|
||||
|
||||
PRs which pass regression test (https://github.com/johnkerl/miller/blob/main/go/regtest/README.md) are always welcome!
|
||||
|
|
@ -1,149 +0,0 @@
|
|||
..
|
||||
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
|
||||
|
||||
CSV, with and without headers
|
||||
=============================
|
||||
|
||||
Headerless CSV on input or output
|
||||
----------------------------------------------------------------
|
||||
|
||||
Sometimes we get CSV files which lack a header. For example (`data/headerless.csv <./data/headerless.csv>`_):
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat data/headerless.csv
|
||||
John,23,present
|
||||
Fred,34,present
|
||||
Alice,56,missing
|
||||
Carol,45,present
|
||||
|
||||
You can use Miller to add a header. The ``--implicit-csv-header`` applies positionally indexed labels:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv --implicit-csv-header cat data/headerless.csv
|
||||
1,2,3
|
||||
John,23,present
|
||||
Fred,34,present
|
||||
Alice,56,missing
|
||||
Carol,45,present
|
||||
|
||||
Following that, you can rename the positionally indexed labels to names with meaning for your context. For example:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv --implicit-csv-header label name,age,status data/headerless.csv
|
||||
name,age,status
|
||||
John,23,present
|
||||
Fred,34,present
|
||||
Alice,56,missing
|
||||
Carol,45,present
|
||||
|
||||
Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller's output to the system command ``sed 1d``, or you can use Miller's ``--headerless-csv-output`` option:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
head -5 data/colored-shapes.dkvp | mlr --ocsv cat
|
||||
color,shape,flag,i,u,v,w,x
|
||||
yellow,triangle,1,11,0.6321695890307647,0.9887207810889004,0.4364983936735774,5.7981881667050565
|
||||
red,square,1,15,0.21966833570651523,0.001257332190235938,0.7927778364718627,2.944117399716207
|
||||
red,circle,1,16,0.20901671281497636,0.29005231936593445,0.13810280912907674,5.065034003400998
|
||||
red,square,0,48,0.9562743938458542,0.7467203085342884,0.7755423050923582,7.117831369597269
|
||||
purple,triangle,0,51,0.4355354501763202,0.8591292672156728,0.8122903963006748,5.753094629505863
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat
|
||||
yellow,triangle,1,11,0.6321695890307647,0.9887207810889004,0.4364983936735774,5.7981881667050565
|
||||
red,square,1,15,0.21966833570651523,0.001257332190235938,0.7927778364718627,2.944117399716207
|
||||
red,circle,1,16,0.20901671281497636,0.29005231936593445,0.13810280912907674,5.065034003400998
|
||||
red,square,0,48,0.9562743938458542,0.7467203085342884,0.7755423050923582,7.117831369597269
|
||||
purple,triangle,0,51,0.4355354501763202,0.8591292672156728,0.8122903963006748,5.753094629505863
|
||||
|
||||
Lastly, often we say "CSV" or "TSV" when we have positionally indexed data in columns which are separated by commas or tabs, respectively. In this case it's perhaps simpler to **just use NIDX format** which was designed for this purpose. (See also :doc:`file-formats`.) For example:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --inidx --ifs comma --oxtab cut -f 1,3 data/headerless.csv
|
||||
1 John
|
||||
3 present
|
||||
|
||||
1 Fred
|
||||
3 present
|
||||
|
||||
1 Alice
|
||||
3 missing
|
||||
|
||||
1 Carol
|
||||
3 present
|
||||
|
||||
Headerless CSV with duplicate field values
|
||||
------------------------------------------
|
||||
|
||||
Miller is (by central design) a mapping from name to value, rather than integer position to value as in most tools in the Unix toolkit such as ``sort``, ``cut``, ``awk``, etc. So given input ``Yea=1,Yea=2`` on the same input line, first ``Yea=1`` is stored, then updated with ``Yea=2``. This is in the input-parser and the value ``Yea=1`` is unavailable to any further processing. The following example line comes from a headerless CSV file and includes 5 times the string (value) ``'NA'``:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
ag '0.9' nas.csv | head -1
|
||||
2:-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
|
||||
|
||||
The repeated ``'NA'`` strings (values) in the same line will be treated as fields (columns) with same name, thus only one is kept in the output.
|
||||
|
||||
This can be worked around by telling ``mlr`` that there is no header row by using ``--implicit-csv-header`` or changing the input format by using ``nidx`` like so:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
ag '0.9' nas.csv | mlr --n2c --fs "," label xsn,ysn,x,y,t,a,e29,e31,e32 then head
|
||||
|
||||
Regularizing ragged CSV
|
||||
----------------------------------------------------------------
|
||||
|
||||
Miller handles compliant CSV: in particular, it's an error if the number of data fields in a given data line don't match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line.
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat data/ragged.csv
|
||||
a,b,c
|
||||
1,2,3
|
||||
4,5
|
||||
6,7,8,9
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-8
|
||||
|
||||
mlr --from data/ragged.csv --fs comma --nidx put '
|
||||
@maxnf = max(@maxnf, NF);
|
||||
@nf = NF;
|
||||
while(@nf < @maxnf) {
|
||||
@nf += 1;
|
||||
$[@nf] = ""
|
||||
}
|
||||
'
|
||||
a,b,c
|
||||
1,2,3
|
||||
4,5
|
||||
6,7,8,9
|
||||
|
||||
or, more simply,
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-6
|
||||
|
||||
mlr --from data/ragged.csv --fs comma --nidx put '
|
||||
@maxnf = max(@maxnf, NF);
|
||||
while(NF < @maxnf) {
|
||||
$[NF+1] = "";
|
||||
}
|
||||
'
|
||||
a,b,c
|
||||
1,2,3
|
||||
4,5
|
||||
6,7,8,9
|
||||
|
|
@ -1,72 +0,0 @@
|
|||
CSV, with and without headers
|
||||
=============================
|
||||
|
||||
Headerless CSV on input or output
|
||||
----------------------------------------------------------------
|
||||
|
||||
Sometimes we get CSV files which lack a header. For example (`data/headerless.csv <./data/headerless.csv>`_):
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat data/headerless.csv
|
||||
GENRST_EOF
|
||||
|
||||
You can use Miller to add a header. The ``--implicit-csv-header`` applies positionally indexed labels:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --csv --implicit-csv-header cat data/headerless.csv
|
||||
GENRST_EOF
|
||||
|
||||
Following that, you can rename the positionally indexed labels to names with meaning for your context. For example:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --csv --implicit-csv-header label name,age,status data/headerless.csv
|
||||
GENRST_EOF
|
||||
|
||||
Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller's output to the system command ``sed 1d``, or you can use Miller's ``--headerless-csv-output`` option:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
head -5 data/colored-shapes.dkvp | mlr --ocsv cat
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat
|
||||
GENRST_EOF
|
||||
|
||||
Lastly, often we say "CSV" or "TSV" when we have positionally indexed data in columns which are separated by commas or tabs, respectively. In this case it's perhaps simpler to **just use NIDX format** which was designed for this purpose. (See also :doc:`file-formats`.) For example:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --inidx --ifs comma --oxtab cut -f 1,3 data/headerless.csv
|
||||
GENRST_EOF
|
||||
|
||||
Headerless CSV with duplicate field values
|
||||
------------------------------------------
|
||||
|
||||
Miller is (by central design) a mapping from name to value, rather than integer position to value as in most tools in the Unix toolkit such as ``sort``, ``cut``, ``awk``, etc. So given input ``Yea=1,Yea=2`` on the same input line, first ``Yea=1`` is stored, then updated with ``Yea=2``. This is in the input-parser and the value ``Yea=1`` is unavailable to any further processing. The following example line comes from a headerless CSV file and includes 5 times the string (value) ``'NA'``:
|
||||
|
||||
GENRST_CARDIFY_HIGHLIGHT_ONE
|
||||
ag '0.9' nas.csv | head -1
|
||||
2:-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
|
||||
GENRST_EOF
|
||||
|
||||
The repeated ``'NA'`` strings (values) in the same line will be treated as fields (columns) with same name, thus only one is kept in the output.
|
||||
|
||||
This can be worked around by telling ``mlr`` that there is no header row by using ``--implicit-csv-header`` or changing the input format by using ``nidx`` like so:
|
||||
|
||||
GENRST_CARDIFY
|
||||
ag '0.9' nas.csv | mlr --n2c --fs "," label xsn,ysn,x,y,t,a,e29,e31,e32 then head
|
||||
GENRST_EOF
|
||||
|
||||
Regularizing ragged CSV
|
||||
----------------------------------------------------------------
|
||||
|
||||
Miller handles compliant CSV: in particular, it's an error if the number of data fields in a given data line don't match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line.
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat data/ragged.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv.sh)
|
||||
|
||||
or, more simply,
|
||||
|
||||
GENRST_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv-2.sh)
|
||||
|
|
@ -1,96 +0,0 @@
|
|||
..
|
||||
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
|
||||
|
||||
Customization: .mlrrc
|
||||
================================================================
|
||||
|
||||
How to use .mlrrc
|
||||
----------------------------------------------------------------
|
||||
|
||||
Suppose you always use CSV files. Then instead of always having to type ``--csv`` as in
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv cut -x -f extra mydata.csv
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv sort -n id mydata.csv
|
||||
|
||||
and so on, you can instead put the following into your ``$HOME/.mlrrc``:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
--csv
|
||||
|
||||
Then you can just type things like
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr cut -x -f extra mydata.csv
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr sort -n id mydata.csv
|
||||
|
||||
and the ``--csv`` part will automatically be understood. (If you do want to process, say, a JSON file then ``mlr --json ...`` at the command line will override the default from your ``.mlrrc``.)
|
||||
|
||||
What you can put in your .mlrrc
|
||||
----------------------------------------------------------------
|
||||
|
||||
* You can include any command-line flags, except the "terminal" ones such as ``--help``.
|
||||
|
||||
* The ``--prepipe``, ``--load``, and ``--mload`` flags aren't allowed in ``.mlrrc`` as they control code execution, and could result in your scripts running things you don't expect if you receive data from someone with a ``.mlrrc`` in it.
|
||||
|
||||
* The formatting rule is you need to put one flag beginning with ``--`` per line: for example, ``--csv`` on one line and ``--nr-progress-mod 1000`` on a separate line.
|
||||
|
||||
* Since every line starts with a ``--`` option, you can leave off the initial ``--`` if you want. For example, ``ojson`` is the same as ``--ojson``, and ``nr-progress-mod 1000`` is the same as ``--nr-progress-mod 1000``.
|
||||
|
||||
* Comments are from a ``#`` to the end of the line.
|
||||
|
||||
* Empty lines are ignored -- including lines which are empty after comments are removed.
|
||||
|
||||
Here is an example ``.mlrrc`` file:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
# Input and output formats are CSV by default (unless otherwise specified
|
||||
# on the mlr command line):
|
||||
csv
|
||||
|
||||
# If a data line has fewer fields than the header line, instead of erroring
|
||||
# (which is the default), just insert empty values for the missing ones:
|
||||
allow-ragged-csv-input
|
||||
|
||||
# These are no-ops for CSV, but when I do use JSON output, I want these
|
||||
# pretty-printing options to be used:
|
||||
jvstack
|
||||
jlistwrap
|
||||
|
||||
# Use "@", rather than "#", for comments within data files:
|
||||
skip-comments-with @
|
||||
|
||||
Where to put your .mlrrc
|
||||
----------------------------------------------------------------
|
||||
|
||||
If the environment variable ``MLRRC`` is set:
|
||||
|
||||
* If its value is ``__none__`` then no ``.mlrrc`` files are processed. (This is nice for things like regression testing.)
|
||||
|
||||
* Otherwise, its value (as a filename) is loaded and processed. If there are syntax errors, they abort ``mlr`` with a usage message (as if you had mistyped something on the command line). If the file can't be loaded at all, though, it is silently skipped.
|
||||
|
||||
* Any ``.mlrrc`` in your home directory or current directory is ignored whenever ``MLRRC`` is set in the environment.
|
||||
|
||||
* Example line in your shell's rc file: ``export MLRRC=/path/to/my/mlrrc``
|
||||
|
||||
Otherwise:
|
||||
|
||||
* If ``$HOME/.mlrrc`` exists, it's processed as above.
|
||||
|
||||
* If ``./.mlrrc`` exists, it's then also processed as above.
|
||||
|
||||
* The idea is you can have all your settings in your ``$HOME/.mlrrc``, then override maybe one or two for your current directory if you like.
|
||||
|
|
@ -1,73 +0,0 @@
|
|||
Customization: .mlrrc
|
||||
================================================================
|
||||
|
||||
How to use .mlrrc
|
||||
----------------------------------------------------------------
|
||||
|
||||
Suppose you always use CSV files. Then instead of always having to type ``--csv`` as in
|
||||
|
||||
GENRST_CARDIFY_HIGHLIGHT_ONE
|
||||
mlr --csv cut -x -f extra mydata.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_CARDIFY_HIGHLIGHT_ONE
|
||||
mlr --csv sort -n id mydata.csv
|
||||
GENRST_EOF
|
||||
|
||||
and so on, you can instead put the following into your ``$HOME/.mlrrc``:
|
||||
|
||||
GENRST_CARDIFY
|
||||
--csv
|
||||
GENRST_EOF
|
||||
|
||||
Then you can just type things like
|
||||
|
||||
GENRST_CARDIFY_HIGHLIGHT_ONE
|
||||
mlr cut -x -f extra mydata.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_CARDIFY_HIGHLIGHT_ONE
|
||||
mlr sort -n id mydata.csv
|
||||
GENRST_EOF
|
||||
|
||||
and the ``--csv`` part will automatically be understood. (If you do want to process, say, a JSON file then ``mlr --json ...`` at the command line will override the default from your ``.mlrrc``.)
|
||||
|
||||
What you can put in your .mlrrc
|
||||
----------------------------------------------------------------
|
||||
|
||||
* You can include any command-line flags, except the "terminal" ones such as ``--help``.
|
||||
|
||||
* The ``--prepipe``, ``--load``, and ``--mload`` flags aren't allowed in ``.mlrrc`` as they control code execution, and could result in your scripts running things you don't expect if you receive data from someone with a ``.mlrrc`` in it.
|
||||
|
||||
* The formatting rule is you need to put one flag beginning with ``--`` per line: for example, ``--csv`` on one line and ``--nr-progress-mod 1000`` on a separate line.
|
||||
|
||||
* Since every line starts with a ``--`` option, you can leave off the initial ``--`` if you want. For example, ``ojson`` is the same as ``--ojson``, and ``nr-progress-mod 1000`` is the same as ``--nr-progress-mod 1000``.
|
||||
|
||||
* Comments are from a ``#`` to the end of the line.
|
||||
|
||||
* Empty lines are ignored -- including lines which are empty after comments are removed.
|
||||
|
||||
Here is an example ``.mlrrc`` file:
|
||||
|
||||
GENRST_INCLUDE_ESCAPED(sample_mlrrc)
|
||||
|
||||
Where to put your .mlrrc
|
||||
----------------------------------------------------------------
|
||||
|
||||
If the environment variable ``MLRRC`` is set:
|
||||
|
||||
* If its value is ``__none__`` then no ``.mlrrc`` files are processed. (This is nice for things like regression testing.)
|
||||
|
||||
* Otherwise, its value (as a filename) is loaded and processed. If there are syntax errors, they abort ``mlr`` with a usage message (as if you had mistyped something on the command line). If the file can't be loaded at all, though, it is silently skipped.
|
||||
|
||||
* Any ``.mlrrc`` in your home directory or current directory is ignored whenever ``MLRRC`` is set in the environment.
|
||||
|
||||
* Example line in your shell's rc file: ``export MLRRC=/path/to/my/mlrrc``
|
||||
|
||||
Otherwise:
|
||||
|
||||
* If ``$HOME/.mlrrc`` exists, it's processed as above.
|
||||
|
||||
* If ``./.mlrrc`` exists, it's then also processed as above.
|
||||
|
||||
* The idea is you can have all your settings in your ``$HOME/.mlrrc``, then override maybe one or two for your current directory if you like.
|
||||
|
|
@ -1,77 +0,0 @@
|
|||
..
|
||||
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
|
||||
|
||||
Data-cleaning examples
|
||||
================================================================
|
||||
|
||||
Here are some ways to use the type-checking options as described in :ref:`reference-dsl-type-tests-and-assertions` Suppose you have the following data file, with inconsistent typing for boolean. (Also imagine that, for the sake of discussion, we have a million-line file rather than a four-line file, so we can't see it all at once and some automation is called for.)
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat data/het-bool.csv
|
||||
name,reachable
|
||||
barney,false
|
||||
betty,true
|
||||
fred,true
|
||||
wilma,1
|
||||
|
||||
One option is to coerce everything to boolean, or integer:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint put '$reachable = boolean($reachable)' data/het-bool.csv
|
||||
name reachable
|
||||
barney false
|
||||
betty true
|
||||
fred true
|
||||
wilma true
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint put '$reachable = int(boolean($reachable))' data/het-bool.csv
|
||||
name reachable
|
||||
barney 0
|
||||
betty 1
|
||||
fred 1
|
||||
wilma 1
|
||||
|
||||
A second option is to flag badly formatted data within the output stream:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --icsv --opprint put '$format_ok = is_string($reachable)' data/het-bool.csv
|
||||
name reachable format_ok
|
||||
barney false false
|
||||
betty true false
|
||||
fred true false
|
||||
wilma 1 false
|
||||
|
||||
Or perhaps to flag badly formatted data outside the output stream:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-3
|
||||
|
||||
mlr --icsv --opprint put '
|
||||
if (!is_string($reachable)) {eprint "Malformed at NR=".NR}
|
||||
' data/het-bool.csv
|
||||
Malformed at NR=1
|
||||
Malformed at NR=2
|
||||
Malformed at NR=3
|
||||
Malformed at NR=4
|
||||
name reachable
|
||||
barney false
|
||||
betty true
|
||||
fred true
|
||||
wilma 1
|
||||
|
||||
A third way is to abort the process on first instance of bad data:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --csv put '$reachable = asserting_string($reachable)' data/het-bool.csv
|
||||
Miller: is_string type-assertion failed at NR=1 FNR=1 FILENAME=data/het-bool.csv
|
||||
|
|
@ -1,38 +0,0 @@
|
|||
Data-cleaning examples
|
||||
================================================================
|
||||
|
||||
Here are some ways to use the type-checking options as described in :ref:`reference-dsl-type-tests-and-assertions` Suppose you have the following data file, with inconsistent typing for boolean. (Also imagine that, for the sake of discussion, we have a million-line file rather than a four-line file, so we can't see it all at once and some automation is called for.)
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat data/het-bool.csv
|
||||
GENRST_EOF
|
||||
|
||||
One option is to coerce everything to boolean, or integer:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint put '$reachable = boolean($reachable)' data/het-bool.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint put '$reachable = int(boolean($reachable))' data/het-bool.csv
|
||||
GENRST_EOF
|
||||
|
||||
A second option is to flag badly formatted data within the output stream:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint put '$format_ok = is_string($reachable)' data/het-bool.csv
|
||||
GENRST_EOF
|
||||
|
||||
Or perhaps to flag badly formatted data outside the output stream:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --icsv --opprint put '
|
||||
if (!is_string($reachable)) {eprint "Malformed at NR=".NR}
|
||||
' data/het-bool.csv
|
||||
GENRST_EOF
|
||||
|
||||
A third way is to abort the process on first instance of bad data:
|
||||
|
||||
GENRST_RUN_COMMAND_TOLERATING_ERROR
|
||||
mlr --csv put '$reachable = asserting_string($reachable)' data/het-bool.csv
|
||||
GENRST_EOF
|
||||
|
|
@ -1,234 +0,0 @@
|
|||
..
|
||||
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
|
||||
|
||||
Data-diving examples
|
||||
================================================================
|
||||
|
||||
flins data
|
||||
----------------------------------------------------------------
|
||||
|
||||
The `flins.csv <data/flins.csv>`_ file is some sample data obtained from https://support.spatialkey.com/spatialkey-sample-csv-data.
|
||||
|
||||
Vertical-tabular format is good for a quick look at CSV data layout -- seeing what columns you have to work with:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
head -n 2 data/flins.csv | mlr --icsv --oxtab cat
|
||||
county Seminole
|
||||
tiv_2011 22890.55
|
||||
tiv_2012 20848.71
|
||||
line Residential
|
||||
|
||||
A few simple queries:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --from data/flins.csv --icsv --opprint count-distinct -f county | head
|
||||
county count
|
||||
Seminole 1
|
||||
Miami Dade 2
|
||||
Palm Beach 1
|
||||
Highlands 2
|
||||
Duval 1
|
||||
St. Johns 1
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --from data/flins.csv --icsv --opprint count-distinct -f construction,line
|
||||
|
||||
Categorization of total insured value:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --from data/flins.csv --icsv --opprint stats1 -a min,mean,max -f tiv_2012
|
||||
tiv_2012_min tiv_2012_mean tiv_2012_max
|
||||
19757.91 1061531.4637499999 2785551.63
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-2
|
||||
|
||||
mlr --from data/flins.csv --icsv --opprint \
|
||||
stats1 -a min,mean,max -f tiv_2012 -g construction,line
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-2
|
||||
|
||||
mlr --from data/flins.csv --icsv --oxtab \
|
||||
stats1 -a p0,p10,p50,p90,p95,p99,p100 -f hu_site_deductible
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-3
|
||||
|
||||
mlr --from data/flins.csv --icsv --opprint \
|
||||
stats1 -a p95,p99,p100 -f hu_site_deductible -g county \
|
||||
then sort -f county | head
|
||||
county
|
||||
Duval
|
||||
Highlands
|
||||
Miami Dade
|
||||
Palm Beach
|
||||
Seminole
|
||||
St. Johns
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-2
|
||||
|
||||
mlr --from data/flins.csv --icsv --oxtab \
|
||||
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012
|
||||
tiv_2011_tiv_2012_corr 0.9353629581411828
|
||||
tiv_2011_tiv_2012_ols_m 1.0890905877734807
|
||||
tiv_2011_tiv_2012_ols_b 103095.52335638746
|
||||
tiv_2011_tiv_2012_ols_n 8
|
||||
tiv_2011_tiv_2012_r2 0.8749038634626236
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-2
|
||||
|
||||
mlr --from data/flins.csv --icsv --opprint \
|
||||
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012 -g county
|
||||
county tiv_2011_tiv_2012_corr tiv_2011_tiv_2012_ols_m tiv_2011_tiv_2012_ols_b tiv_2011_tiv_2012_ols_n tiv_2011_tiv_2012_r2
|
||||
Seminole - - - 1 -
|
||||
Miami Dade 1 0.9306426512386247 -2311.1543275160047 2 0.9999999999999999
|
||||
Palm Beach - - - 1 -
|
||||
Highlands 0.9999999999999997 1.055692910750992 -4529.7939388307705 2 0.9999999999999992
|
||||
Duval - - - 1 -
|
||||
St. Johns - - - 1 -
|
||||
|
||||
Color/shape data
|
||||
----------------------------------------------------------------
|
||||
|
||||
The `colored-shapes.dkvp <https://github.com/johnkerl/miller/blob/master/docs/data/colored-shapes.dkvp>`_ file is some sample data produced by the `mkdat2 <data/mkdat2>`_ script. The idea is:
|
||||
|
||||
* Produce some data with known distributions and correlations, and verify that Miller recovers those properties empirically.
|
||||
* Each record is labeled with one of a few colors and one of a few shapes.
|
||||
* The ``flag`` field is 0 or 1, with probability dependent on color
|
||||
* The ``u`` field is plain uniform on the unit interval.
|
||||
* The ``v`` field is the same, except tightly correlated with ``u`` for red circles.
|
||||
* The ``w`` field is autocorrelated for each color/shape pair.
|
||||
* The ``x`` field is boring Gaussian with mean 5 and standard deviation about 1.2, with no dependence on color or shape.
|
||||
|
||||
Peek at the data:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
wc -l data/colored-shapes.dkvp
|
||||
10078 data/colored-shapes.dkvp
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
head -n 6 data/colored-shapes.dkvp | mlr --opprint cat
|
||||
color shape flag i u v w x
|
||||
yellow triangle 1 11 0.6321695890307647 0.9887207810889004 0.4364983936735774 5.7981881667050565
|
||||
red square 1 15 0.21966833570651523 0.001257332190235938 0.7927778364718627 2.944117399716207
|
||||
red circle 1 16 0.20901671281497636 0.29005231936593445 0.13810280912907674 5.065034003400998
|
||||
red square 0 48 0.9562743938458542 0.7467203085342884 0.7755423050923582 7.117831369597269
|
||||
purple triangle 0 51 0.4355354501763202 0.8591292672156728 0.8122903963006748 5.753094629505863
|
||||
red square 0 64 0.2015510269821953 0.9531098083420033 0.7719912015786777 5.612050466474166
|
||||
|
||||
Look at uncategorized stats (using `creach <https://github.com/johnkerl/scripts/blob/master/fundam/creach>`_ for spacing).
|
||||
|
||||
Here it looks reasonable that ``u`` is unit-uniform; something's up with ``v`` but we can't yet see what:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --oxtab stats1 -a min,mean,max -f flag,u,v data/colored-shapes.dkvp | creach 3
|
||||
flag_min 0
|
||||
flag_mean 0.39888866838658465
|
||||
flag_max 1
|
||||
|
||||
u_min 0.000043912454007477564
|
||||
u_mean 0.4983263438118866
|
||||
u_max 0.9999687954968421
|
||||
|
||||
v_min -0.09270905318501277
|
||||
v_mean 0.49778696527477023
|
||||
v_max 1.0724998185026013
|
||||
|
||||
The histogram shows the different distribution of 0/1 flags:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --opprint histogram -f flag,u,v --lo -0.1 --hi 1.1 --nbins 12 data/colored-shapes.dkvp
|
||||
bin_lo bin_hi flag_count u_count v_count
|
||||
-0.010000000000000002 0.09000000000000002 6058 0 36
|
||||
0.09000000000000002 0.19000000000000003 0 1062 988
|
||||
0.19000000000000003 0.29000000000000004 0 985 1003
|
||||
0.29000000000000004 0.39000000000000007 0 1024 1014
|
||||
0.39000000000000007 0.4900000000000001 0 1002 991
|
||||
0.4900000000000001 0.5900000000000002 0 989 1041
|
||||
0.5900000000000002 0.6900000000000002 0 1001 1016
|
||||
0.6900000000000002 0.7900000000000001 0 972 962
|
||||
0.7900000000000001 0.8900000000000002 0 1035 1070
|
||||
0.8900000000000002 0.9900000000000002 0 995 993
|
||||
0.9900000000000002 1.0900000000000003 4020 1013 939
|
||||
1.0900000000000003 1.1900000000000002 0 0 25
|
||||
|
||||
Look at univariate stats by color and shape. In particular, color-dependent flag probabilities pop out, aligning with their original Bernoulli probablities from the data-generator script:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-3
|
||||
|
||||
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g color \
|
||||
then sort -f color \
|
||||
data/colored-shapes.dkvp
|
||||
color flag_min flag_mean flag_max u_min u_mean u_max v_min v_mean v_max
|
||||
blue 0 0.5843537414965987 1 0.000043912454007477564 0.517717155039078 0.9999687954968421 0.0014886830387470518 0.49105642841387653 0.9995761761685742
|
||||
green 0 0.20919747520288548 1 0.00048750676198217047 0.5048610622924616 0.9999361779701204 0.0005012669003675585 0.49908475928072205 0.9996764373885353
|
||||
orange 0 0.5214521452145214 1 0.00123537823160913 0.49053241689014415 0.9988853487546249 0.0024486660337188493 0.4877637745987629 0.998475130432018
|
||||
purple 0 0.09019264448336252 1 0.0002655214518428872 0.4940049543793683 0.9996465731736793 0.0003641137096487279 0.497050699948439 0.9999751864255598
|
||||
red 0 0.3031674208144796 1 0.0006711367180041172 0.49255964831571375 0.9998822102016469 -0.09270905318501277 0.4965350959465078 1.0724998185026013
|
||||
yellow 0 0.8924274593064402 1 0.001300228762057487 0.49712912165196765 0.99992313390574 0.0007109695568577878 0.510626599360317 0.9999189897724752
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-3
|
||||
|
||||
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g shape \
|
||||
then sort -f shape \
|
||||
data/colored-shapes.dkvp
|
||||
shape flag_min flag_mean flag_max u_min u_mean u_max v_min v_mean v_max
|
||||
circle 0 0.3998456194519491 1 0.000043912454007477564 0.49855450951394115 0.99992313390574 -0.09270905318501277 0.49552415740048406 1.0724998185026013
|
||||
square 0 0.39611178614823817 1 0.0001881939925673093 0.499385458061097 0.9999687954968421 0.00008930277299445954 0.49653825501903986 0.9999751864255598
|
||||
triangle 0 0.4015421115065243 1 0.000881025170573424 0.4968585405884252 0.9996614910922645 0.000716883409890845 0.501049532862137 0.9999946837499262
|
||||
|
||||
Look at bivariate stats by color and shape. In particular, ``u,v`` pairwise correlation for red circles pops out:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr --opprint --right stats2 -a corr -f u,v,w,x data/colored-shapes.dkvp
|
||||
u_v_corr w_x_corr
|
||||
0.13341803768384553 -0.011319938208638764
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-3
|
||||
|
||||
mlr --opprint --right \
|
||||
stats2 -a corr -f u,v,w,x -g color,shape then sort -nr u_v_corr \
|
||||
data/colored-shapes.dkvp
|
||||
color shape u_v_corr w_x_corr
|
||||
red circle 0.9807984157534667 -0.018565046320623148
|
||||
orange square 0.17685846147882145 -0.07104374629148885
|
||||
green circle 0.05764430126828069 0.011795210176784067
|
||||
red square 0.055744791559722166 -0.0006802175149145207
|
||||
yellow triangle 0.04457267106380469 0.02460476240108526
|
||||
yellow square 0.04379171794446621 -0.04462267239937856
|
||||
purple circle 0.03587354791796681 0.13411247530136805
|
||||
blue square 0.03241156493114544 -0.05350791240143263
|
||||
blue triangle 0.015356295190464324 -0.0006084778850362686
|
||||
orange circle 0.01051866723398945 -0.1627949723421722
|
||||
red triangle 0.00809781003735548 0.012485753551391776
|
||||
purple triangle 0.005155038421780437 -0.04505792148014131
|
||||
purple square -0.02568020549187632 0.05769444883779078
|
||||
green square -0.025775985300150128 -0.003265248022084335
|
||||
orange triangle -0.030456930370361554 -0.131870019629393
|
||||
yellow circle -0.06477338560056926 0.07369474300245252
|
||||
blue circle -0.1023476302678634 -0.030529007506883508
|
||||
green triangle -0.10901830007460846 -0.0484881707807228
|
||||
|
|
@ -1,118 +0,0 @@
|
|||
Data-diving examples
|
||||
================================================================
|
||||
|
||||
flins data
|
||||
----------------------------------------------------------------
|
||||
|
||||
The `flins.csv <data/flins.csv>`_ file is some sample data obtained from https://support.spatialkey.com/spatialkey-sample-csv-data.
|
||||
|
||||
Vertical-tabular format is good for a quick look at CSV data layout -- seeing what columns you have to work with:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
head -n 2 data/flins.csv | mlr --icsv --oxtab cat
|
||||
GENRST_EOF
|
||||
|
||||
A few simple queries:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --from data/flins.csv --icsv --opprint count-distinct -f county | head
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --from data/flins.csv --icsv --opprint count-distinct -f construction,line
|
||||
GENRST_EOF
|
||||
|
||||
Categorization of total insured value:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --from data/flins.csv --icsv --opprint stats1 -a min,mean,max -f tiv_2012
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --from data/flins.csv --icsv --opprint \
|
||||
stats1 -a min,mean,max -f tiv_2012 -g construction,line
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --from data/flins.csv --icsv --oxtab \
|
||||
stats1 -a p0,p10,p50,p90,p95,p99,p100 -f hu_site_deductible
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --from data/flins.csv --icsv --opprint \
|
||||
stats1 -a p95,p99,p100 -f hu_site_deductible -g county \
|
||||
then sort -f county | head
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --from data/flins.csv --icsv --oxtab \
|
||||
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --from data/flins.csv --icsv --opprint \
|
||||
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012 -g county
|
||||
GENRST_EOF
|
||||
|
||||
Color/shape data
|
||||
----------------------------------------------------------------
|
||||
|
||||
The `colored-shapes.dkvp <https://github.com/johnkerl/miller/blob/master/docs/data/colored-shapes.dkvp>`_ file is some sample data produced by the `mkdat2 <data/mkdat2>`_ script. The idea is:
|
||||
|
||||
* Produce some data with known distributions and correlations, and verify that Miller recovers those properties empirically.
|
||||
* Each record is labeled with one of a few colors and one of a few shapes.
|
||||
* The ``flag`` field is 0 or 1, with probability dependent on color
|
||||
* The ``u`` field is plain uniform on the unit interval.
|
||||
* The ``v`` field is the same, except tightly correlated with ``u`` for red circles.
|
||||
* The ``w`` field is autocorrelated for each color/shape pair.
|
||||
* The ``x`` field is boring Gaussian with mean 5 and standard deviation about 1.2, with no dependence on color or shape.
|
||||
|
||||
Peek at the data:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
wc -l data/colored-shapes.dkvp
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
head -n 6 data/colored-shapes.dkvp | mlr --opprint cat
|
||||
GENRST_EOF
|
||||
|
||||
Look at uncategorized stats (using `creach <https://github.com/johnkerl/scripts/blob/master/fundam/creach>`_ for spacing).
|
||||
|
||||
Here it looks reasonable that ``u`` is unit-uniform; something's up with ``v`` but we can't yet see what:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --oxtab stats1 -a min,mean,max -f flag,u,v data/colored-shapes.dkvp | creach 3
|
||||
GENRST_EOF
|
||||
|
||||
The histogram shows the different distribution of 0/1 flags:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --opprint histogram -f flag,u,v --lo -0.1 --hi 1.1 --nbins 12 data/colored-shapes.dkvp
|
||||
GENRST_EOF
|
||||
|
||||
Look at univariate stats by color and shape. In particular, color-dependent flag probabilities pop out, aligning with their original Bernoulli probablities from the data-generator script:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g color \
|
||||
then sort -f color \
|
||||
data/colored-shapes.dkvp
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g shape \
|
||||
then sort -f shape \
|
||||
data/colored-shapes.dkvp
|
||||
GENRST_EOF
|
||||
|
||||
Look at bivariate stats by color and shape. In particular, ``u,v`` pairwise correlation for red circles pops out:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --opprint --right stats2 -a corr -f u,v,w,x data/colored-shapes.dkvp
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --opprint --right \
|
||||
stats2 -a corr -f u,v,w,x -g color,shape then sort -nr u_v_corr \
|
||||
data/colored-shapes.dkvp
|
||||
GENRST_EOF
|
||||
|
|
@ -1,126 +0,0 @@
|
|||
..
|
||||
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
|
||||
|
||||
Dates and times
|
||||
===============
|
||||
|
||||
How can I filter by date?
|
||||
----------------------------------------------------------------
|
||||
|
||||
Given input like
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat dates.csv
|
||||
date,event
|
||||
2018-02-03,initialization
|
||||
2018-03-07,discovery
|
||||
2018-02-03,allocation
|
||||
|
||||
we can use ``strptime`` to parse the date field into seconds-since-epoch and then do numeric comparisons. Simply match your input dataset's date-formatting to the :ref:`reference-dsl-strptime` format-string. For example:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-3
|
||||
|
||||
mlr --csv filter '
|
||||
strptime($date, "%Y-%m-%d") > strptime("2018-03-03", "%Y-%m-%d")
|
||||
' dates.csv
|
||||
date,event
|
||||
2018-03-07,discovery
|
||||
|
||||
Caveat: localtime-handling in timezones with DST is still a work in progress; see https://github.com/johnkerl/miller/issues/170. See also https://github.com/johnkerl/miller/issues/208 -- thanks @aborruso!
|
||||
|
||||
Finding missing dates
|
||||
----------------------------------------------------------------
|
||||
|
||||
Suppose you have some date-stamped data which may (or may not) be missing entries for one or more dates:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
head -n 10 data/miss-date.csv
|
||||
date,qoh
|
||||
2012-03-05,10055
|
||||
2012-03-06,10486
|
||||
2012-03-07,10430
|
||||
2012-03-08,10674
|
||||
2012-03-09,10880
|
||||
2012-03-10,10718
|
||||
2012-03-11,10795
|
||||
2012-03-12,11043
|
||||
2012-03-13,11177
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
wc -l data/miss-date.csv
|
||||
1372 data/miss-date.csv
|
||||
|
||||
Since there are 1372 lines in the data file, some automation is called for. To find the missing dates, you can convert the dates to seconds since the epoch using ``strptime``, then compute adjacent differences (the ``cat -n`` simply inserts record-counters):
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-5
|
||||
|
||||
mlr --from data/miss-date.csv --icsv \
|
||||
cat -n \
|
||||
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
|
||||
then step -a delta -f datestamp \
|
||||
| head
|
||||
n=1,date=2012-03-05,qoh=10055,datestamp=1330905600,datestamp_delta=0
|
||||
n=2,date=2012-03-06,qoh=10486,datestamp=1330992000,datestamp_delta=86400
|
||||
n=3,date=2012-03-07,qoh=10430,datestamp=1331078400,datestamp_delta=86400
|
||||
n=4,date=2012-03-08,qoh=10674,datestamp=1331164800,datestamp_delta=86400
|
||||
n=5,date=2012-03-09,qoh=10880,datestamp=1331251200,datestamp_delta=86400
|
||||
n=6,date=2012-03-10,qoh=10718,datestamp=1331337600,datestamp_delta=86400
|
||||
n=7,date=2012-03-11,qoh=10795,datestamp=1331424000,datestamp_delta=86400
|
||||
n=8,date=2012-03-12,qoh=11043,datestamp=1331510400,datestamp_delta=86400
|
||||
n=9,date=2012-03-13,qoh=11177,datestamp=1331596800,datestamp_delta=86400
|
||||
n=10,date=2012-03-14,qoh=11498,datestamp=1331683200,datestamp_delta=86400
|
||||
|
||||
Then, filter for adjacent difference not being 86400 (the number of seconds in a day):
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-5
|
||||
|
||||
mlr --from data/miss-date.csv --icsv \
|
||||
cat -n \
|
||||
then put '$datestamp = strptime($date, "%Y-%m-%d")' \
|
||||
then step -a delta -f datestamp \
|
||||
then filter '$datestamp_delta != 86400 && $n != 1'
|
||||
n=774,date=2014-04-19,qoh=130140,datestamp=1397865600,datestamp_delta=259200
|
||||
n=1119,date=2015-03-31,qoh=181625,datestamp=1427760000,datestamp_delta=172800
|
||||
|
||||
Given this, it's now easy to see where the gaps are:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr cat -n then filter '$n >= 770 && $n <= 780' data/miss-date.csv
|
||||
n=770,1=2014-04-12,2=129435
|
||||
n=771,1=2014-04-13,2=129868
|
||||
n=772,1=2014-04-14,2=129797
|
||||
n=773,1=2014-04-15,2=129919
|
||||
n=774,1=2014-04-16,2=130181
|
||||
n=775,1=2014-04-19,2=130140
|
||||
n=776,1=2014-04-20,2=130271
|
||||
n=777,1=2014-04-21,2=130368
|
||||
n=778,1=2014-04-22,2=130368
|
||||
n=779,1=2014-04-23,2=130849
|
||||
n=780,1=2014-04-24,2=131026
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
mlr cat -n then filter '$n >= 1115 && $n <= 1125' data/miss-date.csv
|
||||
n=1115,1=2015-03-25,2=181006
|
||||
n=1116,1=2015-03-26,2=180995
|
||||
n=1117,1=2015-03-27,2=181043
|
||||
n=1118,1=2015-03-28,2=181112
|
||||
n=1119,1=2015-03-29,2=181306
|
||||
n=1120,1=2015-03-31,2=181625
|
||||
n=1121,1=2015-04-01,2=181494
|
||||
n=1122,1=2015-04-02,2=181718
|
||||
n=1123,1=2015-04-03,2=181835
|
||||
n=1124,1=2015-04-04,2=182104
|
||||
n=1125,1=2015-04-05,2=182528
|
||||
|
|
@ -1,52 +0,0 @@
|
|||
Dates and times
|
||||
===============
|
||||
|
||||
How can I filter by date?
|
||||
----------------------------------------------------------------
|
||||
|
||||
Given input like
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat dates.csv
|
||||
GENRST_EOF
|
||||
|
||||
we can use ``strptime`` to parse the date field into seconds-since-epoch and then do numeric comparisons. Simply match your input dataset's date-formatting to the :ref:`reference-dsl-strptime` format-string. For example:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr --csv filter '
|
||||
strptime($date, "%Y-%m-%d") > strptime("2018-03-03", "%Y-%m-%d")
|
||||
' dates.csv
|
||||
GENRST_EOF
|
||||
|
||||
Caveat: localtime-handling in timezones with DST is still a work in progress; see https://github.com/johnkerl/miller/issues/170. See also https://github.com/johnkerl/miller/issues/208 -- thanks @aborruso!
|
||||
|
||||
Finding missing dates
|
||||
----------------------------------------------------------------
|
||||
|
||||
Suppose you have some date-stamped data which may (or may not) be missing entries for one or more dates:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
head -n 10 data/miss-date.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
wc -l data/miss-date.csv
|
||||
GENRST_EOF
|
||||
|
||||
Since there are 1372 lines in the data file, some automation is called for. To find the missing dates, you can convert the dates to seconds since the epoch using ``strptime``, then compute adjacent differences (the ``cat -n`` simply inserts record-counters):
|
||||
|
||||
GENRST_INCLUDE_AND_RUN_ESCAPED(data/miss-date-1.sh)
|
||||
|
||||
Then, filter for adjacent difference not being 86400 (the number of seconds in a day):
|
||||
|
||||
GENRST_INCLUDE_AND_RUN_ESCAPED(data/miss-date-2.sh)
|
||||
|
||||
Given this, it's now easy to see where the gaps are:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr cat -n then filter '$n >= 770 && $n <= 780' data/miss-date.csv
|
||||
GENRST_EOF
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
mlr cat -n then filter '$n >= 1115 && $n <= 1125' data/miss-date.csv
|
||||
GENRST_EOF
|
||||
|
|
@ -1,253 +0,0 @@
|
|||
..
|
||||
PLEASE DO NOT EDIT DIRECTLY. EDIT THE .rst.in FILE PLEASE.
|
||||
|
||||
DKVP I/O examples
|
||||
======================
|
||||
|
||||
DKVP I/O in Python
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Here are the I/O routines:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
#!/usr/bin/env python
|
||||
|
||||
# ================================================================
|
||||
# Example of DKVP I/O using Python.
|
||||
#
|
||||
# Key point: Use Miller for what it's good at; pass data into/out of tools in
|
||||
# other languages to do what they're good at.
|
||||
#
|
||||
# bash$ python -i dkvp_io.py
|
||||
#
|
||||
# # READ
|
||||
# >>> map = dkvpline2map('x=1,y=2', '=', ',')
|
||||
# >>> map
|
||||
# OrderedDict([('x', '1'), ('y', '2')])
|
||||
#
|
||||
# # MODIFY
|
||||
# >>> map['z'] = map['x'] + map['y']
|
||||
# >>> map
|
||||
# OrderedDict([('x', '1'), ('y', '2'), ('z', 3)])
|
||||
#
|
||||
# # WRITE
|
||||
# >>> line = map2dkvpline(map, '=', ',')
|
||||
# >>> line
|
||||
# 'x=1,y=2,z=3'
|
||||
#
|
||||
# ================================================================
|
||||
|
||||
import re
|
||||
import collections
|
||||
|
||||
# ----------------------------------------------------------------
|
||||
# ips and ifs (input pair separator and input field separator) are nominally '=' and ','.
|
||||
def dkvpline2map(line, ips, ifs):
|
||||
pairs = re.split(ifs, line)
|
||||
map = collections.OrderedDict()
|
||||
for pair in pairs:
|
||||
key, value = re.split(ips, pair, 1)
|
||||
|
||||
# Type inference:
|
||||
try:
|
||||
value = int(value)
|
||||
except:
|
||||
try:
|
||||
value = float(value)
|
||||
except:
|
||||
pass
|
||||
|
||||
map[key] = value
|
||||
return map
|
||||
|
||||
# ----------------------------------------------------------------
|
||||
# ops and ofs (output pair separator and output field separator) are nominally '=' and ','.
|
||||
def map2dkvpline(map , ops, ofs):
|
||||
line = ''
|
||||
pairs = []
|
||||
for key in map:
|
||||
pairs.append(str(key) + ops + str(map[key]))
|
||||
return str.join(ofs, pairs)
|
||||
|
||||
And here is an example using them:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat polyglot-dkvp-io/example.py
|
||||
#!/usr/bin/env python
|
||||
|
||||
import sys
|
||||
import re
|
||||
import copy
|
||||
import dkvp_io
|
||||
|
||||
while True:
|
||||
# Read the original record:
|
||||
line = sys.stdin.readline().strip()
|
||||
if line == '':
|
||||
break
|
||||
map = dkvp_io.dkvpline2map(line, '=', ',')
|
||||
|
||||
# Drop a field:
|
||||
map.pop('x')
|
||||
|
||||
# Compute some new fields:
|
||||
map['ab'] = map['a'] + map['b']
|
||||
map['iy'] = map['i'] + map['y']
|
||||
|
||||
# Add new fields which show type of each already-existing field:
|
||||
omap = copy.copy(map) # since otherwise the for-loop will modify what it loops over
|
||||
keys = omap.keys()
|
||||
for key in keys:
|
||||
# Convert "<type 'int'>" to just "int", etc.:
|
||||
type_string = str(map[key].__class__)
|
||||
type_string = re.sub("<type '", "", type_string) # python2
|
||||
type_string = re.sub("<class '", "", type_string) # python3
|
||||
type_string = re.sub("'>", "", type_string)
|
||||
map['t'+key] = type_string
|
||||
|
||||
# Write the modified record:
|
||||
print(dkvp_io.map2dkvpline(map, '=', ','))
|
||||
|
||||
Run as-is:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
python polyglot-dkvp-io/example.py < data/small
|
||||
a=pan,b=pan,i=1,y=0.7268028627434533,ab=panpan,iy=1.7268028627434533,ta=str,tb=str,ti=int,ty=float,tab=str,tiy=float
|
||||
a=eks,b=pan,i=2,y=0.5221511083334797,ab=ekspan,iy=2.5221511083334796,ta=str,tb=str,ti=int,ty=float,tab=str,tiy=float
|
||||
a=wye,b=wye,i=3,y=0.33831852551664776,ab=wyewye,iy=3.3383185255166477,ta=str,tb=str,ti=int,ty=float,tab=str,tiy=float
|
||||
a=eks,b=wye,i=4,y=0.13418874328430463,ab=ekswye,iy=4.134188743284304,ta=str,tb=str,ti=int,ty=float,tab=str,tiy=float
|
||||
a=wye,b=pan,i=5,y=0.8636244699032729,ab=wyepan,iy=5.863624469903273,ta=str,tb=str,ti=int,ty=float,tab=str,tiy=float
|
||||
|
||||
Run as-is, then pipe to Miller for pretty-printing:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
python polyglot-dkvp-io/example.py < data/small | mlr --opprint cat
|
||||
a b i y ab iy ta tb ti ty tab tiy
|
||||
pan pan 1 0.7268028627434533 panpan 1.7268028627434533 str str int float str float
|
||||
eks pan 2 0.5221511083334797 ekspan 2.5221511083334796 str str int float str float
|
||||
wye wye 3 0.33831852551664776 wyewye 3.3383185255166477 str str int float str float
|
||||
eks wye 4 0.13418874328430463 ekswye 4.134188743284304 str str int float str float
|
||||
wye pan 5 0.8636244699032729 wyepan 5.863624469903273 str str int float str float
|
||||
|
||||
DKVP I/O in Ruby
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Here are the I/O routines:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
#!/usr/bin/env ruby
|
||||
|
||||
# ================================================================
|
||||
# Example of DKVP I/O using Ruby.
|
||||
#
|
||||
# Key point: Use Miller for what it's good at; pass data into/out of tools in
|
||||
# other languages to do what they're good at.
|
||||
#
|
||||
# bash$ irb -I. -r dkvp_io.rb
|
||||
#
|
||||
# # READ
|
||||
# irb(main):001:0> map = dkvpline2map('x=1,y=2', '=', ',')
|
||||
# => {"x"=>"1", "y"=>"2"}
|
||||
#
|
||||
# # MODIFY
|
||||
# irb(main):001:0> map['z'] = map['x'] + map['y']
|
||||
# => 3
|
||||
#
|
||||
# # WRITE
|
||||
# irb(main):002:0> line = map2dkvpline(map, '=', ',')
|
||||
# => "x=1,y=2,z=3"
|
||||
#
|
||||
# ================================================================
|
||||
|
||||
# ----------------------------------------------------------------
|
||||
# ips and ifs (input pair separator and input field separator) are nominally '=' and ','.
|
||||
def dkvpline2map(line, ips, ifs)
|
||||
map = {}
|
||||
line.split(ifs).each do |pair|
|
||||
(k, v) = pair.split(ips, 2)
|
||||
|
||||
# Type inference:
|
||||
begin
|
||||
v = Integer(v)
|
||||
rescue ArgumentError
|
||||
begin
|
||||
v = Float(v)
|
||||
rescue ArgumentError
|
||||
# Leave as string
|
||||
end
|
||||
end
|
||||
|
||||
map[k] = v
|
||||
end
|
||||
map
|
||||
end
|
||||
|
||||
# ----------------------------------------------------------------
|
||||
# ops and ofs (output pair separator and output field separator) are nominally '=' and ','.
|
||||
def map2dkvpline(map, ops, ofs)
|
||||
map.collect{|k,v| k.to_s + ops + v.to_s}.join(ofs)
|
||||
end
|
||||
|
||||
And here is an example using them:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
cat polyglot-dkvp-io/example.rb
|
||||
#!/usr/bin/env ruby
|
||||
|
||||
require 'dkvp_io'
|
||||
|
||||
ARGF.each do |line|
|
||||
# Read the original record:
|
||||
map = dkvpline2map(line.chomp, '=', ',')
|
||||
|
||||
# Drop a field:
|
||||
map.delete('x')
|
||||
|
||||
# Compute some new fields:
|
||||
map['ab'] = map['a'] + map['b']
|
||||
map['iy'] = map['i'] + map['y']
|
||||
|
||||
# Add new fields which show type of each already-existing field:
|
||||
keys = map.keys
|
||||
keys.each do |key|
|
||||
map['t'+key] = map[key].class
|
||||
end
|
||||
|
||||
# Write the modified record:
|
||||
puts map2dkvpline(map, '=', ',')
|
||||
end
|
||||
|
||||
Run as-is:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
ruby -I./polyglot-dkvp-io polyglot-dkvp-io/example.rb data/small
|
||||
a=pan,b=pan,i=1,y=0.7268028627434533,ab=panpan,iy=1.7268028627434533,ta=String,tb=String,ti=Integer,ty=Float,tab=String,tiy=Float
|
||||
a=eks,b=pan,i=2,y=0.5221511083334797,ab=ekspan,iy=2.5221511083334796,ta=String,tb=String,ti=Integer,ty=Float,tab=String,tiy=Float
|
||||
a=wye,b=wye,i=3,y=0.33831852551664776,ab=wyewye,iy=3.3383185255166477,ta=String,tb=String,ti=Integer,ty=Float,tab=String,tiy=Float
|
||||
a=eks,b=wye,i=4,y=0.13418874328430463,ab=ekswye,iy=4.134188743284304,ta=String,tb=String,ti=Integer,ty=Float,tab=String,tiy=Float
|
||||
a=wye,b=pan,i=5,y=0.8636244699032729,ab=wyepan,iy=5.863624469903273,ta=String,tb=String,ti=Integer,ty=Float,tab=String,tiy=Float
|
||||
|
||||
Run as-is, then pipe to Miller for pretty-printing:
|
||||
|
||||
.. code-block:: none
|
||||
:emphasize-lines: 1-1
|
||||
|
||||
ruby -I./polyglot-dkvp-io polyglot-dkvp-io/example.rb data/small | mlr --opprint cat
|
||||
a b i y ab iy ta tb ti ty tab tiy
|
||||
pan pan 1 0.7268028627434533 panpan 1.7268028627434533 String String Integer Float String Float
|
||||
eks pan 2 0.5221511083334797 ekspan 2.5221511083334796 String String Integer Float String Float
|
||||
wye wye 3 0.33831852551664776 wyewye 3.3383185255166477 String String Integer Float String Float
|
||||
eks wye 4 0.13418874328430463 ekswye 4.134188743284304 String String Integer Float String Float
|
||||
wye pan 5 0.8636244699032729 wyepan 5.863624469903273 String String Integer Float String Float
|
||||
|
|
@ -1,52 +0,0 @@
|
|||
DKVP I/O examples
|
||||
======================
|
||||
|
||||
DKVP I/O in Python
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Here are the I/O routines:
|
||||
|
||||
GENRST_INCLUDE_ESCAPED(polyglot-dkvp-io/dkvp_io.py)
|
||||
|
||||
And here is an example using them:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat polyglot-dkvp-io/example.py
|
||||
GENRST_EOF
|
||||
|
||||
Run as-is:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
python polyglot-dkvp-io/example.py < data/small
|
||||
GENRST_EOF
|
||||
|
||||
Run as-is, then pipe to Miller for pretty-printing:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
python polyglot-dkvp-io/example.py < data/small | mlr --opprint cat
|
||||
GENRST_EOF
|
||||
|
||||
DKVP I/O in Ruby
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Here are the I/O routines:
|
||||
|
||||
GENRST_INCLUDE_ESCAPED(polyglot-dkvp-io/dkvp_io.rb)
|
||||
|
||||
And here is an example using them:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
cat polyglot-dkvp-io/example.rb
|
||||
GENRST_EOF
|
||||
|
||||
Run as-is:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
ruby -I./polyglot-dkvp-io polyglot-dkvp-io/example.rb data/small
|
||||
GENRST_EOF
|
||||
|
||||
Run as-is, then pipe to Miller for pretty-printing:
|
||||
|
||||
GENRST_RUN_COMMAND
|
||||
ruby -I./polyglot-dkvp-io polyglot-dkvp-io/example.rb data/small | mlr --opprint cat
|
||||
GENRST_EOF
|
||||
|
|
@ -1,7 +0,0 @@
|
|||
|
||||
linkify introduction.rst.in? too much too soon?
|
||||
|
||||
quicklinks maybe? redundant w/ TOC?
|
||||
|
||||
so i didn't want to pop out to R just to compute that 'one last thing'; hence covar/linreg/etc
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue