miller/docs6/10min.rst.in

Miller in 10 minutes
====================

CSV-file examples
^^^^^^^^^^^^^^^^^

Suppose you have this CSV data file::

POKI_RUN_COMMAND{{cat example.csv}}HERE

``mlr cat`` is like cat -- it passes the data through unmodified::

POKI_RUN_COMMAND{{mlr --csv cat example.csv}}HERE

but it can also do format conversion (here, you can pretty-print in tabular format)::

POKI_RUN_COMMAND{{mlr --icsv --opprint cat example.csv}}HERE

``mlr head`` and ``mlr tail`` count records rather than lines. Whether you're getting the first few records or the last few, the CSV header is included either way::

POKI_RUN_COMMAND{{mlr --csv head -n 4 example.csv}}HERE

::

POKI_RUN_COMMAND{{mlr --csv tail -n 4 example.csv}}HERE

You can sort primarily alphabetically on one field, then secondarily numerically descending on another field::

POKI_RUN_COMMAND{{mlr --icsv --opprint sort -f shape -nr index example.csv}}HERE

You can use ``cut`` to retain only specified fields, in the same order they appeared in the input data::

POKI_RUN_COMMAND{{mlr --icsv --opprint cut -f flag,shape example.csv}}HERE

You can also use ``cut -o`` to retain only specified fields in your preferred order::

POKI_RUN_COMMAND{{mlr --icsv --opprint cut -o -f flag,shape example.csv}}HERE

You can use ``cut -x`` to omit fields you don't care about::

POKI_RUN_COMMAND{{mlr --icsv --opprint cut -x -f flag,shape example.csv}}HERE

You can use ``filter`` to keep only records you care about::

POKI_RUN_COMMAND{{mlr --icsv --opprint filter '$color == "red"' example.csv}}HERE

::

POKI_RUN_COMMAND{{mlr --icsv --opprint filter '$color == "red" && $flag == 1' example.csv}}HERE

You can use ``put`` to create new fields which are computed from other fields::

POKI_RUN_COMMAND{{mlr --icsv --opprint put '$ratio = $quantity / $rate; $color_shape = $color . "_" . $shape' example.csv}}HERE

Even though Miller's main selling point is name-indexing, sometimes you really want to refer to a field name by its positional index. Use ``$[[3]]`` to access the name of field 3 or ``$[[[3]]]`` to access the value of field 3::

POKI_RUN_COMMAND{{mlr --icsv --opprint put '$[[3]] = "NEW"' example.csv}}HERE

::

POKI_RUN_COMMAND{{mlr --icsv --opprint put '$[[[3]]] = "NEW"' example.csv}}HERE

JSON-file examples
^^^^^^^^^^^^^^^^^^

OK, CSV and pretty-print are fine. But Miller can also convert between a few other formats -- let's take a look at JSON output::

POKI_RUN_COMMAND{{mlr --icsv --ojson put '$ratio = $quantity/$rate; $shape = toupper($shape)' example.csv}}HERE

Sorts and stats
^^^^^^^^^^^^^^^

Now suppose you want to sort the data on a given column, *and then* take the top few in that ordering. You can use Miller's ``then`` feature to pipe commands together.

Here are the records with the top three ``index`` values::

POKI_RUN_COMMAND{{mlr --icsv --opprint sort -f shape -nr index then head -n 3 example.csv}}HERE

Lots of Miller commands take a ``-g`` option for group-by: here, ``head -n 1 -g shape`` outputs the first record for each distinct value of the ``shape`` field. This means we're finding the record with highest ``index`` field for each distinct ``shape`` field::

POKI_RUN_COMMAND{{mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv}}HERE

Statistics can be computed with or without group-by field(s)::

POKI_RUN_COMMAND{{mlr --icsv --opprint --from example.csv stats1 -a count,min,mean,max -f quantity -g shape}}HERE

::

POKI_RUN_COMMAND{{mlr --icsv --opprint --from example.csv stats1 -a count,min,mean,max -f quantity -g shape,color}}HERE

If your output has a lot of columns, you can use XTAB format to line things up vertically for you instead::

POKI_RUN_COMMAND{{mlr --icsv --oxtab --from example.csv stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate}}HERE

.. _10min-choices-for-printing-to-files:

Choices for printing to files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Often we want to print output to the screen. Miller does this by default, as we've seen in the previous examples.

Sometimes we want to print output to another file: just use **> outputfilenamegoeshere** at the end of your command:

::

    % mlr --icsv --opprint cat example.csv > newfile.csv
    # Output goes to the new file;
    # nothing is printed to the screen.

::

    % cat newfile.csv
    color  shape    flag index quantity rate
    yellow triangle 1    11    43.6498  9.8870
    red    square   1    15    79.2778  0.0130
    red    circle   1    16    13.8103  2.9010
    red    square   0    48    77.5542  7.4670
    purple triangle 0    51    81.2290  8.5910
    red    square   0    64    77.1991  9.5310
    purple triangle 0    65    80.1405  5.8240
    yellow circle   1    73    63.9785  4.2370
    yellow circle   1    87    63.5058  8.3350
    purple square   0    91    72.3735  8.2430

Other times we just want our files to be **changed in-place**: just use **mlr -I**::

    % cp example.csv newfile.txt

    % cat newfile.txt
    color,shape,flag,index,quantity,rate
    yellow,triangle,1,11,43.6498,9.8870
    red,square,1,15,79.2778,0.0130
    red,circle,1,16,13.8103,2.9010
    red,square,0,48,77.5542,7.4670
    purple,triangle,0,51,81.2290,8.5910
    red,square,0,64,77.1991,9.5310
    purple,triangle,0,65,80.1405,5.8240
    yellow,circle,1,73,63.9785,4.2370
    yellow,circle,1,87,63.5058,8.3350
    purple,square,0,91,72.3735,8.2430

    % mlr -I --icsv --opprint cat newfile.txt

    % cat newfile.txt
    color  shape    flag index quantity rate
    yellow triangle 1    11    43.6498  9.8870
    red    square   1    15    79.2778  0.0130
    red    circle   1    16    13.8103  2.9010
    red    square   0    48    77.5542  7.4670
    purple triangle 0    51    81.2290  8.5910
    red    square   0    64    77.1991  9.5310
    purple triangle 0    65    80.1405  5.8240
    yellow circle   1    73    63.9785  4.2370
    yellow circle   1    87    63.5058  8.3350
    purple square   0    91    72.3735  8.2430

Also using ``mlr -I`` you can bulk-operate on lots of files: e.g.::

    mlr -I --csv cut -x -f unwanted_column_name *.csv

If you like, you can first copy off your original data somewhere else, before doing in-place operations.

Lastly, using ``tee`` within ``put``, you can split your input data into separate files per one or more field names::

POKI_RUN_COMMAND{{mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'}}HERE

::

POKI_RUN_COMMAND{{cat circle.csv}}HERE

::

POKI_RUN_COMMAND{{cat square.csv}}HERE

::

POKI_RUN_COMMAND{{cat triangle.csv}}HERE

Other-format examples
^^^^^^^^^^^^^^^^^^^^^

What's a CSV file, really? It's an array of rows, or *records*, each being a list of key-value pairs, or *fields*: for CSV it so happens that all the keys are shared in the header line and the values vary data line by data line.

For example, if you have::

    shape,flag,index
    circle,1,24
    square,0,36

then that's a way of saying::

    shape=circle,flag=1,index=24
    shape=square,flag=0,index=36

Data written this way are called **DKVP**, for *delimited key-value pairs*.

We've also already seen other ways to write the same data::

    CSV                               PPRINT                 JSON
    shape,flag,index                  shape  flag index      [
    circle,1,24                       circle 1    24           {
    square,0,36                       square 0    36             "shape": "circle",
                                                                 "flag": 1,
                                                                 "index": 24
                                                               },
    DKVP                              XTAB                     {
    shape=circle,flag=1,index=24      shape circle               "shape": "square",
    shape=square,flag=0,index=36      flag  1                    "flag": 0,
                                      index 24                   "index": 36
                                                               }
                                      shape square           ]
                                      flag  0
                                      index 36

Anything we can do with CSV input data, we can do with any other format input data.  And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.