miller/docs6/10min.rst.in
2021-05-24 22:02:43 -04:00

215 lines
8.1 KiB
ReStructuredText

Miller in 10 minutes
====================
CSV-file examples
^^^^^^^^^^^^^^^^^
Suppose you have this CSV data file::
POKI_RUN_COMMAND{{cat example.csv}}HERE
``mlr cat`` is like cat -- it passes the data through unmodified::
POKI_RUN_COMMAND{{mlr --csv cat example.csv}}HERE
but it can also do format conversion (here, you can pretty-print in tabular format)::
POKI_RUN_COMMAND{{mlr --icsv --opprint cat example.csv}}HERE
``mlr head`` and ``mlr tail`` count records rather than lines. Whether you're getting the first few records or the last few, the CSV header is included either way::
POKI_RUN_COMMAND{{mlr --csv head -n 4 example.csv}}HERE
::
POKI_RUN_COMMAND{{mlr --csv tail -n 4 example.csv}}HERE
You can sort primarily alphabetically on one field, then secondarily numerically descending on another field::
POKI_RUN_COMMAND{{mlr --icsv --opprint sort -f shape -nr index example.csv}}HERE
You can use ``cut`` to retain only specified fields, in the same order they appeared in the input data::
POKI_RUN_COMMAND{{mlr --icsv --opprint cut -f flag,shape example.csv}}HERE
You can also use ``cut -o`` to retain only specified fields in your preferred order::
POKI_RUN_COMMAND{{mlr --icsv --opprint cut -o -f flag,shape example.csv}}HERE
You can use ``cut -x`` to omit fields you don't care about::
POKI_RUN_COMMAND{{mlr --icsv --opprint cut -x -f flag,shape example.csv}}HERE
You can use ``filter`` to keep only records you care about::
POKI_RUN_COMMAND{{mlr --icsv --opprint filter '$color == "red"' example.csv}}HERE
::
POKI_RUN_COMMAND{{mlr --icsv --opprint filter '$color == "red" && $flag == 1' example.csv}}HERE
You can use ``put`` to create new fields which are computed from other fields::
POKI_RUN_COMMAND{{mlr --icsv --opprint put '$ratio = $quantity / $rate; $color_shape = $color . "_" . $shape' example.csv}}HERE
Even though Miller's main selling point is name-indexing, sometimes you really want to refer to a field name by its positional index. Use ``$[[3]]`` to access the name of field 3 or ``$[[[3]]]`` to access the value of field 3::
POKI_RUN_COMMAND{{mlr --icsv --opprint put '$[[3]] = "NEW"' example.csv}}HERE
::
POKI_RUN_COMMAND{{mlr --icsv --opprint put '$[[[3]]] = "NEW"' example.csv}}HERE
JSON-file examples
^^^^^^^^^^^^^^^^^^
OK, CSV and pretty-print are fine. But Miller can also convert between a few other formats -- let's take a look at JSON output::
POKI_RUN_COMMAND{{mlr --icsv --ojson put '$ratio = $quantity/$rate; $shape = toupper($shape)' example.csv}}HERE
Sorts and stats
^^^^^^^^^^^^^^^
Now suppose you want to sort the data on a given column, *and then* take the top few in that ordering. You can use Miller's ``then`` feature to pipe commands together.
Here are the records with the top three ``index`` values::
POKI_RUN_COMMAND{{mlr --icsv --opprint sort -f shape -nr index then head -n 3 example.csv}}HERE
Lots of Miller commands take a ``-g`` option for group-by: here, ``head -n 1 -g shape`` outputs the first record for each distinct value of the ``shape`` field. This means we're finding the record with highest ``index`` field for each distinct ``shape`` field::
POKI_RUN_COMMAND{{mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv}}HERE
Statistics can be computed with or without group-by field(s)::
POKI_RUN_COMMAND{{mlr --icsv --opprint --from example.csv stats1 -a count,min,mean,max -f quantity -g shape}}HERE
::
POKI_RUN_COMMAND{{mlr --icsv --opprint --from example.csv stats1 -a count,min,mean,max -f quantity -g shape,color}}HERE
If your output has a lot of columns, you can use XTAB format to line things up vertically for you instead::
POKI_RUN_COMMAND{{mlr --icsv --oxtab --from example.csv stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate}}HERE
.. _10min-choices-for-printing-to-files:
Choices for printing to files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Often we want to print output to the screen. Miller does this by default, as we've seen in the previous examples.
Sometimes we want to print output to another file: just use **> outputfilenamegoeshere** at the end of your command:
::
% mlr --icsv --opprint cat example.csv > newfile.csv
# Output goes to the new file;
# nothing is printed to the screen.
::
% cat newfile.csv
color shape flag index quantity rate
yellow triangle 1 11 43.6498 9.8870
red square 1 15 79.2778 0.0130
red circle 1 16 13.8103 2.9010
red square 0 48 77.5542 7.4670
purple triangle 0 51 81.2290 8.5910
red square 0 64 77.1991 9.5310
purple triangle 0 65 80.1405 5.8240
yellow circle 1 73 63.9785 4.2370
yellow circle 1 87 63.5058 8.3350
purple square 0 91 72.3735 8.2430
Other times we just want our files to be **changed in-place**: just use **mlr -I**::
% cp example.csv newfile.txt
% cat newfile.txt
color,shape,flag,index,quantity,rate
yellow,triangle,1,11,43.6498,9.8870
red,square,1,15,79.2778,0.0130
red,circle,1,16,13.8103,2.9010
red,square,0,48,77.5542,7.4670
purple,triangle,0,51,81.2290,8.5910
red,square,0,64,77.1991,9.5310
purple,triangle,0,65,80.1405,5.8240
yellow,circle,1,73,63.9785,4.2370
yellow,circle,1,87,63.5058,8.3350
purple,square,0,91,72.3735,8.2430
% mlr -I --icsv --opprint cat newfile.txt
% cat newfile.txt
color shape flag index quantity rate
yellow triangle 1 11 43.6498 9.8870
red square 1 15 79.2778 0.0130
red circle 1 16 13.8103 2.9010
red square 0 48 77.5542 7.4670
purple triangle 0 51 81.2290 8.5910
red square 0 64 77.1991 9.5310
purple triangle 0 65 80.1405 5.8240
yellow circle 1 73 63.9785 4.2370
yellow circle 1 87 63.5058 8.3350
purple square 0 91 72.3735 8.2430
Also using ``mlr -I`` you can bulk-operate on lots of files: e.g.::
mlr -I --csv cut -x -f unwanted_column_name *.csv
If you like, you can first copy off your original data somewhere else, before doing in-place operations.
Lastly, using ``tee`` within ``put``, you can split your input data into separate files per one or more field names::
POKI_RUN_COMMAND{{mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'}}HERE
::
POKI_RUN_COMMAND{{cat circle.csv}}HERE
::
POKI_RUN_COMMAND{{cat square.csv}}HERE
::
POKI_RUN_COMMAND{{cat triangle.csv}}HERE
Other-format examples
^^^^^^^^^^^^^^^^^^^^^
What's a CSV file, really? It's an array of rows, or *records*, each being a list of key-value pairs, or *fields*: for CSV it so happens that all the keys are shared in the header line and the values vary data line by data line.
For example, if you have::
shape,flag,index
circle,1,24
square,0,36
then that's a way of saying::
shape=circle,flag=1,index=24
shape=square,flag=0,index=36
Data written this way are called **DKVP**, for *delimited key-value pairs*.
We've also already seen other ways to write the same data::
CSV PPRINT JSON
shape,flag,index shape flag index [
circle,1,24 circle 1 24 {
square,0,36 square 0 36 "shape": "circle",
"flag": 1,
"index": 24
},
DKVP XTAB {
shape=circle,flag=1,index=24 shape circle "shape": "square",
shape=square,flag=0,index=36 flag 1 "flag": 0,
index 24 "index": 36
}
shape square ]
flag 0
index 36
Anything we can do with CSV input data, we can do with any other format input data. And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.