* Accept more passing emit cases * Port docs from sphinx to mkdocs * iterating * rephrase internal-link syntax using mkdocs * iterating
20 KiB
Miller in 10 minutes
Obtaining Miller
You can install Miller for various platforms as follows:
- Linux:
yum install millerorapt-get install millerdepending on your flavor of Linux - MacOS:
brew install millerorport install millerdepending on your preference of [Homebrew](https://brew.sh>_ orMacPorts <https://macports.org). - Windows:
choco install millerusing Chocolatey. - You can get latest builds for Linux, MacOS, and Windows by visiting https://github.com/johnkerl/miller/actions, selecting the latest build, and clicking Artifacts. (These are retained for 5 days after each commit.)
- See also the build page if you prefer -- in particular, if your platform's package manager doesn't have the latest release.
As a first check, you should be able to run mlr --version at your system's command prompt and see something like the following:
mlr --version Miller v6.0.0-dev
As a second check, given example.csv you should be able to do
mlr --csv cat example.csv color,shape,flag,index,quantity,rate yellow,triangle,true,11,43.6498,9.8870 red,square,true,15,79.2778,0.0130 red,circle,true,16,13.8103,2.9010 red,square,false,48,77.5542,7.4670 purple,triangle,false,51,81.2290,8.5910 red,square,false,64,77.1991,9.5310 purple,triangle,false,65,80.1405,5.8240 yellow,circle,true,73,63.9785,4.2370 yellow,circle,true,87,63.5058,8.3350 purple,square,false,91,72.3735,8.2430
mlr --icsv --opprint cat example.csv color shape flag index quantity rate yellow triangle true 11 43.6498 9.8870 red square true 15 79.2778 0.0130 red circle true 16 13.8103 2.9010 red square false 48 77.5542 7.4670 purple triangle false 51 81.2290 8.5910 red square false 64 77.1991 9.5310 purple triangle false 65 80.1405 5.8240 yellow circle true 73 63.9785 4.2370 yellow circle true 87 63.5058 8.3350 purple square false 91 72.3735 8.2430
If you run into issues on these checks, please check out the resources on the community page for help.
Miller verbs
Let's take a quick look at some of the most useful Miller verbs -- file-format-aware, name-index-empowered equivalents of standard system commands.
mlr cat is like system cat (or type on Windows) -- it passes the data through unmodified:
mlr --csv cat example.csv color,shape,flag,index,quantity,rate yellow,triangle,true,11,43.6498,9.8870 red,square,true,15,79.2778,0.0130 red,circle,true,16,13.8103,2.9010 red,square,false,48,77.5542,7.4670 purple,triangle,false,51,81.2290,8.5910 red,square,false,64,77.1991,9.5310 purple,triangle,false,65,80.1405,5.8240 yellow,circle,true,73,63.9785,4.2370 yellow,circle,true,87,63.5058,8.3350 purple,square,false,91,72.3735,8.2430
But mlr cat can also do format conversion -- for example, you can pretty-print in tabular format:
mlr --icsv --opprint cat example.csv color shape flag index quantity rate yellow triangle true 11 43.6498 9.8870 red square true 15 79.2778 0.0130 red circle true 16 13.8103 2.9010 red square false 48 77.5542 7.4670 purple triangle false 51 81.2290 8.5910 red square false 64 77.1991 9.5310 purple triangle false 65 80.1405 5.8240 yellow circle true 73 63.9785 4.2370 yellow circle true 87 63.5058 8.3350 purple square false 91 72.3735 8.2430
mlr head and mlr tail count records rather than lines. Whether you're getting the first few records or the last few, the CSV header is included either way:
mlr --csv head -n 4 example.csv color,shape,flag,index,quantity,rate yellow,triangle,true,11,43.6498,9.8870 red,square,true,15,79.2778,0.0130 red,circle,true,16,13.8103,2.9010 red,square,false,48,77.5542,7.4670
mlr --csv tail -n 4 example.csv color,shape,flag,index,quantity,rate purple,triangle,false,65,80.1405,5.8240 yellow,circle,true,73,63.9785,4.2370 yellow,circle,true,87,63.5058,8.3350 purple,square,false,91,72.3735,8.2430
mlr --icsv --ojson tail -n 2 example.csv
{
"color": "yellow",
"shape": "circle",
"flag": true,
"index": 87,
"quantity": 63.5058,
"rate": 8.3350
}
{
"color": "purple",
"shape": "square",
"flag": false,
"index": 91,
"quantity": 72.3735,
"rate": 8.2430
}
You can sort on a single field:
mlr --icsv --opprint sort -f shape example.csv color shape flag index quantity rate red circle true 16 13.8103 2.9010 yellow circle true 73 63.9785 4.2370 yellow circle true 87 63.5058 8.3350 red square true 15 79.2778 0.0130 red square false 48 77.5542 7.4670 red square false 64 77.1991 9.5310 purple square false 91 72.3735 8.2430 yellow triangle true 11 43.6498 9.8870 purple triangle false 51 81.2290 8.5910 purple triangle false 65 80.1405 5.8240
Or, you can sort primarily alphabetically on one field, then secondarily numerically descending on another field, and so on:
mlr --icsv --opprint sort -f shape -nr index example.csv color shape flag index quantity rate yellow circle true 87 63.5058 8.3350 yellow circle true 73 63.9785 4.2370 red circle true 16 13.8103 2.9010 purple square false 91 72.3735 8.2430 red square false 64 77.1991 9.5310 red square false 48 77.5542 7.4670 red square true 15 79.2778 0.0130 purple triangle false 65 80.1405 5.8240 purple triangle false 51 81.2290 8.5910 yellow triangle true 11 43.6498 9.8870
If there are fields you don't want to see in your data, you can use cut to keep only the ones you want, in the same order they appeared in the input data:
mlr --icsv --opprint cut -f flag,shape example.csv shape flag triangle true square true circle true square false triangle false square false triangle false circle true circle true square false
You can also use cut -o to keep specified fields, but in your preferred order:
mlr --icsv --opprint cut -o -f flag,shape example.csv flag shape true triangle true square true circle false square false triangle false square false triangle true circle true circle false square
You can use cut -x to omit fields you don't care about:
mlr --icsv --opprint cut -x -f flag,shape example.csv color index quantity rate yellow 11 43.6498 9.8870 red 15 79.2778 0.0130 red 16 13.8103 2.9010 red 48 77.5542 7.4670 purple 51 81.2290 8.5910 red 64 77.1991 9.5310 purple 65 80.1405 5.8240 yellow 73 63.9785 4.2370 yellow 87 63.5058 8.3350 purple 91 72.3735 8.2430
You can use filter to keep only records you care about:
mlr --icsv --opprint filter '$color == "red"' example.csv color shape flag index quantity rate red square true 15 79.2778 0.0130 red circle true 16 13.8103 2.9010 red square false 48 77.5542 7.4670 red square false 64 77.1991 9.5310
mlr --icsv --opprint filter '$color == "red" && $flag == true' example.csv color shape flag index quantity rate red square true 15 79.2778 0.0130 red circle true 16 13.8103 2.9010
You can use put to create new fields which are computed from other fields:
mlr --icsv --opprint put ' $ratio = $quantity / $rate; $color_shape = $color . "_" . $shape ' example.csv color shape flag index quantity rate ratio color_shape yellow triangle true 11 43.6498 9.8870 4.414868008496004 yellow_triangle red square true 15 79.2778 0.0130 6098.292307692308 red_square red circle true 16 13.8103 2.9010 4.760530851430541 red_circle red square false 48 77.5542 7.4670 10.386259541984733 red_square purple triangle false 51 81.2290 8.5910 9.455127458968688 purple_triangle red square false 64 77.1991 9.5310 8.099790158430384 red_square purple triangle false 65 80.1405 5.8240 13.760388049450551 purple_triangle yellow circle true 73 63.9785 4.2370 15.09995279679018 yellow_circle yellow circle true 87 63.5058 8.3350 7.619172165566886 yellow_circle purple square false 91 72.3735 8.2430 8.779995147397793 purple_square
Even though Miller's main selling point is name-indexing, sometimes you really want to refer to a field name by its positional index. Use $[[3]] to access the name of field 3 or $[[[3]]] to access the value of field 3:
mlr --icsv --opprint put '$[[3]] = "NEW"' example.csv color shape NEW index quantity rate yellow triangle true 11 43.6498 9.8870 red square true 15 79.2778 0.0130 red circle true 16 13.8103 2.9010 red square false 48 77.5542 7.4670 purple triangle false 51 81.2290 8.5910 red square false 64 77.1991 9.5310 purple triangle false 65 80.1405 5.8240 yellow circle true 73 63.9785 4.2370 yellow circle true 87 63.5058 8.3350 purple square false 91 72.3735 8.2430
mlr --icsv --opprint put '$[[[3]]] = "NEW"' example.csv color shape flag index quantity rate yellow triangle NEW 11 43.6498 9.8870 red square NEW 15 79.2778 0.0130 red circle NEW 16 13.8103 2.9010 red square NEW 48 77.5542 7.4670 purple triangle NEW 51 81.2290 8.5910 red square NEW 64 77.1991 9.5310 purple triangle NEW 65 80.1405 5.8240 yellow circle NEW 73 63.9785 4.2370 yellow circle NEW 87 63.5058 8.3350 purple square NEW 91 72.3735 8.2430
You can find the full list of verbs at the Verbs Reference page.
Multiple input files
Miller takes all the files from the command line as an input stream. But it's format-aware, so it doesn't repeat CSV header lines. For example, with input files [data/a.csv](data/a.csv and data/b.csv, the system cat command will repeat header lines:
cat data/a.csv a,b,c 1,2,3 4,5,6
cat data/b.csv a,b,c 7,8,9
cat data/a.csv data/b.csv a,b,c 1,2,3 4,5,6 a,b,c 7,8,9
However, mlr cat will not:
mlr --csv cat data/a.csv data/b.csv a,b,c 1,2,3 4,5,6 7,8,9
Chaining verbs together
Often we want to chain queries together -- for example, sorting by a field and taking the top few values. We can do this using pipes:
mlr --csv sort -nr index example.csv | mlr --icsv --opprint head -n 3 color shape flag index quantity rate purple square false 91 72.3735 8.2430 yellow circle true 87 63.5058 8.3350 yellow circle true 73 63.9785 4.2370
This works fine -- but Miller also lets you chain verbs together using the word then. Think of this as a Miller-internal pipe that lets you use fewer keystrokes:
mlr --icsv --opprint sort -nr index then head -n 3 example.csv color shape flag index quantity rate purple square false 91 72.3735 8.2430 yellow circle true 87 63.5058 8.3350 yellow circle true 73 63.9785 4.2370
As another convenience, you can put the filename first using --from. When you're interacting with your data at the command line, this makes it easier to up-arrow and append to the previous command:
mlr --icsv --opprint --from example.csv sort -nr index then head -n 3 color shape flag index quantity rate purple square false 91 72.3735 8.2430 yellow circle true 87 63.5058 8.3350 yellow circle true 73 63.9785 4.2370
mlr --icsv --opprint --from example.csv \ sort -nr index \ then head -n 3 \ then cut -f shape,quantity shape quantity square 72.3735 circle 63.5058 circle 63.9785
Sorts and stats
Now suppose you want to sort the data on a given column, and then take the top few in that ordering. You can use Miller's then feature to pipe commands together.
Here are the records with the top three index values:
mlr --icsv --opprint sort -nr index then head -n 3 example.csv color shape flag index quantity rate purple square false 91 72.3735 8.2430 yellow circle true 87 63.5058 8.3350 yellow circle true 73 63.9785 4.2370
Lots of Miller commands take a -g option for group-by: here, head -n 1 -g shape outputs the first record for each distinct value of the shape field. This means we're finding the record with highest index field for each distinct shape field:
mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv color shape flag index quantity rate yellow circle true 87 63.5058 8.3350 purple square false 91 72.3735 8.2430 purple triangle false 65 80.1405 5.8240
Statistics can be computed with or without group-by field(s):
mlr --icsv --opprint --from example.csv \ stats1 -a count,min,mean,max -f quantity -g shape shape quantity_count quantity_min quantity_mean quantity_max triangle 3 43.6498 68.33976666666666 81.229 square 4 72.3735 76.60114999999999 79.2778 circle 3 13.8103 47.0982 63.9785
mlr --icsv --opprint --from example.csv \ stats1 -a count,min,mean,max -f quantity -g shape,color shape color quantity_count quantity_min quantity_mean quantity_max triangle yellow 1 43.6498 43.6498 43.6498 square red 3 77.1991 78.01036666666666 79.2778 circle red 1 13.8103 13.8103 13.8103 triangle purple 2 80.1405 80.68475000000001 81.229 circle yellow 2 63.5058 63.742149999999995 63.9785 square purple 1 72.3735 72.3735 72.3735
If your output has a lot of columns, you can use XTAB format to line things up vertically for you instead:
mlr --icsv --oxtab --from example.csv \ stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate rate_p0 0.0130 rate_p10 2.9010 rate_p25 4.2370 rate_p50 8.2430 rate_p75 8.5910 rate_p90 9.8870 rate_p99 9.8870 rate_p100 9.8870
File formats and format conversion
Miller supports the following formats:
- CSV (comma-separared values)
- TSV (tab-separated values)
- JSON (JavaScript Object Notation)
- PPRINT (pretty-printed tabular)
- XTAB (vertical-tabular or sideways-tabular)
- NIDX (numerically indexed, label-free, with implicit labels
"1","2", etc.) - DKVP (delimited key-value pairs).
What's a CSV file, really? It's an array of rows, or records, each being a list of key-value pairs, or fields: for CSV it so happens that all the keys are shared in the header line and the values vary from one data line to another.
For example, if you have:
shape,flag,index circle,1,24 square,0,36
then that's a way of saying:
shape=circle,flag=1,index=24 shape=square,flag=0,index=36
Other ways to write the same data:
CSV PPRINT
shape,flag,index shape flag index
circle,1,24 circle 1 24
square,0,36 square 0 36
JSON XTAB
{ shape circle
"shape": "circle", flag 1
"flag": 1, index 24
"index": 24 .
} shape square
{ flag 0
"shape": "square", index 36
"flag": 0,
"index": 36
}
DKVP
shape=circle,flag=1,index=24
shape=square,flag=0,index=36
Anything we can do with CSV input data, we can do with any other format input data. And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.
How to specify these to Miller:
- If you use
--csvor--jsonor--pprint, etc., then Miller will use that format for input and output. - If you use
--icsvand--ojson(note the extraiando) then Miller will use CSV for input and JSON for output, etc. See also Keystroke Savers for even shorter options like--c2j.
You can read more about this at the File Formats page.
.. _10min-choices-for-printing-to-files:
Choices for printing to files
Often we want to print output to the screen. Miller does this by default, as we've seen in the previous examples.
Sometimes, though, we want to print output to another file. Just use > outputfilenamegoeshere at the end of your command:
.. code-block:: none :emphasize-lines: 1,1
mlr --icsv --opprint cat example.csv > newfile.csv
# Output goes to the new file;
# nothing is printed to the screen.
.. code-block:: none :emphasize-lines: 1,1
cat newfile.csv
color shape flag index quantity rate
yellow triangle true 11 43.6498 9.8870
red square true 15 79.2778 0.0130
red circle true 16 13.8103 2.9010
red square false 48 77.5542 7.4670
purple triangle false 51 81.2290 8.5910
red square false 64 77.1991 9.5310
purple triangle false 65 80.1405 5.8240
yellow circle true 73 63.9785 4.2370
yellow circle true 87 63.5058 8.3350
purple square false 91 72.3735 8.2430
Other times we just want our files to be changed in-place: just use mlr -I:
.. code-block:: none :emphasize-lines: 1,1
cp example.csv newfile.txt
.. code-block:: none :emphasize-lines: 1,1
cat newfile.txt
color,shape,flag,index,quantity,rate
yellow,triangle,true,11,43.6498,9.8870
red,square,true,15,79.2778,0.0130
red,circle,true,16,13.8103,2.9010
red,square,false,48,77.5542,7.4670
purple,triangle,false,51,81.2290,8.5910
red,square,false,64,77.1991,9.5310
purple,triangle,false,65,80.1405,5.8240
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
purple,square,false,91,72.3735,8.2430
.. code-block:: none :emphasize-lines: 1,1
mlr -I --csv sort -f shape newfile.txt
.. code-block:: none :emphasize-lines: 1,1
cat newfile.txt
color,shape,flag,index,quantity,rate
red,circle,true,16,13.8103,2.9010
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
red,square,true,15,79.2778,0.0130
red,square,false,48,77.5542,7.4670
red,square,false,64,77.1991,9.5310
purple,square,false,91,72.3735,8.2430
yellow,triangle,true,11,43.6498,9.8870
purple,triangle,false,51,81.2290,8.5910
purple,triangle,false,65,80.1405,5.8240
Also using mlr -I you can bulk-operate on lots of files: e.g.:
.. code-block:: none :emphasize-lines: 1,1
mlr -I --csv cut -x -f unwanted_column_name *.csv
If you like, you can first copy off your original data somewhere else, before doing in-place operations.
Lastly, using tee within put, you can split your input data into separate files per one or more field names:
mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'
cat circle.csv color,shape,flag,index,quantity,rate red,circle,true,16,13.8103,2.9010 yellow,circle,true,73,63.9785,4.2370 yellow,circle,true,87,63.5058,8.3350
cat square.csv color,shape,flag,index,quantity,rate red,square,true,15,79.2778,0.0130 red,square,false,48,77.5542,7.4670 red,square,false,64,77.1991,9.5310 purple,square,false,91,72.3735,8.2430
cat triangle.csv color,shape,flag,index,quantity,rate yellow,triangle,true,11,43.6498,9.8870 purple,triangle,false,51,81.2290,8.5910 purple,triangle,false,65,80.1405,5.8240