21 KiB
Miller verbs
Let's take a quick look at some of the most useful Miller verbs -- file-format-aware, name-index-empowered equivalents of standard system commands.
mlr cat is like system cat (or type on Windows) -- it passes the data through unmodified:
mlr --csv cat example.csv
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 red,square,true,2,15,79.2778,0.0130 red,circle,true,3,16,13.8103,2.9010 red,square,false,4,48,77.5542,7.4670 purple,triangle,false,5,51,81.2290,8.5910 red,square,false,6,64,77.1991,9.5310 purple,triangle,false,7,65,80.1405,5.8240 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350 purple,square,false,10,91,72.3735,8.2430
But mlr cat can also do format conversion -- for example, you can pretty-print in tabular format:
mlr --icsv --opprint cat example.csv
color shape flag k index quantity rate yellow triangle true 1 11 43.6498 9.8870 red square true 2 15 79.2778 0.0130 red circle true 3 16 13.8103 2.9010 red square false 4 48 77.5542 7.4670 purple triangle false 5 51 81.2290 8.5910 red square false 6 64 77.1991 9.5310 purple triangle false 7 65 80.1405 5.8240 yellow circle true 8 73 63.9785 4.2370 yellow circle true 9 87 63.5058 8.3350 purple square false 10 91 72.3735 8.2430
mlr head and mlr tail count records rather than lines. Whether you're getting the first few records or the last few, the CSV header is included either way:
mlr --csv head -n 4 example.csv
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 red,square,true,2,15,79.2778,0.0130 red,circle,true,3,16,13.8103,2.9010 red,square,false,4,48,77.5542,7.4670
mlr --csv tail -n 4 example.csv
color,shape,flag,k,index,quantity,rate purple,triangle,false,7,65,80.1405,5.8240 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350 purple,square,false,10,91,72.3735,8.2430
mlr --icsv --ojson tail -n 2 example.csv
{
"color": "yellow",
"shape": "circle",
"flag": "true",
"k": 9,
"index": 87,
"quantity": 63.5058,
"rate": 8.3350
}
{
"color": "purple",
"shape": "square",
"flag": "false",
"k": 10,
"index": 91,
"quantity": 72.3735,
"rate": 8.2430
}
You can sort on a single field:
mlr --icsv --opprint sort -f shape example.csv
color shape flag k index quantity rate red circle true 3 16 13.8103 2.9010 yellow circle true 8 73 63.9785 4.2370 yellow circle true 9 87 63.5058 8.3350 red square true 2 15 79.2778 0.0130 red square false 4 48 77.5542 7.4670 red square false 6 64 77.1991 9.5310 purple square false 10 91 72.3735 8.2430 yellow triangle true 1 11 43.6498 9.8870 purple triangle false 5 51 81.2290 8.5910 purple triangle false 7 65 80.1405 5.8240
Or, you can sort primarily alphabetically on one field, then secondarily numerically descending on another field, and so on:
mlr --icsv --opprint sort -f shape -nr index example.csv
color shape flag k index quantity rate yellow circle true 9 87 63.5058 8.3350 yellow circle true 8 73 63.9785 4.2370 red circle true 3 16 13.8103 2.9010 purple square false 10 91 72.3735 8.2430 red square false 6 64 77.1991 9.5310 red square false 4 48 77.5542 7.4670 red square true 2 15 79.2778 0.0130 purple triangle false 7 65 80.1405 5.8240 purple triangle false 5 51 81.2290 8.5910 yellow triangle true 1 11 43.6498 9.8870
If there are fields you don't want to see in your data, you can use cut to keep only the ones you want, in the same order they appeared in the input data:
mlr --icsv --opprint cut -f flag,shape example.csv
shape flag triangle true square true circle true square false triangle false square false triangle false circle true circle true square false
You can also use cut -o to keep specified fields, but in your preferred order:
mlr --icsv --opprint cut -o -f flag,shape example.csv
flag shape true triangle true square true circle false square false triangle false square false triangle true circle true circle false square
You can use cut -x to omit fields you don't care about:
mlr --icsv --opprint cut -x -f flag,shape example.csv
color k index quantity rate yellow 1 11 43.6498 9.8870 red 2 15 79.2778 0.0130 red 3 16 13.8103 2.9010 red 4 48 77.5542 7.4670 purple 5 51 81.2290 8.5910 red 6 64 77.1991 9.5310 purple 7 65 80.1405 5.8240 yellow 8 73 63.9785 4.2370 yellow 9 87 63.5058 8.3350 purple 10 91 72.3735 8.2430
You can use filter to keep only records you care about:
mlr --icsv --opprint filter '$color == "red"' example.csv
color shape flag k index quantity rate red square true 2 15 79.2778 0.0130 red circle true 3 16 13.8103 2.9010 red square false 4 48 77.5542 7.4670 red square false 6 64 77.1991 9.5310
mlr --icsv --opprint filter '$color == "red" && $flag == true' example.csv
You can use put to create new fields which are computed from other fields:
mlr --icsv --opprint put ' $ratio = $quantity / $rate; $color_shape = $color . "_" . $shape ' example.csv
color shape flag k index quantity rate ratio color_shape yellow triangle true 1 11 43.6498 9.8870 4.414868008496004 yellow_triangle red square true 2 15 79.2778 0.0130 6098.292307692308 red_square red circle true 3 16 13.8103 2.9010 4.760530851430541 red_circle red square false 4 48 77.5542 7.4670 10.386259541984733 red_square purple triangle false 5 51 81.2290 8.5910 9.455127458968688 purple_triangle red square false 6 64 77.1991 9.5310 8.099790158430384 red_square purple triangle false 7 65 80.1405 5.8240 13.760388049450551 purple_triangle yellow circle true 8 73 63.9785 4.2370 15.09995279679018 yellow_circle yellow circle true 9 87 63.5058 8.3350 7.619172165566886 yellow_circle purple square false 10 91 72.3735 8.2430 8.779995147397793 purple_square
Even though Miller's main selling point is name-indexing, sometimes you really want to refer to a field name by its positional index. Use $[[3]] to access the name of field 3 or $[[[3]]] to access the value of field 3:
mlr --icsv --opprint put '$[[3]] = "NEW"' example.csv
color shape NEW k index quantity rate yellow triangle true 1 11 43.6498 9.8870 red square true 2 15 79.2778 0.0130 red circle true 3 16 13.8103 2.9010 red square false 4 48 77.5542 7.4670 purple triangle false 5 51 81.2290 8.5910 red square false 6 64 77.1991 9.5310 purple triangle false 7 65 80.1405 5.8240 yellow circle true 8 73 63.9785 4.2370 yellow circle true 9 87 63.5058 8.3350 purple square false 10 91 72.3735 8.2430
mlr --icsv --opprint put '$[[[3]]] = "NEW"' example.csv
color shape flag k index quantity rate yellow triangle NEW 1 11 43.6498 9.8870 red square NEW 2 15 79.2778 0.0130 red circle NEW 3 16 13.8103 2.9010 red square NEW 4 48 77.5542 7.4670 purple triangle NEW 5 51 81.2290 8.5910 red square NEW 6 64 77.1991 9.5310 purple triangle NEW 7 65 80.1405 5.8240 yellow circle NEW 8 73 63.9785 4.2370 yellow circle NEW 9 87 63.5058 8.3350 purple square NEW 10 91 72.3735 8.2430
You can find the full list of verbs at the Verbs Reference page.
Multiple input files
Miller takes all the files from the command line as an input stream. But it's format-aware, so it doesn't repeat CSV header lines. For example, with input files data/a.csv and data/b.csv, the system cat command will repeat header lines:
cat data/a.csv
a,b,c 1,2,3 4,5,6
cat data/b.csv
a,b,c 7,8,9
cat data/a.csv data/b.csv
a,b,c 1,2,3 4,5,6 a,b,c 7,8,9
However, mlr cat will not:
mlr --csv cat data/a.csv data/b.csv
a,b,c 1,2,3 4,5,6 7,8,9
Chaining verbs together
Often we want to chain queries together -- for example, sorting by a field and taking the top few values. We can do this using pipes:
mlr --csv sort -nr index example.csv | mlr --icsv --opprint head -n 3
color shape flag k index quantity rate purple square false 10 91 72.3735 8.2430 yellow circle true 9 87 63.5058 8.3350 yellow circle true 8 73 63.9785 4.2370
This works fine -- but Miller also lets you chain verbs together using the word then. Think of this as a Miller-internal pipe that lets you use fewer keystrokes:
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
color shape flag k index quantity rate purple square false 10 91 72.3735 8.2430 yellow circle true 9 87 63.5058 8.3350 yellow circle true 8 73 63.9785 4.2370
As another convenience, you can put the filename first using --from. When you're interacting with your data at the command line, this makes it easier to up-arrow and append to the previous command:
mlr --icsv --opprint --from example.csv sort -nr index then head -n 3
color shape flag k index quantity rate purple square false 10 91 72.3735 8.2430 yellow circle true 9 87 63.5058 8.3350 yellow circle true 8 73 63.9785 4.2370
mlr --icsv --opprint --from example.csv \ sort -nr index \ then head -n 3 \ then cut -f shape,quantity
shape quantity square 72.3735 circle 63.5058 circle 63.9785
Sorts and stats
Now suppose you want to sort the data on a given column, and then take the top few in that ordering. You can use Miller's then feature to pipe commands together.
Here are the records with the top three index values:
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
color shape flag k index quantity rate purple square false 10 91 72.3735 8.2430 yellow circle true 9 87 63.5058 8.3350 yellow circle true 8 73 63.9785 4.2370
Lots of Miller commands take a -g option for group-by: here, head -n 1 -g shape outputs the first record for each distinct value of the shape field. This means we're finding the record with highest index field for each distinct shape field:
mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv
color shape flag k index quantity rate yellow circle true 9 87 63.5058 8.3350 purple square false 10 91 72.3735 8.2430 purple triangle false 7 65 80.1405 5.8240
Statistics can be computed with or without group-by field(s):
mlr --icsv --opprint --from example.csv \ stats1 -a count,min,mean,max -f quantity -g shape
shape quantity_count quantity_min quantity_mean quantity_max triangle 3 43.6498 68.33976666666666 81.229 square 4 72.3735 76.60114999999999 79.2778 circle 3 13.8103 47.0982 63.9785
mlr --icsv --opprint --from example.csv \ stats1 -a count,min,mean,max -f quantity -g shape,color
shape color quantity_count quantity_min quantity_mean quantity_max triangle yellow 1 43.6498 43.6498 43.6498 square red 3 77.1991 78.01036666666666 79.2778 circle red 1 13.8103 13.8103 13.8103 triangle purple 2 80.1405 80.68475000000001 81.229 circle yellow 2 63.5058 63.742149999999995 63.9785 square purple 1 72.3735 72.3735 72.3735
If your output has a lot of columns, you can use XTAB format to line things up vertically for you instead:
mlr --icsv --oxtab --from example.csv \ stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate
rate_p0 0.0130 rate_p10 2.9010 rate_p25 4.2370 rate_p50 8.2430 rate_p75 8.5910 rate_p90 9.8870 rate_p99 9.8870 rate_p100 9.8870
File formats and format conversion
Miller supports the following formats:
- CSV (comma-separared values)
- TSV (tab-separated values)
- JSON (JavaScript Object Notation)
- PPRINT (pretty-printed tabular)
- XTAB (vertical-tabular or sideways-tabular)
- NIDX (numerically indexed, label-free, with implicit labels
"1","2", etc.) - DKVP (delimited key-value pairs).
What's a CSV file, really? It's an array of rows, or records, each being a list of key-value pairs, or fields: for CSV it so happens that all the keys are shared in the header line and the values vary from one data line to another.
For example, if you have:
shape,flag,index circle,1,24 square,0,36
then that's a way of saying:
shape=circle,flag=1,index=24 shape=square,flag=0,index=36
Other ways to write the same data:
CSV PPRINT
shape,flag,index shape flag index
circle,1,24 circle 1 24
square,0,36 square 0 36
JSON XTAB
{ shape circle
"shape": "circle", flag 1
"flag": 1, index 24
"index": 24 .
} shape square
{ flag 0
"shape": "square", index 36
"flag": 0,
"index": 36
}
DKVP
shape=circle,flag=1,index=24
shape=square,flag=0,index=36
Anything we can do with CSV input data, we can do with any other format input data. And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.
How to specify these to Miller:
- If you use
--csvor--jsonor--pprint, etc., then Miller will use that format for input and output. - If you use
--icsvand--ojson(note the extraiando) then Miller will use CSV for input and JSON for output, etc. See also Keystroke Savers for even shorter options like--c2j.
You can read more about this at the File Formats page.
Choices for printing to files
Often we want to print output to the screen. Miller does this by default, as we've seen in the previous examples.
Sometimes, though, we want to print output to another file. Just use > outputfilenamegoeshere at the end of your command:
mlr --icsv --opprint cat example.csv > newfile.csv
# Output goes to the new file; # nothing is printed to the screen.
cat newfile.csv
color shape flag index quantity rate yellow triangle true 11 43.6498 9.8870 red square true 15 79.2778 0.0130 red circle true 16 13.8103 2.9010 red square false 48 77.5542 7.4670 purple triangle false 51 81.2290 8.5910 red square false 64 77.1991 9.5310 purple triangle false 65 80.1405 5.8240 yellow circle true 73 63.9785 4.2370 yellow circle true 87 63.5058 8.3350 purple square false 91 72.3735 8.2430
Other times we just want our files to be changed in-place: just use mlr -I:
cp example.csv newfile.txt
cat newfile.txt
color,shape,flag,index,quantity,rate yellow,triangle,true,11,43.6498,9.8870 red,square,true,15,79.2778,0.0130 red,circle,true,16,13.8103,2.9010 red,square,false,48,77.5542,7.4670 purple,triangle,false,51,81.2290,8.5910 red,square,false,64,77.1991,9.5310 purple,triangle,false,65,80.1405,5.8240 yellow,circle,true,73,63.9785,4.2370 yellow,circle,true,87,63.5058,8.3350 purple,square,false,91,72.3735,8.2430
mlr -I --csv sort -f shape newfile.txt
cat newfile.txt
color,shape,flag,index,quantity,rate red,circle,true,16,13.8103,2.9010 yellow,circle,true,73,63.9785,4.2370 yellow,circle,true,87,63.5058,8.3350 red,square,true,15,79.2778,0.0130 red,square,false,48,77.5542,7.4670 red,square,false,64,77.1991,9.5310 purple,square,false,91,72.3735,8.2430 yellow,triangle,true,11,43.6498,9.8870 purple,triangle,false,51,81.2290,8.5910 purple,triangle,false,65,80.1405,5.8240
Also using mlr -I you can bulk-operate on lots of files: e.g.:
mlr -I --csv cut -x -f unwanted_column_name *.csv
If you like, you can first copy off your original data somewhere else, before doing in-place operations.
Lastly, using tee within put, you can split your input data into separate files per one or more field names:
mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'
cat circle.csv
color,shape,flag,k,index,quantity,rate red,circle,true,3,16,13.8103,2.9010 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350
cat square.csv
color,shape,flag,k,index,quantity,rate red,square,true,2,15,79.2778,0.0130 red,square,false,4,48,77.5542,7.4670 red,square,false,6,64,77.1991,9.5310 purple,square,false,10,91,72.3735,8.2430
cat triangle.csv
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 purple,triangle,false,5,51,81.2290,8.5910 purple,triangle,false,7,65,80.1405,5.8240