miller/docs6/docs/10min.md at a34fdafe8aec0efd57ecaaee8f2291c05a055f21

Mirrors/miller

Fork 0

mirror of https://github.com/johnkerl/miller.git synced 2026-01-23 18:25:45 +00:00

John Kerl be1b026ff3 stub for new flag-list page

2021-09-06 21:57:15 -04:00

21 KiB

Raw Blame History

Quick links: Flag list Verb list Function list Glossary Repository ↗

# Miller in 10 minutes

Miller verbs

Let's take a quick look at some of the most useful Miller verbs -- file-format-aware, name-index-empowered equivalents of standard system commands.

mlr cat is like system cat (or type on Windows) -- it passes the data through unmodified:

mlr --csv cat example.csv

color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
red,square,true,2,15,79.2778,0.0130
red,circle,true,3,16,13.8103,2.9010
red,square,false,4,48,77.5542,7.4670
purple,triangle,false,5,51,81.2290,8.5910
red,square,false,6,64,77.1991,9.5310
purple,triangle,false,7,65,80.1405,5.8240
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350
purple,square,false,10,91,72.3735,8.2430

But mlr cat can also do format conversion -- for example, you can pretty-print in tabular format:

mlr --icsv --opprint cat example.csv

color  shape    flag  k  index quantity rate
yellow triangle true  1  11    43.6498  9.8870
red    square   true  2  15    79.2778  0.0130
red    circle   true  3  16    13.8103  2.9010
red    square   false 4  48    77.5542  7.4670
purple triangle false 5  51    81.2290  8.5910
red    square   false 6  64    77.1991  9.5310
purple triangle false 7  65    80.1405  5.8240
yellow circle   true  8  73    63.9785  4.2370
yellow circle   true  9  87    63.5058  8.3350
purple square   false 10 91    72.3735  8.2430

mlr head and mlr tail count records rather than lines. Whether you're getting the first few records or the last few, the CSV header is included either way:

mlr --csv head -n 4 example.csv

color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
red,square,true,2,15,79.2778,0.0130
red,circle,true,3,16,13.8103,2.9010
red,square,false,4,48,77.5542,7.4670

mlr --csv tail -n 4 example.csv

color,shape,flag,k,index,quantity,rate
purple,triangle,false,7,65,80.1405,5.8240
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350
purple,square,false,10,91,72.3735,8.2430

mlr --icsv --ojson tail -n 2 example.csv

{
  "color": "yellow",
  "shape": "circle",
  "flag": "true",
  "k": 9,
  "index": 87,
  "quantity": 63.5058,
  "rate": 8.3350
}
{
  "color": "purple",
  "shape": "square",
  "flag": "false",
  "k": 10,
  "index": 91,
  "quantity": 72.3735,
  "rate": 8.2430
}

You can sort on a single field:

mlr --icsv --opprint sort -f shape example.csv

color  shape    flag  k  index quantity rate
red    circle   true  3  16    13.8103  2.9010
yellow circle   true  8  73    63.9785  4.2370
yellow circle   true  9  87    63.5058  8.3350
red    square   true  2  15    79.2778  0.0130
red    square   false 4  48    77.5542  7.4670
red    square   false 6  64    77.1991  9.5310
purple square   false 10 91    72.3735  8.2430
yellow triangle true  1  11    43.6498  9.8870
purple triangle false 5  51    81.2290  8.5910
purple triangle false 7  65    80.1405  5.8240

Or, you can sort primarily alphabetically on one field, then secondarily numerically descending on another field, and so on:

mlr --icsv --opprint sort -f shape -nr index example.csv

color  shape    flag  k  index quantity rate
yellow circle   true  9  87    63.5058  8.3350
yellow circle   true  8  73    63.9785  4.2370
red    circle   true  3  16    13.8103  2.9010
purple square   false 10 91    72.3735  8.2430
red    square   false 6  64    77.1991  9.5310
red    square   false 4  48    77.5542  7.4670
red    square   true  2  15    79.2778  0.0130
purple triangle false 7  65    80.1405  5.8240
purple triangle false 5  51    81.2290  8.5910
yellow triangle true  1  11    43.6498  9.8870

If there are fields you don't want to see in your data, you can use cut to keep only the ones you want, in the same order they appeared in the input data:

mlr --icsv --opprint cut -f flag,shape example.csv

shape    flag
triangle true
square   true
circle   true
square   false
triangle false
square   false
triangle false
circle   true
circle   true
square   false

You can also use cut -o to keep specified fields, but in your preferred order:

mlr --icsv --opprint cut -o -f flag,shape example.csv

flag  shape
true  triangle
true  square
true  circle
false square
false triangle
false square
false triangle
true  circle
true  circle
false square

You can use cut -x to omit fields you don't care about:

mlr --icsv --opprint cut -x -f flag,shape example.csv

color  k  index quantity rate
yellow 1  11    43.6498  9.8870
red    2  15    79.2778  0.0130
red    3  16    13.8103  2.9010
red    4  48    77.5542  7.4670
purple 5  51    81.2290  8.5910
red    6  64    77.1991  9.5310
purple 7  65    80.1405  5.8240
yellow 8  73    63.9785  4.2370
yellow 9  87    63.5058  8.3350
purple 10 91    72.3735  8.2430

You can use filter to keep only records you care about:

mlr --icsv --opprint filter '$color == "red"' example.csv

color shape  flag  k index quantity rate
red   square true  2 15    79.2778  0.0130
red   circle true  3 16    13.8103  2.9010
red   square false 4 48    77.5542  7.4670
red   square false 6 64    77.1991  9.5310

mlr --icsv --opprint filter '$color == "red" && $flag == true' example.csv

You can use put to create new fields which are computed from other fields:

mlr --icsv --opprint put '
  $ratio = $quantity / $rate;
  $color_shape = $color . "_" . $shape
' example.csv

color  shape    flag  k  index quantity rate   ratio              color_shape
yellow triangle true  1  11    43.6498  9.8870 4.414868008496004  yellow_triangle
red    square   true  2  15    79.2778  0.0130 6098.292307692308  red_square
red    circle   true  3  16    13.8103  2.9010 4.760530851430541  red_circle
red    square   false 4  48    77.5542  7.4670 10.386259541984733 red_square
purple triangle false 5  51    81.2290  8.5910 9.455127458968688  purple_triangle
red    square   false 6  64    77.1991  9.5310 8.099790158430384  red_square
purple triangle false 7  65    80.1405  5.8240 13.760388049450551 purple_triangle
yellow circle   true  8  73    63.9785  4.2370 15.09995279679018  yellow_circle
yellow circle   true  9  87    63.5058  8.3350 7.619172165566886  yellow_circle
purple square   false 10 91    72.3735  8.2430 8.779995147397793  purple_square

Even though Miller's main selling point is name-indexing, sometimes you really want to refer to a field name by its positional index. Use $[[3]] to access the name of field 3 or $[[[3]]] to access the value of field 3:

mlr --icsv --opprint put '$[[3]] = "NEW"' example.csv

color  shape    NEW   k  index quantity rate
yellow triangle true  1  11    43.6498  9.8870
red    square   true  2  15    79.2778  0.0130
red    circle   true  3  16    13.8103  2.9010
red    square   false 4  48    77.5542  7.4670
purple triangle false 5  51    81.2290  8.5910
red    square   false 6  64    77.1991  9.5310
purple triangle false 7  65    80.1405  5.8240
yellow circle   true  8  73    63.9785  4.2370
yellow circle   true  9  87    63.5058  8.3350
purple square   false 10 91    72.3735  8.2430

mlr --icsv --opprint put '$[[[3]]] = "NEW"' example.csv

color  shape    flag k  index quantity rate
yellow triangle NEW  1  11    43.6498  9.8870
red    square   NEW  2  15    79.2778  0.0130
red    circle   NEW  3  16    13.8103  2.9010
red    square   NEW  4  48    77.5542  7.4670
purple triangle NEW  5  51    81.2290  8.5910
red    square   NEW  6  64    77.1991  9.5310
purple triangle NEW  7  65    80.1405  5.8240
yellow circle   NEW  8  73    63.9785  4.2370
yellow circle   NEW  9  87    63.5058  8.3350
purple square   NEW  10 91    72.3735  8.2430

You can find the full list of verbs at the Verbs Reference page.

Multiple input files

Miller takes all the files from the command line as an input stream. But it's format-aware, so it doesn't repeat CSV header lines. For example, with input files data/a.csv and data/b.csv, the system cat command will repeat header lines:

cat data/a.csv

a,b,c
1,2,3
4,5,6

cat data/b.csv

a,b,c
7,8,9

cat data/a.csv data/b.csv

a,b,c
1,2,3
4,5,6
a,b,c
7,8,9

However, mlr cat will not:

mlr --csv cat data/a.csv data/b.csv

a,b,c
1,2,3
4,5,6
7,8,9

Chaining verbs together

Often we want to chain queries together -- for example, sorting by a field and taking the top few values. We can do this using pipes:

mlr --csv sort -nr index example.csv | mlr --icsv --opprint head -n 3

color  shape  flag  k  index quantity rate
purple square false 10 91    72.3735  8.2430
yellow circle true  9  87    63.5058  8.3350
yellow circle true  8  73    63.9785  4.2370

This works fine -- but Miller also lets you chain verbs together using the word then. Think of this as a Miller-internal pipe that lets you use fewer keystrokes:

mlr --icsv --opprint sort -nr index then head -n 3 example.csv

color  shape  flag  k  index quantity rate
purple square false 10 91    72.3735  8.2430
yellow circle true  9  87    63.5058  8.3350
yellow circle true  8  73    63.9785  4.2370

As another convenience, you can put the filename first using --from. When you're interacting with your data at the command line, this makes it easier to up-arrow and append to the previous command:

mlr --icsv --opprint --from example.csv sort -nr index then head -n 3

color  shape  flag  k  index quantity rate
purple square false 10 91    72.3735  8.2430
yellow circle true  9  87    63.5058  8.3350
yellow circle true  8  73    63.9785  4.2370

mlr --icsv --opprint --from example.csv \
  sort -nr index \
  then head -n 3 \
  then cut -f shape,quantity

shape  quantity
square 72.3735
circle 63.5058
circle 63.9785

Sorts and stats

Now suppose you want to sort the data on a given column, and then take the top few in that ordering. You can use Miller's then feature to pipe commands together.

Here are the records with the top three index values:

mlr --icsv --opprint sort -nr index then head -n 3 example.csv

color  shape  flag  k  index quantity rate
purple square false 10 91    72.3735  8.2430
yellow circle true  9  87    63.5058  8.3350
yellow circle true  8  73    63.9785  4.2370

Lots of Miller commands take a -g option for group-by: here, head -n 1 -g shape outputs the first record for each distinct value of the shape field. This means we're finding the record with highest index field for each distinct shape field:

mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv

color  shape    flag  k  index quantity rate
yellow circle   true  9  87    63.5058  8.3350
purple square   false 10 91    72.3735  8.2430
purple triangle false 7  65    80.1405  5.8240

Statistics can be computed with or without group-by field(s):

mlr --icsv --opprint --from example.csv \
  stats1 -a count,min,mean,max -f quantity -g shape

shape    quantity_count quantity_min quantity_mean     quantity_max
triangle 3              43.6498      68.33976666666666 81.229
square   4              72.3735      76.60114999999999 79.2778
circle   3              13.8103      47.0982           63.9785

mlr --icsv --opprint --from example.csv \
  stats1 -a count,min,mean,max -f quantity -g shape,color

shape    color  quantity_count quantity_min quantity_mean      quantity_max
triangle yellow 1              43.6498      43.6498            43.6498
square   red    3              77.1991      78.01036666666666  79.2778
circle   red    1              13.8103      13.8103            13.8103
triangle purple 2              80.1405      80.68475000000001  81.229
circle   yellow 2              63.5058      63.742149999999995 63.9785
square   purple 1              72.3735      72.3735            72.3735

If your output has a lot of columns, you can use XTAB format to line things up vertically for you instead:

mlr --icsv --oxtab --from example.csv \
  stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate

rate_p0   0.0130
rate_p10  2.9010
rate_p25  4.2370
rate_p50  8.2430
rate_p75  8.5910
rate_p90  9.8870
rate_p99  9.8870
rate_p100 9.8870

File formats and format conversion

Miller supports the following formats:

CSV (comma-separared values)
TSV (tab-separated values)
JSON (JavaScript Object Notation)
PPRINT (pretty-printed tabular)
XTAB (vertical-tabular or sideways-tabular)
NIDX (numerically indexed, label-free, with implicit labels "1", "2", etc.)
DKVP (delimited key-value pairs).

What's a CSV file, really? It's an array of rows, or records, each being a list of key-value pairs, or fields: for CSV it so happens that all the keys are shared in the header line and the values vary from one data line to another.

For example, if you have:

shape,flag,index
circle,1,24
square,0,36

then that's a way of saying:

shape=circle,flag=1,index=24
shape=square,flag=0,index=36

Other ways to write the same data:

CSV                   PPRINT
shape,flag,index      shape  flag index
circle,1,24           circle 1    24
square,0,36           square 0    36

JSON                  XTAB
{                     shape circle
  "shape": "circle",  flag  1
  "flag": 1,          index 24
  "index": 24         .
}                     shape square
{                     flag  0
  "shape": "square",  index 36
  "flag": 0,
  "index": 36
}

DKVP
shape=circle,flag=1,index=24
shape=square,flag=0,index=36

Anything we can do with CSV input data, we can do with any other format input data. And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.

How to specify these to Miller:

If you use --csv or --json or --pprint, etc., then Miller will use that format for input and output.
If you use --icsv and --ojson (note the extra i and o) then Miller will use CSV for input and JSON for output, etc. See also Keystroke Savers for even shorter options like --c2j.

You can read more about this at the File Formats page.

Choices for printing to files

Often we want to print output to the screen. Miller does this by default, as we've seen in the previous examples.

Sometimes, though, we want to print output to another file. Just use > outputfilenamegoeshere at the end of your command:

mlr --icsv --opprint cat example.csv > newfile.csv

# Output goes to the new file;
# nothing is printed to the screen.

cat newfile.csv

color  shape    flag     index quantity rate
yellow triangle true     11    43.6498  9.8870
red    square   true     15    79.2778  0.0130
red    circle   true     16    13.8103  2.9010
red    square   false    48    77.5542  7.4670
purple triangle false    51    81.2290  8.5910
red    square   false    64    77.1991  9.5310
purple triangle false    65    80.1405  5.8240
yellow circle   true     73    63.9785  4.2370
yellow circle   true     87    63.5058  8.3350
purple square   false    91    72.3735  8.2430

Other times we just want our files to be changed in-place: just use mlr -I:

cp example.csv newfile.txt

cat newfile.txt

color,shape,flag,index,quantity,rate
yellow,triangle,true,11,43.6498,9.8870
red,square,true,15,79.2778,0.0130
red,circle,true,16,13.8103,2.9010
red,square,false,48,77.5542,7.4670
purple,triangle,false,51,81.2290,8.5910
red,square,false,64,77.1991,9.5310
purple,triangle,false,65,80.1405,5.8240
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
purple,square,false,91,72.3735,8.2430

mlr -I --csv sort -f shape newfile.txt

cat newfile.txt

color,shape,flag,index,quantity,rate
red,circle,true,16,13.8103,2.9010
yellow,circle,true,73,63.9785,4.2370
yellow,circle,true,87,63.5058,8.3350
red,square,true,15,79.2778,0.0130
red,square,false,48,77.5542,7.4670
red,square,false,64,77.1991,9.5310
purple,square,false,91,72.3735,8.2430
yellow,triangle,true,11,43.6498,9.8870
purple,triangle,false,51,81.2290,8.5910
purple,triangle,false,65,80.1405,5.8240

Also using mlr -I you can bulk-operate on lots of files: e.g.:

mlr -I --csv cut -x -f unwanted_column_name *.csv

If you like, you can first copy off your original data somewhere else, before doing in-place operations.

Lastly, using tee within put, you can split your input data into separate files per one or more field names:

mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'

cat circle.csv

color,shape,flag,k,index,quantity,rate
red,circle,true,3,16,13.8103,2.9010
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350

cat square.csv

color,shape,flag,k,index,quantity,rate
red,square,true,2,15,79.2778,0.0130
red,square,false,4,48,77.5542,7.4670
red,square,false,6,64,77.1991,9.5310
purple,square,false,10,91,72.3735,8.2430

cat triangle.csv

color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240

21 KiB Raw Blame History