miller/doc/reference.html
2015-09-20 21:23:39 -04:00

2487 lines
80 KiB
HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<!-- PAGE GENERATED FROM template.html and content-for-reference.html BY poki. -->
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
<meta name="description" content="Miller documentation"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
<meta name="keywords"
content="John Kerl, Kerl, Miller, miller, mlr, OLAP, data analysis software, regression, correlation, variance, data tools, " />
<title> Reference </title>
<link rel="stylesheet" type="text/css" href="css/miller.css"/>
<link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
</head>
<!-- ================================================================ -->
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-15651652-1");
pageTracker._trackPageview();
} catch(err) {}
</script>
<script type="text/javascript">
function toggle(divName) {
var eleDiv = document.getElementById(divName);
if (eleDiv != null) {
if (eleDiv.style.display == "block") {
eleDiv.style.display = "none";
} else {
eleDiv.style.display = "block";
}
}
}
</script>
<!--
The background image is from a screenshot of a Google search for "data analysis
tools", lightened and sepia-toned. Over this was placed a Mac Terminal app with
very light-grey font and translucent background, in which a few statistical
Miller commands were run with pretty-print-tabular output format.
-->
<body background="pix/sepia-overlay.jpg">
<!-- ================================================================ -->
<table width="100%">
<tr>
<!-- navbar -->
<td width="15%">
<!--
<img src="pix/mlr.jpg" />
<img style="border-width:1px; color:black;" src="pix/mlr.jpg" />
-->
<div class="pokinav">
<center><titleinbody>Miller</titleinbody></center>
<!-- PAGE LIST GENERATED FROM template.html BY poki -->
<br/>User info:
<br/>&bull;&nbsp;<a href="index.html">About Miller</a>
<br/>&bull;&nbsp;<a href="file-formats.html">File formats</a>
<br/>&bull;&nbsp;<a href="feature-comparison.html">Miller features in the context of the Unix toolkit</a>
<br/>&bull;&nbsp;<a href="record-heterogeneity.html">Record-heterogeneity</a>
<br/>&bull;&nbsp;<a href="performance.html">Performance</a>
<br/>&bull;&nbsp;<a href="etymology.html">Why call it Miller?</a>
<br/>&bull;&nbsp;<a href="originality.html">How original is Miller?</a>
<br/>&bull;&nbsp;<a href="reference.html"><b>Reference</b></a>
<br/>&bull;&nbsp;<a href="data-examples.html">Data examples</a>
<br/>&bull;&nbsp;<a href="internationalization.html">Internationalization</a>
<br/>&bull;&nbsp;<a href="to-do.html">Things to do</a>
<br/>Developer info:
<br/>&bull;&nbsp;<a href="build.html">Compiling, portability, dependencies, and testing</a>
<br/>&bull;&nbsp;<a href="whyc.html">Why C?</a>
<br/>&bull;&nbsp;<a href="contact.html">Contact information</a>
<br/>&bull;&nbsp;<a href="https://github.com/johnkerl/miller">GitHub repo</a>
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
<br/> <br/> <br/> <br/> <br/> <br/>
</div>
</td>
<!-- page body -->
<td>
<div style="overflow-y:scroll;height:1500px">
<center> <titleinbody> Reference </titleinbody> </center>
<p/>
<!-- BODY COPIED FROM content-for-reference.html BY poki -->
<div class="pokitoc">
<center><b>Contents:</b></center>
&bull;&nbsp;<a href="#Command_overview">Command overview</a><br/>
&bull;&nbsp;<a href="#On-line_help">On-line help</a><br/>
&bull;&nbsp;<a href="#then-chaining">then-chaining</a><br/>
&bull;&nbsp;<a href="#I/O_options">I/O options</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#Formats">Formats</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#Record/field/pair_separators">Record/field/pair separators</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#Number_formatting">Number formatting</a><br/>
&bull;&nbsp;<a href="#Data_transformations">Data transformations</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#cat">cat</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#check">check</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#count-distinct">count-distinct</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#cut">cut</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#filter">filter</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#group-by">group-by</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#group-like">group-like</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#having-fields">having-fields</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#head">head</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#histogram">histogram</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#join">join</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#label">label</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#put">put</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#regularize">regularize</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#rename">rename</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#reorder">reorder</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#sort">sort</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#stats1">stats1</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#stats2">stats2</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#step">step</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#tac">tac</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#tail">tail</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#top">top</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&bull;&nbsp;<a href="#uniq">uniq</a><br/>
&bull;&nbsp;<a href="#Functions_for_filter_and_put">Functions for filter and put</a><br/>
&bull;&nbsp;<a href="#Data_types">Data types</a><br/>
&bull;&nbsp;<a href="#Null_data">Null data</a><br/>
</div>
<p/>
<a id="Command_overview"/><h1>Command overview</h1>
<p>
Whereas the Unix toolkit is made of the separate executables <tt>cat</tt>, <tt>tail</tt>, <tt>cut</tt>,
<tt>sort</tt>, etc., Miller has subcommands, invoked as follows:
<p/>
<div class="pokipanel">
<pre>
mlr tac *.dat
mlr cut --complement -f os_version *.dat
mlr sort -f hostname,uptime *.dat
</pre>
</div>
<p/>
<p/>These falls into categories as follows:
<table border=1>
<tr class="mlrbg">
<th>Commands </th>
<th>Description</th>
</tr>
<tr>
<td>
<a href="#cat"><tt>cat</tt></a>,
<a href="#cut"><tt>cut</tt></a>,
<a href="#head"><tt>head</tt></a>,
<a href="#sort"><tt>sort</tt></a>,
<a href="#tac"><tt>tac</tt></a>,
<a href="#tail"><tt>tail</tt></a>,
<a href="#top"><tt>top</tt></a>,
<a href="#uniq"><tt>uniq</tt></a>
</td>
<td> Analogs of their Unix-toolkit namesakes, discussed below as well as in
<a href="feature-comparison.html">Miller features in the context of the Unix toolkit</a> </td>
</tr>
<tr>
<td>
<a href="#filter"><tt>filter</tt></a>,
<a href="#put"><tt>put</tt></a>,
<a href="#step"><tt>step</tt></a>
</td>
<td> <tt>awk</tt>-like functionality </td>
</tr>
<tr>
<td>
<a href="#histogram"><tt>histogram</tt></a>,
<a href="#stats1"><tt>stats1</tt></a>,
<a href="#stats2"><tt>stats2</tt></a>
</td>
<td> Statistically oriented </td>
</tr>
<tr>
<td>
<a href="#group-by"><tt>group-by</tt></a>,
<a href="#group-like"><tt>group-like</tt></a>,
<a href="#having-fields"><tt>having-fields</tt></a>
</td>
<td> Particularly oriented toward <a href="record-heterogeneity.html">Record-heterogeneity</a>, although
all Miller commands can handle heterogeneous records
</tr>
<tr>
<td>
<a href="#count-distinct"><tt>count-distinct</tt></a>,
<a href="#label"><tt>label</tt></a>,
<a href="#regularize"><tt>rename</tt></a>,
<a href="#rename"><tt>rename</tt></a>,
<a href="#reorder"><tt>reorder</tt></a>
</td>
<td> These draw from other sources (see also <a href="originality.html">How original is Miller?</a>):
<a href="#count-distinct"><tt>count-distinct</tt></a> is SQL-ish, and
<a href="#rename"><tt>rename</tt></a> can be done by <tt>sed</tt> (which does it faster:
see <a href="performance.html">Performance</a>).
</td>
</tr>
</table>
<a id="On-line_help"/><h1>On-line help</h1>
<p/>Examples:<p/>
<p/>
<div class="pokipanel">
<pre>
$ mlr --help
Usage: mlr [I/O options] {verb} [verb-dependent options ...] {file names}
Verbs:
cat check count-distinct cut filter group-by group-like having-fields head
histogram join label put regularize rename reorder sort stats1 stats2 step tac tail top
uniq
Example: mlr --csv --rs lf --fs tab cut -f hostname,uptime file1.csv file2.csv
Please use "mlr {verb name} --help" for verb-specific help.
Please use "mlr --help-all-verbs" for help on all verbs.
Functions for filter and put:
abs acos acosh asin asinh atan atan2 atanh cbrt ceil cos cosh erf erfc exp
expm1 floor invqnorm log log10 log1p max min pow qnorm round roundm sin sinh sqrt tan
tanh urand + - - * / % ** == != &gt; &gt;= &lt; &lt;= &amp;&amp; || ! strlen sub tolower toupper .
boolean float int string hexfmt fmtnum systime sec2gmt gmt2sec sec2hms sec2dhms hms2sec
dhms2sec fsec2hms fsec2dhms hms2fsec dhms2fsec
Please use "mlr --help-function {function name}" for function-specific help.
Please use "mlr --help-all-functions" or "mlr -f" for help on all functions.
Data-format options, for input, output, or both:
--dkvp --idkvp --odkvp Delimited key-value pairs, e.g "a=1,b=2" (default)
--nidx --inidx --onidx Implicitly-integer-indexed fields (Unix-toolkit style)
--csv --icsv --ocsv Comma-separated value (or tab-separated with --fs tab, etc.)
--pprint --ipprint --opprint --right Pretty-printed tabular (produces no output until all input is in)
--xtab --ixtab --oxtab Pretty-printed vertical-tabular
-p is a keystroke-saver for --nidx --fs space --repifs
Separator options, for input, output, or both:
--rs --irs --ors Record separators, e.g. 'lf' or '\r\n'
--fs --ifs --ofs --repifs Field separators, e.g. comma
--ps --ips --ops Pair separators, e.g. equals sign
Notes:
* IPS/OPS are only used for DKVP and XTAB formats, since only in these formats do key-value pairs appear juxtaposed.
* IRS/ORS are ignored for XTAB format. Nominally IFS and OFS are newlines; XTAB records are separated by
two or more consecutive IFS/OFS -- i.e. a blank line.
* OPS must be single-character for XTAB format, and OFS must be single-character for PPRINT format.
This is because they are used with repetition for alignment; multi-character separators
would make alignment impossible.
* DKVP, NIDX, CSVLITE, PPRINT, and XTAB formats are intended to handle platform-native text data.
In particular, this means LF line-terminators by default on Linux/OSX.
You can use "--dkvp --rs crlf" for CRLF-terminated DKVP files, and so on.
* CSV is intended to handle RFC-4180-compliant data.
In particular, this means it uses CRLF line-terminators by default.
You can use "--csv --rs lf" for Linux-native CSV files.
* You can specify separators in any of the following ways, shown by example:
- Type them out, quoting as necessary for shell escapes, e.g. "--fs '|' --ips :"
- C-style escape sequences, e.g. "--rs '\r\n' --fs '\t'".
- To avoid backslashing, you can use any of the following names:
cr crcr newline lf lflf crlf crlfcrlf tab space comma pipe slash colon semicolon equals
* Default separators by format:
File format RS FS PS
dkvp \n , =
csv \r\n , (N/A)
csvlite \n , (N/A)
nidx \n space (N/A)
xtab (N/A) \n space
pprint \n space (N/A)
Double-quoting for CSV output:
--quote-all Wrap all fields in double quotes
--quote-none Do not wrap any fields in double quotes, even if they have OFS or ORS in them
--quote-minimal Wrap fields in double quotes only if they have OFS or ORS in them (default)
--quote-numeric Wrap fields in double quotes only if they have numbers in them
Numerical formatting:
--ofmt {format} E.g. %.18lf, %.0lf. Please use sprintf-style codes for double-precision.
Applies to verbs which compute new values, e.g. put, stats1, stats2.
See also the fmtnum function within mlr put (mlr --help-all-functions).
Other options:
--seed {n} with n of the form 12345678 or 0xcafefeed. For put/filter urand().
Output of one verb may be chained as input to another using "then", e.g.
mlr stats1 -a min,mean,max -f flag,u,v -g color then sort -f color
Please see http://johnkerl.org/miller/doc and/or http://github.com/johnkerl/miller for more information.
This is Miller version &gt;= v2.2.0.
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ mlr sort --help
Usage: mlr sort {flags}
Flags:
-f {comma-separated field names} Lexical ascending
-n {comma-separated field names} Numerical ascending; nulls sort last
-nf {comma-separated field names} Numerical ascending; nulls sort last
-r {comma-separated field names} Lexical descending
-nr {comma-separated field names} Numerical descending; nulls sort first
Sorts records primarily by the first specified field, secondarily by the second field, and so on.
Example:
mlr sort -f a,b -nr x,y,z
which is the same as:
mlr sort -f a -f b -nr x -nr y -nr z
</pre>
</div>
<p/>
<a id="then-chaining"/><h1>then-chaining</h1>
<p/>
In accord with the
<a href="http://en.wikipedia.org/wiki/Unix_philosophy">Unix philosophy</a>, you can pipe data into or out of
Miller. For example:
<p/>
<div class="pokipanel">
<pre>
mlr cut --complement -f os_version *.dat | mlr sort -f hostname,uptime
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
% cat piped.sh
mlr cut -x -f i,y data/big | mlr sort -n y &gt; /dev/null
% time sh piped.sh
real 0m2.828s
user 0m3.183s
sys 0m0.137s
% cat chained.sh
mlr cut -x -f i,y then sort -n y data/big &gt; /dev/null
% time sh chained.sh
real 0m2.082s
user 0m1.933s
sys 0m0.137s
</pre>
</div>
<p/>
<p/>
For better performance (avoiding redundant string-parsing and string-formatting
when you pipe Miller commands together) you can, if you like, instead simply
chain commands together using the <tt>then</tt> keyword:
<p/>
<div class="pokipanel">
<pre>
mlr cut --complement -f os_version then sort -f hostname,uptime *.dat
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="I/O_options"/><h1>I/O options</h1>
<!-- ================================================================ -->
<a id="Formats"/><h2>Formats</h2>
<p/> Options:
<pre>
--dkvp --idkvp --odkvp
--nidx --inidx --onidx
--csv --icsv --ocsv
--csvlite --icsvlite --ocsvlite
--pprint --ipprint --ppprint --right
--xtab --ixtab --oxtab
</pre>
<p/> These are as discussed in <a href="file-formats.html">File formats</a>, with the exception of <tt>--right</tt>
which makes pretty-printed output right-aligned:
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint cat data/small
a b i x y
pan pan 1 0.3467901443380824 0.7268028627434533
eks pan 2 0.7586799647899636 0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006 0.8636244699032729
</pre>
</div>
<p/>
</td><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint --right cat data/small
a b i x y
pan pan 1 0.3467901443380824 0.7268028627434533
eks pan 2 0.7586799647899636 0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006 0.8636244699032729
</pre>
</div>
<p/>
</td></tr></table>
<p/>Additional notes:
<ul>
<li/> Use <tt>--csv</tt>, <tt>--pprint</tt>, etc. when the input and output formats are the same.
<li/> Use <tt>--icsv --opprint</tt>, etc. when you want format conversion as part of what Miller does to your data.
<li/> DKVP (key-value-pair) format is the default for input and output. So,
<tt>--oxtab</tt> is the same as <tt>--idkvp --oxtab</tt>.
</ul>
<!-- ================================================================ -->
<a id="Record/field/pair_separators"/><h2>Record/field/pair separators</h2>
<p/> Miller has record separators <tt>IRS</tt> and <tt>ORS</tt>, field
separators <tt>IFS</tt> and <tt>OFS</tt>, and pair separators <tt>IPS</tt> and
<tt>OPS</tt>. For example, in the DKVP line <tt>a=1,b=2,c=3</tt>, the record
separator is newline, field separator is comma, and pair separator is the
equals sign. These are the default values.
<p/> Options:
<pre>
--rs --irs --ors
--fs --ifs --ofs --repifs
--ps --ips --ops
</pre>
<ul>
<li/> You can change a separator from input to output via e.g. <tt>--ifs =
--ofs :</tt>. Or, you can specify that the same separator is to be used for
input and output via e.g. <tt>--fs :</tt>.
<li/> The pair separator is only relevant to DKVP format.
<li/> Pretty-print and xtab formats ignore the separator arguments altogether.
<li/> The <tt>--repifs</tt> means that multiple successive occurrences of the
field separator count as one. For example, in CSV data we often signify nulls
by empty strings, e.g. <tt>2,9,,,,,6,5,4</tt>. On the other hand, if the field
separator is a space, it might be more natural to parse <tt>2 4 5</tt> the
same as <tt>2 4 5</tt>: <tt>--repifs --ifs ' '</tt> lets this happen. In fact,
the <tt>--ipprint</tt> option above is internally implemented in terms of
<tt>--repifs</tt>.
<li/> Just write out the desired separator, e.g. <tt>--ofs '|'</tt>. But you
may use the symbolic names <tt>newline</tt>, <tt>space</tt>, <tt>tab</tt>,
<tt>pipe</tt>, or <tt>semicolon</tt> if you like.
</ul>
<!-- ================================================================ -->
<a id="Number_formatting"/><h2>Number formatting</h2>
<p/> The command-line option <tt>--ofmt {format string}</tt> is the global
number format for commands which generate numeric output, e.g.
<tt>stats1</tt>, <tt>stats2</tt>, <tt>histogram</tt>, and <tt>step</tt>, as
well as <tt>mlr put</tt>. Examples:
<p/>
<div class="pokipanel">
<pre>
--ofmt %.9le --ofmt %.6lf --ofmt %.0lf
</pre>
</div>
<p/>
<p/> These are just C <tt>printf</tt> formats applied to double-precision
numbers. Please don&rsquo;t use <tt>%s</tt> or <tt>%d</tt>. Additionally, if
you use leading width (e.g. <tt>%18.12lf</tt>) then the output will contain
embedded whitespace, which may not be what you want if you pipe the output to
something else, particularly CSV. I use Miller&rsquo;s pretty-print format
(<tt>mlr --opprint</tt>) to column-align numerical data.
<p/> To apply formatting to a single field, overriding the global
<tt>ofmt</tt>, use <tt>fmtnum</tt> function within <tt>mlr put</tt>. For example:
<p/>
<div class="pokipanel">
<pre>
$ echo 'x=3.1,y=4.3' | mlr put '$z=fmtnum($x*$y,"%08lf")'
x=3.1,y=4.3,z=13.330000
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ echo 'x=0xffff,y=0xff' | mlr put '$z=fmtnum(int($x*$y),"%08llx")'
x=0xffff,y=0xff,z=00feff01
</pre>
</div>
<p/>
<p/>Input conversion from hexadecimal is done automatically on fields handled
by <tt>mlr put</tt> and <tt>mlr filter</tt> as long as the field value begins
with "0x". To apply output conversion to hexadecimal on a single column, you
may use <tt>fmtnum</tt>, or the keystroke-saving <tt>hexfmt</tt> function.
Example:
<p/>
<div class="pokipanel">
<pre>
$ echo 'x=0xffff,y=0xff' | mlr put '$z=hexfmt($x*$y)'
x=0xffff,y=0xff,z=0xfeff01
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="Data_transformations"/><h1>Data transformations</h1>
<!-- ================================================================ -->
<a id="cat"/><h2>cat</h2>
<p/> Most useful for format conversions (see
<a href="file-formats.html">File formats</a>), and concatenating multiple
same-schema CSV files to have the same header:
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ cat a.csv
a,b,c
1,2,3
4,5,6
</pre>
</div>
<p/>
</td> <td>
<p/>
<div class="pokipanel">
<pre>
$ cat b.csv
a,b,c
7,8,9
</pre>
</div>
<p/>
</td> <td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --csv cat a.csv b.csv
a,b,c
1,2,3
4,5,6
7,8,9
</pre>
</div>
<p/>
</td> <td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsv --oxtab cat a.csv b.csv
a 1
b 2
c 3
a 4
b 5
c 6
a 7
b 8
c 9
</pre>
</div>
<p/>
</td></tr></table>
<!-- ================================================================ -->
<a id="check"/><h2>check</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr check --help
Usage: mlr check
Consumes records without printing any output.
Useful for doing a well-formatted check on input data.
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="count-distinct"/><h2>count-distinct</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr count-distinct --help
Usage: mlr count-distinct [options]
-f {a,b,c} Field names for distinct count.
Prints number of records having distinct values for specified field names. Same as uniq -c.
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ mlr count-distinct -f a,b then sort -nr count data/medium
a=zee,b=wye,count=455
a=pan,b=eks,count=429
a=pan,b=pan,count=427
a=wye,b=hat,count=426
a=hat,b=wye,count=423
a=pan,b=hat,count=417
a=eks,b=hat,count=417
a=eks,b=eks,count=413
a=pan,b=zee,count=413
a=zee,b=hat,count=409
a=eks,b=wye,count=407
a=zee,b=zee,count=403
a=pan,b=wye,count=395
a=wye,b=pan,count=392
a=zee,b=eks,count=391
a=zee,b=pan,count=389
a=hat,b=eks,count=389
a=wye,b=eks,count=386
a=hat,b=zee,count=385
a=wye,b=zee,count=385
a=hat,b=hat,count=381
a=wye,b=wye,count=377
a=eks,b=pan,count=371
a=hat,b=pan,count=363
a=eks,b=zee,count=357
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="cut"/><h2>cut</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr cut --help
Usage: mlr cut [options]
-f {a,b,c} Field names to include for cut.
-o Retain fields in the order specified here in the argument list.
Default is to retain them in the order found in the input data.
-x|--complement Exclude, rather that include, field names specified by -f.
Passes through input records with specified fields included/excluded.
</pre>
</div>
<p/>
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint cat data/small
a b i x y
pan pan 1 0.3467901443380824 0.7268028627434533
eks pan 2 0.7586799647899636 0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006 0.8636244699032729
</pre>
</div>
<p/>
</td><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint cut -f y,x,i data/small
i x y
1 0.3467901443380824 0.7268028627434533
2 0.7586799647899636 0.5221511083334797
3 0.20460330576630303 0.33831852551664776
4 0.38139939387114097 0.13418874328430463
5 0.5732889198020006 0.8636244699032729
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ echo 'a=1,b=2,c=3' | mlr cut -f b,c,a
a=1,b=2,c=3
</pre>
</div>
<p/>
</td><td>
<p/>
<div class="pokipanel">
<pre>
$ echo 'a=1,b=2,c=3' | mlr cut -o -f b,c,a
b=2,c=3,a=1
</pre>
</div>
<p/>
</td></tr></table>
<p/>
<!-- ================================================================ -->
<a id="filter"/><h2>filter</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr filter --help
Usage: mlr filter [-v] {expression}
Prints records for which {expression} evaluates to true.
With -v, first prints the AST (abstract syntax tree) for the expression, which
gives full transparency on the precedence and associativity rules of Miller's grammar.
Please use a dollar sign for field names and double-quotes for string literals.
Miller built-in variables are NF NR FNR FILENUM FILENAME PI E.
Examples:
mlr filter 'log10($count) &gt; 4.0'
mlr filter 'FNR == 2 (second record in each file)'
mlr filter 'urand() &lt; 0.001' (subsampling)
mlr filter '$color != "blue" &amp;&amp; $value &gt; 4.2'
mlr filter '($x&lt;.5 &amp;&amp; $y&lt;.5) || ($x&gt;.5 &amp;&amp; $y&gt;.5)'
Please see http://johnkerl.org/miller/doc/reference.html for more information including function list.
</pre>
</div>
<p/>
<p/>Field names must be specified using a <tt>$</tt> in <tt>filter</tt> and
<a href="#put"><tt>put</tt></a> expressions, even though they don&rsquo;t appear in the data
stream. For integer-indexed data, this looks like <tt>awk</tt>&rsquo;s
<tt>$1,$2,$3</tt>. Likewise, enclose string literals in double quotes in
<tt>filter</tt> expressions even though they don&rsquo;t appear in file data.
In particular, <tt>mlr filter '$x=="abc"'</tt> passes through the record
<tt>x=abc</tt>.
<p/>The <tt>filter</tt> command supports the same built-in variables as for
<a href="#put"><tt>put</tt></a>, all <tt>awk</tt>-inspired: <tt>NF</tt>,
<tt>NR</tt>, <tt>FNR</tt>, <tt>FILENUM</tt>, and <tt>FILENAME</tt>. This
selects the 2nd record from each matching file:
<p/>
<div class="pokipanel">
<pre>
$ mlr filter 'FNR == 2' data/small*
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
1=pan,2=pan,3=1,4=0.3467901443380824,5=0.7268028627434533
a=wye,b=eks,i=10000,x=0.734806020620654365,y=0.884788571337605134
</pre>
</div>
<p/>
<p/>Expressions may be arbitrarily complex:
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint filter '$a == "pan" || $b == "wye"' data/small
a b i x y
pan pan 1 0.3467901443380824 0.7268028627434533
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
</pre>
</div>
<p/>
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint filter '($x &gt; 0.5 &amp;&amp; $y &gt; 0.5) || ($x &lt; 0.5 &amp;&amp; $y &lt; 0.5)' then stats2 -a corr -f x,y data/medium
x_y_corr
0.756439
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint filter '($x &gt; 0.5 &amp;&amp; $y &lt; 0.5) || ($x &lt; 0.5 &amp;&amp; $y &gt; 0.5)' then stats2 -a corr -f x,y data/medium
x_y_corr
-0.747994
</pre>
</div>
<p/>
</td></tr></table>
Newlines within the expression are ignored, which can help increase legibility of complex expressions:
<p/>
<div class="pokipanel">
<pre>
mlr --opprint filter '
($x &gt; 0.5 &amp;&amp; $y &lt; 0.5)
||
($x &lt; 0.5 &amp;&amp; $y &gt; 0.5)' \
then stats2 -a corr -f x,y data/medium
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="group-by"/><h2>group-by</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr group-by --help
Usage: mlr group-by {comma-separated field names}
Outputs records in batches having identical values at specified field names.
</pre>
</div>
<p/>
<p/>This is similar to <tt>sort</tt> but with less work. Namely, Miller&rsquo;s
sort has three steps: read through the data and append linked lists of records,
one for each unique combination of the key-field values; after all records
are read, sort the key-field values; then print each record-list. The group-by
operation simply omits the middle sort. An example should make this more
clear.
<table><tr> <td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint group-by a data/small
a b i x y
pan pan 1 0.3467901443380824 0.7268028627434533
eks pan 2 0.7586799647899636 0.5221511083334797
eks wye 4 0.38139939387114097 0.13418874328430463
wye wye 3 0.20460330576630303 0.33831852551664776
wye pan 5 0.5732889198020006 0.8636244699032729
</pre>
</div>
<p/>
</td> <td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint sort -f a data/small
a b i x y
eks pan 2 0.7586799647899636 0.5221511083334797
eks wye 4 0.38139939387114097 0.13418874328430463
pan pan 1 0.3467901443380824 0.7268028627434533
wye wye 3 0.20460330576630303 0.33831852551664776
wye pan 5 0.5732889198020006 0.8636244699032729
</pre>
</div>
<p/>
</td> </tr></table>
<p/>In this example, since the sort is on field <tt>a</tt>, the first step is
to group together all records having the same value for field <tt>a</tt>; the
second step is to sort the distinct <tt>a</tt>-field values <tt>pan</tt>,
<tt>eks</tt>, and <tt>wye</tt> into <tt>eks</tt>, <tt>pan</tt>, and
<tt>wye</tt>; the third step is to print out the record-list for
<tt>a=eks</tt>, then the record-list for <tt>a=pan</tt>, then the record-list
for <tt>a=wye</tt>. The group-by operation omits the middle sort and just puts
like records together, for those times when a sort isn&rsquo;t desired. In
particular, the ordering of group-by fields for group-by is the order in which
they were encountered in the data stream, which in some cases may be more interesting
to you.
<!-- ================================================================ -->
<a id="group-like"/><h2>group-like</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr group-like --help
Usage: mlr group-like
Outputs records in batches having identical field names.
</pre>
</div>
<p/>
<p/> This groups together records having the same schema (i.e. same ordered list of field names)
which is useful for making sense of time-ordered output as described in
<a href="record-heterogeneity.html">Record-heterogeneity</a> &mdash; in particular, in
preparation for CSV or pretty-print output.
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr cat data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true
record_count=100,resource=/path/to/file
resource=/path/to/second/file,loadsec=0.32,ok=true
record_count=150,resource=/path/to/second/file
resource=/some/other/path,loadsec=0.97,ok=false
</pre>
</div>
<p/>
</td><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint group-like data/het.dkvp
resource loadsec ok
/path/to/file 0.45 true
/path/to/second/file 0.32 true
/some/other/path 0.97 false
record_count resource
100 /path/to/file
150 /path/to/second/file
</pre>
</div>
<p/>
</td></tr></table>
<!-- ================================================================ -->
<a id="having-fields"/><h2>having-fields</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr having-fields --help
Usage: mlr having-fields [options]
--at-least {a,b,c}
--which-are {a,b,c}
--at-most {a,b,c}
Conditionally passes through records depending on each record's field names.
</pre>
</div>
<p/>
<p/> Similar to <a href="#group-like"><tt>group-like</tt></a>, this retains records with specified schema.
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr cat data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true
record_count=100,resource=/path/to/file
resource=/path/to/second/file,loadsec=0.32,ok=true
record_count=150,resource=/path/to/second/file
resource=/some/other/path,loadsec=0.97,ok=false
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr having-fields --at-least resource data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true
record_count=100,resource=/path/to/file
resource=/path/to/second/file,loadsec=0.32,ok=true
record_count=150,resource=/path/to/second/file
resource=/some/other/path,loadsec=0.97,ok=false
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr having-fields --which-are resource,ok,loadsec data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true
resource=/path/to/second/file,loadsec=0.32,ok=true
resource=/some/other/path,loadsec=0.97,ok=false
</pre>
</div>
<p/>
</td></tr></table>
<!-- ================================================================ -->
<a id="head"/><h2>head</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr head --help
Usage: mlr head [options]
-n {count} Head count to print; default 10
-g {a,b,c} Optional group-by-field names for head counts
Passes through the first n records, optionally by category.
</pre>
</div>
<p/>
Note that <tt>head</tt> is distinct from <a href="#top"><tt>top</tt></a>
&mdash; <tt>head</tt> shows fields which appear first in the data stream;
<tt>top</tt> shows fields which are numerically largest (or smallest).
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint head -n 4 data/medium
a b i x y
pan pan 1 0.3467901443380824 0.7268028627434533
eks pan 2 0.7586799647899636 0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
</pre>
</div>
<p/>
</td><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint head -n 1 -g b data/medium
a b i x y
pan pan 1 0.3467901443380824 0.7268028627434533
wye wye 3 0.20460330576630303 0.33831852551664776
eks zee 7 0.6117840605678454 0.1878849191181694
zee eks 17 0.29081949506712723 0.054478717073354166
wye hat 24 0.7286126830627567 0.19441962592638418
</pre>
</div>
<p/>
</td></tr></table>
<!-- ================================================================ -->
<a id="histogram"/><h2>histogram</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr histogram --help
Usage: mlr histogram [options]
-f {a,b,c} Value-field names for histogram counts
--lo {lo} Histogram low value
--hi {hi} Histogram high value
--nbins {n} Number of histogram bins
Just a histogram. Input values &lt; lo or &gt; hi are not counted.
</pre>
</div>
<p/>
This is just a histogram; there&rsquo;s not too much to say here. A note about
binning, by example: Suppose you use <tt>--lo 0.0 --hi 1.0 --nbins 10 -f
x</tt>. The input numbers less than 0 or greater than 1 aren&rsquo;t counted
in any bin. Input numbers equal to 1 are counted in the last bin. That is, bin
0 has <tt>0.0 &le; x &lt; 0.1</tt>, bin 1 has <tt>0.1 &le; x &lt; 0.2</tt>,
etc., but bin 9 has <tt>0.9 &le; x &le; 1.0</tt>.
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint put '$x2=$x**2;$x3=$x2*$x' then histogram -f x,x2,x3 --lo 0 --hi 1 --nbins 10 data/medium
bin_lo bin_hi x_count x2_count x3_count
0.000000 0.100000 1072 3231 4661
0.100000 0.200000 938 1254 1184
0.200000 0.300000 1037 988 845
0.300000 0.400000 988 832 676
0.400000 0.500000 950 774 576
0.500000 0.600000 1002 692 476
0.600000 0.700000 1007 591 438
0.700000 0.800000 1007 560 420
0.800000 0.900000 986 571 383
0.900000 1.000000 1013 507 341
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="join"/><h2>join</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr join --help
Usage: mlr join [options]
Joins records from specified left file name with records from all file names at the end of the Miller argument list.
Functionality is essentially the same as the system "join" command, but for record streams.
Options:
-f {left file name}
-j {a,b,c} Comma-separated join-field names for output
-l {a,b,c} Comma-separated join-field names for left input file; defaults to -j values if omitted.
-r {a,b,c} Comma-separated join-field names for right input file(s); defaults to -j values if omitted.
--lp {text} Additional prefix for non-join output field names from the left file
--rp {text} Additional prefix for non-join output field names from the right file(s)
--np Do not emit paired records
--ul Emit unpaired records from the left file
--ur Emit unpaired records from the right file(s)
-u Enable unsorted input. In this case, the entire left file will be loaded into memory.
Without -u, records must be sorted lexically by their join-field names, else not all
records will be paired.
File-format options default to those for the right file names on the Miller argument list, but may be overridden
for the left file as follows. Please see the main "mlr --help" for more information on syntax for these arguments.
-i {one of csv,dkvp,nidx,pprint,xtab}
--irs {record-separator character}
--ifs {field-separator character}
--ips {pair-separator character}
--repifs
--repips
--use-mmap
--no-mmap
Please see http://johnkerl.org/miller/doc/reference.html for more information including examples.
</pre>
</div>
<p/>
Examples:
<p/>Join larger table with IDs with smaller ID-to-name lookup table, showing only paired records:
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsvlite --opprint cat data/join-left-example.csv
id name
100 alice
200 bob
300 carol
400 david
500 edgar
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsvlite --opprint cat data/join-right-example.csv
status idcode
present 400
present 100
missing 200
present 100
present 200
missing 100
missing 200
present 300
missing 600
present 400
present 400
present 300
present 100
missing 400
present 200
present 200
present 200
present 200
present 400
present 300
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsvlite --opprint join -u -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv
id name status
400 david present
100 alice present
200 bob missing
100 alice present
200 bob present
100 alice missing
200 bob missing
300 carol present
400 david present
400 david present
300 carol present
100 alice present
400 david missing
200 bob present
200 bob present
200 bob present
200 bob present
400 david present
300 carol present
</pre>
</div>
<p/>
</td></tr></table>
<p/>Same, but with sorting the input first:
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsvlite --opprint sort -f idcode then join -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv
id name status
100 alice present
100 alice present
100 alice missing
100 alice present
200 bob missing
200 bob present
200 bob missing
200 bob present
200 bob present
200 bob present
200 bob present
300 carol present
300 carol present
300 carol present
400 david present
400 david present
400 david present
400 david missing
400 david present
</pre>
</div>
<p/>
</td></tr></table>
<p/>Same, but showing only unpaired records:
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsvlite --opprint join --np --ul --ur -u -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv
status idcode
missing 600
id name
500 edgar
</pre>
</div>
<p/>
</td></tr></table>
<p/>Use prefixing options to disambiguate between otherwise identical non-join field names:
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --csvlite --opprint cat data/self-join.csv data/self-join.csv
a b c
1 2 3
1 4 5
1 2 3
1 4 5
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --csvlite --opprint join -j a --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv
a left_b left_c right_b right_c
1 2 3 2 3
1 4 5 2 3
1 2 3 4 5
1 4 5 4 5
</pre>
</div>
<p/>
</td></tr></table>
<p/>Use zero join columns:
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --csvlite --opprint join -j "" --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv
left_a left_b left_c right_a right_b right_c
1 2 3 1 2 3
1 4 5 1 2 3
1 2 3 1 4 5
1 4 5 1 4 5
</pre>
</div>
<p/>
</td></tr></table>
<!-- ================================================================ -->
<a id="label"/><h2>label</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr label --help
Usage: mlr label {new1,new2,new3,...}
Given n comma-separated names, renames the first n fields of each record to
have the respective name. (Fields past the nth are left with their original
names.) Particularly useful with --inidx, to give useful names to otherwise
integer-indexed fields.
</pre>
</div>
<p/>
See also <a href="#rename"><tt>rename</tt></a>.
<p/>Example: Files such as <tt>/etc/passwd</tt>, <tt>/etc/group</tt>, and so on
have implicit field names which are found in section-5 manpages. These field names may be made explicit as follows:
<p/>
<div class="pokipanel">
<pre>
% grep -v '^#' /etc/passwd | mlr --nidx --fs : --opprint label name,password,uid,gid,gecos,home_dir,shell | head
name password uid gid gecos home_dir shell
nobody * -2 -2 Unprivileged User /var/empty /usr/bin/false
root * 0 0 System Administrator /var/root /bin/sh
daemon * 1 1 System Services /var/root /usr/bin/false
_uucp * 4 4 Unix to Unix Copy Protocol /var/spool/uucp /usr/sbin/uucico
_taskgated * 13 13 Task Gate Daemon /var/empty /usr/bin/false
_networkd * 24 24 Network Services /var/networkd /usr/bin/false
_installassistant * 25 25 Install Assistant /var/empty /usr/bin/false
_lp * 26 26 Printing Services /var/spool/cups /usr/bin/false
_postfix * 27 27 Postfix Mail Server /var/spool/postfix /usr/bin/false
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="put"/><h2>put</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr put --help
Usage: mlr put [-v] {expression}
Adds/updates specified field(s).
With -v, first prints the AST (abstract syntax tree) for the expression, which
gives full transparency on the precedence and associativity rules of Miller's grammar.
Please use a dollar sign for field names and double-quotes for string literals.
Miller built-in variables are NF NR FNR FILENUM FILENAME PI E.
Multiple assignments may be separated with a semicolon.
Examples:
mlr put '$y = log10($x); $z = sqrt($y)'
mlr put '$filename = FILENAME'
mlr put '$colored_shape = $color . "_" . $shape'
mlr put '$y = cos($theta); $z = atan2($y, $x)'
Please see http://johnkerl.org/miller/doc/reference.html for more information including function list.
</pre>
</div>
<p/>
<p/>Field names must be specified using a <tt>$</tt> in <a href="#filter"><tt>filter</tt></a> and <tt>put</tt>
expressions, even though they don&rsquo;t appear in the data stream. For
integer-indexed data, this looks like <tt>awk</tt>&rsquo;s <tt>$1,$2,$3</tt>.
Likewise, enclose string literals in double quotes in <tt>put</tt>
expressions even though they don&rsquo;t appear in file data. In particular,
<tt>mlr put '$x=="abc"'</tt> creates the field <tt>x=abc</tt>.
<p/>Multiple expressions may be given, separated by semicolons, and each may refer to the ones before:
<p/>
<div class="pokipanel">
<pre>
$ ruby -e '10.times{|i|puts "i=#{i}"}' | mlr --opprint put '$j=$i+1;$k=$i+$j'
i j k
0 1.000000 1.000000
1 2.000000 3.000000
2 3.000000 5.000000
3 4.000000 7.000000
4 5.000000 9.000000
5 6.000000 11.000000
6 7.000000 13.000000
7 8.000000 15.000000
8 9.000000 17.000000
9 10.000000 19.000000
</pre>
</div>
<p/>
<p/>Miller supports the following five built-in variables for <a href="#filter"><tt>filter</tt></a>
and <tt>put</tt>, all <tt>awk</tt>-inspired: <tt>NF</tt>, <tt>NR</tt>,
<tt>FNR</tt>, <tt>FILENUM</tt>, and <tt>FILENAME</tt>.
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint put '$nf=NF; $nr=NR; $fnr=FNR; $filenum=FILENUM; $filename=FILENAME' data/small data/small2
a b i x y nf nr fnr filenum filename
pan pan 1 0.3467901443380824 0.7268028627434533 5 1 1 1 data/small
eks pan 2 0.7586799647899636 0.5221511083334797 5 2 2 1 data/small
wye wye 3 0.20460330576630303 0.33831852551664776 5 3 3 1 data/small
eks wye 4 0.38139939387114097 0.13418874328430463 5 4 4 1 data/small
wye pan 5 0.5732889198020006 0.8636244699032729 5 5 5 1 data/small
pan eks 9999 0.267481232652199086 0.557077185510228001 5 6 1 2 data/small2
wye eks 10000 0.734806020620654365 0.884788571337605134 5 7 2 2 data/small2
pan wye 10001 0.870530722602517626 0.009854780514656930 5 8 3 2 data/small2
hat wye 10002 0.321507044286237609 0.568893318795083758 5 9 4 2 data/small2
pan zee 10003 0.272054845593895200 0.425789896597056627 5 10 5 2 data/small2
</pre>
</div>
<p/>
Newlines within the expression are ignored, which can help increase legibility of complex expressions:
<p/>
<div class="pokipanel">
<pre>
mlr --opprint put '
$nf = NF;
$nr = NR;
$fnr = FNR;
$filenum = FILENUM;
$filename = FILENAME' \
data/small data/small2
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="regularize"/><h2>regularize</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr regularize --help
Usage: mlr regularize
For records seen earlier in the data stream with same field names in a different order,
outputs them with field names in the previously encountered order.
Example: input records a=1,c=2,b=3, then e=4,d=5, then c=7,a=6,b=8
output as a=1,c=2,b=3, then e=4,d=5, then a=6,c=7,b=8
</pre>
</div>
<p/>
<p/>This exists since hash-map software in various languages and tools
encountered in the wild does not always print similar rows with fields in the
same order: <tt>mlr regularize</tt> helps clean that up.
<p/>See also <a href="#reorder"><tt>reorder</tt></a>.
<!-- ================================================================ -->
<a id="rename"/><h2>rename</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr rename --help
Usage: mlr rename {old1,new1,old2,new2,...}
Renames specified fields.
</pre>
</div>
<p/>
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint cat data/small
a b i x y
pan pan 1 0.3467901443380824 0.7268028627434533
eks pan 2 0.7586799647899636 0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006 0.8636244699032729
</pre>
</div>
<p/>
</td><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint rename i,INDEX,b,COLUMN2 data/small
a COLUMN2 INDEX x y
pan pan 1 0.3467901443380824 0.7268028627434533
eks pan 2 0.7586799647899636 0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006 0.8636244699032729
</pre>
</div>
<p/>
</td></tr></table>
<p/>As discussed in <a href="performance.html">Performance</a>, <tt>sed</tt>
is significantly faster than Miller at doing this. However, Miller is
format-aware, so it knows to do renames only within specified field keys and
not any others, nor in field values which may happen to contain the same
pattern. Example:
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ sed 's/y/COLUMN5/g' data/small
a=pan,b=pan,i=1,x=0.3467901443380824,COLUMN5=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,COLUMN5=0.5221511083334797
a=wCOLUMN5e,b=wCOLUMN5e,i=3,x=0.20460330576630303,COLUMN5=0.33831852551664776
a=eks,b=wCOLUMN5e,i=4,x=0.38139939387114097,COLUMN5=0.13418874328430463
a=wCOLUMN5e,b=pan,i=5,x=0.5732889198020006,COLUMN5=0.8636244699032729
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr rename y,COLUMN5 data/small
a=pan,b=pan,i=1,x=0.3467901443380824,COLUMN5=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,COLUMN5=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,COLUMN5=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,COLUMN5=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,COLUMN5=0.8636244699032729
</pre>
</div>
<p/>
</td></tr></table>
See also <a href="#label"><tt>label</tt></a>.
<!-- ================================================================ -->
<a id="reorder"/><h2>reorder</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr reorder --help
Usage: mlr reorder [options]
-f {a,b,c} Field names to reorder.
-e Put specified field names at record end: default is to put at record start.
Example: mlr reorder -f a,b sends input record "d=4,b=2,a=1,c=3" to "a=1,b=2,d=4,c=3".
Example: mlr reorder -e -f a,b sends input record "d=4,b=2,a=1,c=3" to "d=4,c=3,a=1,b=2".
</pre>
</div>
<p/>
This pivots specified field names to the start or end of the record &mdash; for
example when you have highly multi-column data and you want to bring a field or
two to the front of line where you can give a quick visual scan.
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint cat data/small
a b i x y
pan pan 1 0.3467901443380824 0.7268028627434533
eks pan 2 0.7586799647899636 0.5221511083334797
wye wye 3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan 5 0.5732889198020006 0.8636244699032729
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint reorder -f i,b data/small
i b a x y
1 pan pan 0.3467901443380824 0.7268028627434533
2 pan eks 0.7586799647899636 0.5221511083334797
3 wye wye 0.20460330576630303 0.33831852551664776
4 wye eks 0.38139939387114097 0.13418874328430463
5 pan wye 0.5732889198020006 0.8636244699032729
</pre>
</div>
<p/>
</td><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint reorder -e -f i,b data/small
a x y i b
pan 0.3467901443380824 0.7268028627434533 1 pan
eks 0.7586799647899636 0.5221511083334797 2 pan
wye 0.20460330576630303 0.33831852551664776 3 wye
eks 0.38139939387114097 0.13418874328430463 4 wye
wye 0.5732889198020006 0.8636244699032729 5 pan
</pre>
</div>
<p/>
</td></tr></table>
<!-- ================================================================ -->
<a id="sort"/><h2>sort</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr sort --help
Usage: mlr sort {flags}
Flags:
-f {comma-separated field names} Lexical ascending
-n {comma-separated field names} Numerical ascending; nulls sort last
-nf {comma-separated field names} Numerical ascending; nulls sort last
-r {comma-separated field names} Lexical descending
-nr {comma-separated field names} Numerical descending; nulls sort first
Sorts records primarily by the first specified field, secondarily by the second field, and so on.
Example:
mlr sort -f a,b -nr x,y,z
which is the same as:
mlr sort -f a -f b -nr x -nr y -nr z
</pre>
</div>
<p/>
<p/>Example:
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint sort -f a -nr x data/small
a b i x y
eks pan 2 0.7586799647899636 0.5221511083334797
eks wye 4 0.38139939387114097 0.13418874328430463
pan pan 1 0.3467901443380824 0.7268028627434533
wye pan 5 0.5732889198020006 0.8636244699032729
wye wye 3 0.20460330576630303 0.33831852551664776
</pre>
</div>
<p/>
<p/>Here&rsquo;s an example filtering log data: suppose multiple threads (labeled here by color) are all logging progress counts to a single log file. The log file is (by nature) chronological, so the progress of various threads is interleaved:
<p/>
<div class="pokipanel">
<pre>
$ head -n 10 data/multicountdown.dat
upsec=0.002,color=green,count=1203
upsec=0.083,color=red,count=3817
upsec=0.188,color=red,count=3801
upsec=0.395,color=blue,count=2697
upsec=0.526,color=purple,count=953
upsec=0.671,color=blue,count=2684
upsec=0.899,color=purple,count=926
upsec=0.912,color=red,count=3798
upsec=1.093,color=blue,count=2662
upsec=1.327,color=purple,count=917
</pre>
</div>
<p/>
<p/> We can group these by thread by sorting on the thread ID (here,
<tt>color</tt>). Since Miller&rsquo;s sort is stable, this means that
timestamps within each thread&rsquo;s log data are still chronological:
<p/>
<div class="pokipanel">
<pre>
$ head -n 20 data/multicountdown.dat | mlr --opprint sort -f color
upsec color count
0.395 blue 2697
0.671 blue 2684
1.093 blue 2662
2.064 blue 2659
2.2880000000000003 blue 2647
0.002 green 1203
1.407 green 1187
1.448 green 1177
2.313 green 1161
0.526 purple 953
0.899 purple 926
1.327 purple 917
1.703 purple 908
0.083 red 3817
0.188 red 3801
0.912 red 3798
1.416 red 3788
1.587 red 3782
1.601 red 3755
1.832 red 3717
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="stats1"/><h2>stats1</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr stats1 --help
Usage: mlr stats1 [options]
Options:
-a {sum,count,...} Names of accumulators: p10 p25.2 p50 p98 p100 etc. and/or one or more of
count mode sum mean stddev var meaneb min max
-f {a,b,c} Value-field names on which to compute statistics
-g {d,e,f} Optional group-by-field names
Example: mlr stats1 -a min,p10,p50,p90,max -f value -g size,shape
Example: mlr stats1 -a count,mode -f size
Example: mlr stats1 -a count,mode -f size -g shape
Notes:
* p50 is a synonym for median.
* min and max output the same results as p0 and p100, respectively, but use less memory.
* count and mode allow text input; the rest require numeric input. In particular, 1 and 1.0
are distinct text for count and mode.
* When there are mode ties, the first-encountered datum wins.
</pre>
</div>
<p/>
These are simple univariate statistics on one or more number-valued fields
(<tt>count</tt> and <tt>mode</tt> apply to non-numeric fields as well),
optionally categorized by one or more other fields.
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --oxtab stats1 -a count,sum,min,p10,p50,mean,p90,max -f x,y data/medium
x_count 10000
x_sum 4986.019682
x_min 0.000045
x_p10 0.093322
x_p50 0.501159
x_mean 0.498602
x_p90 0.900794
x_max 0.999953
y_count 10000
y_sum 5062.057445
y_min 0.000088
y_p10 0.102132
y_p50 0.506021
y_mean 0.506206
y_p90 0.905366
y_max 0.999965
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint stats1 -a mean -f x,y -g b then sort -f b data/medium
b x_mean y_mean
eks 0.506361 0.510293
hat 0.487899 0.513118
pan 0.497304 0.499599
wye 0.497593 0.504596
zee 0.504242 0.502997
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint stats1 -a p50,p99 -f u,v -g color then put '$ur=$u_p99/$u_p50;$vr=$v_p99/$v_p50' data/colored-shapes.dkvp
color u_p50 u_p99 v_p50 v_p99 ur vr
yellow 0.501019 0.989046 0.520630 0.987034 1.974069 1.895845
red 0.485038 0.990054 0.492586 0.994444 2.041189 2.018823
purple 0.501319 0.988893 0.504571 0.988287 1.972582 1.958668
green 0.502015 0.990764 0.505359 0.990175 1.973574 1.959350
blue 0.525226 0.992655 0.485170 0.993873 1.889958 2.048505
orange 0.483548 0.993635 0.480913 0.989102 2.054884 2.056717
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint count-distinct -f shape then sort -nr count data/colored-shapes.dkvp
shape count
square 4115
triangle 3372
circle 2591
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint stats1 -a mode -f color -g shape data/colored-shapes.dkvp
shape color_mode
triangle red
square red
circle red
</pre>
</div>
<p/>
</td></tr></table>
<!-- ================================================================ -->
<a id="stats2"/><h2>stats2</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr stats2 --help
Usage: mlr stats2 [options]
-a {linreg-ols,corr,...} Names of accumulators: one or more of
linreg-pca linreg-ols r2 corr cov covx
r2 is a quality metric for linreg-ols; linrec-pca outputs its own quality metric.
-f {a,b,c,d} Value-field name-pairs on which to compute statistics.
There must be an even number of names.
-g {e,f,g} Optional group-by-field names.
-v Print additional output for linreg-pca.
Example: mlr stats2 -a linreg-pca -f x,y
Example: mlr stats2 -a linreg-ols,r2 -f x,y -g size,shape
Example: mlr stats2 -a corr -f x,y
</pre>
</div>
<p/>
These are simple bivariate statistics on one or more pairs of number-valued
fields, optionally categorized by one or more fields.
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --oxtab put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' then stats2 -a cov,corr -f x,y,y,y,x2,xy,x2,y2 data/medium
x_y_cov 0.000043
x_y_corr 0.000504
y_y_cov 0.084611
y_y_corr 1.000000
x2_xy_cov 0.041884
x2_xy_corr 0.630174
x2_y2_cov -0.000310
x2_y2_corr -0.003425
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' then stats2 -a linreg-ols,r2 -f x,y,y,y,xy,y2 -g a data/medium
a x_y_ols_m x_y_ols_b x_y_ols_n x_y_r2 y_y_ols_m y_y_ols_b y_y_ols_n y_y_r2 xy_y2_ols_m xy_y2_ols_b xy_y2_ols_n xy_y2_r2
pan 0.017026 0.500403 2081 0.000287 1.000000 0.000000 2081 1.000000 0.878132 0.119082 2081 0.417498
eks 0.040780 0.481402 1965 0.001646 1.000000 0.000000 1965 1.000000 0.897873 0.107341 1965 0.455632
wye -0.039153 0.525510 1966 0.001505 1.000000 0.000000 1966 1.000000 0.853832 0.126745 1966 0.389917
zee 0.002781 0.504307 2047 0.000008 1.000000 0.000000 2047 1.000000 0.852444 0.124017 2047 0.393566
hat -0.018621 0.517901 1941 0.000352 1.000000 0.000000 1941 1.000000 0.841230 0.135573 1941 0.368794
</pre>
</div>
<p/>
</td></tr></table>
<p/>Here&rsquo;s an example simple line-fit. The <tt>x</tt> and <tt>y</tt>
fields of the <tt>data/medium</tt> dataset are just independent uniformly
distributed on the unit interval. Here we remove half the data and fit a line to it.
<p/>
<div class="pokipanel">
<pre>
mlr filter '($x&lt;.5 &amp;&amp; $y&lt;.5) || ($x&gt;.5 &amp;&amp; $y&gt;.5)' data/medium &gt; data/medium-squares
mlr --ofs newline stats2 -a linreg-pca -f x,y data/medium-squares
x_y_pca_m=1.014419
x_y_pca_b=0.000308
x_y_pca_quality=0.861354
# Set x_y_pca_m and x_y_pca_b as shell variables
eval $(mlr --ofs newline stats2 -a linreg-pca -f x,y data/medium-squares)
# In addition to x and y, make a new yfit which is the line fit. Plot using your favorite tool.
mlr --onidx put '$yfit='$x_y_pca_m'*$x+'$x_y_pca_b then cut -x -f a,b,i data/medium-squares \
| pgr -p -title 'linreg-pca example' -xmin 0 -xmax 1 -ymin 0 -ymax 1
</pre>
</div>
<p/>
<p/>I use <a href="https://github.com/johnkerl/pgr"><tt>pgr</tt></a> for
plotting; here&rsquo;s a screenshot.
<center>
<img src="data/linreg-example.jpg"/>
</center>
<p/> (Thanks Drew Kunas for a good conversation about PCA!)
<p/> Here&rsquo;s an example estimating time-to-completion for a set of jobs.
Input data comes from a log file, with number of work units left to do in the
<tt>count</tt> field and accumulated seconds in the <tt>upsec</tt> field,
labeled by the <tt>color</tt> field:
<p/>
<div class="pokipanel">
<pre>
$ head -n 10 data/multicountdown.dat
upsec=0.002,color=green,count=1203
upsec=0.083,color=red,count=3817
upsec=0.188,color=red,count=3801
upsec=0.395,color=blue,count=2697
upsec=0.526,color=purple,count=953
upsec=0.671,color=blue,count=2684
upsec=0.899,color=purple,count=926
upsec=0.912,color=red,count=3798
upsec=1.093,color=blue,count=2662
upsec=1.327,color=purple,count=917
</pre>
</div>
<p/>
We can do a linear regression on count remaining as a function of time: with <tt>c = m*u+b</tt> we want to find the
time when the count goes to zero, i.e. <tt>u=-b/m</tt>.
<p/>
<div class="pokipanel">
<pre>
$ mlr --oxtab stats2 -a linreg-pca -f upsec,count -g color then put '$donesec = -$upsec_count_pca_b/$upsec_count_pca_m' data/multicountdown.dat
color green
upsec_count_pca_m -32.756917
upsec_count_pca_b 1213.722730
upsec_count_pca_n 24
upsec_count_pca_quality 0.999984
donesec 37.052410
color red
upsec_count_pca_m -37.367646
upsec_count_pca_b 3810.133400
upsec_count_pca_n 30
upsec_count_pca_quality 0.999989
donesec 101.963431
color blue
upsec_count_pca_m -29.231212
upsec_count_pca_b 2698.932820
upsec_count_pca_n 25
upsec_count_pca_quality 0.999959
donesec 92.330514
color purple
upsec_count_pca_m -39.030097
upsec_count_pca_b 979.988341
upsec_count_pca_n 21
upsec_count_pca_quality 0.999991
donesec 25.108529
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="step"/><h2>step</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr step --help
Usage: mlr step [options]
-a {delta,rsum,...} Names of steppers: one or more of
delta ratio rsum counter
-f {a,b,c} Value-field names on which to compute statistics
-g {d,e,f} Group-by-field names
Computes values dependent on the previous record, optionally grouped by category.
</pre>
</div>
<p/>
Most Miller commands are record-at-a-time, with the exception of <tt>stats1</tt>,
<tt>stats2</tt>, and <tt>histogram</tt> which compute aggregate output. The
<tt>step</tt> command is intermediate: it allows the option of adding fields
which are functions of fields from previous records. Rsum is short for <i>running sum</i>.
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint step -a delta,rsum,counter -f x data/medium | head -15
a b i x y x_delta x_rsum x_counter
pan pan 1 0.3467901443380824 0.7268028627434533 0.346790 0.346790 1
eks pan 2 0.7586799647899636 0.5221511083334797 0.411890 1.105470 2
wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077 1.310073 3
eks wye 4 0.38139939387114097 0.13418874328430463 0.176796 1.691473 4
wye pan 5 0.5732889198020006 0.8636244699032729 0.191890 2.264762 5
zee pan 6 0.5271261600918548 0.49322128674835697 -0.046163 2.791888 6
eks zee 7 0.6117840605678454 0.1878849191181694 0.084658 3.403672 7
zee wye 8 0.5985540091064224 0.976181385699006 -0.013230 4.002226 8
hat wye 9 0.03144187646093577 0.7495507603507059 -0.567112 4.033668 9
pan wye 10 0.5026260055412137 0.9526183602969864 0.471184 4.536294 10
pan pan 11 0.7930488423451967 0.6505816637259333 0.290423 5.329343 11
zee pan 12 0.3676141320555616 0.23614420670296965 -0.425435 5.696957 12
eks pan 13 0.4915175580479536 0.7709126592971468 0.123903 6.188474 13
eks zee 14 0.5207382318405251 0.34141681118811673 0.029221 6.709213 14
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint step -a delta,rsum,counter -f x -g a data/medium | head -15
a b i x y x_delta x_rsum x_counter
pan pan 1 0.3467901443380824 0.7268028627434533 0.346790 0.346790 1
eks pan 2 0.7586799647899636 0.5221511083334797 0.758680 0.758680 1
wye wye 3 0.20460330576630303 0.33831852551664776 0.204603 0.204603 1
eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281 1.140079 2
wye pan 5 0.5732889198020006 0.8636244699032729 0.368686 0.777892 2
zee pan 6 0.5271261600918548 0.49322128674835697 0.527126 0.527126 1
eks zee 7 0.6117840605678454 0.1878849191181694 0.230385 1.751863 3
zee wye 8 0.5985540091064224 0.976181385699006 0.071428 1.125680 2
hat wye 9 0.03144187646093577 0.7495507603507059 0.031442 0.031442 1
pan wye 10 0.5026260055412137 0.9526183602969864 0.155836 0.849416 2
pan pan 11 0.7930488423451967 0.6505816637259333 0.290423 1.642465 3
zee pan 12 0.3676141320555616 0.23614420670296965 -0.230940 1.493294 3
eks pan 13 0.4915175580479536 0.7709126592971468 -0.120267 2.243381 4
eks zee 14 0.5207382318405251 0.34141681118811673 0.029221 2.764119 5
</pre>
</div>
<p/>
</td></tr></table>
Example deriving uptime-delta from system uptime:
<p/>
<div class="pokipanel">
<pre>
$ each 10 uptime | mlr -p step -a delta -f 11
...
20:08 up 36 days, 10:38, 5 users, load averages: 1.42 1.62 1.73 0.000000
20:08 up 36 days, 10:38, 5 users, load averages: 1.55 1.64 1.74 0.020000
20:08 up 36 days, 10:38, 7 users, load averages: 1.58 1.65 1.74 0.010000
20:08 up 36 days, 10:38, 9 users, load averages: 1.78 1.69 1.76 0.040000
20:08 up 36 days, 10:39, 9 users, load averages: 2.12 1.76 1.78 0.070000
20:08 up 36 days, 10:39, 9 users, load averages: 2.51 1.85 1.81 0.090000
20:08 up 36 days, 10:39, 8 users, load averages: 2.79 1.92 1.83 0.070000
20:08 up 36 days, 10:39, 4 users, load averages: 2.64 1.90 1.83 -0.020000
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="tac"/><h2>tac</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr tac --help
Usage: mlr tac
Prints records in reverse order from the order in which they were encountered.
</pre>
</div>
<p/>
<p/>Prints the records in the input stream in reverse order. Note: this
requires Miller to retain all input records in memory before any output records
are produced.
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsv --opprint cat a.csv
a b c
1 2 3
4 5 6
</pre>
</div>
<p/>
</td><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsv --opprint cat b.csv
a b c
7 8 9
</pre>
</div>
<p/>
</td><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsv --opprint tac a.csv b.csv
a b c
7 8 9
4 5 6
1 2 3
</pre>
</div>
<p/>
</td></tr></table>
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsv --opprint put '$filename=FILENAME' then tac a.csv b.csv
a b c filename
7 8 9 b.csv
4 5 6 a.csv
1 2 3 a.csv
</pre>
</div>
<p/>
</td></tr></table>
<!-- ================================================================ -->
<a id="tail"/><h2>tail</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr tail --help
Usage: mlr tail [options]
-n {count} Tail count to print; default 10
-g {a,b,c} Optional group-by-field names for tail counts
Passes through the last n records, optionally by category.
</pre>
</div>
<p/>
<p/> Prints the last <i>n</i> records in the input stream, optionally by category.
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint tail -n 4 data/colored-shapes.dkvp
color shape flag i u v w x
blue square 1 99974 0.6189062525431605 0.2637962404841453 0.5311465405784674 6.210738209085753
blue triangle 0 99976 0.008110504040268474 0.8267274952432482 0.4732962944898885 6.146956761817328
yellow triangle 0 99990 0.3839424618160777 0.55952913620132 0.5113763011485609 4.307973891915119
yellow circle 1 99994 0.764950884927175 0.25284227383991364 0.49969878539567425 5.013809741826425
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint tail -n 1 -g shape data/colored-shapes.dkvp
color shape flag i u v w x
yellow triangle 0 99990 0.3839424618160777 0.55952913620132 0.5113763011485609 4.307973891915119
blue square 1 99974 0.6189062525431605 0.2637962404841453 0.5311465405784674 6.210738209085753
yellow circle 1 99994 0.764950884927175 0.25284227383991364 0.49969878539567425 5.013809741826425
</pre>
</div>
<p/>
</td></tr></table>
<!-- ================================================================ -->
<a id="top"/><h2>top</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr top --help
Usage: mlr top [options]
-f {a,b,c} Value-field names for top counts
-g {d,e,f} Optional group-by-field names for top counts
-n {count} How many records to print per category; default 1
-a Print all fields for top-value records; default is
to print only value and group-by fields.
--min Print top smallest values; default is top largest values
Prints the n records with smallest/largest values at specified fields, optionally by category.
</pre>
</div>
<p/>
Note that <tt>top</tt> is distinct from <a href="#head"><tt>head</tt></a>
&mdash; <tt>head</tt> shows fields which appear first in the data stream;
<tt>top</tt> shows fields which are numerically largest (or smallest).
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint top -n 4 -f x data/medium
top_idx x_top
1 0.999953
2 0.999823
3 0.999733
4 0.999563
</pre>
</div>
<p/>
</td><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint top -n 2 -f x -g a then sort -f a data/medium
a top_idx x_top
eks 1 0.998811
eks 2 0.998534
hat 1 0.999953
hat 2 0.999733
pan 1 0.999403
pan 2 0.999044
wye 1 0.999823
wye 2 0.999264
zee 1 0.999490
zee 2 0.999438
</pre>
</div>
<p/>
</td></tr></table>
<!-- ================================================================ -->
<a id="uniq"/><h2>uniq</h2>
<p/>
<div class="pokipanel">
<pre>
$ mlr uniq --help
Usage: mlr uniq [options]
-g {d,e,f} Group-by-field names for uniq counts
-c Show repeat counts in addition to unique values
Prints distinct values for specified field names. With -c, same as count-distinct.
</pre>
</div>
<p/>
<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ wc -l data/colored-shapes.dkvp
10078 data/colored-shapes.dkvp
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr uniq -g color,shape data/colored-shapes.dkvp
color=yellow,shape=triangle
color=red,shape=square
color=red,shape=circle
color=purple,shape=triangle
color=yellow,shape=circle
color=purple,shape=square
color=yellow,shape=square
color=red,shape=triangle
color=green,shape=triangle
color=green,shape=square
color=blue,shape=circle
color=blue,shape=triangle
color=purple,shape=circle
color=blue,shape=square
color=green,shape=circle
color=orange,shape=triangle
color=orange,shape=square
color=orange,shape=circle
</pre>
</div>
<p/>
</td></tr><tr><td>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint uniq -g color,shape -c then sort -f color,shape data/colored-shapes.dkvp
color shape count
blue circle 384
blue square 589
blue triangle 497
green circle 287
green square 454
green triangle 368
orange circle 68
orange square 128
orange triangle 107
purple circle 289
purple square 481
purple triangle 372
red circle 1207
red square 1874
red triangle 1560
yellow circle 356
yellow square 589
yellow triangle 468
</pre>
</div>
<p/>
</td></tr></table>
<!-- ================================================================ -->
<a id="Functions_for_filter_and_put"/><h1>Functions for filter and put</h1>
<p/>
<div class="pokipanel">
<pre>
$ mlr --help-all-functions
abs (math: #args=1): Absolute value.
acos (math: #args=1): Inverse trigonometric cosine.
acosh (math: #args=1): Inverse hyperbolic cosine.
asin (math: #args=1): Inverse trigonometric sine.
asinh (math: #args=1): Inverse hyperbolic sine.
atan (math: #args=1): One-argument arctangent.
atan2 (math: #args=2): Two-argument arctangent.
atanh (math: #args=1): Inverse hyperbolic tangent.
cbrt (math: #args=1): Cube root.
ceil (math: #args=1): Ceiling: nearest integer at or above.
cos (math: #args=1): Trigonometric cosine.
cosh (math: #args=1): Hyperbolic cosine.
erf (math: #args=1): Error function.
erfc (math: #args=1): Complementary error function.
exp (math: #args=1): Exponential function e**x.
expm1 (math: #args=1): e**x - 1.
floor (math: #args=1): Floor: nearest integer at or below.
invqnorm (math: #args=1): Inverse of normal cumulative distribution function. Note that invqorm(urand()) is normally distributed.
log (math: #args=1): Natural (base-e) logarithm.
log10 (math: #args=1): Base-10 logarithm.
log1p (math: #args=1): log(1-x).
max (math: #args=2): max of two numbers; null loses
min (math: #args=2): min of two numbers; null loses
pow (math: #args=2): Exponentiation; same as **.
qnorm (math: #args=1): Normal cumulative distribution function.
round (math: #args=1): Round to nearest integer.
roundm (math: #args=2): Round to nearest multiple of m: roundm($x,$m) is the same as round($x/$m)*$m
sin (math: #args=1): Trigonometric sine.
sinh (math: #args=1): Hyperbolic sine.
sqrt (math: #args=1): Square root.
tan (math: #args=1): Trigonometric tangent.
tanh (math: #args=1): Hyperbolic tangent.
urand (math: #args=0): Floating-point numbers on the unit interval. Int-valued example: '$n=floor(20+urand()*11)'.
+ (math: #args=2): Addition.
- (math: #args=1): Unary minus.
- (math: #args=2): Subtraction.
* (math: #args=2): Multiplication.
/ (math: #args=2): Division.
% (math: #args=2): Remainder; never negative-valued.
** (math: #args=2): Exponentiation; same as pow.
== (boolean: #args=2): String/numeric equality. Mixing number and string results in string compare.
!= (boolean: #args=2): String/numeric inequality. Mixing number and string results in string compare.
&gt; (boolean: #args=2): String/numeric greater-than. Mixing number and string results in string compare.
&gt;= (boolean: #args=2): String/numeric greater-than-or-equals. Mixing number and string results in string compare.
&lt; (boolean: #args=2): String/numeric less-than. Mixing number and string results in string compare.
&lt;= (boolean: #args=2): String/numeric less-than-or-equals. Mixing number and string results in string compare.
&amp;&amp; (boolean: #args=2): Logical AND.
|| (boolean: #args=2): Logical OR.
! (boolean: #args=1): Logical negation.
strlen (string: #args=1): String length.
sub (string: #args=3): Example: '$name=sub($name, "old", "new")'. Regexes not supported.
tolower (string: #args=1): Convert string to lowercase.
toupper (string: #args=1): Convert string to uppercase.
. (string: #args=2): String concatenation.
boolean (conversion: #args=1): Convert int/float/bool/string to boolean.
float (conversion: #args=1): Convert int/float/bool/string to float.
int (conversion: #args=1): Convert int/float/bool/string to int.
string (conversion: #args=1): Convert int/float/bool/string to string.
hexfmt (conversion: #args=1): Convert int to string, e.g. 255 to "0xff".
fmtnum (conversion: #args=2): Convert int/float/bool to string using printf-style format string, e.g. "%06lld".
systime (time: #args=0): Floating-point seconds since the epoch, e.g. 1440768801.748936.
sec2gmt (time: #args=1): Formats seconds since epoch (integer part only) as GMT timestamp, e.g. sec2gmt(1440768801.7) = "2015-08-28T13:33:21Z".
gmt2sec (time: #args=1): Parses GMT timestamp as integer seconds since epoch.
sec2hms (time: #args=1): Formats integer seconds as in sec2hms(5000) = "01:23:20"
sec2dhms (time: #args=1): Formats integer seconds as in sec2dhms(500000) = "5d18h53m20s"
hms2sec (time: #args=1): Recovers integer seconds as in hms2sec("01:23:20") = 5000
dhms2sec (time: #args=1): Recovers integer seconds as in dhms2sec("5d18h53m20s") = 500000
fsec2hms (time: #args=1): Formats floating-point seconds as in fsec2hms(5000.25) = "01:23:20.250000"
fsec2dhms (time: #args=1): Formats floating-point seconds as in fsec2dhms(500000.25) = "5d18h53m20.250000s"
hms2fsec (time: #args=1): Recovers floating-point seconds as in hms2fsec("01:23:20.250000") = 5000.250000
dhms2fsec (time: #args=1): Recovers floating-point seconds as in dhms2fsec("5d18h53m20.250000s") = 500000.250000
To set the seed for urand, you may specify decimal or hexadecimal 32-bit
numbers of the form "mlr --seed 123456789" or "mlr --seed 0xcafefeed".
Miller's built-in variables are NF, NR, FNR, FILENUM, and FILENAME (awk-like)
along with the mathematical constants PI and E.
</pre>
</div>
<p/>
<!-- ================================================================ -->
<a id="Data_types"/><h1>Data types</h1>
<p/> Miller&rsquo;s input and output are all string-oriented: there is (as of
August 2015 anyway) no support for binary record packing. In this sense,
everything is a string in and out of Miller. During processing, field names
are always strings, even if they have names like "3"; field values are usually
strings. Field values&rsquo; ability to be interpreted as a non-string type
only has meaning when comparison or function operations are done on them. And
it is an error condition if Miller encounters non-numeric (or otherwise
mistyped) data in a field in which it has been asked to do numeric (or
otherwise type-specific) operations.
<p/> Field values are treated as numeric for the following:
<ul>
<li/> Numeric sort: <tt>mlr sort -n</tt>, <tt>mlr sort -nr</tt>.
<li/> Statistics: <tt>mlr histogram</tt>, <tt>mlr stats1</tt>, <tt>mlr stats2</tt>.
<li/> Cross-record arithmetic: <tt>mlr step</tt>.
</ul>
<p/>For <tt>mlr put</tt> and <tt>mlr filter</tt>:
<ul>
<li/> Miller&rsquo;s types for function processing are <b>null</b> (empty string), <b>error</b>, <b>string</b>, <b>float</b> (double-precision), <b>int</b> (64-bit signed), and <b>boolean</b>.
<li/> On input, string values representable as numbers (e.g. "3" or "3.1") are treated as float. If a record
has <tt>x=1,y=2</tt> then <tt>mlr put '$z=$x+$y'</tt> will produce <tt>x=1,y=2,z=3</tt>, and
<tt>mlr put '$z=$x.$y'</tt> gives an error. To coerce back to string for
processing, use the <tt>string</tt> function:
<tt>mlr put '$z=string($x).string($y)'</tt> will produce <tt>x=1,y=2,z=12</tt>.
<li/> On input, string values representable as boolean (e.g. <tt>"true"</tt>,
<tt>"false"</tt>) are <i>not</i> automatically treated as boolean.
(This is because <tt>"true"</tt> and <tt>"false"</tt> are ordinary words, and auto string-to-boolean
on a column consisting of words would result in some strings mixed with some booleans.)
Use the <tt>boolean</tt> function to coerce: e.g. giving the record <tt>x=1,y=2,w=false</tt> to
<tt>mlr put '$z=($x&lt;$y) || boolean($w)'</tt>.
<li/> Functions take types as described in <tt>mlr --help-all-functions</tt>: for example, <tt>log10</tt>
takes float input and produces float output, <tt>gmt2sec</tt> maps string to int, and <tt>sec2gmt</tt>
maps int to string.
<li/> All math functions described in <tt>mlr --help-all-functions</tt> take integer as well as float input.
</ul>
<!-- ================================================================ -->
<a id="Null_data"/><h1>Null data</h1>
<p/> One of Miller&rsquo;s key features is its support for <b>heterogeneous</b> data.
Accordingly, if you try to sort on field <tt>hostname</tt> when not all records in the data
stream <i>have</i> a field named <tt>hostname</tt>, it is not an error (although you could
pre-filter the data stream using <tt>mlr having-fields --at-least hostname then sort ...</tt>).
Rather, records lacking one or more sort keys are simply output contiguously by <tt>mlr sort</tt>.
<p/> Field values may also be null by being
specified with present key but empty value: e.g. sending <tt>x=,y=2</tt> to <tt>mlr put '$z=$x+$y'</tt>.
<p/>
Rules for null-handling:
<ul>
<li> Records with one or more null sort-field values sort after records with all sort-field values present:
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint cat data/sort-null.dat
a b
3 2
1 8
- 4
5 7
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint sort -n a data/sort-null.dat
a b
1 8
3 2
5 7
- 4
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint sort -nr a data/sort-null.dat
a b
- 4
5 7
3 2
1 8
</pre>
</div>
<p/>
<li> Functions which have one or more null arguments produce null output: e.g.
<p/>
<div class="pokipanel">
<pre>
$ echo 'x=2,y=3' | mlr put '$a=$x+$y'
x=2,y=3,a=5.000000
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ echo 'x=,y=3' | mlr put '$a=$x+$y'
x=,y=3,a=
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'
x=,y=3,a=,b=1.098612
</pre>
</div>
<p/>
<li> The <tt>min</tt> and <tt>max</tt> functions are special: if one argument is non-null, it wins:
<p/>
<div class="pokipanel">
<pre>
$ echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'
x=,y=3,a=3.000000,b=3.000000
</pre>
</div>
<p/>
</ul>
</div>
</td>
</table>
</body>
</html>