mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 10:15:36 +00:00
2487 lines
80 KiB
HTML
2487 lines
80 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html lang="en">
|
|
|
|
<!-- PAGE GENERATED FROM template.html and content-for-reference.html BY poki. -->
|
|
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
|
|
<head>
|
|
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
|
|
<meta name="description" content="Miller documentation"/>
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
|
|
<meta name="keywords"
|
|
content="John Kerl, Kerl, Miller, miller, mlr, OLAP, data analysis software, regression, correlation, variance, data tools, " />
|
|
|
|
<title> Reference </title>
|
|
<link rel="stylesheet" type="text/css" href="css/miller.css"/>
|
|
<link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
|
|
</head>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript">
|
|
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
|
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
|
</script>
|
|
<script type="text/javascript">
|
|
try {
|
|
var pageTracker = _gat._getTracker("UA-15651652-1");
|
|
pageTracker._trackPageview();
|
|
} catch(err) {}
|
|
</script>
|
|
|
|
<script type="text/javascript">
|
|
function toggle(divName) {
|
|
var eleDiv = document.getElementById(divName);
|
|
if (eleDiv != null) {
|
|
if (eleDiv.style.display == "block") {
|
|
eleDiv.style.display = "none";
|
|
} else {
|
|
eleDiv.style.display = "block";
|
|
}
|
|
}
|
|
}
|
|
</script>
|
|
|
|
<!--
|
|
The background image is from a screenshot of a Google search for "data analysis
|
|
tools", lightened and sepia-toned. Over this was placed a Mac Terminal app with
|
|
very light-grey font and translucent background, in which a few statistical
|
|
Miller commands were run with pretty-print-tabular output format.
|
|
-->
|
|
<body background="pix/sepia-overlay.jpg">
|
|
|
|
<!-- ================================================================ -->
|
|
<table width="100%">
|
|
<tr>
|
|
|
|
<!-- navbar -->
|
|
<td width="15%">
|
|
<!--
|
|
<img src="pix/mlr.jpg" />
|
|
<img style="border-width:1px; color:black;" src="pix/mlr.jpg" />
|
|
-->
|
|
|
|
<div class="pokinav">
|
|
<center><titleinbody>Miller</titleinbody></center>
|
|
|
|
<!-- PAGE LIST GENERATED FROM template.html BY poki -->
|
|
<br/>User info:
|
|
<br/>• <a href="index.html">About Miller</a>
|
|
<br/>• <a href="file-formats.html">File formats</a>
|
|
<br/>• <a href="feature-comparison.html">Miller features in the context of the Unix toolkit</a>
|
|
<br/>• <a href="record-heterogeneity.html">Record-heterogeneity</a>
|
|
<br/>• <a href="performance.html">Performance</a>
|
|
<br/>• <a href="etymology.html">Why call it Miller?</a>
|
|
<br/>• <a href="originality.html">How original is Miller?</a>
|
|
<br/>• <a href="reference.html"><b>Reference</b></a>
|
|
<br/>• <a href="data-examples.html">Data examples</a>
|
|
<br/>• <a href="internationalization.html">Internationalization</a>
|
|
<br/>• <a href="to-do.html">Things to do</a>
|
|
<br/>Developer info:
|
|
<br/>• <a href="build.html">Compiling, portability, dependencies, and testing</a>
|
|
<br/>• <a href="whyc.html">Why C?</a>
|
|
<br/>• <a href="contact.html">Contact information</a>
|
|
<br/>• <a href="https://github.com/johnkerl/miller">GitHub repo</a>
|
|
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
|
|
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
|
|
<br/> <br/> <br/> <br/> <br/> <br/>
|
|
</div>
|
|
</td>
|
|
|
|
<!-- page body -->
|
|
<td>
|
|
<div style="overflow-y:scroll;height:1500px">
|
|
<center> <titleinbody> Reference </titleinbody> </center>
|
|
<p/>
|
|
|
|
<!-- BODY COPIED FROM content-for-reference.html BY poki -->
|
|
<div class="pokitoc">
|
|
<center><b>Contents:</b></center>
|
|
• <a href="#Command_overview">Command overview</a><br/>
|
|
• <a href="#On-line_help">On-line help</a><br/>
|
|
• <a href="#then-chaining">then-chaining</a><br/>
|
|
• <a href="#I/O_options">I/O options</a><br/>
|
|
• <a href="#Formats">Formats</a><br/>
|
|
• <a href="#Record/field/pair_separators">Record/field/pair separators</a><br/>
|
|
• <a href="#Number_formatting">Number formatting</a><br/>
|
|
• <a href="#Data_transformations">Data transformations</a><br/>
|
|
• <a href="#cat">cat</a><br/>
|
|
• <a href="#check">check</a><br/>
|
|
• <a href="#count-distinct">count-distinct</a><br/>
|
|
• <a href="#cut">cut</a><br/>
|
|
• <a href="#filter">filter</a><br/>
|
|
• <a href="#group-by">group-by</a><br/>
|
|
• <a href="#group-like">group-like</a><br/>
|
|
• <a href="#having-fields">having-fields</a><br/>
|
|
• <a href="#head">head</a><br/>
|
|
• <a href="#histogram">histogram</a><br/>
|
|
• <a href="#join">join</a><br/>
|
|
• <a href="#label">label</a><br/>
|
|
• <a href="#put">put</a><br/>
|
|
• <a href="#regularize">regularize</a><br/>
|
|
• <a href="#rename">rename</a><br/>
|
|
• <a href="#reorder">reorder</a><br/>
|
|
• <a href="#sort">sort</a><br/>
|
|
• <a href="#stats1">stats1</a><br/>
|
|
• <a href="#stats2">stats2</a><br/>
|
|
• <a href="#step">step</a><br/>
|
|
• <a href="#tac">tac</a><br/>
|
|
• <a href="#tail">tail</a><br/>
|
|
• <a href="#top">top</a><br/>
|
|
• <a href="#uniq">uniq</a><br/>
|
|
• <a href="#Functions_for_filter_and_put">Functions for filter and put</a><br/>
|
|
• <a href="#Data_types">Data types</a><br/>
|
|
• <a href="#Null_data">Null data</a><br/>
|
|
</div>
|
|
<p/>
|
|
|
|
<a id="Command_overview"/><h1>Command overview</h1>
|
|
|
|
<p>
|
|
Whereas the Unix toolkit is made of the separate executables <tt>cat</tt>, <tt>tail</tt>, <tt>cut</tt>,
|
|
<tt>sort</tt>, etc., Miller has subcommands, invoked as follows:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
mlr tac *.dat
|
|
mlr cut --complement -f os_version *.dat
|
|
mlr sort -f hostname,uptime *.dat
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>These falls into categories as follows:
|
|
|
|
<table border=1>
|
|
<tr class="mlrbg">
|
|
<th>Commands </th>
|
|
<th>Description</th>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<a href="#cat"><tt>cat</tt></a>,
|
|
<a href="#cut"><tt>cut</tt></a>,
|
|
<a href="#head"><tt>head</tt></a>,
|
|
<a href="#sort"><tt>sort</tt></a>,
|
|
<a href="#tac"><tt>tac</tt></a>,
|
|
<a href="#tail"><tt>tail</tt></a>,
|
|
<a href="#top"><tt>top</tt></a>,
|
|
<a href="#uniq"><tt>uniq</tt></a>
|
|
</td>
|
|
<td> Analogs of their Unix-toolkit namesakes, discussed below as well as in
|
|
<a href="feature-comparison.html">Miller features in the context of the Unix toolkit</a> </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="#filter"><tt>filter</tt></a>,
|
|
<a href="#put"><tt>put</tt></a>,
|
|
<a href="#step"><tt>step</tt></a>
|
|
</td>
|
|
<td> <tt>awk</tt>-like functionality </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="#histogram"><tt>histogram</tt></a>,
|
|
<a href="#stats1"><tt>stats1</tt></a>,
|
|
<a href="#stats2"><tt>stats2</tt></a>
|
|
</td>
|
|
<td> Statistically oriented </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="#group-by"><tt>group-by</tt></a>,
|
|
<a href="#group-like"><tt>group-like</tt></a>,
|
|
<a href="#having-fields"><tt>having-fields</tt></a>
|
|
</td>
|
|
<td> Particularly oriented toward <a href="record-heterogeneity.html">Record-heterogeneity</a>, although
|
|
all Miller commands can handle heterogeneous records
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="#count-distinct"><tt>count-distinct</tt></a>,
|
|
<a href="#label"><tt>label</tt></a>,
|
|
<a href="#regularize"><tt>rename</tt></a>,
|
|
<a href="#rename"><tt>rename</tt></a>,
|
|
<a href="#reorder"><tt>reorder</tt></a>
|
|
</td>
|
|
<td> These draw from other sources (see also <a href="originality.html">How original is Miller?</a>):
|
|
<a href="#count-distinct"><tt>count-distinct</tt></a> is SQL-ish, and
|
|
<a href="#rename"><tt>rename</tt></a> can be done by <tt>sed</tt> (which does it faster:
|
|
see <a href="performance.html">Performance</a>).
|
|
</td>
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
<a id="On-line_help"/><h1>On-line help</h1>
|
|
|
|
<p/>Examples:<p/>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --help
|
|
Usage: mlr [I/O options] {verb} [verb-dependent options ...] {file names}
|
|
Verbs:
|
|
cat check count-distinct cut filter group-by group-like having-fields head
|
|
histogram join label put regularize rename reorder sort stats1 stats2 step tac tail top
|
|
uniq
|
|
Example: mlr --csv --rs lf --fs tab cut -f hostname,uptime file1.csv file2.csv
|
|
Please use "mlr {verb name} --help" for verb-specific help.
|
|
Please use "mlr --help-all-verbs" for help on all verbs.
|
|
|
|
Functions for filter and put:
|
|
abs acos acosh asin asinh atan atan2 atanh cbrt ceil cos cosh erf erfc exp
|
|
expm1 floor invqnorm log log10 log1p max min pow qnorm round roundm sin sinh sqrt tan
|
|
tanh urand + - - * / % ** == != > >= < <= && || ! strlen sub tolower toupper .
|
|
boolean float int string hexfmt fmtnum systime sec2gmt gmt2sec sec2hms sec2dhms hms2sec
|
|
dhms2sec fsec2hms fsec2dhms hms2fsec dhms2fsec
|
|
Please use "mlr --help-function {function name}" for function-specific help.
|
|
Please use "mlr --help-all-functions" or "mlr -f" for help on all functions.
|
|
|
|
Data-format options, for input, output, or both:
|
|
--dkvp --idkvp --odkvp Delimited key-value pairs, e.g "a=1,b=2" (default)
|
|
--nidx --inidx --onidx Implicitly-integer-indexed fields (Unix-toolkit style)
|
|
--csv --icsv --ocsv Comma-separated value (or tab-separated with --fs tab, etc.)
|
|
--pprint --ipprint --opprint --right Pretty-printed tabular (produces no output until all input is in)
|
|
--xtab --ixtab --oxtab Pretty-printed vertical-tabular
|
|
-p is a keystroke-saver for --nidx --fs space --repifs
|
|
Separator options, for input, output, or both:
|
|
--rs --irs --ors Record separators, e.g. 'lf' or '\r\n'
|
|
--fs --ifs --ofs --repifs Field separators, e.g. comma
|
|
--ps --ips --ops Pair separators, e.g. equals sign
|
|
Notes:
|
|
* IPS/OPS are only used for DKVP and XTAB formats, since only in these formats do key-value pairs appear juxtaposed.
|
|
* IRS/ORS are ignored for XTAB format. Nominally IFS and OFS are newlines; XTAB records are separated by
|
|
two or more consecutive IFS/OFS -- i.e. a blank line.
|
|
* OPS must be single-character for XTAB format, and OFS must be single-character for PPRINT format.
|
|
This is because they are used with repetition for alignment; multi-character separators
|
|
would make alignment impossible.
|
|
* DKVP, NIDX, CSVLITE, PPRINT, and XTAB formats are intended to handle platform-native text data.
|
|
In particular, this means LF line-terminators by default on Linux/OSX.
|
|
You can use "--dkvp --rs crlf" for CRLF-terminated DKVP files, and so on.
|
|
* CSV is intended to handle RFC-4180-compliant data.
|
|
In particular, this means it uses CRLF line-terminators by default.
|
|
You can use "--csv --rs lf" for Linux-native CSV files.
|
|
* You can specify separators in any of the following ways, shown by example:
|
|
- Type them out, quoting as necessary for shell escapes, e.g. "--fs '|' --ips :"
|
|
- C-style escape sequences, e.g. "--rs '\r\n' --fs '\t'".
|
|
- To avoid backslashing, you can use any of the following names:
|
|
cr crcr newline lf lflf crlf crlfcrlf tab space comma pipe slash colon semicolon equals
|
|
* Default separators by format:
|
|
File format RS FS PS
|
|
dkvp \n , =
|
|
csv \r\n , (N/A)
|
|
csvlite \n , (N/A)
|
|
nidx \n space (N/A)
|
|
xtab (N/A) \n space
|
|
pprint \n space (N/A)
|
|
Double-quoting for CSV output:
|
|
--quote-all Wrap all fields in double quotes
|
|
--quote-none Do not wrap any fields in double quotes, even if they have OFS or ORS in them
|
|
--quote-minimal Wrap fields in double quotes only if they have OFS or ORS in them (default)
|
|
--quote-numeric Wrap fields in double quotes only if they have numbers in them
|
|
Numerical formatting:
|
|
--ofmt {format} E.g. %.18lf, %.0lf. Please use sprintf-style codes for double-precision.
|
|
Applies to verbs which compute new values, e.g. put, stats1, stats2.
|
|
See also the fmtnum function within mlr put (mlr --help-all-functions).
|
|
Other options:
|
|
--seed {n} with n of the form 12345678 or 0xcafefeed. For put/filter urand().
|
|
Output of one verb may be chained as input to another using "then", e.g.
|
|
mlr stats1 -a min,mean,max -f flag,u,v -g color then sort -f color
|
|
Please see http://johnkerl.org/miller/doc and/or http://github.com/johnkerl/miller for more information.
|
|
This is Miller version >= v2.2.0.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr sort --help
|
|
Usage: mlr sort {flags}
|
|
Flags:
|
|
-f {comma-separated field names} Lexical ascending
|
|
-n {comma-separated field names} Numerical ascending; nulls sort last
|
|
-nf {comma-separated field names} Numerical ascending; nulls sort last
|
|
-r {comma-separated field names} Lexical descending
|
|
-nr {comma-separated field names} Numerical descending; nulls sort first
|
|
Sorts records primarily by the first specified field, secondarily by the second field, and so on.
|
|
Example:
|
|
mlr sort -f a,b -nr x,y,z
|
|
which is the same as:
|
|
mlr sort -f a -f b -nr x -nr y -nr z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<a id="then-chaining"/><h1>then-chaining</h1>
|
|
|
|
<p/>
|
|
In accord with the
|
|
<a href="http://en.wikipedia.org/wiki/Unix_philosophy">Unix philosophy</a>, you can pipe data into or out of
|
|
Miller. For example:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
mlr cut --complement -f os_version *.dat | mlr sort -f hostname,uptime
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
% cat piped.sh
|
|
mlr cut -x -f i,y data/big | mlr sort -n y > /dev/null
|
|
|
|
% time sh piped.sh
|
|
real 0m2.828s
|
|
user 0m3.183s
|
|
sys 0m0.137s
|
|
|
|
|
|
% cat chained.sh
|
|
mlr cut -x -f i,y then sort -n y data/big > /dev/null
|
|
|
|
% time sh chained.sh
|
|
real 0m2.082s
|
|
user 0m1.933s
|
|
sys 0m0.137s
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
For better performance (avoiding redundant string-parsing and string-formatting
|
|
when you pipe Miller commands together) you can, if you like, instead simply
|
|
chain commands together using the <tt>then</tt> keyword:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
mlr cut --complement -f os_version then sort -f hostname,uptime *.dat
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="I/O_options"/><h1>I/O options</h1>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="Formats"/><h2>Formats</h2>
|
|
|
|
<p/> Options:
|
|
|
|
<pre>
|
|
--dkvp --idkvp --odkvp
|
|
--nidx --inidx --onidx
|
|
--csv --icsv --ocsv
|
|
--csvlite --icsvlite --ocsvlite
|
|
--pprint --ipprint --ppprint --right
|
|
--xtab --ixtab --oxtab
|
|
</pre>
|
|
|
|
<p/> These are as discussed in <a href="file-formats.html">File formats</a>, with the exception of <tt>--right</tt>
|
|
which makes pretty-printed output right-aligned:
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint cat data/small
|
|
a b i x y
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
eks pan 2 0.7586799647899636 0.5221511083334797
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
eks wye 4 0.38139939387114097 0.13418874328430463
|
|
wye pan 5 0.5732889198020006 0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint --right cat data/small
|
|
a b i x y
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
eks pan 2 0.7586799647899636 0.5221511083334797
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
eks wye 4 0.38139939387114097 0.13418874328430463
|
|
wye pan 5 0.5732889198020006 0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<p/>Additional notes:
|
|
|
|
<ul>
|
|
|
|
<li/> Use <tt>--csv</tt>, <tt>--pprint</tt>, etc. when the input and output formats are the same.
|
|
|
|
<li/> Use <tt>--icsv --opprint</tt>, etc. when you want format conversion as part of what Miller does to your data.
|
|
|
|
<li/> DKVP (key-value-pair) format is the default for input and output. So,
|
|
<tt>--oxtab</tt> is the same as <tt>--idkvp --oxtab</tt>.
|
|
|
|
</ul>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="Record/field/pair_separators"/><h2>Record/field/pair separators</h2>
|
|
|
|
<p/> Miller has record separators <tt>IRS</tt> and <tt>ORS</tt>, field
|
|
separators <tt>IFS</tt> and <tt>OFS</tt>, and pair separators <tt>IPS</tt> and
|
|
<tt>OPS</tt>. For example, in the DKVP line <tt>a=1,b=2,c=3</tt>, the record
|
|
separator is newline, field separator is comma, and pair separator is the
|
|
equals sign. These are the default values.
|
|
|
|
<p/> Options:
|
|
<pre>
|
|
--rs --irs --ors
|
|
--fs --ifs --ofs --repifs
|
|
--ps --ips --ops
|
|
</pre>
|
|
|
|
<ul>
|
|
|
|
<li/> You can change a separator from input to output via e.g. <tt>--ifs =
|
|
--ofs :</tt>. Or, you can specify that the same separator is to be used for
|
|
input and output via e.g. <tt>--fs :</tt>.
|
|
|
|
<li/> The pair separator is only relevant to DKVP format.
|
|
|
|
<li/> Pretty-print and xtab formats ignore the separator arguments altogether.
|
|
|
|
<li/> The <tt>--repifs</tt> means that multiple successive occurrences of the
|
|
field separator count as one. For example, in CSV data we often signify nulls
|
|
by empty strings, e.g. <tt>2,9,,,,,6,5,4</tt>. On the other hand, if the field
|
|
separator is a space, it might be more natural to parse <tt>2 4 5</tt> the
|
|
same as <tt>2 4 5</tt>: <tt>--repifs --ifs ' '</tt> lets this happen. In fact,
|
|
the <tt>--ipprint</tt> option above is internally implemented in terms of
|
|
<tt>--repifs</tt>.
|
|
|
|
<li/> Just write out the desired separator, e.g. <tt>--ofs '|'</tt>. But you
|
|
may use the symbolic names <tt>newline</tt>, <tt>space</tt>, <tt>tab</tt>,
|
|
<tt>pipe</tt>, or <tt>semicolon</tt> if you like.
|
|
|
|
</ul>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="Number_formatting"/><h2>Number formatting</h2>
|
|
|
|
<p/> The command-line option <tt>--ofmt {format string}</tt> is the global
|
|
number format for commands which generate numeric output, e.g.
|
|
<tt>stats1</tt>, <tt>stats2</tt>, <tt>histogram</tt>, and <tt>step</tt>, as
|
|
well as <tt>mlr put</tt>. Examples:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
--ofmt %.9le --ofmt %.6lf --ofmt %.0lf
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> These are just C <tt>printf</tt> formats applied to double-precision
|
|
numbers. Please don’t use <tt>%s</tt> or <tt>%d</tt>. Additionally, if
|
|
you use leading width (e.g. <tt>%18.12lf</tt>) then the output will contain
|
|
embedded whitespace, which may not be what you want if you pipe the output to
|
|
something else, particularly CSV. I use Miller’s pretty-print format
|
|
(<tt>mlr --opprint</tt>) to column-align numerical data.
|
|
|
|
<p/> To apply formatting to a single field, overriding the global
|
|
<tt>ofmt</tt>, use <tt>fmtnum</tt> function within <tt>mlr put</tt>. For example:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo 'x=3.1,y=4.3' | mlr put '$z=fmtnum($x*$y,"%08lf")'
|
|
x=3.1,y=4.3,z=13.330000
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo 'x=0xffff,y=0xff' | mlr put '$z=fmtnum(int($x*$y),"%08llx")'
|
|
x=0xffff,y=0xff,z=00feff01
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>Input conversion from hexadecimal is done automatically on fields handled
|
|
by <tt>mlr put</tt> and <tt>mlr filter</tt> as long as the field value begins
|
|
with "0x". To apply output conversion to hexadecimal on a single column, you
|
|
may use <tt>fmtnum</tt>, or the keystroke-saving <tt>hexfmt</tt> function.
|
|
Example:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo 'x=0xffff,y=0xff' | mlr put '$z=hexfmt($x*$y)'
|
|
x=0xffff,y=0xff,z=0xfeff01
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="Data_transformations"/><h1>Data transformations</h1>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="cat"/><h2>cat</h2>
|
|
|
|
<p/> Most useful for format conversions (see
|
|
<a href="file-formats.html">File formats</a>), and concatenating multiple
|
|
same-schema CSV files to have the same header:
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat a.csv
|
|
a,b,c
|
|
1,2,3
|
|
4,5,6
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td> <td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat b.csv
|
|
a,b,c
|
|
7,8,9
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td> <td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --csv cat a.csv b.csv
|
|
a,b,c
|
|
1,2,3
|
|
4,5,6
|
|
7,8,9
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td> <td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --oxtab cat a.csv b.csv
|
|
a 1
|
|
b 2
|
|
c 3
|
|
|
|
a 4
|
|
b 5
|
|
c 6
|
|
|
|
a 7
|
|
b 8
|
|
c 9
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="check"/><h2>check</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr check --help
|
|
Usage: mlr check
|
|
Consumes records without printing any output.
|
|
Useful for doing a well-formatted check on input data.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="count-distinct"/><h2>count-distinct</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr count-distinct --help
|
|
Usage: mlr count-distinct [options]
|
|
-f {a,b,c} Field names for distinct count.
|
|
Prints number of records having distinct values for specified field names. Same as uniq -c.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr count-distinct -f a,b then sort -nr count data/medium
|
|
a=zee,b=wye,count=455
|
|
a=pan,b=eks,count=429
|
|
a=pan,b=pan,count=427
|
|
a=wye,b=hat,count=426
|
|
a=hat,b=wye,count=423
|
|
a=pan,b=hat,count=417
|
|
a=eks,b=hat,count=417
|
|
a=eks,b=eks,count=413
|
|
a=pan,b=zee,count=413
|
|
a=zee,b=hat,count=409
|
|
a=eks,b=wye,count=407
|
|
a=zee,b=zee,count=403
|
|
a=pan,b=wye,count=395
|
|
a=wye,b=pan,count=392
|
|
a=zee,b=eks,count=391
|
|
a=zee,b=pan,count=389
|
|
a=hat,b=eks,count=389
|
|
a=wye,b=eks,count=386
|
|
a=hat,b=zee,count=385
|
|
a=wye,b=zee,count=385
|
|
a=hat,b=hat,count=381
|
|
a=wye,b=wye,count=377
|
|
a=eks,b=pan,count=371
|
|
a=hat,b=pan,count=363
|
|
a=eks,b=zee,count=357
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="cut"/><h2>cut</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr cut --help
|
|
Usage: mlr cut [options]
|
|
-f {a,b,c} Field names to include for cut.
|
|
-o Retain fields in the order specified here in the argument list.
|
|
Default is to retain them in the order found in the input data.
|
|
-x|--complement Exclude, rather that include, field names specified by -f.
|
|
Passes through input records with specified fields included/excluded.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint cat data/small
|
|
a b i x y
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
eks pan 2 0.7586799647899636 0.5221511083334797
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
eks wye 4 0.38139939387114097 0.13418874328430463
|
|
wye pan 5 0.5732889198020006 0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint cut -f y,x,i data/small
|
|
i x y
|
|
1 0.3467901443380824 0.7268028627434533
|
|
2 0.7586799647899636 0.5221511083334797
|
|
3 0.20460330576630303 0.33831852551664776
|
|
4 0.38139939387114097 0.13418874328430463
|
|
5 0.5732889198020006 0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo 'a=1,b=2,c=3' | mlr cut -f b,c,a
|
|
a=1,b=2,c=3
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo 'a=1,b=2,c=3' | mlr cut -o -f b,c,a
|
|
b=2,c=3,a=1
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<p/>
|
|
<!-- ================================================================ -->
|
|
<a id="filter"/><h2>filter</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr filter --help
|
|
Usage: mlr filter [-v] {expression}
|
|
Prints records for which {expression} evaluates to true.
|
|
With -v, first prints the AST (abstract syntax tree) for the expression, which
|
|
gives full transparency on the precedence and associativity rules of Miller's grammar.
|
|
Please use a dollar sign for field names and double-quotes for string literals.
|
|
Miller built-in variables are NF NR FNR FILENUM FILENAME PI E.
|
|
Examples:
|
|
mlr filter 'log10($count) > 4.0'
|
|
mlr filter 'FNR == 2 (second record in each file)'
|
|
mlr filter 'urand() < 0.001' (subsampling)
|
|
mlr filter '$color != "blue" && $value > 4.2'
|
|
mlr filter '($x<.5 && $y<.5) || ($x>.5 && $y>.5)'
|
|
Please see http://johnkerl.org/miller/doc/reference.html for more information including function list.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>Field names must be specified using a <tt>$</tt> in <tt>filter</tt> and
|
|
<a href="#put"><tt>put</tt></a> expressions, even though they don’t appear in the data
|
|
stream. For integer-indexed data, this looks like <tt>awk</tt>’s
|
|
<tt>$1,$2,$3</tt>. Likewise, enclose string literals in double quotes in
|
|
<tt>filter</tt> expressions even though they don’t appear in file data.
|
|
In particular, <tt>mlr filter '$x=="abc"'</tt> passes through the record
|
|
<tt>x=abc</tt>.
|
|
|
|
<p/>The <tt>filter</tt> command supports the same built-in variables as for
|
|
<a href="#put"><tt>put</tt></a>, all <tt>awk</tt>-inspired: <tt>NF</tt>,
|
|
<tt>NR</tt>, <tt>FNR</tt>, <tt>FILENUM</tt>, and <tt>FILENAME</tt>. This
|
|
selects the 2nd record from each matching file:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr filter 'FNR == 2' data/small*
|
|
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
|
|
1=pan,2=pan,3=1,4=0.3467901443380824,5=0.7268028627434533
|
|
a=wye,b=eks,i=10000,x=0.734806020620654365,y=0.884788571337605134
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>Expressions may be arbitrarily complex:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint filter '$a == "pan" || $b == "wye"' data/small
|
|
a b i x y
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
eks wye 4 0.38139939387114097 0.13418874328430463
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint filter '($x > 0.5 && $y > 0.5) || ($x < 0.5 && $y < 0.5)' then stats2 -a corr -f x,y data/medium
|
|
x_y_corr
|
|
0.756439
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint filter '($x > 0.5 && $y < 0.5) || ($x < 0.5 && $y > 0.5)' then stats2 -a corr -f x,y data/medium
|
|
x_y_corr
|
|
-0.747994
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
Newlines within the expression are ignored, which can help increase legibility of complex expressions:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
mlr --opprint filter '
|
|
($x > 0.5 && $y < 0.5)
|
|
||
|
|
($x < 0.5 && $y > 0.5)' \
|
|
then stats2 -a corr -f x,y data/medium
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="group-by"/><h2>group-by</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr group-by --help
|
|
Usage: mlr group-by {comma-separated field names}
|
|
Outputs records in batches having identical values at specified field names.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>This is similar to <tt>sort</tt> but with less work. Namely, Miller’s
|
|
sort has three steps: read through the data and append linked lists of records,
|
|
one for each unique combination of the key-field values; after all records
|
|
are read, sort the key-field values; then print each record-list. The group-by
|
|
operation simply omits the middle sort. An example should make this more
|
|
clear.
|
|
|
|
<table><tr> <td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint group-by a data/small
|
|
a b i x y
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
eks pan 2 0.7586799647899636 0.5221511083334797
|
|
eks wye 4 0.38139939387114097 0.13418874328430463
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
wye pan 5 0.5732889198020006 0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td> <td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint sort -f a data/small
|
|
a b i x y
|
|
eks pan 2 0.7586799647899636 0.5221511083334797
|
|
eks wye 4 0.38139939387114097 0.13418874328430463
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
wye pan 5 0.5732889198020006 0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td> </tr></table>
|
|
|
|
<p/>In this example, since the sort is on field <tt>a</tt>, the first step is
|
|
to group together all records having the same value for field <tt>a</tt>; the
|
|
second step is to sort the distinct <tt>a</tt>-field values <tt>pan</tt>,
|
|
<tt>eks</tt>, and <tt>wye</tt> into <tt>eks</tt>, <tt>pan</tt>, and
|
|
<tt>wye</tt>; the third step is to print out the record-list for
|
|
<tt>a=eks</tt>, then the record-list for <tt>a=pan</tt>, then the record-list
|
|
for <tt>a=wye</tt>. The group-by operation omits the middle sort and just puts
|
|
like records together, for those times when a sort isn’t desired. In
|
|
particular, the ordering of group-by fields for group-by is the order in which
|
|
they were encountered in the data stream, which in some cases may be more interesting
|
|
to you.
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="group-like"/><h2>group-like</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr group-like --help
|
|
Usage: mlr group-like
|
|
Outputs records in batches having identical field names.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> This groups together records having the same schema (i.e. same ordered list of field names)
|
|
which is useful for making sense of time-ordered output as described in
|
|
<a href="record-heterogeneity.html">Record-heterogeneity</a> — in particular, in
|
|
preparation for CSV or pretty-print output.
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr cat data/het.dkvp
|
|
resource=/path/to/file,loadsec=0.45,ok=true
|
|
record_count=100,resource=/path/to/file
|
|
resource=/path/to/second/file,loadsec=0.32,ok=true
|
|
record_count=150,resource=/path/to/second/file
|
|
resource=/some/other/path,loadsec=0.97,ok=false
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint group-like data/het.dkvp
|
|
resource loadsec ok
|
|
/path/to/file 0.45 true
|
|
/path/to/second/file 0.32 true
|
|
/some/other/path 0.97 false
|
|
|
|
record_count resource
|
|
100 /path/to/file
|
|
150 /path/to/second/file
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="having-fields"/><h2>having-fields</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr having-fields --help
|
|
Usage: mlr having-fields [options]
|
|
--at-least {a,b,c}
|
|
--which-are {a,b,c}
|
|
--at-most {a,b,c}
|
|
Conditionally passes through records depending on each record's field names.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> Similar to <a href="#group-like"><tt>group-like</tt></a>, this retains records with specified schema.
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr cat data/het.dkvp
|
|
resource=/path/to/file,loadsec=0.45,ok=true
|
|
record_count=100,resource=/path/to/file
|
|
resource=/path/to/second/file,loadsec=0.32,ok=true
|
|
record_count=150,resource=/path/to/second/file
|
|
resource=/some/other/path,loadsec=0.97,ok=false
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr having-fields --at-least resource data/het.dkvp
|
|
resource=/path/to/file,loadsec=0.45,ok=true
|
|
record_count=100,resource=/path/to/file
|
|
resource=/path/to/second/file,loadsec=0.32,ok=true
|
|
record_count=150,resource=/path/to/second/file
|
|
resource=/some/other/path,loadsec=0.97,ok=false
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr having-fields --which-are resource,ok,loadsec data/het.dkvp
|
|
resource=/path/to/file,loadsec=0.45,ok=true
|
|
resource=/path/to/second/file,loadsec=0.32,ok=true
|
|
resource=/some/other/path,loadsec=0.97,ok=false
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="head"/><h2>head</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr head --help
|
|
Usage: mlr head [options]
|
|
-n {count} Head count to print; default 10
|
|
-g {a,b,c} Optional group-by-field names for head counts
|
|
Passes through the first n records, optionally by category.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
Note that <tt>head</tt> is distinct from <a href="#top"><tt>top</tt></a>
|
|
— <tt>head</tt> shows fields which appear first in the data stream;
|
|
<tt>top</tt> shows fields which are numerically largest (or smallest).
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint head -n 4 data/medium
|
|
a b i x y
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
eks pan 2 0.7586799647899636 0.5221511083334797
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
eks wye 4 0.38139939387114097 0.13418874328430463
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint head -n 1 -g b data/medium
|
|
a b i x y
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
eks zee 7 0.6117840605678454 0.1878849191181694
|
|
zee eks 17 0.29081949506712723 0.054478717073354166
|
|
wye hat 24 0.7286126830627567 0.19441962592638418
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="histogram"/><h2>histogram</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr histogram --help
|
|
Usage: mlr histogram [options]
|
|
-f {a,b,c} Value-field names for histogram counts
|
|
--lo {lo} Histogram low value
|
|
--hi {hi} Histogram high value
|
|
--nbins {n} Number of histogram bins
|
|
Just a histogram. Input values < lo or > hi are not counted.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
This is just a histogram; there’s not too much to say here. A note about
|
|
binning, by example: Suppose you use <tt>--lo 0.0 --hi 1.0 --nbins 10 -f
|
|
x</tt>. The input numbers less than 0 or greater than 1 aren’t counted
|
|
in any bin. Input numbers equal to 1 are counted in the last bin. That is, bin
|
|
0 has <tt>0.0 ≤ x < 0.1</tt>, bin 1 has <tt>0.1 ≤ x < 0.2</tt>,
|
|
etc., but bin 9 has <tt>0.9 ≤ x ≤ 1.0</tt>.
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint put '$x2=$x**2;$x3=$x2*$x' then histogram -f x,x2,x3 --lo 0 --hi 1 --nbins 10 data/medium
|
|
bin_lo bin_hi x_count x2_count x3_count
|
|
0.000000 0.100000 1072 3231 4661
|
|
0.100000 0.200000 938 1254 1184
|
|
0.200000 0.300000 1037 988 845
|
|
0.300000 0.400000 988 832 676
|
|
0.400000 0.500000 950 774 576
|
|
0.500000 0.600000 1002 692 476
|
|
0.600000 0.700000 1007 591 438
|
|
0.700000 0.800000 1007 560 420
|
|
0.800000 0.900000 986 571 383
|
|
0.900000 1.000000 1013 507 341
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="join"/><h2>join</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr join --help
|
|
Usage: mlr join [options]
|
|
Joins records from specified left file name with records from all file names at the end of the Miller argument list.
|
|
Functionality is essentially the same as the system "join" command, but for record streams.
|
|
Options:
|
|
-f {left file name}
|
|
-j {a,b,c} Comma-separated join-field names for output
|
|
-l {a,b,c} Comma-separated join-field names for left input file; defaults to -j values if omitted.
|
|
-r {a,b,c} Comma-separated join-field names for right input file(s); defaults to -j values if omitted.
|
|
--lp {text} Additional prefix for non-join output field names from the left file
|
|
--rp {text} Additional prefix for non-join output field names from the right file(s)
|
|
--np Do not emit paired records
|
|
--ul Emit unpaired records from the left file
|
|
--ur Emit unpaired records from the right file(s)
|
|
-u Enable unsorted input. In this case, the entire left file will be loaded into memory.
|
|
Without -u, records must be sorted lexically by their join-field names, else not all
|
|
records will be paired.
|
|
File-format options default to those for the right file names on the Miller argument list, but may be overridden
|
|
for the left file as follows. Please see the main "mlr --help" for more information on syntax for these arguments.
|
|
-i {one of csv,dkvp,nidx,pprint,xtab}
|
|
--irs {record-separator character}
|
|
--ifs {field-separator character}
|
|
--ips {pair-separator character}
|
|
--repifs
|
|
--repips
|
|
--use-mmap
|
|
--no-mmap
|
|
Please see http://johnkerl.org/miller/doc/reference.html for more information including examples.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
Examples:
|
|
|
|
<p/>Join larger table with IDs with smaller ID-to-name lookup table, showing only paired records:
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint cat data/join-left-example.csv
|
|
id name
|
|
100 alice
|
|
200 bob
|
|
300 carol
|
|
400 david
|
|
500 edgar
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint cat data/join-right-example.csv
|
|
status idcode
|
|
present 400
|
|
present 100
|
|
missing 200
|
|
present 100
|
|
present 200
|
|
missing 100
|
|
missing 200
|
|
present 300
|
|
missing 600
|
|
present 400
|
|
present 400
|
|
present 300
|
|
present 100
|
|
missing 400
|
|
present 200
|
|
present 200
|
|
present 200
|
|
present 200
|
|
present 400
|
|
present 300
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint join -u -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv
|
|
id name status
|
|
400 david present
|
|
100 alice present
|
|
200 bob missing
|
|
100 alice present
|
|
200 bob present
|
|
100 alice missing
|
|
200 bob missing
|
|
300 carol present
|
|
400 david present
|
|
400 david present
|
|
300 carol present
|
|
100 alice present
|
|
400 david missing
|
|
200 bob present
|
|
200 bob present
|
|
200 bob present
|
|
200 bob present
|
|
400 david present
|
|
300 carol present
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<p/>Same, but with sorting the input first:
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint sort -f idcode then join -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv
|
|
id name status
|
|
100 alice present
|
|
100 alice present
|
|
100 alice missing
|
|
100 alice present
|
|
200 bob missing
|
|
200 bob present
|
|
200 bob missing
|
|
200 bob present
|
|
200 bob present
|
|
200 bob present
|
|
200 bob present
|
|
300 carol present
|
|
300 carol present
|
|
300 carol present
|
|
400 david present
|
|
400 david present
|
|
400 david present
|
|
400 david missing
|
|
400 david present
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<p/>Same, but showing only unpaired records:
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint join --np --ul --ur -u -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv
|
|
status idcode
|
|
missing 600
|
|
|
|
id name
|
|
500 edgar
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<p/>Use prefixing options to disambiguate between otherwise identical non-join field names:
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --csvlite --opprint cat data/self-join.csv data/self-join.csv
|
|
a b c
|
|
1 2 3
|
|
1 4 5
|
|
1 2 3
|
|
1 4 5
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --csvlite --opprint join -j a --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv
|
|
a left_b left_c right_b right_c
|
|
1 2 3 2 3
|
|
1 4 5 2 3
|
|
1 2 3 4 5
|
|
1 4 5 4 5
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<p/>Use zero join columns:
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --csvlite --opprint join -j "" --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv
|
|
left_a left_b left_c right_a right_b right_c
|
|
1 2 3 1 2 3
|
|
1 4 5 1 2 3
|
|
1 2 3 1 4 5
|
|
1 4 5 1 4 5
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="label"/><h2>label</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr label --help
|
|
Usage: mlr label {new1,new2,new3,...}
|
|
Given n comma-separated names, renames the first n fields of each record to
|
|
have the respective name. (Fields past the nth are left with their original
|
|
names.) Particularly useful with --inidx, to give useful names to otherwise
|
|
integer-indexed fields.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
See also <a href="#rename"><tt>rename</tt></a>.
|
|
|
|
<p/>Example: Files such as <tt>/etc/passwd</tt>, <tt>/etc/group</tt>, and so on
|
|
have implicit field names which are found in section-5 manpages. These field names may be made explicit as follows:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
% grep -v '^#' /etc/passwd | mlr --nidx --fs : --opprint label name,password,uid,gid,gecos,home_dir,shell | head
|
|
name password uid gid gecos home_dir shell
|
|
nobody * -2 -2 Unprivileged User /var/empty /usr/bin/false
|
|
root * 0 0 System Administrator /var/root /bin/sh
|
|
daemon * 1 1 System Services /var/root /usr/bin/false
|
|
_uucp * 4 4 Unix to Unix Copy Protocol /var/spool/uucp /usr/sbin/uucico
|
|
_taskgated * 13 13 Task Gate Daemon /var/empty /usr/bin/false
|
|
_networkd * 24 24 Network Services /var/networkd /usr/bin/false
|
|
_installassistant * 25 25 Install Assistant /var/empty /usr/bin/false
|
|
_lp * 26 26 Printing Services /var/spool/cups /usr/bin/false
|
|
_postfix * 27 27 Postfix Mail Server /var/spool/postfix /usr/bin/false
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="put"/><h2>put</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr put --help
|
|
Usage: mlr put [-v] {expression}
|
|
Adds/updates specified field(s).
|
|
With -v, first prints the AST (abstract syntax tree) for the expression, which
|
|
gives full transparency on the precedence and associativity rules of Miller's grammar.
|
|
Please use a dollar sign for field names and double-quotes for string literals.
|
|
Miller built-in variables are NF NR FNR FILENUM FILENAME PI E.
|
|
Multiple assignments may be separated with a semicolon.
|
|
Examples:
|
|
mlr put '$y = log10($x); $z = sqrt($y)'
|
|
mlr put '$filename = FILENAME'
|
|
mlr put '$colored_shape = $color . "_" . $shape'
|
|
mlr put '$y = cos($theta); $z = atan2($y, $x)'
|
|
Please see http://johnkerl.org/miller/doc/reference.html for more information including function list.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>Field names must be specified using a <tt>$</tt> in <a href="#filter"><tt>filter</tt></a> and <tt>put</tt>
|
|
expressions, even though they don’t appear in the data stream. For
|
|
integer-indexed data, this looks like <tt>awk</tt>’s <tt>$1,$2,$3</tt>.
|
|
Likewise, enclose string literals in double quotes in <tt>put</tt>
|
|
expressions even though they don’t appear in file data. In particular,
|
|
<tt>mlr put '$x=="abc"'</tt> creates the field <tt>x=abc</tt>.
|
|
|
|
<p/>Multiple expressions may be given, separated by semicolons, and each may refer to the ones before:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ ruby -e '10.times{|i|puts "i=#{i}"}' | mlr --opprint put '$j=$i+1;$k=$i+$j'
|
|
i j k
|
|
0 1.000000 1.000000
|
|
1 2.000000 3.000000
|
|
2 3.000000 5.000000
|
|
3 4.000000 7.000000
|
|
4 5.000000 9.000000
|
|
5 6.000000 11.000000
|
|
6 7.000000 13.000000
|
|
7 8.000000 15.000000
|
|
8 9.000000 17.000000
|
|
9 10.000000 19.000000
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>Miller supports the following five built-in variables for <a href="#filter"><tt>filter</tt></a>
|
|
and <tt>put</tt>, all <tt>awk</tt>-inspired: <tt>NF</tt>, <tt>NR</tt>,
|
|
<tt>FNR</tt>, <tt>FILENUM</tt>, and <tt>FILENAME</tt>.
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint put '$nf=NF; $nr=NR; $fnr=FNR; $filenum=FILENUM; $filename=FILENAME' data/small data/small2
|
|
a b i x y nf nr fnr filenum filename
|
|
pan pan 1 0.3467901443380824 0.7268028627434533 5 1 1 1 data/small
|
|
eks pan 2 0.7586799647899636 0.5221511083334797 5 2 2 1 data/small
|
|
wye wye 3 0.20460330576630303 0.33831852551664776 5 3 3 1 data/small
|
|
eks wye 4 0.38139939387114097 0.13418874328430463 5 4 4 1 data/small
|
|
wye pan 5 0.5732889198020006 0.8636244699032729 5 5 5 1 data/small
|
|
pan eks 9999 0.267481232652199086 0.557077185510228001 5 6 1 2 data/small2
|
|
wye eks 10000 0.734806020620654365 0.884788571337605134 5 7 2 2 data/small2
|
|
pan wye 10001 0.870530722602517626 0.009854780514656930 5 8 3 2 data/small2
|
|
hat wye 10002 0.321507044286237609 0.568893318795083758 5 9 4 2 data/small2
|
|
pan zee 10003 0.272054845593895200 0.425789896597056627 5 10 5 2 data/small2
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
Newlines within the expression are ignored, which can help increase legibility of complex expressions:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
mlr --opprint put '
|
|
$nf = NF;
|
|
$nr = NR;
|
|
$fnr = FNR;
|
|
$filenum = FILENUM;
|
|
$filename = FILENAME' \
|
|
data/small data/small2
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="regularize"/><h2>regularize</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr regularize --help
|
|
Usage: mlr regularize
|
|
For records seen earlier in the data stream with same field names in a different order,
|
|
outputs them with field names in the previously encountered order.
|
|
Example: input records a=1,c=2,b=3, then e=4,d=5, then c=7,a=6,b=8
|
|
output as a=1,c=2,b=3, then e=4,d=5, then a=6,c=7,b=8
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>This exists since hash-map software in various languages and tools
|
|
encountered in the wild does not always print similar rows with fields in the
|
|
same order: <tt>mlr regularize</tt> helps clean that up.
|
|
|
|
<p/>See also <a href="#reorder"><tt>reorder</tt></a>.
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="rename"/><h2>rename</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr rename --help
|
|
Usage: mlr rename {old1,new1,old2,new2,...}
|
|
Renames specified fields.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint cat data/small
|
|
a b i x y
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
eks pan 2 0.7586799647899636 0.5221511083334797
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
eks wye 4 0.38139939387114097 0.13418874328430463
|
|
wye pan 5 0.5732889198020006 0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint rename i,INDEX,b,COLUMN2 data/small
|
|
a COLUMN2 INDEX x y
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
eks pan 2 0.7586799647899636 0.5221511083334797
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
eks wye 4 0.38139939387114097 0.13418874328430463
|
|
wye pan 5 0.5732889198020006 0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<p/>As discussed in <a href="performance.html">Performance</a>, <tt>sed</tt>
|
|
is significantly faster than Miller at doing this. However, Miller is
|
|
format-aware, so it knows to do renames only within specified field keys and
|
|
not any others, nor in field values which may happen to contain the same
|
|
pattern. Example:
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ sed 's/y/COLUMN5/g' data/small
|
|
a=pan,b=pan,i=1,x=0.3467901443380824,COLUMN5=0.7268028627434533
|
|
a=eks,b=pan,i=2,x=0.7586799647899636,COLUMN5=0.5221511083334797
|
|
a=wCOLUMN5e,b=wCOLUMN5e,i=3,x=0.20460330576630303,COLUMN5=0.33831852551664776
|
|
a=eks,b=wCOLUMN5e,i=4,x=0.38139939387114097,COLUMN5=0.13418874328430463
|
|
a=wCOLUMN5e,b=pan,i=5,x=0.5732889198020006,COLUMN5=0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr rename y,COLUMN5 data/small
|
|
a=pan,b=pan,i=1,x=0.3467901443380824,COLUMN5=0.7268028627434533
|
|
a=eks,b=pan,i=2,x=0.7586799647899636,COLUMN5=0.5221511083334797
|
|
a=wye,b=wye,i=3,x=0.20460330576630303,COLUMN5=0.33831852551664776
|
|
a=eks,b=wye,i=4,x=0.38139939387114097,COLUMN5=0.13418874328430463
|
|
a=wye,b=pan,i=5,x=0.5732889198020006,COLUMN5=0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
See also <a href="#label"><tt>label</tt></a>.
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="reorder"/><h2>reorder</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr reorder --help
|
|
Usage: mlr reorder [options]
|
|
-f {a,b,c} Field names to reorder.
|
|
-e Put specified field names at record end: default is to put at record start.
|
|
Example: mlr reorder -f a,b sends input record "d=4,b=2,a=1,c=3" to "a=1,b=2,d=4,c=3".
|
|
Example: mlr reorder -e -f a,b sends input record "d=4,b=2,a=1,c=3" to "d=4,c=3,a=1,b=2".
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
This pivots specified field names to the start or end of the record — for
|
|
example when you have highly multi-column data and you want to bring a field or
|
|
two to the front of line where you can give a quick visual scan.
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint cat data/small
|
|
a b i x y
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
eks pan 2 0.7586799647899636 0.5221511083334797
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
eks wye 4 0.38139939387114097 0.13418874328430463
|
|
wye pan 5 0.5732889198020006 0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint reorder -f i,b data/small
|
|
i b a x y
|
|
1 pan pan 0.3467901443380824 0.7268028627434533
|
|
2 pan eks 0.7586799647899636 0.5221511083334797
|
|
3 wye wye 0.20460330576630303 0.33831852551664776
|
|
4 wye eks 0.38139939387114097 0.13418874328430463
|
|
5 pan wye 0.5732889198020006 0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint reorder -e -f i,b data/small
|
|
a x y i b
|
|
pan 0.3467901443380824 0.7268028627434533 1 pan
|
|
eks 0.7586799647899636 0.5221511083334797 2 pan
|
|
wye 0.20460330576630303 0.33831852551664776 3 wye
|
|
eks 0.38139939387114097 0.13418874328430463 4 wye
|
|
wye 0.5732889198020006 0.8636244699032729 5 pan
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="sort"/><h2>sort</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr sort --help
|
|
Usage: mlr sort {flags}
|
|
Flags:
|
|
-f {comma-separated field names} Lexical ascending
|
|
-n {comma-separated field names} Numerical ascending; nulls sort last
|
|
-nf {comma-separated field names} Numerical ascending; nulls sort last
|
|
-r {comma-separated field names} Lexical descending
|
|
-nr {comma-separated field names} Numerical descending; nulls sort first
|
|
Sorts records primarily by the first specified field, secondarily by the second field, and so on.
|
|
Example:
|
|
mlr sort -f a,b -nr x,y,z
|
|
which is the same as:
|
|
mlr sort -f a -f b -nr x -nr y -nr z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>Example:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint sort -f a -nr x data/small
|
|
a b i x y
|
|
eks pan 2 0.7586799647899636 0.5221511083334797
|
|
eks wye 4 0.38139939387114097 0.13418874328430463
|
|
pan pan 1 0.3467901443380824 0.7268028627434533
|
|
wye pan 5 0.5732889198020006 0.8636244699032729
|
|
wye wye 3 0.20460330576630303 0.33831852551664776
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>Here’s an example filtering log data: suppose multiple threads (labeled here by color) are all logging progress counts to a single log file. The log file is (by nature) chronological, so the progress of various threads is interleaved:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ head -n 10 data/multicountdown.dat
|
|
upsec=0.002,color=green,count=1203
|
|
upsec=0.083,color=red,count=3817
|
|
upsec=0.188,color=red,count=3801
|
|
upsec=0.395,color=blue,count=2697
|
|
upsec=0.526,color=purple,count=953
|
|
upsec=0.671,color=blue,count=2684
|
|
upsec=0.899,color=purple,count=926
|
|
upsec=0.912,color=red,count=3798
|
|
upsec=1.093,color=blue,count=2662
|
|
upsec=1.327,color=purple,count=917
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> We can group these by thread by sorting on the thread ID (here,
|
|
<tt>color</tt>). Since Miller’s sort is stable, this means that
|
|
timestamps within each thread’s log data are still chronological:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ head -n 20 data/multicountdown.dat | mlr --opprint sort -f color
|
|
upsec color count
|
|
0.395 blue 2697
|
|
0.671 blue 2684
|
|
1.093 blue 2662
|
|
2.064 blue 2659
|
|
2.2880000000000003 blue 2647
|
|
0.002 green 1203
|
|
1.407 green 1187
|
|
1.448 green 1177
|
|
2.313 green 1161
|
|
0.526 purple 953
|
|
0.899 purple 926
|
|
1.327 purple 917
|
|
1.703 purple 908
|
|
0.083 red 3817
|
|
0.188 red 3801
|
|
0.912 red 3798
|
|
1.416 red 3788
|
|
1.587 red 3782
|
|
1.601 red 3755
|
|
1.832 red 3717
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="stats1"/><h2>stats1</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr stats1 --help
|
|
Usage: mlr stats1 [options]
|
|
Options:
|
|
-a {sum,count,...} Names of accumulators: p10 p25.2 p50 p98 p100 etc. and/or one or more of
|
|
count mode sum mean stddev var meaneb min max
|
|
-f {a,b,c} Value-field names on which to compute statistics
|
|
-g {d,e,f} Optional group-by-field names
|
|
Example: mlr stats1 -a min,p10,p50,p90,max -f value -g size,shape
|
|
Example: mlr stats1 -a count,mode -f size
|
|
Example: mlr stats1 -a count,mode -f size -g shape
|
|
Notes:
|
|
* p50 is a synonym for median.
|
|
* min and max output the same results as p0 and p100, respectively, but use less memory.
|
|
* count and mode allow text input; the rest require numeric input. In particular, 1 and 1.0
|
|
are distinct text for count and mode.
|
|
* When there are mode ties, the first-encountered datum wins.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
These are simple univariate statistics on one or more number-valued fields
|
|
(<tt>count</tt> and <tt>mode</tt> apply to non-numeric fields as well),
|
|
optionally categorized by one or more other fields.
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --oxtab stats1 -a count,sum,min,p10,p50,mean,p90,max -f x,y data/medium
|
|
x_count 10000
|
|
x_sum 4986.019682
|
|
x_min 0.000045
|
|
x_p10 0.093322
|
|
x_p50 0.501159
|
|
x_mean 0.498602
|
|
x_p90 0.900794
|
|
x_max 0.999953
|
|
y_count 10000
|
|
y_sum 5062.057445
|
|
y_min 0.000088
|
|
y_p10 0.102132
|
|
y_p50 0.506021
|
|
y_mean 0.506206
|
|
y_p90 0.905366
|
|
y_max 0.999965
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint stats1 -a mean -f x,y -g b then sort -f b data/medium
|
|
b x_mean y_mean
|
|
eks 0.506361 0.510293
|
|
hat 0.487899 0.513118
|
|
pan 0.497304 0.499599
|
|
wye 0.497593 0.504596
|
|
zee 0.504242 0.502997
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint stats1 -a p50,p99 -f u,v -g color then put '$ur=$u_p99/$u_p50;$vr=$v_p99/$v_p50' data/colored-shapes.dkvp
|
|
color u_p50 u_p99 v_p50 v_p99 ur vr
|
|
yellow 0.501019 0.989046 0.520630 0.987034 1.974069 1.895845
|
|
red 0.485038 0.990054 0.492586 0.994444 2.041189 2.018823
|
|
purple 0.501319 0.988893 0.504571 0.988287 1.972582 1.958668
|
|
green 0.502015 0.990764 0.505359 0.990175 1.973574 1.959350
|
|
blue 0.525226 0.992655 0.485170 0.993873 1.889958 2.048505
|
|
orange 0.483548 0.993635 0.480913 0.989102 2.054884 2.056717
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint count-distinct -f shape then sort -nr count data/colored-shapes.dkvp
|
|
shape count
|
|
square 4115
|
|
triangle 3372
|
|
circle 2591
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint stats1 -a mode -f color -g shape data/colored-shapes.dkvp
|
|
shape color_mode
|
|
triangle red
|
|
square red
|
|
circle red
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="stats2"/><h2>stats2</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr stats2 --help
|
|
Usage: mlr stats2 [options]
|
|
-a {linreg-ols,corr,...} Names of accumulators: one or more of
|
|
linreg-pca linreg-ols r2 corr cov covx
|
|
r2 is a quality metric for linreg-ols; linrec-pca outputs its own quality metric.
|
|
-f {a,b,c,d} Value-field name-pairs on which to compute statistics.
|
|
There must be an even number of names.
|
|
-g {e,f,g} Optional group-by-field names.
|
|
-v Print additional output for linreg-pca.
|
|
Example: mlr stats2 -a linreg-pca -f x,y
|
|
Example: mlr stats2 -a linreg-ols,r2 -f x,y -g size,shape
|
|
Example: mlr stats2 -a corr -f x,y
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
These are simple bivariate statistics on one or more pairs of number-valued
|
|
fields, optionally categorized by one or more fields.
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --oxtab put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' then stats2 -a cov,corr -f x,y,y,y,x2,xy,x2,y2 data/medium
|
|
x_y_cov 0.000043
|
|
x_y_corr 0.000504
|
|
y_y_cov 0.084611
|
|
y_y_corr 1.000000
|
|
x2_xy_cov 0.041884
|
|
x2_xy_corr 0.630174
|
|
x2_y2_cov -0.000310
|
|
x2_y2_corr -0.003425
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' then stats2 -a linreg-ols,r2 -f x,y,y,y,xy,y2 -g a data/medium
|
|
a x_y_ols_m x_y_ols_b x_y_ols_n x_y_r2 y_y_ols_m y_y_ols_b y_y_ols_n y_y_r2 xy_y2_ols_m xy_y2_ols_b xy_y2_ols_n xy_y2_r2
|
|
pan 0.017026 0.500403 2081 0.000287 1.000000 0.000000 2081 1.000000 0.878132 0.119082 2081 0.417498
|
|
eks 0.040780 0.481402 1965 0.001646 1.000000 0.000000 1965 1.000000 0.897873 0.107341 1965 0.455632
|
|
wye -0.039153 0.525510 1966 0.001505 1.000000 0.000000 1966 1.000000 0.853832 0.126745 1966 0.389917
|
|
zee 0.002781 0.504307 2047 0.000008 1.000000 0.000000 2047 1.000000 0.852444 0.124017 2047 0.393566
|
|
hat -0.018621 0.517901 1941 0.000352 1.000000 0.000000 1941 1.000000 0.841230 0.135573 1941 0.368794
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<p/>Here’s an example simple line-fit. The <tt>x</tt> and <tt>y</tt>
|
|
fields of the <tt>data/medium</tt> dataset are just independent uniformly
|
|
distributed on the unit interval. Here we remove half the data and fit a line to it.
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
|
|
mlr filter '($x<.5 && $y<.5) || ($x>.5 && $y>.5)' data/medium > data/medium-squares
|
|
|
|
mlr --ofs newline stats2 -a linreg-pca -f x,y data/medium-squares
|
|
x_y_pca_m=1.014419
|
|
x_y_pca_b=0.000308
|
|
x_y_pca_quality=0.861354
|
|
|
|
# Set x_y_pca_m and x_y_pca_b as shell variables
|
|
eval $(mlr --ofs newline stats2 -a linreg-pca -f x,y data/medium-squares)
|
|
|
|
# In addition to x and y, make a new yfit which is the line fit. Plot using your favorite tool.
|
|
mlr --onidx put '$yfit='$x_y_pca_m'*$x+'$x_y_pca_b then cut -x -f a,b,i data/medium-squares \
|
|
| pgr -p -title 'linreg-pca example' -xmin 0 -xmax 1 -ymin 0 -ymax 1
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>I use <a href="https://github.com/johnkerl/pgr"><tt>pgr</tt></a> for
|
|
plotting; here’s a screenshot.
|
|
|
|
<center>
|
|
<img src="data/linreg-example.jpg"/>
|
|
</center>
|
|
|
|
<p/> (Thanks Drew Kunas for a good conversation about PCA!)
|
|
|
|
<p/> Here’s an example estimating time-to-completion for a set of jobs.
|
|
Input data comes from a log file, with number of work units left to do in the
|
|
<tt>count</tt> field and accumulated seconds in the <tt>upsec</tt> field,
|
|
labeled by the <tt>color</tt> field:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ head -n 10 data/multicountdown.dat
|
|
upsec=0.002,color=green,count=1203
|
|
upsec=0.083,color=red,count=3817
|
|
upsec=0.188,color=red,count=3801
|
|
upsec=0.395,color=blue,count=2697
|
|
upsec=0.526,color=purple,count=953
|
|
upsec=0.671,color=blue,count=2684
|
|
upsec=0.899,color=purple,count=926
|
|
upsec=0.912,color=red,count=3798
|
|
upsec=1.093,color=blue,count=2662
|
|
upsec=1.327,color=purple,count=917
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
We can do a linear regression on count remaining as a function of time: with <tt>c = m*u+b</tt> we want to find the
|
|
time when the count goes to zero, i.e. <tt>u=-b/m</tt>.
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --oxtab stats2 -a linreg-pca -f upsec,count -g color then put '$donesec = -$upsec_count_pca_b/$upsec_count_pca_m' data/multicountdown.dat
|
|
color green
|
|
upsec_count_pca_m -32.756917
|
|
upsec_count_pca_b 1213.722730
|
|
upsec_count_pca_n 24
|
|
upsec_count_pca_quality 0.999984
|
|
donesec 37.052410
|
|
|
|
color red
|
|
upsec_count_pca_m -37.367646
|
|
upsec_count_pca_b 3810.133400
|
|
upsec_count_pca_n 30
|
|
upsec_count_pca_quality 0.999989
|
|
donesec 101.963431
|
|
|
|
color blue
|
|
upsec_count_pca_m -29.231212
|
|
upsec_count_pca_b 2698.932820
|
|
upsec_count_pca_n 25
|
|
upsec_count_pca_quality 0.999959
|
|
donesec 92.330514
|
|
|
|
color purple
|
|
upsec_count_pca_m -39.030097
|
|
upsec_count_pca_b 979.988341
|
|
upsec_count_pca_n 21
|
|
upsec_count_pca_quality 0.999991
|
|
donesec 25.108529
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="step"/><h2>step</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr step --help
|
|
Usage: mlr step [options]
|
|
-a {delta,rsum,...} Names of steppers: one or more of
|
|
delta ratio rsum counter
|
|
-f {a,b,c} Value-field names on which to compute statistics
|
|
-g {d,e,f} Group-by-field names
|
|
Computes values dependent on the previous record, optionally grouped by category.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
Most Miller commands are record-at-a-time, with the exception of <tt>stats1</tt>,
|
|
<tt>stats2</tt>, and <tt>histogram</tt> which compute aggregate output. The
|
|
<tt>step</tt> command is intermediate: it allows the option of adding fields
|
|
which are functions of fields from previous records. Rsum is short for <i>running sum</i>.
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint step -a delta,rsum,counter -f x data/medium | head -15
|
|
a b i x y x_delta x_rsum x_counter
|
|
pan pan 1 0.3467901443380824 0.7268028627434533 0.346790 0.346790 1
|
|
eks pan 2 0.7586799647899636 0.5221511083334797 0.411890 1.105470 2
|
|
wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077 1.310073 3
|
|
eks wye 4 0.38139939387114097 0.13418874328430463 0.176796 1.691473 4
|
|
wye pan 5 0.5732889198020006 0.8636244699032729 0.191890 2.264762 5
|
|
zee pan 6 0.5271261600918548 0.49322128674835697 -0.046163 2.791888 6
|
|
eks zee 7 0.6117840605678454 0.1878849191181694 0.084658 3.403672 7
|
|
zee wye 8 0.5985540091064224 0.976181385699006 -0.013230 4.002226 8
|
|
hat wye 9 0.03144187646093577 0.7495507603507059 -0.567112 4.033668 9
|
|
pan wye 10 0.5026260055412137 0.9526183602969864 0.471184 4.536294 10
|
|
pan pan 11 0.7930488423451967 0.6505816637259333 0.290423 5.329343 11
|
|
zee pan 12 0.3676141320555616 0.23614420670296965 -0.425435 5.696957 12
|
|
eks pan 13 0.4915175580479536 0.7709126592971468 0.123903 6.188474 13
|
|
eks zee 14 0.5207382318405251 0.34141681118811673 0.029221 6.709213 14
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint step -a delta,rsum,counter -f x -g a data/medium | head -15
|
|
a b i x y x_delta x_rsum x_counter
|
|
pan pan 1 0.3467901443380824 0.7268028627434533 0.346790 0.346790 1
|
|
eks pan 2 0.7586799647899636 0.5221511083334797 0.758680 0.758680 1
|
|
wye wye 3 0.20460330576630303 0.33831852551664776 0.204603 0.204603 1
|
|
eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281 1.140079 2
|
|
wye pan 5 0.5732889198020006 0.8636244699032729 0.368686 0.777892 2
|
|
zee pan 6 0.5271261600918548 0.49322128674835697 0.527126 0.527126 1
|
|
eks zee 7 0.6117840605678454 0.1878849191181694 0.230385 1.751863 3
|
|
zee wye 8 0.5985540091064224 0.976181385699006 0.071428 1.125680 2
|
|
hat wye 9 0.03144187646093577 0.7495507603507059 0.031442 0.031442 1
|
|
pan wye 10 0.5026260055412137 0.9526183602969864 0.155836 0.849416 2
|
|
pan pan 11 0.7930488423451967 0.6505816637259333 0.290423 1.642465 3
|
|
zee pan 12 0.3676141320555616 0.23614420670296965 -0.230940 1.493294 3
|
|
eks pan 13 0.4915175580479536 0.7709126592971468 -0.120267 2.243381 4
|
|
eks zee 14 0.5207382318405251 0.34141681118811673 0.029221 2.764119 5
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
Example deriving uptime-delta from system uptime:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ each 10 uptime | mlr -p step -a delta -f 11
|
|
...
|
|
20:08 up 36 days, 10:38, 5 users, load averages: 1.42 1.62 1.73 0.000000
|
|
20:08 up 36 days, 10:38, 5 users, load averages: 1.55 1.64 1.74 0.020000
|
|
20:08 up 36 days, 10:38, 7 users, load averages: 1.58 1.65 1.74 0.010000
|
|
20:08 up 36 days, 10:38, 9 users, load averages: 1.78 1.69 1.76 0.040000
|
|
20:08 up 36 days, 10:39, 9 users, load averages: 2.12 1.76 1.78 0.070000
|
|
20:08 up 36 days, 10:39, 9 users, load averages: 2.51 1.85 1.81 0.090000
|
|
20:08 up 36 days, 10:39, 8 users, load averages: 2.79 1.92 1.83 0.070000
|
|
20:08 up 36 days, 10:39, 4 users, load averages: 2.64 1.90 1.83 -0.020000
|
|
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="tac"/><h2>tac</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr tac --help
|
|
Usage: mlr tac
|
|
Prints records in reverse order from the order in which they were encountered.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>Prints the records in the input stream in reverse order. Note: this
|
|
requires Miller to retain all input records in memory before any output records
|
|
are produced.
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --opprint cat a.csv
|
|
a b c
|
|
1 2 3
|
|
4 5 6
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --opprint cat b.csv
|
|
a b c
|
|
7 8 9
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --opprint tac a.csv b.csv
|
|
a b c
|
|
7 8 9
|
|
4 5 6
|
|
1 2 3
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --opprint put '$filename=FILENAME' then tac a.csv b.csv
|
|
a b c filename
|
|
7 8 9 b.csv
|
|
4 5 6 a.csv
|
|
1 2 3 a.csv
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="tail"/><h2>tail</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr tail --help
|
|
Usage: mlr tail [options]
|
|
-n {count} Tail count to print; default 10
|
|
-g {a,b,c} Optional group-by-field names for tail counts
|
|
Passes through the last n records, optionally by category.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> Prints the last <i>n</i> records in the input stream, optionally by category.
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint tail -n 4 data/colored-shapes.dkvp
|
|
color shape flag i u v w x
|
|
blue square 1 99974 0.6189062525431605 0.2637962404841453 0.5311465405784674 6.210738209085753
|
|
blue triangle 0 99976 0.008110504040268474 0.8267274952432482 0.4732962944898885 6.146956761817328
|
|
yellow triangle 0 99990 0.3839424618160777 0.55952913620132 0.5113763011485609 4.307973891915119
|
|
yellow circle 1 99994 0.764950884927175 0.25284227383991364 0.49969878539567425 5.013809741826425
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint tail -n 1 -g shape data/colored-shapes.dkvp
|
|
color shape flag i u v w x
|
|
yellow triangle 0 99990 0.3839424618160777 0.55952913620132 0.5113763011485609 4.307973891915119
|
|
blue square 1 99974 0.6189062525431605 0.2637962404841453 0.5311465405784674 6.210738209085753
|
|
yellow circle 1 99994 0.764950884927175 0.25284227383991364 0.49969878539567425 5.013809741826425
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="top"/><h2>top</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr top --help
|
|
Usage: mlr top [options]
|
|
-f {a,b,c} Value-field names for top counts
|
|
-g {d,e,f} Optional group-by-field names for top counts
|
|
-n {count} How many records to print per category; default 1
|
|
-a Print all fields for top-value records; default is
|
|
to print only value and group-by fields.
|
|
--min Print top smallest values; default is top largest values
|
|
Prints the n records with smallest/largest values at specified fields, optionally by category.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
Note that <tt>top</tt> is distinct from <a href="#head"><tt>head</tt></a>
|
|
— <tt>head</tt> shows fields which appear first in the data stream;
|
|
<tt>top</tt> shows fields which are numerically largest (or smallest).
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint top -n 4 -f x data/medium
|
|
top_idx x_top
|
|
1 0.999953
|
|
2 0.999823
|
|
3 0.999733
|
|
4 0.999563
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint top -n 2 -f x -g a then sort -f a data/medium
|
|
a top_idx x_top
|
|
eks 1 0.998811
|
|
eks 2 0.998534
|
|
hat 1 0.999953
|
|
hat 2 0.999733
|
|
pan 1 0.999403
|
|
pan 2 0.999044
|
|
wye 1 0.999823
|
|
wye 2 0.999264
|
|
zee 1 0.999490
|
|
zee 2 0.999438
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="uniq"/><h2>uniq</h2>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr uniq --help
|
|
Usage: mlr uniq [options]
|
|
-g {d,e,f} Group-by-field names for uniq counts
|
|
-c Show repeat counts in addition to unique values
|
|
Prints distinct values for specified field names. With -c, same as count-distinct.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ wc -l data/colored-shapes.dkvp
|
|
10078 data/colored-shapes.dkvp
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr uniq -g color,shape data/colored-shapes.dkvp
|
|
color=yellow,shape=triangle
|
|
color=red,shape=square
|
|
color=red,shape=circle
|
|
color=purple,shape=triangle
|
|
color=yellow,shape=circle
|
|
color=purple,shape=square
|
|
color=yellow,shape=square
|
|
color=red,shape=triangle
|
|
color=green,shape=triangle
|
|
color=green,shape=square
|
|
color=blue,shape=circle
|
|
color=blue,shape=triangle
|
|
color=purple,shape=circle
|
|
color=blue,shape=square
|
|
color=green,shape=circle
|
|
color=orange,shape=triangle
|
|
color=orange,shape=square
|
|
color=orange,shape=circle
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint uniq -g color,shape -c then sort -f color,shape data/colored-shapes.dkvp
|
|
color shape count
|
|
blue circle 384
|
|
blue square 589
|
|
blue triangle 497
|
|
green circle 287
|
|
green square 454
|
|
green triangle 368
|
|
orange circle 68
|
|
orange square 128
|
|
orange triangle 107
|
|
purple circle 289
|
|
purple square 481
|
|
purple triangle 372
|
|
red circle 1207
|
|
red square 1874
|
|
red triangle 1560
|
|
yellow circle 356
|
|
yellow square 589
|
|
yellow triangle 468
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="Functions_for_filter_and_put"/><h1>Functions for filter and put</h1>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --help-all-functions
|
|
abs (math: #args=1): Absolute value.
|
|
acos (math: #args=1): Inverse trigonometric cosine.
|
|
acosh (math: #args=1): Inverse hyperbolic cosine.
|
|
asin (math: #args=1): Inverse trigonometric sine.
|
|
asinh (math: #args=1): Inverse hyperbolic sine.
|
|
atan (math: #args=1): One-argument arctangent.
|
|
atan2 (math: #args=2): Two-argument arctangent.
|
|
atanh (math: #args=1): Inverse hyperbolic tangent.
|
|
cbrt (math: #args=1): Cube root.
|
|
ceil (math: #args=1): Ceiling: nearest integer at or above.
|
|
cos (math: #args=1): Trigonometric cosine.
|
|
cosh (math: #args=1): Hyperbolic cosine.
|
|
erf (math: #args=1): Error function.
|
|
erfc (math: #args=1): Complementary error function.
|
|
exp (math: #args=1): Exponential function e**x.
|
|
expm1 (math: #args=1): e**x - 1.
|
|
floor (math: #args=1): Floor: nearest integer at or below.
|
|
invqnorm (math: #args=1): Inverse of normal cumulative distribution function. Note that invqorm(urand()) is normally distributed.
|
|
log (math: #args=1): Natural (base-e) logarithm.
|
|
log10 (math: #args=1): Base-10 logarithm.
|
|
log1p (math: #args=1): log(1-x).
|
|
max (math: #args=2): max of two numbers; null loses
|
|
min (math: #args=2): min of two numbers; null loses
|
|
pow (math: #args=2): Exponentiation; same as **.
|
|
qnorm (math: #args=1): Normal cumulative distribution function.
|
|
round (math: #args=1): Round to nearest integer.
|
|
roundm (math: #args=2): Round to nearest multiple of m: roundm($x,$m) is the same as round($x/$m)*$m
|
|
sin (math: #args=1): Trigonometric sine.
|
|
sinh (math: #args=1): Hyperbolic sine.
|
|
sqrt (math: #args=1): Square root.
|
|
tan (math: #args=1): Trigonometric tangent.
|
|
tanh (math: #args=1): Hyperbolic tangent.
|
|
urand (math: #args=0): Floating-point numbers on the unit interval. Int-valued example: '$n=floor(20+urand()*11)'.
|
|
+ (math: #args=2): Addition.
|
|
- (math: #args=1): Unary minus.
|
|
- (math: #args=2): Subtraction.
|
|
* (math: #args=2): Multiplication.
|
|
/ (math: #args=2): Division.
|
|
% (math: #args=2): Remainder; never negative-valued.
|
|
** (math: #args=2): Exponentiation; same as pow.
|
|
== (boolean: #args=2): String/numeric equality. Mixing number and string results in string compare.
|
|
!= (boolean: #args=2): String/numeric inequality. Mixing number and string results in string compare.
|
|
> (boolean: #args=2): String/numeric greater-than. Mixing number and string results in string compare.
|
|
>= (boolean: #args=2): String/numeric greater-than-or-equals. Mixing number and string results in string compare.
|
|
< (boolean: #args=2): String/numeric less-than. Mixing number and string results in string compare.
|
|
<= (boolean: #args=2): String/numeric less-than-or-equals. Mixing number and string results in string compare.
|
|
&& (boolean: #args=2): Logical AND.
|
|
|| (boolean: #args=2): Logical OR.
|
|
! (boolean: #args=1): Logical negation.
|
|
strlen (string: #args=1): String length.
|
|
sub (string: #args=3): Example: '$name=sub($name, "old", "new")'. Regexes not supported.
|
|
tolower (string: #args=1): Convert string to lowercase.
|
|
toupper (string: #args=1): Convert string to uppercase.
|
|
. (string: #args=2): String concatenation.
|
|
boolean (conversion: #args=1): Convert int/float/bool/string to boolean.
|
|
float (conversion: #args=1): Convert int/float/bool/string to float.
|
|
int (conversion: #args=1): Convert int/float/bool/string to int.
|
|
string (conversion: #args=1): Convert int/float/bool/string to string.
|
|
hexfmt (conversion: #args=1): Convert int to string, e.g. 255 to "0xff".
|
|
fmtnum (conversion: #args=2): Convert int/float/bool to string using printf-style format string, e.g. "%06lld".
|
|
systime (time: #args=0): Floating-point seconds since the epoch, e.g. 1440768801.748936.
|
|
sec2gmt (time: #args=1): Formats seconds since epoch (integer part only) as GMT timestamp, e.g. sec2gmt(1440768801.7) = "2015-08-28T13:33:21Z".
|
|
gmt2sec (time: #args=1): Parses GMT timestamp as integer seconds since epoch.
|
|
sec2hms (time: #args=1): Formats integer seconds as in sec2hms(5000) = "01:23:20"
|
|
sec2dhms (time: #args=1): Formats integer seconds as in sec2dhms(500000) = "5d18h53m20s"
|
|
hms2sec (time: #args=1): Recovers integer seconds as in hms2sec("01:23:20") = 5000
|
|
dhms2sec (time: #args=1): Recovers integer seconds as in dhms2sec("5d18h53m20s") = 500000
|
|
fsec2hms (time: #args=1): Formats floating-point seconds as in fsec2hms(5000.25) = "01:23:20.250000"
|
|
fsec2dhms (time: #args=1): Formats floating-point seconds as in fsec2dhms(500000.25) = "5d18h53m20.250000s"
|
|
hms2fsec (time: #args=1): Recovers floating-point seconds as in hms2fsec("01:23:20.250000") = 5000.250000
|
|
dhms2fsec (time: #args=1): Recovers floating-point seconds as in dhms2fsec("5d18h53m20.250000s") = 500000.250000
|
|
To set the seed for urand, you may specify decimal or hexadecimal 32-bit
|
|
numbers of the form "mlr --seed 123456789" or "mlr --seed 0xcafefeed".
|
|
Miller's built-in variables are NF, NR, FNR, FILENUM, and FILENAME (awk-like)
|
|
along with the mathematical constants PI and E.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="Data_types"/><h1>Data types</h1>
|
|
|
|
<p/> Miller’s input and output are all string-oriented: there is (as of
|
|
August 2015 anyway) no support for binary record packing. In this sense,
|
|
everything is a string in and out of Miller. During processing, field names
|
|
are always strings, even if they have names like "3"; field values are usually
|
|
strings. Field values’ ability to be interpreted as a non-string type
|
|
only has meaning when comparison or function operations are done on them. And
|
|
it is an error condition if Miller encounters non-numeric (or otherwise
|
|
mistyped) data in a field in which it has been asked to do numeric (or
|
|
otherwise type-specific) operations.
|
|
|
|
<p/> Field values are treated as numeric for the following:
|
|
<ul>
|
|
<li/> Numeric sort: <tt>mlr sort -n</tt>, <tt>mlr sort -nr</tt>.
|
|
<li/> Statistics: <tt>mlr histogram</tt>, <tt>mlr stats1</tt>, <tt>mlr stats2</tt>.
|
|
<li/> Cross-record arithmetic: <tt>mlr step</tt>.
|
|
</ul>
|
|
|
|
<p/>For <tt>mlr put</tt> and <tt>mlr filter</tt>:
|
|
<ul>
|
|
<li/> Miller’s types for function processing are <b>null</b> (empty string), <b>error</b>, <b>string</b>, <b>float</b> (double-precision), <b>int</b> (64-bit signed), and <b>boolean</b>.
|
|
<li/> On input, string values representable as numbers (e.g. "3" or "3.1") are treated as float. If a record
|
|
has <tt>x=1,y=2</tt> then <tt>mlr put '$z=$x+$y'</tt> will produce <tt>x=1,y=2,z=3</tt>, and
|
|
<tt>mlr put '$z=$x.$y'</tt> gives an error. To coerce back to string for
|
|
processing, use the <tt>string</tt> function:
|
|
<tt>mlr put '$z=string($x).string($y)'</tt> will produce <tt>x=1,y=2,z=12</tt>.
|
|
<li/> On input, string values representable as boolean (e.g. <tt>"true"</tt>,
|
|
<tt>"false"</tt>) are <i>not</i> automatically treated as boolean.
|
|
(This is because <tt>"true"</tt> and <tt>"false"</tt> are ordinary words, and auto string-to-boolean
|
|
on a column consisting of words would result in some strings mixed with some booleans.)
|
|
Use the <tt>boolean</tt> function to coerce: e.g. giving the record <tt>x=1,y=2,w=false</tt> to
|
|
<tt>mlr put '$z=($x<$y) || boolean($w)'</tt>.
|
|
<li/> Functions take types as described in <tt>mlr --help-all-functions</tt>: for example, <tt>log10</tt>
|
|
takes float input and produces float output, <tt>gmt2sec</tt> maps string to int, and <tt>sec2gmt</tt>
|
|
maps int to string.
|
|
<li/> All math functions described in <tt>mlr --help-all-functions</tt> take integer as well as float input.
|
|
</ul>
|
|
|
|
<!-- ================================================================ -->
|
|
<a id="Null_data"/><h1>Null data</h1>
|
|
|
|
<p/> One of Miller’s key features is its support for <b>heterogeneous</b> data.
|
|
Accordingly, if you try to sort on field <tt>hostname</tt> when not all records in the data
|
|
stream <i>have</i> a field named <tt>hostname</tt>, it is not an error (although you could
|
|
pre-filter the data stream using <tt>mlr having-fields --at-least hostname then sort ...</tt>).
|
|
Rather, records lacking one or more sort keys are simply output contiguously by <tt>mlr sort</tt>.
|
|
|
|
<p/> Field values may also be null by being
|
|
specified with present key but empty value: e.g. sending <tt>x=,y=2</tt> to <tt>mlr put '$z=$x+$y'</tt>.
|
|
|
|
<p/>
|
|
Rules for null-handling:
|
|
<ul>
|
|
<li> Records with one or more null sort-field values sort after records with all sort-field values present:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint cat data/sort-null.dat
|
|
a b
|
|
3 2
|
|
1 8
|
|
- 4
|
|
5 7
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint sort -n a data/sort-null.dat
|
|
a b
|
|
1 8
|
|
3 2
|
|
5 7
|
|
- 4
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint sort -nr a data/sort-null.dat
|
|
a b
|
|
- 4
|
|
5 7
|
|
3 2
|
|
1 8
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<li> Functions which have one or more null arguments produce null output: e.g.
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo 'x=2,y=3' | mlr put '$a=$x+$y'
|
|
x=2,y=3,a=5.000000
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo 'x=,y=3' | mlr put '$a=$x+$y'
|
|
x=,y=3,a=
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'
|
|
x=,y=3,a=,b=1.098612
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<li> The <tt>min</tt> and <tt>max</tt> functions are special: if one argument is non-null, it wins:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'
|
|
x=,y=3,a=3.000000,b=3.000000
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</ul>
|
|
</div>
|
|
</td>
|
|
|
|
</table>
|
|
</body>
|
|
</html>
|