mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 10:15:36 +00:00
775 lines
34 KiB
HTML
775 lines
34 KiB
HTML
POKI_PUT_TOC_HERE
|
|
|
|
<p/>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="expand_all();" href="javascript:;">Expand all sections</button>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="collapse_all();" href="javascript:;">Collapse all sections</button>
|
|
|
|
<h1>Command overview</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_overview');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_overview" style="display: block">
|
|
|
|
<p>
|
|
Whereas the Unix toolkit is made of the separate executables <tt>cat</tt>, <tt>tail</tt>, <tt>cut</tt>,
|
|
<tt>sort</tt>, etc., Miller has subcommands, invoked as follows:
|
|
|
|
POKI_INCLUDE_ESCAPED(data/subcommand-example.txt)HERE
|
|
|
|
<p/>These fall into categories as follows:
|
|
|
|
<table border=1>
|
|
<tr class="mlrbg">
|
|
<th>Commands </th>
|
|
<th>Description</th>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<a href="reference-verbs.html#cat"><tt>cat</tt></a>,
|
|
<a href="reference-verbs.html#cut"><tt>cut</tt></a>,
|
|
<a href="reference-verbs.html#grep"><tt>grep</tt></a>,
|
|
<a href="reference-verbs.html#head"><tt>head</tt></a>,
|
|
<a href="reference-verbs.html#join"><tt>join</tt></a>,
|
|
<a href="reference-verbs.html#sort"><tt>sort</tt></a>,
|
|
<a href="reference-verbs.html#tac"><tt>tac</tt></a>,
|
|
<a href="reference-verbs.html#tail"><tt>tail</tt></a>,
|
|
<a href="reference-verbs.html#top"><tt>top</tt></a>,
|
|
<a href="reference-verbs.html#uniq"><tt>uniq</tt></a>
|
|
</td>
|
|
<td> Analogs of their Unix-toolkit namesakes, discussed below as well as in
|
|
POKI_PUT_LINK_FOR_PAGE(feature-comparison.html)HERE </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="reference-verbs.html#filter"><tt>filter</tt></a>,
|
|
<a href="reference-verbs.html#put"><tt>put</tt></a>,
|
|
<a href="reference-verbs.html#sec2gmt"><tt>sec2gmt</tt></a>,
|
|
<a href="reference-verbs.html#sec2gmtdate"><tt>sec2gmtdate</tt></a>,
|
|
<a href="reference-verbs.html#step"><tt>step</tt></a>,
|
|
<a href="reference-verbs.html#tee"><tt>tee</tt></a>
|
|
</td>
|
|
<td> <tt>awk</tt>-like functionality </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="reference-verbs.html#bar"><tt>bar</tt></a>,
|
|
<a href="reference-verbs.html#bootstrap"><tt>bootstrap</tt></a>,
|
|
<a href="reference-verbs.html#decimate"><tt>decimate</tt></a>,
|
|
<a href="reference-verbs.html#histogram"><tt>histogram</tt></a>,
|
|
<a href="reference-verbs.html#least-frequent"><tt>least-frequent</tt></a>,
|
|
<a href="reference-verbs.html#most-frequent"><tt>most-frequent</tt></a>,
|
|
<a href="reference-verbs.html#sample"><tt>sample</tt></a>,
|
|
<a href="reference-verbs.html#shuffle"><tt>shuffle</tt></a>,
|
|
<a href="reference-verbs.html#stats1"><tt>stats1</tt></a>,
|
|
<a href="reference-verbs.html#stats2"><tt>stats2</tt></a>
|
|
</td>
|
|
<td> Statistically oriented </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="reference-verbs.html#group-by"><tt>group-by</tt></a>,
|
|
<a href="reference-verbs.html#group-like"><tt>group-like</tt></a>,
|
|
<a href="reference-verbs.html#having-fields"><tt>having-fields</tt></a>
|
|
</td>
|
|
<td> Particularly oriented toward POKI_PUT_LINK_FOR_PAGE(record-heterogeneity.html)HERE, although
|
|
all Miller commands can handle heterogeneous records
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="reference-verbs.html#check"><tt>check</tt></a>,
|
|
<a href="reference-verbs.html#count-distinct"><tt>count-distinct</tt></a>,
|
|
<a href="reference-verbs.html#label"><tt>label</tt></a>,
|
|
<a href="reference-verbs.html#merge-fields"><tt>merge-fields</tt></a>,
|
|
<a href="reference-verbs.html#nest"><tt>nest</tt></a>,
|
|
<a href="reference-verbs.html#nothing"><tt>nothing</tt></a>,
|
|
<a href="reference-verbs.html#regularize"><tt>rename</tt></a>,
|
|
<a href="reference-verbs.html#rename"><tt>rename</tt></a>,
|
|
<a href="reference-verbs.html#reorder"><tt>reorder</tt></a>,
|
|
<a href="reference-verbs.html#reshape"><tt>reshape</tt></a>,
|
|
<a href="reference-verbs.html#seqgen"><tt>seqgen</tt></a>
|
|
</td>
|
|
<td> These draw from other sources (see also POKI_PUT_LINK_FOR_PAGE(originality.html)HERE):
|
|
<a href="reference-verbs.html#count-distinct"><tt>count-distinct</tt></a> is SQL-ish, and
|
|
<a href="reference-verbs.html#rename"><tt>rename</tt></a> can be done by <tt>sed</tt> (which does it faster:
|
|
see POKI_PUT_LINK_FOR_PAGE(performance.html)HERE).
|
|
</td>
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
</div>
|
|
<!-- ================================================================ -->
|
|
<h1>I/O options</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_io_options');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_io_options" style="display: block">
|
|
|
|
<!-- ================================================================ -->
|
|
<h2>Formats</h2>
|
|
|
|
<p/> Options:
|
|
|
|
<pre>
|
|
--dkvp --idkvp --odkvp
|
|
--nidx --inidx --onidx
|
|
--csv --icsv --ocsv
|
|
--csvlite --icsvlite --ocsvlite
|
|
--pprint --ipprint --opprint --right
|
|
--xtab --ixtab --oxtab
|
|
--json --ijson --ojson
|
|
</pre>
|
|
|
|
<p/> These are as discussed in POKI_PUT_LINK_FOR_PAGE(file-formats.html)HERE, with the exception of <tt>--right</tt>
|
|
which makes pretty-printed output right-aligned:
|
|
|
|
<table><tr><td>
|
|
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE
|
|
</td><td>
|
|
POKI_RUN_COMMAND{{mlr --opprint --right cat data/small}}HERE
|
|
</td></tr></table>
|
|
|
|
<p/>Additional notes:
|
|
|
|
<ul>
|
|
|
|
<li/> Use <tt>--csv</tt>, <tt>--pprint</tt>, etc. when the input and output formats are the same.
|
|
|
|
<li/> Use <tt>--icsv --opprint</tt>, etc. when you want format conversion as part of what Miller does to your data.
|
|
|
|
<li/> DKVP (key-value-pair) format is the default for input and output. So,
|
|
<tt>--oxtab</tt> is the same as <tt>--idkvp --oxtab</tt>.
|
|
|
|
</ul>
|
|
|
|
<!-- ================================================================ -->
|
|
<h2>In-place mode</h2>
|
|
|
|
<p/> Use the <tt>mlr -I</tt> flag to process files in-place. For example,
|
|
<tt>mlr -I --csv cut -x -f unwanted_column_name mydata/*.csv</tt> will remove
|
|
<tt>unwanted_column_name</tt> from all your <tt>*.csv</tt> files in your
|
|
<tt>mydata/</tt> subdirectory.
|
|
|
|
<p/> By default, Miller output goes to the screen (or you can redirect a file
|
|
using <tt>></tt> or to another process using <tt>|</tt>). With <tt>-I</tt>,
|
|
for each file name on the command line, output is written to a temporary file
|
|
in the same directory. Miller writes its output into that temp file, which is
|
|
then renamed over the original. Then, processing continues on the next file.
|
|
Each file is processed in isolation: if the output format is CSV, CSV headers
|
|
will be present in each output file; statistics are only over each file's own
|
|
records; and so on.
|
|
|
|
<p/> Please see <a href="10-min.html#Choices_for_printing_to_files">here</a>
|
|
for examples.
|
|
|
|
<!-- ================================================================ -->
|
|
<h2>Compression</h2>
|
|
|
|
<p/> Options:
|
|
|
|
<pre>
|
|
--prepipe {command}
|
|
</pre>
|
|
|
|
<p/>The prepipe command is anything which reads from standard input and produces data acceptable to
|
|
Miller. Nominally this allows you to use whichever decompression utilities you have installed on your
|
|
system, on a per-file basis. If the command has flags, quote them: e.g. <tt>mlr --prepipe 'zcat -cf'</tt>. Examples:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
# These two produce the same output:
|
|
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime
|
|
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz
|
|
# With multiple input files you need --prepipe:
|
|
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz
|
|
$ mlr --prepipe gunzip --idkvp --oxtab cut -f hostname,uptime myfile1.dat.gz myfile2.dat.gz
|
|
|
|
# Similar to the above, but with compressed output as well as input:
|
|
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | gzip > outfile.csv.gz
|
|
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz | gzip > outfile.csv.gz
|
|
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz | gzip > outfile.csv.gz
|
|
|
|
# Similar to the above, but with different compression tools for input and output:
|
|
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | xz -z > outfile.csv.xz
|
|
$ xz -cd < myfile1.csv.xz | mlr cut -f hostname,uptime | gzip > outfile.csv.xz
|
|
$ mlr --prepipe 'xz -cd' cut -f hostname,uptime myfile1.csv.xz myfile2.csv.xz | xz -z > outfile.csv.xz
|
|
|
|
... etc.
|
|
</pre>
|
|
</div>
|
|
|
|
<!-- ================================================================ -->
|
|
<h2>Record/field/pair separators</h2>
|
|
|
|
<p/> Miller has record separators <tt>IRS</tt> and <tt>ORS</tt>, field
|
|
separators <tt>IFS</tt> and <tt>OFS</tt>, and pair separators <tt>IPS</tt> and
|
|
<tt>OPS</tt>. For example, in the DKVP line <tt>a=1,b=2,c=3</tt>, the record
|
|
separator is newline, field separator is comma, and pair separator is the
|
|
equals sign. These are the default values.
|
|
|
|
<p/> Options:
|
|
<pre>
|
|
--rs --irs --ors
|
|
--fs --ifs --ofs --repifs
|
|
--ps --ips --ops
|
|
</pre>
|
|
|
|
<ul>
|
|
|
|
<li/> You can change a separator from input to output via e.g. <tt>--ifs =
|
|
--ofs :</tt>. Or, you can specify that the same separator is to be used for
|
|
input and output via e.g. <tt>--fs :</tt>.
|
|
|
|
<li/> The pair separator is only relevant to DKVP format.
|
|
|
|
<li/> Pretty-print and xtab formats ignore the separator arguments altogether.
|
|
|
|
<li/> The <tt>--repifs</tt> means that multiple successive occurrences of the
|
|
field separator count as one. For example, in CSV data we often signify nulls
|
|
by empty strings, e.g. <tt>2,9,,,,,6,5,4</tt>. On the other hand, if the field
|
|
separator is a space, it might be more natural to parse <tt>2 4 5</tt> the
|
|
same as <tt>2 4 5</tt>: <tt>--repifs --ifs ' '</tt> lets this happen. In fact,
|
|
the <tt>--ipprint</tt> option above is internally implemented in terms of
|
|
<tt>--repifs</tt>.
|
|
|
|
<li/> Just write out the desired separator, e.g. <tt>--ofs '|'</tt>. But you
|
|
may use the symbolic names <tt>newline</tt>, <tt>space</tt>, <tt>tab</tt>,
|
|
<tt>pipe</tt>, or <tt>semicolon</tt> if you like.
|
|
|
|
</ul>
|
|
|
|
<!-- ================================================================ -->
|
|
<h2>Number formatting</h2>
|
|
|
|
<p/> The command-line option <tt>--ofmt {format string}</tt> is the global
|
|
number format for commands which generate numeric output, e.g.
|
|
<tt>stats1</tt>, <tt>stats2</tt>, <tt>histogram</tt>, and <tt>step</tt>, as
|
|
well as <tt>mlr put</tt>. Examples:
|
|
|
|
POKI_CARDIFY(--ofmt %.9le --ofmt %.6lf --ofmt %.0lf)HERE
|
|
|
|
<p/> These are just C <tt>printf</tt> formats applied to double-precision
|
|
numbers. Please don’t use <tt>%s</tt> or <tt>%d</tt>. Additionally, if
|
|
you use leading width (e.g. <tt>%18.12lf</tt>) then the output will contain
|
|
embedded whitespace, which may not be what you want if you pipe the output to
|
|
something else, particularly CSV. I use Miller’s pretty-print format
|
|
(<tt>mlr --opprint</tt>) to column-align numerical data.
|
|
|
|
<p/> To apply formatting to a single field, overriding the global
|
|
<tt>ofmt</tt>, use <tt>fmtnum</tt> function within <tt>mlr put</tt>. For example:
|
|
POKI_RUN_COMMAND{{echo 'x=3.1,y=4.3' | mlr put '$z=fmtnum($x*$y,"%08lf")'}}HERE
|
|
POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=fmtnum(int($x*$y),"%08llx")'}}HERE
|
|
|
|
<p/>Input conversion from hexadecimal is done automatically on fields handled
|
|
by <tt>mlr put</tt> and <tt>mlr filter</tt> as long as the field value begins
|
|
with "0x". To apply output conversion to hexadecimal on a single column, you
|
|
may use <tt>fmtnum</tt>, or the keystroke-saving <tt>hexfmt</tt> function.
|
|
Example:
|
|
|
|
POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=hexfmt($x*$y)'}}HERE
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Data transformations (verbs)</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_data_transformations');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_data_transformations" style="display: block">
|
|
|
|
<p/> Please see <a href="reference-verbs.html">the separate page here</a>.
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Expression language for filter and put</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_dsl_ref');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_dsl_ref" style="display: block">
|
|
|
|
<p/> Please see <a href="reference-dsl.html">the separate page here</a>.
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>then-chaining</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_then_chaining');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_then_chaining" style="display: block">
|
|
|
|
<p/>
|
|
In accord with the
|
|
<a href="http://en.wikipedia.org/wiki/Unix_philosophy">Unix philosophy</a>, you can pipe data into or out of
|
|
Miller. For example:
|
|
|
|
POKI_CARDIFY(mlr cut --complement -f os_version *.dat | mlr sort -f hostname,uptime)HERE
|
|
|
|
<p/>
|
|
You can, if you like, instead simply chain commands together using the
|
|
<tt>then</tt> keyword:
|
|
|
|
POKI_CARDIFY(mlr cut --complement -f os_version then sort -f hostname,uptime *.dat)HERE
|
|
|
|
<p/>(You can precede the very first verb with <tt>then</tt>, if you like, for symmetry.)
|
|
|
|
Here’s a performance comparison:
|
|
|
|
POKI_INCLUDE_ESCAPED(data/then-chaining-performance.txt)HERE
|
|
|
|
There are two reasons to use then-chaining: one is for performance, although I
|
|
don’t expect this to be a win in all cases. Using then-chaining avoids
|
|
redundant string-parsing and string-formatting at each pipeline step: instead
|
|
input records are parsed once, they are fed through each pipeline stage in
|
|
memory, and then output records are formatted once. On the other hand, Miller
|
|
is single-threaded, while modern systems are usually multi-processor, and when
|
|
streaming-data programs operate through pipes, each one can use a CPU. Rest
|
|
assured you get the same results either way.
|
|
|
|
<p/>The other reason to use then-chaining is for simplicity: you don’t
|
|
have re-type formatting flags (e.g. <tt>--csv --fs tab</tt>) at every
|
|
pipeline stage.
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Auxiliary commands</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_auxents');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_auxents" style="display: block">
|
|
|
|
<p/> There are a few nearly-standalone programs which have nothing to do with the rest of Miller, do not
|
|
participate in record streams, and do not deal with file formats. They might as well be little standalone executables
|
|
but they’re delivered within the main Miller executable for convenience.
|
|
|
|
POKI_RUN_COMMAND{{mlr aux-list}}HERE
|
|
|
|
<p/> Examples:
|
|
|
|
POKI_RUN_COMMAND{{echo 'Hello, world!' | mlr hex}}HERE
|
|
POKI_RUN_COMMAND{{echo 'Hello, world!' | mlr hex -r}}HERE
|
|
POKI_RUN_COMMAND{{echo 'Hello, world!' | mlr hex -r | mlr unhex}}HERE
|
|
POKI_RUN_COMMAND{{echo 'Hello, world!' | mlr lecat --mono}}HERE
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Data types</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_data_types');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_data_types" style="display: block">
|
|
|
|
<p/> Miller’s input and output are all string-oriented: there is (as of
|
|
August 2015 anyway) no support for binary record packing. In this sense,
|
|
everything is a string in and out of Miller. During processing, field names
|
|
are always strings, even if they have names like "3"; field values are usually
|
|
strings. Field values’ ability to be interpreted as a non-string type
|
|
only has meaning when comparison or function operations are done on them. And
|
|
it is an error condition if Miller encounters non-numeric (or otherwise
|
|
mistyped) data in a field in which it has been asked to do numeric (or
|
|
otherwise type-specific) operations.
|
|
|
|
<p/> Field values are treated as numeric for the following:
|
|
<ul>
|
|
<li/> Numeric sort: <tt>mlr sort -n</tt>, <tt>mlr sort -nr</tt>.
|
|
<li/> Statistics: <tt>mlr histogram</tt>, <tt>mlr stats1</tt>, <tt>mlr stats2</tt>.
|
|
<li/> Cross-record arithmetic: <tt>mlr step</tt>.
|
|
</ul>
|
|
|
|
<p/>For <tt>mlr put</tt> and <tt>mlr filter</tt>:
|
|
|
|
<ul>
|
|
|
|
<li/> Miller’s types for function processing are <b>empty-null</b> (empty
|
|
string), <b>absent-null</b> (reads of unset right-hand sides, or fall-through
|
|
non-explicit return values from user-defined functions), <b>error</b>,
|
|
<b>string</b>, <b>float</b> (double-precision), <b>int</b> (64-bit signed), and
|
|
<b>boolean</b>.
|
|
|
|
<li/> On input, string values representable as numbers, e.g. "3" or "3.1", are
|
|
treated as int or float, respectively. If a record has <tt>x=1,y=2</tt> then
|
|
<tt>mlr put '$z=$x+$y'</tt> will produce <tt>x=1,y=2,z=3</tt>, and <tt>mlr put
|
|
'$z=$x.$y'</tt> does not give an error simply because the dot operator has been
|
|
generalized to stringify non-strings. To coerce back to string for processing,
|
|
use the <tt>string</tt> function: <tt>mlr put '$z=string($x).string($y)'</tt>
|
|
will produce <tt>x=1,y=2,z=12</tt>.
|
|
|
|
<li/> On input, string values representable as boolean (e.g. <tt>"true"</tt>,
|
|
<tt>"false"</tt>) are <i>not</i> automatically treated as boolean. (This is
|
|
because <tt>"true"</tt> and <tt>"false"</tt> are ordinary words, and auto
|
|
string-to-boolean on a column consisting of words would result in some strings
|
|
mixed with some booleans.) Use the <tt>boolean</tt> function to coerce: e.g.
|
|
giving the record <tt>x=1,y=2,w=false</tt> to <tt>mlr put '$z=($x<$y) ||
|
|
boolean($w)'</tt>.
|
|
|
|
<li/> Functions take types as described in <tt>mlr --help-all-functions</tt>:
|
|
for example, <tt>log10</tt> takes float input and produces float output,
|
|
<tt>gmt2sec</tt> maps string to int, and <tt>sec2gmt</tt> maps int to string.
|
|
|
|
<li/> All math functions described in <tt>mlr --help-all-functions</tt> take
|
|
integer as well as float input.
|
|
|
|
</ul>
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Null data: empty and absent</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_null_data');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_null_data" style="display: block">
|
|
|
|
<p/> One of Miller’s key features is its support for <b>heterogeneous</b>
|
|
data. For example, take <tt>mlr sort</tt>: if you try to sort on field
|
|
<tt>hostname</tt> when not all records in the data stream <i>have</i> a field
|
|
named <tt>hostname</tt>, it is not an error (although you could pre-filter the
|
|
data stream using <tt>mlr having-fields --at-least hostname then sort
|
|
...</tt>). Rather, records lacking one or more sort keys are simply output
|
|
contiguously by <tt>mlr sort</tt>.
|
|
|
|
<p/> Miller has two kinds of null data:
|
|
|
|
<ul>
|
|
|
|
<li/> <b>Empty (key present, value empty)</b>: a field name is present in a
|
|
record (or in an out-of-stream variable) with empty value: e.g. <tt>x=,y=2</tt>
|
|
in the data input stream, or assignment <tt>$x=""</tt> or <tt>@x=""</tt> in
|
|
<tt>mlr put</tt>.
|
|
|
|
<li/> <b>Absent (key not present)</b>: a field name is not present, e.g. input
|
|
record is <tt>x=1,y=2</tt> and a <tt>put</tt> or <tt>filter</tt> expression
|
|
refers to <tt>$z</tt>. Or, reading an out-of-stream variable which hasn’t
|
|
been assigned a value yet, e.g. <tt>mlr put -q '@sum += $x'; end{emit
|
|
@sum}'</tt> or <tt>mlr put -q '@sum[$a][$b] += $x'; end{emit @sum, "a",
|
|
"b"}'</tt>.
|
|
|
|
</ul>
|
|
|
|
<p/>You can test these programatically using the functions
|
|
<tt>is_empty</tt>/<tt>is_not_empty</tt>, <tt>is_absent</tt>/<tt>is_present</tt>, and
|
|
<tt>is_null</tt>/<tt>is_not_null</tt>. For the last pair, note that null means
|
|
either empty or absent.
|
|
|
|
<p/>
|
|
Rules for null-handling:
|
|
|
|
<ul>
|
|
|
|
<li/> Records with one or more empty sort-field values sort after records with
|
|
all sort-field values present:
|
|
POKI_RUN_COMMAND{{mlr cat data/sort-null.dat}}HERE
|
|
POKI_RUN_COMMAND{{mlr sort -n a data/sort-null.dat}}HERE
|
|
POKI_RUN_COMMAND{{mlr sort -nr a data/sort-null.dat}}HERE
|
|
|
|
<li/> Functions/operators which have one or more <i>empty</i> arguments produce empty output: e.g.
|
|
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=$x+$y'}}HERE
|
|
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=$x+$y'}}HERE
|
|
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'}}HERE
|
|
|
|
with the exception that the <tt>min</tt> and <tt>max</tt> functions are
|
|
special: if one argument is non-null, it wins:
|
|
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'}}HERE
|
|
|
|
<li/> Functions of <i>absent</i> variables (e.g. <tt>mlr put '$y =
|
|
log10($nonesuch)'</tt>) evaluate to absent, and arithmetic/bitwise/boolean
|
|
operators with both operands being absent evaluate to absent.
|
|
Arithmetic operators with one absent operand return the other operand.
|
|
More specifically, absent values act like zero for addition/subtraction, and
|
|
one for multiplication: Furthermore, <b>any expression which evaluates to
|
|
absent is not stored in the left-hand side of an assignment statement </b>:
|
|
|
|
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'}}HERE
|
|
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'}}HERE
|
|
|
|
<li/> Likewise, for assignment to maps, <b>absent-valued keys or values result
|
|
in a skipped assignment</b>.
|
|
|
|
</ul>
|
|
|
|
The reasoning is as follows:
|
|
|
|
<ul>
|
|
|
|
<li/> Empty values are explicit in the data so they should explicitly affect accumulations:
|
|
<tt>mlr put '@sum += $x'</tt>
|
|
should accumulate numeric <tt>x</tt> values into the sum but an empty
|
|
<tt>x</tt>, when encountered in the input data stream, should make the sum
|
|
non-numeric. To work around this you can use the
|
|
<tt>is_not_null</tt> function as follows:
|
|
<tt>mlr put 'is_not_null($x) { @sum += $x }'</tt>
|
|
|
|
<li/> Absent stream-record values should not break accumulations, since Miller
|
|
by design handles heterogenous data: the running <tt>@sum</tt> in
|
|
<tt>mlr put '@sum += $x'</tt>
|
|
should not be invalidated for records which have no <tt>x</tt>.
|
|
|
|
<li/> Absent out-of-stream-variable values are precisely what allow you to write
|
|
<tt>mlr put '@sum += $x'</tt>. Otherwise you would have to write
|
|
<tt>mlr put 'begin{@sum = 0}; @sum += $x'</tt> —
|
|
which is tolerable — but for
|
|
<tt>mlr put 'begin{...}; @sum[$a][$b] += $x'</tt>
|
|
you’d have to pre-initialize <tt>@sum</tt> for all values of <tt>$a</tt> and <tt>$b</tt> in your
|
|
input data stream, which is intolerable.
|
|
|
|
<li/> The penalty for the absent feature is that misspelled variables can be hard to find:
|
|
e.g. in <tt>mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}'</tt>
|
|
the accumulator is spelt <tt>@sumx</tt> in the begin-block but <tt>@sum</tt> in the end-block, where since it
|
|
is absent, <tt>@sum*2</tt> evaluates to 2. See also the section on
|
|
<a href="reference-dsl.html#Errors_and_transparency">errors and transparency</a>.
|
|
|
|
</ul>
|
|
|
|
<p/>Since absent plus absent is absent (and likewise for other operators),
|
|
accumulations such as <tt>@sum += $x</tt> work correctly on heterogenous data,
|
|
as do within-record formulas if both operands are absent. If one operand is
|
|
present, you may get behavior you don’t desire. To work around this
|
|
— namely, to set an output field only for records which have all the
|
|
inputs present — you can use a pattern-action block with
|
|
<tt>is_present</tt>:
|
|
|
|
POKI_RUN_COMMAND{{mlr cat data/het.dkvp}}HERE
|
|
POKI_RUN_COMMAND{{mlr put 'is_present($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp}}HERE
|
|
POKI_RUN_COMMAND{{mlr put '$loadmillis = (is_present($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp}}HERE
|
|
|
|
<p/> If you’re interested in a formal description of how empty and absent
|
|
fields participate in arithmetic, here’s a table for plus (other
|
|
arithmetic/boolean/bitwise operators are similar):
|
|
|
|
POKI_RUN_COMMAND{{mlr --print-type-arithmetic-info}}HERE
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>String literals</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_string_literals');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_string_literals" style="display: block">
|
|
|
|
<p/>
|
|
You can use the following backslash escapes for strings such as between the double quotes in contexts such as
|
|
<tt>mlr filter '$name =~ "..."'</tt>,
|
|
<tt>mlr put '$name = $othername . "..."'</tt>,
|
|
<tt>mlr put '$name = sub($name, "...", "...")</tt>, etc.:
|
|
|
|
<ul>
|
|
<li/> <tt>\a</tt>: ASCII code 0x07 (alarm/bell)
|
|
<li/> <tt>\b</tt>: ASCII code 0x08 (backspace)
|
|
<li/> <tt>\f</tt>: ASCII code 0x0c (formfeed)
|
|
<li/> <tt>\n</tt>: ASCII code 0x0a (LF/linefeed/newline)
|
|
<li/> <tt>\r</tt>: ASCII code 0x0d (CR/carriage return)
|
|
<li/> <tt>\t</tt>: ASCII code 0x09 (tab)
|
|
<li/> <tt>\v</tt>: ASCII code 0x0b (vertical tab)
|
|
<li/> <tt>\\</tt>: backslash
|
|
<li/> <tt>\"</tt>: double quote
|
|
<li/> <tt>\123</tt>: Octal 123, etc. for <tt>\000</tt> up to <tt>\377</tt>
|
|
<li/> <tt>\x7f</tt>: Hexadecimal 7f, etc. for <tt>\x00</tt> up to <tt>\xff</tt>
|
|
</ul>
|
|
|
|
<p/>See also <a href="https://en.wikipedia.org/wiki/Escape_sequences_in_C">https://en.wikipedia.org/wiki/Escape_sequences_in_C</a>.
|
|
|
|
<p/>These replacements apply only to strings you key in for the DSL expressions for <tt>filter</tt> and <tt>put</tt>:
|
|
that is, if you type <tt>\t</tt> in a string literal for a <tt>filter</tt>/<tt>put</tt> expression, it will be turned into a tab character. If you want a backslash followed by a <tt>t</tt>, then please type <tt>\\t</tt>.
|
|
|
|
<p/>However, these replacements are not done automatically within your data stream. If you wish to make these
|
|
replacements, you can do, for example, for a field named <tt>field</tt>, <tt> mlr put '$field = gsub($field, "\\t",
|
|
"\t")'</tt>. If you need to make such a replacement for all fields in your data, you should probably simply use the
|
|
system <tt>sed</tt> command.
|
|
|
|
</div>
|
|
<h1>Regular expressions</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_regular_expressions');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_regular_expressions" style="display: block">
|
|
|
|
<p/>Miller lets you use regular expressions (of type POSIX.2) in the following contexts:
|
|
|
|
<ul>
|
|
|
|
<li/> In <tt>mlr filter</tt> with <tt>=~</tt> or <tt>!=~</tt>, e.g. <tt>mlr
|
|
filter '$url =~ "http.*com"'</tt>
|
|
|
|
<li/> In <tt>mlr put</tt> with <tt>sub</tt> or <tt>gsub</tt>, e.g. <tt>mlr put
|
|
'$url = sub($url, "http.*com", "")'</tt>
|
|
|
|
<li/> In <tt>mlr having-fields</tt>, e.g. <tt>mlr having-fields
|
|
--any-matching '^sda[0-9]'</tt>
|
|
|
|
<li/> In <tt>mlr cut</tt>, e.g. <tt>mlr cut -r -f '^status$,^sda[0-9]'</tt>
|
|
|
|
<li/> In <tt>mlr rename</tt>, e.g. <tt>mlr rename -r '^(sda[0-9]).*$,dev/\1'</tt>
|
|
|
|
<li/> In <tt>mlr grep</tt>, e.g. <tt>mlr --csv grep 00188555487 myfiles*.csv</tt>
|
|
|
|
</ul>
|
|
|
|
<p/>Points demonstrated by the above examples:
|
|
|
|
<ul>
|
|
|
|
<li/> There are no implicit start-of-string or end-of-string anchors; please
|
|
use <tt>^</tt> and/or <tt>$</tt> explicitly.
|
|
|
|
<li/> Miller regexes are wrapped with double quotes rather than slashes.
|
|
|
|
<li/> The <tt>i</tt> after the ending double quote indicates a case-insensitive
|
|
regex.
|
|
|
|
<li/> Capture groups are wrapped with <tt>(...)</tt> rather than
|
|
<tt>\(...\)</tt>; use <tt>\(</tt> and <tt>\)</tt> to match against parentheses.
|
|
|
|
</ul>
|
|
|
|
<p/>For <tt>filter</tt> and <tt>put</tt>, if the regular expression is a string
|
|
literal (the normal case), it is precompiled at process start and reused
|
|
thereafter, which is efficient. If the regular expression is a more complex
|
|
expression, including string concatenation using <tt>.</tt>, or a column name
|
|
(in which case you can take regular expressions from input data!), then regexes
|
|
are compiled on each record which works but is less efficient. As well, in this
|
|
case there is no way to specify case-insensitive matching.
|
|
|
|
<p/>Example:
|
|
|
|
POKI_RUN_COMMAND{{cat data/regex-in-data.dat}}HERE
|
|
POKI_RUN_COMMAND{{mlr filter '$name =~ $regex' data/regex-in-data.dat}}HERE
|
|
|
|
<h2>Regex captures</h2>
|
|
|
|
<p/>Regex captures of the form <tt>\0</tt> through <tt>\9</tt> are supported as
|
|
follows: <ul>
|
|
|
|
<li/> Captures have in-function context for <tt>sub</tt> and <tt>gsub</tt>.
|
|
For example, the first <tt>\1,\2</tt> pair belong to the first <tt>sub</tt> and
|
|
the second <tt>\1,\2</tt> pair belong to the second <tt>sub</tt>:
|
|
|
|
<p/>
|
|
<div class=pokipanel>
|
|
<pre>
|
|
mlr put '$b = sub($a, "(..)_(...)", "\2-\1"); $c = sub($a, "(..)_(.)(..)", ":\1:\2:\3")'
|
|
</pre>
|
|
</div>
|
|
|
|
<li/> Captures endure for the entirety of a <tt>put</tt> for the <tt>=~</tt>
|
|
and <tt>!=~</tt> operators. For example, here the <tt>\1,\2</tt> are set by the
|
|
<tt>=~</tt> operator and are used by both subsequent assignment statements:
|
|
|
|
<p/>
|
|
<div class=pokipanel>
|
|
<pre>
|
|
mlr put '$a =~ "(..)_(....); $b = "left_\1"; $c = "right_\2"'
|
|
</pre>
|
|
</div>
|
|
|
|
<li/>The captures are not retained across multiple puts. For example, here the
|
|
<tt>\1,\2</tt> won’t be expanded from the regex capture:
|
|
|
|
<p/>
|
|
<div class=pokipanel>
|
|
<pre>
|
|
mlr put '$a =~ "(..)_(....)' then {... something else ...} then put '$b = "left_\1"; $c = "right_\2"'
|
|
</pre>
|
|
</div>
|
|
|
|
<li/> Captures are ignored in <tt>filter</tt> for the <tt>=~</tt> and
|
|
<tt>!=~</tt> operators. For example, there is no mechanism provided to refer to
|
|
the first <tt>(..)</tt> as <tt>\1</tt> or to the second <tt>(....)</tt> as
|
|
<tt>\2</tt> in the following filter statement:
|
|
|
|
<p/>
|
|
<div class=pokipanel>
|
|
<pre>
|
|
mlr filter '$a =~ "(..)_(....)'
|
|
</pre>
|
|
</div>
|
|
|
|
<li/> Up to nine matches are supported: <tt>\1</tt> through <tt>\9</tt>, while
|
|
<tt>\0</tt> is the entire match string; <tt>\15</tt> is treated as <tt>\1</tt>
|
|
followed by an unrelated <tt>5</tt>.
|
|
</ul>
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Arithmetic</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_arithmetic');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_arithmetic" style="display: block">
|
|
|
|
<h2>Input scanning</h2>
|
|
|
|
<p/>Numbers in Miller are double-precision float or 64-bit signed integers.
|
|
Anything scannable as int, e.g <tt>123</tt> or <tt>0xabcd</tt>, is treated as
|
|
an integer; otherwise, input scannable as float (<tt>4.56</tt> or <tt>8e9</tt>)
|
|
is treated as float; everything else is a string.
|
|
|
|
<p/>If you want all numbers to be treated as floats, then you may use
|
|
<tt>float()</tt> in your filter/put expressions (e.g. replacing <tt>$c = $a *
|
|
$b</tt> with <tt>$c = float($a) * float($b)</tt>) — or, more simply, use
|
|
<tt>mlr filter -F</tt> and <tt>mlr put -F</tt> which forces all numeric input,
|
|
whether from expression literals or field values, to float. Likewise <tt>mlr
|
|
stats1 -F</tt> and <tt>mlr step -F</tt> force integerable accumulators (such as
|
|
<tt>count</tt>) to be done in floating-point.
|
|
|
|
<h2>Conversion by math routines</h2>
|
|
|
|
<p/>For most math functions, integers are cast to float on input, and produce
|
|
float output: e.g. <tt>exp(0) = 1.0</tt> rather than <tt>1</tt>. The
|
|
following, however, produce integer output if their inputs are integers:
|
|
<tt>+</tt> <tt>-</tt> <tt>*</tt> <tt>/</tt> <tt>//</tt> <tt>%</tt> <tt>abs</tt>
|
|
<tt>ceil</tt> <tt>floor</tt> <tt>max</tt> <tt>min</tt> <tt>round</tt>
|
|
<tt>roundm</tt> <tt>sgn</tt>. As well, <tt>stats1 -a min</tt>, <tt>stats1 -a
|
|
max</tt>, <tt>stats1 -a sum</tt>, <tt>step -a delta</tt>, and <tt>step -a
|
|
rsum</tt> produce integer output if their inputs are integers.
|
|
|
|
<h2>Conversion by arithmetic operators</h2>
|
|
|
|
<p/>The sum, difference, and product of integers is again integer, except for
|
|
when that would overflow a 64-bit integer at which point Miller converts the
|
|
result to float.
|
|
|
|
<p/>The short of it is that Miller does this transparently for you so you
|
|
needn’t think about it.
|
|
|
|
<p/>Implementation details of this, for the interested: integer adds and
|
|
subtracts overflow by at most one bit so it suffices to check sign-changes.
|
|
Thus, Miller allows you to add and subtract arbitrary 64-bit signed integers,
|
|
converting only to float precisely when the result is less than -2<sup>63</sup>
|
|
or greater than 2<sup>63</sup>-1. Multiplies, on the other hand, can overflow
|
|
by a word size and a sign-change technique does not suffice to detect overflow.
|
|
Instead Miller tests whether the floating-point product exceeds the
|
|
representable integer range. Now, 64-bit integers have 64-bit precision while
|
|
IEEE-doubles have only 52-bit mantissas — so, there are 53 bits including
|
|
implicit leading one. The following experiment explicitly demonstrates the
|
|
resolution at this range:
|
|
|
|
<div class=pokipanel>
|
|
<pre>
|
|
64-bit integer 64-bit integer Casted to double Back to 64-bit
|
|
in hex in decimal integer
|
|
0x7ffffffffffff9ff 9223372036854774271 9223372036854773760.000000 0x7ffffffffffff800
|
|
0x7ffffffffffffa00 9223372036854774272 9223372036854773760.000000 0x7ffffffffffff800
|
|
0x7ffffffffffffbff 9223372036854774783 9223372036854774784.000000 0x7ffffffffffffc00
|
|
0x7ffffffffffffc00 9223372036854774784 9223372036854774784.000000 0x7ffffffffffffc00
|
|
0x7ffffffffffffdff 9223372036854775295 9223372036854774784.000000 0x7ffffffffffffc00
|
|
0x7ffffffffffffe00 9223372036854775296 9223372036854775808.000000 0x8000000000000000
|
|
0x7ffffffffffffffe 9223372036854775806 9223372036854775808.000000 0x8000000000000000
|
|
0x7fffffffffffffff 9223372036854775807 9223372036854775808.000000 0x8000000000000000
|
|
</pre>
|
|
</div>
|
|
|
|
<p/>That is, one cannot check an integer product to see if it is precisely
|
|
greater than 2<sup>63</sup>-1 or less than -2<sup>63</sup> using either integer
|
|
arithmetic (it may have already overflowed) or using double-precision (due to
|
|
granularity). Instead Miller checks for overflow in 64-bit integer
|
|
multiplication by seeing whether the absolute value of the double-precision
|
|
product exceeds the largest representable IEEE double less than 2<sup>63</sup>,
|
|
which we see from the listing above is 9223372036854774784. (An alternative
|
|
would be to do all integer multiplies using handcrafted multi-word 128-bit
|
|
arithmetic. This approach is not taken.)
|
|
|
|
<h2>Pythonic division</h2>
|
|
|
|
<p/>Division and remainder are
|
|
<a href="http://python-history.blogspot.com/2010/08/why-pythons-integer-division-floors.html">
|
|
pythonic</a>:
|
|
<ul>
|
|
<li/> Quotient of integers is floating-point: <tt>7/2</tt> is <tt>3.5</tt>.
|
|
<li/> Integer division is done with <tt>//</tt>: <tt>7//2</tt> is <tt>3</tt>.
|
|
This rounds toward the negative.
|
|
<li/> Remainders are non-negative.
|
|
</ul>
|
|
|
|
</div>
|
|
<!-- ================================================================ -->
|
|
<h1>On-line help</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_online_help');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_online_help" style="display: block">
|
|
|
|
<p/>Examples:<p/>
|
|
|
|
POKI_RUN_COMMAND{{mlr --help}}HERE
|
|
|
|
POKI_RUN_COMMAND{{mlr sort --help}}HERE
|
|
|
|
</div>
|