mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 10:15:36 +00:00
790 lines
37 KiB
HTML
790 lines
37 KiB
HTML
POKI_PUT_TOC_HERE
|
|
|
|
<p/>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="bodyToggler.expandAll();" href="javascript:;">Expand all sections</button>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="bodyToggler.collapseAll();" href="javascript:;">Collapse all sections</button>
|
|
|
|
<h1>Command overview</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_overview');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_overview" style="display: block">
|
|
|
|
<p>
|
|
Whereas the Unix toolkit is made of the separate executables <code>cat</code>, <code>tail</code>, <code>cut</code>,
|
|
<code>sort</code>, etc., Miller has subcommands, invoked as follows:
|
|
|
|
POKI_INCLUDE_ESCAPED(data/subcommand-example.txt)HERE
|
|
|
|
<p/>These fall into categories as follows:
|
|
|
|
<table border=1>
|
|
<tr class="mlrbg">
|
|
<th>Commands </th>
|
|
<th>Description</th>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<a href="reference-verbs.html#cat"><code>cat</code></a>,
|
|
<a href="reference-verbs.html#cut"><code>cut</code></a>,
|
|
<a href="reference-verbs.html#grep"><code>grep</code></a>,
|
|
<a href="reference-verbs.html#head"><code>head</code></a>,
|
|
<a href="reference-verbs.html#join"><code>join</code></a>,
|
|
<a href="reference-verbs.html#sort"><code>sort</code></a>,
|
|
<a href="reference-verbs.html#tac"><code>tac</code></a>,
|
|
<a href="reference-verbs.html#tail"><code>tail</code></a>,
|
|
<a href="reference-verbs.html#top"><code>top</code></a>,
|
|
<a href="reference-verbs.html#uniq"><code>uniq</code></a>
|
|
</td>
|
|
<td> Analogs of their Unix-toolkit namesakes, discussed below as well as in
|
|
POKI_PUT_LINK_FOR_PAGE(feature-comparison.html)HERE </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="reference-verbs.html#filter"><code>filter</code></a>,
|
|
<a href="reference-verbs.html#put"><code>put</code></a>,
|
|
<a href="reference-verbs.html#sec2gmt"><code>sec2gmt</code></a>,
|
|
<a href="reference-verbs.html#sec2gmtdate"><code>sec2gmtdate</code></a>,
|
|
<a href="reference-verbs.html#step"><code>step</code></a>,
|
|
<a href="reference-verbs.html#tee"><code>tee</code></a>
|
|
</td>
|
|
<td> <code>awk</code>-like functionality </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="reference-verbs.html#bar"><code>bar</code></a>,
|
|
<a href="reference-verbs.html#bootstrap"><code>bootstrap</code></a>,
|
|
<a href="reference-verbs.html#decimate"><code>decimate</code></a>,
|
|
<a href="reference-verbs.html#histogram"><code>histogram</code></a>,
|
|
<a href="reference-verbs.html#least-frequent"><code>least-frequent</code></a>,
|
|
<a href="reference-verbs.html#most-frequent"><code>most-frequent</code></a>,
|
|
<a href="reference-verbs.html#sample"><code>sample</code></a>,
|
|
<a href="reference-verbs.html#shuffle"><code>shuffle</code></a>,
|
|
<a href="reference-verbs.html#stats1"><code>stats1</code></a>,
|
|
<a href="reference-verbs.html#stats2"><code>stats2</code></a>
|
|
</td>
|
|
<td> Statistically oriented </td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="reference-verbs.html#group-by"><code>group-by</code></a>,
|
|
<a href="reference-verbs.html#group-like"><code>group-like</code></a>,
|
|
<a href="reference-verbs.html#having-fields"><code>having-fields</code></a>
|
|
</td>
|
|
<td> Particularly oriented toward POKI_PUT_LINK_FOR_PAGE(record-heterogeneity.html)HERE, although
|
|
all Miller commands can handle heterogeneous records
|
|
</tr>
|
|
|
|
<tr>
|
|
<td>
|
|
<a href="reference-verbs.html#check"><code>check</code></a>,
|
|
<a href="reference-verbs.html#count-distinct"><code>count-distinct</code></a>,
|
|
<a href="reference-verbs.html#label"><code>label</code></a>,
|
|
<a href="reference-verbs.html#merge-fields"><code>merge-fields</code></a>,
|
|
<a href="reference-verbs.html#nest"><code>nest</code></a>,
|
|
<a href="reference-verbs.html#nothing"><code>nothing</code></a>,
|
|
<a href="reference-verbs.html#regularize"><code>rename</code></a>,
|
|
<a href="reference-verbs.html#rename"><code>rename</code></a>,
|
|
<a href="reference-verbs.html#reorder"><code>reorder</code></a>,
|
|
<a href="reference-verbs.html#reshape"><code>reshape</code></a>,
|
|
<a href="reference-verbs.html#seqgen"><code>seqgen</code></a>
|
|
</td>
|
|
<td> These draw from other sources (see also POKI_PUT_LINK_FOR_PAGE(originality.html)HERE):
|
|
<a href="reference-verbs.html#count-distinct"><code>count-distinct</code></a> is SQL-ish, and
|
|
<a href="reference-verbs.html#rename"><code>rename</code></a> can be done by <code>sed</code> (which does it faster:
|
|
see POKI_PUT_LINK_FOR_PAGE(performance.html)HERE).
|
|
</td>
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
</div>
|
|
<!-- ================================================================ -->
|
|
<h1>I/O options</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_io_options');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_io_options" style="display: block">
|
|
|
|
<!-- ================================================================ -->
|
|
<h2>Formats</h2>
|
|
|
|
<p/> Options:
|
|
|
|
<pre>
|
|
--dkvp --idkvp --odkvp
|
|
--nidx --inidx --onidx
|
|
--csv --icsv --ocsv
|
|
--csvlite --icsvlite --ocsvlite
|
|
--pprint --ipprint --opprint --right
|
|
--xtab --ixtab --oxtab
|
|
--json --ijson --ojson
|
|
</pre>
|
|
|
|
<p/> These are as discussed in POKI_PUT_LINK_FOR_PAGE(file-formats.html)HERE, with the exception of <code>--right</code>
|
|
which makes pretty-printed output right-aligned:
|
|
|
|
<table><tr><td>
|
|
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE
|
|
</td><td>
|
|
POKI_RUN_COMMAND{{mlr --opprint --right cat data/small}}HERE
|
|
</td></tr></table>
|
|
|
|
<p/>Additional notes:
|
|
|
|
<ul>
|
|
|
|
<li/> Use <code>--csv</code>, <code>--pprint</code>, etc. when the input and output formats are the same.
|
|
|
|
<li/> Use <code>--icsv --opprint</code>, etc. when you want format conversion as part of what Miller does to your data.
|
|
|
|
<li/> DKVP (key-value-pair) format is the default for input and output. So,
|
|
<code>--oxtab</code> is the same as <code>--idkvp --oxtab</code>.
|
|
|
|
</ul>
|
|
|
|
<b>Pro-tip:</b> Please use either <b>--format1</b>, or <b>--iformat1
|
|
--oformat2</b>. If you use <b>--format1 --oformat2</b> then what happens is
|
|
that flags are set up for input <i>and</i> output for format1, some of which
|
|
are overwritten for output in format2. For technical reasons, having
|
|
<code>--oformat2</code> clobber all the output-related effects of
|
|
<code>--format1</code> also removes some flexibility from the command-line
|
|
interface. See also
|
|
<a href="https://github.com/johnkerl/miller/issues/180">https://github.com/johnkerl/miller/issues/180</a> and
|
|
<a href="https://github.com/johnkerl/miller/issues/199">https://github.com/johnkerl/miller/issues/199</a>.
|
|
|
|
<!-- ================================================================ -->
|
|
<h2>In-place mode</h2>
|
|
|
|
<p/> Use the <code>mlr -I</code> flag to process files in-place. For example,
|
|
<code>mlr -I --csv cut -x -f unwanted_column_name mydata/*.csv</code> will remove
|
|
<code>unwanted_column_name</code> from all your <code>*.csv</code> files in your
|
|
<code>mydata/</code> subdirectory.
|
|
|
|
<p/> By default, Miller output goes to the screen (or you can redirect a file
|
|
using <code>></code> or to another process using <code>|</code>). With <code>-I</code>,
|
|
for each file name on the command line, output is written to a temporary file
|
|
in the same directory. Miller writes its output into that temp file, which is
|
|
then renamed over the original. Then, processing continues on the next file.
|
|
Each file is processed in isolation: if the output format is CSV, CSV headers
|
|
will be present in each output file; statistics are only over each file's own
|
|
records; and so on.
|
|
|
|
<p/> Please see <a href="10-min.html#Choices_for_printing_to_files">here</a>
|
|
for examples.
|
|
|
|
<!-- ================================================================ -->
|
|
<h2>Compression</h2>
|
|
|
|
<p/> Options:
|
|
|
|
<pre>
|
|
--prepipe {command}
|
|
</pre>
|
|
|
|
<p/>The prepipe command is anything which reads from standard input and produces data acceptable to
|
|
Miller. Nominally this allows you to use whichever decompression utilities you have installed on your
|
|
system, on a per-file basis. If the command has flags, quote them: e.g. <code>mlr --prepipe 'zcat -cf'</code>. Examples:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
# These two produce the same output:
|
|
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime
|
|
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz
|
|
# With multiple input files you need --prepipe:
|
|
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz
|
|
$ mlr --prepipe gunzip --idkvp --oxtab cut -f hostname,uptime myfile1.dat.gz myfile2.dat.gz
|
|
|
|
# Similar to the above, but with compressed output as well as input:
|
|
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | gzip > outfile.csv.gz
|
|
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz | gzip > outfile.csv.gz
|
|
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz | gzip > outfile.csv.gz
|
|
|
|
# Similar to the above, but with different compression tools for input and output:
|
|
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | xz -z > outfile.csv.xz
|
|
$ xz -cd < myfile1.csv.xz | mlr cut -f hostname,uptime | gzip > outfile.csv.xz
|
|
$ mlr --prepipe 'xz -cd' cut -f hostname,uptime myfile1.csv.xz myfile2.csv.xz | xz -z > outfile.csv.xz
|
|
|
|
... etc.
|
|
</pre>
|
|
</div>
|
|
|
|
<!-- ================================================================ -->
|
|
<h2>Record/field/pair separators</h2>
|
|
|
|
<p/> Miller has record separators <code>IRS</code> and <code>ORS</code>, field
|
|
separators <code>IFS</code> and <code>OFS</code>, and pair separators <code>IPS</code> and
|
|
<code>OPS</code>. For example, in the DKVP line <code>a=1,b=2,c=3</code>, the record
|
|
separator is newline, field separator is comma, and pair separator is the
|
|
equals sign. These are the default values.
|
|
|
|
<p/> Options:
|
|
<pre>
|
|
--rs --irs --ors
|
|
--fs --ifs --ofs --repifs
|
|
--ps --ips --ops
|
|
</pre>
|
|
|
|
<ul>
|
|
|
|
<li/> You can change a separator from input to output via e.g. <code>--ifs =
|
|
--ofs :</code>. Or, you can specify that the same separator is to be used for
|
|
input and output via e.g. <code>--fs :</code>.
|
|
|
|
<li/> The pair separator is only relevant to DKVP format.
|
|
|
|
<li/> Pretty-print and xtab formats ignore the separator arguments altogether.
|
|
|
|
<li/> The <code>--repifs</code> means that multiple successive occurrences of the
|
|
field separator count as one. For example, in CSV data we often signify nulls
|
|
by empty strings, e.g. <code>2,9,,,,,6,5,4</code>. On the other hand, if the field
|
|
separator is a space, it might be more natural to parse <code>2 4 5</code> the
|
|
same as <code>2 4 5</code>: <code>--repifs --ifs ' '</code> lets this happen. In fact,
|
|
the <code>--ipprint</code> option above is internally implemented in terms of
|
|
<code>--repifs</code>.
|
|
|
|
<li/> Just write out the desired separator, e.g. <code>--ofs '|'</code>. But you
|
|
may use the symbolic names <code>newline</code>, <code>space</code>, <code>tab</code>,
|
|
<code>pipe</code>, or <code>semicolon</code> if you like.
|
|
|
|
</ul>
|
|
|
|
<!-- ================================================================ -->
|
|
<h2>Number formatting</h2>
|
|
|
|
<p/> The command-line option <code>--ofmt {format string}</code> is the global
|
|
number format for commands which generate numeric output, e.g.
|
|
<code>stats1</code>, <code>stats2</code>, <code>histogram</code>, and <code>step</code>, as
|
|
well as <code>mlr put</code>. Examples:
|
|
|
|
POKI_CARDIFY(--ofmt %.9le --ofmt %.6lf --ofmt %.0lf)HERE
|
|
|
|
<p/> These are just C <code>printf</code> formats applied to double-precision
|
|
numbers. Please don’t use <code>%s</code> or <code>%d</code>. Additionally, if
|
|
you use leading width (e.g. <code>%18.12lf</code>) then the output will contain
|
|
embedded whitespace, which may not be what you want if you pipe the output to
|
|
something else, particularly CSV. I use Miller’s pretty-print format
|
|
(<code>mlr --opprint</code>) to column-align numerical data.
|
|
|
|
<p/> To apply formatting to a single field, overriding the global
|
|
<code>ofmt</code>, use <code>fmtnum</code> function within <code>mlr put</code>. For example:
|
|
POKI_RUN_COMMAND{{echo 'x=3.1,y=4.3' | mlr put '$z=fmtnum($x*$y,"%08lf")'}}HERE
|
|
POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=fmtnum(int($x*$y),"%08llx")'}}HERE
|
|
|
|
<p/>Input conversion from hexadecimal is done automatically on fields handled
|
|
by <code>mlr put</code> and <code>mlr filter</code> as long as the field value begins
|
|
with "0x". To apply output conversion to hexadecimal on a single column, you
|
|
may use <code>fmtnum</code>, or the keystroke-saving <code>hexfmt</code> function.
|
|
Example:
|
|
|
|
POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=hexfmt($x*$y)'}}HERE
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Data transformations (verbs)</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_data_transformations');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_data_transformations" style="display: block">
|
|
|
|
<p/> Please see <a href="reference-verbs.html">the separate page here</a>.
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Expression language for filter and put</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_dsl_ref');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_dsl_ref" style="display: block">
|
|
|
|
<p/> Please see <a href="reference-dsl.html">the separate page here</a>.
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>then-chaining</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_then_chaining');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_then_chaining" style="display: block">
|
|
|
|
<p/>
|
|
In accord with the
|
|
<a href="http://en.wikipedia.org/wiki/Unix_philosophy">Unix philosophy</a>, you can pipe data into or out of
|
|
Miller. For example:
|
|
|
|
POKI_CARDIFY(mlr cut --complement -f os_version *.dat | mlr sort -f hostname,uptime)HERE
|
|
|
|
<p/>
|
|
You can, if you like, instead simply chain commands together using the
|
|
<code>then</code> keyword:
|
|
|
|
POKI_CARDIFY(mlr cut --complement -f os_version then sort -f hostname,uptime *.dat)HERE
|
|
|
|
<p/>(You can precede the very first verb with <code>then</code>, if you like, for symmetry.)
|
|
|
|
Here’s a performance comparison:
|
|
|
|
POKI_INCLUDE_ESCAPED(data/then-chaining-performance.txt)HERE
|
|
|
|
There are two reasons to use then-chaining: one is for performance, although I
|
|
don’t expect this to be a win in all cases. Using then-chaining avoids
|
|
redundant string-parsing and string-formatting at each pipeline step: instead
|
|
input records are parsed once, they are fed through each pipeline stage in
|
|
memory, and then output records are formatted once. On the other hand, Miller
|
|
is single-threaded, while modern systems are usually multi-processor, and when
|
|
streaming-data programs operate through pipes, each one can use a CPU. Rest
|
|
assured you get the same results either way.
|
|
|
|
<p/>The other reason to use then-chaining is for simplicity: you don’t
|
|
have re-type formatting flags (e.g. <code>--csv --fs tab</code>) at every
|
|
pipeline stage.
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Auxiliary commands</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_auxents');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_auxents" style="display: block">
|
|
|
|
<p/> There are a few nearly-standalone programs which have nothing to do with the rest of Miller, do not
|
|
participate in record streams, and do not deal with file formats. They might as well be little standalone executables
|
|
but they’re delivered within the main Miller executable for convenience.
|
|
|
|
POKI_RUN_COMMAND{{mlr aux-list}}HERE
|
|
POKI_RUN_COMMAND{{mlr lecat --help}}HERE
|
|
POKI_RUN_COMMAND{{mlr termcvt --help}}HERE
|
|
POKI_RUN_COMMAND{{mlr hex --help}}HERE
|
|
POKI_RUN_COMMAND{{mlr unhex --help}}HERE
|
|
|
|
<p/> Examples:
|
|
|
|
POKI_RUN_COMMAND{{echo 'Hello, world!' | mlr lecat --mono}}HERE
|
|
POKI_RUN_COMMAND{{echo 'Hello, world!' | mlr termcvt --lf2crlf | mlr lecat --mono}}HERE
|
|
POKI_RUN_COMMAND{{mlr hex data/budget.csv}}HERE
|
|
POKI_RUN_COMMAND{{mlr hex -r data/budget.csv}}HERE
|
|
POKI_RUN_COMMAND{{mlr hex -r data/budget.csv | sed 's/20/2a/g' | mlr unhex}}HERE
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Data types</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_data_types');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_data_types" style="display: block">
|
|
|
|
<p/> Miller’s input and output are all string-oriented: there is (as of
|
|
August 2015 anyway) no support for binary record packing. In this sense,
|
|
everything is a string in and out of Miller. During processing, field names
|
|
are always strings, even if they have names like "3"; field values are usually
|
|
strings. Field values’ ability to be interpreted as a non-string type
|
|
only has meaning when comparison or function operations are done on them. And
|
|
it is an error condition if Miller encounters non-numeric (or otherwise
|
|
mistyped) data in a field in which it has been asked to do numeric (or
|
|
otherwise type-specific) operations.
|
|
|
|
<p/> Field values are treated as numeric for the following:
|
|
<ul>
|
|
<li/> Numeric sort: <code>mlr sort -n</code>, <code>mlr sort -nr</code>.
|
|
<li/> Statistics: <code>mlr histogram</code>, <code>mlr stats1</code>, <code>mlr stats2</code>.
|
|
<li/> Cross-record arithmetic: <code>mlr step</code>.
|
|
</ul>
|
|
|
|
<p/>For <code>mlr put</code> and <code>mlr filter</code>:
|
|
|
|
<ul>
|
|
|
|
<li/> Miller’s types for function processing are <b>empty-null</b> (empty
|
|
string), <b>absent-null</b> (reads of unset right-hand sides, or fall-through
|
|
non-explicit return values from user-defined functions), <b>error</b>,
|
|
<b>string</b>, <b>float</b> (double-precision), <b>int</b> (64-bit signed), and
|
|
<b>boolean</b>.
|
|
|
|
<li/> On input, string values representable as numbers, e.g. "3" or "3.1", are
|
|
treated as int or float, respectively. If a record has <code>x=1,y=2</code> then
|
|
<code>mlr put '$z=$x+$y'</code> will produce <code>x=1,y=2,z=3</code>, and <code>mlr put
|
|
'$z=$x.$y'</code> does not give an error simply because the dot operator has been
|
|
generalized to stringify non-strings. To coerce back to string for processing,
|
|
use the <code>string</code> function: <code>mlr put '$z=string($x).string($y)'</code>
|
|
will produce <code>x=1,y=2,z=12</code>.
|
|
|
|
<li/> On input, string values representable as boolean (e.g. <code>"true"</code>,
|
|
<code>"false"</code>) are <i>not</i> automatically treated as boolean. (This is
|
|
because <code>"true"</code> and <code>"false"</code> are ordinary words, and auto
|
|
string-to-boolean on a column consisting of words would result in some strings
|
|
mixed with some booleans.) Use the <code>boolean</code> function to coerce: e.g.
|
|
giving the record <code>x=1,y=2,w=false</code> to <code>mlr put '$z=($x<$y) ||
|
|
boolean($w)'</code>.
|
|
|
|
<li/> Functions take types as described in <code>mlr --help-all-functions</code>:
|
|
for example, <code>log10</code> takes float input and produces float output,
|
|
<code>gmt2sec</code> maps string to int, and <code>sec2gmt</code> maps int to string.
|
|
|
|
<li/> All math functions described in <code>mlr --help-all-functions</code> take
|
|
integer as well as float input.
|
|
|
|
</ul>
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Null data: empty and absent</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_null_data');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_null_data" style="display: block">
|
|
|
|
<p/> One of Miller’s key features is its support for <b>heterogeneous</b>
|
|
data. For example, take <code>mlr sort</code>: if you try to sort on field
|
|
<code>hostname</code> when not all records in the data stream <i>have</i> a field
|
|
named <code>hostname</code>, it is not an error (although you could pre-filter the
|
|
data stream using <code>mlr having-fields --at-least hostname then sort
|
|
...</code>). Rather, records lacking one or more sort keys are simply output
|
|
contiguously by <code>mlr sort</code>.
|
|
|
|
<p/> Miller has two kinds of null data:
|
|
|
|
<ul>
|
|
|
|
<li/> <b>Empty (key present, value empty)</b>: a field name is present in a
|
|
record (or in an out-of-stream variable) with empty value: e.g. <code>x=,y=2</code>
|
|
in the data input stream, or assignment <code>$x=""</code> or <code>@x=""</code> in
|
|
<code>mlr put</code>.
|
|
|
|
<li/> <b>Absent (key not present)</b>: a field name is not present, e.g. input
|
|
record is <code>x=1,y=2</code> and a <code>put</code> or <code>filter</code> expression
|
|
refers to <code>$z</code>. Or, reading an out-of-stream variable which hasn’t
|
|
been assigned a value yet, e.g. <code>mlr put -q '@sum += $x; end{emit
|
|
@sum}'</code> or <code>mlr put -q '@sum[$a][$b] += $x; end{emit @sum, "a",
|
|
"b"}'</code>.
|
|
|
|
</ul>
|
|
|
|
<p/>You can test these programatically using the functions
|
|
<code>is_empty</code>/<code>is_not_empty</code>, <code>is_absent</code>/<code>is_present</code>, and
|
|
<code>is_null</code>/<code>is_not_null</code>. For the last pair, note that null means
|
|
either empty or absent.
|
|
|
|
<p/>
|
|
Rules for null-handling:
|
|
|
|
<ul>
|
|
|
|
<li/> Records with one or more empty sort-field values sort after records with
|
|
all sort-field values present:
|
|
POKI_RUN_COMMAND{{mlr cat data/sort-null.dat}}HERE
|
|
POKI_RUN_COMMAND{{mlr sort -n a data/sort-null.dat}}HERE
|
|
POKI_RUN_COMMAND{{mlr sort -nr a data/sort-null.dat}}HERE
|
|
|
|
<li/> Functions/operators which have one or more <i>empty</i> arguments produce empty output: e.g.
|
|
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=$x+$y'}}HERE
|
|
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=$x+$y'}}HERE
|
|
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'}}HERE
|
|
|
|
with the exception that the <code>min</code> and <code>max</code> functions are
|
|
special: if one argument is non-null, it wins:
|
|
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'}}HERE
|
|
|
|
<li/> Functions of <i>absent</i> variables (e.g. <code>mlr put '$y =
|
|
log10($nonesuch)'</code>) evaluate to absent, and arithmetic/bitwise/boolean
|
|
operators with both operands being absent evaluate to absent.
|
|
Arithmetic operators with one absent operand return the other operand.
|
|
More specifically, absent values act like zero for addition/subtraction, and
|
|
one for multiplication: Furthermore, <b>any expression which evaluates to
|
|
absent is not stored in the left-hand side of an assignment statement </b>:
|
|
|
|
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'}}HERE
|
|
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'}}HERE
|
|
|
|
<li/> Likewise, for assignment to maps, <b>absent-valued keys or values result
|
|
in a skipped assignment</b>.
|
|
|
|
</ul>
|
|
|
|
The reasoning is as follows:
|
|
|
|
<ul>
|
|
|
|
<li/> Empty values are explicit in the data so they should explicitly affect accumulations:
|
|
<code>mlr put '@sum += $x'</code>
|
|
should accumulate numeric <code>x</code> values into the sum but an empty
|
|
<code>x</code>, when encountered in the input data stream, should make the sum
|
|
non-numeric. To work around this you can use the
|
|
<code>is_not_null</code> function as follows:
|
|
<code>mlr put 'is_not_null($x) { @sum += $x }'</code>
|
|
|
|
<li/> Absent stream-record values should not break accumulations, since Miller
|
|
by design handles heterogenous data: the running <code>@sum</code> in
|
|
<code>mlr put '@sum += $x'</code>
|
|
should not be invalidated for records which have no <code>x</code>.
|
|
|
|
<li/> Absent out-of-stream-variable values are precisely what allow you to write
|
|
<code>mlr put '@sum += $x'</code>. Otherwise you would have to write
|
|
<code>mlr put 'begin{@sum = 0}; @sum += $x'</code> —
|
|
which is tolerable — but for
|
|
<code>mlr put 'begin{...}; @sum[$a][$b] += $x'</code>
|
|
you’d have to pre-initialize <code>@sum</code> for all values of <code>$a</code> and <code>$b</code> in your
|
|
input data stream, which is intolerable.
|
|
|
|
<li/> The penalty for the absent feature is that misspelled variables can be hard to find:
|
|
e.g. in <code>mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}'</code>
|
|
the accumulator is spelt <code>@sumx</code> in the begin-block but <code>@sum</code> in the end-block, where since it
|
|
is absent, <code>@sum*2</code> evaluates to 2. See also the section on
|
|
<a href="reference-dsl.html#Errors_and_transparency">errors and transparency</a>.
|
|
|
|
</ul>
|
|
|
|
<p/>Since absent plus absent is absent (and likewise for other operators),
|
|
accumulations such as <code>@sum += $x</code> work correctly on heterogenous data,
|
|
as do within-record formulas if both operands are absent. If one operand is
|
|
present, you may get behavior you don’t desire. To work around this
|
|
— namely, to set an output field only for records which have all the
|
|
inputs present — you can use a pattern-action block with
|
|
<code>is_present</code>:
|
|
|
|
POKI_RUN_COMMAND{{mlr cat data/het.dkvp}}HERE
|
|
POKI_RUN_COMMAND{{mlr put 'is_present($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp}}HERE
|
|
POKI_RUN_COMMAND{{mlr put '$loadmillis = (is_present($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp}}HERE
|
|
|
|
<p/> If you’re interested in a formal description of how empty and absent
|
|
fields participate in arithmetic, here’s a table for plus (other
|
|
arithmetic/boolean/bitwise operators are similar):
|
|
|
|
POKI_RUN_COMMAND{{mlr --print-type-arithmetic-info}}HERE
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>String literals</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_string_literals');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_string_literals" style="display: block">
|
|
|
|
<p/>
|
|
You can use the following backslash escapes for strings such as between the double quotes in contexts such as
|
|
<code>mlr filter '$name =~ "..."'</code>,
|
|
<code>mlr put '$name = $othername . "..."'</code>,
|
|
<code>mlr put '$name = sub($name, "...", "...")</code>, etc.:
|
|
|
|
<ul>
|
|
<li/> <code>\a</code>: ASCII code 0x07 (alarm/bell)
|
|
<li/> <code>\b</code>: ASCII code 0x08 (backspace)
|
|
<li/> <code>\f</code>: ASCII code 0x0c (formfeed)
|
|
<li/> <code>\n</code>: ASCII code 0x0a (LF/linefeed/newline)
|
|
<li/> <code>\r</code>: ASCII code 0x0d (CR/carriage return)
|
|
<li/> <code>\t</code>: ASCII code 0x09 (tab)
|
|
<li/> <code>\v</code>: ASCII code 0x0b (vertical tab)
|
|
<li/> <code>\\</code>: backslash
|
|
<li/> <code>\"</code>: double quote
|
|
<li/> <code>\123</code>: Octal 123, etc. for <code>\000</code> up to <code>\377</code>
|
|
<li/> <code>\x7f</code>: Hexadecimal 7f, etc. for <code>\x00</code> up to <code>\xff</code>
|
|
</ul>
|
|
|
|
<p/>See also <a href="https://en.wikipedia.org/wiki/Escape_sequences_in_C">https://en.wikipedia.org/wiki/Escape_sequences_in_C</a>.
|
|
|
|
<p/>These replacements apply only to strings you key in for the DSL expressions for <code>filter</code> and <code>put</code>:
|
|
that is, if you type <code>\t</code> in a string literal for a <code>filter</code>/<code>put</code> expression, it will be turned into a tab character. If you want a backslash followed by a <code>t</code>, then please type <code>\\t</code>.
|
|
|
|
<p/>However, these replacements are not done automatically within your data stream. If you wish to make these
|
|
replacements, you can do, for example, for a field named <code>field</code>, <code> mlr put '$field = gsub($field, "\\t",
|
|
"\t")'</code>. If you need to make such a replacement for all fields in your data, you should probably simply use the
|
|
system <code>sed</code> command.
|
|
|
|
</div>
|
|
<h1>Regular expressions</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_regular_expressions');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_regular_expressions" style="display: block">
|
|
|
|
<p/>Miller lets you use regular expressions (of type POSIX.2) in the following contexts:
|
|
|
|
<ul>
|
|
|
|
<li/> In <code>mlr filter</code> with <code>=~</code> or <code>!=~</code>, e.g. <code>mlr
|
|
filter '$url =~ "http.*com"'</code>
|
|
|
|
<li/> In <code>mlr put</code> with <code>sub</code> or <code>gsub</code>, e.g. <code>mlr put
|
|
'$url = sub($url, "http.*com", "")'</code>
|
|
|
|
<li/> In <code>mlr having-fields</code>, e.g. <code>mlr having-fields
|
|
--any-matching '^sda[0-9]'</code>
|
|
|
|
<li/> In <code>mlr cut</code>, e.g. <code>mlr cut -r -f '^status$,^sda[0-9]'</code>
|
|
|
|
<li/> In <code>mlr rename</code>, e.g. <code>mlr rename -r '^(sda[0-9]).*$,dev/\1'</code>
|
|
|
|
<li/> In <code>mlr grep</code>, e.g. <code>mlr --csv grep 00188555487 myfiles*.csv</code>
|
|
|
|
</ul>
|
|
|
|
<p/>Points demonstrated by the above examples:
|
|
|
|
<ul>
|
|
|
|
<li/> There are no implicit start-of-string or end-of-string anchors; please
|
|
use <code>^</code> and/or <code>$</code> explicitly.
|
|
|
|
<li/> Miller regexes are wrapped with double quotes rather than slashes.
|
|
|
|
<li/> The <code>i</code> after the ending double quote indicates a case-insensitive
|
|
regex.
|
|
|
|
<li/> Capture groups are wrapped with <code>(...)</code> rather than
|
|
<code>\(...\)</code>; use <code>\(</code> and <code>\)</code> to match against parentheses.
|
|
|
|
</ul>
|
|
|
|
<p/>For <code>filter</code> and <code>put</code>, if the regular expression is a string
|
|
literal (the normal case), it is precompiled at process start and reused
|
|
thereafter, which is efficient. If the regular expression is a more complex
|
|
expression, including string concatenation using <code>.</code>, or a column name
|
|
(in which case you can take regular expressions from input data!), then regexes
|
|
are compiled on each record which works but is less efficient. As well, in this
|
|
case there is no way to specify case-insensitive matching.
|
|
|
|
<p/>Example:
|
|
|
|
POKI_RUN_COMMAND{{cat data/regex-in-data.dat}}HERE
|
|
POKI_RUN_COMMAND{{mlr filter '$name =~ $regex' data/regex-in-data.dat}}HERE
|
|
|
|
<h2>Regex captures</h2>
|
|
|
|
<p/>Regex captures of the form <code>\0</code> through <code>\9</code> are supported as
|
|
follows: <ul>
|
|
|
|
<li/> Captures have in-function context for <code>sub</code> and <code>gsub</code>.
|
|
For example, the first <code>\1,\2</code> pair belong to the first <code>sub</code> and
|
|
the second <code>\1,\2</code> pair belong to the second <code>sub</code>:
|
|
|
|
<p/>
|
|
<div class=pokipanel>
|
|
<pre>
|
|
mlr put '$b = sub($a, "(..)_(...)", "\2-\1"); $c = sub($a, "(..)_(.)(..)", ":\1:\2:\3")'
|
|
</pre>
|
|
</div>
|
|
|
|
<li/> Captures endure for the entirety of a <code>put</code> for the <code>=~</code>
|
|
and <code>!=~</code> operators. For example, here the <code>\1,\2</code> are set by the
|
|
<code>=~</code> operator and are used by both subsequent assignment statements:
|
|
|
|
<p/>
|
|
<div class=pokipanel>
|
|
<pre>
|
|
mlr put '$a =~ "(..)_(....); $b = "left_\1"; $c = "right_\2"'
|
|
</pre>
|
|
</div>
|
|
|
|
<li/>The captures are not retained across multiple puts. For example, here the
|
|
<code>\1,\2</code> won’t be expanded from the regex capture:
|
|
|
|
<p/>
|
|
<div class=pokipanel>
|
|
<pre>
|
|
mlr put '$a =~ "(..)_(....)' then {... something else ...} then put '$b = "left_\1"; $c = "right_\2"'
|
|
</pre>
|
|
</div>
|
|
|
|
<li/> Captures are ignored in <code>filter</code> for the <code>=~</code> and
|
|
<code>!=~</code> operators. For example, there is no mechanism provided to refer to
|
|
the first <code>(..)</code> as <code>\1</code> or to the second <code>(....)</code> as
|
|
<code>\2</code> in the following filter statement:
|
|
|
|
<p/>
|
|
<div class=pokipanel>
|
|
<pre>
|
|
mlr filter '$a =~ "(..)_(....)'
|
|
</pre>
|
|
</div>
|
|
|
|
<li/> Up to nine matches are supported: <code>\1</code> through <code>\9</code>, while
|
|
<code>\0</code> is the entire match string; <code>\15</code> is treated as <code>\1</code>
|
|
followed by an unrelated <code>5</code>.
|
|
</ul>
|
|
|
|
<!-- ================================================================ -->
|
|
</div>
|
|
<h1>Arithmetic</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_arithmetic');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_arithmetic" style="display: block">
|
|
|
|
<h2>Input scanning</h2>
|
|
|
|
<p/>Numbers in Miller are double-precision float or 64-bit signed integers.
|
|
Anything scannable as int, e.g <code>123</code> or <code>0xabcd</code>, is treated as
|
|
an integer; otherwise, input scannable as float (<code>4.56</code> or <code>8e9</code>)
|
|
is treated as float; everything else is a string.
|
|
|
|
<p/>If you want all numbers to be treated as floats, then you may use
|
|
<code>float()</code> in your filter/put expressions (e.g. replacing <code>$c = $a *
|
|
$b</code> with <code>$c = float($a) * float($b)</code>) — or, more simply, use
|
|
<code>mlr filter -F</code> and <code>mlr put -F</code> which forces all numeric input,
|
|
whether from expression literals or field values, to float. Likewise <code>mlr
|
|
stats1 -F</code> and <code>mlr step -F</code> force integerable accumulators (such as
|
|
<code>count</code>) to be done in floating-point.
|
|
|
|
<h2>Conversion by math routines</h2>
|
|
|
|
<p/>For most math functions, integers are cast to float on input, and produce
|
|
float output: e.g. <code>exp(0) = 1.0</code> rather than <code>1</code>. The
|
|
following, however, produce integer output if their inputs are integers:
|
|
<code>+</code> <code>-</code> <code>*</code> <code>/</code> <code>//</code> <code>%</code> <code>abs</code>
|
|
<code>ceil</code> <code>floor</code> <code>max</code> <code>min</code> <code>round</code>
|
|
<code>roundm</code> <code>sgn</code>. As well, <code>stats1 -a min</code>, <code>stats1 -a
|
|
max</code>, <code>stats1 -a sum</code>, <code>step -a delta</code>, and <code>step -a
|
|
rsum</code> produce integer output if their inputs are integers.
|
|
|
|
<h2>Conversion by arithmetic operators</h2>
|
|
|
|
<p/>The sum, difference, and product of integers is again integer, except for
|
|
when that would overflow a 64-bit integer at which point Miller converts the
|
|
result to float.
|
|
|
|
<p/>The short of it is that Miller does this transparently for you so you
|
|
needn’t think about it.
|
|
|
|
<p/>Implementation details of this, for the interested: integer adds and
|
|
subtracts overflow by at most one bit so it suffices to check sign-changes.
|
|
Thus, Miller allows you to add and subtract arbitrary 64-bit signed integers,
|
|
converting only to float precisely when the result is less than -2<sup>63</sup>
|
|
or greater than 2<sup>63</sup>-1. Multiplies, on the other hand, can overflow
|
|
by a word size and a sign-change technique does not suffice to detect overflow.
|
|
Instead Miller tests whether the floating-point product exceeds the
|
|
representable integer range. Now, 64-bit integers have 64-bit precision while
|
|
IEEE-doubles have only 52-bit mantissas — so, there are 53 bits including
|
|
implicit leading one. The following experiment explicitly demonstrates the
|
|
resolution at this range:
|
|
|
|
<div class=pokipanel>
|
|
<pre>
|
|
64-bit integer 64-bit integer Casted to double Back to 64-bit
|
|
in hex in decimal integer
|
|
0x7ffffffffffff9ff 9223372036854774271 9223372036854773760.000000 0x7ffffffffffff800
|
|
0x7ffffffffffffa00 9223372036854774272 9223372036854773760.000000 0x7ffffffffffff800
|
|
0x7ffffffffffffbff 9223372036854774783 9223372036854774784.000000 0x7ffffffffffffc00
|
|
0x7ffffffffffffc00 9223372036854774784 9223372036854774784.000000 0x7ffffffffffffc00
|
|
0x7ffffffffffffdff 9223372036854775295 9223372036854774784.000000 0x7ffffffffffffc00
|
|
0x7ffffffffffffe00 9223372036854775296 9223372036854775808.000000 0x8000000000000000
|
|
0x7ffffffffffffffe 9223372036854775806 9223372036854775808.000000 0x8000000000000000
|
|
0x7fffffffffffffff 9223372036854775807 9223372036854775808.000000 0x8000000000000000
|
|
</pre>
|
|
</div>
|
|
|
|
<p/>That is, one cannot check an integer product to see if it is precisely
|
|
greater than 2<sup>63</sup>-1 or less than -2<sup>63</sup> using either integer
|
|
arithmetic (it may have already overflowed) or using double-precision (due to
|
|
granularity). Instead Miller checks for overflow in 64-bit integer
|
|
multiplication by seeing whether the absolute value of the double-precision
|
|
product exceeds the largest representable IEEE double less than 2<sup>63</sup>,
|
|
which we see from the listing above is 9223372036854774784. (An alternative
|
|
would be to do all integer multiplies using handcrafted multi-word 128-bit
|
|
arithmetic. This approach is not taken.)
|
|
|
|
<h2>Pythonic division</h2>
|
|
|
|
<p/>Division and remainder are
|
|
<a href="http://python-history.blogspot.com/2010/08/why-pythons-integer-division-floors.html">
|
|
pythonic</a>:
|
|
<ul>
|
|
<li/> Quotient of integers is floating-point: <code>7/2</code> is <code>3.5</code>.
|
|
<li/> Integer division is done with <code>//</code>: <code>7//2</code> is <code>3</code>.
|
|
This rounds toward the negative.
|
|
<li/> Remainders are non-negative.
|
|
</ul>
|
|
|
|
</div>
|
|
<!-- ================================================================ -->
|
|
<h1>On-line help</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_online_help');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_online_help" style="display: block">
|
|
|
|
<p/>Examples:<p/>
|
|
|
|
POKI_RUN_COMMAND{{mlr --help}}HERE
|
|
|
|
POKI_RUN_COMMAND{{mlr sort --help}}HERE
|
|
|
|
</div>
|