miller/docs/reference.rst.in
2020-09-29 22:51:32 -04:00

515 lines
28 KiB
ReStructuredText

Main reference
================================================================
.. _reference-command-overview:
Command overview
----------------------------------------------------------------
Whereas the Unix toolkit is made of the separate executables ``cat``, ``tail``, ``cut``,
``sort``, etc., Miller has subcommands, or **verbs**, invoked as follows:
::
POKI_INCLUDE_ESCAPED(data/subcommand-example.txt)HERE
These fall into categories as follows:
* Analogs of their Unix-toolkit namesakes, discussed below as well as in :doc:`feature-comparison`: :ref:`reference-verbs-cat` :ref:`reference-verbs-cut` :ref:`reference-verbs-grep` :ref:`reference-verbs-head` :ref:`reference-verbs-join` :ref:`reference-verbs-sort` :ref:`reference-verbs-tac` :ref:`reference-verbs-tail` :ref:`reference-verbs-top` :ref:`reference-verbs-uniq`.
* ``awk``-like functionality: :ref:`reference-verbs-filter` :ref:`reference-verbs-put` :ref:`reference-verbs-sec2gmt` :ref:`reference-verbs-sec2gmtdate` :ref:`reference-verbs-step` :ref:`reference-verbs-tee`.
* Statistically oriented: :ref:`reference-verbs-bar` :ref:`reference-verbs-bootstrap` :ref:`reference-verbs-decimate` :ref:`reference-verbs-histogram` :ref:`reference-verbs-least-frequent` :ref:`reference-verbs-most-frequent` :ref:`reference-verbs-sample` :ref:`reference-verbs-shuffle` :ref:`reference-verbs-stats1` :ref:`reference-verbs-stats2`.
* Particularly oriented toward :doc:`record-heterogeneity`, although all Miller commands can handle heterogeneous records: :ref:`reference-verbs-group-by` :ref:`reference-verbs-group-like` :ref:`reference-verbs-having-fields`.
* These draw from other sources (see also :doc:`originality`): :ref:`reference-verbs-count-distinct` is SQL-ish, and :ref:`reference-verbs-rename` can be done by ``sed`` (which does it faster: see :doc:`performance`. Verbs: :ref:`reference-verbs-check` :ref:`reference-verbs-count-distinct` :ref:`reference-verbs-label` :ref:`reference-verbs-merge-fields` :ref:`reference-verbs-nest` :ref:`reference-verbs-nothing` :ref:`reference-verbs-regularize` :ref:`reference-verbs-rename` :ref:`reference-verbs-reorder` :ref:`reference-verbs-reshape` :ref:`reference-verbs-seqgen`.
I/O options
----------------------------------------------------------------
Formats
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Options:
::
--dkvp --idkvp --odkvp
--nidx --inidx --onidx
--csv --icsv --ocsv
--csvlite --icsvlite --ocsvlite
--pprint --ipprint --opprint --right
--xtab --ixtab --oxtab
--json --ijson --ojson
These are as discussed in :doc:`file-formats`, with the exception of ``--right`` which makes pretty-printed output right-aligned:
::
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE
::
POKI_RUN_COMMAND{{mlr --opprint --right cat data/small}}HERE
Additional notes:
* Use ``--csv``, ``--pprint``, etc. when the input and output formats are the same.
* Use ``--icsv --opprint``, etc. when you want format conversion as part of what Miller does to your data.
* DKVP (key-value-pair) format is the default for input and output. So, ``--oxtab`` is the same as ``--idkvp --oxtab``.
**Pro-tip:** Please use either **--format1**, or **--iformat1 --oformat2**. If you use **--format1 --oformat2** then what happens is that flags are set up for input *and* output for format1, some of which are overwritten for output in format2. For technical reasons, having ``--oformat2`` clobber all the output-related effects of ``--format1`` also removes some flexibility from the command-line interface. See also https://github.com/johnkerl/miller/issues/180 and https://github.com/johnkerl/miller/issues/199.
In-place mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Use the ``mlr -I`` flag to process files in-place. For example, ``mlr -I --csv cut -x -f unwanted_column_name mydata/*.csv`` will remove ``unwanted_column_name`` from all your ``*.csv`` files in your ``mydata/`` subdirectory.
By default, Miller output goes to the screen (or you can redirect a file using ``>`` or to another process using ``|``). With ``-I``, for each file name on the command line, output is written to a temporary file in the same directory. Miller writes its output into that temp file, which is then renamed over the original. Then, processing continues on the next file. Each file is processed in isolation: if the output format is CSV, CSV headers will be present in each output file; statistics are only over each file's own records; and so on.
Please see :ref:`10min-choices-for-printing-to-files` for examples.
Compression
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Options:
::
--prepipe {command}
The prepipe command is anything which reads from standard input and produces data acceptable to Miller. Nominally this allows you to use whichever decompression utilities you have installed on your system, on a per-file basis. If the command has flags, quote them: e.g. ``mlr --prepipe 'zcat -cf'``. Examples:
::
# These two produce the same output:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz
# With multiple input files you need --prepipe:
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz
$ mlr --prepipe gunzip --idkvp --oxtab cut -f hostname,uptime myfile1.dat.gz myfile2.dat.gz
::
# Similar to the above, but with compressed output as well as input:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | gzip > outfile.csv.gz
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz | gzip > outfile.csv.gz
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz | gzip > outfile.csv.gz
::
# Similar to the above, but with different compression tools for input and output:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | xz -z > outfile.csv.xz
$ xz -cd < myfile1.csv.xz | mlr cut -f hostname,uptime | gzip > outfile.csv.xz
$ mlr --prepipe 'xz -cd' cut -f hostname,uptime myfile1.csv.xz myfile2.csv.xz | xz -z > outfile.csv.xz
.. _reference-separators:
Record/field/pair separators
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Miller has record separators ``IRS`` and ``ORS``, field separators ``IFS`` and ``OFS``, and pair separators ``IPS`` and ``OPS``. For example, in the DKVP line ``a=1,b=2,c=3``, the record separator is newline, field separator is comma, and pair separator is the equals sign. These are the default values.
Options:
::
--rs --irs --ors
--fs --ifs --ofs --repifs
--ps --ips --ops
* You can change a separator from input to output via e.g. ``--ifs = --ofs :``. Or, you can specify that the same separator is to be used for input and output via e.g. ``--fs :``.
* The pair separator is only relevant to DKVP format.
* Pretty-print and xtab formats ignore the separator arguments altogether.
* The ``--repifs`` means that multiple successive occurrences of the field separator count as one. For example, in CSV data we often signify nulls by empty strings, e.g. ``2,9,,,,,6,5,4``. On the other hand, if the field separator is a space, it might be more natural to parse ``2 4 5`` the same as ``2 4 5``: ``--repifs --ifs ' '`` lets this happen. In fact, the ``--ipprint`` option above is internally implemented in terms of ``--repifs``.
* Just write out the desired separator, e.g. ``--ofs '|'``. But you may use the symbolic names ``newline``, ``space``, ``tab``, ``pipe``, or ``semicolon`` if you like.
Number formatting
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The command-line option ``--ofmt {format string}`` is the global number format for commands which generate numeric output, e.g. ``stats1``, ``stats2``, ``histogram``, and ``step``, as well as ``mlr put``. Examples:
::
--ofmt %.9le --ofmt %.6lf --ofmt %.0lf
These are just C ``printf`` formats applied to double-precision numbers. Please don't use ``%s`` or ``%d``. Additionally, if you use leading width (e.g. ``%18.12lf``) then the output will contain embedded whitespace, which may not be what you want if you pipe the output to something else, particularly CSV. I use Miller's pretty-print format (``mlr --opprint``) to column-align numerical data.
To apply formatting to a single field, overriding the global ``ofmt``, use ``fmtnum`` function within ``mlr put``. For example:
::
POKI_RUN_COMMAND{{echo 'x=3.1,y=4.3' | mlr put '$z=fmtnum($x*$y,"%08lf")'}}HERE
::
POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=fmtnum(int($x*$y),"%08llx")'}}HERE
Input conversion from hexadecimal is done automatically on fields handled by ``mlr put`` and ``mlr filter`` as long as the field value begins with "0x". To apply output conversion to hexadecimal on a single column, you may use ``fmtnum``, or the keystroke-saving ``hexfmt`` function. Example:
::
POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=hexfmt($x*$y)'}}HERE
Data transformations (verbs)
----------------------------------------------------------------
Please see the separate page :doc:`reference-verbs`.
Expression language for filter and put
----------------------------------------------------------------
Please see the separate page :doc:`reference-dsl`.
then-chaining
----------------------------------------------------------------
In accord with the `Unix philosophy <http://en.wikipedia.org/wiki/Unix_philosophy>`_, you can pipe data into or out of Miller. For example:
::
mlr cut --complement -f os_version *.dat | mlr sort -f hostname,uptime
You can, if you like, instead simply chain commands together using the ``then`` keyword:
::
mlr cut --complement -f os_version then sort -f hostname,uptime *.dat
(You can precede the very first verb with ``then``, if you like, for symmetry.)
Here's a performance comparison:
::
POKI_INCLUDE_ESCAPED(data/then-chaining-performance.txt)HERE
There are two reasons to use then-chaining: one is for performance, although I don't expect this to be a win in all cases. Using then-chaining avoids redundant string-parsing and string-formatting at each pipeline step: instead input records are parsed once, they are fed through each pipeline stage in memory, and then output records are formatted once. On the other hand, Miller is single-threaded, while modern systems are usually multi-processor, and when streaming-data programs operate through pipes, each one can use a CPU. Rest assured you get the same results either way.
The other reason to use then-chaining is for simplicity: you don't have re-type formatting flags (e.g. ``--csv --fs tab``) at every pipeline stage.
Auxiliary commands
----------------------------------------------------------------
There are a few nearly-standalone programs which have nothing to do with the rest of Miller, do not participate in record streams, and do not deal with file formats. They might as well be little standalone executables but they're delivered within the main Miller executable for convenience.
::
POKI_RUN_COMMAND{{mlr aux-list}}HERE
::
POKI_RUN_COMMAND{{mlr lecat --help}}HERE
::
POKI_RUN_COMMAND{{mlr termcvt --help}}HERE
::
POKI_RUN_COMMAND{{mlr hex --help}}HERE
::
POKI_RUN_COMMAND{{mlr unhex --help}}HERE
Examples:
::
POKI_RUN_COMMAND{{echo 'Hello, world!' | mlr lecat --mono}}HERE
::
POKI_RUN_COMMAND{{echo 'Hello, world!' | mlr termcvt --lf2crlf | mlr lecat --mono}}HERE
::
POKI_RUN_COMMAND{{mlr hex data/budget.csv}}HERE
::
POKI_RUN_COMMAND{{mlr hex -r data/budget.csv}}HERE
::
POKI_RUN_COMMAND{{mlr hex -r data/budget.csv | sed 's/20/2a/g' | mlr unhex}}HERE
Data types
----------------------------------------------------------------
Miller's input and output are all string-oriented: there is (as of August 2015 anyway) no support for binary record packing. In this sense, everything is a string in and out of Miller. During processing, field names are always strings, even if they have names like "3"; field values are usually strings. Field values' ability to be interpreted as a non-string type only has meaning when comparison or function operations are done on them. And it is an error condition if Miller encounters non-numeric (or otherwise mistyped) data in a field in which it has been asked to do numeric (or otherwise type-specific) operations.
Field values are treated as numeric for the following:
* Numeric sort: ``mlr sort -n``, ``mlr sort -nr``.
* Statistics: ``mlr histogram``, ``mlr stats1``, ``mlr stats2``.
* Cross-record arithmetic: ``mlr step``.
For ``mlr put`` and ``mlr filter``:
* Miller's types for function processing are **empty-null** (empty string), **absent-null** (reads of unset right-hand sides, or fall-through non-explicit return values from user-defined functions), **error**, **string**, **float** (double-precision), **int** (64-bit signed), and **boolean**.
* On input, string values representable as numbers, e.g. "3" or "3.1", are treated as int or float, respectively. If a record has ``x=1,y=2`` then ``mlr put '$z=$x+$y'`` will produce ``x=1,y=2,z=3``, and ``mlr put '$z=$x.$y'`` does not give an error simply because the dot operator has been generalized to stringify non-strings. To coerce back to string for processing, use the ``string`` function: ``mlr put '$z=string($x).string($y)'`` will produce ``x=1,y=2,z=12``.
* On input, string values representable as boolean (e.g. ``"true"``, ``"false"``) are *not* automatically treated as boolean. (This is because ``"true"`` and ``"false"`` are ordinary words, and auto string-to-boolean on a column consisting of words would result in some strings mixed with some booleans.) Use the ``boolean`` function to coerce: e.g. giving the record ``x=1,y=2,w=false`` to ``mlr put '$z=($x<$y) || boolean($w)'``.
* Functions take types as described in ``mlr --help-all-functions``: for example, ``log10`` takes float input and produces float output, ``gmt2sec`` maps string to int, and ``sec2gmt`` maps int to string.
* All math functions described in ``mlr --help-all-functions`` take integer as well as float input.
.. _reference-null-data:
Null data: empty and absent
----------------------------------------------------------------
One of Miller's key features is its support for **heterogeneous** data. For example, take ``mlr sort``: if you try to sort on field ``hostname`` when not all records in the data stream *have* a field named ``hostname``, it is not an error (although you could pre-filter the data stream using ``mlr having-fields --at-least hostname then sort ...``). Rather, records lacking one or more sort keys are simply output contiguously by ``mlr sort``.
Miller has two kinds of null data:
* **Empty (key present, value empty)**: a field name is present in a record (or in an out-of-stream variable) with empty value: e.g. ``x=,y=2`` in the data input stream, or assignment ``$x=""`` or ``@x=""`` in ``mlr put``.
* **Absent (key not present)**: a field name is not present, e.g. input record is ``x=1,y=2`` and a ``put`` or ``filter`` expression refers to ``$z``. Or, reading an out-of-stream variable which hasn't been assigned a value yet, e.g. ``mlr put -q '@sum += $x; end{emit @sum}'`` or ``mlr put -q '@sum[$a][$b] += $x; end{emit @sum, "a", "b"}'``.
You can test these programatically using the functions ``is_empty``/``is_not_empty``, ``is_absent``/``is_present``, and ``is_null``/``is_not_null``. For the last pair, note that null means either empty or absent.
Rules for null-handling:
* Records with one or more empty sort-field values sort after records with all sort-field values present:
::
POKI_RUN_COMMAND{{mlr cat data/sort-null.dat}}HERE
::
POKI_RUN_COMMAND{{mlr sort -n a data/sort-null.dat}}HERE
::
POKI_RUN_COMMAND{{mlr sort -nr a data/sort-null.dat}}HERE
* Functions/operators which have one or more *empty* arguments produce empty output: e.g.
::
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=$x+$y'}}HERE
::
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=$x+$y'}}HERE
::
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'}}HERE
with the exception that the ``min`` and ``max`` functions are special: if one argument is non-null, it wins:
::
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'}}HERE
* Functions of *absent* variables (e.g. ``mlr put '$y = log10($nonesuch)'``) evaluate to absent, and arithmetic/bitwise/boolean operators with both operands being absent evaluate to absent. Arithmetic operators with one absent operand return the other operand. More specifically, absent values act like zero for addition/subtraction, and one for multiplication: Furthermore, **any expression which evaluates to absent is not stored in the left-hand side of an assignment statement**:
::
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'}}HERE
::
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'}}HERE
* Likewise, for assignment to maps, **absent-valued keys or values result in a skipped assignment**.
The reasoning is as follows:
* Empty values are explicit in the data so they should explicitly affect accumulations: ``mlr put '@sum += $x'`` should accumulate numeric ``x`` values into the sum but an empty ``x``, when encountered in the input data stream, should make the sum non-numeric. To work around this you can use the ``is_not_null`` function as follows: ``mlr put 'is_not_null($x) { @sum += $x }'``
* Absent stream-record values should not break accumulations, since Miller by design handles heterogenous data: the running ``@sum`` in ``mlr put '@sum += $x'`` should not be invalidated for records which have no ``x``.
* Absent out-of-stream-variable values are precisely what allow you to write ``mlr put '@sum += $x'``. Otherwise you would have to write ``mlr put 'begin{@sum = 0}; @sum += $x'`` -- which is tolerable -- but for ``mlr put 'begin{...}; @sum[$a][$b] += $x'`` you'd have to pre-initialize ``@sum`` for all values of ``$a`` and ``$b`` in your input data stream, which is intolerable.
* The penalty for the absent feature is that misspelled variables can be hard to find: e.g. in ``mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}'`` the accumulator is spelt ``@sumx`` in the begin-block but ``@sum`` in the end-block, where since it is absent, ``@sum*2`` evaluates to 2. See also the section on :ref:`reference-dsl-errors-and-transparency`.
Since absent plus absent is absent (and likewise for other operators), accumulations such as ``@sum += $x`` work correctly on heterogenous data, as do within-record formulas if both operands are absent. If one operand is present, you may get behavior you don't desire. To work around this -- namely, to set an output field only for records which have all the inputs present -- you can use a pattern-action block with ``is_present``:
::
POKI_RUN_COMMAND{{mlr cat data/het.dkvp}}HERE
::
POKI_RUN_COMMAND{{mlr put 'is_present($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp}}HERE
::
POKI_RUN_COMMAND{{mlr put '$loadmillis = (is_present($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp}}HERE
If you're interested in a formal description of how empty and absent fields participate in arithmetic, here's a table for plus (other arithmetic/boolean/bitwise operators are similar):
::
POKI_RUN_COMMAND{{mlr --print-type-arithmetic-info}}HERE
String literals
----------------------------------------------------------------
You can use the following backslash escapes for strings such as between the double quotes in contexts such as ``mlr filter '$name =~ "..."'``, ``mlr put '$name = $othername . "..."'``, ``mlr put '$name = sub($name, "...", "...")``, etc.:
* ``\a``: ASCII code 0x07 (alarm/bell)
* ``\b``: ASCII code 0x08 (backspace)
* ``\f``: ASCII code 0x0c (formfeed)
* ``\n``: ASCII code 0x0a (LF/linefeed/newline)
* ``\r``: ASCII code 0x0d (CR/carriage return)
* ``\t``: ASCII code 0x09 (tab)
* ``\v``: ASCII code 0x0b (vertical tab)
* ``\\``: backslash
* ``\"``: double quote
* ``\123``: Octal 123, etc. for ``\000`` up to ``\377``
* ``\x7f``: Hexadecimal 7f, etc. for ``\x00`` up to ``\xff``
See also https://en.wikipedia.org/wiki/Escape_sequences_in_C.
These replacements apply only to strings you key in for the DSL expressions for ``filter`` and ``put``: that is, if you type ``\t`` in a string literal for a ``filter``/``put`` expression, it will be turned into a tab character. If you want a backslash followed by a ``t``, then please type ``\\t``.
However, these replacements are not done automatically within your data stream. If you wish to make these replacements, you can do, for example, for a field named ``field``, ``mlr put '$field = gsub($field, "\\t", "\t")'``. If you need to make such a replacement for all fields in your data, you should probably simply use the system ``sed`` command.
Regular expressions
----------------------------------------------------------------
Miller lets you use regular expressions (of type POSIX.2) in the following contexts:
* In ``mlr filter`` with ``=~`` or ``!=~``, e.g. ``mlr filter '$url =~ "http.*com"'``
* In ``mlr put`` with ``sub`` or ``gsub``, e.g. ``mlr put '$url = sub($url, "http.*com", "")'``
* In ``mlr having-fields``, e.g. ``mlr having-fields --any-matching '^sda[0-9]'``
* In ``mlr cut``, e.g. ``mlr cut -r -f '^status$,^sda[0-9]'``
* In ``mlr rename``, e.g. ``mlr rename -r '^(sda[0-9]).*$,dev/\1'``
* In ``mlr grep``, e.g. ``mlr --csv grep 00188555487 myfiles*.csv``
Points demonstrated by the above examples:
* There are no implicit start-of-string or end-of-string anchors; please use ``^`` and/or ``$`` explicitly.
* Miller regexes are wrapped with double quotes rather than slashes.
* The ``i`` after the ending double quote indicates a case-insensitive regex.
* Capture groups are wrapped with ``(...)`` rather than ``\(...\)``; use ``\(`` and ``\)`` to match against parentheses.
For ``filter`` and ``put``, if the regular expression is a string literal (the normal case), it is precompiled at process start and reused thereafter, which is efficient. If the regular expression is a more complex expression, including string concatenation using ``.``, or a column name (in which case you can take regular expressions from input data!), then regexes are compiled on each record which works but is less efficient. As well, in this case there is no way to specify case-insensitive matching.
Example:
::
POKI_RUN_COMMAND{{cat data/regex-in-data.dat}}HERE
::
POKI_RUN_COMMAND{{mlr filter '$name =~ $regex' data/regex-in-data.dat}}HERE
Regex captures
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Regex captures of the form ``\0`` through ``\9`` are supported as
* Captures have in-function context for ``sub`` and ``gsub``. For example, the first ``\1,\2`` pair belong to the first ``sub`` and the second ``\1,\2`` pair belong to the second ``sub``:
::
mlr put '$b = sub($a, "(..)_(...)", "\2-\1"); $c = sub($a, "(..)_(.)(..)", ":\1:\2:\3")'
* Captures endure for the entirety of a ``put`` for the ``=~`` and ``!=~`` operators. For example, here the ``\1,\2`` are set by the ``=~`` operator and are used by both subsequent assignment statements:
::
mlr put '$a =~ "(..)_(....); $b = "left_\1"; $c = "right_\2"'
* The captures are not retained across multiple puts. For example, here the ``\1,\2`` won't be expanded from the regex capture:
::
mlr put '$a =~ "(..)_(....)' then {... something else ...} then put '$b = "left_\1"; $c = "right_\2"'
* Captures are ignored in ``filter`` for the ``=~`` and ``!=~`` operators. For example, there is no mechanism provided to refer to the first ``(..)`` as ``\1`` or to the second ``(....)`` as ``\2`` in the following filter statement:
::
mlr filter '$a =~ "(..)_(....)'
* Up to nine matches are supported: ``\1`` through ``\9``, while ``\0`` is the entire match string; ``\15`` is treated as ``\1`` followed by an unrelated ``5``.
Arithmetic
----------------------------------------------------------------
Input scanning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Numbers in Miller are double-precision float or 64-bit signed integers. Anything scannable as int, e.g ``123`` or ``0xabcd``, is treated as an integer; otherwise, input scannable as float (``4.56`` or ``8e9``) is treated as float; everything else is a string.
If you want all numbers to be treated as floats, then you may use ``float()`` in your filter/put expressions (e.g. replacing ``$c = $a * $b`` with ``$c = float($a) * float($b)``) -- or, more simply, use ``mlr filter -F`` and ``mlr put -F`` which forces all numeric input, whether from expression literals or field values, to float. Likewise ``mlr stats1 -F`` and ``mlr step -F`` force integerable accumulators (such as ``count``) to be done in floating-point.
Conversion by math routines
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For most math functions, integers are cast to float on input, and produce float output: e.g. ``exp(0) = 1.0`` rather than ``1``. The following, however, produce integer output if their inputs are integers: ``+`` ``-`` ``*`` ``/`` ``//`` ``%`` ``abs`` ``ceil`` ``floor`` ``max`` ``min`` ``round`` ``roundm`` ``sgn``. As well, ``stats1 -a min``, ``stats1 -a max``, ``stats1 -a sum``, ``step -a delta``, and ``step -a rsum`` produce integer output if their inputs are integers.
Conversion by arithmetic operators
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The sum, difference, and product of integers is again integer, except for when that would overflow a 64-bit integer at which point Miller converts the result to float.
The short of it is that Miller does this transparently for you so you needn't think about it.
Implementation details of this, for the interested: integer adds and subtracts overflow by at most one bit so it suffices to check sign-changes. Thus, Miller allows you to add and subtract arbitrary 64-bit signed integers, converting only to float precisely when the result is less than -2\ :sup:`63` or greater than 2\ :sup:`63`\ -1. Multiplies, on the other hand, can overflow by a word size and a sign-change technique does not suffice to detect overflow. Instead Miller tests whether the floating-point product exceeds the representable integer range. Now, 64-bit integers have 64-bit precision while IEEE-doubles have only 52-bit mantissas -- so, there are 53 bits including implicit leading one. The following experiment explicitly demonstrates the resolution at this range:
::
64-bit integer 64-bit integer Casted to double Back to 64-bit
in hex in decimal integer
0x7ffffffffffff9ff 9223372036854774271 9223372036854773760.000000 0x7ffffffffffff800
0x7ffffffffffffa00 9223372036854774272 9223372036854773760.000000 0x7ffffffffffff800
0x7ffffffffffffbff 9223372036854774783 9223372036854774784.000000 0x7ffffffffffffc00
0x7ffffffffffffc00 9223372036854774784 9223372036854774784.000000 0x7ffffffffffffc00
0x7ffffffffffffdff 9223372036854775295 9223372036854774784.000000 0x7ffffffffffffc00
0x7ffffffffffffe00 9223372036854775296 9223372036854775808.000000 0x8000000000000000
0x7ffffffffffffffe 9223372036854775806 9223372036854775808.000000 0x8000000000000000
0x7fffffffffffffff 9223372036854775807 9223372036854775808.000000 0x8000000000000000
That is, one cannot check an integer product to see if it is precisely greater than 2\ :sup:`63`\ -1 or less than -2\ :sup:`63` using either integer arithmetic (it may have already overflowed) or using double-precision (due to granularity). Instead Miller checks for overflow in 64-bit integer multiplication by seeing whether the absolute value of the double-precision product exceeds the largest representable IEEE double less than 2\ :sup:`63`, which we see from the listing above is 9223372036854774784. (An alternative would be to do all integer multiplies using handcrafted multi-word 128-bit arithmetic. This approach is not taken.)
Pythonic division
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Division and remainder are `pythonic <http://python-history.blogspot.com/2010/08/why-pythons-integer-division-floors.html>`_:
* Quotient of integers is floating-point: ``7/2`` is ``3.5``.
* Integer division is done with ``//``: ``7//2`` is ``3``. This rounds toward the negative.
* Remainders are non-negative.
On-line help
----------------------------------------------------------------
Examples:
::
POKI_RUN_COMMAND{{mlr --help}}HERE
::
POKI_RUN_COMMAND{{mlr sort --help}}HERE