mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 18:25:45 +00:00
* Docs6 proofreads batch 3 * BUild-everything script for local development * Start of glossary * Put quicklinks atop every page, not just the base-index page * Expanded record-heterogeneity page * streaming page * separators page * vimrc doc * separators page
136 lines
5.2 KiB
Markdown
136 lines
5.2 KiB
Markdown
# Separators
|
|
|
|
## Record, field, and pair separators
|
|
|
|
Miller has record separators, field separators, and pair separators. For
|
|
example, given the following [DKVP](file-formats.md#dkvp-key-value-pairs)
|
|
records:
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/a.dkvp
|
|
GENMD_EOF
|
|
|
|
* the **record separator** is newline -- it separates records from one another;
|
|
* the **field separator** is `,` -- it separates fields (key-value pairs) from one another;
|
|
* and the **pair separator** is `=` -- it separates the key from the value within each key-value pair.
|
|
|
|
These are the default values, which you can override with flags such as `--ips`
|
|
and `--ops` (below).
|
|
|
|
Not all [file formats](file-formats.md) have all three of these: for example,
|
|
CSV does not have a pair separator, since keys are on the header line and
|
|
values are on each data line.
|
|
|
|
Also, separators are not programmable for all file formats. For example, in
|
|
[JSON objects](file-formats.md#json), the pair separator is `:` and the
|
|
field-separator is `,` -- we write `{"a":1,"b":2,"c":3}` -- but these aren't
|
|
modifiable. If you do `mlr --json --ips : --ips '=' cat myfile.json` then you
|
|
don't get `{"a"=1,"b"=2,"c"=3}`. This is because the pair-separator `:` is
|
|
part of the JSON specification.
|
|
|
|
## Input and output separators
|
|
|
|
Miller lets you use the same separators for input and output, or, to change
|
|
them between input and output, if you wish to transform your data in that way.
|
|
|
|
Miller uses the names `IRS` and `ORS` for the input and output record
|
|
separators, `IFS` and `OFS` for the input and output field separators, and
|
|
`IPS` and `OPS` for input and output pair separators.
|
|
|
|
For example:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --ifs , --ofs ';' --ips = --ops : cut -o -f c,a,b data/a.dkvp
|
|
GENMD_EOF
|
|
|
|
If your data has non-default separators and you don't want to change those
|
|
between input and output, you can use `--rs`, `--fs`, and `--ps`. Setting `--fs
|
|
:` is the same as setting `--ifs : --ofs :`, but with fewer keystrokes.
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --fs ';' --ps : cut -o -f c,a,b data/modsep.dkvp
|
|
GENMD_EOF
|
|
|
|
## Multi-character separators
|
|
|
|
The separators default to single characters, but can be multiple characters if you like:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --ifs ';' --ips : --ofs ';;;' --ops := cut -o -f c,a,b data/modsep.dkvp
|
|
GENMD_EOF
|
|
|
|
While the separators can be multiple characters, [regular
|
|
expressions](reference-main-regular-expressions.md) (which Miller supports in
|
|
many ways) are not (as of mid-2021) supported by Miller. So, in the above
|
|
example, you can say the field-separator is one semicolon, or three, but two or
|
|
four won't be recognized using `--ifs ';;;'`.
|
|
|
|
To fill this need, in the absence of full regular-expression support, Miller
|
|
has a `--repifs` option for input. This means, for example, using `--ifs
|
|
' ' --repifs` you can have the field separator be one _or more_ spaces. (Mixes
|
|
of spaces and tabs, however, won't be recognized as a separator.)
|
|
|
|
The `--repifs` flag means that multiple successive occurrences of the field
|
|
separator count as one. For example, in CSV data we often signify nulls by
|
|
empty strings, e.g. `2,9,,,,,6,5,4`. On the other hand, if the field separator
|
|
is a space, it might be more natural to parse `2 4 5` the same as `2 4 5`:
|
|
`--repifs --ifs ' '` lets this happen. In fact, the `--ipprint` option above
|
|
is internally implemented in terms of `--repifs`.
|
|
|
|
For example:
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/extra-spaces.txt
|
|
GENMD_EOF
|
|
|
|
(TODO: FIXME)
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --ifs ' ' --repifs --inidx --oxtab cat data/extra-spaces.txt
|
|
GENMD_EOF
|
|
|
|
## Command-line flags
|
|
|
|
Given the above, we now have seen the following flags:
|
|
|
|
GENMD_CARDIFY
|
|
--rs --irs --ors
|
|
--fs --ifs --ofs --repifs
|
|
--ps --ips --ops
|
|
GENMD_EOF
|
|
|
|
Also note that you can use names for certain characters: e.g. `--fs space` is
|
|
the same as `--fs ' '`. A full list is: `colon`, `comma`, `equals`, `newline`,
|
|
`pipe`, `semicolon`, `slash`, `space`, `tab`.
|
|
|
|
## DSL built-in variables
|
|
|
|
Miller exposes for you read-only [built-in variables](reference-dsl-variables.md#built-in-variables) with
|
|
names `IRS`, `ORS`, `IFS`, `OFS`, `IPS`, and `OPS`. Unlike in AWK, you can't set these in begin-blocks --
|
|
their values indicate what you set at the command line -- so their use is limited.
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --ifs , --ofs ';' --ips = --ops : --from data/a.dkvp put '$d = ">>>" . IFS . "|||" . OFS . "<<<"'
|
|
GENMD_EOF
|
|
|
|
## Which separators apply to which file formats
|
|
|
|
* CSV/TSV/ASV/USV/etc.:
|
|
* Record separator is newline (Linux/BSDs/MacOS) or carriage-return-newline (Windows); programmable in Miller 5 and below; TODO to support for the Miller 6 release.
|
|
* If field separator is tab, we have TSV; see more examples (ASV, USV, etc.) at in the [CSV section](file-formats.md#csvtsvasvusvetc).
|
|
* No pair separator.
|
|
* JSON: ignores all separator flags from the command line.
|
|
* PPRINT
|
|
* Record separator is newline (Linux/BSDs/MacOS) or carriage-return-newline (Windows); programmable in Miller 5 and below; TODO to support for the Miller 6 release.
|
|
* TODO: write up
|
|
* TODO: write up
|
|
* Markdown tabular: ignores all separator flags from the command line.
|
|
* XTAB
|
|
* TODO: write up
|
|
* TODO: write up
|
|
* TODO: write up
|
|
* DKVP: lets you specify record, field, and pair separators.
|
|
* NIDX
|
|
* TODO: write up
|
|
* TODO: write up
|
|
* No pair separator.
|