mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 02:14:13 +00:00
New doc page: Parsing and formatting fields (#973)
This commit is contained in:
parent
9350fed34d
commit
1eae19421b
9 changed files with 596 additions and 216 deletions
|
|
@ -59,11 +59,6 @@ nav:
|
|||
- "Two-pass algorithms": "two-pass-algorithms.md"
|
||||
- "Programming-language examples": "programming-examples.md"
|
||||
- "Miscellaneous examples": "misc-examples.md"
|
||||
- 'Background':
|
||||
- "Why?": "why.md"
|
||||
- "Why call it Miller?": "etymology.md"
|
||||
- "How original is Miller?": "originality.md"
|
||||
- "Performance": "performance.md"
|
||||
- 'Main reference':
|
||||
- "Miller command structure": "reference-main-overview.md"
|
||||
- "Then-chaining": "reference-main-then-chaining.md"
|
||||
|
|
@ -72,6 +67,7 @@ nav:
|
|||
- "In-place mode": "reference-main-in-place-processing.md"
|
||||
- "Number formatting": "reference-main-number-formatting.md"
|
||||
- "Separators": "reference-main-separators.md"
|
||||
- "Parsing and formatting fields": "parsing-and-formatting-fields.md"
|
||||
- "Flatten/unflatten: converting between JSON and tabular formats": "flatten-unflatten.md"
|
||||
- "Sorting": "sorting.md"
|
||||
- "Streaming processing, and memory usage": "streaming-and-memory.md"
|
||||
|
|
@ -103,6 +99,11 @@ nav:
|
|||
- "DSL errors and transparency": "reference-dsl-errors.md"
|
||||
- "Differences from other programming languages": "reference-dsl-differences.md"
|
||||
- "A note on the complexity of Miller's expression language": "reference-dsl-complexity.md"
|
||||
- 'Background':
|
||||
- "Why?": "why.md"
|
||||
- "Why call it Miller?": "etymology.md"
|
||||
- "How original is Miller?": "originality.md"
|
||||
- "Performance": "performance.md"
|
||||
- 'Misc. reference':
|
||||
- "Auxiliary commands": "reference-main-auxiliary-commands.md"
|
||||
- "Manual page": "manpage.md"
|
||||
|
|
|
|||
5
docs/src/data/sec2dhms.csv
Normal file
5
docs/src/data/sec2dhms.csv
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
sec
|
||||
1
|
||||
100
|
||||
10000
|
||||
1000000
|
||||
|
3
docs/src/data/split1.csv
Normal file
3
docs/src/data/split1.csv
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
name,nicknames,codes
|
||||
Alice,"Allie,Skater","1,3,5"
|
||||
Robert,"Bob,Bobby,Biker","2,4,6"
|
||||
|
5
docs/src/data/split2.csv
Normal file
5
docs/src/data/split2.csv
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
stamp,event
|
||||
5-18:53:20,open
|
||||
5-18:53:22,close
|
||||
5-19:07:34,open
|
||||
5-19:07:56,close
|
||||
|
385
docs/src/parsing-and-formatting-fields.md
Normal file
385
docs/src/parsing-and-formatting-fields.md
Normal file
|
|
@ -0,0 +1,385 @@
|
|||
<!--- PLEASE DO NOT EDIT DIRECTLY. EDIT THE .md.in FILE PLEASE. --->
|
||||
<div>
|
||||
<span class="quicklinks">
|
||||
Quick links:
|
||||
|
||||
<a class="quicklink" href="../reference-main-flag-list/index.html">Flags</a>
|
||||
|
||||
<a class="quicklink" href="../reference-verbs/index.html">Verbs</a>
|
||||
|
||||
<a class="quicklink" href="../reference-dsl-builtin-functions/index.html">Functions</a>
|
||||
|
||||
<a class="quicklink" href="../glossary/index.html">Glossary</a>
|
||||
|
||||
<a class="quicklink" href="../release-docs/index.html">Release docs</a>
|
||||
</span>
|
||||
</div>
|
||||
# Parsing and formatting fields
|
||||
|
||||
Miller offers several ways to split strings into pieces (parsing them), and to put things together
|
||||
into a string (formatting them).
|
||||
|
||||
## Splitting and joining with the same separator
|
||||
|
||||
One pattern we often have is items separated by the same separator, e.g. a field with value
|
||||
`1;2;3;4` -- with a `;` between every pair of items. There are several useful
|
||||
[DSL](miller-programming-language.md) [functions](reference-dsl-builtin-functions.md) for splitting
|
||||
a string into pieces, and joining pieces into a string.
|
||||
|
||||
For example, suppose we have a CSV file like this:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>cat data/split1.csv</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
name,nicknames,codes
|
||||
Alice,"Allie,Skater","1,3,5"
|
||||
Robert,"Bob,Bobby,Biker","2,4,6"
|
||||
</pre>
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --icsv --ojson cat data/split1.csv</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
[
|
||||
{
|
||||
"name": "Alice",
|
||||
"nicknames": "Allie,Skater",
|
||||
"codes": "1,3,5"
|
||||
},
|
||||
{
|
||||
"name": "Robert",
|
||||
"nicknames": "Bob,Bobby,Biker",
|
||||
"codes": "2,4,6"
|
||||
}
|
||||
]
|
||||
</pre>
|
||||
|
||||
Then we can use the [`splita`](reference-dsl-builtin-functions.md#splita) function to split the
|
||||
`nicknames` string into an array of strings:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --icsv --ojson --from data/split1.csv put '$nicknames = splita($nicknames, ",")'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
[
|
||||
{
|
||||
"name": "Alice",
|
||||
"nicknames": ["Allie", "Skater"],
|
||||
"codes": "1,3,5"
|
||||
},
|
||||
{
|
||||
"name": "Robert",
|
||||
"nicknames": ["Bob", "Bobby", "Biker"],
|
||||
"codes": "2,4,6"
|
||||
}
|
||||
]
|
||||
</pre>
|
||||
|
||||
Likewise we can split the `codes` field. Since these look like numbers, we can again use `splita`
|
||||
which tries to type-infer ints and floats when it finds them -- or, we can use
|
||||
[splitax](reference-dsl-builtin-functions.md#splitax) to ask for the string to be split up into
|
||||
substrings, with no type inference:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --icsv --ojson --from data/split1.csv put '$codes = splita($codes, ",")'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
[
|
||||
{
|
||||
"name": "Alice",
|
||||
"nicknames": "Allie,Skater",
|
||||
"codes": [1, 3, 5]
|
||||
},
|
||||
{
|
||||
"name": "Robert",
|
||||
"nicknames": "Bob,Bobby,Biker",
|
||||
"codes": [2, 4, 6]
|
||||
}
|
||||
]
|
||||
</pre>
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --icsv --ojson --from data/split1.csv put '$codes = splitax($codes, ",")'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
[
|
||||
{
|
||||
"name": "Alice",
|
||||
"nicknames": "Allie,Skater",
|
||||
"codes": ["1", "3", "5"]
|
||||
},
|
||||
{
|
||||
"name": "Robert",
|
||||
"nicknames": "Bob,Bobby,Biker",
|
||||
"codes": ["2", "4", "6"]
|
||||
}
|
||||
]
|
||||
</pre>
|
||||
|
||||
We can do operations on the array, then use [joinv](reference-dsl-builtin-functions.md#joinv) to put them
|
||||
back together:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --icsv --ojson --from data/split1.csv put '</b>
|
||||
<b> $codes = splita($codes, ","); # split into array of integers</b>
|
||||
<b> $codes = apply($codes, func(e) { return e * 100 }); # do math on the array of integers</b>
|
||||
<b> $codes = joinv($codes, ","); # join the updated array back into a string</b>
|
||||
<b>'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
[
|
||||
{
|
||||
"name": "Alice",
|
||||
"nicknames": "Allie,Skater",
|
||||
"codes": "100,300,500"
|
||||
},
|
||||
{
|
||||
"name": "Robert",
|
||||
"nicknames": "Bob,Bobby,Biker",
|
||||
"codes": "200,400,600"
|
||||
}
|
||||
]
|
||||
</pre>
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --csv --from data/split1.csv put '</b>
|
||||
<b> $codes = splita($codes, ","); # split into array of integers</b>
|
||||
<b> $codes = apply($codes, func(e) { return e * 100 }); # do math on the array of integers</b>
|
||||
<b> $codes = joinv($codes, ","); # join the updated array back into a string</b>
|
||||
<b>'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
name,nicknames,codes
|
||||
Alice,"Allie,Skater","100,300,500"
|
||||
Robert,"Bob,Bobby,Biker","200,400,600"
|
||||
</pre>
|
||||
|
||||
The full list of split functions includes
|
||||
[splita](reference-dsl-builtin-functions.md#splita),
|
||||
[splitax](reference-dsl-builtin-functions.md#splitax),
|
||||
[splitkv](reference-dsl-builtin-functions.md#splitkv),
|
||||
[splitkvx](reference-dsl-builtin-functions.md#splitkvx),
|
||||
[splitnv](reference-dsl-builtin-functions.md#splitnv), and
|
||||
[splitnx](reference-dsl-builtin-functions.md#splitx). The flavors have to to with what the output is
|
||||
-- arrays or maps -- and whether or not type-inference is done.
|
||||
|
||||
The full list of join functions includes [joink](reference-dsl-builtin-functions.md#joink),
|
||||
[joinv](reference-dsl-builtin-functions.md#joinv), and
|
||||
[joinkv](reference-dsl-builtin-functions.md#joinkv). Here the flavors have to do with whether we put
|
||||
array/map keys, values, or both into the resulting string.
|
||||
|
||||
## Example: shortening hostnames
|
||||
|
||||
Suppose you want to just keep the first two components of the hostnames:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>cat data/hosts.csv</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
host,status
|
||||
xy01.east.acme.org,up
|
||||
ab02.west.acme.org,down
|
||||
ac91.west.acme.org,up
|
||||
</pre>
|
||||
|
||||
Using the [`splita`](reference-dsl-builtin-functions.md#splita) and
|
||||
[`joinv`](reference-dsl-builtin-functions.md#joinv) functions, along with
|
||||
[array slicing](reference-main-arrays.md#slicing), we get
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --csv --from data/hosts.csv put '$host = joinv(splita($host, ".")[1:2], ".")'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
host,status
|
||||
xy01.east,up
|
||||
ab02.west,down
|
||||
ac91.west,up
|
||||
</pre>
|
||||
|
||||
## Flatten/unflatten: representing arrays in CSV
|
||||
|
||||
In the above examples, when we split a string field into an array, we used JSON output. That's
|
||||
because JSON permits nested data structures. For CSV output, Miller uses, by default, a
|
||||
_flatten/unflatten strategy_: array-valued fields are turned into multiple CSV columns. For example:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --icsv --ojson --from data/split1.csv put '$codes = splitax($codes, ",")'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
[
|
||||
{
|
||||
"name": "Alice",
|
||||
"nicknames": "Allie,Skater",
|
||||
"codes": ["1", "3", "5"]
|
||||
},
|
||||
{
|
||||
"name": "Robert",
|
||||
"nicknames": "Bob,Bobby,Biker",
|
||||
"codes": ["2", "4", "6"]
|
||||
}
|
||||
]
|
||||
</pre>
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --csv --from data/split1.csv put '$codes = splitax($codes, ",")'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
name,nicknames,codes.1,codes.2,codes.3
|
||||
Alice,"Allie,Skater",1,3,5
|
||||
Robert,"Bob,Bobby,Biker",2,4,6
|
||||
</pre>
|
||||
|
||||
See the [flatten/unflatten: converting between JSON and tabular formats¶](flatten-unflatten.md)
|
||||
for more on this default behavior, including how to override it when you prefer.
|
||||
|
||||
## Splitting and joining with different separators
|
||||
|
||||
The above is well and good when a string contains pieces with multiple instances of the same
|
||||
separator. However sometimes we have input like `5-18:53:20`. Here we can use the more flexible
|
||||
[unformat](reference-dsl-builtin-functions.md#unformat) and
|
||||
[format](reference-dsl-builtin-functions.md#format) DSL functions. (As above, there's an
|
||||
[unformatx](reference-dsl-builtin-functions.md#unformatx) function if you want Miller to just split
|
||||
the string into string pieces without trying to type-infer them.)
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>cat data/split2.csv</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
stamp,event
|
||||
5-18:53:20,open
|
||||
5-18:53:22,close
|
||||
5-19:07:34,open
|
||||
5-19:07:56,close
|
||||
</pre>
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --icsv --ojson --from data/split2.csv put '$pieces = unformat("{}-{}:{}:{}", $stamp)'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
[
|
||||
{
|
||||
"stamp": "5-18:53:20",
|
||||
"event": "open",
|
||||
"pieces": [5, 18, 53, 20]
|
||||
},
|
||||
{
|
||||
"stamp": "5-18:53:22",
|
||||
"event": "close",
|
||||
"pieces": [5, 18, 53, 22]
|
||||
},
|
||||
{
|
||||
"stamp": "5-19:07:34",
|
||||
"event": "open",
|
||||
"pieces": [5, 19, "07", 34]
|
||||
},
|
||||
{
|
||||
"stamp": "5-19:07:56",
|
||||
"event": "close",
|
||||
"pieces": [5, 19, "07", 56]
|
||||
}
|
||||
]
|
||||
</pre>
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --icsv --opprint --from data/split2.csv put '</b>
|
||||
<b> pieces = unformat("{}-{}:{}:{}", $stamp);</b>
|
||||
<b> $description = format("{} day(s) {} hour(s) {} minute(s) {} seconds(s)", pieces[1], pieces[2], pieces[3], pieces[4]);</b>
|
||||
<b>'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
stamp event description
|
||||
5-18:53:20 open 5 day(s) 18 hour(s) 53 minute(s) 20 seconds(s)
|
||||
5-18:53:22 close 5 day(s) 18 hour(s) 53 minute(s) 22 seconds(s)
|
||||
5-19:07:34 open 5 day(s) 19 hour(s) 07 minute(s) 34 seconds(s)
|
||||
5-19:07:56 close 5 day(s) 19 hour(s) 07 minute(s) 56 seconds(s)
|
||||
</pre>
|
||||
|
||||
## Using regular expressions and capture groups
|
||||
|
||||
If you prefer [regular expressions](reference-main-regular-expressions.md), those can be used in this context as well:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --icsv --opprint --from data/split2.csv put '</b>
|
||||
<b> if ($stamp =~ "([0-9]+)-([0-9]+):([0-9]+):([0-9]+)") {</b>
|
||||
<b> $description = "\1 day(s) \2 hour(s) \3 minute(s) \4 seconds(s)";</b>
|
||||
<b> }</b>
|
||||
<b>'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
stamp event description
|
||||
5-18:53:20 open 5 day(s) 18 hour(s) 53 minute(s) 20 seconds(s)
|
||||
5-18:53:22 close 5 day(s) 18 hour(s) 53 minute(s) 22 seconds(s)
|
||||
5-19:07:34 open 5 day(s) 19 hour(s) 07 minute(s) 34 seconds(s)
|
||||
5-19:07:56 close 5 day(s) 19 hour(s) 07 minute(s) 56 seconds(s)
|
||||
</pre>
|
||||
|
||||
## Special case: timestamps
|
||||
|
||||
Timestamps are complex enough to merit their own handling: see the
|
||||
[DSL datetime/timezone functions page](reference-dsl-time.md). in particular the
|
||||
[strptime](reference-dsl-builtin-functions.md#strptime)
|
||||
and
|
||||
[strftime](reference-dsl-builtin-functions.md#strftime)
|
||||
functions.
|
||||
|
||||
## Special case: dhms and seconds
|
||||
|
||||
For historical reasons, Miller has a way to represent seconds in a more human-readable format, using days,
|
||||
hours, minutes, and seconds. For example:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --c2p --from data/sec2dhms.csv put '$dhms = sec2dhms($sec)'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
sec dhms
|
||||
1 1s
|
||||
100 1m40s
|
||||
10000 2h46m40s
|
||||
1000000 11d13h46m40s
|
||||
</pre>
|
||||
|
||||
Please see
|
||||
[sec2dhms](reference-dsl-builtin-functions.md#sec2dhms)
|
||||
and
|
||||
[dhms2sec](reference-dsl-builtin-functions.md#sec2dhms)
|
||||
|
||||
## Special case: financial values
|
||||
|
||||
One way to handle currencies is to sub out the currency marker (like `$`) as well as commas:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>echo 'd=$1234.56' | mlr put '$d = float(gsub(ssub($d, "$", ""), ",", ""))'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
d=1234.56
|
||||
</pre>
|
||||
|
||||
## Nesting and unnesting fields
|
||||
|
||||
Sometimes we want not to split strings into arrays, but rather, to use them to create multiple records.
|
||||
|
||||
For example:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --c2p cat data/split1.csv</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
name nicknames codes
|
||||
Alice Allie,Skater 1,3,5
|
||||
Robert Bob,Bobby,Biker 2,4,6
|
||||
</pre>
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --c2p nest --evar , -f nicknames data/split1.csv</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
name nicknames codes
|
||||
Alice Allie 1,3,5
|
||||
Alice Skater 1,3,5
|
||||
Robert Bob 2,4,6
|
||||
Robert Bobby 2,4,6
|
||||
Robert Biker 2,4,6
|
||||
</pre>
|
||||
|
||||
See [documentation on the nest verb](reference-verbs.md#nest) for general information on how to do this.
|
||||
190
docs/src/parsing-and-formatting-fields.md.in
Normal file
190
docs/src/parsing-and-formatting-fields.md.in
Normal file
|
|
@ -0,0 +1,190 @@
|
|||
# Parsing and formatting fields
|
||||
|
||||
Miller offers several ways to split strings into pieces (parsing them), and to put things together
|
||||
into a string (formatting them).
|
||||
|
||||
## Splitting and joining with the same separator
|
||||
|
||||
One pattern we often have is items separated by the same separator, e.g. a field with value
|
||||
`1;2;3;4` -- with a `;` between every pair of items. There are several useful
|
||||
[DSL](miller-programming-language.md) [functions](reference-dsl-builtin-functions.md) for splitting
|
||||
a string into pieces, and joining pieces into a string.
|
||||
|
||||
For example, suppose we have a CSV file like this:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
cat data/split1.csv
|
||||
GENMD-EOF
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --icsv --ojson cat data/split1.csv
|
||||
GENMD-EOF
|
||||
|
||||
Then we can use the [`splita`](reference-dsl-builtin-functions.md#splita) function to split the
|
||||
`nicknames` string into an array of strings:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --icsv --ojson --from data/split1.csv put '$nicknames = splita($nicknames, ",")'
|
||||
GENMD-EOF
|
||||
|
||||
Likewise we can split the `codes` field. Since these look like numbers, we can again use `splita`
|
||||
which tries to type-infer ints and floats when it finds them -- or, we can use
|
||||
[splitax](reference-dsl-builtin-functions.md#splitax) to ask for the string to be split up into
|
||||
substrings, with no type inference:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --icsv --ojson --from data/split1.csv put '$codes = splita($codes, ",")'
|
||||
GENMD-EOF
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --icsv --ojson --from data/split1.csv put '$codes = splitax($codes, ",")'
|
||||
GENMD-EOF
|
||||
|
||||
We can do operations on the array, then use [joinv](reference-dsl-builtin-functions.md#joinv) to put them
|
||||
back together:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --icsv --ojson --from data/split1.csv put '
|
||||
$codes = splita($codes, ","); # split into array of integers
|
||||
$codes = apply($codes, func(e) { return e * 100 }); # do math on the array of integers
|
||||
$codes = joinv($codes, ","); # join the updated array back into a string
|
||||
'
|
||||
GENMD-EOF
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --csv --from data/split1.csv put '
|
||||
$codes = splita($codes, ","); # split into array of integers
|
||||
$codes = apply($codes, func(e) { return e * 100 }); # do math on the array of integers
|
||||
$codes = joinv($codes, ","); # join the updated array back into a string
|
||||
'
|
||||
GENMD-EOF
|
||||
|
||||
The full list of split functions includes
|
||||
[splita](reference-dsl-builtin-functions.md#splita),
|
||||
[splitax](reference-dsl-builtin-functions.md#splitax),
|
||||
[splitkv](reference-dsl-builtin-functions.md#splitkv),
|
||||
[splitkvx](reference-dsl-builtin-functions.md#splitkvx),
|
||||
[splitnv](reference-dsl-builtin-functions.md#splitnv), and
|
||||
[splitnx](reference-dsl-builtin-functions.md#splitx). The flavors have to to with what the output is
|
||||
-- arrays or maps -- and whether or not type-inference is done.
|
||||
|
||||
The full list of join functions includes [joink](reference-dsl-builtin-functions.md#joink),
|
||||
[joinv](reference-dsl-builtin-functions.md#joinv), and
|
||||
[joinkv](reference-dsl-builtin-functions.md#joinkv). Here the flavors have to do with whether we put
|
||||
array/map keys, values, or both into the resulting string.
|
||||
|
||||
## Example: shortening hostnames
|
||||
|
||||
Suppose you want to just keep the first two components of the hostnames:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
cat data/hosts.csv
|
||||
GENMD-EOF
|
||||
|
||||
Using the [`splita`](reference-dsl-builtin-functions.md#splita) and
|
||||
[`joinv`](reference-dsl-builtin-functions.md#joinv) functions, along with
|
||||
[array slicing](reference-main-arrays.md#slicing), we get
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --csv --from data/hosts.csv put '$host = joinv(splita($host, ".")[1:2], ".")'
|
||||
GENMD-EOF
|
||||
|
||||
## Flatten/unflatten: representing arrays in CSV
|
||||
|
||||
In the above examples, when we split a string field into an array, we used JSON output. That's
|
||||
because JSON permits nested data structures. For CSV output, Miller uses, by default, a
|
||||
_flatten/unflatten strategy_: array-valued fields are turned into multiple CSV columns. For example:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --icsv --ojson --from data/split1.csv put '$codes = splitax($codes, ",")'
|
||||
GENMD-EOF
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --csv --from data/split1.csv put '$codes = splitax($codes, ",")'
|
||||
GENMD-EOF
|
||||
|
||||
See the [flatten/unflatten: converting between JSON and tabular formats¶](flatten-unflatten.md)
|
||||
for more on this default behavior, including how to override it when you prefer.
|
||||
|
||||
## Splitting and joining with different separators
|
||||
|
||||
The above is well and good when a string contains pieces with multiple instances of the same
|
||||
separator. However sometimes we have input like `5-18:53:20`. Here we can use the more flexible
|
||||
[unformat](reference-dsl-builtin-functions.md#unformat) and
|
||||
[format](reference-dsl-builtin-functions.md#format) DSL functions. (As above, there's an
|
||||
[unformatx](reference-dsl-builtin-functions.md#unformatx) function if you want Miller to just split
|
||||
the string into string pieces without trying to type-infer them.)
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
cat data/split2.csv
|
||||
GENMD-EOF
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --icsv --ojson --from data/split2.csv put '$pieces = unformat("{}-{}:{}:{}", $stamp)'
|
||||
GENMD-EOF
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --icsv --opprint --from data/split2.csv put '
|
||||
pieces = unformat("{}-{}:{}:{}", $stamp);
|
||||
$description = format("{} day(s) {} hour(s) {} minute(s) {} seconds(s)", pieces[1], pieces[2], pieces[3], pieces[4]);
|
||||
'
|
||||
GENMD-EOF
|
||||
|
||||
## Using regular expressions and capture groups
|
||||
|
||||
If you prefer [regular expressions](reference-main-regular-expressions.md), those can be used in this context as well:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --icsv --opprint --from data/split2.csv put '
|
||||
if ($stamp =~ "([0-9]+)-([0-9]+):([0-9]+):([0-9]+)") {
|
||||
$description = "\1 day(s) \2 hour(s) \3 minute(s) \4 seconds(s)";
|
||||
}
|
||||
'
|
||||
GENMD-EOF
|
||||
|
||||
## Special case: timestamps
|
||||
|
||||
Timestamps are complex enough to merit their own handling: see the
|
||||
[DSL datetime/timezone functions page](reference-dsl-time.md). in particular the
|
||||
[strptime](reference-dsl-builtin-functions.md#strptime)
|
||||
and
|
||||
[strftime](reference-dsl-builtin-functions.md#strftime)
|
||||
functions.
|
||||
|
||||
## Special case: dhms and seconds
|
||||
|
||||
For historical reasons, Miller has a way to represent seconds in a more human-readable format, using days,
|
||||
hours, minutes, and seconds. For example:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --c2p --from data/sec2dhms.csv put '$dhms = sec2dhms($sec)'
|
||||
GENMD-EOF
|
||||
|
||||
Please see
|
||||
[sec2dhms](reference-dsl-builtin-functions.md#sec2dhms)
|
||||
and
|
||||
[dhms2sec](reference-dsl-builtin-functions.md#sec2dhms)
|
||||
|
||||
## Special case: financial values
|
||||
|
||||
One way to handle currencies is to sub out the currency marker (like `$`) as well as commas:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
echo 'd=$1234.56' | mlr put '$d = float(gsub(ssub($d, "$", ""), ",", ""))'
|
||||
GENMD-EOF
|
||||
|
||||
## Nesting and unnesting fields
|
||||
|
||||
Sometimes we want not to split strings into arrays, but rather, to use them to create multiple records.
|
||||
|
||||
For example:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --c2p cat data/split1.csv
|
||||
GENMD-EOF
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --c2p nest --evar , -f nicknames data/split1.csv
|
||||
GENMD-EOF
|
||||
|
||||
See [documentation on the nest verb](reference-verbs.md#nest) for general information on how to do this.
|
||||
|
|
@ -302,133 +302,6 @@ yellow,circle,true,9,87,63.5058,8.3350,3
|
|||
|
||||
The difference is a matter of taste (although `mlr cat -n` puts the counter first).
|
||||
|
||||
## Splitting a string and taking a few of the components
|
||||
|
||||
Suppose you want to just keep the first two components of the hostnames:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>cat data/hosts.csv</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
host,status
|
||||
xy01.east.acme.org,up
|
||||
ab02.west.acme.org,down
|
||||
ac91.west.acme.org,up
|
||||
</pre>
|
||||
|
||||
Using the [`splita`](reference-dsl-builtin-functions.md#splita) and
|
||||
[`joinv`](reference-dsl-builtin-functions.md#joinv) functions, along with
|
||||
[array slicing](reference-main-arrays.md#slicing), we get
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --csv --from data/hosts.csv put '$host = joinv(splita($host, ".")[1:2], ".")'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
host,status
|
||||
xy01.east,up
|
||||
ab02.west,down
|
||||
ac91.west,up
|
||||
</pre>
|
||||
|
||||
## Splitting nested fields
|
||||
|
||||
Suppose you have a TSV file like this:
|
||||
|
||||
<pre class="pre-non-highlight-non-pair">
|
||||
a b
|
||||
x z
|
||||
s u:v:w
|
||||
</pre>
|
||||
|
||||
The simplest option is to use [nest](reference-verbs.md#nest):
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --tsv nest --explode --values --across-records -f b --nested-fs : data/nested.tsv</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
a b
|
||||
x z
|
||||
s u
|
||||
s v
|
||||
s w
|
||||
</pre>
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --tsv nest --explode --values --across-fields -f b --nested-fs : data/nested.tsv</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
a b_1
|
||||
x z
|
||||
|
||||
a b_1 b_2 b_3
|
||||
s u v w
|
||||
</pre>
|
||||
|
||||
While `mlr nest` is simplest, let's also take a look at a few ways to do this using the `put` DSL.
|
||||
|
||||
One option to split out the colon-delimited values in the `b` column is to use `splitnv` to create an integer-indexed map and loop over it, adding new fields to the current record:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --from data/nested.tsv --itsv --oxtab put '</b>
|
||||
<b> o = splitnv($b, ":");</b>
|
||||
<b> for (k,v in o) {</b>
|
||||
<b> $["p".k]=v</b>
|
||||
<b> }</b>
|
||||
<b>'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
a x
|
||||
b z
|
||||
p1 z
|
||||
|
||||
a s
|
||||
b u:v:w
|
||||
p1 u
|
||||
p2 v
|
||||
p3 w
|
||||
</pre>
|
||||
|
||||
while another is to loop over the same map from `splitnv` and use it (with `put -q` to suppress printing the original record) to produce multiple records:
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --from data/nested.tsv --itsv --oxtab put -q '</b>
|
||||
<b> o = splitnv($b, ":");</b>
|
||||
<b> for (k,v in o) {</b>
|
||||
<b> x = mapsum($*, {"b":v});</b>
|
||||
<b> emit x</b>
|
||||
<b> }</b>
|
||||
<b>'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
a x
|
||||
b z
|
||||
|
||||
a s
|
||||
b u
|
||||
|
||||
a s
|
||||
b v
|
||||
|
||||
a s
|
||||
b w
|
||||
</pre>
|
||||
|
||||
<pre class="pre-highlight-in-pair">
|
||||
<b>mlr --from data/nested.tsv --tsv put -q '</b>
|
||||
<b> o = splitnv($b, ":");</b>
|
||||
<b> for (k,v in o) {</b>
|
||||
<b> x = mapsum($*, {"b":v}); emit x</b>
|
||||
<b> }</b>
|
||||
<b>'</b>
|
||||
</pre>
|
||||
<pre class="pre-non-highlight-in-pair">
|
||||
a b
|
||||
x z
|
||||
s u
|
||||
s v
|
||||
s w
|
||||
</pre>
|
||||
|
||||
## Options for dealing with duplicate rows
|
||||
|
||||
If your data has records appearing multiple times, you can use [uniq](reference-verbs.md#uniq) to show and/or count the unique records.
|
||||
|
|
|
|||
|
|
@ -168,72 +168,6 @@ GENMD-EOF
|
|||
|
||||
The difference is a matter of taste (although `mlr cat -n` puts the counter first).
|
||||
|
||||
## Splitting a string and taking a few of the components
|
||||
|
||||
Suppose you want to just keep the first two components of the hostnames:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
cat data/hosts.csv
|
||||
GENMD-EOF
|
||||
|
||||
Using the [`splita`](reference-dsl-builtin-functions.md#splita) and
|
||||
[`joinv`](reference-dsl-builtin-functions.md#joinv) functions, along with
|
||||
[array slicing](reference-main-arrays.md#slicing), we get
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --csv --from data/hosts.csv put '$host = joinv(splita($host, ".")[1:2], ".")'
|
||||
GENMD-EOF
|
||||
|
||||
## Splitting nested fields
|
||||
|
||||
Suppose you have a TSV file like this:
|
||||
|
||||
GENMD-INCLUDE-ESCAPED(data/nested.tsv)
|
||||
|
||||
The simplest option is to use [nest](reference-verbs.md#nest):
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --tsv nest --explode --values --across-records -f b --nested-fs : data/nested.tsv
|
||||
GENMD-EOF
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --tsv nest --explode --values --across-fields -f b --nested-fs : data/nested.tsv
|
||||
GENMD-EOF
|
||||
|
||||
While `mlr nest` is simplest, let's also take a look at a few ways to do this using the `put` DSL.
|
||||
|
||||
One option to split out the colon-delimited values in the `b` column is to use `splitnv` to create an integer-indexed map and loop over it, adding new fields to the current record:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --from data/nested.tsv --itsv --oxtab put '
|
||||
o = splitnv($b, ":");
|
||||
for (k,v in o) {
|
||||
$["p".k]=v
|
||||
}
|
||||
'
|
||||
GENMD-EOF
|
||||
|
||||
while another is to loop over the same map from `splitnv` and use it (with `put -q` to suppress printing the original record) to produce multiple records:
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --from data/nested.tsv --itsv --oxtab put -q '
|
||||
o = splitnv($b, ":");
|
||||
for (k,v in o) {
|
||||
x = mapsum($*, {"b":v});
|
||||
emit x
|
||||
}
|
||||
'
|
||||
GENMD-EOF
|
||||
|
||||
GENMD-RUN-COMMAND
|
||||
mlr --from data/nested.tsv --tsv put -q '
|
||||
o = splitnv($b, ":");
|
||||
for (k,v in o) {
|
||||
x = mapsum($*, {"b":v}); emit x
|
||||
}
|
||||
'
|
||||
GENMD-EOF
|
||||
|
||||
## Options for dealing with duplicate rows
|
||||
|
||||
If your data has records appearing multiple times, you can use [uniq](reference-verbs.md#uniq) to show and/or count the unique records.
|
||||
|
|
|
|||
20
todo.txt
20
todo.txt
|
|
@ -2,8 +2,8 @@
|
|||
RELEASES
|
||||
* plan 6.1.0
|
||||
o unsparsify -f CSV by default -- ? into CSV record-writer -- ? caveat that record 1 controls all ...
|
||||
o fmt/unfmt/regex doc
|
||||
o FAQ/examples reorg
|
||||
- \d etc to DSL :( -- & parsing-and-formatting-fields.md.in
|
||||
o mlr split -- needs an example page along with the tee DSL function
|
||||
|
||||
o https://github.com/johnkerl/miller/issues?q=is%3Aissue+is%3Aopen+label%3Aneeds-documentation
|
||||
|
||||
|
|
@ -121,22 +121,6 @@ strict-mode ideas
|
|||
* srec:
|
||||
o abend unless $?x -- ?
|
||||
|
||||
----------------------------------------------------------------
|
||||
mlr join --left-fields a,b,c
|
||||
|
||||
----------------------------------------------------------------
|
||||
! Better functions for values manipulation, e.g. easier conversion of strings like "$1,234.56" into numeric values
|
||||
o note on is_error(x) (or string(x) == "(error)")
|
||||
? dhms w/ optional separgs -- ? what about fenceposting? ["d","h","m","s"] vs ["-",":",":",""] -- ?
|
||||
o 'Ability to specify some formats that are fixed. Like we can process
|
||||
"5d18h53m20s" format in *dhms* commands, but what about "5-18:53:20"? This is
|
||||
a common format used by the SLURM resource manager.'
|
||||
o linked-to faqent w/ -f -s etc ...
|
||||
|
||||
----------------------------------------------------------------
|
||||
k better print-interpolate with {} etc
|
||||
doc: mlr --csv --from example.csv put -q 'print format("Index {} at quantity {} and rate {}", $index, $quantity, $rate)'
|
||||
|
||||
----------------------------------------------------------------
|
||||
! sysdate, sysdate_local; datediff ...
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue