New doc page: Parsing and formatting fields (#973)

This commit is contained in:
John Kerl 2022-03-06 23:28:16 -05:00 committed by GitHub
parent 9350fed34d
commit 1eae19421b
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
9 changed files with 596 additions and 216 deletions

View file

@ -59,11 +59,6 @@ nav:
- "Two-pass algorithms": "two-pass-algorithms.md"
- "Programming-language examples": "programming-examples.md"
- "Miscellaneous examples": "misc-examples.md"
- 'Background':
- "Why?": "why.md"
- "Why call it Miller?": "etymology.md"
- "How original is Miller?": "originality.md"
- "Performance": "performance.md"
- 'Main reference':
- "Miller command structure": "reference-main-overview.md"
- "Then-chaining": "reference-main-then-chaining.md"
@ -72,6 +67,7 @@ nav:
- "In-place mode": "reference-main-in-place-processing.md"
- "Number formatting": "reference-main-number-formatting.md"
- "Separators": "reference-main-separators.md"
- "Parsing and formatting fields": "parsing-and-formatting-fields.md"
- "Flatten/unflatten: converting between JSON and tabular formats": "flatten-unflatten.md"
- "Sorting": "sorting.md"
- "Streaming processing, and memory usage": "streaming-and-memory.md"
@ -103,6 +99,11 @@ nav:
- "DSL errors and transparency": "reference-dsl-errors.md"
- "Differences from other programming languages": "reference-dsl-differences.md"
- "A note on the complexity of Miller's expression language": "reference-dsl-complexity.md"
- 'Background':
- "Why?": "why.md"
- "Why call it Miller?": "etymology.md"
- "How original is Miller?": "originality.md"
- "Performance": "performance.md"
- 'Misc. reference':
- "Auxiliary commands": "reference-main-auxiliary-commands.md"
- "Manual page": "manpage.md"

View file

@ -0,0 +1,5 @@
sec
1
100
10000
1000000
1 sec
2 1
3 100
4 10000
5 1000000

3
docs/src/data/split1.csv Normal file
View file

@ -0,0 +1,3 @@
name,nicknames,codes
Alice,"Allie,Skater","1,3,5"
Robert,"Bob,Bobby,Biker","2,4,6"
1 name nicknames codes
2 Alice Allie,Skater 1,3,5
3 Robert Bob,Bobby,Biker 2,4,6

5
docs/src/data/split2.csv Normal file
View file

@ -0,0 +1,5 @@
stamp,event
5-18:53:20,open
5-18:53:22,close
5-19:07:34,open
5-19:07:56,close
1 stamp event
2 5-18:53:20 open
3 5-18:53:22 close
4 5-19:07:34 open
5 5-19:07:56 close

View file

@ -0,0 +1,385 @@
<!--- PLEASE DO NOT EDIT DIRECTLY. EDIT THE .md.in FILE PLEASE. --->
<div>
<span class="quicklinks">
Quick links:
&nbsp;
<a class="quicklink" href="../reference-main-flag-list/index.html">Flags</a>
&nbsp;
<a class="quicklink" href="../reference-verbs/index.html">Verbs</a>
&nbsp;
<a class="quicklink" href="../reference-dsl-builtin-functions/index.html">Functions</a>
&nbsp;
<a class="quicklink" href="../glossary/index.html">Glossary</a>
&nbsp;
<a class="quicklink" href="../release-docs/index.html">Release docs</a>
</span>
</div>
# Parsing and formatting fields
Miller offers several ways to split strings into pieces (parsing them), and to put things together
into a string (formatting them).
## Splitting and joining with the same separator
One pattern we often have is items separated by the same separator, e.g. a field with value
`1;2;3;4` -- with a `;` between every pair of items. There are several useful
[DSL](miller-programming-language.md) [functions](reference-dsl-builtin-functions.md) for splitting
a string into pieces, and joining pieces into a string.
For example, suppose we have a CSV file like this:
<pre class="pre-highlight-in-pair">
<b>cat data/split1.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
name,nicknames,codes
Alice,"Allie,Skater","1,3,5"
Robert,"Bob,Bobby,Biker","2,4,6"
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --icsv --ojson cat data/split1.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
[
{
"name": "Alice",
"nicknames": "Allie,Skater",
"codes": "1,3,5"
},
{
"name": "Robert",
"nicknames": "Bob,Bobby,Biker",
"codes": "2,4,6"
}
]
</pre>
Then we can use the [`splita`](reference-dsl-builtin-functions.md#splita) function to split the
`nicknames` string into an array of strings:
<pre class="pre-highlight-in-pair">
<b>mlr --icsv --ojson --from data/split1.csv put '$nicknames = splita($nicknames, ",")'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
[
{
"name": "Alice",
"nicknames": ["Allie", "Skater"],
"codes": "1,3,5"
},
{
"name": "Robert",
"nicknames": ["Bob", "Bobby", "Biker"],
"codes": "2,4,6"
}
]
</pre>
Likewise we can split the `codes` field. Since these look like numbers, we can again use `splita`
which tries to type-infer ints and floats when it finds them -- or, we can use
[splitax](reference-dsl-builtin-functions.md#splitax) to ask for the string to be split up into
substrings, with no type inference:
<pre class="pre-highlight-in-pair">
<b>mlr --icsv --ojson --from data/split1.csv put '$codes = splita($codes, ",")'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
[
{
"name": "Alice",
"nicknames": "Allie,Skater",
"codes": [1, 3, 5]
},
{
"name": "Robert",
"nicknames": "Bob,Bobby,Biker",
"codes": [2, 4, 6]
}
]
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --icsv --ojson --from data/split1.csv put '$codes = splitax($codes, ",")'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
[
{
"name": "Alice",
"nicknames": "Allie,Skater",
"codes": ["1", "3", "5"]
},
{
"name": "Robert",
"nicknames": "Bob,Bobby,Biker",
"codes": ["2", "4", "6"]
}
]
</pre>
We can do operations on the array, then use [joinv](reference-dsl-builtin-functions.md#joinv) to put them
back together:
<pre class="pre-highlight-in-pair">
<b>mlr --icsv --ojson --from data/split1.csv put '</b>
<b> $codes = splita($codes, ","); # split into array of integers</b>
<b> $codes = apply($codes, func(e) { return e * 100 }); # do math on the array of integers</b>
<b> $codes = joinv($codes, ","); # join the updated array back into a string</b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
[
{
"name": "Alice",
"nicknames": "Allie,Skater",
"codes": "100,300,500"
},
{
"name": "Robert",
"nicknames": "Bob,Bobby,Biker",
"codes": "200,400,600"
}
]
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --csv --from data/split1.csv put '</b>
<b> $codes = splita($codes, ","); # split into array of integers</b>
<b> $codes = apply($codes, func(e) { return e * 100 }); # do math on the array of integers</b>
<b> $codes = joinv($codes, ","); # join the updated array back into a string</b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
name,nicknames,codes
Alice,"Allie,Skater","100,300,500"
Robert,"Bob,Bobby,Biker","200,400,600"
</pre>
The full list of split functions includes
[splita](reference-dsl-builtin-functions.md#splita),
[splitax](reference-dsl-builtin-functions.md#splitax),
[splitkv](reference-dsl-builtin-functions.md#splitkv),
[splitkvx](reference-dsl-builtin-functions.md#splitkvx),
[splitnv](reference-dsl-builtin-functions.md#splitnv), and
[splitnx](reference-dsl-builtin-functions.md#splitx). The flavors have to to with what the output is
-- arrays or maps -- and whether or not type-inference is done.
The full list of join functions includes [joink](reference-dsl-builtin-functions.md#joink),
[joinv](reference-dsl-builtin-functions.md#joinv), and
[joinkv](reference-dsl-builtin-functions.md#joinkv). Here the flavors have to do with whether we put
array/map keys, values, or both into the resulting string.
## Example: shortening hostnames
Suppose you want to just keep the first two components of the hostnames:
<pre class="pre-highlight-in-pair">
<b>cat data/hosts.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
host,status
xy01.east.acme.org,up
ab02.west.acme.org,down
ac91.west.acme.org,up
</pre>
Using the [`splita`](reference-dsl-builtin-functions.md#splita) and
[`joinv`](reference-dsl-builtin-functions.md#joinv) functions, along with
[array slicing](reference-main-arrays.md#slicing), we get
<pre class="pre-highlight-in-pair">
<b>mlr --csv --from data/hosts.csv put '$host = joinv(splita($host, ".")[1:2], ".")'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
host,status
xy01.east,up
ab02.west,down
ac91.west,up
</pre>
## Flatten/unflatten: representing arrays in CSV
In the above examples, when we split a string field into an array, we used JSON output. That's
because JSON permits nested data structures. For CSV output, Miller uses, by default, a
_flatten/unflatten strategy_: array-valued fields are turned into multiple CSV columns. For example:
<pre class="pre-highlight-in-pair">
<b>mlr --icsv --ojson --from data/split1.csv put '$codes = splitax($codes, ",")'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
[
{
"name": "Alice",
"nicknames": "Allie,Skater",
"codes": ["1", "3", "5"]
},
{
"name": "Robert",
"nicknames": "Bob,Bobby,Biker",
"codes": ["2", "4", "6"]
}
]
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --csv --from data/split1.csv put '$codes = splitax($codes, ",")'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
name,nicknames,codes.1,codes.2,codes.3
Alice,"Allie,Skater",1,3,5
Robert,"Bob,Bobby,Biker",2,4,6
</pre>
See the [flatten/unflatten: converting between JSON and tabular formats¶](flatten-unflatten.md)
for more on this default behavior, including how to override it when you prefer.
## Splitting and joining with different separators
The above is well and good when a string contains pieces with multiple instances of the same
separator. However sometimes we have input like `5-18:53:20`. Here we can use the more flexible
[unformat](reference-dsl-builtin-functions.md#unformat) and
[format](reference-dsl-builtin-functions.md#format) DSL functions. (As above, there's an
[unformatx](reference-dsl-builtin-functions.md#unformatx) function if you want Miller to just split
the string into string pieces without trying to type-infer them.)
<pre class="pre-highlight-in-pair">
<b>cat data/split2.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
stamp,event
5-18:53:20,open
5-18:53:22,close
5-19:07:34,open
5-19:07:56,close
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --icsv --ojson --from data/split2.csv put '$pieces = unformat("{}-{}:{}:{}", $stamp)'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
[
{
"stamp": "5-18:53:20",
"event": "open",
"pieces": [5, 18, 53, 20]
},
{
"stamp": "5-18:53:22",
"event": "close",
"pieces": [5, 18, 53, 22]
},
{
"stamp": "5-19:07:34",
"event": "open",
"pieces": [5, 19, "07", 34]
},
{
"stamp": "5-19:07:56",
"event": "close",
"pieces": [5, 19, "07", 56]
}
]
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --icsv --opprint --from data/split2.csv put '</b>
<b> pieces = unformat("{}-{}:{}:{}", $stamp);</b>
<b> $description = format("{} day(s) {} hour(s) {} minute(s) {} seconds(s)", pieces[1], pieces[2], pieces[3], pieces[4]);</b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
stamp event description
5-18:53:20 open 5 day(s) 18 hour(s) 53 minute(s) 20 seconds(s)
5-18:53:22 close 5 day(s) 18 hour(s) 53 minute(s) 22 seconds(s)
5-19:07:34 open 5 day(s) 19 hour(s) 07 minute(s) 34 seconds(s)
5-19:07:56 close 5 day(s) 19 hour(s) 07 minute(s) 56 seconds(s)
</pre>
## Using regular expressions and capture groups
If you prefer [regular expressions](reference-main-regular-expressions.md), those can be used in this context as well:
<pre class="pre-highlight-in-pair">
<b>mlr --icsv --opprint --from data/split2.csv put '</b>
<b> if ($stamp =~ "([0-9]+)-([0-9]+):([0-9]+):([0-9]+)") {</b>
<b> $description = "\1 day(s) \2 hour(s) \3 minute(s) \4 seconds(s)";</b>
<b> }</b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
stamp event description
5-18:53:20 open 5 day(s) 18 hour(s) 53 minute(s) 20 seconds(s)
5-18:53:22 close 5 day(s) 18 hour(s) 53 minute(s) 22 seconds(s)
5-19:07:34 open 5 day(s) 19 hour(s) 07 minute(s) 34 seconds(s)
5-19:07:56 close 5 day(s) 19 hour(s) 07 minute(s) 56 seconds(s)
</pre>
## Special case: timestamps
Timestamps are complex enough to merit their own handling: see the
[DSL datetime/timezone functions page](reference-dsl-time.md). in particular the
[strptime](reference-dsl-builtin-functions.md#strptime)
and
[strftime](reference-dsl-builtin-functions.md#strftime)
functions.
## Special case: dhms and seconds
For historical reasons, Miller has a way to represent seconds in a more human-readable format, using days,
hours, minutes, and seconds. For example:
<pre class="pre-highlight-in-pair">
<b>mlr --c2p --from data/sec2dhms.csv put '$dhms = sec2dhms($sec)'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
sec dhms
1 1s
100 1m40s
10000 2h46m40s
1000000 11d13h46m40s
</pre>
Please see
[sec2dhms](reference-dsl-builtin-functions.md#sec2dhms)
and
[dhms2sec](reference-dsl-builtin-functions.md#sec2dhms)
## Special case: financial values
One way to handle currencies is to sub out the currency marker (like `$`) as well as commas:
<pre class="pre-highlight-in-pair">
<b>echo 'd=$1234.56' | mlr put '$d = float(gsub(ssub($d, "$", ""), ",", ""))'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
d=1234.56
</pre>
## Nesting and unnesting fields
Sometimes we want not to split strings into arrays, but rather, to use them to create multiple records.
For example:
<pre class="pre-highlight-in-pair">
<b>mlr --c2p cat data/split1.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
name nicknames codes
Alice Allie,Skater 1,3,5
Robert Bob,Bobby,Biker 2,4,6
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --c2p nest --evar , -f nicknames data/split1.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
name nicknames codes
Alice Allie 1,3,5
Alice Skater 1,3,5
Robert Bob 2,4,6
Robert Bobby 2,4,6
Robert Biker 2,4,6
</pre>
See [documentation on the nest verb](reference-verbs.md#nest) for general information on how to do this.

View file

@ -0,0 +1,190 @@
# Parsing and formatting fields
Miller offers several ways to split strings into pieces (parsing them), and to put things together
into a string (formatting them).
## Splitting and joining with the same separator
One pattern we often have is items separated by the same separator, e.g. a field with value
`1;2;3;4` -- with a `;` between every pair of items. There are several useful
[DSL](miller-programming-language.md) [functions](reference-dsl-builtin-functions.md) for splitting
a string into pieces, and joining pieces into a string.
For example, suppose we have a CSV file like this:
GENMD-RUN-COMMAND
cat data/split1.csv
GENMD-EOF
GENMD-RUN-COMMAND
mlr --icsv --ojson cat data/split1.csv
GENMD-EOF
Then we can use the [`splita`](reference-dsl-builtin-functions.md#splita) function to split the
`nicknames` string into an array of strings:
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/split1.csv put '$nicknames = splita($nicknames, ",")'
GENMD-EOF
Likewise we can split the `codes` field. Since these look like numbers, we can again use `splita`
which tries to type-infer ints and floats when it finds them -- or, we can use
[splitax](reference-dsl-builtin-functions.md#splitax) to ask for the string to be split up into
substrings, with no type inference:
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/split1.csv put '$codes = splita($codes, ",")'
GENMD-EOF
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/split1.csv put '$codes = splitax($codes, ",")'
GENMD-EOF
We can do operations on the array, then use [joinv](reference-dsl-builtin-functions.md#joinv) to put them
back together:
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/split1.csv put '
$codes = splita($codes, ","); # split into array of integers
$codes = apply($codes, func(e) { return e * 100 }); # do math on the array of integers
$codes = joinv($codes, ","); # join the updated array back into a string
'
GENMD-EOF
GENMD-RUN-COMMAND
mlr --csv --from data/split1.csv put '
$codes = splita($codes, ","); # split into array of integers
$codes = apply($codes, func(e) { return e * 100 }); # do math on the array of integers
$codes = joinv($codes, ","); # join the updated array back into a string
'
GENMD-EOF
The full list of split functions includes
[splita](reference-dsl-builtin-functions.md#splita),
[splitax](reference-dsl-builtin-functions.md#splitax),
[splitkv](reference-dsl-builtin-functions.md#splitkv),
[splitkvx](reference-dsl-builtin-functions.md#splitkvx),
[splitnv](reference-dsl-builtin-functions.md#splitnv), and
[splitnx](reference-dsl-builtin-functions.md#splitx). The flavors have to to with what the output is
-- arrays or maps -- and whether or not type-inference is done.
The full list of join functions includes [joink](reference-dsl-builtin-functions.md#joink),
[joinv](reference-dsl-builtin-functions.md#joinv), and
[joinkv](reference-dsl-builtin-functions.md#joinkv). Here the flavors have to do with whether we put
array/map keys, values, or both into the resulting string.
## Example: shortening hostnames
Suppose you want to just keep the first two components of the hostnames:
GENMD-RUN-COMMAND
cat data/hosts.csv
GENMD-EOF
Using the [`splita`](reference-dsl-builtin-functions.md#splita) and
[`joinv`](reference-dsl-builtin-functions.md#joinv) functions, along with
[array slicing](reference-main-arrays.md#slicing), we get
GENMD-RUN-COMMAND
mlr --csv --from data/hosts.csv put '$host = joinv(splita($host, ".")[1:2], ".")'
GENMD-EOF
## Flatten/unflatten: representing arrays in CSV
In the above examples, when we split a string field into an array, we used JSON output. That's
because JSON permits nested data structures. For CSV output, Miller uses, by default, a
_flatten/unflatten strategy_: array-valued fields are turned into multiple CSV columns. For example:
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/split1.csv put '$codes = splitax($codes, ",")'
GENMD-EOF
GENMD-RUN-COMMAND
mlr --csv --from data/split1.csv put '$codes = splitax($codes, ",")'
GENMD-EOF
See the [flatten/unflatten: converting between JSON and tabular formats¶](flatten-unflatten.md)
for more on this default behavior, including how to override it when you prefer.
## Splitting and joining with different separators
The above is well and good when a string contains pieces with multiple instances of the same
separator. However sometimes we have input like `5-18:53:20`. Here we can use the more flexible
[unformat](reference-dsl-builtin-functions.md#unformat) and
[format](reference-dsl-builtin-functions.md#format) DSL functions. (As above, there's an
[unformatx](reference-dsl-builtin-functions.md#unformatx) function if you want Miller to just split
the string into string pieces without trying to type-infer them.)
GENMD-RUN-COMMAND
cat data/split2.csv
GENMD-EOF
GENMD-RUN-COMMAND
mlr --icsv --ojson --from data/split2.csv put '$pieces = unformat("{}-{}:{}:{}", $stamp)'
GENMD-EOF
GENMD-RUN-COMMAND
mlr --icsv --opprint --from data/split2.csv put '
pieces = unformat("{}-{}:{}:{}", $stamp);
$description = format("{} day(s) {} hour(s) {} minute(s) {} seconds(s)", pieces[1], pieces[2], pieces[3], pieces[4]);
'
GENMD-EOF
## Using regular expressions and capture groups
If you prefer [regular expressions](reference-main-regular-expressions.md), those can be used in this context as well:
GENMD-RUN-COMMAND
mlr --icsv --opprint --from data/split2.csv put '
if ($stamp =~ "([0-9]+)-([0-9]+):([0-9]+):([0-9]+)") {
$description = "\1 day(s) \2 hour(s) \3 minute(s) \4 seconds(s)";
}
'
GENMD-EOF
## Special case: timestamps
Timestamps are complex enough to merit their own handling: see the
[DSL datetime/timezone functions page](reference-dsl-time.md). in particular the
[strptime](reference-dsl-builtin-functions.md#strptime)
and
[strftime](reference-dsl-builtin-functions.md#strftime)
functions.
## Special case: dhms and seconds
For historical reasons, Miller has a way to represent seconds in a more human-readable format, using days,
hours, minutes, and seconds. For example:
GENMD-RUN-COMMAND
mlr --c2p --from data/sec2dhms.csv put '$dhms = sec2dhms($sec)'
GENMD-EOF
Please see
[sec2dhms](reference-dsl-builtin-functions.md#sec2dhms)
and
[dhms2sec](reference-dsl-builtin-functions.md#sec2dhms)
## Special case: financial values
One way to handle currencies is to sub out the currency marker (like `$`) as well as commas:
GENMD-RUN-COMMAND
echo 'd=$1234.56' | mlr put '$d = float(gsub(ssub($d, "$", ""), ",", ""))'
GENMD-EOF
## Nesting and unnesting fields
Sometimes we want not to split strings into arrays, but rather, to use them to create multiple records.
For example:
GENMD-RUN-COMMAND
mlr --c2p cat data/split1.csv
GENMD-EOF
GENMD-RUN-COMMAND
mlr --c2p nest --evar , -f nicknames data/split1.csv
GENMD-EOF
See [documentation on the nest verb](reference-verbs.md#nest) for general information on how to do this.

View file

@ -302,133 +302,6 @@ yellow,circle,true,9,87,63.5058,8.3350,3
The difference is a matter of taste (although `mlr cat -n` puts the counter first).
## Splitting a string and taking a few of the components
Suppose you want to just keep the first two components of the hostnames:
<pre class="pre-highlight-in-pair">
<b>cat data/hosts.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
host,status
xy01.east.acme.org,up
ab02.west.acme.org,down
ac91.west.acme.org,up
</pre>
Using the [`splita`](reference-dsl-builtin-functions.md#splita) and
[`joinv`](reference-dsl-builtin-functions.md#joinv) functions, along with
[array slicing](reference-main-arrays.md#slicing), we get
<pre class="pre-highlight-in-pair">
<b>mlr --csv --from data/hosts.csv put '$host = joinv(splita($host, ".")[1:2], ".")'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
host,status
xy01.east,up
ab02.west,down
ac91.west,up
</pre>
## Splitting nested fields
Suppose you have a TSV file like this:
<pre class="pre-non-highlight-non-pair">
a b
x z
s u:v:w
</pre>
The simplest option is to use [nest](reference-verbs.md#nest):
<pre class="pre-highlight-in-pair">
<b>mlr --tsv nest --explode --values --across-records -f b --nested-fs : data/nested.tsv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a b
x z
s u
s v
s w
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --tsv nest --explode --values --across-fields -f b --nested-fs : data/nested.tsv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a b_1
x z
a b_1 b_2 b_3
s u v w
</pre>
While `mlr nest` is simplest, let's also take a look at a few ways to do this using the `put` DSL.
One option to split out the colon-delimited values in the `b` column is to use `splitnv` to create an integer-indexed map and loop over it, adding new fields to the current record:
<pre class="pre-highlight-in-pair">
<b>mlr --from data/nested.tsv --itsv --oxtab put '</b>
<b> o = splitnv($b, ":");</b>
<b> for (k,v in o) {</b>
<b> $["p".k]=v</b>
<b> }</b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a x
b z
p1 z
a s
b u:v:w
p1 u
p2 v
p3 w
</pre>
while another is to loop over the same map from `splitnv` and use it (with `put -q` to suppress printing the original record) to produce multiple records:
<pre class="pre-highlight-in-pair">
<b>mlr --from data/nested.tsv --itsv --oxtab put -q '</b>
<b> o = splitnv($b, ":");</b>
<b> for (k,v in o) {</b>
<b> x = mapsum($*, {"b":v});</b>
<b> emit x</b>
<b> }</b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a x
b z
a s
b u
a s
b v
a s
b w
</pre>
<pre class="pre-highlight-in-pair">
<b>mlr --from data/nested.tsv --tsv put -q '</b>
<b> o = splitnv($b, ":");</b>
<b> for (k,v in o) {</b>
<b> x = mapsum($*, {"b":v}); emit x</b>
<b> }</b>
<b>'</b>
</pre>
<pre class="pre-non-highlight-in-pair">
a b
x z
s u
s v
s w
</pre>
## Options for dealing with duplicate rows
If your data has records appearing multiple times, you can use [uniq](reference-verbs.md#uniq) to show and/or count the unique records.

View file

@ -168,72 +168,6 @@ GENMD-EOF
The difference is a matter of taste (although `mlr cat -n` puts the counter first).
## Splitting a string and taking a few of the components
Suppose you want to just keep the first two components of the hostnames:
GENMD-RUN-COMMAND
cat data/hosts.csv
GENMD-EOF
Using the [`splita`](reference-dsl-builtin-functions.md#splita) and
[`joinv`](reference-dsl-builtin-functions.md#joinv) functions, along with
[array slicing](reference-main-arrays.md#slicing), we get
GENMD-RUN-COMMAND
mlr --csv --from data/hosts.csv put '$host = joinv(splita($host, ".")[1:2], ".")'
GENMD-EOF
## Splitting nested fields
Suppose you have a TSV file like this:
GENMD-INCLUDE-ESCAPED(data/nested.tsv)
The simplest option is to use [nest](reference-verbs.md#nest):
GENMD-RUN-COMMAND
mlr --tsv nest --explode --values --across-records -f b --nested-fs : data/nested.tsv
GENMD-EOF
GENMD-RUN-COMMAND
mlr --tsv nest --explode --values --across-fields -f b --nested-fs : data/nested.tsv
GENMD-EOF
While `mlr nest` is simplest, let's also take a look at a few ways to do this using the `put` DSL.
One option to split out the colon-delimited values in the `b` column is to use `splitnv` to create an integer-indexed map and loop over it, adding new fields to the current record:
GENMD-RUN-COMMAND
mlr --from data/nested.tsv --itsv --oxtab put '
o = splitnv($b, ":");
for (k,v in o) {
$["p".k]=v
}
'
GENMD-EOF
while another is to loop over the same map from `splitnv` and use it (with `put -q` to suppress printing the original record) to produce multiple records:
GENMD-RUN-COMMAND
mlr --from data/nested.tsv --itsv --oxtab put -q '
o = splitnv($b, ":");
for (k,v in o) {
x = mapsum($*, {"b":v});
emit x
}
'
GENMD-EOF
GENMD-RUN-COMMAND
mlr --from data/nested.tsv --tsv put -q '
o = splitnv($b, ":");
for (k,v in o) {
x = mapsum($*, {"b":v}); emit x
}
'
GENMD-EOF
## Options for dealing with duplicate rows
If your data has records appearing multiple times, you can use [uniq](reference-verbs.md#uniq) to show and/or count the unique records.

View file

@ -2,8 +2,8 @@
RELEASES
* plan 6.1.0
o unsparsify -f CSV by default -- ? into CSV record-writer -- ? caveat that record 1 controls all ...
o fmt/unfmt/regex doc
o FAQ/examples reorg
- \d etc to DSL :( -- & parsing-and-formatting-fields.md.in
o mlr split -- needs an example page along with the tee DSL function
o https://github.com/johnkerl/miller/issues?q=is%3Aissue+is%3Aopen+label%3Aneeds-documentation
@ -121,22 +121,6 @@ strict-mode ideas
* srec:
o abend unless $?x -- ?
----------------------------------------------------------------
mlr join --left-fields a,b,c
----------------------------------------------------------------
! Better functions for values manipulation, e.g. easier conversion of strings like "$1,234.56" into numeric values
o note on is_error(x) (or string(x) == "(error)")
? dhms w/ optional separgs -- ? what about fenceposting? ["d","h","m","s"] vs ["-",":",":",""] -- ?
o 'Ability to specify some formats that are fixed. Like we can process
"5d18h53m20s" format in *dhms* commands, but what about "5-18:53:20"? This is
a common format used by the SLURM resource manager.'
o linked-to faqent w/ -f -s etc ...
----------------------------------------------------------------
k better print-interpolate with {} etc
doc: mlr --csv --from example.csv put -q 'print format("Index {} at quantity {} and rate {}", $index, $quantity, $rate)'
----------------------------------------------------------------
! sysdate, sysdate_local; datediff ...