filename options for split (iss. #1365) (#1366)

* #1365 - filename options for `split`

* Don't use joiner string when prefix is empty.
* Add option to specify joiner string.
* Add option to not URL-escape file names.

* #1365 - update documentation

* #1365 - don't URL-escape file name prefix

I **_thought_** it'd be cool to apply URL-escaping to the file name prefix as well, just in case it included spaces or other characters.  I forgot that a common use for the prefix is to specify a directory path that will contain the file.  When the slashes ("`/`") of the path are URL-escaped, they become "`%2F`" and the directories will not be created.  So, I moved the prefix handling code to come after the URL-escaping.

* #1365 - new `split` options for CLI help output

* #1365 - fix escape/suffix logic error

Trying to make the `return` statement cleaner, I thought it'd be good to add the file name suffix immediately after the file name is URL-escaped.  I'd forgotten that the suffix will not be added if the new `-e` option is used to skip URL-escaping.  So, I put the suffix back where I had it.

* #1365 - add `split` to the "10 minutes" document

Not strictly part of this issue, but as I was checking for docs that I should update as a result of my changes, I noticed this document showed how to split data using the `put` and `tee` combination, but not about the `split` verb.

* #1365 - updated manpage

When I ran `make dev`, generating `data-diving-examples.md` failed.  The two `manpage.txt` files ended up empty, but `mlr.1` seems to be correct.

---------

Co-authored-by: Mr. Lance E Sloan (sloanlance) <sloanlance@users.noreply.github.com>
This commit is contained in:
Mr. Lance E Sloan 2023-08-23 16:08:48 -04:00 committed by GitHub
parent 12f3b14ce6
commit e2338195ba
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
6 changed files with 113 additions and 26 deletions

View file

@ -909,3 +909,40 @@ yellow,triangle,true,1,11,43.6498,9.8870
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240
</pre>
Alternatively, the `split` verb can do the same thing:
<pre class="pre-highlight-non-pair">
<b>mlr --csv --from example.csv split -g shape</b>
</pre>
<pre class="pre-highlight-in-pair">
<b>cat split_circle.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
color,shape,flag,k,index,quantity,rate
red,circle,true,3,16,13.8103,2.9010
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350
</pre>
<pre class="pre-highlight-in-pair">
<b>cat split_square.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
color,shape,flag,k,index,quantity,rate
red,square,true,2,15,79.2778,0.0130
red,square,false,4,48,77.5542,7.4670
red,square,false,6,64,77.1991,9.5310
purple,square,false,10,91,72.3735,8.2430
</pre>
<pre class="pre-highlight-in-pair">
<b>cat split_triangle.csv</b>
</pre>
<pre class="pre-non-highlight-in-pair">
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240
</pre>

View file

@ -434,3 +434,21 @@ GENMD-EOF
GENMD-RUN-COMMAND
cat triangle.csv
GENMD-EOF
Alternatively, the `split` verb can do the same thing:
GENMD-RUN-COMMAND
mlr --csv --from example.csv split -g shape
GENMD-EOF
GENMD-RUN-COMMAND
cat split_circle.csv
GENMD-EOF
GENMD-RUN-COMMAND
cat split_square.csv
GENMD-EOF
GENMD-RUN-COMMAND
cat split_triangle.csv
GENMD-EOF

View file

@ -1813,6 +1813,8 @@ MILLER(1) MILLER(1)
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-e Do NOT URL-escape names of output files.
-j {j} Use string J to join filename parts; default "_".
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000

View file

@ -1,7 +1,6 @@
package transformers
import (
"bytes"
"container/list"
"fmt"
"net/url"
@ -17,6 +16,7 @@ import (
// ----------------------------------------------------------------
const verbNameSplit = "split"
const splitDefaultOutputFileNamePrefix = "split"
const splitDefaultFileNamePartJoiner = "_"
var SplitSetup = TransformerSetup{
Verb: verbNameSplit,
@ -39,6 +39,8 @@ Exactly one of -m, -n, or -g must be supplied.
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-e Do NOT URL-escape names of output files.
-j {j} Use string J to join filename parts; default "`+splitDefaultFileNamePartJoiner+`".
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
@ -88,6 +90,8 @@ func transformerSplitParseCLI(
var doSize bool = false
var groupByFieldNames []string = nil
var emitDownstream bool = false
var escapeFileNameCharacters bool = true
var fileNamePartJoiner string = splitDefaultFileNamePartJoiner
var doAppend bool = false
var outputFileNamePrefix string = splitDefaultOutputFileNamePrefix
var outputFileNameSuffix string = "uninit"
@ -138,6 +142,12 @@ func transformerSplitParseCLI(
} else if opt == "-v" {
emitDownstream = true
} else if opt == "-e" {
escapeFileNameCharacters = false
} else if opt == "-j" {
fileNamePartJoiner = cli.VerbGetStringArgOrDie(verb, opt, args, &argi, argc)
} else {
// This is inelegant. For error-proofing we advance argi already in our
// loop (so individual if-statements don't need to). However,
@ -180,6 +190,8 @@ func transformerSplitParseCLI(
doSize,
groupByFieldNames,
emitDownstream,
escapeFileNameCharacters,
fileNamePartJoiner,
doAppend,
outputFileNamePrefix,
outputFileNameSuffix,
@ -195,14 +207,16 @@ func transformerSplitParseCLI(
// ----------------------------------------------------------------
type TransformerSplit struct {
n int64
outputFileNamePrefix string
outputFileNameSuffix string
emitDownstream bool
ungroupedCounter int64
groupByFieldNames []string
recordWriterOptions *cli.TWriterOptions
doAppend bool
n int64
outputFileNamePrefix string
outputFileNameSuffix string
emitDownstream bool
escapeFileNameCharacters bool
fileNamePartJoiner string
ungroupedCounter int64
groupByFieldNames []string
recordWriterOptions *cli.TWriterOptions
doAppend bool
// For doSize ungrouped: only one file open at a time
outputHandler output.OutputHandler
@ -220,6 +234,8 @@ func NewTransformerSplit(
doSize bool,
groupByFieldNames []string,
emitDownstream bool,
escapeFileNameCharacters bool,
fileNamePartJoiner string,
doAppend bool,
outputFileNamePrefix string,
outputFileNameSuffix string,
@ -227,14 +243,16 @@ func NewTransformerSplit(
) (*TransformerSplit, error) {
tr := &TransformerSplit{
n: n,
outputFileNamePrefix: outputFileNamePrefix,
outputFileNameSuffix: outputFileNameSuffix,
emitDownstream: emitDownstream,
ungroupedCounter: 0,
groupByFieldNames: groupByFieldNames,
recordWriterOptions: recordWriterOptions,
doAppend: doAppend,
n: n,
outputFileNamePrefix: outputFileNamePrefix,
outputFileNameSuffix: outputFileNameSuffix,
emitDownstream: emitDownstream,
escapeFileNameCharacters: escapeFileNameCharacters,
fileNamePartJoiner: fileNamePartJoiner,
ungroupedCounter: 0,
groupByFieldNames: groupByFieldNames,
recordWriterOptions: recordWriterOptions,
doAppend: doAppend,
outputHandler: nil,
previousQuotient: -1,
@ -402,13 +420,21 @@ func (tr *TransformerSplit) makeUngroupedOutputFileName(k int64) string {
func (tr *TransformerSplit) makeGroupedOutputFileName(
groupByFieldValues []*mlrval.Mlrval,
) string {
var buffer bytes.Buffer
buffer.WriteString(tr.outputFileNamePrefix)
var fileNameParts []string
for _, groupByFieldValue := range groupByFieldValues {
buffer.WriteString("_")
buffer.WriteString(url.QueryEscape(groupByFieldValue.String()))
fileNameParts = append(fileNameParts, groupByFieldValue.String())
}
buffer.WriteString(".")
buffer.WriteString(tr.outputFileNameSuffix)
return buffer.String()
fileName := strings.Join(fileNameParts, tr.fileNamePartJoiner)
if tr.escapeFileNameCharacters {
fileName = url.QueryEscape(fileName)
}
if tr.outputFileNamePrefix != "" {
fileName = tr.outputFileNamePrefix + tr.fileNamePartJoiner + fileName
}
return fileName + "." + tr.outputFileNameSuffix
}

View file

@ -2,12 +2,12 @@
.\" Title: mlr
.\" Author: [see the "AUTHOR" section]
.\" Generator: ./mkman.rb
.\" Date: 2023-08-20
.\" Date: 2023-08-22
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "MILLER" "1" "2023-08-20" "\ \&" "\ \&"
.TH "MILLER" "1" "2023-08-22" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Portability definitions
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -2296,6 +2296,8 @@ Exactly one of -m, -n, or -g must be supplied.
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-e Do NOT URL-escape names of output files.
-j {j} Use string J to join filename parts; default "_".
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000

View file

@ -997,6 +997,8 @@ Exactly one of -m, -n, or -g must be supplied.
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-e Do NOT URL-escape names of output files.
-j {j} Use string J to join filename parts; default "_".
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000