mlr split verb (#898)

* mlr split

* regression-test cases

* doc-build artifacts
This commit is contained in:
John Kerl 2022-01-26 23:16:25 -05:00 committed by GitHub
parent dad6456022
commit f7ff63124b
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
62 changed files with 969 additions and 19 deletions

View file

@ -195,8 +195,8 @@ VERB LIST
json-stringify join label least-frequent merge-fields most-frequent nest
nothing put regularize remove-empty-columns rename reorder repeat reshape
sample sec2gmtdate sec2gmt seqgen shuffle skip-trivial-records sort
sort-within-records stats1 stats2 step tac tail tee template top unflatten
uniq unsparsify
sort-within-records split stats1 stats2 step tac tail tee template top
unflatten uniq unsparsify
FUNCTION LIST
abs acos acosh any append apply arrayify asin asinh asserting_absent
@ -1737,6 +1737,46 @@ VERBS
-r Recursively sort subobjects/submaps, e.g. for JSON input.
-h|--help Show this message.
split
Usage: mlr split [options] {filename}
Options:
-n {n}: Cap file sizes at N records.
-m {m}: Produce M files, round-robining records among them.
-g {a,b,c}: Write separate files with records having distinct values for fields named a,b,c.
Exactly one of -m, -n, or -g must be supplied.
--prefix {p} Specify filename prefix; default "split".
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
the input is CSV, but the output files are JSON.
Examples: Suppose myfile.csv has 1,000,000 records.
100 output files, 10,000 records each. First 10,000 records in split_1.csv, next in split_2.csv, etc.
mlr --csv --from myfile.csv split -n 10000
10 output files, 100,000 records each. Records 1,11,21,etc in split_1.csv, records 2,12,22, etc in split_2.csv, etc.
mlr --csv --from myfile.csv split -m 10
Same, but with JSON output.
mlr --csv --from myfile.csv split -m 10 -o json
Same but instead of split_1.csv, split_2.csv, etc. there are test_1.dat, test_2.dat, etc.
mlr --csv --from myfile.csv split -m 10 --prefix test --suffix dat
Same, but written to the /tmp/ directory.
mlr --csv --from myfile.csv split -m 10 --prefix /tmp/test --suffix dat
If the shape field has values triangle and square, then there will be split_triangle.csv and split_square.csv.
mlr --csv --from myfile.csv split -g shape
If the color field has values yellow and green, and the shape field has values triangle and square,
then there will be split_yellow_triangle.csv, split_yellow_square.csv, etc.
mlr --csv --from myfile.csv split -g color,shape
See also the "tee" DSL function which lets you do more ad-hoc customization.
stats1
Usage: mlr stats1 [options]
Computes univariate statistics for one or more given fields, accumulated across
@ -3091,5 +3131,5 @@ SEE ALSO
2022-01-25 MILLER(1)
2022-01-27 MILLER(1)
</pre>

View file

@ -174,8 +174,8 @@ VERB LIST
json-stringify join label least-frequent merge-fields most-frequent nest
nothing put regularize remove-empty-columns rename reorder repeat reshape
sample sec2gmtdate sec2gmt seqgen shuffle skip-trivial-records sort
sort-within-records stats1 stats2 step tac tail tee template top unflatten
uniq unsparsify
sort-within-records split stats1 stats2 step tac tail tee template top
unflatten uniq unsparsify
FUNCTION LIST
abs acos acosh any append apply arrayify asin asinh asserting_absent
@ -1716,6 +1716,46 @@ VERBS
-r Recursively sort subobjects/submaps, e.g. for JSON input.
-h|--help Show this message.
split
Usage: mlr split [options] {filename}
Options:
-n {n}: Cap file sizes at N records.
-m {m}: Produce M files, round-robining records among them.
-g {a,b,c}: Write separate files with records having distinct values for fields named a,b,c.
Exactly one of -m, -n, or -g must be supplied.
--prefix {p} Specify filename prefix; default "split".
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
the input is CSV, but the output files are JSON.
Examples: Suppose myfile.csv has 1,000,000 records.
100 output files, 10,000 records each. First 10,000 records in split_1.csv, next in split_2.csv, etc.
mlr --csv --from myfile.csv split -n 10000
10 output files, 100,000 records each. Records 1,11,21,etc in split_1.csv, records 2,12,22, etc in split_2.csv, etc.
mlr --csv --from myfile.csv split -m 10
Same, but with JSON output.
mlr --csv --from myfile.csv split -m 10 -o json
Same but instead of split_1.csv, split_2.csv, etc. there are test_1.dat, test_2.dat, etc.
mlr --csv --from myfile.csv split -m 10 --prefix test --suffix dat
Same, but written to the /tmp/ directory.
mlr --csv --from myfile.csv split -m 10 --prefix /tmp/test --suffix dat
If the shape field has values triangle and square, then there will be split_triangle.csv and split_square.csv.
mlr --csv --from myfile.csv split -g shape
If the color field has values yellow and green, and the shape field has values triangle and square,
then there will be split_yellow_triangle.csv, split_yellow_square.csv, etc.
mlr --csv --from myfile.csv split -g color,shape
See also the "tee" DSL function which lets you do more ad-hoc customization.
stats1
Usage: mlr stats1 [options]
Computes univariate statistics for one or more given fields, accumulated across
@ -3070,4 +3110,4 @@ SEE ALSO
2022-01-25 MILLER(1)
2022-01-27 MILLER(1)

View file

@ -2978,6 +2978,52 @@ a b c
9 8 7
</pre>
## split
<pre class="pre-highlight-in-pair">
<b>mlr split --help</b>
</pre>
<pre class="pre-non-highlight-in-pair">
Usage: mlr split [options] {filename}
Options:
-n {n}: Cap file sizes at N records.
-m {m}: Produce M files, round-robining records among them.
-g {a,b,c}: Write separate files with records having distinct values for fields named a,b,c.
Exactly one of -m, -n, or -g must be supplied.
--prefix {p} Specify filename prefix; default "split".
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
the input is CSV, but the output files are JSON.
Examples: Suppose myfile.csv has 1,000,000 records.
100 output files, 10,000 records each. First 10,000 records in split_1.csv, next in split_2.csv, etc.
mlr --csv --from myfile.csv split -n 10000
10 output files, 100,000 records each. Records 1,11,21,etc in split_1.csv, records 2,12,22, etc in split_2.csv, etc.
mlr --csv --from myfile.csv split -m 10
Same, but with JSON output.
mlr --csv --from myfile.csv split -m 10 -o json
Same but instead of split_1.csv, split_2.csv, etc. there are test_1.dat, test_2.dat, etc.
mlr --csv --from myfile.csv split -m 10 --prefix test --suffix dat
Same, but written to the /tmp/ directory.
mlr --csv --from myfile.csv split -m 10 --prefix /tmp/test --suffix dat
If the shape field has values triangle and square, then there will be split_triangle.csv and split_square.csv.
mlr --csv --from myfile.csv split -g shape
If the color field has values yellow and green, and the shape field has values triangle and square,
then there will be split_yellow_triangle.csv, split_yellow_square.csv, etc.
mlr --csv --from myfile.csv split -g color,shape
See also the "tee" DSL function which lets you do more ad-hoc customization.
</pre>
## stats1
<pre class="pre-highlight-in-pair">

View file

@ -936,6 +936,12 @@ GENMD-RUN-COMMAND
mlr --ijson --opprint sort-within-records data/sort-within-records.json
GENMD-EOF
## split
GENMD-RUN-COMMAND
mlr split --help
GENMD-EOF
## stats1
GENMD-RUN-COMMAND

View file

@ -56,6 +56,17 @@ type MultiOutputHandlerManager struct {
}
// ----------------------------------------------------------------
func NewFileOutputHandlerManager(
recordWriterOptions *cli.TWriterOptions,
doAppend bool,
) *MultiOutputHandlerManager {
if doAppend {
return NewFileAppendHandlerManager(recordWriterOptions)
} else {
return NewFileWritetHandlerManager(recordWriterOptions)
}
}
func NewFileWritetHandlerManager(
recordWriterOptions *cli.TWriterOptions,
) *MultiOutputHandlerManager {
@ -228,6 +239,18 @@ func newOutputHandlerCommon(
}
// ----------------------------------------------------------------
func NewFileOutputHandler(
filename string,
recordWriterOptions *cli.TWriterOptions,
doAppend bool,
) (*FileOutputHandler, error) {
if doAppend {
return NewFileAppendOutputHandler(filename, recordWriterOptions)
} else {
return NewFileWriteOutputHandler(filename, recordWriterOptions)
}
}
func NewFileWriteOutputHandler(
filename string,
recordWriterOptions *cli.TWriterOptions,

View file

@ -59,6 +59,7 @@ var TRANSFORMER_LOOKUP_TABLE = []TransformerSetup{
SkipTrivialRecordsSetup,
SortSetup,
SortWithinRecordsSetup,
SplitSetup,
Stats1Setup,
Stats2Setup,
StepSetup,

View file

@ -0,0 +1,437 @@
package transformers
import (
"bytes"
"container/list"
"fmt"
"net/url"
"os"
"strings"
"github.com/johnkerl/miller/internal/pkg/cli"
"github.com/johnkerl/miller/internal/pkg/mlrval"
"github.com/johnkerl/miller/internal/pkg/output"
"github.com/johnkerl/miller/internal/pkg/types"
)
// ----------------------------------------------------------------
const verbNameSplit = "split"
const splitDefaultOutputFileNamePrefix = "split"
var SplitSetup = TransformerSetup{
Verb: verbNameSplit,
UsageFunc: transformerSplitUsage,
ParseCLIFunc: transformerSplitParseCLI,
IgnoresInput: false,
}
func transformerSplitUsage(
o *os.File,
doExit bool,
exitCode int,
) {
fmt.Fprintf(o, "Usage: %s %s [options] {filename}\n", "mlr", verbNameSplit)
fmt.Fprintf(o,
`Options:
-n {n}: Cap file sizes at N records.
-m {m}: Produce M files, round-robining records among them.
-g {a,b,c}: Write separate files with records having distinct values for fields named a,b,c.
Exactly one of -m, -n, or -g must be supplied.
--prefix {p} Specify filename prefix; default "`+splitDefaultOutputFileNamePrefix+`".
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
the input is CSV, but the output files are JSON.
Examples: Suppose myfile.csv has 1,000,000 records.
100 output files, 10,000 records each. First 10,000 records in split_1.csv, next in split_2.csv, etc.
mlr --csv --from myfile.csv split -n 10000
10 output files, 100,000 records each. Records 1,11,21,etc in split_1.csv, records 2,12,22, etc in split_2.csv, etc.
mlr --csv --from myfile.csv split -m 10
Same, but with JSON output.
mlr --csv --from myfile.csv split -m 10 -o json
Same but instead of split_1.csv, split_2.csv, etc. there are test_1.dat, test_2.dat, etc.
mlr --csv --from myfile.csv split -m 10 --prefix test --suffix dat
Same, but written to the /tmp/ directory.
mlr --csv --from myfile.csv split -m 10 --prefix /tmp/test --suffix dat
If the shape field has values triangle and square, then there will be split_triangle.csv and split_square.csv.
mlr --csv --from myfile.csv split -g shape
If the color field has values yellow and green, and the shape field has values triangle and square,
then there will be split_yellow_triangle.csv, split_yellow_square.csv, etc.
mlr --csv --from myfile.csv split -g color,shape
See also the "tee" DSL function which lets you do more ad-hoc customization.
`)
if doExit {
os.Exit(exitCode)
}
}
func transformerSplitParseCLI(
pargi *int,
argc int,
args []string,
mainOptions *cli.TOptions,
doConstruct bool, // false for first pass of CLI-parse, true for second pass
) IRecordTransformer {
// Skip the verb name from the current spot in the mlr command line
argi := *pargi
verb := args[argi]
argi++
var n int = 0
var doMod bool = false
var doSize bool = false
var groupByFieldNames []string = nil
var emitDownstream bool = false
var doAppend bool = false
var outputFileNamePrefix string = splitDefaultOutputFileNamePrefix
var outputFileNameSuffix string = "uninit"
haveOutputFileNameSuffix := false
var localOptions *cli.TOptions = nil
if mainOptions != nil {
copyThereof := *mainOptions // struct copy
localOptions = &copyThereof
}
// Parse local flags.
for argi < argc /* variable increment: 1 or 2 depending on flag */ {
opt := args[argi]
if !strings.HasPrefix(opt, "-") {
break // No more flag options to process
}
if args[argi] == "--" {
break // All transformers must do this so main-flags can follow verb-flags
}
argi++
if opt == "-h" || opt == "--help" {
transformerSplitUsage(os.Stdout, true, 0)
} else if opt == "-n" {
n = cli.VerbGetIntArgOrDie(verb, opt, args, &argi, argc)
doSize = true
} else if opt == "-m" {
n = cli.VerbGetIntArgOrDie(verb, opt, args, &argi, argc)
doMod = true
} else if opt == "-g" {
groupByFieldNames = cli.VerbGetStringArrayArgOrDie(verb, opt, args, &argi, argc)
} else if opt == "--prefix" {
outputFileNamePrefix = cli.VerbGetStringArgOrDie(verb, opt, args, &argi, argc)
} else if opt == "--suffix" {
outputFileNameSuffix = cli.VerbGetStringArgOrDie(verb, opt, args, &argi, argc)
haveOutputFileNameSuffix = true
} else if opt == "-a" {
doAppend = true
} else if opt == "-v" {
emitDownstream = true
} else {
// This is inelegant. For error-proofing we advance argi already in our
// loop (so individual if-statements don't need to). However,
// ParseWriterOptions expects it unadvanced.
largi := argi - 1
if cli.FLAG_TABLE.Parse(args, argc, &largi, localOptions) {
// This lets mlr main and mlr split have different output formats.
// Nothing else to handle here.
argi = largi
} else {
transformerSplitUsage(os.Stderr, true, 1)
}
}
}
doGroup := groupByFieldNames != nil
if !doMod && !doSize && !doGroup {
fmt.Fprintf(os.Stderr, "mlr %s: At least one of -m, -n, or -g is required.\n", verb)
os.Exit(1)
}
if (doMod && doSize) || (doMod && doGroup) || (doSize && doGroup) {
fmt.Fprintf(os.Stderr, "mlr %s: Only one of -m, -n, or -g is required.\n", verb)
os.Exit(1)
}
cli.FinalizeWriterOptions(&localOptions.WriterOptions)
if !haveOutputFileNameSuffix {
outputFileNameSuffix = localOptions.WriterOptions.OutputFileFormat
}
*pargi = argi
if !doConstruct { // All transformers must do this for main command-line parsing
return nil
}
transformer, err := NewTransformerSplit(
n,
doMod,
doSize,
groupByFieldNames,
emitDownstream,
doAppend,
outputFileNamePrefix,
outputFileNameSuffix,
&localOptions.WriterOptions,
)
if err != nil {
// Error message already printed out
os.Exit(1)
}
return transformer
}
// ----------------------------------------------------------------
type TransformerSplit struct {
n int
outputFileNamePrefix string
outputFileNameSuffix string
emitDownstream bool
ungroupedCounter int
groupByFieldNames []string
recordWriterOptions *cli.TWriterOptions
doAppend bool
// For doSize ungrouped: only one file open at a time
outputHandler output.OutputHandler
previousQuotient int
// For all other cases: multiple files open at a time
outputHandlerManager output.OutputHandlerManager
recordTransformerFunc RecordTransformerFunc
}
func NewTransformerSplit(
n int,
doMod bool,
doSize bool,
groupByFieldNames []string,
emitDownstream bool,
doAppend bool,
outputFileNamePrefix string,
outputFileNameSuffix string,
recordWriterOptions *cli.TWriterOptions,
) (*TransformerSplit, error) {
tr := &TransformerSplit{
n: n,
outputFileNamePrefix: outputFileNamePrefix,
outputFileNameSuffix: outputFileNameSuffix,
emitDownstream: emitDownstream,
ungroupedCounter: 0,
groupByFieldNames: groupByFieldNames,
recordWriterOptions: recordWriterOptions,
doAppend: doAppend,
outputHandler: nil,
previousQuotient: -1,
}
tr.outputHandlerManager = output.NewFileOutputHandlerManager(recordWriterOptions, doAppend)
if groupByFieldNames != nil {
tr.recordTransformerFunc = tr.splitGrouped
} else if doMod {
tr.recordTransformerFunc = tr.splitModUngrouped
} else {
tr.recordTransformerFunc = tr.splitSizeUngrouped
}
return tr, nil
}
func (tr *TransformerSplit) Transform(
inrecAndContext *types.RecordAndContext,
outputRecordsAndContexts *list.List, // list of *types.RecordAndContext
inputDownstreamDoneChannel <-chan bool,
outputDownstreamDoneChannel chan<- bool,
) {
HandleDefaultDownstreamDone(inputDownstreamDoneChannel, outputDownstreamDoneChannel)
tr.recordTransformerFunc(inrecAndContext, outputRecordsAndContexts, inputDownstreamDoneChannel,
outputDownstreamDoneChannel)
}
func (tr *TransformerSplit) splitModUngrouped(
inrecAndContext *types.RecordAndContext,
outputRecordsAndContexts *list.List, // list of *types.RecordAndContext
inputDownstreamDoneChannel <-chan bool,
outputDownstreamDoneChannel chan<- bool,
) {
if !inrecAndContext.EndOfStream {
remainder := 1 + (tr.ungroupedCounter % tr.n)
filename := tr.makeUngroupedOutputFileName(remainder)
err := tr.outputHandlerManager.WriteRecordAndContext(inrecAndContext, filename)
if err != nil {
fmt.Fprintf(os.Stderr, "mlr: file-write error: %v\n", err)
os.Exit(1)
}
if tr.emitDownstream {
outputRecordsAndContexts.PushBack(inrecAndContext)
}
tr.ungroupedCounter++
} else {
outputRecordsAndContexts.PushBack(inrecAndContext) // end-of-stream marker
errs := tr.outputHandlerManager.Close()
if len(errs) > 0 {
for _, err := range errs {
fmt.Fprintf(os.Stderr, "mlr: file-close error: %v\n", err)
}
os.Exit(1)
}
}
}
func (tr *TransformerSplit) splitSizeUngrouped(
inrecAndContext *types.RecordAndContext,
outputRecordsAndContexts *list.List, // list of *types.RecordAndContext
inputDownstreamDoneChannel <-chan bool,
outputDownstreamDoneChannel chan<- bool,
) {
var err error
if !inrecAndContext.EndOfStream {
quotient := 1 + (tr.ungroupedCounter / tr.n)
if quotient != tr.previousQuotient {
if tr.outputHandler != nil {
err = tr.outputHandler.Close()
if err != nil {
fmt.Fprintf(os.Stderr, "mlr: file-close error: %v\n", err)
os.Exit(1)
}
}
filename := tr.makeUngroupedOutputFileName(quotient)
tr.outputHandler, err = output.NewFileOutputHandler(
filename,
tr.recordWriterOptions,
tr.doAppend,
)
if err != nil {
fmt.Fprintf(os.Stderr, "mlr: file-open error: %v\n", err)
os.Exit(1)
}
tr.previousQuotient = quotient
}
err = tr.outputHandler.WriteRecordAndContext(inrecAndContext)
if err != nil {
fmt.Fprintf(os.Stderr, "mlr: file-write error: %v\n", err)
os.Exit(1)
}
if tr.emitDownstream {
outputRecordsAndContexts.PushBack(inrecAndContext)
}
tr.ungroupedCounter++
} else {
outputRecordsAndContexts.PushBack(inrecAndContext) // end-of-stream marker
if tr.outputHandler != nil {
err := tr.outputHandler.Close()
if err != nil {
fmt.Fprintf(os.Stderr, "mlr: file-close error: %v\n", err)
os.Exit(1)
}
}
}
}
func (tr *TransformerSplit) splitGrouped(
inrecAndContext *types.RecordAndContext,
outputRecordsAndContexts *list.List, // list of *types.RecordAndContext
inputDownstreamDoneChannel <-chan bool,
outputDownstreamDoneChannel chan<- bool,
) {
if !inrecAndContext.EndOfStream {
var filename string
groupByFieldValues, ok := inrecAndContext.Record.GetSelectedValues(tr.groupByFieldNames)
if !ok {
filename = fmt.Sprintf("%s_ungrouped.%s", tr.outputFileNamePrefix, tr.outputFileNameSuffix)
} else {
filename = tr.makeGroupedOutputFileName(groupByFieldValues)
}
err := tr.outputHandlerManager.WriteRecordAndContext(inrecAndContext, filename)
if err != nil {
fmt.Fprintf(os.Stderr, "mlr: %v\n", err)
os.Exit(1)
}
if tr.emitDownstream {
outputRecordsAndContexts.PushBack(inrecAndContext)
}
} else {
outputRecordsAndContexts.PushBack(inrecAndContext) // emit end-of-stream marker
errs := tr.outputHandlerManager.Close()
if len(errs) > 0 {
for _, err := range errs {
fmt.Fprintf(os.Stderr, "mlr: file-close error: %v\n", err)
}
os.Exit(1)
}
}
}
// makeUngroupedOutputFileName example: "split_53.csv"
func (tr *TransformerSplit) makeUngroupedOutputFileName(k int) string {
return fmt.Sprintf("%s_%d.%s", tr.outputFileNamePrefix, k, tr.outputFileNameSuffix)
}
// makeGroupedOutputFileName example: "split_orange.csv"
func (tr *TransformerSplit) makeGroupedOutputFileName(
groupByFieldValues []*mlrval.Mlrval,
) string {
var buffer bytes.Buffer
buffer.WriteString(tr.outputFileNamePrefix)
for _, groupByFieldValue := range groupByFieldValues {
buffer.WriteString("_")
buffer.WriteString(url.QueryEscape(groupByFieldValue.String()))
}
buffer.WriteString(".")
buffer.WriteString(tr.outputFileNameSuffix)
return buffer.String()
}
// makeGroupedIndexedOutputFileName example: "split_yellow_53.csv"
func (tr *TransformerSplit) makeGroupedIndexedOutputFileName(
groupByFieldValues []*mlrval.Mlrval,
index int,
) string {
// URL-escape the fields which come from data and which may have '/'
// etc within. Don't URL-escape the prefix since people may want to
// use prefixes like '/tmp/split' to write to the /tmp directory, etc.
var buffer bytes.Buffer
buffer.WriteString(tr.outputFileNamePrefix)
for _, groupByFieldValue := range groupByFieldValues {
buffer.WriteString("_")
buffer.WriteString(url.QueryEscape(groupByFieldValue.String()))
}
buffer.WriteString(fmt.Sprintf("_%d", index))
buffer.WriteString(".")
buffer.WriteString(tr.outputFileNameSuffix)
return buffer.String()
}

View file

@ -174,8 +174,8 @@ VERB LIST
json-stringify join label least-frequent merge-fields most-frequent nest
nothing put regularize remove-empty-columns rename reorder repeat reshape
sample sec2gmtdate sec2gmt seqgen shuffle skip-trivial-records sort
sort-within-records stats1 stats2 step tac tail tee template top unflatten
uniq unsparsify
sort-within-records split stats1 stats2 step tac tail tee template top
unflatten uniq unsparsify
FUNCTION LIST
abs acos acosh any append apply arrayify asin asinh asserting_absent
@ -1716,6 +1716,46 @@ VERBS
-r Recursively sort subobjects/submaps, e.g. for JSON input.
-h|--help Show this message.
split
Usage: mlr split [options] {filename}
Options:
-n {n}: Cap file sizes at N records.
-m {m}: Produce M files, round-robining records among them.
-g {a,b,c}: Write separate files with records having distinct values for fields named a,b,c.
Exactly one of -m, -n, or -g must be supplied.
--prefix {p} Specify filename prefix; default "split".
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
the input is CSV, but the output files are JSON.
Examples: Suppose myfile.csv has 1,000,000 records.
100 output files, 10,000 records each. First 10,000 records in split_1.csv, next in split_2.csv, etc.
mlr --csv --from myfile.csv split -n 10000
10 output files, 100,000 records each. Records 1,11,21,etc in split_1.csv, records 2,12,22, etc in split_2.csv, etc.
mlr --csv --from myfile.csv split -m 10
Same, but with JSON output.
mlr --csv --from myfile.csv split -m 10 -o json
Same but instead of split_1.csv, split_2.csv, etc. there are test_1.dat, test_2.dat, etc.
mlr --csv --from myfile.csv split -m 10 --prefix test --suffix dat
Same, but written to the /tmp/ directory.
mlr --csv --from myfile.csv split -m 10 --prefix /tmp/test --suffix dat
If the shape field has values triangle and square, then there will be split_triangle.csv and split_square.csv.
mlr --csv --from myfile.csv split -g shape
If the color field has values yellow and green, and the shape field has values triangle and square,
then there will be split_yellow_triangle.csv, split_yellow_square.csv, etc.
mlr --csv --from myfile.csv split -g color,shape
See also the "tee" DSL function which lets you do more ad-hoc customization.
stats1
Usage: mlr stats1 [options]
Computes univariate statistics for one or more given fields, accumulated across
@ -3070,4 +3110,4 @@ SEE ALSO
2022-01-25 MILLER(1)
2022-01-27 MILLER(1)

View file

@ -2,12 +2,12 @@
.\" Title: mlr
.\" Author: [see the "AUTHOR" section]
.\" Generator: ./mkman.rb
.\" Date: 2022-01-25
.\" Date: 2022-01-27
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "MILLER" "1" "2022-01-25" "\ \&" "\ \&"
.TH "MILLER" "1" "2022-01-27" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Portability definitions
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -215,8 +215,8 @@ fraction gap grep group-by group-like having-fields head histogram json-parse
json-stringify join label least-frequent merge-fields most-frequent nest
nothing put regularize remove-empty-columns rename reorder repeat reshape
sample sec2gmtdate sec2gmt seqgen shuffle skip-trivial-records sort
sort-within-records stats1 stats2 step tac tail tee template top unflatten
uniq unsparsify
sort-within-records split stats1 stats2 step tac tail tee template top
unflatten uniq unsparsify
.fi
.if n \{\
.RE
@ -2169,6 +2169,52 @@ Options:
.fi
.if n \{\
.RE
.SS "split"
.if n \{\
.RS 0
.\}
.nf
Usage: mlr split [options] {filename}
Options:
-n {n}: Cap file sizes at N records.
-m {m}: Produce M files, round-robining records among them.
-g {a,b,c}: Write separate files with records having distinct values for fields named a,b,c.
Exactly one of -m, -n, or -g must be supplied.
--prefix {p} Specify filename prefix; default "split".
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
the input is CSV, but the output files are JSON.
Examples: Suppose myfile.csv has 1,000,000 records.
100 output files, 10,000 records each. First 10,000 records in split_1.csv, next in split_2.csv, etc.
mlr --csv --from myfile.csv split -n 10000
10 output files, 100,000 records each. Records 1,11,21,etc in split_1.csv, records 2,12,22, etc in split_2.csv, etc.
mlr --csv --from myfile.csv split -m 10
Same, but with JSON output.
mlr --csv --from myfile.csv split -m 10 -o json
Same but instead of split_1.csv, split_2.csv, etc. there are test_1.dat, test_2.dat, etc.
mlr --csv --from myfile.csv split -m 10 --prefix test --suffix dat
Same, but written to the /tmp/ directory.
mlr --csv --from myfile.csv split -m 10 --prefix /tmp/test --suffix dat
If the shape field has values triangle and square, then there will be split_triangle.csv and split_square.csv.
mlr --csv --from myfile.csv split -g shape
If the color field has values yellow and green, and the shape field has values triangle and square,
then there will be split_yellow_triangle.csv, split_yellow_square.csv, etc.
mlr --csv --from myfile.csv split -g color,shape
See also the "tee" DSL function which lets you do more ad-hoc customization.
.fi
.if n \{\
.RE
.SS "stats1"
.if n \{\
.RS 0

View file

@ -929,6 +929,47 @@ Options:
-r Recursively sort subobjects/submaps, e.g. for JSON input.
-h|--help Show this message.
================================================================
split
Usage: mlr split [options] {filename}
Options:
-n {n}: Cap file sizes at N records.
-m {m}: Produce M files, round-robining records among them.
-g {a,b,c}: Write separate files with records having distinct values for fields named a,b,c.
Exactly one of -m, -n, or -g must be supplied.
--prefix {p} Specify filename prefix; default "split".
--suffix {s} Specify filename suffix; default is from mlr output format, e.g. "csv".
-a Append to existing file(s), if any, rather than overwriting.
-v Send records along to downstream verbs as well as splitting to files.
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h). For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
the input is CSV, but the output files are JSON.
Examples: Suppose myfile.csv has 1,000,000 records.
100 output files, 10,000 records each. First 10,000 records in split_1.csv, next in split_2.csv, etc.
mlr --csv --from myfile.csv split -n 10000
10 output files, 100,000 records each. Records 1,11,21,etc in split_1.csv, records 2,12,22, etc in split_2.csv, etc.
mlr --csv --from myfile.csv split -m 10
Same, but with JSON output.
mlr --csv --from myfile.csv split -m 10 -o json
Same but instead of split_1.csv, split_2.csv, etc. there are test_1.dat, test_2.dat, etc.
mlr --csv --from myfile.csv split -m 10 --prefix test --suffix dat
Same, but written to the /tmp/ directory.
mlr --csv --from myfile.csv split -m 10 --prefix /tmp/test --suffix dat
If the shape field has values triangle and square, then there will be split_triangle.csv and split_square.csv.
mlr --csv --from myfile.csv split -g shape
If the color field has values yellow and green, and the shape field has values triangle and square,
then there will be split_yellow_triangle.csv, split_yellow_square.csv, etc.
mlr --csv --from myfile.csv split -g color,shape
See also the "tee" DSL function which lets you do more ad-hoc customization.
================================================================
stats1
Usage: mlr stats1 [options]

View file

@ -0,0 +1 @@
mlr --csv split -m 2 --prefix ${CASEDIR}/split test/input/example.csv

View file

View file

View file

@ -0,0 +1,3 @@
${CASEDIR}/split_1.csv.expect ${CASEDIR}/split_1.csv
${CASEDIR}/split_2.csv.expect ${CASEDIR}/split_2.csv

View file

@ -0,0 +1,6 @@
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
red,circle,true,3,16,13.8103,2.9010
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240
yellow,circle,true,9,87,63.5058,8.3350

View file

@ -0,0 +1,6 @@
color,shape,flag,k,index,quantity,rate
red,square,true,2,15,79.2778,0.0130
red,square,false,4,48,77.5542,7.4670
red,square,false,6,64,77.1991,9.5310
yellow,circle,true,8,73,63.9785,4.2370
purple,square,false,10,91,72.3735,8.2430

View file

@ -0,0 +1 @@
mlr --csv split -n 2 --prefix ${CASEDIR}/split test/input/example.csv

View file

View file

View file

@ -0,0 +1,6 @@
${CASEDIR}/split_1.csv.expect ${CASEDIR}/split_1.csv
${CASEDIR}/split_2.csv.expect ${CASEDIR}/split_2.csv
${CASEDIR}/split_3.csv.expect ${CASEDIR}/split_3.csv
${CASEDIR}/split_4.csv.expect ${CASEDIR}/split_4.csv
${CASEDIR}/split_5.csv.expect ${CASEDIR}/split_5.csv

View file

@ -0,0 +1,3 @@
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
red,square,true,2,15,79.2778,0.0130

View file

@ -0,0 +1,3 @@
color,shape,flag,k,index,quantity,rate
red,circle,true,3,16,13.8103,2.9010
red,square,false,4,48,77.5542,7.4670

View file

@ -0,0 +1,3 @@
color,shape,flag,k,index,quantity,rate
purple,triangle,false,5,51,81.2290,8.5910
red,square,false,6,64,77.1991,9.5310

View file

@ -0,0 +1,3 @@
color,shape,flag,k,index,quantity,rate
purple,triangle,false,7,65,80.1405,5.8240
yellow,circle,true,8,73,63.9785,4.2370

View file

@ -0,0 +1,3 @@
color,shape,flag,k,index,quantity,rate
yellow,circle,true,9,87,63.5058,8.3350
purple,square,false,10,91,72.3735,8.2430

View file

@ -0,0 +1 @@
mlr --csv split -g shape --prefix ${CASEDIR}/split test/input/example.csv

View file

View file

View file

@ -0,0 +1,3 @@
${CASEDIR}/split_square.csv.expect ${CASEDIR}/split_square.csv
${CASEDIR}/split_circle.csv.expect ${CASEDIR}/split_circle.csv
${CASEDIR}/split_triangle.csv.expect ${CASEDIR}/split_triangle.csv

View file

@ -0,0 +1,4 @@
color,shape,flag,k,index,quantity,rate
red,circle,true,3,16,13.8103,2.9010
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350

View file

@ -0,0 +1,5 @@
color,shape,flag,k,index,quantity,rate
red,square,true,2,15,79.2778,0.0130
red,square,false,4,48,77.5542,7.4670
red,square,false,6,64,77.1991,9.5310
purple,square,false,10,91,72.3735,8.2430

View file

@ -0,0 +1,4 @@
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240

View file

@ -0,0 +1 @@
mlr --csv split -g color,shape --prefix ${CASEDIR}/split test/input/example.csv

View file

View file

View file

@ -0,0 +1,7 @@
${CASEDIR}/split_purple_square.csv.expect ${CASEDIR}/split_purple_square.csv
${CASEDIR}/split_purple_triangle.csv.expect ${CASEDIR}/split_purple_triangle.csv
${CASEDIR}/split_red_circle.csv.expect ${CASEDIR}/split_red_circle.csv
${CASEDIR}/split_red_square.csv.expect ${CASEDIR}/split_red_square.csv
${CASEDIR}/split_yellow_circle.csv.expect ${CASEDIR}/split_yellow_circle.csv
${CASEDIR}/split_yellow_triangle.csv.expect ${CASEDIR}/split_yellow_triangle.csv

View file

@ -0,0 +1,2 @@
color,shape,flag,k,index,quantity,rate
purple,square,false,10,91,72.3735,8.2430

View file

@ -0,0 +1,3 @@
color,shape,flag,k,index,quantity,rate
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240

View file

@ -0,0 +1,2 @@
color,shape,flag,k,index,quantity,rate
red,circle,true,3,16,13.8103,2.9010

View file

@ -0,0 +1,4 @@
color,shape,flag,k,index,quantity,rate
red,square,true,2,15,79.2778,0.0130
red,square,false,4,48,77.5542,7.4670
red,square,false,6,64,77.1991,9.5310

View file

@ -0,0 +1,3 @@
color,shape,flag,k,index,quantity,rate
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350

View file

@ -0,0 +1,2 @@
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870

View file

@ -0,0 +1 @@
mlr --csv split -m 2 --prefix ${CASEDIR}/split --suffix dat test/input/example.csv

View file

View file

View file

@ -0,0 +1,3 @@
${CASEDIR}/split_1.dat.expect ${CASEDIR}/split_1.dat
${CASEDIR}/split_2.dat.expect ${CASEDIR}/split_2.dat

View file

@ -0,0 +1,6 @@
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
red,circle,true,3,16,13.8103,2.9010
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240
yellow,circle,true,9,87,63.5058,8.3350

View file

@ -0,0 +1,6 @@
color,shape,flag,k,index,quantity,rate
red,square,true,2,15,79.2778,0.0130
red,square,false,4,48,77.5542,7.4670
red,square,false,6,64,77.1991,9.5310
yellow,circle,true,8,73,63.9785,4.2370
purple,square,false,10,91,72.3735,8.2430

View file

@ -0,0 +1 @@
mlr --csv split -m 2 --prefix ${CASEDIR}/split --ojson test/input/example.csv

View file

View file

View file

@ -0,0 +1,3 @@
${CASEDIR}/split_1.json.expect ${CASEDIR}/split_1.json
${CASEDIR}/split_2.json.expect ${CASEDIR}/split_2.json

View file

@ -0,0 +1,47 @@
[
{
"color": "yellow",
"shape": "triangle",
"flag": "true",
"k": 1,
"index": 11,
"quantity": 43.6498,
"rate": 9.8870
},
{
"color": "red",
"shape": "circle",
"flag": "true",
"k": 3,
"index": 16,
"quantity": 13.8103,
"rate": 2.9010
},
{
"color": "purple",
"shape": "triangle",
"flag": "false",
"k": 5,
"index": 51,
"quantity": 81.2290,
"rate": 8.5910
},
{
"color": "purple",
"shape": "triangle",
"flag": "false",
"k": 7,
"index": 65,
"quantity": 80.1405,
"rate": 5.8240
},
{
"color": "yellow",
"shape": "circle",
"flag": "true",
"k": 9,
"index": 87,
"quantity": 63.5058,
"rate": 8.3350
}
]

View file

@ -0,0 +1,47 @@
[
{
"color": "red",
"shape": "square",
"flag": "true",
"k": 2,
"index": 15,
"quantity": 79.2778,
"rate": 0.0130
},
{
"color": "red",
"shape": "square",
"flag": "false",
"k": 4,
"index": 48,
"quantity": 77.5542,
"rate": 7.4670
},
{
"color": "red",
"shape": "square",
"flag": "false",
"k": 6,
"index": 64,
"quantity": 77.1991,
"rate": 9.5310
},
{
"color": "yellow",
"shape": "circle",
"flag": "true",
"k": 8,
"index": 73,
"quantity": 63.9785,
"rate": 4.2370
},
{
"color": "purple",
"shape": "square",
"flag": "false",
"k": 10,
"index": 91,
"quantity": 72.3735,
"rate": 8.2430
}
]

View file

@ -0,0 +1 @@
mlr --csv split -m 2 -v --prefix ${CASEDIR}/split test/input/example.csv

View file

View file

@ -0,0 +1,11 @@
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
red,square,true,2,15,79.2778,0.0130
red,circle,true,3,16,13.8103,2.9010
red,square,false,4,48,77.5542,7.4670
purple,triangle,false,5,51,81.2290,8.5910
red,square,false,6,64,77.1991,9.5310
purple,triangle,false,7,65,80.1405,5.8240
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350
purple,square,false,10,91,72.3735,8.2430

View file

@ -0,0 +1,3 @@
${CASEDIR}/split_1.csv.expect ${CASEDIR}/split_1.csv
${CASEDIR}/split_2.csv.expect ${CASEDIR}/split_2.csv

View file

@ -0,0 +1,6 @@
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
red,circle,true,3,16,13.8103,2.9010
purple,triangle,false,5,51,81.2290,8.5910
purple,triangle,false,7,65,80.1405,5.8240
yellow,circle,true,9,87,63.5058,8.3350

View file

@ -0,0 +1,6 @@
color,shape,flag,k,index,quantity,rate
red,square,true,2,15,79.2778,0.0130
red,square,false,4,48,77.5542,7.4670
red,square,false,6,64,77.1991,9.5310
yellow,circle,true,8,73,63.9785,4.2370
purple,square,false,10,91,72.3735,8.2430

11
test/input/example.csv Normal file
View file

@ -0,0 +1,11 @@
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
red,square,true,2,15,79.2778,0.0130
red,circle,true,3,16,13.8103,2.9010
red,square,false,4,48,77.5542,7.4670
purple,triangle,false,5,51,81.2290,8.5910
red,square,false,6,64,77.1991,9.5310
purple,triangle,false,7,65,80.1405,5.8240
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350
purple,square,false,10,91,72.3735,8.2430
1 color shape flag k index quantity rate
2 yellow triangle true 1 11 43.6498 9.8870
3 red square true 2 15 79.2778 0.0130
4 red circle true 3 16 13.8103 2.9010
5 red square false 4 48 77.5542 7.4670
6 purple triangle false 5 51 81.2290 8.5910
7 red square false 6 64 77.1991 9.5310
8 purple triangle false 7 65 80.1405 5.8240
9 yellow circle true 8 73 63.9785 4.2370
10 yellow circle true 9 87 63.5058 8.3350
11 purple square false 10 91 72.3735 8.2430

View file

@ -1,4 +1,4 @@
================================================================
===============================================================
RELEASES
* follow ...
@ -26,6 +26,10 @@ FEATURES
o format/unformat
o strmatch
o =~
* separate examples from FAQs
* mlr split -- needs an example page along with the tee DSL function
* new example entry, with ccump and pgr
o slwin --prune (or somesuch) to only emit averages over full windows -- ?
----------------------------------------------------------------
k better print-interpolate with {} etc
@ -33,15 +37,10 @@ k better print-interpolate with {} etc
----------------------------------------------------------------
! sysdate, sysdate_local; datediff ...
----------------------------------------------------------------
mlr split ... -n, -g -- ?
- how to specify filenames?
----------------------------------------------------------------
! strmatch https://github.com/johnkerl/miller/issues/77#issuecomment-538790927
----------------------------------------------------------------
* new example entry, with ccump and pgr
* make a lag-by-n and lead-by-n
----------------------------------------------------------------