mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 02:14:13 +00:00
Make TSV finally true TSV (#923)
* Spec-TSV * doc mods; more test cases
This commit is contained in:
parent
ac47c7052a
commit
66c4a077fd
30 changed files with 705 additions and 139 deletions
3
.vimrc
3
.vimrc
|
|
@ -1,4 +1,5 @@
|
|||
map \d :w<C-m>:!clear;echo Building ...; echo; make mlr<C-m>
|
||||
map \f :w<C-m>:!clear;echo Building ...; echo; make ut<C-m>
|
||||
map \r :w<C-m>:!clear;echo Building ...; echo; make ut-scan ut-mlv<C-m>
|
||||
"map \r :w<C-m>:!clear;echo Building ...; echo; make ut-scan ut-mlv<C-m>
|
||||
map \r :w<C-m>:!clear;echo Building ...; echo; make ut-lib<C-m>
|
||||
map \t :w<C-m>:!clear;go test github.com/johnkerl/miller/internal/pkg/transformers/...<C-m>
|
||||
|
|
|
|||
|
|
@ -104,36 +104,34 @@ NIDX: implicitly numerically indexed (Unix-toolkit style)
|
|||
|
||||
When `mlr` is invoked with the `--csv` or `--csvlite` option, key names are found on the first record and values are taken from subsequent records. This includes the case of CSV-formatted files. See [Record Heterogeneity](record-heterogeneity.md) for how Miller handles changes of field names within a single data stream.
|
||||
|
||||
Miller has record separator `RS` and field separator `FS`, just as `awk` does. For TSV, use `--fs tab`; to convert TSV to CSV, use `--ifs tab --ofs comma`, etc. (See also the [separators page](reference-main-separators.md).)
|
||||
Miller has record separator `RS` and field separator `FS`, just as `awk` does. (See also the [separators page](reference-main-separators.md).)
|
||||
|
||||
**TSV (tab-separated values):** the following are synonymous pairs:
|
||||
**TSV (tab-separated values):** `FS` is tab and `RS` is newline (or carriage return + linefeed for
|
||||
Windows). On input, if fields have `\r`, `\n`, `\t`, or `\\`, those are decoded as carriage return,
|
||||
newline, tab, and backslash, respectively. On output, the reverse is done -- for example, if a field
|
||||
has an embedded newline, that newline is replaced by `\n`.
|
||||
|
||||
* `--tsv` and `--csv --fs tab`
|
||||
* `--itsv` and `--icsv --ifs tab`
|
||||
* `--otsv` and `--ocsv --ofs tab`
|
||||
* `--tsvlite` and `--csvlite --fs tab`
|
||||
* `--itsvlite` and `--icsvlite --ifs tab`
|
||||
* `--otsvlite` and `--ocsvlite --ofs tab`
|
||||
**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS `0x1f` and `0x1e`, respectively.
|
||||
|
||||
**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS 0x1f and 0x1e, respectively.
|
||||
|
||||
**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS U+241F (UTF-8 0x0xe2909f) and U+241E (UTF-8 0xe2909e), respectively.
|
||||
**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS `U+241F` (UTF-8 `0x0xe2909f`) and `U+241E` (UTF-8 `0xe2909e`), respectively.
|
||||
|
||||
Miller's `--csv` flag supports [RFC-4180 CSV](https://tools.ietf.org/html/rfc4180). This includes CRLF line-terminators by default, regardless of platform.
|
||||
|
||||
Here are the differences between CSV and CSV-lite:
|
||||
|
||||
* CSV-lite naively splits lines on newline, and fields on comma -- embedded commas and newlines are not escaped in any way.
|
||||
|
||||
* CSV supports [RFC-4180](https://tools.ietf.org/html/rfc4180)-style double-quoting, including the ability to have commas and/or LF/CRLF line-endings contained within an input field; CSV-lite does not.
|
||||
|
||||
* CSV does not allow heterogeneous data; CSV-lite does (see also [Record Heterogeneity](record-heterogeneity.md)).
|
||||
|
||||
* The CSV-lite input-reading code is fractionally more efficient than the CSV input-reader.
|
||||
* TSV-lite is simply CSV-lite with field separator set to tab instead of comma.
|
||||
|
||||
Here are things they have in common:
|
||||
* CSV-lite allows changing FS and/or RS to any values, perhaps multi-character.
|
||||
|
||||
* The ability to specify record/field separators other than the default, e.g. CR-LF vs. LF, or tab instead of comma for TSV, and so on.
|
||||
* In short, use-cases for CSV-lite and TSV-lite are often found when dealing with CSV/TSV files which are formatted in some non-standard way -- you have a little more flexibility available to you. (As an example of this flexibility: ASV and USV are nothing more than CSV-lite with different values for FS and RS.)
|
||||
|
||||
* The `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.
|
||||
CSV, TSV, CSV-lite, and TSV-lite have in common the `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.
|
||||
|
||||
## JSON
|
||||
|
||||
|
|
|
|||
|
|
@ -16,36 +16,34 @@ GENMD-EOF
|
|||
|
||||
When `mlr` is invoked with the `--csv` or `--csvlite` option, key names are found on the first record and values are taken from subsequent records. This includes the case of CSV-formatted files. See [Record Heterogeneity](record-heterogeneity.md) for how Miller handles changes of field names within a single data stream.
|
||||
|
||||
Miller has record separator `RS` and field separator `FS`, just as `awk` does. For TSV, use `--fs tab`; to convert TSV to CSV, use `--ifs tab --ofs comma`, etc. (See also the [separators page](reference-main-separators.md).)
|
||||
Miller has record separator `RS` and field separator `FS`, just as `awk` does. (See also the [separators page](reference-main-separators.md).)
|
||||
|
||||
**TSV (tab-separated values):** the following are synonymous pairs:
|
||||
**TSV (tab-separated values):** `FS` is tab and `RS` is newline (or carriage return + linefeed for
|
||||
Windows). On input, if fields have `\r`, `\n`, `\t`, or `\\`, those are decoded as carriage return,
|
||||
newline, tab, and backslash, respectively. On output, the reverse is done -- for example, if a field
|
||||
has an embedded newline, that newline is replaced by `\n`.
|
||||
|
||||
* `--tsv` and `--csv --fs tab`
|
||||
* `--itsv` and `--icsv --ifs tab`
|
||||
* `--otsv` and `--ocsv --ofs tab`
|
||||
* `--tsvlite` and `--csvlite --fs tab`
|
||||
* `--itsvlite` and `--icsvlite --ifs tab`
|
||||
* `--otsvlite` and `--ocsvlite --ofs tab`
|
||||
**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS `0x1f` and `0x1e`, respectively.
|
||||
|
||||
**ASV (ASCII-separated values):** the flags `--asv`, `--iasv`, `--oasv`, `--asvlite`, `--iasvlite`, and `--oasvlite` are analogous except they use ASCII FS and RS 0x1f and 0x1e, respectively.
|
||||
|
||||
**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS U+241F (UTF-8 0x0xe2909f) and U+241E (UTF-8 0xe2909e), respectively.
|
||||
**USV (Unicode-separated values):** likewise, the flags `--usv`, `--iusv`, `--ousv`, `--usvlite`, `--iusvlite`, and `--ousvlite` use Unicode FS and RS `U+241F` (UTF-8 `0x0xe2909f`) and `U+241E` (UTF-8 `0xe2909e`), respectively.
|
||||
|
||||
Miller's `--csv` flag supports [RFC-4180 CSV](https://tools.ietf.org/html/rfc4180). This includes CRLF line-terminators by default, regardless of platform.
|
||||
|
||||
Here are the differences between CSV and CSV-lite:
|
||||
|
||||
* CSV-lite naively splits lines on newline, and fields on comma -- embedded commas and newlines are not escaped in any way.
|
||||
|
||||
* CSV supports [RFC-4180](https://tools.ietf.org/html/rfc4180)-style double-quoting, including the ability to have commas and/or LF/CRLF line-endings contained within an input field; CSV-lite does not.
|
||||
|
||||
* CSV does not allow heterogeneous data; CSV-lite does (see also [Record Heterogeneity](record-heterogeneity.md)).
|
||||
|
||||
* The CSV-lite input-reading code is fractionally more efficient than the CSV input-reader.
|
||||
* TSV-lite is simply CSV-lite with field separator set to tab instead of comma.
|
||||
|
||||
Here are things they have in common:
|
||||
* CSV-lite allows changing FS and/or RS to any values, perhaps multi-character.
|
||||
|
||||
* The ability to specify record/field separators other than the default, e.g. CR-LF vs. LF, or tab instead of comma for TSV, and so on.
|
||||
* In short, use-cases for CSV-lite and TSV-lite are often found when dealing with CSV/TSV files which are formatted in some non-standard way -- you have a little more flexibility available to you. (As an example of this flexibility: ASV and USV are nothing more than CSV-lite with different values for FS and RS.)
|
||||
|
||||
* The `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.
|
||||
CSV, TSV, CSV-lite, and TSV-lite have in common the `--implicit-csv-header` flag for input and the `--headerless-csv-output` flag for output.
|
||||
|
||||
## JSON
|
||||
|
||||
|
|
|
|||
|
|
@ -92,11 +92,11 @@ If there's more than one input file, you can use `--mfrom`, then however many fi
|
|||
The following have even shorter versions:
|
||||
|
||||
* `-c` is the same as `--csv`
|
||||
* `-t` is the same as `--tsvlite`
|
||||
* `-t` is the same as `--tsv`
|
||||
* `-j` is the same as `--json`
|
||||
|
||||
I don't use these within these documents, since I want the docs to be self-explanatory on every page, and
|
||||
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're there for you to use.
|
||||
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're always there for you to use.
|
||||
|
||||
## .mlrrc file
|
||||
|
||||
|
|
|
|||
|
|
@ -37,11 +37,11 @@ GENMD-EOF
|
|||
The following have even shorter versions:
|
||||
|
||||
* `-c` is the same as `--csv`
|
||||
* `-t` is the same as `--tsvlite`
|
||||
* `-t` is the same as `--tsv`
|
||||
* `-j` is the same as `--json`
|
||||
|
||||
I don't use these within these documents, since I want the docs to be self-explanatory on every page, and
|
||||
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're there for you to use.
|
||||
I think `mlr --csv ...` explains itself better than `mlr -c ...`. Nonetheless, they're always there for you to use.
|
||||
|
||||
## .mlrrc file
|
||||
|
||||
|
|
|
|||
|
|
@ -386,7 +386,7 @@ FILE-FORMAT FLAGS
|
|||
--oxtab Use XTAB format for output data.
|
||||
--pprint Use PPRINT format for input and output data.
|
||||
--tsv Use TSV format for input and output data.
|
||||
--tsvlite or -t Use TSV-lite format for input and output data.
|
||||
--tsv or -t Use TSV-lite format for input and output data.
|
||||
--usv or --usvlite Use USV format for input and output data.
|
||||
--xtab Use XTAB format for input and output data.
|
||||
-i {format name} Use format name for input data. For example: `-i csv`
|
||||
|
|
@ -708,7 +708,6 @@ SEPARATOR FLAGS
|
|||
alignment impossible.
|
||||
* OPS may be multi-character for XTAB format, in which case alignment is
|
||||
disabled.
|
||||
* TSV is simply CSV using tab as field separator (`--fs tab`).
|
||||
* FS/PS are ignored for markdown format; RS is used.
|
||||
* All FS and PS options are ignored for JSON format, since they are not relevant
|
||||
to the JSON format.
|
||||
|
|
@ -763,6 +762,7 @@ SEPARATOR FLAGS
|
|||
markdown " " N/A "\n"
|
||||
nidx " " N/A "\n"
|
||||
pprint " " N/A "\n"
|
||||
tsv " " N/A "\n"
|
||||
xtab "\n" " " "\n\n"
|
||||
|
||||
--fs {string} Specify FS for input and output.
|
||||
|
|
@ -3157,5 +3157,5 @@ SEE ALSO
|
|||
|
||||
|
||||
|
||||
2022-02-05 MILLER(1)
|
||||
2022-02-06 MILLER(1)
|
||||
</pre>
|
||||
|
|
|
|||
|
|
@ -365,7 +365,7 @@ FILE-FORMAT FLAGS
|
|||
--oxtab Use XTAB format for output data.
|
||||
--pprint Use PPRINT format for input and output data.
|
||||
--tsv Use TSV format for input and output data.
|
||||
--tsvlite or -t Use TSV-lite format for input and output data.
|
||||
--tsv or -t Use TSV-lite format for input and output data.
|
||||
--usv or --usvlite Use USV format for input and output data.
|
||||
--xtab Use XTAB format for input and output data.
|
||||
-i {format name} Use format name for input data. For example: `-i csv`
|
||||
|
|
@ -687,7 +687,6 @@ SEPARATOR FLAGS
|
|||
alignment impossible.
|
||||
* OPS may be multi-character for XTAB format, in which case alignment is
|
||||
disabled.
|
||||
* TSV is simply CSV using tab as field separator (`--fs tab`).
|
||||
* FS/PS are ignored for markdown format; RS is used.
|
||||
* All FS and PS options are ignored for JSON format, since they are not relevant
|
||||
to the JSON format.
|
||||
|
|
@ -742,6 +741,7 @@ SEPARATOR FLAGS
|
|||
markdown " " N/A "\n"
|
||||
nidx " " N/A "\n"
|
||||
pprint " " N/A "\n"
|
||||
tsv " " N/A "\n"
|
||||
xtab "\n" " " "\n\n"
|
||||
|
||||
--fs {string} Specify FS for input and output.
|
||||
|
|
@ -3136,4 +3136,4 @@ SEE ALSO
|
|||
|
||||
|
||||
|
||||
2022-02-05 MILLER(1)
|
||||
2022-02-06 MILLER(1)
|
||||
|
|
|
|||
|
|
@ -177,7 +177,7 @@ are overridden in all cases by setting output format to `format2`.
|
|||
* `--oxtab`: Use XTAB format for output data.
|
||||
* `--pprint`: Use PPRINT format for input and output data.
|
||||
* `--tsv`: Use TSV format for input and output data.
|
||||
* `--tsvlite or -t`: Use TSV-lite format for input and output data.
|
||||
* `--tsv`: Use TSV format for input and output data.
|
||||
* `--usv or --usvlite`: Use USV format for input and output data.
|
||||
* `--xtab`: Use XTAB format for input and output data.
|
||||
* `-i {format name}`: Use format name for input data. For example: `-i csv` is the same as `--icsv`.
|
||||
|
|
@ -405,7 +405,6 @@ Notes about all other separators:
|
|||
alignment impossible.
|
||||
* OPS may be multi-character for XTAB format, in which case alignment is
|
||||
disabled.
|
||||
* TSV is simply CSV using tab as field separator (`--fs tab`).
|
||||
* FS/PS are ignored for markdown format; RS is used.
|
||||
* All FS and PS options are ignored for JSON format, since they are not relevant
|
||||
to the JSON format.
|
||||
|
|
@ -460,6 +459,7 @@ Notes about all other separators:
|
|||
markdown " " N/A "\n"
|
||||
nidx " " N/A "\n"
|
||||
pprint " " N/A "\n"
|
||||
tsv " " N/A "\n"
|
||||
xtab "\n" " " "\n\n"
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -261,8 +261,9 @@ a:4;b:5;c:6;d:>>>,|||;<<<
|
|||
|
||||
Notes:
|
||||
|
||||
* If CSV field separator is tab, we have TSV; see more examples (ASV, USV, etc.) at in the [CSV section](file-formats.md#csvtsvasvusvetc).
|
||||
* CSV IRS and ORS must be newline, and CSV IFS must be a single character. (CSV-lite does not have these restrictions.)
|
||||
* TSV IRS and ORS must be newline, and TSV IFS must be a tab. (TSV-lite does not have these restrictions.)
|
||||
* See the [CSV section](file-formats.md#csvtsvasvusvetc) for information about ASV and USV.
|
||||
* JSON: ignores all separator flags from the command line.
|
||||
* Headerless CSV overlaps quite a bit with NIDX format using comma for IFS. See also the page on [CSV with and without headers](csv-with-and-without-headers.md).
|
||||
* For XTAB, the record separator is a repetition of the field separator. For example, if one record has `x=1,y=2` and the next has `x=3,y=4`, and OFS is newline, then output lines are `x 1`, then `y 2`, then an extra newline, then `x 3`, then `y 4`. This means: to customize XTAB, set `OFS` rather than `ORS`.
|
||||
|
|
|
|||
|
|
@ -151,8 +151,9 @@ GENMD-EOF
|
|||
|
||||
Notes:
|
||||
|
||||
* If CSV field separator is tab, we have TSV; see more examples (ASV, USV, etc.) at in the [CSV section](file-formats.md#csvtsvasvusvetc).
|
||||
* CSV IRS and ORS must be newline, and CSV IFS must be a single character. (CSV-lite does not have these restrictions.)
|
||||
* TSV IRS and ORS must be newline, and TSV IFS must be a tab. (TSV-lite does not have these restrictions.)
|
||||
* See the [CSV section](file-formats.md#csvtsvasvusvetc) for information about ASV and USV.
|
||||
* JSON: ignores all separator flags from the command line.
|
||||
* Headerless CSV overlaps quite a bit with NIDX format using comma for IFS. See also the page on [CSV with and without headers](csv-with-and-without-headers.md).
|
||||
* For XTAB, the record separator is a repetition of the field separator. For example, if one record has `x=1,y=2` and the next has `x=3,y=4`, and OFS is newline, then output lines are `x 1`, then `y 2`, then an extra newline, then `x 3`, then `y 4`. This means: to customize XTAB, set `OFS` rather than `ORS`.
|
||||
|
|
|
|||
|
|
@ -147,7 +147,6 @@ Notes about all other separators:
|
|||
alignment impossible.
|
||||
* OPS may be multi-character for XTAB format, in which case alignment is
|
||||
disabled.
|
||||
* TSV is simply CSV using tab as field separator (` + "`--fs tab`" + `).
|
||||
* FS/PS are ignored for markdown format; RS is used.
|
||||
* All FS and PS options are ignored for JSON format, since they are not relevant
|
||||
to the JSON format.
|
||||
|
|
@ -629,9 +628,7 @@ var FileFormatFlagSection = FlagSection{
|
|||
name: "--itsv",
|
||||
help: "Use TSV format for input data.",
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -824,7 +821,7 @@ var FileFormatFlagSection = FlagSection{
|
|||
name: "--otsv",
|
||||
help: "Use TSV format for output data.",
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.WriterOptions.OutputFileFormat = "csv"
|
||||
options.WriterOptions.OutputFileFormat = "tsv"
|
||||
options.WriterOptions.OFS = "\t"
|
||||
options.WriterOptions.ofsWasSpecified = true
|
||||
*pargi += 1
|
||||
|
|
@ -981,27 +978,19 @@ var FileFormatFlagSection = FlagSection{
|
|||
name: "--tsv",
|
||||
help: "Use TSV format for input and output data.",
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.WriterOptions.OutputFileFormat = "csv"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.WriterOptions.OFS = "\t"
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.WriterOptions.ofsWasSpecified = true
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
options.WriterOptions.OutputFileFormat = "tsv"
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
||||
{
|
||||
name: "--tsvlite",
|
||||
name: "--tsv",
|
||||
help: "Use TSV-lite format for input and output data.",
|
||||
altNames: []string{"-t"},
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csvlite"
|
||||
options.WriterOptions.OutputFileFormat = "csvlite"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.WriterOptions.OFS = "\t"
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.WriterOptions.ofsWasSpecified = true
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
options.WriterOptions.OutputFileFormat = "tsv"
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1181,11 +1170,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.WriterOptions.OutputFileFormat = "csv"
|
||||
options.WriterOptions.OFS = "\t"
|
||||
options.WriterOptions.OutputFileFormat = "tsv"
|
||||
options.ReaderOptions.irsWasSpecified = true
|
||||
options.WriterOptions.ofsWasSpecified = true
|
||||
options.WriterOptions.orsWasSpecified = true
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1308,12 +1294,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
// need to print a tedious 60-line list.
|
||||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
options.WriterOptions.OutputFileFormat = "csv"
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.ReaderOptions.irsWasSpecified = true
|
||||
options.WriterOptions.orsWasSpecified = true
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1324,11 +1306,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
// need to print a tedious 60-line list.
|
||||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
options.WriterOptions.OutputFileFormat = "dkvp"
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.ReaderOptions.irsWasSpecified = true
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1339,12 +1318,9 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
// need to print a tedious 60-line list.
|
||||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
options.WriterOptions.OutputFileFormat = "nidx"
|
||||
options.WriterOptions.OFS = " "
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.ReaderOptions.irsWasSpecified = true
|
||||
options.WriterOptions.ofsWasSpecified = true
|
||||
*pargi += 1
|
||||
},
|
||||
|
|
@ -1356,13 +1332,10 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
// need to print a tedious 60-line list.
|
||||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
options.WriterOptions.OutputFileFormat = "json"
|
||||
options.WriterOptions.WrapJSONOutputInOuterList = true
|
||||
options.WriterOptions.JSONOutputMultiline = true
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.ReaderOptions.irsWasSpecified = true
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1373,13 +1346,10 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
// need to print a tedious 60-line list.
|
||||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
options.WriterOptions.OutputFileFormat = "json"
|
||||
options.WriterOptions.WrapJSONOutputInOuterList = false
|
||||
options.WriterOptions.JSONOutputMultiline = false
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.ReaderOptions.irsWasSpecified = true
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1390,11 +1360,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
// need to print a tedious 60-line list.
|
||||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
options.WriterOptions.OutputFileFormat = "pprint"
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.ReaderOptions.irsWasSpecified = true
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1405,12 +1372,9 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
// need to print a tedious 60-line list.
|
||||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
options.WriterOptions.OutputFileFormat = "pprint"
|
||||
options.WriterOptions.BarredPprintOutput = true
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.ReaderOptions.irsWasSpecified = true
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1421,11 +1385,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
// need to print a tedious 60-line list.
|
||||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
options.WriterOptions.OutputFileFormat = "xtab"
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.ReaderOptions.irsWasSpecified = true
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1436,11 +1397,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
// need to print a tedious 60-line list.
|
||||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "csv"
|
||||
options.ReaderOptions.IFS = "\t"
|
||||
options.ReaderOptions.InputFileFormat = "tsv"
|
||||
options.WriterOptions.OutputFileFormat = "markdown"
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.ReaderOptions.irsWasSpecified = true
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1465,7 +1423,7 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "dkvp"
|
||||
options.WriterOptions.OutputFileFormat = "csv"
|
||||
options.WriterOptions.OutputFileFormat = "tsv"
|
||||
options.WriterOptions.OFS = "\t"
|
||||
options.WriterOptions.ofsWasSpecified = true
|
||||
options.WriterOptions.orsWasSpecified = true
|
||||
|
|
@ -1585,10 +1543,7 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "nidx"
|
||||
options.WriterOptions.OutputFileFormat = "csv"
|
||||
options.WriterOptions.OFS = "\t"
|
||||
options.WriterOptions.ofsWasSpecified = true
|
||||
options.WriterOptions.orsWasSpecified = true
|
||||
options.WriterOptions.OutputFileFormat = "tsv"
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1703,10 +1658,7 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "json"
|
||||
options.WriterOptions.OutputFileFormat = "csv"
|
||||
options.WriterOptions.OFS = "\t"
|
||||
options.WriterOptions.ofsWasSpecified = true
|
||||
options.WriterOptions.orsWasSpecified = true
|
||||
options.WriterOptions.OutputFileFormat = "tsv"
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1805,10 +1757,7 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "json"
|
||||
options.WriterOptions.OutputFileFormat = "csv"
|
||||
options.WriterOptions.OFS = "\t"
|
||||
options.WriterOptions.ofsWasSpecified = true
|
||||
options.WriterOptions.orsWasSpecified = true
|
||||
options.WriterOptions.OutputFileFormat = "tsv"
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -1910,11 +1859,8 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "pprint"
|
||||
options.ReaderOptions.IFS = " "
|
||||
options.WriterOptions.OutputFileFormat = "csv"
|
||||
options.WriterOptions.OFS = "\t"
|
||||
options.WriterOptions.OutputFileFormat = "tsv"
|
||||
options.ReaderOptions.ifsWasSpecified = true
|
||||
options.WriterOptions.ofsWasSpecified = true
|
||||
options.WriterOptions.orsWasSpecified = true
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
@ -2028,10 +1974,7 @@ var FormatConversionKeystrokeSaverFlagSection = FlagSection{
|
|||
suppressFlagEnumeration: true,
|
||||
parser: func(args []string, argc int, pargi *int, options *TOptions) {
|
||||
options.ReaderOptions.InputFileFormat = "xtab"
|
||||
options.WriterOptions.OutputFileFormat = "csv"
|
||||
options.WriterOptions.OFS = "\t"
|
||||
options.WriterOptions.ofsWasSpecified = true
|
||||
options.WriterOptions.orsWasSpecified = true
|
||||
options.WriterOptions.OutputFileFormat = "tsv"
|
||||
*pargi += 1
|
||||
},
|
||||
},
|
||||
|
|
|
|||
|
|
@ -89,6 +89,7 @@ var defaultFSes = map[string]string{
|
|||
"nidx": " ",
|
||||
"markdown": " ",
|
||||
"pprint": " ",
|
||||
"tsv": "\t",
|
||||
"xtab": "\n", // todo: windows-dependent ...
|
||||
}
|
||||
|
||||
|
|
@ -100,6 +101,7 @@ var defaultPSes = map[string]string{
|
|||
"markdown": "N/A",
|
||||
"nidx": "N/A",
|
||||
"pprint": "N/A",
|
||||
"tsv": "N/A",
|
||||
"xtab": " ",
|
||||
}
|
||||
|
||||
|
|
@ -111,6 +113,7 @@ var defaultRSes = map[string]string{
|
|||
"markdown": "\n",
|
||||
"nidx": "\n",
|
||||
"pprint": "\n",
|
||||
"tsv": "\n",
|
||||
"xtab": "\n\n", // todo: maybe jettison the idea of this being alterable
|
||||
}
|
||||
|
||||
|
|
@ -122,5 +125,6 @@ var defaultAllowRepeatIFSes = map[string]bool{
|
|||
"markdown": false,
|
||||
"nidx": false,
|
||||
"pprint": true,
|
||||
"tsv": false,
|
||||
"xtab": false,
|
||||
}
|
||||
|
|
|
|||
|
|
@ -20,6 +20,8 @@ func Create(readerOptions *cli.TReaderOptions, recordsPerBatch int64) (IRecordRe
|
|||
return NewRecordReaderNIDX(readerOptions, recordsPerBatch)
|
||||
case "pprint":
|
||||
return NewRecordReaderPPRINT(readerOptions, recordsPerBatch)
|
||||
case "tsv":
|
||||
return NewRecordReaderTSV(readerOptions, recordsPerBatch)
|
||||
case "xtab":
|
||||
return NewRecordReaderXTAB(readerOptions, recordsPerBatch)
|
||||
case "gen":
|
||||
|
|
|
|||
378
internal/pkg/input/record_reader_tsv.go
Normal file
378
internal/pkg/input/record_reader_tsv.go
Normal file
|
|
@ -0,0 +1,378 @@
|
|||
package input
|
||||
|
||||
import (
|
||||
"container/list"
|
||||
"fmt"
|
||||
"io"
|
||||
"strconv"
|
||||
"strings"
|
||||
|
||||
"github.com/johnkerl/miller/internal/pkg/cli"
|
||||
"github.com/johnkerl/miller/internal/pkg/lib"
|
||||
"github.com/johnkerl/miller/internal/pkg/mlrval"
|
||||
"github.com/johnkerl/miller/internal/pkg/types"
|
||||
)
|
||||
|
||||
// recordBatchGetterTSV points to either an explicit-TSV-header or
|
||||
// implicit-TSV-header record-batch getter.
|
||||
type recordBatchGetterTSV func(
|
||||
reader *RecordReaderTSV,
|
||||
linesChannel <-chan *list.List,
|
||||
filename string,
|
||||
context *types.Context,
|
||||
errorChannel chan error,
|
||||
) (
|
||||
recordsAndContexts *list.List,
|
||||
eof bool,
|
||||
)
|
||||
|
||||
type RecordReaderTSV struct {
|
||||
readerOptions *cli.TReaderOptions
|
||||
recordsPerBatch int64 // distinct from readerOptions.RecordsPerBatch for join/repl
|
||||
|
||||
fieldSplitter iFieldSplitter
|
||||
recordBatchGetter recordBatchGetterTSV
|
||||
|
||||
inputLineNumber int64
|
||||
headerStrings []string
|
||||
}
|
||||
|
||||
func NewRecordReaderTSV(
|
||||
readerOptions *cli.TReaderOptions,
|
||||
recordsPerBatch int64,
|
||||
) (*RecordReaderTSV, error) {
|
||||
if readerOptions.IFS != "\t" {
|
||||
return nil, fmt.Errorf("for TSV, IFS cannot be altered")
|
||||
}
|
||||
if readerOptions.IRS != "\n" && readerOptions.IRS != "\r\n" {
|
||||
return nil, fmt.Errorf("for TSV, IRS cannot be altered; LF vs CR/LF is autodetected")
|
||||
}
|
||||
reader := &RecordReaderTSV{
|
||||
readerOptions: readerOptions,
|
||||
recordsPerBatch: recordsPerBatch,
|
||||
fieldSplitter: newFieldSplitter(readerOptions),
|
||||
}
|
||||
if reader.readerOptions.UseImplicitCSVHeader {
|
||||
reader.recordBatchGetter = getRecordBatchImplicitTSVHeader
|
||||
} else {
|
||||
reader.recordBatchGetter = getRecordBatchExplicitTSVHeader
|
||||
}
|
||||
return reader, nil
|
||||
}
|
||||
|
||||
func (reader *RecordReaderTSV) Read(
|
||||
filenames []string,
|
||||
context types.Context,
|
||||
readerChannel chan<- *list.List, // list of *types.RecordAndContext
|
||||
errorChannel chan error,
|
||||
downstreamDoneChannel <-chan bool, // for mlr head
|
||||
) {
|
||||
if filenames != nil { // nil for mlr -n
|
||||
if len(filenames) == 0 { // read from stdin
|
||||
handle, err := lib.OpenStdin(
|
||||
reader.readerOptions.Prepipe,
|
||||
reader.readerOptions.PrepipeIsRaw,
|
||||
reader.readerOptions.FileInputEncoding,
|
||||
)
|
||||
if err != nil {
|
||||
errorChannel <- err
|
||||
return
|
||||
}
|
||||
reader.processHandle(
|
||||
handle,
|
||||
"(stdin)",
|
||||
&context,
|
||||
readerChannel,
|
||||
errorChannel,
|
||||
downstreamDoneChannel,
|
||||
)
|
||||
} else {
|
||||
for _, filename := range filenames {
|
||||
handle, err := lib.OpenFileForRead(
|
||||
filename,
|
||||
reader.readerOptions.Prepipe,
|
||||
reader.readerOptions.PrepipeIsRaw,
|
||||
reader.readerOptions.FileInputEncoding,
|
||||
)
|
||||
if err != nil {
|
||||
errorChannel <- err
|
||||
return
|
||||
}
|
||||
reader.processHandle(
|
||||
handle,
|
||||
filename,
|
||||
&context,
|
||||
readerChannel,
|
||||
errorChannel,
|
||||
downstreamDoneChannel,
|
||||
)
|
||||
handle.Close()
|
||||
}
|
||||
}
|
||||
}
|
||||
readerChannel <- types.NewEndOfStreamMarkerList(&context)
|
||||
}
|
||||
|
||||
func (reader *RecordReaderTSV) processHandle(
|
||||
handle io.Reader,
|
||||
filename string,
|
||||
context *types.Context,
|
||||
readerChannel chan<- *list.List, // list of *types.RecordAndContext
|
||||
errorChannel chan error,
|
||||
downstreamDoneChannel <-chan bool, // for mlr head
|
||||
) {
|
||||
context.UpdateForStartOfFile(filename)
|
||||
reader.inputLineNumber = 0
|
||||
reader.headerStrings = nil
|
||||
|
||||
recordsPerBatch := reader.recordsPerBatch
|
||||
lineScanner := NewLineScanner(handle, reader.readerOptions.IRS)
|
||||
linesChannel := make(chan *list.List, recordsPerBatch)
|
||||
go channelizedLineScanner(lineScanner, linesChannel, downstreamDoneChannel, recordsPerBatch)
|
||||
|
||||
for {
|
||||
recordsAndContexts, eof := reader.recordBatchGetter(reader, linesChannel, filename, context, errorChannel)
|
||||
if recordsAndContexts.Len() > 0 {
|
||||
readerChannel <- recordsAndContexts
|
||||
}
|
||||
if eof {
|
||||
break
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func getRecordBatchExplicitTSVHeader(
|
||||
reader *RecordReaderTSV,
|
||||
linesChannel <-chan *list.List,
|
||||
filename string,
|
||||
context *types.Context,
|
||||
errorChannel chan error,
|
||||
) (
|
||||
recordsAndContexts *list.List,
|
||||
eof bool,
|
||||
) {
|
||||
recordsAndContexts = list.New()
|
||||
dedupeFieldNames := reader.readerOptions.DedupeFieldNames
|
||||
|
||||
lines, more := <-linesChannel
|
||||
if !more {
|
||||
return recordsAndContexts, true
|
||||
}
|
||||
|
||||
for e := lines.Front(); e != nil; e = e.Next() {
|
||||
line := e.Value.(string)
|
||||
|
||||
reader.inputLineNumber++
|
||||
|
||||
// Check for comments-in-data feature
|
||||
// TODO: function-pointer this away
|
||||
if reader.readerOptions.CommentHandling != cli.CommentsAreData {
|
||||
if strings.HasPrefix(line, reader.readerOptions.CommentString) {
|
||||
if reader.readerOptions.CommentHandling == cli.PassComments {
|
||||
recordsAndContexts.PushBack(types.NewOutputString(line+"\n", context))
|
||||
continue
|
||||
} else if reader.readerOptions.CommentHandling == cli.SkipComments {
|
||||
continue
|
||||
}
|
||||
// else comments are data
|
||||
}
|
||||
}
|
||||
|
||||
if line == "" {
|
||||
// Reset to new schema
|
||||
reader.headerStrings = nil
|
||||
continue
|
||||
}
|
||||
|
||||
fields := reader.fieldSplitter.Split(line)
|
||||
|
||||
if reader.headerStrings == nil {
|
||||
reader.headerStrings = fields
|
||||
// Get data lines on subsequent loop iterations
|
||||
} else {
|
||||
if !reader.readerOptions.AllowRaggedCSVInput && len(reader.headerStrings) != len(fields) {
|
||||
err := fmt.Errorf(
|
||||
"mlr: TSV header/data length mismatch %d != %d "+
|
||||
"at filename %s line %d.\n",
|
||||
len(reader.headerStrings), len(fields), filename, reader.inputLineNumber,
|
||||
)
|
||||
errorChannel <- err
|
||||
return
|
||||
}
|
||||
|
||||
record := mlrval.NewMlrmapAsRecord()
|
||||
if !reader.readerOptions.AllowRaggedCSVInput {
|
||||
for i, field := range fields {
|
||||
field = lib.TSVDecodeField(field)
|
||||
value := mlrval.FromDeferredType(field)
|
||||
_, err := record.PutReferenceMaybeDedupe(reader.headerStrings[i], value, dedupeFieldNames)
|
||||
if err != nil {
|
||||
errorChannel <- err
|
||||
return
|
||||
}
|
||||
}
|
||||
} else {
|
||||
nh := int64(len(reader.headerStrings))
|
||||
nd := int64(len(fields))
|
||||
n := lib.IntMin2(nh, nd)
|
||||
var i int64
|
||||
for i = 0; i < n; i++ {
|
||||
field := lib.TSVDecodeField(fields[i])
|
||||
value := mlrval.FromDeferredType(field)
|
||||
_, err := record.PutReferenceMaybeDedupe(reader.headerStrings[i], value, dedupeFieldNames)
|
||||
if err != nil {
|
||||
errorChannel <- err
|
||||
return
|
||||
}
|
||||
}
|
||||
if nh < nd {
|
||||
// if header shorter than data: use 1-up itoa keys
|
||||
for i = nh; i < nd; i++ {
|
||||
key := strconv.FormatInt(i+1, 10)
|
||||
field := lib.TSVDecodeField(fields[i])
|
||||
value := mlrval.FromDeferredType(field)
|
||||
_, err := record.PutReferenceMaybeDedupe(key, value, dedupeFieldNames)
|
||||
if err != nil {
|
||||
errorChannel <- err
|
||||
return
|
||||
}
|
||||
}
|
||||
}
|
||||
if nh > nd {
|
||||
// if header longer than data: use "" values
|
||||
for i = nd; i < nh; i++ {
|
||||
record.PutCopy(reader.headerStrings[i], mlrval.VOID)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
context.UpdateForInputRecord()
|
||||
recordsAndContexts.PushBack(types.NewRecordAndContext(record, context))
|
||||
}
|
||||
}
|
||||
|
||||
return recordsAndContexts, false
|
||||
}
|
||||
|
||||
func getRecordBatchImplicitTSVHeader(
|
||||
reader *RecordReaderTSV,
|
||||
linesChannel <-chan *list.List,
|
||||
filename string,
|
||||
context *types.Context,
|
||||
errorChannel chan error,
|
||||
) (
|
||||
recordsAndContexts *list.List,
|
||||
eof bool,
|
||||
) {
|
||||
recordsAndContexts = list.New()
|
||||
dedupeFieldNames := reader.readerOptions.DedupeFieldNames
|
||||
|
||||
lines, more := <-linesChannel
|
||||
if !more {
|
||||
return recordsAndContexts, true
|
||||
}
|
||||
|
||||
for e := lines.Front(); e != nil; e = e.Next() {
|
||||
line := e.Value.(string)
|
||||
|
||||
reader.inputLineNumber++
|
||||
|
||||
// Check for comments-in-data feature
|
||||
// TODO: function-pointer this away
|
||||
if reader.readerOptions.CommentHandling != cli.CommentsAreData {
|
||||
if strings.HasPrefix(line, reader.readerOptions.CommentString) {
|
||||
if reader.readerOptions.CommentHandling == cli.PassComments {
|
||||
recordsAndContexts.PushBack(types.NewOutputString(line+"\n", context))
|
||||
continue
|
||||
} else if reader.readerOptions.CommentHandling == cli.SkipComments {
|
||||
continue
|
||||
}
|
||||
// else comments are data
|
||||
}
|
||||
}
|
||||
|
||||
// This is how to do a chomp:
|
||||
line = strings.TrimRight(line, reader.readerOptions.IRS)
|
||||
|
||||
line = strings.TrimRight(line, "\r")
|
||||
|
||||
if line == "" {
|
||||
// Reset to new schema
|
||||
reader.headerStrings = nil
|
||||
continue
|
||||
}
|
||||
|
||||
fields := reader.fieldSplitter.Split(line)
|
||||
|
||||
if reader.headerStrings == nil {
|
||||
n := len(fields)
|
||||
reader.headerStrings = make([]string, n)
|
||||
for i := 0; i < n; i++ {
|
||||
reader.headerStrings[i] = strconv.Itoa(i + 1)
|
||||
}
|
||||
} else {
|
||||
if !reader.readerOptions.AllowRaggedCSVInput && len(reader.headerStrings) != len(fields) {
|
||||
err := fmt.Errorf(
|
||||
"mlr: TSV header/data length mismatch %d != %d "+
|
||||
"at filename %s line %d.\n",
|
||||
len(reader.headerStrings), len(fields), filename, reader.inputLineNumber,
|
||||
)
|
||||
errorChannel <- err
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
record := mlrval.NewMlrmapAsRecord()
|
||||
if !reader.readerOptions.AllowRaggedCSVInput {
|
||||
for i, field := range fields {
|
||||
field = lib.TSVDecodeField(field)
|
||||
value := mlrval.FromDeferredType(field)
|
||||
_, err := record.PutReferenceMaybeDedupe(reader.headerStrings[i], value, dedupeFieldNames)
|
||||
if err != nil {
|
||||
errorChannel <- err
|
||||
return
|
||||
}
|
||||
}
|
||||
} else {
|
||||
nh := int64(len(reader.headerStrings))
|
||||
nd := int64(len(fields))
|
||||
n := lib.IntMin2(nh, nd)
|
||||
var i int64
|
||||
for i = 0; i < n; i++ {
|
||||
field := lib.TSVDecodeField(fields[i])
|
||||
value := mlrval.FromDeferredType(field)
|
||||
_, err := record.PutReferenceMaybeDedupe(reader.headerStrings[i], value, dedupeFieldNames)
|
||||
if err != nil {
|
||||
errorChannel <- err
|
||||
return
|
||||
}
|
||||
}
|
||||
if nh < nd {
|
||||
// if header shorter than data: use 1-up itoa keys
|
||||
key := strconv.FormatInt(i+1, 10)
|
||||
field := lib.TSVDecodeField(fields[i])
|
||||
value := mlrval.FromDeferredType(field)
|
||||
_, err := record.PutReferenceMaybeDedupe(key, value, dedupeFieldNames)
|
||||
if err != nil {
|
||||
errorChannel <- err
|
||||
return
|
||||
}
|
||||
}
|
||||
if nh > nd {
|
||||
// if header longer than data: use "" values
|
||||
for i = nd; i < nh; i++ {
|
||||
_, err := record.PutReferenceMaybeDedupe(reader.headerStrings[i], mlrval.VOID.Copy(), dedupeFieldNames)
|
||||
if err != nil {
|
||||
errorChannel <- err
|
||||
return
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
context.UpdateForInputRecord()
|
||||
recordsAndContexts.PushBack(types.NewRecordAndContext(record, context))
|
||||
}
|
||||
|
||||
return recordsAndContexts, false
|
||||
}
|
||||
68
internal/pkg/lib/tsv_codec.go
Normal file
68
internal/pkg/lib/tsv_codec.go
Normal file
|
|
@ -0,0 +1,68 @@
|
|||
package lib
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
)
|
||||
|
||||
// * https://en.wikipedia.org/wiki/Tab-separated_values
|
||||
// * https://www.iana.org/assignments/media-types/text/tab-separated-values
|
||||
// \n for newline,
|
||||
// \r for carriage return,
|
||||
// \t for tab,
|
||||
// \\ for backslash.
|
||||
|
||||
// TSVDecodeField is for the TSV record-reader.
|
||||
func TSVDecodeField(input string) string {
|
||||
var buffer bytes.Buffer
|
||||
n := len(input)
|
||||
for i := 0; i < n; /* increment in loop */ {
|
||||
c := input[i]
|
||||
if c == '\\' && i < n-1 {
|
||||
d := input[i+1]
|
||||
if d == '\\' {
|
||||
buffer.WriteByte('\\')
|
||||
i += 2
|
||||
} else if d == 'n' {
|
||||
buffer.WriteByte('\n')
|
||||
i += 2
|
||||
} else if d == 'r' {
|
||||
buffer.WriteByte('\r')
|
||||
i += 2
|
||||
} else if d == 't' {
|
||||
buffer.WriteByte('\t')
|
||||
i += 2
|
||||
} else {
|
||||
buffer.WriteByte(c)
|
||||
i++
|
||||
}
|
||||
} else {
|
||||
buffer.WriteByte(c)
|
||||
i++
|
||||
}
|
||||
}
|
||||
return buffer.String()
|
||||
}
|
||||
|
||||
// TSVEncodeField is for the TSV record-writer.
|
||||
func TSVEncodeField(input string) string {
|
||||
var buffer bytes.Buffer
|
||||
for i := range input {
|
||||
c := input[i]
|
||||
if c == '\\' {
|
||||
buffer.WriteByte('\\')
|
||||
buffer.WriteByte('\\')
|
||||
} else if c == '\n' {
|
||||
buffer.WriteByte('\\')
|
||||
buffer.WriteByte('n')
|
||||
} else if c == '\r' {
|
||||
buffer.WriteByte('\\')
|
||||
buffer.WriteByte('r')
|
||||
} else if c == '\t' {
|
||||
buffer.WriteByte('\\')
|
||||
buffer.WriteByte('t')
|
||||
} else {
|
||||
buffer.WriteByte(c)
|
||||
}
|
||||
}
|
||||
return buffer.String()
|
||||
}
|
||||
35
internal/pkg/lib/tsv_codec_test.go
Normal file
35
internal/pkg/lib/tsv_codec_test.go
Normal file
|
|
@ -0,0 +1,35 @@
|
|||
package lib
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"github.com/stretchr/testify/assert"
|
||||
)
|
||||
|
||||
func TestTSVDecodeField(t *testing.T) {
|
||||
assert.Equal(t, "", TSVDecodeField(""))
|
||||
assert.Equal(t, "a", TSVDecodeField("a"))
|
||||
assert.Equal(t, "abc", TSVDecodeField("abc"))
|
||||
assert.Equal(t, `\`, TSVDecodeField(`\`))
|
||||
assert.Equal(t, "\n", TSVDecodeField(`\n`))
|
||||
assert.Equal(t, "\r", TSVDecodeField(`\r`))
|
||||
assert.Equal(t, "\t", TSVDecodeField(`\t`))
|
||||
assert.Equal(t, "\\", TSVDecodeField(`\\`))
|
||||
assert.Equal(t, `\n`, TSVDecodeField(`\\n`))
|
||||
assert.Equal(t, "\\\n", TSVDecodeField(`\\\n`))
|
||||
assert.Equal(t, "abc\r\ndef\r\n", TSVDecodeField(`abc\r\ndef\r\n`))
|
||||
}
|
||||
|
||||
func TestTSVEncodeField(t *testing.T) {
|
||||
assert.Equal(t, "", TSVEncodeField(""))
|
||||
assert.Equal(t, "a", TSVEncodeField("a"))
|
||||
assert.Equal(t, "abc", TSVEncodeField("abc"))
|
||||
assert.Equal(t, `\\`, TSVEncodeField(`\`))
|
||||
assert.Equal(t, `\n`, TSVEncodeField("\n"))
|
||||
assert.Equal(t, `\r`, TSVEncodeField("\r"))
|
||||
assert.Equal(t, `\t`, TSVEncodeField("\t"))
|
||||
assert.Equal(t, `\\`, TSVEncodeField("\\"))
|
||||
assert.Equal(t, `\\n`, TSVEncodeField("\\n"))
|
||||
assert.Equal(t, `\\\n`, TSVEncodeField("\\\n"))
|
||||
assert.Equal(t, `abc\r\ndef\r\n`, TSVEncodeField("abc\r\ndef\r\n"))
|
||||
}
|
||||
|
|
@ -22,6 +22,8 @@ func Create(writerOptions *cli.TWriterOptions) (IRecordWriter, error) {
|
|||
return NewRecordWriterNIDX(writerOptions)
|
||||
case "pprint":
|
||||
return NewRecordWriterPPRINT(writerOptions)
|
||||
case "tsv":
|
||||
return NewRecordWriterTSV(writerOptions)
|
||||
case "xtab":
|
||||
return NewRecordWriterXTAB(writerOptions)
|
||||
default:
|
||||
|
|
|
|||
104
internal/pkg/output/record_writer_tsv.go
Normal file
104
internal/pkg/output/record_writer_tsv.go
Normal file
|
|
@ -0,0 +1,104 @@
|
|||
package output
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"fmt"
|
||||
"strings"
|
||||
|
||||
"github.com/johnkerl/miller/internal/pkg/cli"
|
||||
"github.com/johnkerl/miller/internal/pkg/colorizer"
|
||||
"github.com/johnkerl/miller/internal/pkg/lib"
|
||||
"github.com/johnkerl/miller/internal/pkg/mlrval"
|
||||
)
|
||||
|
||||
type RecordWriterTSV struct {
|
||||
writerOptions *cli.TWriterOptions
|
||||
// For reporting schema changes: we print a newline and the new header
|
||||
lastJoinedHeader *string
|
||||
// Only write one blank line for schema changes / blank input lines
|
||||
justWroteEmptyLine bool
|
||||
}
|
||||
|
||||
func NewRecordWriterTSV(writerOptions *cli.TWriterOptions) (*RecordWriterTSV, error) {
|
||||
if writerOptions.OFS != "\t" {
|
||||
return nil, fmt.Errorf("for TSV, OFS cannot be altered")
|
||||
}
|
||||
if writerOptions.ORS != "\n" && writerOptions.ORS != "\r\n" {
|
||||
return nil, fmt.Errorf("for CSV, ORS cannot be altered")
|
||||
}
|
||||
return &RecordWriterTSV{
|
||||
writerOptions: writerOptions,
|
||||
lastJoinedHeader: nil,
|
||||
justWroteEmptyLine: false,
|
||||
}, nil
|
||||
}
|
||||
|
||||
func (writer *RecordWriterTSV) Write(
|
||||
outrec *mlrval.Mlrmap,
|
||||
bufferedOutputStream *bufio.Writer,
|
||||
outputIsStdout bool,
|
||||
) {
|
||||
// End of record stream: nothing special for this output format
|
||||
if outrec == nil {
|
||||
return
|
||||
}
|
||||
|
||||
if outrec.IsEmpty() {
|
||||
if !writer.justWroteEmptyLine {
|
||||
bufferedOutputStream.WriteString(writer.writerOptions.ORS)
|
||||
}
|
||||
joinedHeader := ""
|
||||
writer.lastJoinedHeader = &joinedHeader
|
||||
writer.justWroteEmptyLine = true
|
||||
return
|
||||
}
|
||||
|
||||
needToPrintHeader := false
|
||||
joinedHeader := strings.Join(outrec.GetKeys(), ",")
|
||||
if writer.lastJoinedHeader == nil || *writer.lastJoinedHeader != joinedHeader {
|
||||
if writer.lastJoinedHeader != nil {
|
||||
if !writer.justWroteEmptyLine {
|
||||
bufferedOutputStream.WriteString(writer.writerOptions.ORS)
|
||||
}
|
||||
writer.justWroteEmptyLine = true
|
||||
}
|
||||
writer.lastJoinedHeader = &joinedHeader
|
||||
needToPrintHeader = true
|
||||
}
|
||||
|
||||
if needToPrintHeader && !writer.writerOptions.HeaderlessCSVOutput {
|
||||
for pe := outrec.Head; pe != nil; pe = pe.Next {
|
||||
bufferedOutputStream.WriteString(
|
||||
colorizer.MaybeColorizeKey(
|
||||
lib.TSVEncodeField(
|
||||
pe.Key,
|
||||
),
|
||||
outputIsStdout,
|
||||
),
|
||||
)
|
||||
|
||||
if pe.Next != nil {
|
||||
bufferedOutputStream.WriteString(writer.writerOptions.OFS)
|
||||
}
|
||||
}
|
||||
|
||||
bufferedOutputStream.WriteString(writer.writerOptions.ORS)
|
||||
}
|
||||
|
||||
for pe := outrec.Head; pe != nil; pe = pe.Next {
|
||||
bufferedOutputStream.WriteString(
|
||||
colorizer.MaybeColorizeValue(
|
||||
lib.TSVEncodeField(
|
||||
pe.Value.String(),
|
||||
),
|
||||
outputIsStdout,
|
||||
),
|
||||
)
|
||||
if pe.Next != nil {
|
||||
bufferedOutputStream.WriteString(writer.writerOptions.OFS)
|
||||
}
|
||||
}
|
||||
bufferedOutputStream.WriteString(writer.writerOptions.ORS)
|
||||
|
||||
writer.justWroteEmptyLine = false
|
||||
}
|
||||
|
|
@ -365,7 +365,7 @@ FILE-FORMAT FLAGS
|
|||
--oxtab Use XTAB format for output data.
|
||||
--pprint Use PPRINT format for input and output data.
|
||||
--tsv Use TSV format for input and output data.
|
||||
--tsvlite or -t Use TSV-lite format for input and output data.
|
||||
--tsv or -t Use TSV-lite format for input and output data.
|
||||
--usv or --usvlite Use USV format for input and output data.
|
||||
--xtab Use XTAB format for input and output data.
|
||||
-i {format name} Use format name for input data. For example: `-i csv`
|
||||
|
|
@ -687,7 +687,6 @@ SEPARATOR FLAGS
|
|||
alignment impossible.
|
||||
* OPS may be multi-character for XTAB format, in which case alignment is
|
||||
disabled.
|
||||
* TSV is simply CSV using tab as field separator (`--fs tab`).
|
||||
* FS/PS are ignored for markdown format; RS is used.
|
||||
* All FS and PS options are ignored for JSON format, since they are not relevant
|
||||
to the JSON format.
|
||||
|
|
@ -742,6 +741,7 @@ SEPARATOR FLAGS
|
|||
markdown " " N/A "\n"
|
||||
nidx " " N/A "\n"
|
||||
pprint " " N/A "\n"
|
||||
tsv " " N/A "\n"
|
||||
xtab "\n" " " "\n\n"
|
||||
|
||||
--fs {string} Specify FS for input and output.
|
||||
|
|
@ -3136,4 +3136,4 @@ SEE ALSO
|
|||
|
||||
|
||||
|
||||
2022-02-05 MILLER(1)
|
||||
2022-02-06 MILLER(1)
|
||||
|
|
|
|||
|
|
@ -2,12 +2,12 @@
|
|||
.\" Title: mlr
|
||||
.\" Author: [see the "AUTHOR" section]
|
||||
.\" Generator: ./mkman.rb
|
||||
.\" Date: 2022-02-05
|
||||
.\" Date: 2022-02-06
|
||||
.\" Manual: \ \&
|
||||
.\" Source: \ \&
|
||||
.\" Language: English
|
||||
.\"
|
||||
.TH "MILLER" "1" "2022-02-05" "\ \&" "\ \&"
|
||||
.TH "MILLER" "1" "2022-02-06" "\ \&" "\ \&"
|
||||
.\" -----------------------------------------------------------------
|
||||
.\" * Portability definitions
|
||||
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
|
@ -444,7 +444,7 @@ are overridden in all cases by setting output format to `format2`.
|
|||
--oxtab Use XTAB format for output data.
|
||||
--pprint Use PPRINT format for input and output data.
|
||||
--tsv Use TSV format for input and output data.
|
||||
--tsvlite or -t Use TSV-lite format for input and output data.
|
||||
--tsv or -t Use TSV-lite format for input and output data.
|
||||
--usv or --usvlite Use USV format for input and output data.
|
||||
--xtab Use XTAB format for input and output data.
|
||||
-i {format name} Use format name for input data. For example: `-i csv`
|
||||
|
|
@ -830,7 +830,6 @@ Notes about all other separators:
|
|||
alignment impossible.
|
||||
* OPS may be multi-character for XTAB format, in which case alignment is
|
||||
disabled.
|
||||
* TSV is simply CSV using tab as field separator (`--fs tab`).
|
||||
* FS/PS are ignored for markdown format; RS is used.
|
||||
* All FS and PS options are ignored for JSON format, since they are not relevant
|
||||
to the JSON format.
|
||||
|
|
@ -885,6 +884,7 @@ Notes about all other separators:
|
|||
markdown " " N/A "\en"
|
||||
nidx " " N/A "\en"
|
||||
pprint " " N/A "\en"
|
||||
tsv " " N/A "\en"
|
||||
xtab "\en" " " "\en\en"
|
||||
|
||||
--fs {string} Specify FS for input and output.
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
"a b i x y"
|
||||
"pan pan 1 0.3467901443380824 0.7268028627434533"
|
||||
"eks pan 2 0.7586799647899636 0.5221511083334797"
|
||||
"wye wye 3 0.20460330576630303 0.33831852551664776"
|
||||
"eks wye 4 0.38139939387114097 0.13418874328430463"
|
||||
a\tb\ti\tx\ty
|
||||
pan\tpan\t1\t0.3467901443380824\t0.7268028627434533
|
||||
eks\tpan\t2\t0.7586799647899636\t0.5221511083334797
|
||||
wye\twye\t3\t0.20460330576630303\t0.33831852551664776
|
||||
eks\twye\t4\t0.38139939387114097\t0.13418874328430463
|
||||
|
|
|
|||
1
test/cases/io-spec-tsv/0001/cmd
Normal file
1
test/cases/io-spec-tsv/0001/cmd
Normal file
|
|
@ -0,0 +1 @@
|
|||
mlr --itsv --ojson cat ${CASEDIR}/data.tsv
|
||||
2
test/cases/io-spec-tsv/0001/data.tsv
Normal file
2
test/cases/io-spec-tsv/0001/data.tsv
Normal file
|
|
@ -0,0 +1,2 @@
|
|||
a\tb,c\nd,e
|
||||
1\r2,3\\4,5
|
||||
|
0
test/cases/io-spec-tsv/0001/experr
Normal file
0
test/cases/io-spec-tsv/0001/experr
Normal file
5
test/cases/io-spec-tsv/0001/expout
Normal file
5
test/cases/io-spec-tsv/0001/expout
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
[
|
||||
{
|
||||
"a\\tb,c\\nd,e": "1\r2,3\\4,5"
|
||||
}
|
||||
]
|
||||
1
test/cases/io-spec-tsv/0002/cmd
Normal file
1
test/cases/io-spec-tsv/0002/cmd
Normal file
|
|
@ -0,0 +1 @@
|
|||
mlr --ijson --otsv cat ${CASEDIR}/data.json
|
||||
5
test/cases/io-spec-tsv/0002/data.json
Normal file
5
test/cases/io-spec-tsv/0002/data.json
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
[
|
||||
{
|
||||
"a\\tb,c\\nd,e": "1\r2,3\\4,5"
|
||||
}
|
||||
]
|
||||
0
test/cases/io-spec-tsv/0002/experr
Normal file
0
test/cases/io-spec-tsv/0002/experr
Normal file
2
test/cases/io-spec-tsv/0002/expout
Normal file
2
test/cases/io-spec-tsv/0002/expout
Normal file
|
|
@ -0,0 +1,2 @@
|
|||
a\\tb,c\\nd,e
|
||||
1\r2,3\\4,5
|
||||
19
todo.txt
19
todo.txt
|
|
@ -2,26 +2,41 @@
|
|||
RELEASES
|
||||
|
||||
* plan 6.1.0
|
||||
! IANA-TSV w/ \{X}
|
||||
? w/ natural sort order
|
||||
? strptime
|
||||
? datediff et al.
|
||||
? mlr join --left-fields a,b,c
|
||||
? rank
|
||||
? ?foo and ??foo @ repl help
|
||||
o fmt/unfmt/regex doc
|
||||
o FAQ/examples reorg
|
||||
k default colors; bold/underline/reverse
|
||||
k array concat
|
||||
k format/unformat
|
||||
k split
|
||||
k split verb
|
||||
k slwin & shift-lead
|
||||
m unicode string literals
|
||||
k 0o.. octal literals in the DSL
|
||||
k codeql/codespell/goreleaseer binaries/zips
|
||||
k :rb
|
||||
k ?foo and ??foo @ repl help
|
||||
k doc-improves
|
||||
* plan 6.2.0
|
||||
? YAML
|
||||
|
||||
================================================================
|
||||
FEATURES
|
||||
|
||||
----------------------------------------------------------------
|
||||
TSV etc
|
||||
|
||||
? also: some escapes perhaps for dkvp, xtab, pprint -- ?
|
||||
o nidx is a particular pure-text, leave-as-is
|
||||
? try out nidx single-line w/ \001, \002 FS/PS & \n or \n\n RS
|
||||
o make/publicize a shorthand for this -- ?
|
||||
o --words && --lines & --paragraphs -- ?
|
||||
* still need csv --lazy-quotes
|
||||
|
||||
----------------------------------------------------------------
|
||||
* natural sort order
|
||||
https://github.com/facette/natsort
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue