miller/docs6/docs/reference-main-compressed-data.md.in
2021-09-06 21:57:15 -04:00

109 lines
4.7 KiB
Markdown

# Compressed data
As of [Miller 6](new-in-miller-6.md), Miller supports reading GZIP, BZIP2, and
ZLIB formats transparently, and in-process. And (as before Miller 6) you have a
more general `--prepipe` option to support other decompression programs.
## Automatic detection on input
If your files end in `.gz`, `.bz2`, or `.z` then Miller will autodetect by file extension:
GENMD_CARDIFY_HIGHLIGHT_ONE
file gz-example.csv.gz
gz-example.csv.gz: gzip compressed data, was "gz-example.csv", last modified: Mon Aug 23 02:04:34 2021, from Unix, original size modulo 2^32 429
GENMD_EOF
GENMD_RUN_COMMAND
mlr --csv sort -f color gz-example.csv.gz
GENMD_EOF
This will decompress the input data on the fly, while leaving the disk file unmodified. This helps you save disk space, at the cost of some additional runtime CPU usage to decompress the data.
## Manual detection on input
If the filename doesn't in in `.gz`, `.bz2`, or `.z` then you can use the flags `--gzin`, `--bz2in`, or `--zin` to let Miller know:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --csv --gzin sort -f color myfile.bin # myfile.bin has gzip contents
GENMD_EOF
## External decompressors on input
Using the `--prepipe` flag, you can provide the name of any decompression
program in your `$PATH` and Miller will run it on each input file, effectively
piping the standard output of that program to Miller's standard input.
You can, of course, already do without this for single input files, for example:
GENMD_RUN_COMMAND
gunzip < gz-example.csv.gz | mlr --csv sort -f color
GENMD_EOF
The benefit of `--prepipe` is that Miller will run the specified program once per
file, respecting file boundaries.
The prepipe command can be anything which reads from standard input and produces
data acceptable to Miller. Nominally this allows you to use whichever
decompression utilities you have installed on your system, on a per-file basis.
If the command has flags, quote them: e.g. `mlr --prepipe 'zcat -cf'`.
In your [.mlrrc file](customization.md), `--prepipe` and `--prepipex` are not
allowed as they could be used for unexpected code execution. You can use
`--prepipe-bz2`, `--prepipe-gunzip`, and `--prepipe-zcat` in `.mlrrc`, though.
Note that this feature is quite general and is not limited to decompression
utilities. You can use it to apply per-file filters of your choice: e.g. `mlr
--prepipe head -n 10 ...`, if you like.
There is a `--prepipe` and a `--prepipex`:
* If the command normally runs with `nameofprogram < filename.ext` (such as `gunzip` or `zcat -cf` or `xz -cd`) then use `--prepipe`.
* If the command normally runs with `nameofprogram filename.ext` (such as `unzip -qc`) then use `--prepipex`.
Lastly, note that if `--prepipe` or `--prepipex` is specified on the Miller
command line, it replaces any autodetect decisions that might have been made
based on the filename extension. Likewise, `--gzin`/`--bz2in`/`--zin` are ignored if
`--prepipe` or `--prepipex` is also specified.
## Compressed output
Everything said so far on this page has to do with compressed input.
For compressed output:
* Normally Miller output is to stdout, so you can pipe the output: `mlr sort -n quantity foo.csv | gzip > sorted.csv.gz`.
* For [`tee` statements](reference-dsl-output-statements.md#tee-statements), which write output to files rather than stdout, use `tee`'s redirect syntax:
<!---
gzip by default puts timestamp into the header, so every regen of these *.md.in files makes a modified
file, which is annoying for version control. That can be suppressed by using 'gzip -n' but then that's
confusing for the reader, who has no need for -n. We handle this by making this code sample non-live.
--->
GENMD_CARDIFY_HIGHLIGHT_FOUR
mlr --from example.csv --csv put -q '
filename = $color.".csv.gz";
tee | "gzip > ".filename, $*
'
GENMD_EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
file red.csv.gz purple.csv.gz yellow.csv.gz
red.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 185
purple.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 164
yellow.csv.gz: gzip compressed data, last modified: Mon Aug 23 02:34:05 2021, from Unix, original size modulo 2^32 158
GENMD_EOF
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --csv cat yellow.csv.gz
color,shape,flag,k,index,quantity,rate
yellow,triangle,true,1,11,43.6498,9.8870
yellow,circle,true,8,73,63.9785,4.2370
yellow,circle,true,9,87,63.5058,8.3350
GENMD_EOF
* Using the [in-place flag](reference-main-io-options.md#in-place-mode) `-I`,
as of August 2021 the overwritten file will _not_ be compressed as it was when it was read:
e.g. `mlr -I --csv cat gz-example.csv.gz` will write `gz-example.csv.gz` which contains
a plain, uncompressed CSV contents. This is a bug and will be fixed.