miller/c/input
2017-06-12 23:51:39 -04:00
..
byte_reader.h zcat iterate 2015-12-13 17:55:37 -07:00
byte_readers.h [read performance iterate] byte-reader iterate 2015-09-04 09:19:33 -04:00
file_ingestor_stdio.c windows-port iterate 2017-03-05 10:39:06 -05:00
file_ingestor_stdio.h valgrind findings 2016-04-29 19:12:25 -04:00
file_reader_mmap.c windows-port iterate 2017-03-05 11:32:16 -05:00
file_reader_mmap.h zcat iterate 2015-12-13 17:55:37 -07:00
file_reader_stdio.c windows-port iterate 2017-03-05 10:39:06 -05:00
file_reader_stdio.h zcat iterate 2015-12-13 17:55:37 -07:00
json_parser.c autodetect iterate: lrec readers 2017-01-30 19:07:52 -05:00
json_parser.h json-leak debug 2016-04-23 12:00:08 -04:00
line_readers.c valgrind findings 2017-06-12 23:12:55 -04:00
line_readers.h new line-readers -> getlines performance comparator 2017-03-13 17:52:40 -04:00
lrec_reader.h json-input iterate 2016-01-31 23:40:52 -05:00
lrec_reader_in_memory.c neaten 2016-05-25 07:33:59 -04:00
lrec_reader_mmap_csv.c Strip UTF-8 BOM (0xefbbbf) from start of CSV files 2017-04-30 22:14:14 -04:00
lrec_reader_mmap_csvlite.c neaten 2017-02-01 21:16:28 -05:00
lrec_reader_mmap_dkvp.c neaten 2017-02-01 21:16:28 -05:00
lrec_reader_mmap_json.c json-array-to-map iterate 2017-03-16 09:13:10 -04:00
lrec_reader_mmap_nidx.c neaten 2017-02-01 21:16:28 -05:00
lrec_reader_mmap_xtab.c neaten 2017-02-01 21:22:13 -05:00
lrec_reader_stdio_csv.c valgrind findings 2017-06-12 23:51:39 -04:00
lrec_reader_stdio_csvlite.c line-readers iterate 2017-03-12 23:24:36 -04:00
lrec_reader_stdio_dkvp.c line-readers iterate 2017-03-12 23:08:30 -04:00
lrec_reader_stdio_json.c json-array-to-map iterate 2017-03-16 09:13:10 -04:00
lrec_reader_stdio_nidx.c line-readers iterate 2017-03-12 23:12:59 -04:00
lrec_reader_stdio_xtab.c line-readers iterate 2017-03-12 23:32:01 -04:00
lrec_readers.c json-array-to-map iterate 2017-03-16 09:13:10 -04:00
lrec_readers.h json-array-to-map iterate 2017-03-16 09:13:10 -04:00
Makefile.am JSON stdio iterate 2016-02-04 17:27:48 -05:00
Makefile.in Include ./configure et al.: output from autoreconf -fiv 2017-04-27 15:11:19 -04:00
mlr_json_adapter.c valgrind findings 2017-04-13 23:33:36 -04:00
mlr_json_adapter.h json-array-to-map iterate 2017-03-16 09:13:10 -04:00
mmap_byte_reader.c windows-port iterate 2017-03-05 11:32:16 -05:00
peek_file_reader.c move manual tests to unit tests: checkpoint 2015-09-11 18:37:04 -04:00
peek_file_reader.h windows-port iterate 2017-03-07 22:51:44 -05:00
README.md multi-char input separators for mmap DKVP 2015-09-16 23:54:24 -04:00
stdio_byte_reader.c windows-port iterate 2017-03-05 11:10:13 -05:00
string_byte_reader.c neaten 2016-05-25 07:33:59 -04:00

Miller file/record input

These are readers for Miller file formats, stdio and mmap versions. The stdio and mmap record parsers are similar but not identical, due to inversion of processing order: getting an entire mallocked line and then splitting it by separators in the former case, versus splitting while discovering end of line in the latter case. The code duplication could be largely removed by having the mmap readers find end-of-lines, then split up the lines -- however that requires two passes through input strings and for performance I want just a single pass.

While there are separate record-writers for CSV and pretty-print, there is just a common record-reader: pretty-print is CSV with field separator being a space, and allow_repeat_ifs set to true.

Idea of header_keeper objects for CSV: each header_keeper object retains the input-line backing and the slls_t for a CSV header line which is used by one or more CSV data lines. Meanwhile some mappers (e.g. sort, tac) retain input records from the entire data stream, which may include header-schema changes in the input stream. This means we need to keep headers intact as long as any lrecs are pointing to them. One option is reference-counting which I experimented with; it was messy and error-prone. The approach used here is to keep a hash map from header-schema to header_keeper object. The current pheader_keeper is a pointer into one of those. Then when the reader is freed, all its header-keepers are freed.

There is some code duplication involving single-character and multi-character IRS, IFS, and IPS. While single-character is a special case of multi-character, keeping separate implementations for single-character and multi-character versions is worthwhile for performance. The difference is betweeen *p == ifs and streqn(p, ifs, ifslen): even with function inlining, the latter is more expensive than the former in the single-character case.

Example timing info for a million-line file is as follows:

TIME IN SECONDS 0.945 -- mlr --irs lf   --ifs ,  --ips =  check ../data/big.dkvp2
TIME IN SECONDS 1.139 -- mlr --irs crlf --ifs ,  --ips =  check ../data/big.dkvp2
TIME IN SECONDS 1.291 -- mlr --irs lf   --ifs /, --ips =: check ../data/big.dkvp2
TIME IN SECONDS 1.443 -- mlr --irs crlf --ifs /, --ips =: check ../data/big.dkvp2

i.e. (even when averaged over multiple runs) performance improvements of 20-30% are obtained by special-casing single-character-separator code: this is worth doing.