mirror of https://github.com/johnkerl/miller.git synced 2026-01-23 10:15:36 +00:00

History

John Kerl 60a7857dca update tar.gz path in miller.spec		2021-11-16 22:56:26 -05:00
..
byte_reader.h	zcat iterate	2015-12-13 17:55:37 -07:00
byte_readers.h	remove mmap-readers, which were high-maintenance and not able to be used when most needed	2020-01-26 10:21:31 -05:00
file_ingestor_stdio.c	windows-port iterate	2017-03-05 10:39:06 -05:00
file_ingestor_stdio.h	valgrind findings	2016-04-29 19:12:25 -04:00
file_reader_stdio.c	windows-port iterate	2017-03-05 10:39:06 -05:00
file_reader_stdio.h	zcat iterate	2015-12-13 17:55:37 -07:00
json_parser.c	fix old buffer overrun in C JSON parser	2020-11-23 22:43:39 -05:00
json_parser.h	fix 32-bit signed int at one spot in the Miller JSON reader	2020-07-06 17:45:55 -04:00
line_readers.c	autodetect <-> skip/pass-comments testing results	2018-01-06 15:23:05 -05:00
line_readers.h	fix line-counting for comments in stdio-csvlite	2018-01-05 21:12:22 -05:00
lrec_reader.h	remove mmap-readers, which were high-maintenance and not able to be used when most needed	2020-01-26 10:21:31 -05:00
lrec_reader_gen.c	remove mmap-readers, which were high-maintenance and not able to be used when most needed	2020-01-26 10:21:31 -05:00
lrec_reader_in_memory.c	neaten	2016-05-25 07:33:59 -04:00
lrec_reader_stdio_csv.c	fix rssle handling in XSV reader after comments line has more fields than data line	2021-02-23 22:01:14 -05:00
lrec_reader_stdio_csvlite.c	valgrind findings	2020-03-16 22:54:56 -04:00
lrec_reader_stdio_dkvp.c	pass-comments iterate	2018-01-01 22:25:37 -05:00
lrec_reader_stdio_json.c	fix 32-bit signed int at one spot in the Miller JSON reader	2020-07-06 17:45:55 -04:00
lrec_reader_stdio_nidx.c	pass-comments iterate	2018-01-01 22:25:37 -05:00
lrec_reader_stdio_xtab.c	pass-comments iterate	2018-01-01 22:25:37 -05:00
lrec_readers.c	remove mmap-readers, which were high-maintenance and not able to be used when most needed	2020-01-26 10:21:31 -05:00
lrec_readers.h	remove mmap-readers, which were high-maintenance and not able to be used when most needed	2020-01-26 10:21:31 -05:00
Makefile.am	remove mmap-readers, which were high-maintenance and not able to be used when most needed	2020-01-26 10:21:31 -05:00
Makefile.in	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
mlr_json_adapter.c	remove mmap-readers, which were high-maintenance and not able to be used when most needed	2020-01-26 10:21:31 -05:00
mlr_json_adapter.h	remove mmap-readers, which were high-maintenance and not able to be used when most needed	2020-01-26 10:21:31 -05:00
peek_file_reader.c	move manual tests to unit tests: checkpoint	2015-09-11 18:37:04 -04:00
peek_file_reader.h	windows-port iterate	2017-03-07 22:51:44 -05:00
README.md	multi-char input separators for mmap DKVP	2015-09-16 23:54:24 -04:00
stdio_byte_reader.c	windows-port iterate	2017-03-05 11:10:13 -05:00
string_byte_reader.c	neaten	2016-05-25 07:33:59 -04:00

README.md

Miller file/record input

These are readers for Miller file formats, stdio and mmap versions. The stdio and mmap record parsers are similar but not identical, due to inversion of processing order: getting an entire mallocked line and then splitting it by separators in the former case, versus splitting while discovering end of line in the latter case. The code duplication could be largely removed by having the mmap readers find end-of-lines, then split up the lines -- however that requires two passes through input strings and for performance I want just a single pass.

While there are separate record-writers for CSV and pretty-print, there is just a common record-reader: pretty-print is CSV with field separator being a space, and allow_repeat_ifs set to true.

Idea of header_keeper objects for CSV: each header_keeper object retains the input-line backing and the slls_t for a CSV header line which is used by one or more CSV data lines. Meanwhile some mappers (e.g. sort, tac) retain input records from the entire data stream, which may include header-schema changes in the input stream. This means we need to keep headers intact as long as any lrecs are pointing to them. One option is reference-counting which I experimented with; it was messy and error-prone. The approach used here is to keep a hash map from header-schema to header_keeper object. The current pheader_keeper is a pointer into one of those. Then when the reader is freed, all its header-keepers are freed.

There is some code duplication involving single-character and multi-character IRS, IFS, and IPS. While single-character is a special case of multi-character, keeping separate implementations for single-character and multi-character versions is worthwhile for performance. The difference is betweeen *p == ifs and streqn(p, ifs, ifslen): even with function inlining, the latter is more expensive than the former in the single-character case.

Example timing info for a million-line file is as follows:

TIME IN SECONDS 0.945 -- mlr --irs lf   --ifs ,  --ips =  check ../data/big.dkvp2
TIME IN SECONDS 1.139 -- mlr --irs crlf --ifs ,  --ips =  check ../data/big.dkvp2
TIME IN SECONDS 1.291 -- mlr --irs lf   --ifs /, --ips =: check ../data/big.dkvp2
TIME IN SECONDS 1.443 -- mlr --irs crlf --ifs /, --ips =: check ../data/big.dkvp2

i.e. (even when averaged over multiple runs) performance improvements of 20-30% are obtained by special-casing single-character-separator code: this is worth doing.