mirror of https://github.com/johnkerl/miller.git synced 2026-01-23 10:15:36 +00:00

History

John Kerl d09de40b75 valgrind findings		2017-06-12 23:51:39 -04:00
..
byte_reader.h	zcat iterate	2015-12-13 17:55:37 -07:00
byte_readers.h	[read performance iterate] byte-reader iterate	2015-09-04 09:19:33 -04:00
file_ingestor_stdio.c	windows-port iterate	2017-03-05 10:39:06 -05:00
file_ingestor_stdio.h	valgrind findings	2016-04-29 19:12:25 -04:00
file_reader_mmap.c	windows-port iterate	2017-03-05 11:32:16 -05:00
file_reader_mmap.h	zcat iterate	2015-12-13 17:55:37 -07:00
file_reader_stdio.c	windows-port iterate	2017-03-05 10:39:06 -05:00
file_reader_stdio.h	zcat iterate	2015-12-13 17:55:37 -07:00
json_parser.c	autodetect iterate: lrec readers	2017-01-30 19:07:52 -05:00
json_parser.h	json-leak debug	2016-04-23 12:00:08 -04:00
line_readers.c	valgrind findings	2017-06-12 23:12:55 -04:00
line_readers.h	new line-readers -> getlines performance comparator	2017-03-13 17:52:40 -04:00
lrec_reader.h	json-input iterate	2016-01-31 23:40:52 -05:00
lrec_reader_in_memory.c	neaten	2016-05-25 07:33:59 -04:00
lrec_reader_mmap_csv.c	Strip UTF-8 BOM (0xefbbbf) from start of CSV files	2017-04-30 22:14:14 -04:00
lrec_reader_mmap_csvlite.c	neaten	2017-02-01 21:16:28 -05:00
lrec_reader_mmap_dkvp.c	neaten	2017-02-01 21:16:28 -05:00
lrec_reader_mmap_json.c	json-array-to-map iterate	2017-03-16 09:13:10 -04:00
lrec_reader_mmap_nidx.c	neaten	2017-02-01 21:16:28 -05:00
lrec_reader_mmap_xtab.c	neaten	2017-02-01 21:22:13 -05:00
lrec_reader_stdio_csv.c	valgrind findings	2017-06-12 23:51:39 -04:00
lrec_reader_stdio_csvlite.c	line-readers iterate	2017-03-12 23:24:36 -04:00
lrec_reader_stdio_dkvp.c	line-readers iterate	2017-03-12 23:08:30 -04:00
lrec_reader_stdio_json.c	json-array-to-map iterate	2017-03-16 09:13:10 -04:00
lrec_reader_stdio_nidx.c	line-readers iterate	2017-03-12 23:12:59 -04:00
lrec_reader_stdio_xtab.c	line-readers iterate	2017-03-12 23:32:01 -04:00
lrec_readers.c	json-array-to-map iterate	2017-03-16 09:13:10 -04:00
lrec_readers.h	json-array-to-map iterate	2017-03-16 09:13:10 -04:00
Makefile.am	JSON stdio iterate	2016-02-04 17:27:48 -05:00
Makefile.in	Include ./configure et al.: output from autoreconf -fiv	2017-04-27 15:11:19 -04:00
mlr_json_adapter.c	valgrind findings	2017-04-13 23:33:36 -04:00
mlr_json_adapter.h	json-array-to-map iterate	2017-03-16 09:13:10 -04:00
mmap_byte_reader.c	windows-port iterate	2017-03-05 11:32:16 -05:00
peek_file_reader.c	move manual tests to unit tests: checkpoint	2015-09-11 18:37:04 -04:00
peek_file_reader.h	windows-port iterate	2017-03-07 22:51:44 -05:00
README.md	multi-char input separators for mmap DKVP	2015-09-16 23:54:24 -04:00
stdio_byte_reader.c	windows-port iterate	2017-03-05 11:10:13 -05:00
string_byte_reader.c	neaten	2016-05-25 07:33:59 -04:00

README.md

Miller file/record input

These are readers for Miller file formats, stdio and mmap versions. The stdio and mmap record parsers are similar but not identical, due to inversion of processing order: getting an entire mallocked line and then splitting it by separators in the former case, versus splitting while discovering end of line in the latter case. The code duplication could be largely removed by having the mmap readers find end-of-lines, then split up the lines -- however that requires two passes through input strings and for performance I want just a single pass.

While there are separate record-writers for CSV and pretty-print, there is just a common record-reader: pretty-print is CSV with field separator being a space, and allow_repeat_ifs set to true.

Idea of header_keeper objects for CSV: each header_keeper object retains the input-line backing and the slls_t for a CSV header line which is used by one or more CSV data lines. Meanwhile some mappers (e.g. sort, tac) retain input records from the entire data stream, which may include header-schema changes in the input stream. This means we need to keep headers intact as long as any lrecs are pointing to them. One option is reference-counting which I experimented with; it was messy and error-prone. The approach used here is to keep a hash map from header-schema to header_keeper object. The current pheader_keeper is a pointer into one of those. Then when the reader is freed, all its header-keepers are freed.

There is some code duplication involving single-character and multi-character IRS, IFS, and IPS. While single-character is a special case of multi-character, keeping separate implementations for single-character and multi-character versions is worthwhile for performance. The difference is betweeen *p == ifs and streqn(p, ifs, ifslen): even with function inlining, the latter is more expensive than the former in the single-character case.

Example timing info for a million-line file is as follows:

TIME IN SECONDS 0.945 -- mlr --irs lf   --ifs ,  --ips =  check ../data/big.dkvp2
TIME IN SECONDS 1.139 -- mlr --irs crlf --ifs ,  --ips =  check ../data/big.dkvp2
TIME IN SECONDS 1.291 -- mlr --irs lf   --ifs /, --ips =: check ../data/big.dkvp2
TIME IN SECONDS 1.443 -- mlr --irs crlf --ifs /, --ips =: check ../data/big.dkvp2

i.e. (even when averaged over multiple runs) performance improvements of 20-30% are obtained by special-casing single-character-separator code: this is worth doing.