mirror of https://github.com/johnkerl/miller.git synced 2026-01-23 18:25:45 +00:00

History

John Kerl 001321a508 miller 5.6.2		2019-09-21 20:06:13 -04:00
..
auxents	autoreconf -fiv	2019-09-07 13:11:32 -04:00
cli	Fix "interger" spelling	2019-09-13 08:48:47 +02:00
containers	autoreconf -fiv	2019-09-07 13:11:32 -04:00
dsl	rough draft of system DSL function	2019-09-11 21:53:16 -04:00
experimental	STRIDX -> TOKEN in CSV readers	2019-09-21 10:53:57 -04:00
input	Merge branch 'master' of git+ssh://github.com/johnkerl/miller	2019-09-21 19:54:37 -04:00
lib	fix memory leak in system DSL function	2019-09-12 21:57:38 -04:00
mapping	autoreconf -fiv	2019-09-07 13:11:32 -04:00
msys2	windows line endings, take two	2017-03-05 14:27:41 -05:00
output	autoreconf -fiv	2019-09-07 13:11:32 -04:00
parsing	autoreconf -fiv	2019-09-07 13:11:32 -04:00
reg_test	regression test for system DSL function	2019-09-12 21:00:24 -04:00
stream	autoreconf -fiv	2019-09-07 13:11:32 -04:00
tools	fix travis build; neaten	2018-01-06 12:51:09 -05:00
unit_test	STRIDX -> TOKEN in CSV readers	2019-09-21 10:53:57 -04:00
.gitignore	draft 5.1.0 release notes	2017-04-14 07:23:24 -04:00
.vimrc	split up cst_statement_alloc	2016-03-15 09:33:25 -04:00
asanmk	draft 5.1.0 release notes	2017-04-14 07:23:24 -04:00
camake	directory renames	2016-12-02 20:07:02 -05:00
cmake	directory renames	2016-12-02 20:07:02 -05:00
csv-rfc.txt	neaten	2015-08-25 22:57:20 -04:00
draft-release-notes.md	draft-release-notes.md	2019-09-12 21:48:44 -04:00
Makefile.am	appveyor iterate: -static	2017-07-02 21:06:16 -04:00
Makefile.in	autoreconf -fiv	2019-09-07 13:11:32 -04:00
Makefile.no-autoconfig	fix label verb for overlap between old and new names	2019-09-02 11:33:20 -04:00
Makefile.windows	windows makefile for outside of autoconfig	2017-03-13 22:40:46 -04:00
mlrmain.c	aux is a reserved name on windows; smh	2017-04-14 21:47:41 -04:00
mlrvers.h	miller 5.6.2	2019-09-21 20:06:13 -04:00
oo	regression-testing for positional-indexing feature	2019-08-28 22:30:13 -04:00
perf-mand.txt	draft 5.1.0 release notes	2017-04-14 07:23:24 -04:00
pre-travis.sh	subpackage-dependency-DAG iterate	2015-09-23 09:08:34 -04:00
push2	draft 5.1.0 release notes	2017-04-14 07:23:24 -04:00
pushl	draft 5.1.0 release notes	2017-04-14 07:23:24 -04:00
README.md	braced-oosvar-name doc	2016-05-06 10:14:04 -07:00
regdiff	fix travis build	2018-02-10 12:34:26 -05:00
stdlib.mlr	draft 5.1.0 release notes	2017-04-14 07:23:24 -04:00
todo.txt	doc neatens	2019-09-18 22:52:32 -04:00
vgrun	neaten	2016-07-09 14:47:16 -04:00
winpatch.diff	windows-port iterate	2017-04-08 01:00:53 -04:00
winrun.bat	winpatch	2017-04-07 23:55:10 -04:00
winstall.txt	windows-port notes	2017-04-08 01:10:47 -04:00

README.md

Data flow

Miller data flow is records produced by a record-reader in input/, followed by one or more mappers in mapping/, written by a record-writer in output/, controlled by logic in stream/. Argument parsing for initial stream setup is in cli/.

Container names

The user-visible concept of stream record (or srec) is implemented in the lrec_t (linked-record type) data structure. The user-visible concept of out-of-stream variables is implemented using the mlhmmv_t (multi-level hashmap of mlrvals) structure. Source-code comments and names within the code refer to srec/lrec and oosvar/mlhmmv depending on the context.

While those two data structures contain user-visible data structures, others are used in Miller's implementation: slls and sllv are singly-linked lists of string and void-star respectively; lhmss is a linked hashmap from string to string; lhmsi is a linked hashmap from string to int; and so on.

Memory management

Miller is streaming and as near stateless as possible. For most Miller functions, you can ingest a 20GB file with 4GB RAM, no problem. For example, mlr cat of a DKVP file retains no data in memory from one line to another; mlr cat of a CSV file retains only the field names from the header line. The stats1 and stats2 commands retain only aggregation state (e.g. count and sum over specified fields needed to compute mean of specified fields). The mlr tac and mlr sort commands, obviously, need to consume and retain all input records before emitting any output records.

Miller classes are in general modular, following a constructor/destructor model with minimal dependencies between classes. As a general rule, void-star payloads (sllv, lhmslv) must be freed by the callee (which has access to the data type) whereas non-void-star payloads (slls, hss) are freed by the container class.

One complication is for free-flags in lrec and slls: the idea is that an entire line is mallocked and presented by the record reader; then individual fields are split out and populated into linked list or records. To reduce the amount of strduping there, free-flags are used to track which fields should be freed by the destructor and which are freed elsewhere.

The header_keeper object is an elaboration on this theme: suppose there is a CSV file with header line a,b,c and data lines 1,2,3, then 4,5,6, then 7,8,9. Then the keys a, b, and c are shared between all three records; they are retained in a single header_keeper object.

A bigger complication to the otherwise modular nature of Miller is its baton-passing memory-management model. Namely, one class may be responsible for freeing memory allocated by another class.

For example, using mlr cat: The record-reader produces records and returns pointers to them. The record-mapper is just a pass-through; it returns the record-pointers it receives. The record-writer formats the records to stdout and does not return them, so it is responsible for freeing them.

Similarly, mlr cut -x and any other mappers which modify record objects without creating new ones. By contrast,stats1 et al. produce their own records; they free what they do not pass on.

Null-lrec conventions

Record-readers return a null lrec-pointer to signify end of input stream.

Each mapper takes an lrec-pointer as input and returns a linked list of lrec-pointer.

Null-lrec is input to mappers to signify end of stream: e.g. sort or tac should use this as a signal to deliver the sorted/reversed list of rows.

When a mapper has no output before end of stream (e.g. sort or tac while accumulating inputs) it returns a null lrec-pointer which is treated as synonymous with returning an empty list.

At end of stream, a mapper returns a linked list of records ending in a null lrec-pointer.

A null lrec-pointer at end of stream is passed to lrec writers so that they may produce final output (e.g. pretty-print which produces no output until end of stream).

Performance optimizations

The initial implementation of Miller used lhmss (insertion-ordered string-to-string hash map) for record objects. Keys and values were strduped out of file-input lines. Each of the following produced from 5 to 30 percent performance gains:

The lrec object is a hashless map suited to low access-to-creation ratio. See detailed comments in https://github.com/johnkerl/miller/blob/master/c/containers/lrec.h.
Free-flags as discussed above removed additional occurrences of string copies.
Using mmap to read files gets rid of double passes on record parsing (one to find end of line, and another to separate fields) as well as most use of malloc. Note however that standard input cannot be mmapped, so both record-reader options are retained.