miller/c
2019-09-21 20:06:13 -04:00
..
auxents autoreconf -fiv 2019-09-07 13:11:32 -04:00
cli Fix "interger" spelling 2019-09-13 08:48:47 +02:00
containers autoreconf -fiv 2019-09-07 13:11:32 -04:00
dsl rough draft of system DSL function 2019-09-11 21:53:16 -04:00
experimental STRIDX -> TOKEN in CSV readers 2019-09-21 10:53:57 -04:00
input Merge branch 'master' of git+ssh://github.com/johnkerl/miller 2019-09-21 19:54:37 -04:00
lib fix memory leak in system DSL function 2019-09-12 21:57:38 -04:00
mapping autoreconf -fiv 2019-09-07 13:11:32 -04:00
msys2 windows line endings, take two 2017-03-05 14:27:41 -05:00
output autoreconf -fiv 2019-09-07 13:11:32 -04:00
parsing autoreconf -fiv 2019-09-07 13:11:32 -04:00
reg_test regression test for system DSL function 2019-09-12 21:00:24 -04:00
stream autoreconf -fiv 2019-09-07 13:11:32 -04:00
tools fix travis build; neaten 2018-01-06 12:51:09 -05:00
unit_test STRIDX -> TOKEN in CSV readers 2019-09-21 10:53:57 -04:00
.gitignore draft 5.1.0 release notes 2017-04-14 07:23:24 -04:00
.vimrc split up cst_statement_alloc 2016-03-15 09:33:25 -04:00
asanmk draft 5.1.0 release notes 2017-04-14 07:23:24 -04:00
camake directory renames 2016-12-02 20:07:02 -05:00
cmake directory renames 2016-12-02 20:07:02 -05:00
csv-rfc.txt neaten 2015-08-25 22:57:20 -04:00
draft-release-notes.md draft-release-notes.md 2019-09-12 21:48:44 -04:00
Makefile.am appveyor iterate: -static 2017-07-02 21:06:16 -04:00
Makefile.in autoreconf -fiv 2019-09-07 13:11:32 -04:00
Makefile.no-autoconfig fix label verb for overlap between old and new names 2019-09-02 11:33:20 -04:00
Makefile.windows windows makefile for outside of autoconfig 2017-03-13 22:40:46 -04:00
mlrmain.c aux is a reserved name on windows; smh 2017-04-14 21:47:41 -04:00
mlrvers.h miller 5.6.2 2019-09-21 20:06:13 -04:00
oo regression-testing for positional-indexing feature 2019-08-28 22:30:13 -04:00
perf-mand.txt draft 5.1.0 release notes 2017-04-14 07:23:24 -04:00
pre-travis.sh subpackage-dependency-DAG iterate 2015-09-23 09:08:34 -04:00
push2 draft 5.1.0 release notes 2017-04-14 07:23:24 -04:00
pushl draft 5.1.0 release notes 2017-04-14 07:23:24 -04:00
README.md braced-oosvar-name doc 2016-05-06 10:14:04 -07:00
regdiff fix travis build 2018-02-10 12:34:26 -05:00
stdlib.mlr draft 5.1.0 release notes 2017-04-14 07:23:24 -04:00
todo.txt doc neatens 2019-09-18 22:52:32 -04:00
vgrun neaten 2016-07-09 14:47:16 -04:00
winpatch.diff windows-port iterate 2017-04-08 01:00:53 -04:00
winrun.bat winpatch 2017-04-07 23:55:10 -04:00
winstall.txt windows-port notes 2017-04-08 01:10:47 -04:00

Data flow

Miller data flow is records produced by a record-reader in input/, followed by one or more mappers in mapping/, written by a record-writer in output/, controlled by logic in stream/. Argument parsing for initial stream setup is in cli/.

Container names

The user-visible concept of stream record (or srec) is implemented in the lrec_t (linked-record type) data structure. The user-visible concept of out-of-stream variables is implemented using the mlhmmv_t (multi-level hashmap of mlrvals) structure. Source-code comments and names within the code refer to srec/lrec and oosvar/mlhmmv depending on the context.

While those two data structures contain user-visible data structures, others are used in Miller's implementation: slls and sllv are singly-linked lists of string and void-star respectively; lhmss is a linked hashmap from string to string; lhmsi is a linked hashmap from string to int; and so on.

Memory management

Miller is streaming and as near stateless as possible. For most Miller functions, you can ingest a 20GB file with 4GB RAM, no problem. For example, mlr cat of a DKVP file retains no data in memory from one line to another; mlr cat of a CSV file retains only the field names from the header line. The stats1 and stats2 commands retain only aggregation state (e.g. count and sum over specified fields needed to compute mean of specified fields). The mlr tac and mlr sort commands, obviously, need to consume and retain all input records before emitting any output records.

Miller classes are in general modular, following a constructor/destructor model with minimal dependencies between classes. As a general rule, void-star payloads (sllv, lhmslv) must be freed by the callee (which has access to the data type) whereas non-void-star payloads (slls, hss) are freed by the container class.

One complication is for free-flags in lrec and slls: the idea is that an entire line is mallocked and presented by the record reader; then individual fields are split out and populated into linked list or records. To reduce the amount of strduping there, free-flags are used to track which fields should be freed by the destructor and which are freed elsewhere.

The header_keeper object is an elaboration on this theme: suppose there is a CSV file with header line a,b,c and data lines 1,2,3, then 4,5,6, then 7,8,9. Then the keys a, b, and c are shared between all three records; they are retained in a single header_keeper object.

A bigger complication to the otherwise modular nature of Miller is its baton-passing memory-management model. Namely, one class may be responsible for freeing memory allocated by another class.

For example, using mlr cat: The record-reader produces records and returns pointers to them. The record-mapper is just a pass-through; it returns the record-pointers it receives. The record-writer formats the records to stdout and does not return them, so it is responsible for freeing them.

Similarly, mlr cut -x and any other mappers which modify record objects without creating new ones. By contrast,stats1 et al. produce their own records; they free what they do not pass on.

Null-lrec conventions

Record-readers return a null lrec-pointer to signify end of input stream.

Each mapper takes an lrec-pointer as input and returns a linked list of lrec-pointer.

Null-lrec is input to mappers to signify end of stream: e.g. sort or tac should use this as a signal to deliver the sorted/reversed list of rows.

When a mapper has no output before end of stream (e.g. sort or tac while accumulating inputs) it returns a null lrec-pointer which is treated as synonymous with returning an empty list.

At end of stream, a mapper returns a linked list of records ending in a null lrec-pointer.

A null lrec-pointer at end of stream is passed to lrec writers so that they may produce final output (e.g. pretty-print which produces no output until end of stream).

Performance optimizations

The initial implementation of Miller used lhmss (insertion-ordered string-to-string hash map) for record objects. Keys and values were strduped out of file-input lines. Each of the following produced from 5 to 30 percent performance gains:

  • The lrec object is a hashless map suited to low access-to-creation ratio. See detailed comments in https://github.com/johnkerl/miller/blob/master/c/containers/lrec.h.
  • Free-flags as discussed above removed additional occurrences of string copies.
  • Using mmap to read files gets rid of double passes on record parsing (one to find end of line, and another to separate fields) as well as most use of malloc. Note however that standard input cannot be mmapped, so both record-reader options are retained.