mirror of https://github.com/johnkerl/miller.git synced 2026-01-23 02:14:13 +00:00

History

John Kerl fdaf8f88ec Fix #1108 (#1129 )		2022-11-25 23:05:21 -05:00
..
auxents	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
cli	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
containers	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
dsl	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
experimental	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
input	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
lib	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
mapping	Fix #1108 (#1129 )	2022-11-25 23:05:21 -05:00
msys2	windows line endings, take two	2017-03-05 14:27:41 -05:00
output	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
parsing	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
reg_test	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
stream	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
tools	fix travis build; neaten	2018-01-06 12:51:09 -05:00
u	update mandelbrot-benchmark source files	2021-02-23 23:00:57 -05:00
unit_test	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
.gitignore	draft 5.1.0 release notes	2017-04-14 07:23:24 -04:00
.vimrc	split up cst_statement_alloc	2016-03-15 09:33:25 -04:00
asanmk	draft 5.1.0 release notes	2017-04-14 07:23:24 -04:00
camake	directory renames	2016-12-02 20:07:02 -05:00
cmake	directory renames	2016-12-02 20:07:02 -05:00
csv-rfc.txt	neaten	2015-08-25 22:57:20 -04:00
draft-release-notes.md	draft-release-notes.md	2019-09-12 21:48:44 -04:00
Makefile.am	remove configure.ac refs to doc/Makefile (now docs)	2020-10-27 08:18:09 -04:00
Makefile.in	update tar.gz path in miller.spec	2021-11-16 22:56:26 -05:00
Makefile.no-autoconfig	remove mmap-readers, which were high-maintenance and not able to be used when most needed	2020-01-26 10:21:31 -05:00
Makefile.windows	remove mmap-readers, which were high-maintenance and not able to be used when most needed	2020-01-26 10:21:31 -05:00
mlrmain.c	aux is a reserved name on windows; smh	2017-04-14 21:47:41 -04:00
mlrvers.h	autoreconf.fiv; manpage & docs w/ 5.10.3	2021-11-16 22:33:46 -05:00
oo	regression-testing for positional-indexing feature	2019-08-28 22:30:13 -04:00
perf-mand.txt	draft 5.1.0 release notes	2017-04-14 07:23:24 -04:00
pre-travis.sh	subpackage-dependency-DAG iterate	2015-09-23 09:08:34 -04:00
pushl	draft 5.1.0 release notes	2017-04-14 07:23:24 -04:00
README.md	Merge branch 'master' of https://github.com/johnkerl/miller	2020-12-31 11:39:16 -05:00
regdiff	fix travis build	2018-02-10 12:34:26 -05:00
stdlib.mlr	draft 5.1.0 release notes	2017-04-14 07:23:24 -04:00
todo.txt	whitespace to trigger travis	2020-08-17 13:57:02 -04:00
vgrun	mlr put -s foo=bar doc	2020-03-07 23:19:58 -05:00
winpatch.diff	windows-port iterate	2017-04-08 01:00:53 -04:00
winrun.bat	winpatch	2017-04-07 23:55:10 -04:00
winstall.txt	windows-port notes	2017-04-08 01:10:47 -04:00

README.md

Data flow

Miller data flow is records produced by a record-reader in input/, followed by one or more mappers in mapping/, written by a record-writer in output/, controlled by logic in stream/. Argument parsing for initial stream setup is in cli/.

Container names

The user-visible concept of stream record (or srec) is implemented in the lrec_t (linked-record type) data structure. The user-visible concept of out-of-stream variables is implemented using the mlhmmv_t (multi-level hashmap of mlrvals) structure. Source-code comments and names within the code refer to srec/lrec and oosvar/mlhmmv depending on the context.

While those two data structures contain user-visible data structures, others are used in Miller's implementation: slls and sllv are singly-linked lists of string and void-star respectively; lhmss is a linked hashmap from string to string; lhmsi is a linked hashmap from string to int; and so on.

Memory management

Miller is streaming and as near stateless as possible. For most Miller functions, you can ingest a 20GB file with 4GB RAM, no problem. For example, mlr cat of a DKVP file retains no data in memory from one line to another; mlr cat of a CSV file retains only the field names from the header line. The stats1 and stats2 commands retain only aggregation state (e.g. count and sum over specified fields needed to compute mean of specified fields). The mlr tac and mlr sort commands, obviously, need to consume and retain all input records before emitting any output records.

Miller classes are in general modular, following a constructor/destructor model with minimal dependencies between classes. As a general rule, void-star payloads (sllv, lhmslv) must be freed by the callee (which has access to the data type) whereas non-void-star payloads (slls, hss) are freed by the container class.

One complication is for free-flags in lrec and slls: the idea is that an entire line is mallocked and presented by the record reader; then individual fields are split out and populated into linked list or records. To reduce the amount of strduping there, free-flags are used to track which fields should be freed by the destructor and which are freed elsewhere.

The header_keeper object is an elaboration on this theme: suppose there is a CSV file with header line a,b,c and data lines 1,2,3, then 4,5,6, then 7,8,9. Then the keys a, b, and c are shared between all three records; they are retained in a single header_keeper object.

A bigger complication to the otherwise modular nature of Miller is its baton-passing memory-management model. Namely, one class may be responsible for freeing memory allocated by another class.

For example, using mlr cat: The record-reader produces records and returns pointers to them. The record-mapper is just a pass-through; it returns the record-pointers it receives. The record-writer formats the records to stdout and does not return them, so it is responsible for freeing them.

Similarly, mlr cut -x and any other mappers which modify record objects without creating new ones. By contrast,stats1 et al. produce their own records; they free what they do not pass on.

Null-lrec conventions

Record-readers return a null lrec-pointer to signify end of input stream.

Each mapper takes an lrec-pointer as input and returns a linked list of lrec-pointer.

Null-lrec is input to mappers to signify end of stream: e.g. sort or tac should use this as a signal to deliver the sorted/reversed list of rows.

When a mapper has no output before end of stream (e.g. sort or tac while accumulating inputs) it returns a null lrec-pointer which is treated as synonymous with returning an empty list.

At end of stream, a mapper returns a linked list of records ending in a null lrec-pointer.

A null lrec-pointer at end of stream is passed to lrec writers so that they may produce final output (e.g. pretty-print which produces no output until end of stream).

Performance optimizations

The initial implementation of Miller used lhmss (insertion-ordered string-to-string hash map) for record objects. Keys and values were strduped out of file-input lines. Each of the following produced from 5 to 30 percent performance gains:

The lrec object is a hashless map suited to low access-to-creation ratio. See detailed comments in https://github.com/johnkerl/miller/blob/master/c/containers/lrec.h.
Free-flags as discussed above removed additional occurrences of string copies.
Using mmap to read files gets rid of double passes on record parsing (one to find end of line, and another to separate fields) as well as most use of malloc. Note however that standard input cannot be mmapped, so both record-reader options are retained.

Source-code indexing

Please see https://sourcegraph.com/github.com/johnkerl/miller