mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 02:14:13 +00:00
187 lines
22 KiB
Markdown
187 lines
22 KiB
Markdown
# Quickstart for developers
|
|
|
|
* `make`, `make check`, `make docs`, etc: see [Makefile](Makefile) in the repo base directory.
|
|
* Software-testing methodology: see [./test/README.md](./test/README.md).
|
|
* Source-code indexing: please see [https://sourcegraph.com/github.com/johnkerl/miller](https://sourcegraph.com/github.com/johnkerl/miller)
|
|
* Godoc As of September 2021, `godoc` support is minimal: package-level synopses exist; most `func`/`const`/etc content lacks `godoc`-style comments. To view doc material, you can:
|
|
* `go get golang.org/x/tools/cmd/godoc`
|
|
* `cd go`
|
|
* `godoc -http=:6060 -goroot .`
|
|
* Browse to `http://localhost:6060`
|
|
* Note: control-C and restart the server, then reload in the browser, to pick up edits to source files
|
|
|
|
# Continuous integration
|
|
|
|
The Go implementation is auto-built using GitHub Actions: see [.github/workflows/go.yml](.github/workflows/go.yml). This works splendidly on Linux, MacOS, and Windows.
|
|
|
|
# Benefits of porting to Go
|
|
|
|
* The lack of a streaming (record-by-record) JSON reader in the C implementation ([issue 99](https://github.com/johnkerl/miller/issues/99)) is immediately solved in the Go implementation.
|
|
* In the C implementation, arrays were not supported in the DSL; in the Go implementation they are.
|
|
* Flattening nested map structures to output records was clumsy. Now, Miller will be a JSON-to-JSON processor, if your inputs and outputs are both JSON; JSON input and output will be idiomatic.
|
|
* The quoted-DKVP feature from [issue 266](https://github.com/johnkerl/miller/issues/266) will be easily addressed.
|
|
* String/number-formatting issues in [issue 211](https://github.com/johnkerl/miller/issues/211), [issue 178](https://github.com/johnkerl/miller/issues/178), [issue 151](https://github.com/johnkerl/miller/issues/151), and [issue 259](https://github.com/johnkerl/miller/issues/259) will be fixed during the Go port.
|
|
* I think some DST/timezone issues such as [issue 359](https://github.com/johnkerl/miller/issues/359) will be easier to fix using the Go datetime library than using the C datetime library
|
|
* The code will be easier to read and, I hope, easier for others to contribute to. What this means is it should be quicker and easier to add new features to Miller -- after the development-time cost of the port itself is paid, of course.
|
|
|
|
# Why Go
|
|
|
|
* As noted above, multiple Miller issues will benefit from stronger library support.
|
|
* Channels/goroutines are an excellent for Miller's reader/mapper/mapper/mapper/writer record-stream architecture.
|
|
* Since I did timing experiments in 2015, I found Go to be faster than it was then.
|
|
* In terms of CPU-cycle-count, Go is a bit slower than C (it does more things, like bounds-checking arrays and so on) -- but by leveraging concurrency over a couple processors, I find that it's competitive in terms of wall-time.
|
|
* Go is an up-and-coming language, with good reason -- it's mature, stable, with few of C's weaknesses and many of C's strengths.
|
|
* The source code will be easier to read/maintain/write, by myself and others.
|
|
|
|
# Efficiency of the Go port
|
|
|
|
As I wrote [here](https://johnkerl.org//miller-docs-by-release/1.0.0/performance.html) back in 2015 I couldn't get Rust or Go (or any other language I tried) to do some test-case processing as quickly as C, so I stuck with C.
|
|
|
|
Either Go has improved since 2015, or I'm a better Go programmer than I used to be, or both -- but as of 2020 I can get Go-Miller to process data about as quickly as C-Miller.
|
|
|
|
Note: in some sense Go-Miller is *less* efficient but in a way that doesn't significantly affect wall time. Namely, doing `mlr cat` on a million-record data file on my bargain-value MacBook Pro, the C version takes about 2.5 seconds and the Go version takes about 3 seconds. So in terms of wall time -- which is what we care most about, how long we have to wait -- it's about the same.
|
|
|
|
A way to look a little deeper at resource usage is to run `htop`, while processing a 10x larger file, so it'll take 25 or 30 seconds rather than 2.5 or 3. This way we can look at the steady-state resource consumption. I found that the C version -- which is purely single-threaded -- is taking 100% CPU. And the Go version, which uses concurrency and channels and `MAXPROCS=4`, with reader/transformer/writer each on their own CPU, is taking about 240% CPU. So Go-Miller is taking up not just a little more CPU, but a lot more -- yet, it does more work in parallel, and finishes the job in about the same amount of time.
|
|
|
|
Even commodity hardware has multiple CPUs these days -- and the Go code is *much* easier to read, extend, and improve than the C code -- so I'll call this a net win for Miller.
|
|
|
|
# Developer information
|
|
|
|
## Source-code goals
|
|
|
|
Donald Knuth famously said: *Programs are meant to be read by humans and only incidentally for computers to execute.*
|
|
|
|
During the coding of Miller, I've been guided by the following:
|
|
|
|
* *Miller should be pleasant to read.*
|
|
* If you want to fix a bug, you should be able to quickly and confidently find out where and how.
|
|
* If you want to learn something about Go channels, or lexing/parsing in Go -- especially if you don't already know much about them -- the comments should help you learn what you want to.
|
|
* If you're the kind of person who reads other people's code for fun, well, the code should be fun, as well as readable.
|
|
* `README.md` files throughout the directory tree are intended to give you a sense of what is where, what to read first and what doesn't need reading right away, and so on -- so you spend a minimum of time being confused or frustrated.
|
|
* Names of files, variables, functions, etc. should be fully spelled out (e.g. `NewEvaluableLeafNode`), except for a small number of most-used names where a longer name would cause unnecessary line-wraps (e.g. `Mlrval` instead of `MillerValue` since this appears very very often).
|
|
* Code should not be too clever. This includes some reasonable amounts of code duplication from time to time, to keep things inline, rather than lasagna code.
|
|
* Things should be transparent. For example, the `-v` in `mlr -n put -v '$y = 3 + 0.1 * $x'` shows you the abstract syntax tree derived from the DSL expression.
|
|
* Comments should be robust with respect to reasonably anticipated changes. For example, one package should cross-link to another in its comments, but I try to avoid mentioning specific filenames too much in the comments and README files since these may change over time. I make an exception for stable points such as [cmd/mlr/main.go](./cmd/mlr/main.go), [mlr.bnf](./pkg/parsing/mlr.bnf), [stream.go](./pkg/stream/stream.go), etc.
|
|
* *Miller should be pleasant to write.*
|
|
* It should be quick to answer the question *Did I just break anything?* -- hence `mlr regtest` functionality.
|
|
* It should be quick to find out what to do next as you iteratively develop -- see for example [cst/README.md](./pkg/dsl/cst/README.md).
|
|
* *The language should be an asset, not a liability.*
|
|
* One of the reasons I chose Go is that (personally anyway) I find it to be reasonably efficient, well-supported with standard libraries, straightforward, and fun. I hope you enjoy it as much as I have.
|
|
|
|
## Directory structure
|
|
|
|
Information here is for the benefit of anyone reading/using the Miller Go code. To use the Miller tool at the command line, you don't need to know any of this if you don't want to. :)
|
|
|
|
## Directory-structure overview
|
|
|
|
Miller is a multi-format record-stream processor, where a **record** is a
|
|
sequence of key-value pairs. The basic **stream** operation is:
|
|
|
|
* **read** records in some specified file format;
|
|
* **transform** the input records to output records in some user-specified way, using a **chain** of **transformers** (also sometimes called **verbs**) -- sort, filter, cut, put, etc.;
|
|
* **write** the records in some specified file format.
|
|
|
|
So, in broad overview, the key packages are:
|
|
|
|
* [pkg/stream](./pkg/stream) -- connect input -> transforms -> output via Go channels
|
|
* [pkg/input](./pkg/input) -- read input records
|
|
* [pkg/transformers](./pkg/transformers) -- transform input records to output records
|
|
* [pkg/output](./pkg/output) -- write output records
|
|
* The rest are details to support this.
|
|
|
|
## Directory-structure details
|
|
|
|
### Dependencies
|
|
|
|
* Miller dependencies are all in the Go standard library, except two:
|
|
* GOCC lexer/parser code-generator from [github.com/goccmack/gocc](https://github.com/goccmack/gocc):
|
|
* Forked at [github.com/johnkerl/gocc](github.com/johnkerl/gocc).
|
|
* This package defines the grammar for Miller's domain-specific language (DSL) for the Miller `put` and `filter` verbs. And, GOCC is a joy to use. :)
|
|
* It is used on the terms of its open-source license.
|
|
* [golang.org/x/term](https://pkg.go.dev/golang.org/x/term):
|
|
* Just a one-line Miller callsite for is-a-terminal checking for the [Miller REPL](./pkg/terminals/repl/README.md).
|
|
* It is used on the terms of its open-source license.
|
|
* See also [./go.mod](go.mod). Setup:
|
|
* `go get github.com/johnkerl/gocc`
|
|
* `go get golang.org/x/term`
|
|
|
|
### Miller per se
|
|
|
|
* The main entry point is [cmd/mlr/main.go](./cmd/mlr/main.go); everything else in [pkg](./pkg).
|
|
* [pkg/entrypoint](./pkg/entrypoint): All the usual contents of `main()` are here, for ease of testing.
|
|
* [pkg/platform](./pkg/platform): Platform-dependent code, which as of early 2021 is the command-line parser. Handling single quotes and double quotes is different on Windows unless particular care is taken, which is what this package does.
|
|
* [pkg/lib](./pkg/lib):
|
|
* Implementation of the [`Mlrval`](./pkg/types/mlrval.go) datatype which includes string/int/float/boolean/void/absent/error types. These are used for record values, as well as expression/variable values in the Miller `put`/`filter` DSL. See also below for more details.
|
|
* [`Mlrmap`](./pkg/types/mlrmap.go) is the sequence of key-value pairs which represents a Miller record. The key-lookup mechanism is optimized for Miller read/write usage patterns -- please see [mlrmap.go](./pkg/types/mlrmap.go) for more details.
|
|
* [`context`](./pkg/types/context.go) supports AWK-like variables such as `FILENAME`, `NF`, `NR`, and so on.
|
|
* [pkg/cli](./pkg/cli) is the flag-parsing logic for supporting Miller's command-line interface. When you type something like `mlr --icsv --ojson put '$sum = $a + $b' then filter '$sum > 1000' myfile.csv`, it's the CLI parser which makes it possible for Miller to construct a CSV record-reader, a transformer-chain of `put` then `filter`, and a JSON record-writer.
|
|
* [pkg/climain](./pkg/climain) contains a layer which invokes `pkg/cli`, which was split out to avoid a Go package-import cycle.
|
|
* [pkg/stream](./pkg/stream) is as above -- it uses Go channels to pipe together file-reads, to record-reading/parsing, to a chain of record-transformers, to record-writing/formatting, to terminal standard output.
|
|
* [pkg/input](./pkg/input) is as above -- one record-reader type per supported input file format, and a factory method.
|
|
* [pkg/output](./pkg/output) is as above -- one record-writer type per supported output file format, and a factory method.
|
|
* [pkg/transformers](./pkg/transformers) contains the abstract record-transformer interface datatype, as well as the Go-channel chaining mechanism for piping one transformer into the next. It also contains all the concrete record-transformers such as `cat`, `tac`, `sort`, `put`, and so on.
|
|
* [pkg/parsing](./pkg/parsing) contains a single source file, `mlr.bnf`, which is the lexical/semantic grammar file for the Miller `put`/`filter` DSL using the GOCC framework. All subdirectories of `pkg/parsing/` are autogen code created by GOCC's processing of `mlr.bnf`. If you need to edit `mlr.bnf`, please use [tools/build-dsl](./tools/build-dsl) to autogenerate Go code from it (using the GOCC tool). (This takes several minutes to run.)
|
|
* [pkg/dsl](./pkg/dsl) contains [`ast_types.go`](pkg/dsl/ast_types.go) which is the abstract syntax tree datatype shared between GOCC and Miller. I didn't use a `pkg/dsl/ast` naming convention, although that would have been nice, in order to avoid a Go package-dependency cycle.
|
|
* [pkg/dsl/cst](./pkg/dsl/cst) is the concrete syntax tree, constructed from an AST produced by GOCC. The CST is what is actually executed on every input record when you do things like `$z = $x * 0.3 * $y`. Please see the [pkg/dsl/cst/README.md](./pkg/dsl/cst/README.md) for more information.
|
|
|
|
## Nil-record conventions
|
|
|
|
Through out the code, records are passed by reference (as are most things, for
|
|
that matter, to reduce unnecessary data copies). In particular, records can be
|
|
nil through the reader/transformer/writer sequence.
|
|
|
|
* Record-readers produce an end-of-stream marker (within the `RecordAndContext` struct) to signify end of input stream.
|
|
* Each transformer takes a record-pointer as input and produces a sequence of zero or more record-pointers.
|
|
* Many transformers, such as `cat`, `cut`, `rename`, etc. produce one output record per input record.
|
|
* The `filter` transformer produces one or zero output records per input record depending on whether the record passed the filter.
|
|
* The `nothing` transformer produces zero output records.
|
|
* The `sort` and `tac` transformers are *non-streaming* -- they produce zero output records per input record, and instead retain each input record in a list. Then, when the end-of-stream marker is received, they sort/reverse the records and emit them, then they emit the end-of-stream marker.
|
|
* Many transformers such as `stats1` and `count` also retain input records, then produce output once there is no more input to them.
|
|
* An end-of-stream marker is passed to record-writers so that they may produce final output.
|
|
* Most writers produce their output one record at a time.
|
|
* The pretty-print writer produces no output until end of stream (or schema change), since it needs to compute the max width down each column.
|
|
|
|
## Memory management
|
|
|
|
* Go has garbage collection which immediately simplifies the coding compared to the C port.
|
|
* Pointers are used freely for record-processing: record-readers allocate pointed records; pointed records are passed on Go channels from record-readers to record-transformers to record-writers.
|
|
* Any transformer which passes an input record through is fine -- be it unmodified as in `mlr cat` or modified as in `mlr cut`.
|
|
* If a transformer drops a record (`mlr filter` in false cases, for example, or `mlr nothing`) it will be GCed.
|
|
* One caveat is any transformer which produces multiples, e.g. `mlr repeat` -- this needs to explicitly copy records instead of producing multiple pointers to the same record.
|
|
* Right-hand-sides of DSL expressions all pass around pointers to records and Mlrvals.
|
|
* Lvalue expressions return pointed `*types.Mlrmap` so they can be assigned to; rvalue expressions return non-pointed `types.Mlrval` but these are very shallow copies -- the int/string/etc types are copied but maps/arrays are passed by reference in the rvalue expression-evaluators.
|
|
* Copy-on-write is done on map/array put -- for example, in the assignment phase of a DSL statement, where an rvalue is assigned to an lvalue.
|
|
|
|
## More about mlrvals
|
|
|
|
[`Mlrval`](./pkg/types/mlrval.go) is the datatype of record values, as well as expression/variable values in the Miller `put`/`filter` DSL. It includes string/int/float/boolean/void/absent/error types, not unlike PHP's `zval`.
|
|
|
|
* Miller's `absent` type is like Javascript's `undefined` -- it's for times when there is no such key, as in a DSL expression `$out = $foo` when the input record is `$x=3,y=4` -- there is no `$foo` so `$foo` has `absent` type. Nothing is written to the `$out` field in this case. See also [here](https://miller.readthedocs.io/en/latest/reference-main-null-data) for more information.
|
|
* Miller's `void` type is like Javascript's `null` -- it's for times when there is a key with no value, as in `$out = $x` when the input record is `$x=,$y=4`. This is an overlap with `string` type, since a void value looks like an empty string. I've gone back and forth on this (including when I was writing the C implementation) -- whether to retain `void` as a distinct type from empty-string, or not. I ended up keeping it as it made the `Mlrval` logic easier to understand.
|
|
* Miller's `error` type is for things like doing type-uncoerced addition of strings. Data-dependent errors are intended to result in `(error)`-valued output, rather than crashing Miller. See also [here](https://miller.readthedocs.io/en/latest/reference-main-data-types) for more information.
|
|
* Miller's number handling makes auto-overflow from int to float transparent, while preserving the possibility of 64-bit bitwise arithmetic.
|
|
* This is different from JavaScript, which has only double-precision floats and thus no support for 64-bit numbers (note however that there is now [`BigInt`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt)).
|
|
* This is also different from C and Go, wherein casts are necessary -- without which int arithmetic overflows.
|
|
* See also [here](https://miller.readthedocs.io/en/latest/reference-main-arithmetic) for the semantics of Miller arithmetic, which the [`Mlrval`](./pkg/types/mlrval.go) class implements.
|
|
|
|
## Performance optimizations
|
|
|
|
Key performance-related PRs for the Go port include:
|
|
|
|
* [#424](https://github.com/johnkerl/miller/pull/424): In C, lots of little mallocs are fine. In Go, not so much. This is not the garbage-collection penalty -- it's the penalty of _allocating_ -- lots of `duffcopy` and `madvise` appearing in the flame graphs. The idea was to reduce data-copies in the DSL.
|
|
* [#765](https://github.com/johnkerl/miller/pull/765): In C, prints to `stdout` are buffered a line at a time if the output is to the terminal, or a few KB at a time if not (i.e. file or pipe). Note the cost is how often the process does a `write` system call with associated overhead of context-switching into the kernel and back out. The C behavior is the right thing to do. In the Go port, very early on writes were all unbuffered -- several per record. Then buffering was soon switched to per-record, which was an improvement. But as of #765, the buffering is done at the library level, and it's done C-style -- much less frequently when output is not to a terminal.
|
|
* [#774](https://github.com/johnkerl/miller/pull/774): For CSV-lite and DKVP, this avoids using regexes to split strings when `strings.Split` will do.
|
|
* [#779](https://github.com/johnkerl/miller/pull/779): The basic idea of the Miller Go port was that the record-reader writes a record at a time over a channel to the first verb; the first verb writes records one at a time to the second verb, and so on; the last verb writes records one at a time to the record-writer. This is very simple, but for large files, the Go runtime scheduler overhead is too large -- data are chopped up into too many pieces. On #779 records are written 500 (or fewer) per batch, and all the channels from record-reader, to verbs, to record-writer are on record-batches. This lets Miller spend more time doing its job and less time yielding to the goroutine scheduler.
|
|
* [#786](https://github.com/johnkerl/miller/pull/787): In the C version, all values were strings until operated on specifically (expliclitly) by a verb. In the Go port, initially, all values were type-inferred on read, with types retained throughout the processing chain. This was an incredibly elegant and empowering design decision -- central to the Go port, in fact -- but it came with the cost that _all_ fields were being scanned as float/int even if they weren't used in the processing chain. On #786, fields are left as raw strings with type "pending", only just-in-time inferred to string/int/float only when used within the processing chain.
|
|
* [#787](https://github.com/johnkerl/miller/pull/787): This removed an unnecessary data copy in the `mlrval.String()` method. Originally this method had non-pointer receiver to conform with the `fmt.Stringer` interface. Hoewver, that's a false economy: `fmt.Println(someMlrval)` is a corner case, and stream processing is the primary concern. Implementing this as a pointer-receiver method was a performance improvement.
|
|
* [#809](https://github.com/johnkerl/miller/pull/809): This reduced the number of passes through fields for just-in-time type-inference. For example, for `$y = $x + 1`, each record's `$x` field's raw string (if not already accessed in the processing chain) needs to be checked to see if it's int (like `123`), float (like `123.4` or `1.2e3`), or string (anything else). Previously, succinct calls to built-in Go library functions were used. That was easy to code, but made too many expensive calls that were avoidable by lighter peeking of strings. In particular, an is-octal regex was being invoked unnecessarily on every field type-infer operation.
|
|
|
|
See also [./README-profiling.md](./README-profiling.md) and [https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks](https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks).
|
|
|
|
In summary:
|
|
|
|
* [#765](https://github.com/johnkerl/miller/pull/765), [#774](https://github.com/johnkerl/miller/pull/774), and [#787](https://github.com/johnkerl/miller/pull/787) were low-hanging fruit.
|
|
* [#424](https://github.com/johnkerl/miller/pull/424) was a bit more involved, and reveals that memory allocation -- not just GC -- needs to be handled more mindfully in Go than in C.
|
|
* [#779](https://github.com/johnkerl/miller/pull/779) was a bit more involved, and reveals that Go's elegant goroutine/channel processing model comes with the caveat that channelized data should not be organized in many, small pieces.
|
|
* [#809](https://github.com/johnkerl/miller/pull/809) was also bit more involved, and reveals that library functions are convenient, but profiling and analysis can sometimes reveal an opportunity for an impact, custom solution.
|
|
* [#786](https://github.com/johnkerl/miller/pull/786) was a massive refactor involving about 10KLOC -- in hindsight it would have been best to do this work at the start of the Go port, not at the end.
|