Update README-profiling.md (#810)

2026-01-23 02:14:13 +00:00 · 2021-12-27 23:49:11 -05:00 · 2021-12-27 23:49:11 -05:00 · 2e431d4234
commit 2e431d4234
parent b6097a2793
5 changed files with 38 additions and 119 deletions
--- a/README-go-port.md
+++ b/README-go-port.md
@ -163,6 +163,7 @@ nil through the reader/transformer/writer sequence.

 Key performance-related PRs for the Go port include:

+* [https://github.com/johnkerl/miller/pull/424](#424): In C, lots of little mallocs are fine. In Go, not so much. This is not the garbage-collection penalty -- it's the penalty of _allocating_ -- lots of `duffcopy` and `madvise` appearing in the flame graphs. The idea was to reduce data-copies in the DSL.
 * [https://github.com/johnkerl/miller/pull/765](#765): In C, prints to `stdout` are buffered a line at a time if the output is to the terminal, or a few KB at a time if not (i.e. file or pipe). Note the cost is how often the process does a `write` system call with associated overhead of context-switching into the kernel and back out. The C behavior is the right thing to do. In the Go port, very early on writes were all unbuffered -- several per record. Then buffering was soon switched to per-record, which was an improvement. But as of #765, the buffering is done at the library level, and it's done C-style -- much less frequently when output is not to a terminal.
 * [https://github.com/johnkerl/miller/pull/774](#774): For CSV-lite and DKVP, this avoids using regexes to split strings when `strings.Split` will do.
 * [https://github.com/johnkerl/miller/pull/779](#779): The basic idea of the Miller Go port was that the record-reader writes a record at a time over a channel to the first verb; the first verb writes records one at a time to the second verb, and so on; the last verb writes records one at a time to the record-writer. This is very simple, but for large files, the Go runtime scheduler overhead is too large -- data are chopped up into too many pieces. On #779 records are written 500 (or fewer) per batch, and all the channels from record-reader, to verbs, to record-writer are on record-batches. This lets Miller spend more time doing its job and less time yielding to the goroutine scheduler.
@ -170,9 +171,12 @@ Key performance-related PRs for the Go port include:
 * [https://github.com/johnkerl/miller/pull/787](#787): This removed an unnecessary data copy in the `mlrval.String()` method. Originally this method had non-pointer receiver to conform with the `fmt.Stringer` interface. Hoewver, that's a false economy: `fmt.Println(someMlrval)` is a corner case, and stream processing is the primary concern. Implementing this as a pointer-receiver method was a performance improvement.
 * [https://github.com/johnkerl/miller/pull/809](#809): This reduced the number of passes through fields for just-in-time type-inference. For example, for `$y = $x + 1`, each record's `$x` field's raw string (if not already accessed in the processing chain) needs to be checked to see if it's int (like `123`), float (like `123.4` or `1.2e3`), or string (anything else). Previously, succinct calls to built-in Go library functions were used. That was easy to code, but made too many expensive calls that were avoidable by lighter peeking of strings. In particular, an is-octal regex was being invoked unnecessarily on every field type-infer operation.

+See also [./README-profiling.md](./README-profiling.md) and [https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks](https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks).
+
 In summary:

 * #765, #774, and #787 were low-hanging fruit.
+* #424 was a bit more involved, and reveals that memory allocation -- not just GC -- needs to be handled more mindfully in Go than in C.
 * #779 was a bit more involved, and reveals that Go's elegant goroutine/channel processing model comes with the caveat that channelized data should not be organized in many, small pieces.
 * #809 was also bit more involved, and reveals that library functions are convenient, but profiling and analysis can sometimes reveal an opportunity for an impact, custom solution.
 * #786 was a massive refactor involving about 10KLOC -- in hindsight it would have been best to do this work at the start of the Go port, not at the end.
--- a/README-profiling.md
+++ b/README-profiling.md
@ -1,60 +1,56 @@
-# How to run the profiler
+# See also

-Running Miller:
+* [https://github.com/johnkerl/miller/blob/readme-profiling/README-go-port.md#performance-optimizations](https://github.com/johnkerl/miller/blob/readme-profiling/README-go-port.md#performance-optimizations)
+* [https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks](https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks).
+
+# How to view profile data
+
+Run the profiler:

 ```
-mlr --cpuprofile cpu.pprof put -f u/example.mlr then nothing ~/tmp/huge > /dev/null
+mlr --cpuprofile cpu.pprof --csv put -f scripts/chain-1.mlr ~/tmp/big.csv > /dev/null
 ```

-(or whatever command-line flags).
+(or whatever command-line flags for Miller).

-# How to view the profiling results
-
-
-## Text mode
+Text mode:

 ```
 go tool pprof mlr cpu.pprof
 top10
 ```

-## PDF mode
+Graphical mode:

 ```
-go tool pprof --pdf mlr cpu.pprof > mlr-call-graph.pdf
-mv mlr-call-graph.pdf ~/Desktop
+go tool pprof -http=:8080 cpu.pprof
 ```

-## Flame-graph mode
+and let it pop open a browser window.

-One-time setup:
+# Benchmark scripts

+Scripts:

-```
-export GOPATH=$HOME/go
-mkdir -p $HOME/go
-```
+* [./scripts/make-big-files](./scripts/make-big-files) -- Create million-record data files in various formats.
+* [./scripts/chain-cmps.sh](./scripts/chain-cmps.sh) -- Run a few processing scenarios on the million-record CSV file.
+  * [./scripts/chain-1.mlr](./scripts/chain-1.mlr) -- An example `mlr put` used by the previous script
+* [./scripts/time-big-files](./scripts/time-big-files) -- Runs `mlr cat` for million-record files of various file formats. Catting files isn't intrinsically interesting but it shows how input and output processing vary over file formats.
+  * [./scripts/time-big-file](./scripts/time-big-file) -- Helper script for the former.
+* [./scripts/chain-lengths.sh](./scripts/chain-lengths.sh) -- Run longer and longer chains of `scripts/chain1.mlr`, showing how Miller handles multicore and concurrency.
+* [./scripts/make-data-stream](./scripts/make-data-stream) -- Create an endless stream of data to be piped into Miller for steady-state load-testing: e.g. `scripts/make-data-stream | mlr ...` then look at `htop` in another window.

-```
-go get -u github.com/google/pprof
-ll ~/go/bin/pprof
-go get -u github.com/uber/go-torch
-```
+Notes:

-```
-mkdir -p ~/git/brendangregg
-cd ~/git/brendangregg
-git clone https://github.com/brendangregg/FlameGraph
-```
+* Any of the above can be run using the profiler. I find Flame Graph mode particuarly informative for drill-down.
+* The above refer to `mlr5` and `~/tmp/miller/mlr` as well as `./mlr`. The idea is I have a copy of Miller 5.10.3 (the C implementation) saved off in my path as `mlr5`. Then I keep `~/tmp/miller` on recent HEAD. Then I have `.` on a dev branch. Comparing `mlr5` to `./mlr` shows relative performance of the C and Go implementations. Comparing `~/tmp/miller/mlr` to `./mlr` shows relative performance of whatever optimization I'm currently working on.
+* Several of the above scripts use [justtime](https://github.com/johnkerl/scripts/blob/main/fundam/justtime) to get one-line timing information.

-Per run:
+# How to vary compiler versions

-```
-cd /path/to/mlr/go
-export PATH=${PATH}:~/git/brendangregg/FlameGraph/
-go-torch cpu.pprof
-mv torch.svg ~/Desktop/
-```
+* [./scripts/compiler-versions-install](./scripts/compiler-versions-install)
+* [./scripts/compiler-versions-build](./scripts/compiler-versions-build)
+* [./scripts/compiler-versions-time](./scripts/compiler-versions-time)

 # How to control garbage collection

@ -69,79 +65,3 @@ GOGC=1000 GODEBUG=gctrace=1 mlr -n put -q -f u/mand.mlr 1> /dev/null
 # Turn off GC entirely and see where time is spent:
 GOGC=off  GODEBUG=gctrace=1 mlr -n put -q -f u/mand.mlr 1> /dev/null
 ```
-
-# Findings from the profiler as of 2021-02
-
-* GC on: lots of GC
-* GC off: lots of `duffcopy` and `madvise`
-* From benchmark `u/mand.mlr`: issue is allocation created by expressions
-  * Things like `type BinaryFunc func(input1 *Mlrval, input2 *Mlrval) (output Mlrval)`
-  * `z = 2 + x + 4 + y * 3` results in AST, mapped to a CST, with a malloc on the output of every unary/binary/ternary function
-  * Idea: replace with `type BinaryFunc func(output* Mlrval, input1 *Mlrval, input2 *Mlrval)`: no allocation at all.
-    * Breaks the Fibonacci test since the binary-operator node is no longer re-entrant
-  * Idea: replace with `type BinaryFunc func(input1 *Mlrval, input2 *Mlrval) (output *Mlrval)`: better.
-    * Makes possible zero-copy eval of literal nodes, etc.
-
-```
-for i in 100 200 300 400 500 600 700 800 900 1000 ; do
-  for j in 1 2 3 4 5 ; do
-    echo $i;
-    justtime GOGC=$i mlr -n put -q -f u/mand.mlr > /dev/null
-  done
-done
-```
-
-```
- 100 23.754
- 100 23.883
- 100 24.021
- 100 24.022
- 100 24.305
- 200 20.864
- 200 20.211
- 200 19.980
- 200 20.251
- 200 20.691
- 300 19.140
- 300 18.610
- 300 18.793
- 300 19.111
- 300 19.027
- 400 18.067
- 400 18.274
- 400 18.344
- 400 18.378
- 400 18.250
- 500 17.791
- 500 17.644
- 500 17.814
- 500 18.064
- 500 18.403
- 600 17.878
- 600 17.892
- 600 18.034
- 600 18.125
- 600 18.008
- 700 18.153
- 700 18.286
- 700 17.342
- 700 21.136
- 700 20.729
- 800 19.585
- 800 19.116
- 800 17.170
- 800 18.549
- 800 18.236
- 900 16.950
- 900 17.883
- 900 17.532
- 900 17.551
- 900 17.804
-1000 20.076
-1000 20.745
-1000 19.657
-1000 18.733
-1000 18.560
-```
-
-Sweet spot around 500. Note https://golang.org/pkg/runtime/debug/#SetGCPercent.
--- a/docs/src/index.md.in
+++ b/docs/src/index.md.in
@ -1,4 +1,3 @@
-
 # Introduction

 **Miller is a command-line tool for querying, shaping, and reformatting data files in various formats including CSV, TSV, and JSON.**
@ -35,3 +34,5 @@ which allow you to succintly express your own logic.
 In the following you can see how CSV, TSV, tabular, JSON, and other **file formats** share a common theme which is **lists of key-value-pairs**. Miller embraces this common theme.

 ![coverart/cover-combined.png](coverart/cover-combined.png)
+
+**The big picture:** Even well into the 21st century, our world is full of text-formatted data like CSV. Google _CSV memes_, for example. We need tooling to _thrive in this world_, nimbly manipulating data which is in CSVs. And we need tooling to _move beyond CSV_, to be able to pull data out and into other storage and processing systems. Miller is designed for both these goals.
--- a/scripts/time-big-file
+++ b/scripts/time-big-file
@ -1,6 +1,6 @@
 #!/bin/bash

-mlr="mlr -S"
+mlr="mlr"
 suffix="dkvp"

 iflag=""
--- a/todo.txt
+++ b/todo.txt
@ -1,20 +1,14 @@
 ================================================================
 PUNCHDOWN LIST

-* numeric-inference perf
-  o README-profiling.md -- update ...
-  o README-go-port.md re various scripts
-
 * blockers:
  - fractional-strptime
  - improved regex doc w/ lots of examples
  - cmp-matrices
  - all-contribs: twi dm
+    https://github.com/all-contributors/all-contributors
  - license triple-checks
  - `mlr put` -> coverart
-  - 0b1011 olh/webdoc
-  - big-picture note
-  ? array/map fields: marshal as JSON_SINGLE_LINE

 * nikos materials -> fold in