2.5 KiB
Miller 6 is written in Go which supports multicore programming.
Miller uses Go's channel concept. The following are all separate goroutines:
- One channel for input-record reader: parsing input file(s) to record objects
- One channel for each verb in then-chains
- One channel for output-record writer: formatting output records as text
- One controller channel which coordinates all these, without much work to do.
For example, mlr --csv cut -f somefield then sort -f otherfield then put '$z = $x + $y' a.csv b.csv c.csv will have 6 goroutines running: input-reader,
cut, sort, put, output-writer, controller.
If all the verbs in the chain are streaming --
operating on each record as it arrives, then passing it on -- then all verbs in
the chain will be active at once. On the other hand, if there is a
non-streaming verb in the chain, which produces output only after receiving all
input -- for example, sort -- then we would expect verbs after that in the
chain to sit idle until the end of the input stream is reached, the sort does
its computation, then sends its output to downstream verbs.
In practice, profiling has shown that the input-reader uses the most CPU of all the above. This means CPUs running verbs may not be 100% utilized, since they are likely to be spending some of their time waiting for input data.
Running Miller on a machine with more CPUs than active channels (as listed above) won't speed up a given invocation of Miller. However, of course, you'll be able to run more invocations of Miller at the same time if you like.
You can set the Go-standard environment variable GOMAXPROCS if you like. If
you don't, Miller will (as is standard for Go programs in Go 1.16 and above) up
to all available CPUs.
If you set $GOMAXPROCS=1 in the environment, that's fine -- the Go runtime
will multiplex different channel-handling goroutines onto the same CPU.