miller/todo.txt

================================================================
PUNCHDOWN LIST

* blockers:
  - fractional-strptime
  - improved regex doc w/ lots of examples
  - cmp-matrices
  - all-contribs: twi dm
    https://github.com/all-contributors/all-contributors
    yarn all-contributors add namegoeshere ideas
  - license triple-checks
  - `mlr put` -> coverart

* nikos materials -> fold in

* cases/dsl-min-max-types: cmp-matrices need to be fixed to follow the advertised rule for mixed types
  NUMERICS < BOOL < VOID < STRING

* regex
  o authoritative regex docs accompanied by thorough UT
    - expand existing regex webdoc
  o r-strings/implicit-r/297: double-check end of reference-main-data-types.md.in
  o reference-main-regular-expressions:
    separate escaping for "\t" etc in arg-2/regex position -- "\t"."\t" example as well ...

* datetime
  o sysdate, sysdate_local; datediff ...
  o .6S bugfix -- separate PR -- ?
  o strptime w/ ...00.Z -> error
  o strptime/strftime experiments ...
    - verb sec2gmtdate
      > leaves non-numbers as-is -- ?
      > check sec2gmt as well -- ?
    ! strptime:
      strptime("1970-01-01T00:00:00.Z", "%Y-%m-%dT%H:%M:%SZ")
      (error)

* doc
  o new-in-miller-6: missings:
    - dump syntax -- ?
    - emittable constraints -- ?
  o wut h1 spacing before/after ...
  o shell-commands: while-read example from issues
  ? special-symbols-and-formatting: How to escape '?' in regexes? -> still true? link to torbiak297?
  E reference-dsl-user-defined-functions: UDSes -> non-factorial example -- maybe some useful aggregator
  o reference-main-arithmetic: ? test stats1/step -F flag
  o reference-dsl-control-structures:
    e while (NR < 10) will never terminate as NR is only incremented between
      records -> and each expression is invoked once per record so once for NR=1,
      once for NR=2, etc.
  o C-style triple-for loops: loop to NR -> NO!!!
  o Since uninitialized out-of-stream variables default to 0 for
    addition/substraction and 1 for multiplication when they appear on expression
    right-hand sides (not quite as in awk, where they'd default to 0 either way)
    <-> xlink to other page
  r fzf-ish w/ head -n 4, --from, up-arrow & append verb, then cat -- find & update the existing section
  ! https://github.com/johnkerl/miller/issues/653 -- stats1 w/ empties? check stats2
    - needs UTs as well
  o while-read example from issues

* release ordering?
  conda
  brew macports chocolatey
  ubuntu debian fedora gentoo prolinux archlinux
  netbsd freebsd

* post-release:
  w installing-miller.md.in

================================================================
NON-BLOCKERS

* JSON perf -- try alternate packages to encoding/json

* pos/neg 0x/0b/0o UTs

* 0o into BNF

? BIFs as FCFs?

* pv: 'mlr --prepipex pv --gzin tail -n 10 ~/tmp/zhuge.gz' needs --gzin & --prepipex both
  o plus corresponding docwork

* JIT follow-ons:
  o UT:
    - JSON I/O
    - mlrval_cmp.go
    - mv from-array/from-map w/ copy & mutate orig & check new -- & vice versa
    - dash-O and octal infer
    - populate each bifs/X_test.go for each bifs/X.go etc etc
  o neatens
    - carefully read all $mlv files
    - check $types & $bifs as well
    - proofread all mlrval_cmp.go dispo mxes
    - update rmds x several
  o misc
    - grep tmiller map put ref vs map put copy
    - krepl miller6 new doclink-comments
    - git rm cmd/mprof{n}

* https://staticcheck.io/docs
  o lots of nice little things to clean up -- no bugs per se, all stylistic i *think* ...

* xtab splitter UT; nidx too

* integrate:
  o https://www.libhunt.com/r/miller
  o https://repology.org/project/miller/information

* verslink old relnotes

* more perf?
  - batchify source-quench -- experiment 1st
  - further channelize (CSV-first focus) mlrval infer vs record-put ?
  ? coalesce errchan & done-writing w/ Err to RAC, and close-chan *and* EOSMarker -- ?

* []Mlrval -> []*Mlrval ?

* funcptr away the ifs/ifsregex check in record-readers

* handling for --cpuprofile not at args[1] slot

* try simpler-than-regex-split-string for repeated-single -- especially for XTAB reader

* UT-per-se of XTAB channelizedStanzaScanner

* main-level (verb-level?) flag for "," -> X in verbs -- in case commas in field names
* golinter

* single UT, hard to invoke w/ new full go.mod path
  go test $(ls internal/pkg/lib/*.go|grep -v test) internal/pkg/lib/unbackslash_test.go
  etc

* file-formats: NIDX link to headerless CSV

* fmtnum(98, "%3d%%") -- ? workaround: fmtnum(98, "%3d") . "%"

* link to SE table ...
  https://github.com/johnkerl/miller/discussions/609#discussioncomment-1115715

* single-line JSON for DKVP/CSV/etc ...
  o mlr --j2x --no-auto-flatten cat $mlg/regtest/input/flatten-input-2.json
    - code: make sure this does single-line json ...
  o mlr --j2c --no-auto-flatten cat $mlg/regtest/input/flatten-input-2.json
    - code: this is ok ... maybe prefer single-line -- ?

* hofs section on typedecls
  o hofs+typedecls RT cases
* glossary re natural ordering
  o separate webdoc section ... somewhere ...
  o hofs.md.in link
  o numerics < bool < string

* nr, nf, keys
t* ranspose ...
* -colname syntax ... -x colname maybe into more verbs ...

* consider expanding '(error)' to have more useful error-text

* mlr -k
  o c,go
  o various test cases
  o OLH re limitations
  o check JSON-parser x 2 -- is there really a 'restart'?
    - infinite-loop avoidance for sure

* -S/-A/-O page with examples -- ?

* sysdate & sysdate_local

* broadly rethink os.Exit, especially as affecting mlr repl

* mlr stdlib -- ? how to deliver?
* some support for UDF help-strings -- ?

* separate -S/-A/-O inference page w/ examples of whywhens

* mlr -k
* transpose
* print w/ #{...}; defer variadic printf
* meta: nf,nr,keys?

* mlr -f {arg}, mlr -F {arg}, etc

* non-streaming DSL-enabled cut
  https://github.com/johnkerl/miller/discussions/613

* single cheatsheet page -- put out RFH?
  https://twitter.com/icymi_py/status/1426622817785765898/photo/1

* mlrval_json -- get file/line in internal-coding-error detected

? $0 as raw-record string -- ? would make mlr grep simpler and more natural ...

* IIFEs: 'func f = (func(){ return udf})()'
* BIFs in non-sigil context (UDFs already are)
* non-top-level func defs

* non-lite DKVP reader/writer

* precedence for `:` in slicing syntax

* full format-string parser for corner cases like "X%08lldX"

* more of:
  o colored-shapes.dkvp -> csv; also mkdat2
  o data/small -> csv throughout. and/or just use example.csv

* json-triple-quote -- what can be done here?

* godoc neatens at func/const/etc level

* unarrayify function

* XYZWasSpecified -> XYZ = "default" w/ check-after -- ?

* parquet -- ?

* case auxfiles: cat them too

* uniqify-field-names in record-readers -- which issue?

* non-blocker: commenting passes ...

* non-blocker: array and string slices on LHS of assignments

* non-blocker: feature/shorthand for repl newline before prompt

* non-blocker: new functions:
  o new columns-to-arrays and arrays-to-columns for stan format

? gzout, bz2out -- ? make sure this works through tee? bleah ...
? zip -- but that is an archive case too not just an encoding case
  ? miller support for archive-traversal; directory-traversal even w/o zip -- ?
  ? as 6.1 follow-on work -- ?

* more about HTTP logs in miller -- doc and encapsulate:
  mlr --icsv --implicit-csv-header --ifs space --oxtab --from elb-log put -q 'print $27'
* PR-template etc checklists

* clean up TODO/xxx in internal/pkg/platform
* mlr regtest doc -- focus on either go/regtest or internal/pkg/auxents/regtest, one linking to the other

* also: write up how git status after test should show any missed extra-outs

* help-refactor:
  o audit for DEFAULT_FOOs @
  o audit for '-z {zzz}'
  o audit for consistent usage style

* new columns-to-arrays and arrays-to-columns for stan format

* https://segment.com/blog/allocation-efficiency-in-high-performance-go-services/

* c/go both:
  o https://brandur.org/logfmt is simply DKVP w/ IFS = space (need dquot though)
  o https://docs.fluentbit.io/manual/pipeline/parsers/ltsv is just DKVP with IFS tab and IPS colon
* do some profiling every so often

* UDF nexts:
  o more functions (see below)
  o strmatch https://github.com/johnkerl/miller/issues/77#issuecomment-538790927
  o DSL sort function https://github.com/johnkerl/miller/issues/77#issuecomment-321916921

* bash completion script https://github.com/johnkerl/miller/issues/77#issuecomment-308247402
  https://iridakos.com/programming/2018/03/01/bash-programmable-completion-tutorial#:~:text=Bash%20completion%20is%20a%20functionality,key%20while%20typing%20a%20command.

* sliding-window averages into mapper step (C + Go)
* stats1 rank

* double-check rand-seeding
  o all rand invocations should go through the seeder for UT/other determinism

* comment-handling
  - delegator for CSV ...

! quoted NIDX
  - how with whitespace regex -- ?
! quoted DKVP
  - what about csvlite-style -- ? needs a --dkvplite ?
! pprint emit on schema change, not all-at-end.

* widen DSL coverage
  o err-return for array/map get/put if incorrect types ... currently go-void ...
    ! the DSL needs a full, written-down-and-published spell-out of error-eval semantics
  o profile mand.mlr & check for need for idx-assign
    -> most definitely needed
  o multiple-valued return/assign -- ?
    - array destructure at LHS for multi-retval assign (maps too?)

* UT per se for lrec ops

* libify errors.New callsites for DSL/CST
* record-readers are fully in-channel/loop; record-writers are multi with in-channel/loop being
  done by ChannelWriter, which is very small. opportunity to refactor.
* address all manner of xxx and TODO comments
* empty csv ... reminder ...

* functions as first-class objects; then sortaf/sortmf take f not "f"

* godoc notes:
  o go get golang.org/x/tools/cmd/godoc
  o dev mode:
    godoc -http=:6060 -goroot .
  o publish:
    godoc -http=:6060 -goroot .
    cd ~/tmp/bar
    wget -p -k http://localhost:6060/pkg
    mv localhost:6060 miller6
    file:///Users/kerl/tmp/bar/miller6/pkg
    maybe publish to ISP space

* het ifmt-reading
  - separate out InputFormat into per-file (or tbd) & have autodetect on file endings -- ?
  - maybe a TBD reader -- ?
  - InputFormat into Context
  - TBD writer -- defer factory until first context?
  - deeper refactor pulling format out of reader/writer options entirely -- ?

================================================================
MAYBES

* dotted-syntax support in verbs?

* repl as verb -- ?  'put --repl' maybe

* json-triple-quote -- what can be done here?

* non-blocker: _ variable feature?

* headerful/headerless mix -- ?
  TOptions as list, not single -- ?

* miller extensibility re golang plugins -- ?!?
  ? verbs ?
  ? DSL functions ?

* pkg graph:
  go get github.com/kisielk/godepgraph
  godepgraph miller | dot -Tpng -o ~/Desktop/mlrdeps.png
  flamegraph etc double-check

* more data formats:
  https://indico.cern.ch/event/613842/contributions/2585787/attachments/1463230/2260889/pivarski-data-formats.pdf

----------------------------------------------------------------
DEFER:

* once on go 1.16: get around ioutil.ReadFile depcreation
  o build-dsl hand-edit / sedder
  o io/ioutil -> os
  o ioutil.ReadFile -> os.ReadFile on internal/pkg/parsing/lexer/lexer.go

* parser-fu:
  o iterative LALR grok
- jackson notes
- gocc .txt/.go for simple grammars
o find/bookmark/grok rob's lexer slides
o iterate on a parser-generator with JSON config file
no need to bootstrap a parser for the parser-generator language

----------------------------------------------------------------
INFO

i go tool nm -size mlr | sort -nrk 2

----------------------------------------------------------------
GOCC UPSTREAMS:

? support "abc" (not just 'a' 'b' 'c') in the lexer part

----------------------------------------------------------------
TBF:
* go 1.16 at some point
* tools/perf:
  o https://eng.uber.com/pprof-go-profiler/
  o profile mlr --j2x cat mappings.json
  o golang static-analysis tool -- ?
* iconv note
* AST insertions: make a simple NodeFromToken & have all interface{} be *ASTNode, not *token.Token
* cst printer with reflect.TypeOf -- ?
? makefile for build-dsl: if $bnf newer than productionstable.go
* I/O perf delta between C & Go is smaller for CSV, middle for DKVP, large for JSON -- debug
* neaten/error-proof:
  o mlrmapEntry -> own keys/mlrvals -- keep the kcopy/vcopy & be very clear,
    or remove. (keeping pointers allows nil-check which is good.)
  o inrec *types.Mlrmap is good for default no-copy across channels ... needs
    a big red flag though for things like the repeat verb (maybe *only* that one ...)
! clean up the AST API. ish! :^/
* json:
  d thorough UT for json mlrval-parser including various expect-fail error cases
  d doc re no jlistwrap on input if they want get streaming input
  d UT JSON-to-JSON cat-mapping should be identical
  d JSON-like accessor syntax in the grammar: $field[3]["bar"]
  d flatten/unflatten for non-JSON I/O formats -- maybe just double-quoted JSON strings -- ?
    - make a force-single-line writer
    - make a jsonparse DSL function -- ?
  d other formats: use JSON marshaler for collection types, maybe double-quoted
  o research gocc support
  o maybe a case for hand-roll
? dsl/ast.go -> parsing/ast.go? then, put new-ast ctor -> parsing package
  o if so, update r.mds
* relnotes: label b,i,x vs x,i,b change
* double-check dump CR-terminators depending on expression type
* good example of wording for why/when to make a breaking release:
  https://webpack.js.org/blog/2020-10-10-webpack-5-release/
* unset, unassign, remove -- too many different names. also assign/put ... maybe stick w/ 2?
* huge commenting pass
* profile mlr sort
* go exe 17MB, wut. try to discover. (gocc presumably but verify.)
* fill-down make columns required. also, --all.
* check triple-dash at mlr fill-down-h ; check others
* clean up unused exitCode arg in sort/put usage.
  o also document pre/post conditions for flag and non-flag usages x all mappers
? emit @x or emit x -- should make k/v pairs w/ "x" & value -- ? check against C impl
i emitp/emitf -- note for-loops didn't appear until 4.1.0 & emits are much older (emitp 3.5.0).
  if i were starting clean-slate, i'd have had just a single `emit`.
* asserting_{type}: os.Exit(1) -> return nil, err flow?
* test put/filter w/ various combinations of -s/-e/-f
* mt_void keep-or-not .......
  o check dispo matrices
  o if keep, need careful MT_VOID at from-string constructor -- ? or not ?
  o comment clearly regardless
* bitwise_and_dispositions et al should not have _absn for collections -- _erro instead
* ast-parex separate mlr auxents entrypoint?
* port u/window*.mlr from mlrc to mlrgo (actually, fix mlrgo of course)
* line/column caret at parse-error messages -- would require some GOCC refactoring
  in order to get the full DSL string and the line/number info into the same method
* csvlite rd/wr: comment for USV/ASV too. no need for escaping then.
* comment schema-change supported only in csvlite reader, not csv reader
* for-multi: C semantics 'k1: data k2: a:x v:1', sigh ...
* neaten mlr gap -g (default) print
! write out thorough min/max/cmp cases for all orderings by type
* silent zero-pass for-loops on non-collections:
  o intended as a heterogenity feature ...
  o consider a --errors or --strict mode; something
* note about non-determinism for DSL print/dump vs record output-stream now ...
* put/filter updates:
* [[...]] / [[[...]]]:
  o put '$array = [1,2,3,[4,5]]' is a syntax error unfortunately; need '$array = [1,2,3,[4,5] ]'
i https://en.wikipedia.org/wiki/Delimiter#Delimiter_collision
* reorder locations of get/put/remove methods in mlrval/mlrmap
* grep out all error message from regtest outputs & doc them all & make sure index-searchable at readthedocs
* short 'asserting' functions (absent/error); and/or put --strict or somesuch
* function metadata: auto-sort on mlr -f?
* --x2b @ help-doc .go; etc
? remove flagSet x all -- ? for consistency?
* os.Args[0] etc -> "mlr" throughout the codebase
* emitx later: 'emit([a,b,c],d,e,f)' for SR-conflict issues

* genmds multi-line something something for autogen of repl examples -- ?

* maybe split Context into varying & non-varying -- separate structs entirely

* idea: records as mlrmap -> mlrval?
  o reduce $* copy ...
  o opens the door to some (verb-subset) truly arbitrary-JSON processing ...

* mlr --opprint put $[[1]] = $[[2]]; unset $["a"] ./regtest/input/abixy
  o squint at pointer-handling
  o output varied after flatten-mods

* join
  > clean up VERBOSE in joiner-files
  > joinBucketKeeper & joinBucket need to be privatized
  > rewrite join-bucket-keeper.go entirely
  > also needs UT per se (not just regression)
* cli-doc --no-auto-flatten and --no-auto-unflatten
* note (fix? doc?) flatten of '$x={}' expands to nothing. not invertible.
* parex print regtest -- what about new ast-node types?
* all case-files could use top-notes
* dev-note on why `int` not `int64` -- processor-arch & those who most need it get it
* doc auto-flatten/auto-unflatten -- incl narrative from mlrcli_parse.go
* doc6: default flatsep is now "." not ":" in keeping with JSON culture
? allow [[...]] / [[[...]]] at assignment LHS

* readeropts/writeropts/readerwriteropts -> cliutil funcs
  o then put into join.go, put.go, & repl
* mlr inp parse error failstring retback?
* https://blog.golang.org/go1.13-errors
* split REPL lines on ';' -- ?
* tilde-expand for REPL load/open: if '~' is at the start of the string, run it though 'sh -c echo'
* doc shift/unshift as using [2:] and append
? ctx invars -> ptr w/ cmt
? string/array slices on assignment LHS -- ?
* beyond:
  o support 'x[1]["a"]' etc notation in various verbs?
  o sort within nested data structures?
  o array-sort, map-key sort, map-value sort in the DSL?
  o closures for sorting and more -- ?!?
  o or maybe just use UDFs ...
* optimize MlrvalLessThanForSort
  o mlr --cpuprofile cpu.pprof --from ~tmp/big sort -f a -nr x then nothing
  o GOGC=1000 mlr --cpuprofile cpu.pprof --from /Users/kerl/tmp/huge sort -f a -nr x then nothing
  o wc -l ~/tmp/big
    1000000 /Users/kerl/tmp/big
  o wc -l ~/tmp/huge
    10000000 /Users/kerl/tmp/huge
* optimize MlrvalGetMeanEB et al.
* data-copy reduction wup:
  o literal-type nodes -- now zero-copy
  x modify Evaluate to return pointer -- too much copying
  o wup for it was the binary-operator node, w/ the '*', that broke w/ no-output-copy & fibo UT
  o bonus: return MlrvalSqrt(MlrvalDivide(input1, input2))
  o type-gated mv -- should use passed-in storage slot -- ?
  o nice narrative write-up w/ the C stack-allocator problem, Go non-solution,
    profilng methods, GC readings/findings, before-and-after CST data structures,
    final perf results.
  o next round of data-copy reduction:
    - $z = min($x, $y) -- needs to return pointer to x or y
    o $z = $x + $y -- needs to have space for sum, and return pointer to it
    o therefore type BinaryFunc func(input1, input2 *Mlrval) *types.Mlrval
      > have the function z-allocate outputs when needed
      > the outputs must be on the stack, not statically allocated, to make them re-entrant
        and OK for recursive functions
      > var output types.Mlrval w/ field-setters, rather than return &Mlrval{... all of them ...}
    - then IEvaluable: Evaluate(state *runtime.State) *types.Mlrval
    - invalidate CopyFrom
    - check for under/over copy at Assign
    - global *ERROR / *ABSENT / etc
* for i, e in range c optimization -- always *copies* e
  o try and benchmark/compare ...
  o lots of array-of-pointer stuff, this is totally fine
  o take care w/ copying (non-pointer) mlrvals though
* more copy-on-retain for concurrent pointer-mods
  o make a thorough audit, and warn everywhere
  o either do copy for all retainers, or treat inrecs as immutable ...
  o 'this.recordsAndContexts.PushBack(inrecAndContext)' idiom needs copy everywhere ...
* consider -w/-W to stderr not stdout -- ?
* doc6 re warnings
* -W to mlr main also -- so it can be used with .mlrrc
* push/pop/shift/unshift functions
* 0035.cmd
  begin{@x=1} func f(x) { dump; print "hello"; tee  > ENV["outdir"]."/udf-x", $* } $o=f($i)

* zlib: n.b. brew install pigz, then pigz -z

* regex-capture follow-on: https://github.com/johnkerl/miller/issues/388 is much cleaner
  o keep current syntax for backward compatibility
  o but encourage use of this

* put -T -- ?

----------------------------------------------------------------
DOC6:

* mlrdoc false && 4, true || 4 because of short-circuiting requirement
* error if UDF has same name as built-in
* more text examples in mlr-put doc
* window.mlr, window2.mlr -> doc somewhere
* doc: substr in inferred-numeric fields: https://github.com/johnkerl/miller/issues/290.
  o xref to 1-up note.
* document --jvstack is now the default; --no-jvstack
* doc about put -S/-F cannot make sense anymore
* why not flagSet. can't be supported everywhere, so don't confuse the user by
  supporting it some places and not others.
* back-incompat:
  mlr -n put $vflag '@x=1; dump > stdout, @x'
  mlr -n put $vflag '@x=1; dump > stdout @x'

* document tee -p

* why no flagSet:

	Unlike other transformers, we can't use flagSet here. The syntax of 'mlr
	sort' is it needs to take things like 'mlr sort -f a -n b -n c', i.e.
	first sort lexically on field a, then numerically on field b, then
	lexically on field c. The flagSet API would let the '-f c' clobber the
	'-f a', while we want both.

	Unlike other transformers, we can't use flagSet here. The syntax of 'mlr put'
	and 'mlr filter' is they need to be able to take -f and/or -e more than
	once, and Go flags can't handle that.

* doc re multi-load: can't '$x >' and '3' in separate -f anymore. no worries.
* sec2gmt --millis/--micros/--nanos doc
* sort-within-records --recursive doc

* docs nest simplers now that we have getoptish
* mongo examples to doc :D
* doclink re https://readthedocs.org/projects/miller/ & https://github.com/johnkerl/miller/settings/hooks
* dotted-map doc ...
  o $*.foo["bar"] = NR b04k b/c precedence :(
  o change precedence?

? * would LOVE to have small prev-page/next-page links at the *top* not bottom ...
  https://squidfunk.github.io/mkdocs-material/customization/#extending-the-theme

* go test single files:
  $ go test src/types/mlrval_functions_test.go $(ls src/types/*.go | grep -v test)
  ok   command-line-arguments 0.100s
  $ lsr \*test\*.go
  ./regression_test.go
  ./src/types/mlrval_functions_test.go
  ./src/types/mlrval_format_test.go
  ./src/auxents/regtest/regtester.go
  ./src/lib/regex_test.go
  ./src/lib/unbackslash_test.go
  $ go test src/types/mlrval_functions_test.go $(ls src/types/*.go|grep -v test)
  ok   command-line-arguments 0.097s
  $ go test src/types/mlrval_format_test.go $(ls src/types/*.go|grep -v test)
  ok   command-line-arguments 0.093s
  $ go test src/lib/regex_test.go src/lib/regex.go
  ok   command-line-arguments 0.083s
  $ go test src/lib/unbackslash_test.go src/lib/unbackslash.go
  ok   command-line-arguments 0.081s