miller/todo.txt

===============================================================
* 1050 mlr check w/ empty csv column name
*  283 strmatch DSL function
*  440 strict mode
* 1128 bash/zsh autocompletions
* 1025 emitv2
* 1082 summary/type
* 1105 too many open files

* opt-in type-infers for inf, true, etc
* 982 '.' ',' etc
* extended field accessors for #763 and #948 (positional & json.nested)
* awk-like exit
* mrpl exits ...

================================================================
STRICT MODE

? re silent zero-pass for-loops on non-collections:

i theme is handling of 'absent'

? what about handling of 'error' ?

* improve wording:
  mlr: couldn't assign variable int function return value from value absent (absent)

* need $?x and @?x in the grammar & CST

* flags:
  o mlr -z and mlr put -z
  o note put has -w (warn) and -W (fatal)
    - then strict mode includes -W?

* tests:
  mlr --csv --from $exv put -z 'x = 1'
  mlr --csv --from $exv put -z 'var x = a'
  mlr --csv --from $exv put -z 'var x = $nonesuch'
  mlr --csv --from $exv put -z 'var x = $["asdf"]'
  mlr --csv --from $exv put -z 'var x = $[nonesuch]'
  mlr --csv --from $exv put -z 'var x = $[[999]]'
  mlr --csv --from $exv put -z 'var x = $[[[999]]]'
  mlr --csv --from $exv put -z 'begin { var m = $* }'
  mlr --csv --from $exv put -z 'var x = @nonesuch'
  mlr --csv --from $exv put -z 'var x = @["nonesuch"]'
  mlr --csv --from $exv put -z 'func f(): int {}; $x = f()'
  mlr --csv --from $exv put -z 'func f() {}; $x = f()'
  mlr --csv --from $exv put -z 'func f() {return nonesuch}; $x = f()'
  mlr --csv --from $exv put -z '$env = ENV[nonesuch]'
  mlr --csv --from $exv put -z '$env = ENV["nonesuch"]'

----------------------------------------------------------------
EXTENDED FIELD ACCESSORS

* extract/simplify getter logic outside of full CST/runtime/etc
* sketch out putter ...
? what does 'cut -f a.b[3]' even mean?
  o 'cut -xf a.b[3]' is easy enough ...
  o field name '@graph' is an issue :(
* maybe ......... not DSL per se but a separate "parser" here
  o delimiters '[' ']' '.' '{' '}'
  o $foo
  o ${foo} -- in case the field name has a . in it or what have you
  x $["foo"] -- don't support this
  o $foo.bar -- treat this like $foo["bar"]
  o $foo["bar"] or foo[1] -- contents just a string or an integer (no expressions supported)
  o $[[2]] -- contents just an integer (no expressions supported)
  o $[[[3]]] -- contents just an integer (no expressions supported)
  o note $foo.bar[[[2]]][[3]] needs to work
    $foo["bar"]={"a":1,"b":{"x":3},"c":4}
    $foo["bar"][[[2]]][[1]]  is  "x"
  o parse as walk-through (with {...} special) until . [] [[]] [[[]]]
    - example $foo.bar[[[2]]][[3]]
      L1: $* -> foo
      L2: -> map-index ["bar"]
      L3: positional value 2
      L4: positional name 3
  o maybe ... eval as simply composition of classes
    - level 1 mlrmap -> mlrval -- $foo or $[[2]] or $[[[3]]]
    - level 2+ mlrval -> mlrval -- [], [[]], [[[]]], or .
* code-comment why special classes & not just DSL
  o inside [] [[]] [[[]]] are just int/string (maybe slices?); no expressions
  o no $[...] with strings
  o no need for runtime.State
  o move POC from $cst to $mlrval -- ? or maybe its own package?
* note that '.' is quite fine in CSV files etc -- also maybe in JSON ........
  ? so, maybe only enable this with mlr -e? also maybe extended/non-extended at each & every verb -- ?!?
  ? maybe if extended/non-extended is unspecified, default it to on for json/jsonl input & off otherwise -- ?
* maybe also 'raise' and 'raise_array' verbs (need better names though)

----------------------------------------------------------------
TSV etc

? also: some escapes perhaps for dkvp, xtab, pprint -- ?
  o nidx is a particular pure-text, leave-as-is
? try out nidx single-line w/ \001, \002 FS/PS & \n or \n\n RS
  o make/publicize a shorthand for this -- ?
  o --words && --lines & --paragraphs -- ?

----------------------------------------------------------------
example pages as feature catch-up:
* example page for format/scan options
  o format/unformat
    x=unformat(1,2); x; string(x); is_error(x)
  o strmatch
  o =~
* separate examples section from (new) FAQs section
* mlr split -- needs an example page along with the tee DSL function
  o narrative about in-memory, shard-out to workers, holdouts, sample/bootstrap, etc etc.
  o ulimit note
* new slwin/ewma example entry, with ccump and pgr
  o slwin --prune (or somesuch) to only emit averages over full windows -- ?

----------------------------------------------------------------
inference:

* inf/nan (unquoted) -> DSL
  o webdocs as in #933 description
* for data files: --symbol-true yes --symbol-false off --symbol-infinity inf --symbol-not-available N/A

----------------------------------------------------------------
! sysdate, sysdate_local; datediff ...

----------------------------------------------------------------
! strmatch https://github.com/johnkerl/miller/issues/77#issuecomment-538790927

----------------------------------------------------------------
* make a lag-by-n and lead-by-n

----------------------------------------------------------------
! rank

----------------------------------------------------------------
! CSV default unsparsify ... note this must trust the *first* record .......
* CSV: acceptance somehow for \" in place of ""
  o & more flex more generally

----------------------------------------------------------------
! quoted NIDX
  - how with whitespace regex -- ?
! quoted DKVP
  - what about csvlite-style -- ? needs a --dkvplite ?
! pprint emit on schema change, not all-at-end.

* new columns-to-arrays and arrays-to-columns for stan format
* transpose ...
* DKVP/XTAB encodings/decodings for \n etc; non-lite DKVP reader -- or nah? broken already for these cases
* csvstat / pandas describe
* r-strings/implicit-r/297: double-check end of reference-main-data-types.md.in
* non-streaming DSL-enabled cut: https://github.com/johnkerl/miller/discussions/613

* pv: 'mlr --prepipex pv --gzin tail -n 10 ~/tmp/zhuge.gz' needs --gzin & --prepipex both
  o plus corresponding docwork
* -W to mlr main also -- so it can be used with .mlrrc

? json-triple-quote -- what can be done here?
? dotted-syntax support in verbs?
? full format-string parser for corner cases like "X%08lldX"
? array and string slices on LHS of assignments
? feature/shorthand for repl newline before prompt
? meta: nr, nf, keys
? BIFs as FCFs?
? IP addresses and ranges as a datatype so one could do membership tests like "if 10.10.10.10 in 10.0.0.0/8".
? push/pop/shift/unshift functions
? IIFEs: 'func f = (func(){ return udf})()'
? non-top-level func defs
? support 'x[1]["a"]' etc notation in various verbs?
? sort within nested data structures?
? ast-parex separate mlr auxents entrypoint?
? het ifmt-reading
  - separate out InputFormat into per-file (or tbd) & have autodetect on file endings -- ?
  - maybe a TBD reader -- ?
  - InputFormat into Context
  - TBD writer -- defer factory until first context?
  - deeper refactor pulling format out of reader/writer options entirely -- ?
? multiple-valued return/assign -- ?
  ? array destructure at LHS for multi-retval assign (maps too?)
? new file formats
  o https://brandur.org/logfmt is simply DKVP w/ IFS = space (need dquot though)
  o https://docs.fluentbit.io/manual/pipeline/parsers/ltsv is just DKVP with IFS tab and IPS colon
? string/array slices on assignment LHS -- ?
? zip -- but that is an archive case too not just an encoding case
  ? miller support for archive-traversal; directory-traversal even w/o zip -- ?
  ? as 6.1 follow-on work -- ?

* mlr -k
  o various test cases
  o OLH re limitations
  o check JSON-parser x 2 -- is there really a 'restart'?
    - infinite-loop avoidance for sure

================================================================
UX

? NF?

! bnf fix for '[[' ']]' etc -- make it a nesting of singles. since otherwise no '[[3,4]]' literals :(
! broadly rethink os.Exit, especially as affecting mlr repl

* consider expanding '(error)' to have more useful error-text
* sync-print option; or (yuck) another xprint variant; or ...; emph dump/eprint
* strptime w/ ...00.Z -> error
* short 'asserting' functions (absent/error); and/or strict or somesuch (DSL, or global ...)
* header-only csv ... reminder ...
* bash completion script https://github.com/johnkerl/miller/issues/77#issuecomment-308247402
  https://iridakos.com/programming/2018/03/01/bash-programmable-completion-tutorial#:~:text=Bash%20completion%20is%20a%20functionality,key%20while%20typing%20a%20command.
* tilde-expand for REPL load/open: if '~' is at the start of the string, run it though 'sh -c echo'
  o or, simpler, just slice[1:] and prepend with os.Getenv("HOME")

? fmtnum(98, "%3d%%") -- ? workaround: fmtnum(98, "%3d") . "%"
? main-level (verb-level?) flag for "," -> X in verbs -- in case commas in field names
? trace-mode ?
? $0 as raw-record string -- ? would make mlr grep simpler and more natural ...

================================================================
DOC

* issue on gnu parallel -> link to scripting page

* https://stackoverflow.com/questions/71011603/is-there-a-simple-way-to-convert-a-csv-with-0-indexed-paths-as-keys-to-json-with/71015190#71015190

! slwin / unformat narrative:
  o prototype in DSL
  o mlr put -f / mlr -s
  o and, if it's frequent then submit a feature request b/c other people probably also would like it! :)

! link to SE table ...
  https://github.com/johnkerl/miller/discussions/609#discussioncomment-1115715

* put -f link in scrpting.html
* no --load in .mlrrc b/c
  https://github.com/johnkerl/miller/security/advisories/GHSA-mw2v-4q78-j2cw
  (system() in begin{})
  but suggest alias

* surface readme-versions.md @ install & tell them how to get current
  o in particular, older distros won't auto-update

* how-to-contribute @ rmd
* more clear miller docs == head

* go version & why not generics (not all distros w/ newer go compilers)

* document the fill-empty verb in
  https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling

* mlr R numpy/scipy/pandas ref.
* ffmts page <-> R read.table read.csv ...
* https://github.com/agarrharr/awesome-cli-apps#data-manipulation

* "Miller, if I understand correctly, was born for those who already frequented
  the command line, awk, for users with at least average data skills. While for
  me, a basic user, it made me discover how it gives the possibility of
  analyzing and transforming data, even complex ones, with a very low learning
  curve."

* document cloudthings, e.g.
  o go.yml
  o codespell.yml
    - codespell --check-filenames --skip *.csv,*.dkvp,*.txt,*.js,*.html,*.map,./tags,./test/cases --ignore-words-list denom,inTerm,inout,iput,nd,nin,numer,Wit,te,wee
  o readthedocs triggers
* doc
  o wut h1 spacing before/after ...
  o shell-commands: while-read example from issues
  E reference-dsl-user-defined-functions: UDSes -> non-factorial example -- maybe some useful aggregator
  o reference-main-arithmetic: ? test stats1/step -F flag
  o reference-dsl-control-structures:
    e while (NR < 10) will never terminate as NR is only incremented between
      records -> and each expression is invoked once per record so once for NR=1,
      once for NR=2, etc.
  o C-style triple-for loops: loop to NR -> NO!!!
  o Since uninitialized out-of-stream variables default to 0 for
    addition/subtraction and 1 for multiplication when they appear on expression
    right-hand sides (not quite as in awk, where they'd default to 0 either way)
    <-> xlink to other page
  r fzf-ish w/ head -n 4, --from, up-arrow & append verb, then cat -- find & update the existing section
  ! https://github.com/johnkerl/miller/issues/653 -- stats1 w/ empties? check stats2
    - needs UTs as well
  o while-read example from issues
w contact re https://jsonlines.org/on_the_web/
* verslink old relnotes
* single UT, hard to invoke w/ new full go.mod path
  go test $(ls pkg/lib/*.go|grep -v test) pkg/lib/unbackslash_test.go
  etc
* file-formats: NIDX link to headerless CSV
* window.mlr, window2.mlr -> doc somewhere
* sec2gmt --millis/--micros/--nanos doc
* sort-within-records --recursive doc
* back-incompat:
  mlr -n put $vflag '@x=1; dump > stdout, @x'
  mlr -n put $vflag '@x=1; dump > stdout @x'
* single cheatsheet page -- put out RFH?
  https://twitter.com/icymi_py/status/1426622817785765898/photo/1
* more about HTTP logs in miller -- doc and encapsulate:
  mlr --icsv --implicit-csv-header --ifs space --oxtab --from elb-log put -q 'print $27'
* hofs section on typedecls
* write up how git status after test should show any missed extra-outs
* docs nest simplers now that we have getoptish
* zlib: n.b. brew install pigz, then pigz -z
* doc shift/unshift as using [2:] and append
* comment schema-change supported only in csvlite reader, not csv reader
* consider -w/-W to stderr not stdout -- ?

? grep out all error message from regtest outputs & doc them all & make sure index-searchable at readthedocs

 ===============================================================
TESTING

! pos/neg 0x/0b/0o UTs

* RT ngrams.sh -v -o 1 one-word-list.txt
* JIT UT:
  o JSON I/O
  o mlrval_cmp.go
  o mv from-array/from-map w/ copy & mutate orig & check new -- & vice versa
  o dash-O and octal infer
  o populate each bifs/X_test.go for each bifs/X.go etc etc
* xtab splitter UT; nidx too
* hofs+typedecls RT cases
* UT per se for lrec ops

================================================================
PERFORMANCE

! profile sort

* JSON perf -- try alternate packages to encoding/json
* more perf?
  - batchify source-quench -- experiment 1st
  - further channelize (CSV-first focus) mlrval infer vs record-put ?
  ? coalesce errchan & done-writing w/ Err to RAC, and close-chan *and* EOSMarker -- ?

================================================================
CODE-NEATENS / QUALITY

* go-style assert.Equal(t, actual, expected) in place of R style expect_equal(expected, actual) in .../*_test.go

! ast namings etc
  ? make a simple NodeFromToken & have all interface{} be *ASTNode, not *token.Token
! broad commenting pass / TODO

* []Mlrval -> []*Mlrval ?
* cmp for array & map
o JIT neatens
  - carefully read all $mlv files
  - check $types & $bifs as well
  - proofread all mlrval_cmp.go dispo mxes
  - update rmds x several
* https://staticcheck.io/docs
  o lots of nice little things to clean up -- no bugs per se, all stylistic i *think* ...
* funcptr away the ifs/ifsregex check in record-readers
* all case-files could use top-notes
* godoc neatens at func/const/etc level
* unset, unassign, remove -- too many different names. also assign/put ... maybe stick w/ 2?
* check triple-dash at mlr fill-down-h ; check others
* clean up unused exitCode arg in sort/put usage.
  o also document pre/post conditions for flag and non-flag usages x all mappers
* neaten mlr gap -g (default) print
* typedecls w/ for-multi: C semantics 'k1: data k2: a:x v:1', sigh ...
* fill-down make columns required. also, --all.
* bitwise_and_dispositions et al should not have _absn for collections -- _erro instead
* libify errors.New callsites for DSL/CST
* record-readers are fully in-channel/loop; record-writers are multi with in-channel/loop being
  done by ChannelWriter, which is very small. opportunity to refactor.
* widen DSL coverage
  o err-return for array/map get/put if incorrect types ... currently go-void ...
    ! the DSL needs a full, written-down-and-published spell-out of error-eval semantics
* double-check rand-seeding
  o all rand invocations should go through the seeder for UT/other determinism
* dev-note on why `int` not `int64` -- processor-arch & those who most need it get it

? gzout, bz2out -- ? make sure this works through tee? bleah ...
? golinter?
? go exe 17MB b/c gocc -- gogll ?

================================================================
INFO

i @all-contributors please add @person for ideas
i @all-contributors please add @person for documentation

i https://en.wikipedia.org/wiki/Delimiter#Delimiter_collision

i https://framework.frictionlessdata.io/docs/tutorials/working-with-cli/

i https://segment.com/blog/allocation-efficiency-in-high-performance-go-services/

i go tool nm -size mlr | sort -nrk 2

i https://go.dev/play/p/hrB77U3d0S3?v=gotip

* godoc notes:
  o go get golang.org/x/tools/cmd/godoc
  o dev mode:
    godoc -http=:6060 -goroot .
  o publish:
    godoc -http=:6060 -goroot .
    cd ~/tmp/bar
    wget -p -k http://localhost:6060/pkg
    mv localhost:6060 miller6
    file:///Users/kerl/tmp/bar/miller6/pkg
    maybe publish to ISP space

* pkg graph:
  go get github.com/kisielk/godepgraph
  godepgraph miller | dot -Tpng -o ~/Desktop/mlrdeps.png
  flamegraph etc double-check

* more data formats:
  https://indico.cern.ch/event/613842/contributions/2585787/attachments/1463230/2260889/pivarski-data-formats.pdf