miller/todo.txt
John Kerl 268a96d002
Export library code in pkg/ (#1391)
* Export library code in `pkg/`

* new doc page
2023-09-10 17:15:13 -04:00

410 lines
16 KiB
Text

===============================================================
* 1050 mlr check w/ empty csv column name
* 283 strmatch DSL function
* 440 strict mode
* 1128 bash/zsh autocompletions
* 1025 emitv2
* 1082 summary/type
* 1105 too many open files
* opt-in type-infers for inf, true, etc
* 982 '.' ',' etc
* extended field accessors for #763 and #948 (positional & json.nested)
* awk-like exit
* mrpl exits ...
================================================================
STRICT MODE
? re silent zero-pass for-loops on non-collections:
i theme is handling of 'absent'
? what about handling of 'error' ?
* improve wording:
mlr: couldn't assign variable int function return value from value absent (absent)
* need $?x and @?x in the grammar & CST
* flags:
o mlr -z and mlr put -z
o note put has -w (warn) and -W (fatal)
- then strict mode includes -W?
* tests:
mlr --csv --from $exv put -z 'x = 1'
mlr --csv --from $exv put -z 'var x = a'
mlr --csv --from $exv put -z 'var x = $nonesuch'
mlr --csv --from $exv put -z 'var x = $["asdf"]'
mlr --csv --from $exv put -z 'var x = $[nonesuch]'
mlr --csv --from $exv put -z 'var x = $[[999]]'
mlr --csv --from $exv put -z 'var x = $[[[999]]]'
mlr --csv --from $exv put -z 'begin { var m = $* }'
mlr --csv --from $exv put -z 'var x = @nonesuch'
mlr --csv --from $exv put -z 'var x = @["nonesuch"]'
mlr --csv --from $exv put -z 'func f(): int {}; $x = f()'
mlr --csv --from $exv put -z 'func f() {}; $x = f()'
mlr --csv --from $exv put -z 'func f() {return nonesuch}; $x = f()'
mlr --csv --from $exv put -z '$env = ENV[nonesuch]'
mlr --csv --from $exv put -z '$env = ENV["nonesuch"]'
----------------------------------------------------------------
EXTENDED FIELD ACCESSORS
* extract/simplify getter logic outside of full CST/runtime/etc
* sketch out putter ...
? what does 'cut -f a.b[3]' even mean?
o 'cut -xf a.b[3]' is easy enough ...
o field name '@graph' is an issue :(
* maybe ......... not DSL per se but a separate "parser" here
o delimiters '[' ']' '.' '{' '}'
o $foo
o ${foo} -- in case the field name has a . in it or what have you
x $["foo"] -- don't support this
o $foo.bar -- treat this like $foo["bar"]
o $foo["bar"] or foo[1] -- contents just a string or an integer (no expressions supported)
o $[[2]] -- contents just an integer (no expressions supported)
o $[[[3]]] -- contents just an integer (no expressions supported)
o note $foo.bar[[[2]]][[3]] needs to work
$foo["bar"]={"a":1,"b":{"x":3},"c":4}
$foo["bar"][[[2]]][[1]] is "x"
o parse as walk-through (with {...} special) until . [] [[]] [[[]]]
- example $foo.bar[[[2]]][[3]]
L1: $* -> foo
L2: -> map-index ["bar"]
L3: positional value 2
L4: positional name 3
o maybe ... eval as simply composition of classes
- level 1 mlrmap -> mlrval -- $foo or $[[2]] or $[[[3]]]
- level 2+ mlrval -> mlrval -- [], [[]], [[[]]], or .
* code-comment why special classes & not just DSL
o inside [] [[]] [[[]]] are just int/string (maybe slices?); no expressions
o no $[...] with strings
o no need for runtime.State
o move POC from $cst to $mlrval -- ? or maybe its own package?
* note that '.' is quite fine in CSV files etc -- also maybe in JSON ........
? so, maybe only enable this with mlr -e? also maybe extended/non-extended at each & every verb -- ?!?
? maybe if extended/non-extended is unspecified, default it to on for json/jsonl input & off otherwise -- ?
* maybe also 'raise' and 'raise_array' verbs (need better names though)
----------------------------------------------------------------
TSV etc
? also: some escapes perhaps for dkvp, xtab, pprint -- ?
o nidx is a particular pure-text, leave-as-is
? try out nidx single-line w/ \001, \002 FS/PS & \n or \n\n RS
o make/publicize a shorthand for this -- ?
o --words && --lines & --paragraphs -- ?
----------------------------------------------------------------
example pages as feature catch-up:
* example page for format/scan options
o format/unformat
x=unformat(1,2); x; string(x); is_error(x)
o strmatch
o =~
* separate examples section from (new) FAQs section
* mlr split -- needs an example page along with the tee DSL function
o narrative about in-memory, shard-out to workers, holdouts, sample/bootstrap, etc etc.
o ulimit note
* new slwin/ewma example entry, with ccump and pgr
o slwin --prune (or somesuch) to only emit averages over full windows -- ?
----------------------------------------------------------------
inference:
* inf/nan (unquoted) -> DSL
o webdocs as in #933 description
* for data files: --symbol-true yes --symbol-false off --symbol-infinity inf --symbol-not-available N/A
----------------------------------------------------------------
! sysdate, sysdate_local; datediff ...
----------------------------------------------------------------
! strmatch https://github.com/johnkerl/miller/issues/77#issuecomment-538790927
----------------------------------------------------------------
* make a lag-by-n and lead-by-n
----------------------------------------------------------------
! rank
----------------------------------------------------------------
! CSV default unsparsify ... note this must trust the *first* record .......
* CSV: acceptance somehow for \" in place of ""
o & more flex more generally
----------------------------------------------------------------
! quoted NIDX
- how with whitespace regex -- ?
! quoted DKVP
- what about csvlite-style -- ? needs a --dkvplite ?
! pprint emit on schema change, not all-at-end.
* new columns-to-arrays and arrays-to-columns for stan format
* transpose ...
* DKVP/XTAB encodings/decodings for \n etc; non-lite DKVP reader -- or nah? broken already for these cases
* csvstat / pandas describe
* r-strings/implicit-r/297: double-check end of reference-main-data-types.md.in
* non-streaming DSL-enabled cut: https://github.com/johnkerl/miller/discussions/613
* pv: 'mlr --prepipex pv --gzin tail -n 10 ~/tmp/zhuge.gz' needs --gzin & --prepipex both
o plus corresponding docwork
* -W to mlr main also -- so it can be used with .mlrrc
? json-triple-quote -- what can be done here?
? dotted-syntax support in verbs?
? full format-string parser for corner cases like "X%08lldX"
? array and string slices on LHS of assignments
? feature/shorthand for repl newline before prompt
? meta: nr, nf, keys
? BIFs as FCFs?
? IP addresses and ranges as a datatype so one could do membership tests like "if 10.10.10.10 in 10.0.0.0/8".
? push/pop/shift/unshift functions
? IIFEs: 'func f = (func(){ return udf})()'
? non-top-level func defs
? support 'x[1]["a"]' etc notation in various verbs?
? sort within nested data structures?
? ast-parex separate mlr auxents entrypoint?
? het ifmt-reading
- separate out InputFormat into per-file (or tbd) & have autodetect on file endings -- ?
- maybe a TBD reader -- ?
- InputFormat into Context
- TBD writer -- defer factory until first context?
- deeper refactor pulling format out of reader/writer options entirely -- ?
? multiple-valued return/assign -- ?
? array destructure at LHS for multi-retval assign (maps too?)
? new file formats
o https://brandur.org/logfmt is simply DKVP w/ IFS = space (need dquot though)
o https://docs.fluentbit.io/manual/pipeline/parsers/ltsv is just DKVP with IFS tab and IPS colon
? string/array slices on assignment LHS -- ?
? zip -- but that is an archive case too not just an encoding case
? miller support for archive-traversal; directory-traversal even w/o zip -- ?
? as 6.1 follow-on work -- ?
* mlr -k
o various test cases
o OLH re limitations
o check JSON-parser x 2 -- is there really a 'restart'?
- infinite-loop avoidance for sure
================================================================
UX
? NF?
! bnf fix for '[[' ']]' etc -- make it a nesting of singles. since otherwise no '[[3,4]]' literals :(
! broadly rethink os.Exit, especially as affecting mlr repl
* consider expanding '(error)' to have more useful error-text
* sync-print option; or (yuck) another xprint variant; or ...; emph dump/eprint
* strptime w/ ...00.Z -> error
* short 'asserting' functions (absent/error); and/or strict or somesuch (DSL, or global ...)
* header-only csv ... reminder ...
* bash completion script https://github.com/johnkerl/miller/issues/77#issuecomment-308247402
https://iridakos.com/programming/2018/03/01/bash-programmable-completion-tutorial#:~:text=Bash%20completion%20is%20a%20functionality,key%20while%20typing%20a%20command.
* tilde-expand for REPL load/open: if '~' is at the start of the string, run it though 'sh -c echo'
o or, simpler, just slice[1:] and prepend with os.Getenv("HOME")
? fmtnum(98, "%3d%%") -- ? workaround: fmtnum(98, "%3d") . "%"
? main-level (verb-level?) flag for "," -> X in verbs -- in case commas in field names
? trace-mode ?
? $0 as raw-record string -- ? would make mlr grep simpler and more natural ...
================================================================
DOC
* issue on gnu parallel -> link to scripting page
* https://stackoverflow.com/questions/71011603/is-there-a-simple-way-to-convert-a-csv-with-0-indexed-paths-as-keys-to-json-with/71015190#71015190
! slwin / unformat narrative:
o prototype in DSL
o mlr put -f / mlr -s
o and, if it's frequent then submit a feature request b/c other people probably also would like it! :)
! link to SE table ...
https://github.com/johnkerl/miller/discussions/609#discussioncomment-1115715
* put -f link in scrpting.html
* no --load in .mlrrc b/c
https://github.com/johnkerl/miller/security/advisories/GHSA-mw2v-4q78-j2cw
(system() in begin{})
but suggest alias
* surface readme-versions.md @ install & tell them how to get current
o in particular, older distros won't auto-update
* how-to-contribute @ rmd
* more clear miller docs == head
* go version & why not generics (not all distros w/ newer go compilers)
* document the fill-empty verb in
https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling
* mlr R numpy/scipy/pandas ref.
* ffmts page <-> R read.table read.csv ...
* https://github.com/agarrharr/awesome-cli-apps#data-manipulation
* "Miller, if I understand correctly, was born for those who already frequented
the command line, awk, for users with at least average data skills. While for
me, a basic user, it made me discover how it gives the possibility of
analyzing and transforming data, even complex ones, with a very low learning
curve."
* document cloudthings, e.g.
o go.yml
o codespell.yml
- codespell --check-filenames --skip *.csv,*.dkvp,*.txt,*.js,*.html,*.map,./tags,./test/cases --ignore-words-list denom,inTerm,inout,iput,nd,nin,numer,Wit,te,wee
o readthedocs triggers
* doc
o wut h1 spacing before/after ...
o shell-commands: while-read example from issues
E reference-dsl-user-defined-functions: UDSes -> non-factorial example -- maybe some useful aggregator
o reference-main-arithmetic: ? test stats1/step -F flag
o reference-dsl-control-structures:
e while (NR < 10) will never terminate as NR is only incremented between
records -> and each expression is invoked once per record so once for NR=1,
once for NR=2, etc.
o C-style triple-for loops: loop to NR -> NO!!!
o Since uninitialized out-of-stream variables default to 0 for
addition/subtraction and 1 for multiplication when they appear on expression
right-hand sides (not quite as in awk, where they'd default to 0 either way)
<-> xlink to other page
r fzf-ish w/ head -n 4, --from, up-arrow & append verb, then cat -- find & update the existing section
! https://github.com/johnkerl/miller/issues/653 -- stats1 w/ empties? check stats2
- needs UTs as well
o while-read example from issues
w contact re https://jsonlines.org/on_the_web/
* verslink old relnotes
* single UT, hard to invoke w/ new full go.mod path
go test $(ls pkg/lib/*.go|grep -v test) pkg/lib/unbackslash_test.go
etc
* file-formats: NIDX link to headerless CSV
* window.mlr, window2.mlr -> doc somewhere
* sec2gmt --millis/--micros/--nanos doc
* sort-within-records --recursive doc
* back-incompat:
mlr -n put $vflag '@x=1; dump > stdout, @x'
mlr -n put $vflag '@x=1; dump > stdout @x'
* single cheatsheet page -- put out RFH?
https://twitter.com/icymi_py/status/1426622817785765898/photo/1
* more about HTTP logs in miller -- doc and encapsulate:
mlr --icsv --implicit-csv-header --ifs space --oxtab --from elb-log put -q 'print $27'
* hofs section on typedecls
* write up how git status after test should show any missed extra-outs
* docs nest simplers now that we have getoptish
* zlib: n.b. brew install pigz, then pigz -z
* doc shift/unshift as using [2:] and append
* comment schema-change supported only in csvlite reader, not csv reader
* consider -w/-W to stderr not stdout -- ?
? grep out all error message from regtest outputs & doc them all & make sure index-searchable at readthedocs
===============================================================
TESTING
! pos/neg 0x/0b/0o UTs
* RT ngrams.sh -v -o 1 one-word-list.txt
* JIT UT:
o JSON I/O
o mlrval_cmp.go
o mv from-array/from-map w/ copy & mutate orig & check new -- & vice versa
o dash-O and octal infer
o populate each bifs/X_test.go for each bifs/X.go etc etc
* xtab splitter UT; nidx too
* hofs+typedecls RT cases
* UT per se for lrec ops
================================================================
PERFORMANCE
! profile sort
* JSON perf -- try alternate packages to encoding/json
* more perf?
- batchify source-quench -- experiment 1st
- further channelize (CSV-first focus) mlrval infer vs record-put ?
? coalesce errchan & done-writing w/ Err to RAC, and close-chan *and* EOSMarker -- ?
================================================================
CODE-NEATENS / QUALITY
* go-style assert.Equal(t, actual, expected) in place of R style expect_equal(expected, actual) in .../*_test.go
! ast namings etc
? make a simple NodeFromToken & have all interface{} be *ASTNode, not *token.Token
! broad commenting pass / TODO
* []Mlrval -> []*Mlrval ?
* cmp for array & map
o JIT neatens
- carefully read all $mlv files
- check $types & $bifs as well
- proofread all mlrval_cmp.go dispo mxes
- update rmds x several
* https://staticcheck.io/docs
o lots of nice little things to clean up -- no bugs per se, all stylistic i *think* ...
* funcptr away the ifs/ifsregex check in record-readers
* all case-files could use top-notes
* godoc neatens at func/const/etc level
* unset, unassign, remove -- too many different names. also assign/put ... maybe stick w/ 2?
* check triple-dash at mlr fill-down-h ; check others
* clean up unused exitCode arg in sort/put usage.
o also document pre/post conditions for flag and non-flag usages x all mappers
* neaten mlr gap -g (default) print
* typedecls w/ for-multi: C semantics 'k1: data k2: a:x v:1', sigh ...
* fill-down make columns required. also, --all.
* bitwise_and_dispositions et al should not have _absn for collections -- _erro instead
* libify errors.New callsites for DSL/CST
* record-readers are fully in-channel/loop; record-writers are multi with in-channel/loop being
done by ChannelWriter, which is very small. opportunity to refactor.
* widen DSL coverage
o err-return for array/map get/put if incorrect types ... currently go-void ...
! the DSL needs a full, written-down-and-published spell-out of error-eval semantics
* double-check rand-seeding
o all rand invocations should go through the seeder for UT/other determinism
* dev-note on why `int` not `int64` -- processor-arch & those who most need it get it
? gzout, bz2out -- ? make sure this works through tee? bleah ...
? golinter?
? go exe 17MB b/c gocc -- gogll ?
================================================================
INFO
i @all-contributors please add @person for ideas
i @all-contributors please add @person for documentation
i https://en.wikipedia.org/wiki/Delimiter#Delimiter_collision
i https://framework.frictionlessdata.io/docs/tutorials/working-with-cli/
i https://segment.com/blog/allocation-efficiency-in-high-performance-go-services/
i go tool nm -size mlr | sort -nrk 2
i https://go.dev/play/p/hrB77U3d0S3?v=gotip
* godoc notes:
o go get golang.org/x/tools/cmd/godoc
o dev mode:
godoc -http=:6060 -goroot .
o publish:
godoc -http=:6060 -goroot .
cd ~/tmp/bar
wget -p -k http://localhost:6060/pkg
mv localhost:6060 miller6
file:///Users/kerl/tmp/bar/miller6/pkg
maybe publish to ISP space
* pkg graph:
go get github.com/kisielk/godepgraph
godepgraph miller | dot -Tpng -o ~/Desktop/mlrdeps.png
flamegraph etc double-check
* more data formats:
https://indico.cern.ch/event/613842/contributions/2585787/attachments/1463230/2260889/pivarski-data-formats.pdf