miller/todo.txt
John Kerl b7f76cd622
Multiple on-line-help issues from #908 (#921)
* More on issue 908

* doc-build artifacts
2022-02-05 00:18:55 -05:00

358 lines
14 KiB
Text

===============================================================
RELEASES
* plan 6.1.0
? w/ natural sort order
? strptime
? datediff et al.
? mlr join --left-fields a,b,c
? rank
? ?foo and ??foo @ repl help
o fmt/unfmt/regex doc
o FAQ/examples reorg
k default colors; bold/underline/reverse
k array concat
k format/unformat
k split
k slwin & shift-lead
k 0o.. octal literals in the DSL
* plan 6.2.0
? YAML
================================================================
908
* is_not_empty (class=typing #args=1) False if field is present in input with empty value, true
otherwise -> That means that and absent value will return true. While technically it makes sense,
many people will ask if "is not empty" to get the green flag before processing the value, which will
be nonexistent if the field is absent. Would it make sense to consider absent fields as empty ones?
Probably you already though about those concerns, but just in case.
-> engage in issue
================================================================
FEATURES
----------------------------------------------------------------
* natural sort order
https://github.com/facette/natsort
----------------------------------------------------------------
example pages as feature catch-up:
* example page for format/scan options
o format/unformat
x=unformat(1,2); x; string(x); is_error(x)
o strmatch
o =~
* separate examples section from (new) FAQs section
* mlr split -- needs an example page along with the tee DSL function
o narrative about in-memory, shard-out to workers, holdouts, sample/bootstrap, etc etc.
o ulimit note
* new slwin/ewma example entry, with ccump and pgr
o slwin --prune (or somesuch) to only emit averages over full windows -- ?
----------------------------------------------------------------
mlr join --left-fields a,b,c
----------------------------------------------------------------
! Better functions for values manipulation, e.g. easier conversion of strings like "$1,234.56" into numeric values
o note on is_error(x) (or string(x) == "(error)")
? dhms w/ optional separgs -- ? what about fenceposting? ["d","h","m","s"] vs ["-",":",":",""] -- ?
o 'Ability to specify some formats that are fixed. Like we can process
"5d18h53m20s" format in *dhms* commands, but what about "5-18:53:20"? This is
a common format used by the SLURM resource manager.'
o linked-to faqent w/ -f -s etc ...
----------------------------------------------------------------
k better print-interpolate with {} etc
doc: mlr --csv --from example.csv put -q 'print format("Index {} at quantity {} and rate {}", $index, $quantity, $rate)'
----------------------------------------------------------------
! sysdate, sysdate_local; datediff ...
----------------------------------------------------------------
! strmatch https://github.com/johnkerl/miller/issues/77#issuecomment-538790927
----------------------------------------------------------------
* make a lag-by-n and lead-by-n
----------------------------------------------------------------
! rank
----------------------------------------------------------------
! CSV default unsparsify ... note this must trust the *first* record .......
* CSV: acceptance somehow for \" in place of ""
o & more flex more generally
----------------------------------------------------------------
! quoted NIDX
- how with whitespace regex -- ?
! quoted DKVP
- what about csvlite-style -- ? needs a --dkvplite ?
! pprint emit on schema change, not all-at-end.
* new columns-to-arrays and arrays-to-columns for stan format
* transpose ...
* DKVP/XTAB encodings/decodings for \n etc; non-lite DKVP reader -- or nah? broken already for these cases
* csvstat / pandas describe
* r-strings/implicit-r/297: double-check end of reference-main-data-types.md.in
* non-streaming DSL-enabled cut: https://github.com/johnkerl/miller/discussions/613
* pv: 'mlr --prepipex pv --gzin tail -n 10 ~/tmp/zhuge.gz' needs --gzin & --prepipex both
o plus corresponding docwork
* -W to mlr main also -- so it can be used with .mlrrc
? json-triple-quote -- what can be done here?
? dotted-syntax support in verbs?
? full format-string parser for corner cases like "X%08lldX"
? array and string slices on LHS of assignments
? feature/shorthand for repl newline before prompt
? meta: nr, nf, keys
? BIFs as FCFs?
? IP addresses and ranges as a datatype so one could do membership tests like "if 10.10.10.10 in 10.0.0.0/8".
? push/pop/shift/unshift functions
? IIFEs: 'func f = (func(){ return udf})()'
? non-top-level func defs
? support 'x[1]["a"]' etc notation in various verbs?
? sort within nested data structures?
? ast-parex separate mlr auxents entrypoint?
? het ifmt-reading
- separate out InputFormat into per-file (or tbd) & have autodetect on file endings -- ?
- maybe a TBD reader -- ?
- InputFormat into Context
- TBD writer -- defer factory until first context?
- deeper refactor pulling format out of reader/writer options entirely -- ?
? multiple-valued return/assign -- ?
? array destructure at LHS for multi-retval assign (maps too?)
? new file formats
o https://brandur.org/logfmt is simply DKVP w/ IFS = space (need dquot though)
o https://docs.fluentbit.io/manual/pipeline/parsers/ltsv is just DKVP with IFS tab and IPS colon
? string/array slices on assignment LHS -- ?
? zip -- but that is an archive case too not just an encoding case
? miller support for archive-traversal; directory-traversal even w/o zip -- ?
? as 6.1 follow-on work -- ?
* mlr -k
o various test cases
o OLH re limitations
o check JSON-parser x 2 -- is there really a 'restart'?
- infinite-loop avoidance for sure
================================================================
UX
* ?xyz and ??xyz in repl, for :help and :help find respectively
? NF?
! bnf fix for '[[' ']]' etc -- make it a nesting of singles. since otherwise no '[[3,4]]' literals :(
! broadly rethink os.Exit, especially as affecting mlr repl
* consider expanding '(error)' to have more useful error-text
* sync-print option; or (yuck) another xprint variant; or ...; emph dump/eprint
* strptime w/ ...00.Z -> error
* short 'asserting' functions (absent/error); and/or strict or somesuch (DSL, or global ...)
* header-only csv ... reminder ...
* bash completion script https://github.com/johnkerl/miller/issues/77#issuecomment-308247402
https://iridakos.com/programming/2018/03/01/bash-programmable-completion-tutorial#:~:text=Bash%20completion%20is%20a%20functionality,key%20while%20typing%20a%20command.
* tilde-expand for REPL load/open: if '~' is at the start of the string, run it though 'sh -c echo'
o or, simpler, just slice[1:] and prepend with os.Getenv("HOME")
? fmtnum(98, "%3d%%") -- ? workaround: fmtnum(98, "%3d") . "%"
? main-level (verb-level?) flag for "," -> X in verbs -- in case commas in field names
? trace-mode ?
? re silent zero-pass for-loops on non-collections:
o intended as a heterogenity feature ...
o consider a --errors or --strict mode; something
? $0 as raw-record string -- ? would make mlr grep simpler and more natural ...
================================================================
DOC
! slwin / unformat narrative:
o prototype in DSL
o mlr put -f / mlr -s
o and, if it's frequent then submit a feature request b/c other people probably also would like it! :)
! link to SE table ...
https://github.com/johnkerl/miller/discussions/609#discussioncomment-1115715
* put -f link in scrpting.html
* no --load in .mlrrc b/c
https://github.com/johnkerl/miller/security/advisories/GHSA-mw2v-4q78-j2cw
(system() in begin{})
but suggest alias
* surface readme-versions.md @ install & tell them how to get current
o in particular, older distros won't auto-update
* how-to-contribute @ rmd
* more clear miller docs == head
* go version & why not generics (not all distros w/ newer go compilers)
* document the fill-empty verb in
https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling
* mlr R numpy/scipy/pandas ref.
* ffmts page <-> R read.table read.csv ...
* https://github.com/agarrharr/awesome-cli-apps#data-manipulation
* "Miller, if I understand correctly, was born for those who already frequented
the command line, awk, for users with at least average data skills. While for
me, a basic user, it made me discover how it gives the possibility of
analyzing and transforming data, even complex ones, with a very low learning
curve."
* document cloudthings, e.g.
o go.yml
o codespell.yml
- codespell --check-filenames --skip *.csv,*.dkvp,*.txt,*.js,*.html,*.map,./tags,./test/cases --ignore-words-list denom,inTerm,inout,iput,nd,nin,numer,Wit,te,wee
o readthedocs triggers
* doc
o wut h1 spacing before/after ...
o shell-commands: while-read example from issues
E reference-dsl-user-defined-functions: UDSes -> non-factorial example -- maybe some useful aggregator
o reference-main-arithmetic: ? test stats1/step -F flag
o reference-dsl-control-structures:
e while (NR < 10) will never terminate as NR is only incremented between
records -> and each expression is invoked once per record so once for NR=1,
once for NR=2, etc.
o C-style triple-for loops: loop to NR -> NO!!!
o Since uninitialized out-of-stream variables default to 0 for
addition/subtraction and 1 for multiplication when they appear on expression
right-hand sides (not quite as in awk, where they'd default to 0 either way)
<-> xlink to other page
r fzf-ish w/ head -n 4, --from, up-arrow & append verb, then cat -- find & update the existing section
! https://github.com/johnkerl/miller/issues/653 -- stats1 w/ empties? check stats2
- needs UTs as well
o while-read example from issues
w contact re https://jsonlines.org/on_the_web/
* verslink old relnotes
* single UT, hard to invoke w/ new full go.mod path
go test $(ls internal/pkg/lib/*.go|grep -v test) internal/pkg/lib/unbackslash_test.go
etc
* file-formats: NIDX link to headerless CSV
* window.mlr, window2.mlr -> doc somewhere
* sec2gmt --millis/--micros/--nanos doc
* sort-within-records --recursive doc
* back-incompat:
mlr -n put $vflag '@x=1; dump > stdout, @x'
mlr -n put $vflag '@x=1; dump > stdout @x'
* single cheatsheet page -- put out RFH?
https://twitter.com/icymi_py/status/1426622817785765898/photo/1
* more about HTTP logs in miller -- doc and encapsulate:
mlr --icsv --implicit-csv-header --ifs space --oxtab --from elb-log put -q 'print $27'
* hofs section on typedecls
* write up how git status after test should show any missed extra-outs
* docs nest simplers now that we have getoptish
* zlib: n.b. brew install pigz, then pigz -z
* doc shift/unshift as using [2:] and append
* comment schema-change supported only in csvlite reader, not csv reader
* consider -w/-W to stderr not stdout -- ?
? grep out all error message from regtest outputs & doc them all & make sure index-searchable at readthedocs
===============================================================
TESTING
! ./mlr vs mlr ...
! pos/neg 0x/0b/0o UTs
* RT ngrams.sh -v -o 1 one-word-list.txt
* JIT UT:
o JSON I/O
o mlrval_cmp.go
o mv from-array/from-map w/ copy & mutate orig & check new -- & vice versa
o dash-O and octal infer
o populate each bifs/X_test.go for each bifs/X.go etc etc
* xtab splitter UT; nidx too
* hofs+typedecls RT cases
* UT per se for lrec ops
================================================================
PERFORMANCE
! profile sort
* JSON perf -- try alternate packages to encoding/json
* more perf?
- batchify source-quench -- experiment 1st
- further channelize (CSV-first focus) mlrval infer vs record-put ?
? coalesce errchan & done-writing w/ Err to RAC, and close-chan *and* EOSMarker -- ?
================================================================
CODE-NEATENS / QUALITY
* go-style assert.Equal(t, actual, expected) in place of R style expect_equal(expected, actual) in .../*_test.go
! ast namings etc
? make a simple NodeFromToken & have all interface{} be *ASTNode, not *token.Token
! broad commenting pass / TODO
* []Mlrval -> []*Mlrval ?
* cmp for array & map
o JIT neatens
- carefully read all $mlv files
- check $types & $bifs as well
- proofread all mlrval_cmp.go dispo mxes
- update rmds x several
* https://staticcheck.io/docs
o lots of nice little things to clean up -- no bugs per se, all stylistic i *think* ...
* funcptr away the ifs/ifsregex check in record-readers
* all case-files could use top-notes
* godoc neatens at func/const/etc level
* unset, unassign, remove -- too many different names. also assign/put ... maybe stick w/ 2?
* check triple-dash at mlr fill-down-h ; check others
* clean up unused exitCode arg in sort/put usage.
o also document pre/post conditions for flag and non-flag usages x all mappers
* neaten mlr gap -g (default) print
* typedecls w/ for-multi: C semantics 'k1: data k2: a:x v:1', sigh ...
* fill-down make columns required. also, --all.
* bitwise_and_dispositions et al should not have _absn for collections -- _erro instead
* libify errors.New callsites for DSL/CST
* record-readers are fully in-channel/loop; record-writers are multi with in-channel/loop being
done by ChannelWriter, which is very small. opportunity to refactor.
* widen DSL coverage
o err-return for array/map get/put if incorrect types ... currently go-void ...
! the DSL needs a full, written-down-and-published spell-out of error-eval semantics
* double-check rand-seeding
o all rand invocations should go through the seeder for UT/other determinism
* dev-note on why `int` not `int64` -- processor-arch & those who most need it get it
? gzout, bz2out -- ? make sure this works through tee? bleah ...
? golinter?
? go exe 17MB b/c gocc -- gogll ?
================================================================
INFO
i https://en.wikipedia.org/wiki/Delimiter#Delimiter_collision
i https://framework.frictionlessdata.io/docs/tutorials/working-with-cli/
i https://segment.com/blog/allocation-efficiency-in-high-performance-go-services/
i go tool nm -size mlr | sort -nrk 2
i https://go.dev/play/p/hrB77U3d0S3?v=gotip
* godoc notes:
o go get golang.org/x/tools/cmd/godoc
o dev mode:
godoc -http=:6060 -goroot .
o publish:
godoc -http=:6060 -goroot .
cd ~/tmp/bar
wget -p -k http://localhost:6060/pkg
mv localhost:6060 miller6
file:///Users/kerl/tmp/bar/miller6/pkg
maybe publish to ISP space
* pkg graph:
go get github.com/kisielk/godepgraph
godepgraph miller | dot -Tpng -o ~/Desktop/mlrdeps.png
flamegraph etc double-check
* more data formats:
https://indico.cern.ch/event/613842/contributions/2585787/attachments/1463230/2260889/pivarski-data-formats.pdf