mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-22 18:06:52 +00:00
410 lines
16 KiB
Text
410 lines
16 KiB
Text
===============================================================
|
|
* 1050 mlr check w/ empty csv column name
|
|
* 283 strmatch DSL function
|
|
* 440 strict mode
|
|
* 1128 bash/zsh autocompletions
|
|
* 1025 emitv2
|
|
* 1082 summary/type
|
|
* 1105 too many open files
|
|
|
|
* opt-in type-infers for inf, true, etc
|
|
* 982 '.' ',' etc
|
|
* extended field accessors for #763 and #948 (positional & json.nested)
|
|
* awk-like exit
|
|
* mrpl exits ...
|
|
|
|
================================================================
|
|
STRICT MODE
|
|
|
|
? re silent zero-pass for-loops on non-collections:
|
|
|
|
i theme is handling of 'absent'
|
|
|
|
? what about handling of 'error' ?
|
|
|
|
* improve wording:
|
|
mlr: couldn't assign variable int function return value from value absent (absent)
|
|
|
|
* need $?x and @?x in the grammar & CST
|
|
|
|
* flags:
|
|
o mlr -z and mlr put -z
|
|
o note put has -w (warn) and -W (fatal)
|
|
- then strict mode includes -W?
|
|
|
|
* tests:
|
|
mlr --csv --from $exv put -z 'x = 1'
|
|
mlr --csv --from $exv put -z 'var x = a'
|
|
mlr --csv --from $exv put -z 'var x = $nonesuch'
|
|
mlr --csv --from $exv put -z 'var x = $["asdf"]'
|
|
mlr --csv --from $exv put -z 'var x = $[nonesuch]'
|
|
mlr --csv --from $exv put -z 'var x = $[[999]]'
|
|
mlr --csv --from $exv put -z 'var x = $[[[999]]]'
|
|
mlr --csv --from $exv put -z 'begin { var m = $* }'
|
|
mlr --csv --from $exv put -z 'var x = @nonesuch'
|
|
mlr --csv --from $exv put -z 'var x = @["nonesuch"]'
|
|
mlr --csv --from $exv put -z 'func f(): int {}; $x = f()'
|
|
mlr --csv --from $exv put -z 'func f() {}; $x = f()'
|
|
mlr --csv --from $exv put -z 'func f() {return nonesuch}; $x = f()'
|
|
mlr --csv --from $exv put -z '$env = ENV[nonesuch]'
|
|
mlr --csv --from $exv put -z '$env = ENV["nonesuch"]'
|
|
|
|
----------------------------------------------------------------
|
|
EXTENDED FIELD ACCESSORS
|
|
|
|
* extract/simplify getter logic outside of full CST/runtime/etc
|
|
* sketch out putter ...
|
|
? what does 'cut -f a.b[3]' even mean?
|
|
o 'cut -xf a.b[3]' is easy enough ...
|
|
o field name '@graph' is an issue :(
|
|
* maybe ......... not DSL per se but a separate "parser" here
|
|
o delimiters '[' ']' '.' '{' '}'
|
|
o $foo
|
|
o ${foo} -- in case the field name has a . in it or what have you
|
|
x $["foo"] -- don't support this
|
|
o $foo.bar -- treat this like $foo["bar"]
|
|
o $foo["bar"] or foo[1] -- contents just a string or an integer (no expressions supported)
|
|
o $[[2]] -- contents just an integer (no expressions supported)
|
|
o $[[[3]]] -- contents just an integer (no expressions supported)
|
|
o note $foo.bar[[[2]]][[3]] needs to work
|
|
$foo["bar"]={"a":1,"b":{"x":3},"c":4}
|
|
$foo["bar"][[[2]]][[1]] is "x"
|
|
o parse as walk-through (with {...} special) until . [] [[]] [[[]]]
|
|
- example $foo.bar[[[2]]][[3]]
|
|
L1: $* -> foo
|
|
L2: -> map-index ["bar"]
|
|
L3: positional value 2
|
|
L4: positional name 3
|
|
o maybe ... eval as simply composition of classes
|
|
- level 1 mlrmap -> mlrval -- $foo or $[[2]] or $[[[3]]]
|
|
- level 2+ mlrval -> mlrval -- [], [[]], [[[]]], or .
|
|
* code-comment why special classes & not just DSL
|
|
o inside [] [[]] [[[]]] are just int/string (maybe slices?); no expressions
|
|
o no $[...] with strings
|
|
o no need for runtime.State
|
|
o move POC from $cst to $mlrval -- ? or maybe its own package?
|
|
* note that '.' is quite fine in CSV files etc -- also maybe in JSON ........
|
|
? so, maybe only enable this with mlr -e? also maybe extended/non-extended at each & every verb -- ?!?
|
|
? maybe if extended/non-extended is unspecified, default it to on for json/jsonl input & off otherwise -- ?
|
|
* maybe also 'raise' and 'raise_array' verbs (need better names though)
|
|
|
|
----------------------------------------------------------------
|
|
TSV etc
|
|
|
|
? also: some escapes perhaps for dkvp, xtab, pprint -- ?
|
|
o nidx is a particular pure-text, leave-as-is
|
|
? try out nidx single-line w/ \001, \002 FS/PS & \n or \n\n RS
|
|
o make/publicize a shorthand for this -- ?
|
|
o --words && --lines & --paragraphs -- ?
|
|
|
|
----------------------------------------------------------------
|
|
example pages as feature catch-up:
|
|
* example page for format/scan options
|
|
o format/unformat
|
|
x=unformat(1,2); x; string(x); is_error(x)
|
|
o strmatch
|
|
o =~
|
|
* separate examples section from (new) FAQs section
|
|
* mlr split -- needs an example page along with the tee DSL function
|
|
o narrative about in-memory, shard-out to workers, holdouts, sample/bootstrap, etc etc.
|
|
o ulimit note
|
|
* new slwin/ewma example entry, with ccump and pgr
|
|
o slwin --prune (or somesuch) to only emit averages over full windows -- ?
|
|
|
|
----------------------------------------------------------------
|
|
inference:
|
|
|
|
* inf/nan (unquoted) -> DSL
|
|
o webdocs as in #933 description
|
|
* for data files: --symbol-true yes --symbol-false off --symbol-infinity inf --symbol-not-available N/A
|
|
|
|
----------------------------------------------------------------
|
|
! sysdate, sysdate_local; datediff ...
|
|
|
|
----------------------------------------------------------------
|
|
! strmatch https://github.com/johnkerl/miller/issues/77#issuecomment-538790927
|
|
|
|
----------------------------------------------------------------
|
|
* make a lag-by-n and lead-by-n
|
|
|
|
----------------------------------------------------------------
|
|
! rank
|
|
|
|
----------------------------------------------------------------
|
|
! CSV default unsparsify ... note this must trust the *first* record .......
|
|
* CSV: acceptance somehow for \" in place of ""
|
|
o & more flex more generally
|
|
|
|
----------------------------------------------------------------
|
|
! quoted NIDX
|
|
- how with whitespace regex -- ?
|
|
! quoted DKVP
|
|
- what about csvlite-style -- ? needs a --dkvplite ?
|
|
! pprint emit on schema change, not all-at-end.
|
|
|
|
* new columns-to-arrays and arrays-to-columns for stan format
|
|
* transpose ...
|
|
* DKVP/XTAB encodings/decodings for \n etc; non-lite DKVP reader -- or nah? broken already for these cases
|
|
* csvstat / pandas describe
|
|
* r-strings/implicit-r/297: double-check end of reference-main-data-types.md.in
|
|
* non-streaming DSL-enabled cut: https://github.com/johnkerl/miller/discussions/613
|
|
|
|
* pv: 'mlr --prepipex pv --gzin tail -n 10 ~/tmp/zhuge.gz' needs --gzin & --prepipex both
|
|
o plus corresponding docwork
|
|
* -W to mlr main also -- so it can be used with .mlrrc
|
|
|
|
? json-triple-quote -- what can be done here?
|
|
? dotted-syntax support in verbs?
|
|
? full format-string parser for corner cases like "X%08lldX"
|
|
? array and string slices on LHS of assignments
|
|
? feature/shorthand for repl newline before prompt
|
|
? meta: nr, nf, keys
|
|
? BIFs as FCFs?
|
|
? IP addresses and ranges as a datatype so one could do membership tests like "if 10.10.10.10 in 10.0.0.0/8".
|
|
? push/pop/shift/unshift functions
|
|
? IIFEs: 'func f = (func(){ return udf})()'
|
|
? non-top-level func defs
|
|
? support 'x[1]["a"]' etc notation in various verbs?
|
|
? sort within nested data structures?
|
|
? ast-parex separate mlr auxents entrypoint?
|
|
? het ifmt-reading
|
|
- separate out InputFormat into per-file (or tbd) & have autodetect on file endings -- ?
|
|
- maybe a TBD reader -- ?
|
|
- InputFormat into Context
|
|
- TBD writer -- defer factory until first context?
|
|
- deeper refactor pulling format out of reader/writer options entirely -- ?
|
|
? multiple-valued return/assign -- ?
|
|
? array destructure at LHS for multi-retval assign (maps too?)
|
|
? new file formats
|
|
o https://brandur.org/logfmt is simply DKVP w/ IFS = space (need dquot though)
|
|
o https://docs.fluentbit.io/manual/pipeline/parsers/ltsv is just DKVP with IFS tab and IPS colon
|
|
? string/array slices on assignment LHS -- ?
|
|
? zip -- but that is an archive case too not just an encoding case
|
|
? miller support for archive-traversal; directory-traversal even w/o zip -- ?
|
|
? as 6.1 follow-on work -- ?
|
|
|
|
* mlr -k
|
|
o various test cases
|
|
o OLH re limitations
|
|
o check JSON-parser x 2 -- is there really a 'restart'?
|
|
- infinite-loop avoidance for sure
|
|
|
|
================================================================
|
|
UX
|
|
|
|
? NF?
|
|
|
|
! bnf fix for '[[' ']]' etc -- make it a nesting of singles. since otherwise no '[[3,4]]' literals :(
|
|
! broadly rethink os.Exit, especially as affecting mlr repl
|
|
|
|
* consider expanding '(error)' to have more useful error-text
|
|
* sync-print option; or (yuck) another xprint variant; or ...; emph dump/eprint
|
|
* strptime w/ ...00.Z -> error
|
|
* short 'asserting' functions (absent/error); and/or strict or somesuch (DSL, or global ...)
|
|
* header-only csv ... reminder ...
|
|
* bash completion script https://github.com/johnkerl/miller/issues/77#issuecomment-308247402
|
|
https://iridakos.com/programming/2018/03/01/bash-programmable-completion-tutorial#:~:text=Bash%20completion%20is%20a%20functionality,key%20while%20typing%20a%20command.
|
|
* tilde-expand for REPL load/open: if '~' is at the start of the string, run it though 'sh -c echo'
|
|
o or, simpler, just slice[1:] and prepend with os.Getenv("HOME")
|
|
|
|
? fmtnum(98, "%3d%%") -- ? workaround: fmtnum(98, "%3d") . "%"
|
|
? main-level (verb-level?) flag for "," -> X in verbs -- in case commas in field names
|
|
? trace-mode ?
|
|
? $0 as raw-record string -- ? would make mlr grep simpler and more natural ...
|
|
|
|
================================================================
|
|
DOC
|
|
|
|
* issue on gnu parallel -> link to scripting page
|
|
|
|
* https://stackoverflow.com/questions/71011603/is-there-a-simple-way-to-convert-a-csv-with-0-indexed-paths-as-keys-to-json-with/71015190#71015190
|
|
|
|
! slwin / unformat narrative:
|
|
o prototype in DSL
|
|
o mlr put -f / mlr -s
|
|
o and, if it's frequent then submit a feature request b/c other people probably also would like it! :)
|
|
|
|
! link to SE table ...
|
|
https://github.com/johnkerl/miller/discussions/609#discussioncomment-1115715
|
|
|
|
* put -f link in scrpting.html
|
|
* no --load in .mlrrc b/c
|
|
https://github.com/johnkerl/miller/security/advisories/GHSA-mw2v-4q78-j2cw
|
|
(system() in begin{})
|
|
but suggest alias
|
|
|
|
* surface readme-versions.md @ install & tell them how to get current
|
|
o in particular, older distros won't auto-update
|
|
|
|
* how-to-contribute @ rmd
|
|
* more clear miller docs == head
|
|
|
|
* go version & why not generics (not all distros w/ newer go compilers)
|
|
|
|
* document the fill-empty verb in
|
|
https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling
|
|
|
|
* mlr R numpy/scipy/pandas ref.
|
|
* ffmts page <-> R read.table read.csv ...
|
|
* https://github.com/agarrharr/awesome-cli-apps#data-manipulation
|
|
|
|
* "Miller, if I understand correctly, was born for those who already frequented
|
|
the command line, awk, for users with at least average data skills. While for
|
|
me, a basic user, it made me discover how it gives the possibility of
|
|
analyzing and transforming data, even complex ones, with a very low learning
|
|
curve."
|
|
|
|
* document cloudthings, e.g.
|
|
o go.yml
|
|
o codespell.yml
|
|
- codespell --check-filenames --skip *.csv,*.dkvp,*.txt,*.js,*.html,*.map,./tags,./test/cases --ignore-words-list denom,inTerm,inout,iput,nd,nin,numer,Wit,te,wee
|
|
o readthedocs triggers
|
|
* doc
|
|
o wut h1 spacing before/after ...
|
|
o shell-commands: while-read example from issues
|
|
E reference-dsl-user-defined-functions: UDSes -> non-factorial example -- maybe some useful aggregator
|
|
o reference-main-arithmetic: ? test stats1/step -F flag
|
|
o reference-dsl-control-structures:
|
|
e while (NR < 10) will never terminate as NR is only incremented between
|
|
records -> and each expression is invoked once per record so once for NR=1,
|
|
once for NR=2, etc.
|
|
o C-style triple-for loops: loop to NR -> NO!!!
|
|
o Since uninitialized out-of-stream variables default to 0 for
|
|
addition/subtraction and 1 for multiplication when they appear on expression
|
|
right-hand sides (not quite as in awk, where they'd default to 0 either way)
|
|
<-> xlink to other page
|
|
r fzf-ish w/ head -n 4, --from, up-arrow & append verb, then cat -- find & update the existing section
|
|
! https://github.com/johnkerl/miller/issues/653 -- stats1 w/ empties? check stats2
|
|
- needs UTs as well
|
|
o while-read example from issues
|
|
w contact re https://jsonlines.org/on_the_web/
|
|
* verslink old relnotes
|
|
* single UT, hard to invoke w/ new full go.mod path
|
|
go test $(ls pkg/lib/*.go|grep -v test) pkg/lib/unbackslash_test.go
|
|
etc
|
|
* file-formats: NIDX link to headerless CSV
|
|
* window.mlr, window2.mlr -> doc somewhere
|
|
* sec2gmt --millis/--micros/--nanos doc
|
|
* sort-within-records --recursive doc
|
|
* back-incompat:
|
|
mlr -n put $vflag '@x=1; dump > stdout, @x'
|
|
mlr -n put $vflag '@x=1; dump > stdout @x'
|
|
* single cheatsheet page -- put out RFH?
|
|
https://twitter.com/icymi_py/status/1426622817785765898/photo/1
|
|
* more about HTTP logs in miller -- doc and encapsulate:
|
|
mlr --icsv --implicit-csv-header --ifs space --oxtab --from elb-log put -q 'print $27'
|
|
* hofs section on typedecls
|
|
* write up how git status after test should show any missed extra-outs
|
|
* docs nest simplers now that we have getoptish
|
|
* zlib: n.b. brew install pigz, then pigz -z
|
|
* doc shift/unshift as using [2:] and append
|
|
* comment schema-change supported only in csvlite reader, not csv reader
|
|
* consider -w/-W to stderr not stdout -- ?
|
|
|
|
? grep out all error message from regtest outputs & doc them all & make sure index-searchable at readthedocs
|
|
|
|
===============================================================
|
|
TESTING
|
|
|
|
! pos/neg 0x/0b/0o UTs
|
|
|
|
* RT ngrams.sh -v -o 1 one-word-list.txt
|
|
* JIT UT:
|
|
o JSON I/O
|
|
o mlrval_cmp.go
|
|
o mv from-array/from-map w/ copy & mutate orig & check new -- & vice versa
|
|
o dash-O and octal infer
|
|
o populate each bifs/X_test.go for each bifs/X.go etc etc
|
|
* xtab splitter UT; nidx too
|
|
* hofs+typedecls RT cases
|
|
* UT per se for lrec ops
|
|
|
|
================================================================
|
|
PERFORMANCE
|
|
|
|
! profile sort
|
|
|
|
* JSON perf -- try alternate packages to encoding/json
|
|
* more perf?
|
|
- batchify source-quench -- experiment 1st
|
|
- further channelize (CSV-first focus) mlrval infer vs record-put ?
|
|
? coalesce errchan & done-writing w/ Err to RAC, and close-chan *and* EOSMarker -- ?
|
|
|
|
================================================================
|
|
CODE-NEATENS / QUALITY
|
|
|
|
* go-style assert.Equal(t, actual, expected) in place of R style expect_equal(expected, actual) in .../*_test.go
|
|
|
|
! ast namings etc
|
|
? make a simple NodeFromToken & have all interface{} be *ASTNode, not *token.Token
|
|
! broad commenting pass / TODO
|
|
|
|
* []Mlrval -> []*Mlrval ?
|
|
* cmp for array & map
|
|
o JIT neatens
|
|
- carefully read all $mlv files
|
|
- check $types & $bifs as well
|
|
- proofread all mlrval_cmp.go dispo mxes
|
|
- update rmds x several
|
|
* https://staticcheck.io/docs
|
|
o lots of nice little things to clean up -- no bugs per se, all stylistic i *think* ...
|
|
* funcptr away the ifs/ifsregex check in record-readers
|
|
* all case-files could use top-notes
|
|
* godoc neatens at func/const/etc level
|
|
* unset, unassign, remove -- too many different names. also assign/put ... maybe stick w/ 2?
|
|
* check triple-dash at mlr fill-down-h ; check others
|
|
* clean up unused exitCode arg in sort/put usage.
|
|
o also document pre/post conditions for flag and non-flag usages x all mappers
|
|
* neaten mlr gap -g (default) print
|
|
* typedecls w/ for-multi: C semantics 'k1: data k2: a:x v:1', sigh ...
|
|
* fill-down make columns required. also, --all.
|
|
* bitwise_and_dispositions et al should not have _absn for collections -- _erro instead
|
|
* libify errors.New callsites for DSL/CST
|
|
* record-readers are fully in-channel/loop; record-writers are multi with in-channel/loop being
|
|
done by ChannelWriter, which is very small. opportunity to refactor.
|
|
* widen DSL coverage
|
|
o err-return for array/map get/put if incorrect types ... currently go-void ...
|
|
! the DSL needs a full, written-down-and-published spell-out of error-eval semantics
|
|
* double-check rand-seeding
|
|
o all rand invocations should go through the seeder for UT/other determinism
|
|
* dev-note on why `int` not `int64` -- processor-arch & those who most need it get it
|
|
|
|
? gzout, bz2out -- ? make sure this works through tee? bleah ...
|
|
? golinter?
|
|
? go exe 17MB b/c gocc -- gogll ?
|
|
|
|
================================================================
|
|
INFO
|
|
|
|
i @all-contributors please add @person for ideas
|
|
i @all-contributors please add @person for documentation
|
|
|
|
i https://en.wikipedia.org/wiki/Delimiter#Delimiter_collision
|
|
|
|
i https://framework.frictionlessdata.io/docs/tutorials/working-with-cli/
|
|
|
|
i https://segment.com/blog/allocation-efficiency-in-high-performance-go-services/
|
|
|
|
i go tool nm -size mlr | sort -nrk 2
|
|
|
|
i https://go.dev/play/p/hrB77U3d0S3?v=gotip
|
|
|
|
* godoc notes:
|
|
o go get golang.org/x/tools/cmd/godoc
|
|
o dev mode:
|
|
godoc -http=:6060 -goroot .
|
|
o publish:
|
|
godoc -http=:6060 -goroot .
|
|
cd ~/tmp/bar
|
|
wget -p -k http://localhost:6060/pkg
|
|
mv localhost:6060 miller6
|
|
file:///Users/kerl/tmp/bar/miller6/pkg
|
|
maybe publish to ISP space
|
|
|
|
* pkg graph:
|
|
go get github.com/kisielk/godepgraph
|
|
godepgraph miller | dot -Tpng -o ~/Desktop/mlrdeps.png
|
|
flamegraph etc double-check
|
|
|
|
* more data formats:
|
|
https://indico.cern.ch/event/613842/contributions/2585787/attachments/1463230/2260889/pivarski-data-formats.pdf
|