mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 18:25:45 +00:00
224 lines
6.3 KiB
Markdown
224 lines
6.3 KiB
Markdown
# Miscellaneous examples
|
|
|
|
Column select:
|
|
|
|
GENMD_SHOW_COMMAND
|
|
mlr --csv cut -f hostname,uptime mydata.csv
|
|
GENMD_EOF
|
|
|
|
Add new columns as function of other columns:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr --nidx put '$sum = $7 < 0.0 ? 3.5 : $7 + 2.1*$8' *.dat
|
|
GENMD_EOF
|
|
|
|
Row filter:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr --csv filter '$status != "down" && $upsec >= 10000' *.csv
|
|
GENMD_EOF
|
|
|
|
Apply column labels and pretty-print:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
grep -v '^#' /etc/group | mlr --ifs : --nidx --opprint label group,pass,gid,member then sort -f group
|
|
GENMD_EOF
|
|
|
|
Join multiple data sources on key columns:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr join -j account_id -f accounts.dat then group-by account_name balances.dat
|
|
GENMD_EOF
|
|
|
|
Mulltiple formats including JSON:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr --json put '$attr = sub($attr, "([0-9]+)_([0-9]+)_.*", "\1:\2")' data/*.json
|
|
GENMD_EOF
|
|
|
|
Aggregate per-column statistics:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr stats1 -a min,mean,max,p10,p50,p90 -f flag,u,v data/*
|
|
GENMD_EOF
|
|
|
|
Linear regression:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr stats2 -a linreg-pca -f u,v -g shape data/*
|
|
GENMD_EOF
|
|
|
|
Aggregate custom per-column statistics:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr put -q '@sum[$a][$b] += $x; end {emit @sum, "a", "b"}' data/*
|
|
GENMD_EOF
|
|
|
|
Iterate over data using DSL expressions:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr --from estimates.tbl put '
|
|
for (k,v in $*) {
|
|
if (is_numeric(v) && k =~ "^[t-z].*$") {
|
|
$sum += v; $count += 1
|
|
}
|
|
}
|
|
$mean = $sum / $count # no assignment if count unset
|
|
'
|
|
GENMD_EOF
|
|
|
|
Run DSL expressions from a script file:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr --from infile.dat put -f analyze.mlr
|
|
GENMD_EOF
|
|
|
|
Split/reduce output to multiple filenames:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr --from infile.dat put 'tee > "./taps/data-".$a."-".$b, $*'
|
|
GENMD_EOF
|
|
|
|
Compressed I/O:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr --from infile.dat put 'tee | "gzip > ./taps/data-".$a."-".$b.".gz", $*'
|
|
GENMD_EOF
|
|
|
|
Interoperate with other data-processing tools using standard pipes:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr --from infile.dat put -q '@v=$*; dump | "jq .[]"'
|
|
GENMD_EOF
|
|
|
|
Tap/trace:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr --from infile.dat put '(NR % 1000 == 0) { print > stderr, "Checkpoint ".NR}'
|
|
GENMD_EOF
|
|
|
|
## Program timing
|
|
|
|
This admittedly artificial example demonstrates using Miller time and stats functions to introspectively acquire some information about Miller's own runtime. The `delta` function computes the difference between successive timestamps.
|
|
|
|
GENMD_INCLUDE_ESCAPED(data/timing-example.txt)
|
|
|
|
## Showing differences between successive queries
|
|
|
|
Suppose you have a database query which you run at one point in time, producing the output on the left, then again later producing the output on the right:
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/previous_counters.csv
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/current_counters.csv
|
|
GENMD_EOF
|
|
|
|
And, suppose you want to compute the differences in the counters between adjacent keys. Since the color names aren't all in the same order, nor are they all present on both sides, we can't just paste the two files side-by-side and do some column-four-minus-column-two arithmetic.
|
|
|
|
First, rename counter columns to make them distinct:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --csv rename count,previous_count data/previous_counters.csv > data/prevtemp.csv
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/prevtemp.csv
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --csv rename count,current_count data/current_counters.csv > data/currtemp.csv
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat data/currtemp.csv
|
|
GENMD_EOF
|
|
|
|
Then, join on the key field(s), and use unsparsify to zero-fill counters absent on one side but present on the other. Use `--ul` and `--ur` to emit unpaired records (namely, purple on the left and yellow on the right):
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --icsv --opprint \
|
|
join -j color --ul --ur -f data/prevtemp.csv \
|
|
then unsparsify --fill-with 0 \
|
|
then put '$count_delta = $current_count - $previous_count' \
|
|
data/currtemp.csv
|
|
GENMD_EOF
|
|
|
|
## Memoization with out-of-stream variables
|
|
|
|
The recursive function for the Fibonacci sequence is famous for its computational complexity. Namely, using f(0)=1, f(1)=1, f(n)=f(n-1)+f(n-2) for n>=2, the evaluation tree branches left as well as right at each non-trivial level, resulting in millions or more paths to the root 0/1 nodes for larger n. This program
|
|
|
|
GENMD_INCLUDE_ESCAPED(data/fibo-uncached.sh)
|
|
|
|
produces output like this:
|
|
|
|
GENMD_CARDIFY
|
|
i o fcount seconds_delta
|
|
1 1 1 0
|
|
2 2 3 0.000039101
|
|
3 3 5 0.000015974
|
|
4 5 9 0.000019073
|
|
5 8 15 0.000026941
|
|
6 13 25 0.000036955
|
|
7 21 41 0.000056028
|
|
8 34 67 0.000086069
|
|
9 55 109 0.000134945
|
|
10 89 177 0.000217915
|
|
11 144 287 0.000355959
|
|
12 233 465 0.000506163
|
|
13 377 753 0.000811815
|
|
14 610 1219 0.001297235
|
|
15 987 1973 0.001960993
|
|
16 1597 3193 0.003417969
|
|
17 2584 5167 0.006215811
|
|
18 4181 8361 0.008294106
|
|
19 6765 13529 0.012095928
|
|
20 10946 21891 0.019592047
|
|
21 17711 35421 0.031193972
|
|
22 28657 57313 0.057254076
|
|
23 46368 92735 0.080307961
|
|
24 75025 150049 0.129482031
|
|
25 121393 242785 0.213325977
|
|
26 196418 392835 0.334423065
|
|
27 317811 635621 0.605969906
|
|
28 514229 1028457 0.971235037
|
|
GENMD_EOF
|
|
|
|
Note that the time it takes to evaluate the function is blowing up exponentially as the input argument increases. Using `@`-variables, which persist across records, we can cache and reuse the results of previous computations:
|
|
|
|
GENMD_INCLUDE_ESCAPED(data/fibo-cached.sh)
|
|
|
|
with output like this:
|
|
|
|
GENMD_CARDIFY
|
|
i o fcount seconds_delta
|
|
1 1 1 0
|
|
2 2 3 0.000053883
|
|
3 3 3 0.000035048
|
|
4 5 3 0.000045061
|
|
5 8 3 0.000014067
|
|
6 13 3 0.000028849
|
|
7 21 3 0.000028133
|
|
8 34 3 0.000027895
|
|
9 55 3 0.000014067
|
|
10 89 3 0.000015020
|
|
11 144 3 0.000012875
|
|
12 233 3 0.000033140
|
|
13 377 3 0.000014067
|
|
14 610 3 0.000012875
|
|
15 987 3 0.000029087
|
|
16 1597 3 0.000013828
|
|
17 2584 3 0.000013113
|
|
18 4181 3 0.000012875
|
|
19 6765 3 0.000013113
|
|
20 10946 3 0.000012875
|
|
21 17711 3 0.000013113
|
|
22 28657 3 0.000013113
|
|
23 46368 3 0.000015974
|
|
24 75025 3 0.000012875
|
|
25 121393 3 0.000013113
|
|
26 196418 3 0.000012875
|
|
27 317811 3 0.000013113
|
|
28 514229 3 0.000012875
|
|
GENMD_EOF
|