miller/docs6/docs/misc-examples.md.in
2021-08-25 22:01:32 -04:00

224 lines
6.3 KiB
Markdown

# Miscellaneous examples
Column select:
GENMD_SHOW_COMMAND
mlr --csv cut -f hostname,uptime mydata.csv
GENMD_EOF
Add new columns as function of other columns:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --nidx put '$sum = $7 < 0.0 ? 3.5 : $7 + 2.1*$8' *.dat
GENMD_EOF
Row filter:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --csv filter '$status != "down" && $upsec >= 10000' *.csv
GENMD_EOF
Apply column labels and pretty-print:
GENMD_CARDIFY_HIGHLIGHT_ONE
grep -v '^#' /etc/group | mlr --ifs : --nidx --opprint label group,pass,gid,member then sort -f group
GENMD_EOF
Join multiple data sources on key columns:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr join -j account_id -f accounts.dat then group-by account_name balances.dat
GENMD_EOF
Mulltiple formats including JSON:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --json put '$attr = sub($attr, "([0-9]+)_([0-9]+)_.*", "\1:\2")' data/*.json
GENMD_EOF
Aggregate per-column statistics:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr stats1 -a min,mean,max,p10,p50,p90 -f flag,u,v data/*
GENMD_EOF
Linear regression:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr stats2 -a linreg-pca -f u,v -g shape data/*
GENMD_EOF
Aggregate custom per-column statistics:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr put -q '@sum[$a][$b] += $x; end {emit @sum, "a", "b"}' data/*
GENMD_EOF
Iterate over data using DSL expressions:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --from estimates.tbl put '
for (k,v in $*) {
if (is_numeric(v) && k =~ "^[t-z].*$") {
$sum += v; $count += 1
}
}
$mean = $sum / $count # no assignment if count unset
'
GENMD_EOF
Run DSL expressions from a script file:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --from infile.dat put -f analyze.mlr
GENMD_EOF
Split/reduce output to multiple filenames:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --from infile.dat put 'tee > "./taps/data-".$a."-".$b, $*'
GENMD_EOF
Compressed I/O:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --from infile.dat put 'tee | "gzip > ./taps/data-".$a."-".$b.".gz", $*'
GENMD_EOF
Interoperate with other data-processing tools using standard pipes:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --from infile.dat put -q '@v=$*; dump | "jq .[]"'
GENMD_EOF
Tap/trace:
GENMD_CARDIFY_HIGHLIGHT_ONE
mlr --from infile.dat put '(NR % 1000 == 0) { print > stderr, "Checkpoint ".NR}'
GENMD_EOF
## Program timing
This admittedly artificial example demonstrates using Miller time and stats functions to introspectively acquire some information about Miller's own runtime. The `delta` function computes the difference between successive timestamps.
GENMD_INCLUDE_ESCAPED(data/timing-example.txt)
## Showing differences between successive queries
Suppose you have a database query which you run at one point in time, producing the output on the left, then again later producing the output on the right:
GENMD_RUN_COMMAND
cat data/previous_counters.csv
GENMD_EOF
GENMD_RUN_COMMAND
cat data/current_counters.csv
GENMD_EOF
And, suppose you want to compute the differences in the counters between adjacent keys. Since the color names aren't all in the same order, nor are they all present on both sides, we can't just paste the two files side-by-side and do some column-four-minus-column-two arithmetic.
First, rename counter columns to make them distinct:
GENMD_RUN_COMMAND
mlr --csv rename count,previous_count data/previous_counters.csv > data/prevtemp.csv
GENMD_EOF
GENMD_RUN_COMMAND
cat data/prevtemp.csv
GENMD_EOF
GENMD_RUN_COMMAND
mlr --csv rename count,current_count data/current_counters.csv > data/currtemp.csv
GENMD_EOF
GENMD_RUN_COMMAND
cat data/currtemp.csv
GENMD_EOF
Then, join on the key field(s), and use unsparsify to zero-fill counters absent on one side but present on the other. Use `--ul` and `--ur` to emit unpaired records (namely, purple on the left and yellow on the right):
GENMD_RUN_COMMAND
mlr --icsv --opprint \
join -j color --ul --ur -f data/prevtemp.csv \
then unsparsify --fill-with 0 \
then put '$count_delta = $current_count - $previous_count' \
data/currtemp.csv
GENMD_EOF
## Memoization with out-of-stream variables
The recursive function for the Fibonacci sequence is famous for its computational complexity. Namely, using f(0)=1, f(1)=1, f(n)=f(n-1)+f(n-2) for n>=2, the evaluation tree branches left as well as right at each non-trivial level, resulting in millions or more paths to the root 0/1 nodes for larger n. This program
GENMD_INCLUDE_ESCAPED(data/fibo-uncached.sh)
produces output like this:
GENMD_CARDIFY
i o fcount seconds_delta
1 1 1 0
2 2 3 0.000039101
3 3 5 0.000015974
4 5 9 0.000019073
5 8 15 0.000026941
6 13 25 0.000036955
7 21 41 0.000056028
8 34 67 0.000086069
9 55 109 0.000134945
10 89 177 0.000217915
11 144 287 0.000355959
12 233 465 0.000506163
13 377 753 0.000811815
14 610 1219 0.001297235
15 987 1973 0.001960993
16 1597 3193 0.003417969
17 2584 5167 0.006215811
18 4181 8361 0.008294106
19 6765 13529 0.012095928
20 10946 21891 0.019592047
21 17711 35421 0.031193972
22 28657 57313 0.057254076
23 46368 92735 0.080307961
24 75025 150049 0.129482031
25 121393 242785 0.213325977
26 196418 392835 0.334423065
27 317811 635621 0.605969906
28 514229 1028457 0.971235037
GENMD_EOF
Note that the time it takes to evaluate the function is blowing up exponentially as the input argument increases. Using `@`-variables, which persist across records, we can cache and reuse the results of previous computations:
GENMD_INCLUDE_ESCAPED(data/fibo-cached.sh)
with output like this:
GENMD_CARDIFY
i o fcount seconds_delta
1 1 1 0
2 2 3 0.000053883
3 3 3 0.000035048
4 5 3 0.000045061
5 8 3 0.000014067
6 13 3 0.000028849
7 21 3 0.000028133
8 34 3 0.000027895
9 55 3 0.000014067
10 89 3 0.000015020
11 144 3 0.000012875
12 233 3 0.000033140
13 377 3 0.000014067
14 610 3 0.000012875
15 987 3 0.000029087
16 1597 3 0.000013828
17 2584 3 0.000013113
18 4181 3 0.000012875
19 6765 3 0.000013113
20 10946 3 0.000012875
21 17711 3 0.000013113
22 28657 3 0.000013113
23 46368 3 0.000015974
24 75025 3 0.000012875
25 121393 3 0.000013113
26 196418 3 0.000012875
27 317811 3 0.000013113
28 514229 3 0.000012875
GENMD_EOF