mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-24 02:36:15 +00:00
697 lines
No EOL
32 KiB
HTML
697 lines
No EOL
32 KiB
HTML
|
||
<!DOCTYPE html>
|
||
|
||
<html>
|
||
<head>
|
||
<meta charset="utf-8" />
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||
<title>Two-pass algorithms — Miller 6.0.0-alpha documentation</title>
|
||
|
||
<link rel="stylesheet" href="_static/scrolls.css" type="text/css" />
|
||
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
|
||
<link rel="stylesheet" href="_static/print.css" type="text/css" />
|
||
|
||
<script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
|
||
<script src="_static/jquery.js"></script>
|
||
<script src="_static/underscore.js"></script>
|
||
<script src="_static/doctools.js"></script>
|
||
<script src="_static/language_data.js"></script>
|
||
<script src="_static/theme_extras.js"></script>
|
||
<link rel="index" title="Index" href="genindex.html" />
|
||
<link rel="search" title="Search" href="search.html" />
|
||
<link rel="next" title="DKVP I/O examples" href="dkvp-examples.html" />
|
||
<link rel="prev" title="Randomizing examples" href="randomizing-examples.html" />
|
||
</head><body>
|
||
<div id="content">
|
||
<div class="header">
|
||
<h1 class="heading"><a href="index.html"
|
||
title="back to the documentation overview"><span>Two-pass algorithms</span></a></h1>
|
||
</div>
|
||
<div class="relnav" role="navigation" aria-label="related navigation">
|
||
<a href="randomizing-examples.html">« Randomizing examples</a> |
|
||
<a href="#">Two-pass algorithms</a>
|
||
| <a href="dkvp-examples.html">DKVP I/O examples »</a>
|
||
</div>
|
||
<div id="contentwrapper">
|
||
<div id="toc" role="navigation" aria-label="table of contents navigation">
|
||
<h3>Table of Contents</h3>
|
||
<ul>
|
||
<li><a class="reference internal" href="#">Two-pass algorithms</a><ul>
|
||
<li><a class="reference internal" href="#overview">Overview</a></li>
|
||
<li><a class="reference internal" href="#computation-of-percentages">Computation of percentages</a></li>
|
||
<li><a class="reference internal" href="#line-number-ratios">Line-number ratios</a></li>
|
||
<li><a class="reference internal" href="#records-having-max-value">Records having max value</a></li>
|
||
<li><a class="reference internal" href="#feature-counting">Feature-counting</a></li>
|
||
<li><a class="reference internal" href="#unsparsing">Unsparsing</a></li>
|
||
<li><a class="reference internal" href="#mean-without-with-oosvars">Mean without/with oosvars</a></li>
|
||
<li><a class="reference internal" href="#keyed-mean-without-with-oosvars">Keyed mean without/with oosvars</a></li>
|
||
<li><a class="reference internal" href="#variance-and-standard-deviation-without-with-oosvars">Variance and standard deviation without/with oosvars</a></li>
|
||
<li><a class="reference internal" href="#min-max-without-with-oosvars">Min/max without/with oosvars</a></li>
|
||
<li><a class="reference internal" href="#keyed-min-max-without-with-oosvars">Keyed min/max without/with oosvars</a></li>
|
||
<li><a class="reference internal" href="#delta-without-with-oosvars">Delta without/with oosvars</a></li>
|
||
<li><a class="reference internal" href="#keyed-delta-without-with-oosvars">Keyed delta without/with oosvars</a></li>
|
||
<li><a class="reference internal" href="#exponentially-weighted-moving-averages-without-with-oosvars">Exponentially weighted moving averages without/with oosvars</a></li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
|
||
</div>
|
||
<div role="main">
|
||
|
||
<div class="section" id="two-pass-algorithms">
|
||
<h1>Two-pass algorithms<a class="headerlink" href="#two-pass-algorithms" title="Permalink to this headline">¶</a></h1>
|
||
<div class="section" id="overview">
|
||
<h2>Overview<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2>
|
||
<p>Miller is a streaming record processor; commands are performed once per record. This makes Miller particularly suitable for single-pass algorithms, allowing many of its verbs to process files that are (much) larger than the amount of RAM present in your system. (Of course, Miller verbs such as <code class="docutils literal notranslate"><span class="pre">sort</span></code>, <code class="docutils literal notranslate"><span class="pre">tac</span></code>, etc. all must ingest and retain all input records before emitting any output records.) You can also use out-of-stream variables to perform multi-pass computations, at the price of retaining all input records in memory.</p>
|
||
<p>One of Miller’s strengths is its compact notation: for example, given input of the form</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> head -n 5 ../data/medium
|
||
</span> a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
|
||
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
|
||
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
|
||
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
|
||
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
|
||
</pre></div>
|
||
</div>
|
||
<p>you can simply do</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab stats1 -a sum -f x ../data/medium
|
||
</span> x_sum 4986.019681679581
|
||
</pre></div>
|
||
</div>
|
||
<p>or</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint stats1 -a sum -f x -g b ../data/medium
|
||
</span> b x_sum
|
||
pan 965.7636699425815
|
||
wye 1023.5484702619565
|
||
zee 979.7420161495838
|
||
eks 1016.7728571314786
|
||
hat 1000.192668193983
|
||
</pre></div>
|
||
</div>
|
||
<p>rather than the more tedious</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab put -q '
|
||
</span><span class="hll"> @x_sum += $x;
|
||
</span><span class="hll"> end {
|
||
</span><span class="hll"> emit @x_sum
|
||
</span><span class="hll"> }
|
||
</span><span class="hll"> ' data/medium
|
||
</span> x_sum 4986.019681679581
|
||
</pre></div>
|
||
</div>
|
||
<p>or</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put -q '
|
||
</span><span class="hll"> @x_sum[$b] += $x;
|
||
</span><span class="hll"> end {
|
||
</span><span class="hll"> emit @x_sum, "b"
|
||
</span><span class="hll"> }
|
||
</span><span class="hll"> ' data/medium
|
||
</span> b x_sum
|
||
pan 965.7636699425815
|
||
wye 1023.5484702619565
|
||
zee 979.7420161495838
|
||
eks 1016.7728571314786
|
||
hat 1000.192668193983
|
||
</pre></div>
|
||
</div>
|
||
<p>The former (<code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">stats1</span></code> et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.</p>
|
||
<p>Nonetheless, out-of-stream variables (which I whimsically call <em>oosvars</em>), begin/end blocks, and emit statements give you the ability to implement logic – if you wish to do so – which isn’t present in other Miller verbs. (If you find yourself often using the same out-of-stream-variable logic over and over, please file a request at <a class="reference external" href="https://github.com/johnkerl/miller/issues">https://github.com/johnkerl/miller/issues</a> to get it implemented directly in Go as a Miller verb of its own.)</p>
|
||
<p>The following examples compute some things using oosvars which are already computable using Miller verbs, by way of providing food for thought.</p>
|
||
</div>
|
||
<div class="section" id="computation-of-percentages">
|
||
<h2>Computation of percentages<a class="headerlink" href="#computation-of-percentages" title="Permalink to this headline">¶</a></h2>
|
||
<p>For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record’s value to a percentage.</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --from data/small --opprint put -q '
|
||
</span><span class="hll"> # These are executed once per record, which is the first pass.
|
||
</span><span class="hll"> # The key is to use NR to index an out-of-stream variable to
|
||
</span><span class="hll"> # retain all the x-field values.
|
||
</span><span class="hll"> @x_min = min($x, @x_min);
|
||
</span><span class="hll"> @x_max = max($x, @x_max);
|
||
</span><span class="hll"> @x[NR] = $x;
|
||
</span><span class="hll">
|
||
</span><span class="hll"> # The second pass is in a for-loop in an end-block.
|
||
</span><span class="hll"> end {
|
||
</span><span class="hll"> for (nr, x in @x) {
|
||
</span><span class="hll"> @x_pct[nr] = 100 * (x - @x_min) / (@x_max - @x_min);
|
||
</span><span class="hll"> }
|
||
</span><span class="hll"> emit (@x, @x_pct), "NR"
|
||
</span><span class="hll"> }
|
||
</span><span class="hll"> '
|
||
</span> NR x x_pct
|
||
1 0.3467901443380824 25.66194338926441
|
||
2 0.7586799647899636 100
|
||
3 0.20460330576630303 0
|
||
4 0.38139939387114097 31.90823602213647
|
||
5 0.5732889198020006 66.54054236562845
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="line-number-ratios">
|
||
<h2>Line-number ratios<a class="headerlink" href="#line-number-ratios" title="Permalink to this headline">¶</a></h2>
|
||
<p>Similarly, finding the total record count requires first reading through all the data:</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint --from data/small put -q '
|
||
</span><span class="hll"> @records[NR] = $*;
|
||
</span><span class="hll"> end {
|
||
</span><span class="hll"> for((I,k),v in @records) {
|
||
</span><span class="hll"> @records[I]["I"] = I;
|
||
</span><span class="hll"> @records[I]["N"] = NR;
|
||
</span><span class="hll"> @records[I]["PCT"] = 100*I/NR
|
||
</span><span class="hll"> }
|
||
</span><span class="hll"> emit @records,"I"
|
||
</span><span class="hll"> }
|
||
</span><span class="hll"> ' then reorder -f I,N,PCT
|
||
</span> I N PCT a b i x y
|
||
1 5 (error) pan pan 1 0.3467901443380824 0.7268028627434533
|
||
2 5 (error) eks pan 2 0.7586799647899636 0.5221511083334797
|
||
3 5 (error) wye wye 3 0.20460330576630303 0.33831852551664776
|
||
4 5 (error) eks wye 4 0.38139939387114097 0.13418874328430463
|
||
5 5 (error) wye pan 5 0.5732889198020006 0.8636244699032729
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="records-having-max-value">
|
||
<h2>Records having max value<a class="headerlink" href="#records-having-max-value" title="Permalink to this headline">¶</a></h2>
|
||
<p>The idea is to retain records having the largest value of <code class="docutils literal notranslate"><span class="pre">n</span></code> in the following data:</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --itsv --opprint cat data/maxrows.tsv
|
||
</span> a b n score
|
||
purple red 5 0.743231
|
||
blue purple 2 0.093710
|
||
red purple 2 0.802103
|
||
purple red 5 0.389055
|
||
red purple 2 0.880457
|
||
orange red 2 0.540349
|
||
purple purple 1 0.634451
|
||
orange purple 5 0.257223
|
||
orange purple 5 0.693499
|
||
red red 4 0.981355
|
||
blue purple 5 0.157052
|
||
purple purple 1 0.441784
|
||
red purple 1 0.124912
|
||
orange blue 1 0.921944
|
||
blue purple 4 0.490909
|
||
purple red 5 0.454779
|
||
green purple 4 0.198278
|
||
orange blue 5 0.705700
|
||
red red 3 0.940705
|
||
purple red 5 0.072936
|
||
orange blue 3 0.389463
|
||
orange purple 2 0.664985
|
||
blue purple 1 0.371813
|
||
red purple 4 0.984571
|
||
green purple 5 0.203577
|
||
green purple 3 0.900873
|
||
purple purple 0 0.965677
|
||
blue purple 2 0.208785
|
||
purple purple 1 0.455077
|
||
red purple 4 0.477187
|
||
blue red 4 0.007487
|
||
</pre></div>
|
||
</div>
|
||
<p>Of course, the largest value of <code class="docutils literal notranslate"><span class="pre">n</span></code> isn’t known until after all data have been read. Using an out-of-stream variable we can retain all records as they are read, then filter them at the end:</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> cat data/maxrows.mlr
|
||
</span> # Retain all records
|
||
@records[NR] = $*;
|
||
# Track max value of n
|
||
@maxn = max(@maxn, $n);
|
||
|
||
# After all records have been read, loop through retained records
|
||
# and print those with the max n value.
|
||
end {
|
||
for (nr in @records) {
|
||
map record = @records[nr];
|
||
if (record["n"] == @maxn) {
|
||
emit record;
|
||
}
|
||
}
|
||
}
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --itsv --opprint put -q -f data/maxrows.mlr data/maxrows.tsv
|
||
</span> a b n score
|
||
purple red 5 0.743231
|
||
purple red 5 0.389055
|
||
orange purple 5 0.257223
|
||
orange purple 5 0.693499
|
||
blue purple 5 0.157052
|
||
purple red 5 0.454779
|
||
orange blue 5 0.705700
|
||
purple red 5 0.072936
|
||
green purple 5 0.203577
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="feature-counting">
|
||
<h2>Feature-counting<a class="headerlink" href="#feature-counting" title="Permalink to this headline">¶</a></h2>
|
||
<p>Suppose you have some heterogeneous data like this:</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>{ "qoh": 29874, "rate": 1.68, "latency": 0.02 }
|
||
{ "name": "alice", "uid": 572 }
|
||
{ "qoh": 1227, "rate": 1.01, "latency": 0.07 }
|
||
{ "qoh": 13458, "rate": 1.72, "latency": 0.04 }
|
||
{ "qoh": 56782, "rate": 1.64 }
|
||
{ "qoh": 23512, "rate": 1.71, "latency": 0.03 }
|
||
{ "qoh": 9876, "rate": 1.89, "latency": 0.08 }
|
||
{ "name": "bill", "uid": 684 }
|
||
{ "name": "chuck", "uid2": 908 }
|
||
{ "name": "dottie", "uid": 440 }
|
||
{ "qoh": 0, "rate": 0.40, "latency": 0.01 }
|
||
{ "qoh": 5438, "rate": 1.56, "latency": 0.17 }
|
||
</pre></div>
|
||
</div>
|
||
<p>A reasonable question to ask is, how many occurrences of each field are there? And, what percentage of total row count has each of them? Since the denominator of the percentage is not known until the end, this is a two-pass algorithm:</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>for (key in $*) {
|
||
@key_counts[key] += 1;
|
||
}
|
||
@record_count += 1;
|
||
|
||
end {
|
||
for (key in @key_counts) {
|
||
@key_fraction[key] = @key_counts[key] / @record_count
|
||
}
|
||
emit @record_count;
|
||
emit @key_counts, "key";
|
||
emit @key_fraction,"key"
|
||
}
|
||
</pre></div>
|
||
</div>
|
||
<p>Then</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --json put -q -f data/feature-count.mlr data/features.json
|
||
</span> {
|
||
"record_count": 12
|
||
}
|
||
{
|
||
"key": "qoh",
|
||
"key_counts": 8
|
||
}
|
||
{
|
||
"key": "rate",
|
||
"key_counts": 8
|
||
}
|
||
{
|
||
"key": "latency",
|
||
"key_counts": 7
|
||
}
|
||
{
|
||
"key": "name",
|
||
"key_counts": 4
|
||
}
|
||
{
|
||
"key": "uid",
|
||
"key_counts": 3
|
||
}
|
||
{
|
||
"key": "uid2",
|
||
"key_counts": 1
|
||
}
|
||
{
|
||
"key": "qoh",
|
||
"key_fraction": 0.6666666666666666
|
||
}
|
||
{
|
||
"key": "rate",
|
||
"key_fraction": 0.6666666666666666
|
||
}
|
||
{
|
||
"key": "latency",
|
||
"key_fraction": 0.5833333333333334
|
||
}
|
||
{
|
||
"key": "name",
|
||
"key_fraction": 0.3333333333333333
|
||
}
|
||
{
|
||
"key": "uid",
|
||
"key_fraction": 0.25
|
||
}
|
||
{
|
||
"key": "uid2",
|
||
"key_fraction": 0.08333333333333333
|
||
}
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --ijson --opprint put -q -f data/feature-count.mlr data/features.json
|
||
</span> record_count
|
||
12
|
||
|
||
key key_counts
|
||
qoh 8
|
||
rate 8
|
||
latency 7
|
||
name 4
|
||
uid 3
|
||
uid2 1
|
||
|
||
key key_fraction
|
||
qoh 0.6666666666666666
|
||
rate 0.6666666666666666
|
||
latency 0.5833333333333334
|
||
name 0.3333333333333333
|
||
uid 0.25
|
||
uid2 0.08333333333333333
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="unsparsing">
|
||
<h2>Unsparsing<a class="headerlink" href="#unsparsing" title="Permalink to this headline">¶</a></h2>
|
||
<p>The previous section discussed how to fill out missing data fields within CSV with full header line – so the list of all field names is present within the header line. Next, let’s look at a related problem: we have data where each record has various key names but we want to produce rectangular output having the union of all key names.</p>
|
||
<p>For example, suppose you have JSON input like this:</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> cat data/sparse.json
|
||
</span> {"a":1,"b":2,"v":3}
|
||
{"u":1,"b":2}
|
||
{"a":1,"v":2,"x":3}
|
||
{"v":1,"w":2}
|
||
</pre></div>
|
||
</div>
|
||
<p>There are field names <code class="docutils literal notranslate"><span class="pre">a</span></code>, <code class="docutils literal notranslate"><span class="pre">b</span></code>, <code class="docutils literal notranslate"><span class="pre">v</span></code>, <code class="docutils literal notranslate"><span class="pre">u</span></code>, <code class="docutils literal notranslate"><span class="pre">x</span></code>, <code class="docutils literal notranslate"><span class="pre">w</span></code> in the data – but not all in every record. Since we don’t know the names of all the keys until we’ve read them all, this needs to be a two-pass algorithm. On the first pass, remember all the unique key names and all the records; on the second pass, loop through the records filling in absent values, then producing output. Use <code class="docutils literal notranslate"><span class="pre">put</span> <span class="pre">-q</span></code> since we don’t want to produce per-record output, only emitting output in the <code class="docutils literal notranslate"><span class="pre">end</span></code> block:</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> cat data/unsparsify.mlr
|
||
</span> # First pass:
|
||
# Remember all unique key names:
|
||
for (k in $*) {
|
||
@all_keys[k] = 1;
|
||
}
|
||
# Remember all input records:
|
||
@records[NR] = $*;
|
||
|
||
# Second pass:
|
||
end {
|
||
for (nr in @records) {
|
||
# Get the sparsely keyed input record:
|
||
irecord = @records[nr];
|
||
# Fill in missing keys with empty string:
|
||
map orecord = {};
|
||
for (k in @all_keys) {
|
||
if (haskey(irecord, k)) {
|
||
orecord[k] = irecord[k];
|
||
} else {
|
||
orecord[k] = "";
|
||
}
|
||
}
|
||
# Produce the output:
|
||
emit orecord;
|
||
}
|
||
}
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --json put -q -f data/unsparsify.mlr data/sparse.json
|
||
</span> {
|
||
"a": 1,
|
||
"b": 2,
|
||
"v": 3,
|
||
"u": "",
|
||
"x": "",
|
||
"w": ""
|
||
}
|
||
{
|
||
"a": "",
|
||
"b": 2,
|
||
"v": "",
|
||
"u": 1,
|
||
"x": "",
|
||
"w": ""
|
||
}
|
||
{
|
||
"a": 1,
|
||
"b": "",
|
||
"v": 2,
|
||
"u": "",
|
||
"x": 3,
|
||
"w": ""
|
||
}
|
||
{
|
||
"a": "",
|
||
"b": "",
|
||
"v": 1,
|
||
"u": "",
|
||
"x": "",
|
||
"w": 2
|
||
}
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --ijson --ocsv put -q -f data/unsparsify.mlr data/sparse.json
|
||
</span> a,b,v,u,x,w
|
||
1,2,3,,,
|
||
,2,,1,,
|
||
1,,2,,3,
|
||
,,1,,,2
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --ijson --opprint put -q -f data/unsparsify.mlr data/sparse.json
|
||
</span> a b v u x w
|
||
1 2 3 - - -
|
||
- 2 - 1 - -
|
||
1 - 2 - 3 -
|
||
- - 1 - - 2
|
||
</pre></div>
|
||
</div>
|
||
<p>There is a keystroke-saving verb for this: <a class="reference internal" href="reference-verbs.html#reference-verbs-unsparsify"><span class="std std-ref">mlr unsparsify</span></a>.</p>
|
||
</div>
|
||
<div class="section" id="mean-without-with-oosvars">
|
||
<h2>Mean without/with oosvars<a class="headerlink" href="#mean-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint stats1 -a mean -f x data/medium
|
||
</span> x_mean
|
||
0.49860196816795804
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put -q '
|
||
</span><span class="hll"> @x_sum += $x;
|
||
</span><span class="hll"> @x_count += 1;
|
||
</span><span class="hll"> end {
|
||
</span><span class="hll"> @x_mean = @x_sum / @x_count;
|
||
</span><span class="hll"> emit @x_mean
|
||
</span><span class="hll"> }
|
||
</span><span class="hll"> ' data/medium
|
||
</span> x_mean
|
||
0.49860196816795804
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="keyed-mean-without-with-oosvars">
|
||
<h2>Keyed mean without/with oosvars<a class="headerlink" href="#keyed-mean-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint stats1 -a mean -f x -g a,b data/medium
|
||
</span> a b x_mean
|
||
pan pan 0.5133141190437597
|
||
eks pan 0.48507555383425127
|
||
wye wye 0.49150092785839306
|
||
eks wye 0.4838950517724162
|
||
wye pan 0.4996119901034838
|
||
zee pan 0.5198298297816007
|
||
eks zee 0.49546320772681596
|
||
zee wye 0.5142667998230479
|
||
hat wye 0.49381326184632596
|
||
pan wye 0.5023618498923658
|
||
zee eks 0.4883932942792647
|
||
hat zee 0.5099985721987774
|
||
hat eks 0.48587864619953547
|
||
wye hat 0.4977304763723314
|
||
pan eks 0.5036718595143479
|
||
eks eks 0.5227992666570941
|
||
hat hat 0.47993053101017374
|
||
hat pan 0.4643355557376876
|
||
zee zee 0.5127559183726382
|
||
pan hat 0.492140950155604
|
||
pan zee 0.4966041598627583
|
||
zee hat 0.46772617655014515
|
||
wye zee 0.5059066170573692
|
||
eks hat 0.5006790659966355
|
||
wye eks 0.5306035254809106
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put -q '
|
||
</span><span class="hll"> @x_sum[$a][$b] += $x;
|
||
</span><span class="hll"> @x_count[$a][$b] += 1;
|
||
</span><span class="hll"> end{
|
||
</span><span class="hll"> for ((a, b), v in @x_sum) {
|
||
</span><span class="hll"> @x_mean[a][b] = @x_sum[a][b] / @x_count[a][b];
|
||
</span><span class="hll"> }
|
||
</span><span class="hll"> emit @x_mean, "a", "b"
|
||
</span><span class="hll"> }
|
||
</span><span class="hll"> ' data/medium
|
||
</span> a b x_mean
|
||
pan pan 0.5133141190437597
|
||
pan wye 0.5023618498923658
|
||
pan eks 0.5036718595143479
|
||
pan hat 0.492140950155604
|
||
pan zee 0.4966041598627583
|
||
eks pan 0.48507555383425127
|
||
eks wye 0.4838950517724162
|
||
eks zee 0.49546320772681596
|
||
eks eks 0.5227992666570941
|
||
eks hat 0.5006790659966355
|
||
wye wye 0.49150092785839306
|
||
wye pan 0.4996119901034838
|
||
wye hat 0.4977304763723314
|
||
wye zee 0.5059066170573692
|
||
wye eks 0.5306035254809106
|
||
zee pan 0.5198298297816007
|
||
zee wye 0.5142667998230479
|
||
zee eks 0.4883932942792647
|
||
zee zee 0.5127559183726382
|
||
zee hat 0.46772617655014515
|
||
hat wye 0.49381326184632596
|
||
hat zee 0.5099985721987774
|
||
hat eks 0.48587864619953547
|
||
hat hat 0.47993053101017374
|
||
hat pan 0.4643355557376876
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="variance-and-standard-deviation-without-with-oosvars">
|
||
<h2>Variance and standard deviation without/with oosvars<a class="headerlink" href="#variance-and-standard-deviation-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium
|
||
</span> x_count 10000
|
||
x_sum 4986.019681679581
|
||
x_mean 0.49860196816795804
|
||
x_var 0.08426974433144456
|
||
x_stddev 0.2902925151144007
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> cat variance.mlr
|
||
</span> @n += 1;
|
||
@sumx += $x;
|
||
@sumx2 += $x**2;
|
||
end {
|
||
@mean = @sumx / @n;
|
||
@var = (@sumx2 - @mean * (2 * @sumx - @n * @mean)) / (@n - 1);
|
||
@stddev = sqrt(@var);
|
||
emitf @n, @sumx, @sumx2, @mean, @var, @stddev
|
||
}
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab put -q -f variance.mlr data/medium
|
||
</span> n 10000
|
||
sumx 4986.019681679581
|
||
sumx2 3328.652400179729
|
||
mean 0.49860196816795804
|
||
var 0.08426974433144456
|
||
stddev 0.2902925151144007
|
||
</pre></div>
|
||
</div>
|
||
<p>You can also do this keyed, of course, imitating the keyed-mean example above.</p>
|
||
</div>
|
||
<div class="section" id="min-max-without-with-oosvars">
|
||
<h2>Min/max without/with oosvars<a class="headerlink" href="#min-max-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab stats1 -a min,max -f x data/medium
|
||
</span> x_min 0.00004509679127584487
|
||
x_max 0.999952670371898
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab put -q '
|
||
</span><span class="hll"> @x_min = min(@x_min, $x);
|
||
</span><span class="hll"> @x_max = max(@x_max, $x);
|
||
</span><span class="hll"> end{emitf @x_min, @x_max}
|
||
</span><span class="hll"> ' data/medium
|
||
</span> x_min 0.00004509679127584487
|
||
x_max 0.999952670371898
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="keyed-min-max-without-with-oosvars">
|
||
<h2>Keyed min/max without/with oosvars<a class="headerlink" href="#keyed-min-max-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint stats1 -a min,max -f x -g a data/medium
|
||
</span> a x_min x_max
|
||
pan 0.00020390740306253097 0.9994029107062516
|
||
eks 0.0006917972627396018 0.9988110946859143
|
||
wye 0.0001874794831505655 0.9998228522652893
|
||
zee 0.0005486114815762555 0.9994904324789629
|
||
hat 0.00004509679127584487 0.999952670371898
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint --from data/medium put -q '
|
||
</span><span class="hll"> @min[$a] = min(@min[$a], $x);
|
||
</span><span class="hll"> @max[$a] = max(@max[$a], $x);
|
||
</span><span class="hll"> end{
|
||
</span><span class="hll"> emit (@min, @max), "a";
|
||
</span><span class="hll"> }
|
||
</span><span class="hll"> '
|
||
</span> a min max
|
||
pan 0.00020390740306253097 0.9994029107062516
|
||
eks 0.0006917972627396018 0.9988110946859143
|
||
wye 0.0001874794831505655 0.9998228522652893
|
||
zee 0.0005486114815762555 0.9994904324789629
|
||
hat 0.00004509679127584487 0.999952670371898
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="delta-without-with-oosvars">
|
||
<h2>Delta without/with oosvars<a class="headerlink" href="#delta-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint step -a delta -f x data/small
|
||
</span> a b i x y x_delta
|
||
pan pan 1 0.3467901443380824 0.7268028627434533 0
|
||
eks pan 2 0.7586799647899636 0.5221511083334797 0.41188982045188116
|
||
wye wye 3 0.20460330576630303 0.33831852551664776 -0.5540766590236605
|
||
eks wye 4 0.38139939387114097 0.13418874328430463 0.17679608810483793
|
||
wye pan 5 0.5732889198020006 0.8636244699032729 0.19188952593085962
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put '
|
||
</span><span class="hll"> $x_delta = is_present(@last) ? $x - @last : 0;
|
||
</span><span class="hll"> @last = $x
|
||
</span><span class="hll"> ' data/small
|
||
</span> a b i x y x_delta
|
||
pan pan 1 0.3467901443380824 0.7268028627434533 0
|
||
eks pan 2 0.7586799647899636 0.5221511083334797 0.41188982045188116
|
||
wye wye 3 0.20460330576630303 0.33831852551664776 -0.5540766590236605
|
||
eks wye 4 0.38139939387114097 0.13418874328430463 0.17679608810483793
|
||
wye pan 5 0.5732889198020006 0.8636244699032729 0.19188952593085962
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="keyed-delta-without-with-oosvars">
|
||
<h2>Keyed delta without/with oosvars<a class="headerlink" href="#keyed-delta-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint step -a delta -f x -g a data/small
|
||
</span> a b i x y x_delta
|
||
pan pan 1 0.3467901443380824 0.7268028627434533 0
|
||
eks pan 2 0.7586799647899636 0.5221511083334797 0
|
||
wye wye 3 0.20460330576630303 0.33831852551664776 0
|
||
eks wye 4 0.38139939387114097 0.13418874328430463 -0.3772805709188226
|
||
wye pan 5 0.5732889198020006 0.8636244699032729 0.36868561403569755
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put '
|
||
</span><span class="hll"> $x_delta = is_present(@last[$a]) ? $x - @last[$a] : 0;
|
||
</span><span class="hll"> @last[$a]=$x
|
||
</span><span class="hll"> ' data/small
|
||
</span> a b i x y x_delta
|
||
pan pan 1 0.3467901443380824 0.7268028627434533 0
|
||
eks pan 2 0.7586799647899636 0.5221511083334797 0
|
||
wye wye 3 0.20460330576630303 0.33831852551664776 0
|
||
eks wye 4 0.38139939387114097 0.13418874328430463 -0.3772805709188226
|
||
wye pan 5 0.5732889198020006 0.8636244699032729 0.36868561403569755
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
<div class="section" id="exponentially-weighted-moving-averages-without-with-oosvars">
|
||
<h2>Exponentially weighted moving averages without/with oosvars<a class="headerlink" href="#exponentially-weighted-moving-averages-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint step -a ewma -d 0.1 -f x data/small
|
||
</span> a b i x y x_ewma_0.1
|
||
pan pan 1 0.3467901443380824 0.7268028627434533 0.3467901443380824
|
||
eks pan 2 0.7586799647899636 0.5221511083334797 0.3879791263832706
|
||
wye wye 3 0.20460330576630303 0.33831852551664776 0.36964154432157387
|
||
eks wye 4 0.38139939387114097 0.13418874328430463 0.37081732927653055
|
||
wye pan 5 0.5732889198020006 0.8636244699032729 0.3910644883290776
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put '
|
||
</span><span class="hll"> begin{ @a=0.1 };
|
||
</span><span class="hll"> $e = NR==1 ? $x : @a * $x + (1 - @a) * @e;
|
||
</span><span class="hll"> @e=$e
|
||
</span><span class="hll"> ' data/small
|
||
</span> a b i x y e
|
||
pan pan 1 0.3467901443380824 0.7268028627434533 0.3467901443380824
|
||
eks pan 2 0.7586799647899636 0.5221511083334797 0.3879791263832706
|
||
wye wye 3 0.20460330576630303 0.33831852551664776 0.36964154432157387
|
||
eks wye 4 0.38139939387114097 0.13418874328430463 0.37081732927653055
|
||
wye pan 5 0.5732889198020006 0.8636244699032729 0.3910644883290776
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
|
||
|
||
</div>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="footer" role="contentinfo">
|
||
© Copyright 2021, John Kerl.
|
||
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 3.2.1.
|
||
</div>
|
||
</body>
|
||
</html> |