miller/docs6/docs/_build/html/two-pass-algorithms.html


<!DOCTYPE html>

<html>
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Two-pass algorithms &#8212; Miller 6.0.0-alpha documentation</title>

    <link rel="stylesheet" href="_static/scrolls.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    <link rel="stylesheet" href="_static/print.css" type="text/css" />

    <script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
    <script src="_static/jquery.js"></script>
    <script src="_static/underscore.js"></script>
    <script src="_static/doctools.js"></script>
    <script src="_static/language_data.js"></script>
    <script src="_static/theme_extras.js"></script>
    <link rel="index" title="Index" href="genindex.html" />
    <link rel="search" title="Search" href="search.html" />
    <link rel="next" title="DKVP I/O examples" href="dkvp-examples.html" />
    <link rel="prev" title="Randomizing examples" href="randomizing-examples.html" />
  </head><body>
    <div id="content">
      <div class="header">
        <h1 class="heading"><a href="index.html"
          title="back to the documentation overview"><span>Two-pass algorithms</span></a></h1>
      </div>
      <div class="relnav" role="navigation" aria-label="related navigation">
        <a href="randomizing-examples.html">&laquo; Randomizing examples</a> |
        <a href="#">Two-pass algorithms</a>
        | <a href="dkvp-examples.html">DKVP I/O examples &raquo;</a>
      </div>
      <div id="contentwrapper">
        <div id="toc" role="navigation" aria-label="table of contents navigation">
          <h3>Table of Contents</h3>
          <ul>
<li><a class="reference internal" href="#">Two-pass algorithms</a><ul>
<li><a class="reference internal" href="#overview">Overview</a></li>
<li><a class="reference internal" href="#computation-of-percentages">Computation of percentages</a></li>
<li><a class="reference internal" href="#line-number-ratios">Line-number ratios</a></li>
<li><a class="reference internal" href="#records-having-max-value">Records having max value</a></li>
<li><a class="reference internal" href="#feature-counting">Feature-counting</a></li>
<li><a class="reference internal" href="#unsparsing">Unsparsing</a></li>
<li><a class="reference internal" href="#mean-without-with-oosvars">Mean without/with oosvars</a></li>
<li><a class="reference internal" href="#keyed-mean-without-with-oosvars">Keyed mean without/with oosvars</a></li>
<li><a class="reference internal" href="#variance-and-standard-deviation-without-with-oosvars">Variance and standard deviation without/with oosvars</a></li>
<li><a class="reference internal" href="#min-max-without-with-oosvars">Min/max without/with oosvars</a></li>
<li><a class="reference internal" href="#keyed-min-max-without-with-oosvars">Keyed min/max without/with oosvars</a></li>
<li><a class="reference internal" href="#delta-without-with-oosvars">Delta without/with oosvars</a></li>
<li><a class="reference internal" href="#keyed-delta-without-with-oosvars">Keyed delta without/with oosvars</a></li>
<li><a class="reference internal" href="#exponentially-weighted-moving-averages-without-with-oosvars">Exponentially weighted moving averages without/with oosvars</a></li>
</ul>
</li>
</ul>

        </div>
        <div role="main">

  <div class="section" id="two-pass-algorithms">
<h1>Two-pass algorithms<a class="headerlink" href="#two-pass-algorithms" title="Permalink to this headline">¶</a></h1>
<div class="section" id="overview">
<h2>Overview<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2>
<p>Miller is a streaming record processor; commands are performed once per record. This makes Miller particularly suitable for single-pass algorithms, allowing many of its verbs to process files that are (much) larger than the amount of RAM present in your system. (Of course, Miller verbs such as <code class="docutils literal notranslate"><span class="pre">sort</span></code>, <code class="docutils literal notranslate"><span class="pre">tac</span></code>, etc. all must ingest and retain all input records before emitting any output records.) You can also use out-of-stream variables to perform multi-pass computations, at the price of retaining all input records in memory.</p>
<p>One of Miller’s strengths is its compact notation: for example, given input of the form</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> head -n 5 ../data/medium
</span> a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
 a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
 a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
 a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
</pre></div>
</div>
<p>you can simply do</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab stats1 -a sum -f x ../data/medium
</span> x_sum 4986.019681679581
</pre></div>
</div>
<p>or</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint stats1 -a sum -f x -g b ../data/medium
</span> b   x_sum
 pan 965.7636699425815
 wye 1023.5484702619565
 zee 979.7420161495838
 eks 1016.7728571314786
 hat 1000.192668193983
</pre></div>
</div>
<p>rather than the more tedious</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab put -q &#39;
</span><span class="hll">   @x_sum += $x;
</span><span class="hll">   end {
</span><span class="hll">     emit @x_sum
</span><span class="hll">   }
</span><span class="hll"> &#39; data/medium
</span> x_sum 4986.019681679581
</pre></div>
</div>
<p>or</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put -q &#39;
</span><span class="hll">   @x_sum[$b] += $x;
</span><span class="hll">   end {
</span><span class="hll">     emit @x_sum, &quot;b&quot;
</span><span class="hll">   }
</span><span class="hll"> &#39; data/medium
</span> b   x_sum
 pan 965.7636699425815
 wye 1023.5484702619565
 zee 979.7420161495838
 eks 1016.7728571314786
 hat 1000.192668193983
</pre></div>
</div>
<p>The former (<code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">stats1</span></code> et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.</p>
<p>Nonetheless, out-of-stream variables (which I whimsically call <em>oosvars</em>), begin/end blocks, and emit statements give you the ability to implement logic – if you wish to do so – which isn’t present in other Miller verbs.  (If you find yourself often using the same out-of-stream-variable logic over and over, please file a request at <a class="reference external" href="https://github.com/johnkerl/miller/issues">https://github.com/johnkerl/miller/issues</a> to get it implemented directly in Go as a Miller verb of its own.)</p>
<p>The following examples compute some things using oosvars which are already computable using Miller verbs, by way of providing food for thought.</p>
</div>
<div class="section" id="computation-of-percentages">
<h2>Computation of percentages<a class="headerlink" href="#computation-of-percentages" title="Permalink to this headline">¶</a></h2>
<p>For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record’s value to a percentage.</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --from data/small --opprint put -q &#39;
</span><span class="hll">   # These are executed once per record, which is the first pass.
</span><span class="hll">   # The key is to use NR to index an out-of-stream variable to
</span><span class="hll">   # retain all the x-field values.
</span><span class="hll">   @x_min = min($x, @x_min);
</span><span class="hll">   @x_max = max($x, @x_max);
</span><span class="hll">   @x[NR] = $x;
</span><span class="hll">
</span><span class="hll">   # The second pass is in a for-loop in an end-block.
</span><span class="hll">   end {
</span><span class="hll">     for (nr, x in @x) {
</span><span class="hll">       @x_pct[nr] = 100 * (x - @x_min) / (@x_max - @x_min);
</span><span class="hll">     }
</span><span class="hll">     emit (@x, @x_pct), &quot;NR&quot;
</span><span class="hll">   }
</span><span class="hll"> &#39;
</span> NR x                   x_pct
 1  0.3467901443380824  25.66194338926441
 2  0.7586799647899636  100
 3  0.20460330576630303 0
 4  0.38139939387114097 31.90823602213647
 5  0.5732889198020006  66.54054236562845
</pre></div>
</div>
</div>
<div class="section" id="line-number-ratios">
<h2>Line-number ratios<a class="headerlink" href="#line-number-ratios" title="Permalink to this headline">¶</a></h2>
<p>Similarly, finding the total record count requires first reading through all the data:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint --from data/small put -q &#39;
</span><span class="hll">   @records[NR] = $*;
</span><span class="hll">   end {
</span><span class="hll">     for((I,k),v in @records) {
</span><span class="hll">       @records[I][&quot;I&quot;] = I;
</span><span class="hll">       @records[I][&quot;N&quot;] = NR;
</span><span class="hll">       @records[I][&quot;PCT&quot;] = 100*I/NR
</span><span class="hll">     }
</span><span class="hll">     emit @records,&quot;I&quot;
</span><span class="hll">   }
</span><span class="hll"> &#39; then reorder -f I,N,PCT
</span> I N PCT     a   b   i x                   y
 1 5 (error) pan pan 1 0.3467901443380824  0.7268028627434533
 2 5 (error) eks pan 2 0.7586799647899636  0.5221511083334797
 3 5 (error) wye wye 3 0.20460330576630303 0.33831852551664776
 4 5 (error) eks wye 4 0.38139939387114097 0.13418874328430463
 5 5 (error) wye pan 5 0.5732889198020006  0.8636244699032729
</pre></div>
</div>
</div>
<div class="section" id="records-having-max-value">
<h2>Records having max value<a class="headerlink" href="#records-having-max-value" title="Permalink to this headline">¶</a></h2>
<p>The idea is to retain records having the largest value of <code class="docutils literal notranslate"><span class="pre">n</span></code> in the following data:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --itsv --opprint cat data/maxrows.tsv
</span> a      b      n score
 purple red    5 0.743231
 blue   purple 2 0.093710
 red    purple 2 0.802103
 purple red    5 0.389055
 red    purple 2 0.880457
 orange red    2 0.540349
 purple purple 1 0.634451
 orange purple 5 0.257223
 orange purple 5 0.693499
 red    red    4 0.981355
 blue   purple 5 0.157052
 purple purple 1 0.441784
 red    purple 1 0.124912
 orange blue   1 0.921944
 blue   purple 4 0.490909
 purple red    5 0.454779
 green  purple 4 0.198278
 orange blue   5 0.705700
 red    red    3 0.940705
 purple red    5 0.072936
 orange blue   3 0.389463
 orange purple 2 0.664985
 blue   purple 1 0.371813
 red    purple 4 0.984571
 green  purple 5 0.203577
 green  purple 3 0.900873
 purple purple 0 0.965677
 blue   purple 2 0.208785
 purple purple 1 0.455077
 red    purple 4 0.477187
 blue   red    4 0.007487
</pre></div>
</div>
<p>Of course, the largest value of <code class="docutils literal notranslate"><span class="pre">n</span></code> isn’t known until after all data have been read. Using an out-of-stream variable we can retain all records as they are read, then filter them at the end:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> cat data/maxrows.mlr
</span> # Retain all records
 @records[NR] = $*;
 # Track max value of n
 @maxn = max(@maxn, $n);

 # After all records have been read, loop through retained records
 # and print those with the max n value.
 end {
   for (nr in @records) {
     map record = @records[nr];
     if (record[&quot;n&quot;] == @maxn) {
       emit record;
     }
   }
 }
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --itsv --opprint put -q -f data/maxrows.mlr data/maxrows.tsv
</span> a      b      n score
 purple red    5 0.743231
 purple red    5 0.389055
 orange purple 5 0.257223
 orange purple 5 0.693499
 blue   purple 5 0.157052
 purple red    5 0.454779
 orange blue   5 0.705700
 purple red    5 0.072936
 green  purple 5 0.203577
</pre></div>
</div>
</div>
<div class="section" id="feature-counting">
<h2>Feature-counting<a class="headerlink" href="#feature-counting" title="Permalink to this headline">¶</a></h2>
<p>Suppose you have some heterogeneous data like this:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>{ &quot;qoh&quot;: 29874, &quot;rate&quot;: 1.68, &quot;latency&quot;: 0.02 }
{ &quot;name&quot;: &quot;alice&quot;, &quot;uid&quot;: 572 }
{ &quot;qoh&quot;: 1227, &quot;rate&quot;: 1.01, &quot;latency&quot;: 0.07 }
{ &quot;qoh&quot;: 13458, &quot;rate&quot;: 1.72, &quot;latency&quot;: 0.04 }
{ &quot;qoh&quot;: 56782, &quot;rate&quot;: 1.64 }
{ &quot;qoh&quot;: 23512, &quot;rate&quot;: 1.71, &quot;latency&quot;: 0.03 }
{ &quot;qoh&quot;: 9876, &quot;rate&quot;: 1.89, &quot;latency&quot;: 0.08 }
{ &quot;name&quot;: &quot;bill&quot;, &quot;uid&quot;: 684 }
{ &quot;name&quot;: &quot;chuck&quot;, &quot;uid2&quot;: 908 }
{ &quot;name&quot;: &quot;dottie&quot;, &quot;uid&quot;: 440 }
{ &quot;qoh&quot;: 0, &quot;rate&quot;: 0.40, &quot;latency&quot;: 0.01 }
{ &quot;qoh&quot;: 5438, &quot;rate&quot;: 1.56, &quot;latency&quot;: 0.17 }
</pre></div>
</div>
<p>A reasonable question to ask is, how many occurrences of each field are there? And, what percentage of total row count has each of them? Since the denominator of the percentage is not known until the end, this is a two-pass algorithm:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>for (key in $*) {
  @key_counts[key] += 1;
}
@record_count += 1;

end {
  for (key in @key_counts) {
      @key_fraction[key] = @key_counts[key] / @record_count
  }
  emit @record_count;
  emit @key_counts, &quot;key&quot;;
  emit @key_fraction,&quot;key&quot;
}
</pre></div>
</div>
<p>Then</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --json put -q -f data/feature-count.mlr data/features.json
</span> {
   &quot;record_count&quot;: 12
 }
 {
   &quot;key&quot;: &quot;qoh&quot;,
   &quot;key_counts&quot;: 8
 }
 {
   &quot;key&quot;: &quot;rate&quot;,
   &quot;key_counts&quot;: 8
 }
 {
   &quot;key&quot;: &quot;latency&quot;,
   &quot;key_counts&quot;: 7
 }
 {
   &quot;key&quot;: &quot;name&quot;,
   &quot;key_counts&quot;: 4
 }
 {
   &quot;key&quot;: &quot;uid&quot;,
   &quot;key_counts&quot;: 3
 }
 {
   &quot;key&quot;: &quot;uid2&quot;,
   &quot;key_counts&quot;: 1
 }
 {
   &quot;key&quot;: &quot;qoh&quot;,
   &quot;key_fraction&quot;: 0.6666666666666666
 }
 {
   &quot;key&quot;: &quot;rate&quot;,
   &quot;key_fraction&quot;: 0.6666666666666666
 }
 {
   &quot;key&quot;: &quot;latency&quot;,
   &quot;key_fraction&quot;: 0.5833333333333334
 }
 {
   &quot;key&quot;: &quot;name&quot;,
   &quot;key_fraction&quot;: 0.3333333333333333
 }
 {
   &quot;key&quot;: &quot;uid&quot;,
   &quot;key_fraction&quot;: 0.25
 }
 {
   &quot;key&quot;: &quot;uid2&quot;,
   &quot;key_fraction&quot;: 0.08333333333333333
 }
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --ijson --opprint put -q -f data/feature-count.mlr data/features.json
</span> record_count
 12

 key     key_counts
 qoh     8
 rate    8
 latency 7
 name    4
 uid     3
 uid2    1

 key     key_fraction
 qoh     0.6666666666666666
 rate    0.6666666666666666
 latency 0.5833333333333334
 name    0.3333333333333333
 uid     0.25
 uid2    0.08333333333333333
</pre></div>
</div>
</div>
<div class="section" id="unsparsing">
<h2>Unsparsing<a class="headerlink" href="#unsparsing" title="Permalink to this headline">¶</a></h2>
<p>The previous section discussed how to fill out missing data fields within CSV with full header line – so the list of all field names is present within the header line. Next, let’s look at a related problem: we have data where each record has various key names but we want to produce rectangular output having the union of all key names.</p>
<p>For example, suppose you have JSON input like this:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> cat data/sparse.json
</span> {&quot;a&quot;:1,&quot;b&quot;:2,&quot;v&quot;:3}
 {&quot;u&quot;:1,&quot;b&quot;:2}
 {&quot;a&quot;:1,&quot;v&quot;:2,&quot;x&quot;:3}
 {&quot;v&quot;:1,&quot;w&quot;:2}
</pre></div>
</div>
<p>There are field names <code class="docutils literal notranslate"><span class="pre">a</span></code>, <code class="docutils literal notranslate"><span class="pre">b</span></code>, <code class="docutils literal notranslate"><span class="pre">v</span></code>, <code class="docutils literal notranslate"><span class="pre">u</span></code>, <code class="docutils literal notranslate"><span class="pre">x</span></code>, <code class="docutils literal notranslate"><span class="pre">w</span></code> in the data – but not all in every record.  Since we don’t know the names of all the keys until we’ve read them all, this needs to be a two-pass algorithm. On the first pass, remember all the unique key names and all the records; on the second pass, loop through the records filling in absent values, then producing output. Use <code class="docutils literal notranslate"><span class="pre">put</span> <span class="pre">-q</span></code> since we don’t want to produce per-record output, only emitting output in the <code class="docutils literal notranslate"><span class="pre">end</span></code> block:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> cat data/unsparsify.mlr
</span> # First pass:
 # Remember all unique key names:
 for (k in $*) {
   @all_keys[k] = 1;
 }
 # Remember all input records:
 @records[NR] = $*;

 # Second pass:
 end {
   for (nr in @records) {
     # Get the sparsely keyed input record:
     irecord = @records[nr];
     # Fill in missing keys with empty string:
     map orecord = {};
     for (k in @all_keys) {
       if (haskey(irecord, k)) {
         orecord[k] = irecord[k];
       } else {
         orecord[k] = &quot;&quot;;
       }
     }
     # Produce the output:
     emit orecord;
   }
 }
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --json put -q -f data/unsparsify.mlr data/sparse.json
</span> {
   &quot;a&quot;: 1,
   &quot;b&quot;: 2,
   &quot;v&quot;: 3,
   &quot;u&quot;: &quot;&quot;,
   &quot;x&quot;: &quot;&quot;,
   &quot;w&quot;: &quot;&quot;
 }
 {
   &quot;a&quot;: &quot;&quot;,
   &quot;b&quot;: 2,
   &quot;v&quot;: &quot;&quot;,
   &quot;u&quot;: 1,
   &quot;x&quot;: &quot;&quot;,
   &quot;w&quot;: &quot;&quot;
 }
 {
   &quot;a&quot;: 1,
   &quot;b&quot;: &quot;&quot;,
   &quot;v&quot;: 2,
   &quot;u&quot;: &quot;&quot;,
   &quot;x&quot;: 3,
   &quot;w&quot;: &quot;&quot;
 }
 {
   &quot;a&quot;: &quot;&quot;,
   &quot;b&quot;: &quot;&quot;,
   &quot;v&quot;: 1,
   &quot;u&quot;: &quot;&quot;,
   &quot;x&quot;: &quot;&quot;,
   &quot;w&quot;: 2
 }
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --ijson --ocsv put -q -f data/unsparsify.mlr data/sparse.json
</span> a,b,v,u,x,w
 1,2,3,,,
 ,2,,1,,
 1,,2,,3,
 ,,1,,,2
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --ijson --opprint put -q -f data/unsparsify.mlr data/sparse.json
</span> a b v u x w
 1 2 3 - - -
 - 2 - 1 - -
 1 - 2 - 3 -
 - - 1 - - 2
</pre></div>
</div>
<p>There is a keystroke-saving verb for this: <a class="reference internal" href="reference-verbs.html#reference-verbs-unsparsify"><span class="std std-ref">mlr unsparsify</span></a>.</p>
</div>
<div class="section" id="mean-without-with-oosvars">
<h2>Mean without/with oosvars<a class="headerlink" href="#mean-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint stats1 -a mean -f x data/medium
</span> x_mean
 0.49860196816795804
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put -q &#39;
</span><span class="hll">   @x_sum += $x;
</span><span class="hll">   @x_count += 1;
</span><span class="hll">   end {
</span><span class="hll">     @x_mean = @x_sum / @x_count;
</span><span class="hll">     emit @x_mean
</span><span class="hll">   }
</span><span class="hll"> &#39; data/medium
</span> x_mean
 0.49860196816795804
</pre></div>
</div>
</div>
<div class="section" id="keyed-mean-without-with-oosvars">
<h2>Keyed mean without/with oosvars<a class="headerlink" href="#keyed-mean-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint stats1 -a mean -f x -g a,b data/medium
</span> a   b   x_mean
 pan pan 0.5133141190437597
 eks pan 0.48507555383425127
 wye wye 0.49150092785839306
 eks wye 0.4838950517724162
 wye pan 0.4996119901034838
 zee pan 0.5198298297816007
 eks zee 0.49546320772681596
 zee wye 0.5142667998230479
 hat wye 0.49381326184632596
 pan wye 0.5023618498923658
 zee eks 0.4883932942792647
 hat zee 0.5099985721987774
 hat eks 0.48587864619953547
 wye hat 0.4977304763723314
 pan eks 0.5036718595143479
 eks eks 0.5227992666570941
 hat hat 0.47993053101017374
 hat pan 0.4643355557376876
 zee zee 0.5127559183726382
 pan hat 0.492140950155604
 pan zee 0.4966041598627583
 zee hat 0.46772617655014515
 wye zee 0.5059066170573692
 eks hat 0.5006790659966355
 wye eks 0.5306035254809106
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put -q &#39;
</span><span class="hll">   @x_sum[$a][$b] += $x;
</span><span class="hll">   @x_count[$a][$b] += 1;
</span><span class="hll">   end{
</span><span class="hll">     for ((a, b), v in @x_sum) {
</span><span class="hll">       @x_mean[a][b] = @x_sum[a][b] / @x_count[a][b];
</span><span class="hll">     }
</span><span class="hll">     emit @x_mean, &quot;a&quot;, &quot;b&quot;
</span><span class="hll">   }
</span><span class="hll"> &#39; data/medium
</span> a   b   x_mean
 pan pan 0.5133141190437597
 pan wye 0.5023618498923658
 pan eks 0.5036718595143479
 pan hat 0.492140950155604
 pan zee 0.4966041598627583
 eks pan 0.48507555383425127
 eks wye 0.4838950517724162
 eks zee 0.49546320772681596
 eks eks 0.5227992666570941
 eks hat 0.5006790659966355
 wye wye 0.49150092785839306
 wye pan 0.4996119901034838
 wye hat 0.4977304763723314
 wye zee 0.5059066170573692
 wye eks 0.5306035254809106
 zee pan 0.5198298297816007
 zee wye 0.5142667998230479
 zee eks 0.4883932942792647
 zee zee 0.5127559183726382
 zee hat 0.46772617655014515
 hat wye 0.49381326184632596
 hat zee 0.5099985721987774
 hat eks 0.48587864619953547
 hat hat 0.47993053101017374
 hat pan 0.4643355557376876
</pre></div>
</div>
</div>
<div class="section" id="variance-and-standard-deviation-without-with-oosvars">
<h2>Variance and standard deviation without/with oosvars<a class="headerlink" href="#variance-and-standard-deviation-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium
</span> x_count  10000
 x_sum    4986.019681679581
 x_mean   0.49860196816795804
 x_var    0.08426974433144456
 x_stddev 0.2902925151144007
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> cat variance.mlr
</span> @n += 1;
 @sumx += $x;
 @sumx2 += $x**2;
 end {
   @mean = @sumx / @n;
   @var = (@sumx2 - @mean * (2 * @sumx - @n * @mean)) / (@n - 1);
   @stddev = sqrt(@var);
   emitf @n, @sumx, @sumx2, @mean, @var, @stddev
 }
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab put -q -f variance.mlr data/medium
</span> n      10000
 sumx   4986.019681679581
 sumx2  3328.652400179729
 mean   0.49860196816795804
 var    0.08426974433144456
 stddev 0.2902925151144007
</pre></div>
</div>
<p>You can also do this keyed, of course, imitating the keyed-mean example above.</p>
</div>
<div class="section" id="min-max-without-with-oosvars">
<h2>Min/max without/with oosvars<a class="headerlink" href="#min-max-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab stats1 -a min,max -f x data/medium
</span> x_min 0.00004509679127584487
 x_max 0.999952670371898
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --oxtab put -q &#39;
</span><span class="hll">   @x_min = min(@x_min, $x);
</span><span class="hll">   @x_max = max(@x_max, $x);
</span><span class="hll">   end{emitf @x_min, @x_max}
</span><span class="hll"> &#39; data/medium
</span> x_min 0.00004509679127584487
 x_max 0.999952670371898
</pre></div>
</div>
</div>
<div class="section" id="keyed-min-max-without-with-oosvars">
<h2>Keyed min/max without/with oosvars<a class="headerlink" href="#keyed-min-max-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint stats1 -a min,max -f x -g a data/medium
</span> a   x_min                  x_max
 pan 0.00020390740306253097 0.9994029107062516
 eks 0.0006917972627396018  0.9988110946859143
 wye 0.0001874794831505655  0.9998228522652893
 zee 0.0005486114815762555  0.9994904324789629
 hat 0.00004509679127584487 0.999952670371898
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint --from data/medium put -q &#39;
</span><span class="hll">   @min[$a] = min(@min[$a], $x);
</span><span class="hll">   @max[$a] = max(@max[$a], $x);
</span><span class="hll">   end{
</span><span class="hll">     emit (@min, @max), &quot;a&quot;;
</span><span class="hll">   }
</span><span class="hll"> &#39;
</span> a   min                    max
 pan 0.00020390740306253097 0.9994029107062516
 eks 0.0006917972627396018  0.9988110946859143
 wye 0.0001874794831505655  0.9998228522652893
 zee 0.0005486114815762555  0.9994904324789629
 hat 0.00004509679127584487 0.999952670371898
</pre></div>
</div>
</div>
<div class="section" id="delta-without-with-oosvars">
<h2>Delta without/with oosvars<a class="headerlink" href="#delta-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint step -a delta -f x data/small
</span> a   b   i x                   y                   x_delta
 pan pan 1 0.3467901443380824  0.7268028627434533  0
 eks pan 2 0.7586799647899636  0.5221511083334797  0.41188982045188116
 wye wye 3 0.20460330576630303 0.33831852551664776 -0.5540766590236605
 eks wye 4 0.38139939387114097 0.13418874328430463 0.17679608810483793
 wye pan 5 0.5732889198020006  0.8636244699032729  0.19188952593085962
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put &#39;
</span><span class="hll">   $x_delta = is_present(@last) ? $x - @last : 0;
</span><span class="hll">   @last = $x
</span><span class="hll"> &#39; data/small
</span> a   b   i x                   y                   x_delta
 pan pan 1 0.3467901443380824  0.7268028627434533  0
 eks pan 2 0.7586799647899636  0.5221511083334797  0.41188982045188116
 wye wye 3 0.20460330576630303 0.33831852551664776 -0.5540766590236605
 eks wye 4 0.38139939387114097 0.13418874328430463 0.17679608810483793
 wye pan 5 0.5732889198020006  0.8636244699032729  0.19188952593085962
</pre></div>
</div>
</div>
<div class="section" id="keyed-delta-without-with-oosvars">
<h2>Keyed delta without/with oosvars<a class="headerlink" href="#keyed-delta-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint step -a delta -f x -g a data/small
</span> a   b   i x                   y                   x_delta
 pan pan 1 0.3467901443380824  0.7268028627434533  0
 eks pan 2 0.7586799647899636  0.5221511083334797  0
 wye wye 3 0.20460330576630303 0.33831852551664776 0
 eks wye 4 0.38139939387114097 0.13418874328430463 -0.3772805709188226
 wye pan 5 0.5732889198020006  0.8636244699032729  0.36868561403569755
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put &#39;
</span><span class="hll">   $x_delta = is_present(@last[$a]) ? $x - @last[$a] : 0;
</span><span class="hll">   @last[$a]=$x
</span><span class="hll"> &#39; data/small
</span> a   b   i x                   y                   x_delta
 pan pan 1 0.3467901443380824  0.7268028627434533  0
 eks pan 2 0.7586799647899636  0.5221511083334797  0
 wye wye 3 0.20460330576630303 0.33831852551664776 0
 eks wye 4 0.38139939387114097 0.13418874328430463 -0.3772805709188226
 wye pan 5 0.5732889198020006  0.8636244699032729  0.36868561403569755
</pre></div>
</div>
</div>
<div class="section" id="exponentially-weighted-moving-averages-without-with-oosvars">
<h2>Exponentially weighted moving averages without/with oosvars<a class="headerlink" href="#exponentially-weighted-moving-averages-without-with-oosvars" title="Permalink to this headline">¶</a></h2>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint step -a ewma -d 0.1 -f x data/small
</span> a   b   i x                   y                   x_ewma_0.1
 pan pan 1 0.3467901443380824  0.7268028627434533  0.3467901443380824
 eks pan 2 0.7586799647899636  0.5221511083334797  0.3879791263832706
 wye wye 3 0.20460330576630303 0.33831852551664776 0.36964154432157387
 eks wye 4 0.38139939387114097 0.13418874328430463 0.37081732927653055
 wye pan 5 0.5732889198020006  0.8636244699032729  0.3910644883290776
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --opprint put &#39;
</span><span class="hll">   begin{ @a=0.1 };
</span><span class="hll">   $e = NR==1 ? $x : @a * $x + (1 - @a) * @e;
</span><span class="hll">   @e=$e
</span><span class="hll"> &#39; data/small
</span> a   b   i x                   y                   e
 pan pan 1 0.3467901443380824  0.7268028627434533  0.3467901443380824
 eks pan 2 0.7586799647899636  0.5221511083334797  0.3879791263832706
 wye wye 3 0.20460330576630303 0.33831852551664776 0.36964154432157387
 eks wye 4 0.38139939387114097 0.13418874328430463 0.37081732927653055
 wye pan 5 0.5732889198020006  0.8636244699032729  0.3910644883290776
</pre></div>
</div>
</div>
</div>


        </div>
      </div>
    </div>

    <div class="footer" role="contentinfo">
        &#169; Copyright 2021, John Kerl.
      Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 3.2.1.
    </div>
  </body>
</html>