miller/docs6/docs/_build/html/faq.html

503 lines
No EOL
31 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>FAQ &#8212; Miller 6.0.0-alpha documentation</title>
<link rel="stylesheet" href="_static/classic.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
<script src="_static/jquery.js"></script>
<script src="_static/underscore.js"></script>
<script src="_static/doctools.js"></script>
<script src="_static/language_data.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Cookbook part 1: common patterns" href="cookbook.html" />
<link rel="prev" title="Joins" href="joins.html" />
</head><body>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="cookbook.html" title="Cookbook part 1: common patterns"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="joins.html" title="Joins"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">Miller 6.0.0-alpha documentation</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">FAQ</a></li>
</ul>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<div class="section" id="faq">
<h1>FAQ<a class="headerlink" href="#faq" title="Permalink to this headline"></a></h1>
<div class="section" id="how-do-i-suppress-numeric-conversion">
<h2>How do I suppress numeric conversion?<a class="headerlink" href="#how-do-i-suppress-numeric-conversion" title="Permalink to this headline"></a></h2>
<p><strong>TL;DR use put -S</strong>.</p>
<p>Within <code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">put</span></code> and <code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">filter</span></code>, the default behavior for scanning input records is to parse them as integer, if possible, then as float, if possible, else leave them as string:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ cat data/scan-example-1.tbl
</span> value
1
2.0
3x
hello
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --pprint put &#39;$copy = $value; $type = typeof($value)&#39; data/scan-example-1.tbl
</span> value copy type
1 1 int
2.0 2.0 float
3x 3x string
hello hello string
</pre></div>
</div>
<p>The numeric-conversion rule is simple:</p>
<ul class="simple">
<li><p>Try to scan as integer (<code class="docutils literal notranslate"><span class="pre">&quot;1&quot;</span></code> should be int);</p></li>
<li><p>If that doesnt succeed, try to scan as float (<code class="docutils literal notranslate"><span class="pre">&quot;1.0&quot;</span></code> should be float);</p></li>
<li><p>If that doesnt succeed, leave the value as a string (<code class="docutils literal notranslate"><span class="pre">&quot;1x&quot;</span></code> is string).</p></li>
</ul>
<p>This is a sensible default: you should be able to put <code class="docutils literal notranslate"><span class="pre">'$z</span> <span class="pre">=</span> <span class="pre">$x</span> <span class="pre">+</span> <span class="pre">$y'</span></code> without having to write <code class="docutils literal notranslate"><span class="pre">'$z</span> <span class="pre">=</span> <span class="pre">int($x)</span> <span class="pre">+</span> <span class="pre">float($y)'</span></code>. Also note that default output format for floating-point numbers created by <code class="docutils literal notranslate"><span class="pre">put</span></code> (and other verbs such as <code class="docutils literal notranslate"><span class="pre">stats1</span></code>) is six decimal places; you can override this using <code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">--ofmt</span></code>. Also note that Miller uses your systems Go library functions whenever possible: e.g. <code class="docutils literal notranslate"><span class="pre">sscanf</span></code> for converting strings to integer or floating-point.</p>
<p>But now suppose you have data like these:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ cat data/scan-example-2.tbl
</span> value
0001
0002
0005
0005WA
0006
0007
0007WA
0008
0009
0010
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --pprint put &#39;$copy = $value; $type = typeof($value)&#39; data/scan-example-2.tbl
</span> value copy type
0001 0001 int
0002 0002 int
0005 0005 int
0005WA 0005WA string
0006 0006 int
0007 0007 int
0007WA 0007WA string
0008 0008 float
0009 0009 float
0010 0010 int
</pre></div>
</div>
<p>The same conversion rules as above are being used. Namely:</p>
<ul class="simple">
<li><p>By default field values are inferred to int, else float, else string;</p></li>
<li><p>leading zeroes indicate octal for integers (<code class="docutils literal notranslate"><span class="pre">sscanf</span></code> semantics);</p></li>
<li><p>since <code class="docutils literal notranslate"><span class="pre">0008</span></code> doesnt scan as integer (leading 0 requests octal but 8 isnt a valid octal digit), the float scan is tried next and it succeeds;</p></li>
<li><p>default floating-point output format is 6 decimal places (override with <code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">--ofmt</span></code>).</p></li>
</ul>
<p>Taken individually the rules make sense; taken collectively they produce a mishmash of types here.</p>
<p>The solution is to <strong>use the -S flag</strong> for <code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">put</span></code> and/or <code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">filter</span></code>. Then all field values are left as string. You can type-coerce on demand using syntax like <code class="docutils literal notranslate"><span class="pre">'$z</span> <span class="pre">=</span> <span class="pre">int($x)</span> <span class="pre">+</span> <span class="pre">float($y)'</span></code>. (See also <a class="reference internal" href="reference-dsl.html"><span class="doc">DSL reference</span></a>; see also <a class="reference external" href="https://github.com/johnkerl/miller/issues/150">https://github.com/johnkerl/miller/issues/150</a>.)</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --pprint put -S &#39;$copy = $value; $type = typeof($value)&#39; data/scan-example-2.tbl
</span> value copy type
0001 0001 int
0002 0002 int
0005 0005 int
0005WA 0005WA string
0006 0006 int
0007 0007 int
0007WA 0007WA string
0008 0008 float
0009 0009 float
0010 0010 int
</pre></div>
</div>
</div>
<div class="section" id="how-do-i-examine-then-chaining">
<h2>How do I examine then-chaining?<a class="headerlink" href="#how-do-i-examine-then-chaining" title="Permalink to this headline"></a></h2>
<p>Then-chaining found in Miller is intended to function the same as Unix pipes, but with less keystroking. You can print your data one pipeline step at a time, to see what intermediate output at one step becomes the input to the next step.</p>
<p>First, look at the input data:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ cat data/then-example.csv
</span> Status,Payment_Type,Amount
paid,cash,10.00
pending,debit,20.00
paid,cash,50.00
pending,credit,40.00
paid,debit,30.00
</pre></div>
</div>
<p>Next, run the first step of your command, omitting anything from the first <code class="docutils literal notranslate"><span class="pre">then</span></code> onward:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --icsv --opprint count-distinct -f Status,Payment_Type data/then-example.csv
</span> Status Payment_Type count
paid cash 2
pending debit 1
pending credit 1
paid debit 1
</pre></div>
</div>
<p>After that, run it with the next <code class="docutils literal notranslate"><span class="pre">then</span></code> step included:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --icsv --opprint count-distinct -f Status,Payment_Type then sort -nr count data/then-example.csv
</span> Status Payment_Type count
paid cash 2
pending debit 1
pending credit 1
paid debit 1
</pre></div>
</div>
<p>Now if you use <code class="docutils literal notranslate"><span class="pre">then</span></code> to include another verb after that, the columns <code class="docutils literal notranslate"><span class="pre">Status</span></code>, <code class="docutils literal notranslate"><span class="pre">Payment_Type</span></code>, and <code class="docutils literal notranslate"><span class="pre">count</span></code> will be the input to that verb.</p>
<p>Note, by the way, that youll get the same results using pipes:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --csv count-distinct -f Status,Payment_Type data/then-example.csv | mlr --icsv --opprint sort -nr count
</span> Status Payment_Type count
paid cash 2
pending debit 1
pending credit 1
paid debit 1
</pre></div>
</div>
</div>
<div class="section" id="how-can-i-filter-by-date">
<h2>How can I filter by date?<a class="headerlink" href="#how-can-i-filter-by-date" title="Permalink to this headline"></a></h2>
<p>Given input like</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ cat dates.csv
</span> date,event
2018-02-03,initialization
2018-03-07,discovery
2018-02-03,allocation
</pre></div>
</div>
<p>we can use <code class="docutils literal notranslate"><span class="pre">strptime</span></code> to parse the date field into seconds-since-epoch and then do numeric comparisons. Simply match your input datasets date-formatting to the <a class="reference internal" href="reference-dsl.html#reference-dsl-strptime"><span class="std std-ref">strptime</span></a> format-string. For example:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --csv filter &#39;strptime($date, &quot;%Y-%m-%d&quot;) &gt; strptime(&quot;2018-03-03&quot;, &quot;%Y-%m-%d&quot;)&#39; dates.csv
</span> date,event
2018-03-07,discovery
</pre></div>
</div>
<p>Caveat: localtime-handling in timezones with DST is still a work in progress; see <a class="reference external" href="https://github.com/johnkerl/miller/issues/170">https://github.com/johnkerl/miller/issues/170</a>. See also <a class="reference external" href="https://github.com/johnkerl/miller/issues/208">https://github.com/johnkerl/miller/issues/208</a> thanks &#64;aborruso!</p>
</div>
<div class="section" id="how-can-i-handle-commas-as-data-in-various-formats">
<h2>How can I handle commas-as-data in various formats?<a class="headerlink" href="#how-can-i-handle-commas-as-data-in-various-formats" title="Permalink to this headline"></a></h2>
<p><a class="reference internal" href="file-formats.html"><span class="doc">CSV</span></a> handles this well and by design:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ cat commas.csv
</span> Name,Role
&quot;Xiao, Lin&quot;,administrator
&quot;Khavari, Darius&quot;,tester
</pre></div>
</div>
<p>Likewise <a class="reference internal" href="file-formats.html#file-formats-json"><span class="std std-ref">Tabular JSON</span></a>:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --icsv --ojson cat commas.csv
</span> {
&quot;Name&quot;: &quot;Xiao, Lin&quot;,
&quot;Role&quot;: &quot;administrator&quot;
}
{
&quot;Name&quot;: &quot;Khavari, Darius&quot;,
&quot;Role&quot;: &quot;tester&quot;
}
</pre></div>
</div>
<p>For Millers <a class="reference internal" href="file-formats.html#file-formats-xtab"><span class="std std-ref">vertical-tabular format</span></a> there is no escaping for carriage returns, but commas work fine:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --icsv --oxtab cat commas.csv
</span> Name Xiao, Lin
Role administrator
Name Khavari, Darius
Role tester
</pre></div>
</div>
<p>But for <a class="reference internal" href="file-formats.html#file-formats-dkvp"><span class="std std-ref">Key-value_pairs</span></a> and <a class="reference internal" href="file-formats.html#file-formats-nidx"><span class="std std-ref">index-numbered</span></a>, commas are the default field separator. And as of Miller 5.4.0 anyway there is no CSV-style double-quote-handling like there is for CSV. So commas within the data look like delimiters:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --icsv --odkvp cat commas.csv
</span> Name=Xiao, Lin,Role=administrator
Name=Khavari, Darius,Role=tester
</pre></div>
</div>
<p>One solution is to use a different delimiter, such as a pipe character:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --icsv --odkvp --ofs pipe cat commas.csv
</span> Name=Xiao, Lin|Role=administrator
Name=Khavari, Darius|Role=tester
</pre></div>
</div>
<p>To be extra-sure to avoid data/delimiter clashes, you can also use control
characters as delimiters here, control-A:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --icsv --odkvp --ofs &#39;\001&#39; cat commas.csv | cat -v
</span> Name=Xiao, Lin\001Role=administrator
Name=Khavari, Darius\001Role=tester
</pre></div>
</div>
</div>
<div class="section" id="how-can-i-handle-field-names-with-special-symbols-in-them">
<h2>How can I handle field names with special symbols in them?<a class="headerlink" href="#how-can-i-handle-field-names-with-special-symbols-in-them" title="Permalink to this headline"></a></h2>
<p>Simply surround the field names with curly braces:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ echo &#39;x.a=3,y:b=4,z/c=5&#39; | mlr put &#39;${product.all} = ${x.a} * ${y:b} * ${z/c}&#39;
</span> x.a=3,y:b=4,z/c=5,product.all=60
</pre></div>
</div>
</div>
<div class="section" id="how-to-escape-in-regexes">
<h2>How to escape ? in regexes?<a class="headerlink" href="#how-to-escape-in-regexes" title="Permalink to this headline"></a></h2>
<p>One way is to use square brackets; an alternative is to use simple string-substitution rather than a regular expression.</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ cat data/question.dat
</span> a=is it?,b=it is!
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --oxtab put &#39;$c = gsub($a, &quot;[?]&quot;,&quot; ...&quot;)&#39; data/question.dat
</span> a is it?
b it is!
c is it ...
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --oxtab put &#39;$c = ssub($a, &quot;?&quot;,&quot; ...&quot;)&#39; data/question.dat
</span> a is it?
b it is!
c is it ...
</pre></div>
</div>
<p>The <code class="docutils literal notranslate"><span class="pre">ssub</span></code> function exists precisely for this reason: so you dont have to escape anything.</p>
</div>
<div class="section" id="how-can-i-put-single-quotes-into-strings">
<h2>How can I put single-quotes into strings?<a class="headerlink" href="#how-can-i-put-single-quotes-into-strings" title="Permalink to this headline"></a></h2>
<p>This is a little tricky due to the shells handling of quotes. For simplicity, lets first put an update script into a file:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>$a = &quot;It&#39;s OK, I said, then &#39;for now&#39;.&quot;
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ echo a=bcd | mlr put -f data/single-quote-example.mlr
</span> a=It&#39;s OK, I said, then &#39;for now&#39;.
</pre></div>
</div>
<p>So, its simple: Millers DSL uses double quotes for strings, and you can put single quotes (or backslash-escaped double-quotes) inside strings, no problem.</p>
<p>Without putting the update expression in a file, its messier:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ echo a=bcd | mlr put &#39;$a=&quot;It&#39;\&#39;&#39;s OK, I said, &#39;\&#39;&#39;for now&#39;\&#39;&#39;.&quot;&#39;
</span> a=It&#39;s OK, I said, &#39;for now&#39;.
</pre></div>
</div>
<p>The idea is that the outermost single-quotes are to protect the <code class="docutils literal notranslate"><span class="pre">put</span></code> expression from the shell, and the double quotes within them are for Miller. To get a single quote in the middle there, you need to actually put it <em>outside</em> the single-quoting for the shell. The pieces are the following, all concatenated together:</p>
<ul class="simple">
<li><p><code class="docutils literal notranslate"><span class="pre">$a=&quot;It</span></code></p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">\'</span></code></p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">s</span> <span class="pre">OK,</span> <span class="pre">I</span> <span class="pre">said,</span></code></p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">\'</span></code></p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">for</span> <span class="pre">now</span></code></p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">\'</span></code></p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">.</span></code></p></li>
</ul>
</div>
<div class="section" id="nr-is-not-consecutive-after-then-chaining">
<h2>NR is not consecutive after then-chaining<a class="headerlink" href="#nr-is-not-consecutive-after-then-chaining" title="Permalink to this headline"></a></h2>
<p>Given this input data:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ cat data/small
</span> a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
</pre></div>
</div>
<p>why dont I see <code class="docutils literal notranslate"><span class="pre">NR=1</span></code> and <code class="docutils literal notranslate"><span class="pre">NR=2</span></code> here??</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr filter &#39;$x &gt; 0.5&#39; then put &#39;$NR = NR&#39; data/small
</span> a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,NR=2
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,NR=5
</pre></div>
</div>
<p>The reason is that <code class="docutils literal notranslate"><span class="pre">NR</span></code> is computed for the original input records and isnt dynamically updated. By contrast, <code class="docutils literal notranslate"><span class="pre">NF</span></code> is dynamically updated: its the number of fields in the current record, and if you add/remove a field, the value of <code class="docutils literal notranslate"><span class="pre">NF</span></code> will change:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ echo x=1,y=2,z=3 | mlr put &#39;$nf1 = NF; $u = 4; $nf2 = NF; unset $x,$y,$z; $nf3 = NF&#39;
</span> nf1=3,u=4,nf2=5,nf3=3
</pre></div>
</div>
<p><code class="docutils literal notranslate"><span class="pre">NR</span></code>, by contrast (and <code class="docutils literal notranslate"><span class="pre">FNR</span></code> as well), retains the value from the original input stream, and records may be dropped by a <code class="docutils literal notranslate"><span class="pre">filter</span></code> within a <code class="docutils literal notranslate"><span class="pre">then</span></code>-chain. To recover consecutive record numbers, you can use out-of-stream variables as follows:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --opprint --from data/small put &#39;
</span> begin{ @nr1 = 0 }
@nr1 += 1;
$nr1 = @nr1
&#39; \
then filter &#39;$x&gt;0.5&#39; \
then put &#39;
begin{ @nr2 = 0 }
@nr2 += 1;
$nr2 = @nr2
&#39;
a b i x y nr1 nr2
eks pan 2 0.7586799647899636 0.5221511083334797 2 1
wye pan 5 0.5732889198020006 0.8636244699032729 5 2
</pre></div>
</div>
<p>Or, simply use <code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">cat</span> <span class="pre">-n</span></code>:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr filter &#39;$x &gt; 0.5&#39; then cat -n data/small
</span> n=1,a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
n=2,a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
</pre></div>
</div>
</div>
<div class="section" id="why-am-i-not-seeing-all-possible-joins-occur">
<h2>Why am I not seeing all possible joins occur?<a class="headerlink" href="#why-am-i-not-seeing-all-possible-joins-occur" title="Permalink to this headline"></a></h2>
<p><strong>This section describes behavior before Miller 5.1.0. As of 5.1.0, -u is the default.</strong></p>
<p>For example, the right file here has nine records, and the left file should add in the <code class="docutils literal notranslate"><span class="pre">hostname</span></code> column so the join output should also have 9 records:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --icsvlite --opprint cat data/join-u-left.csv
</span> hostname ipaddr
nadir.east.our.org 10.3.1.18
zenith.west.our.org 10.3.1.27
apoapsis.east.our.org 10.4.5.94
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --icsvlite --opprint cat data/join-u-right.csv
</span> ipaddr timestamp bytes
10.3.1.27 1448762579 4568
10.3.1.18 1448762578 8729
10.4.5.94 1448762579 17445
10.3.1.27 1448762589 12
10.3.1.18 1448762588 44558
10.4.5.94 1448762589 8899
10.3.1.27 1448762599 0
10.3.1.18 1448762598 73425
10.4.5.94 1448762599 12200
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --icsvlite --opprint join -s -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
</span> ipaddr hostname timestamp bytes
10.3.1.27 zenith.west.our.org 1448762579 4568
10.4.5.94 apoapsis.east.our.org 1448762579 17445
10.4.5.94 apoapsis.east.our.org 1448762589 8899
10.4.5.94 apoapsis.east.our.org 1448762599 12200
</pre></div>
</div>
<p>The issue is that Millers <code class="docutils literal notranslate"><span class="pre">join</span></code>, by default (before 5.1.0), took input sorted (lexically ascending) by the sort keys on both the left and right files. This design decision was made intentionally to parallel the Unix/Linux system <code class="docutils literal notranslate"><span class="pre">join</span></code> command, which has the same semantics. The benefit of this default is that the joiner program can stream through the left and right files, needing to load neither entirely into memory. The drawback, of course, is that is requires sorted input.</p>
<p>The solution (besides pre-sorting the input files on the join keys) is to simply use <strong>mlr join -u</strong> (which is now the default). This loads the left file entirely into memory (while the right file is still streamed one line at a time) and does all possible joins without requiring sorted input:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --icsvlite --opprint join -u -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
</span> ipaddr hostname timestamp bytes
10.3.1.27 zenith.west.our.org 1448762579 4568
10.3.1.18 nadir.east.our.org 1448762578 8729
10.4.5.94 apoapsis.east.our.org 1448762579 17445
10.3.1.27 zenith.west.our.org 1448762589 12
10.3.1.18 nadir.east.our.org 1448762588 44558
10.4.5.94 apoapsis.east.our.org 1448762589 8899
10.3.1.27 zenith.west.our.org 1448762599 0
10.3.1.18 nadir.east.our.org 1448762598 73425
10.4.5.94 apoapsis.east.our.org 1448762599 12200
</pre></div>
</div>
<p>General advice is to make sure the left-file is relatively small, e.g. containing name-to-number mappings, while saving large amounts of data for the right file.</p>
</div>
<div class="section" id="how-to-rectangularize-after-joins-with-unpaired">
<h2>How to rectangularize after joins with unpaired?<a class="headerlink" href="#how-to-rectangularize-after-joins-with-unpaired" title="Permalink to this headline"></a></h2>
<p>Suppose you have the following two data files:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>id,code
3,0000ff
2,00ff00
4,ff0000
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span>id,color
4,red
2,green
</pre></div>
</div>
<p>Joining on color the results are as expected:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --csv join -j id -f data/color-codes.csv data/color-names.csv
</span> id,code,color
4,ff0000,red
2,00ff00,green
</pre></div>
</div>
<p>However, if we ask for left-unpaireds, since theres no <code class="docutils literal notranslate"><span class="pre">color</span></code> column, we get a row not having the same column names as the other:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --csv join --ul -j id -f data/color-codes.csv data/color-names.csv
</span> id,code,color
4,ff0000,red
2,00ff00,green
id,code
3,0000ff
</pre></div>
</div>
<p>To fix this, we can use <strong>unsparsify</strong>:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> $ mlr --csv join --ul -j id -f data/color-codes.csv then unsparsify --fill-with &quot;&quot; data/color-names.csv
</span> id,code,color
4,ff0000,red
2,00ff00,green
3,0000ff,
</pre></div>
</div>
<p>Thanks to &#64;aborruso for the tip!</p>
</div>
</div>
<div class="clearer"></div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<h3><a href="index.html">Table of Contents</a></h3>
<ul>
<li><a class="reference internal" href="#">FAQ</a><ul>
<li><a class="reference internal" href="#how-do-i-suppress-numeric-conversion">How do I suppress numeric conversion?</a></li>
<li><a class="reference internal" href="#how-do-i-examine-then-chaining">How do I examine then-chaining?</a></li>
<li><a class="reference internal" href="#how-can-i-filter-by-date">How can I filter by date?</a></li>
<li><a class="reference internal" href="#how-can-i-handle-commas-as-data-in-various-formats">How can I handle commas-as-data in various formats?</a></li>
<li><a class="reference internal" href="#how-can-i-handle-field-names-with-special-symbols-in-them">How can I handle field names with special symbols in them?</a></li>
<li><a class="reference internal" href="#how-to-escape-in-regexes">How to escape ? in regexes?</a></li>
<li><a class="reference internal" href="#how-can-i-put-single-quotes-into-strings">How can I put single-quotes into strings?</a></li>
<li><a class="reference internal" href="#nr-is-not-consecutive-after-then-chaining">NR is not consecutive after then-chaining</a></li>
<li><a class="reference internal" href="#why-am-i-not-seeing-all-possible-joins-occur">Why am I not seeing all possible joins occur?</a></li>
<li><a class="reference internal" href="#how-to-rectangularize-after-joins-with-unpaired">How to rectangularize after joins with unpaired?</a></li>
</ul>
</li>
</ul>
<h4>Previous topic</h4>
<p class="topless"><a href="joins.html"
title="previous chapter">Joins</a></p>
<h4>Next topic</h4>
<p class="topless"><a href="cookbook.html"
title="next chapter">Cookbook part 1: common patterns</a></p>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="_sources/faq.rst.txt"
rel="nofollow">Show Source</a></li>
</ul>
</div>
<div id="searchbox" style="display: none" role="search">
<h3 id="searchlabel">Quick search</h3>
<div class="searchformwrapper">
<form class="search" action="search.html" method="get">
<input type="text" name="q" aria-labelledby="searchlabel" />
<input type="submit" value="Go" />
</form>
</div>
</div>
<script>$('#searchbox').show(0);</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="cookbook.html" title="Cookbook part 1: common patterns"
>next</a> |</li>
<li class="right" >
<a href="joins.html" title="Joins"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">Miller 6.0.0-alpha documentation</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">FAQ</a></li>
</ul>
</div>
<div class="footer" role="contentinfo">
&#169; Copyright 2020, John Kerl.
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 3.2.1.
</div>
</body>
</html>