miller/doc/faq.html
2017-05-08 18:42:16 -04:00

807 lines
27 KiB
HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<!-- PAGE GENERATED FROM template.html and content-for-faq.html BY poki. -->
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
<meta name="description" content="Miller documentation"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
<meta name="keywords"
content="John Kerl, Kerl, Miller, miller, mlr, OLAP, data analysis software, regression, correlation, variance, data tools, " />
<title> FAQ </title>
<link rel="stylesheet" type="text/css" href="css/miller.css"/>
<link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
</head>
<!-- ================================================================ -->
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-15651652-1");
pageTracker._trackPageview();
} catch(err) {}
</script>
<!-- ================================================================ -->
<script type="text/javascript">
function toggle_div(div) {
if (div != null) {
if (div.id.startsWith("section_toggle_")) {
var state = div.style.display;
if (state == "block") {
div.style.display = "none";
} else {
div.style.display = "block";
}
}
}
}
function expand_div(div) {
if (div != null) {
if (div.id.startsWith("section_toggle_")) {
div.style.display = "block";
}
}
}
function collapse_div(div) {
if (div != null) {
if (div.id.startsWith("section_toggle_")) {
div.style.display = "none";
}
}
}
function toggle_by_name(divName) {
toggle_div(document.getElementById(divName));
}
function expand_by_name(divName) {
expand_div(document.getElementById(divName));
}
function collapse_by_name(divName) {
collapse_div(document.getElementById(divName));
}
function expand_all() {
var divs = document.getElementsByTagName("div");
for(var i = 0; i < divs.length; i++) {
expand_div(divs[i]);
}
}
function collapse_all() {
var divs = document.getElementsByTagName("div");
for(var i = 0; i < divs.length; i++){
collapse_div(divs[i]);
}
}
</script>
<!--
The background image is from a screenshot of a Google search for "data analysis
tools", lightened and sepia-toned. Over this was placed a Mac Terminal app with
very light-grey font and translucent background, in which a few statistical
Miller commands were run with pretty-print-tabular output format.
<body background="pix/sepia-overlay.jpg">
-->
<body bgcolor="#ffffff">
<!-- ================================================================ -->
<table width="100%">
<tr>
<!-- navbar -->
<td width="15%">
<!--
<img src="pix/mlr.jpg" />
<img style="border-width:1px; color:black;" src="pix/mlr.jpg" />
-->
<div class="pokinav">
<center><titleinbody>Miller</titleinbody></center>
<!-- PAGE LIST GENERATED FROM template.html BY poki -->
<br/><b>Overview:</b>
<br/>&bull;&nbsp;<a href="index.html">About Miller</a>
<br/>&bull;&nbsp;<a href="10-min.html">Miller in 10 minutes</a>
<br/>&bull;&nbsp;<a href="file-formats.html">File formats</a>
<br/>&bull;&nbsp;<a href="feature-comparison.html">Miller features in the context of the Unix toolkit</a>
<br/>&bull;&nbsp;<a href="record-heterogeneity.html">Record-heterogeneity</a>
<br/>&bull;&nbsp;<a href="internationalization.html">Internationalization</a>
<br/><b>Using Miller:</b>
<br/>&bull;&nbsp;<a href="faq.html"><b>FAQ</b></a>
<br/>&bull;&nbsp;<a href="cookbook.html">Cookbook part 1</a>
<br/>&bull;&nbsp;<a href="cookbook2.html">Cookbook part 2</a>
<br/>&bull;&nbsp;<a href="cookbook3.html">Cookbook part 3</a>
<br/>&bull;&nbsp;<a href="data-examples.html">Data-diving examples</a>
<br/>&bull;&nbsp;<a href="manpage.html">Manpage</a>
<br/>&bull;&nbsp;<a href="reference.html">Reference</a>
<br/>&bull;&nbsp;<a href="reference-verbs.html">Reference: Verbs</a>
<br/>&bull;&nbsp;<a href="reference-dsl.html">Reference: DSL</a>
<br/>&bull;&nbsp;<a href="release-docs.html">Documents by release</a>
<br/>&bull;&nbsp;<a href="build.html">Installation, portability, dependencies, and testing</a>
<br/><b>Background:</b>
<br/>&bull;&nbsp;<a href="why.html">Why?</a>
<br/>&bull;&nbsp;<a href="whyc.html">Why C?</a>
<br/>&bull;&nbsp;<a href="etymology.html">Why call it Miller?</a>
<br/>&bull;&nbsp;<a href="originality.html">How original is Miller?</a>
<br/>&bull;&nbsp;<a href="performance.html">Performance</a>
<br/><b>Repository:</b>
<br/>&bull;&nbsp;<a href="to-do.html">Things to do</a>
<br/>&bull;&nbsp;<a href="contact.html">Contact information</a>
<br/>&bull;&nbsp;<a href="https://github.com/johnkerl/miller">GitHub repo</a>
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
<br/> <br/> <br/> <br/> <br/> <br/>
</div>
</td>
<!-- page body -->
<td>
<!--
This is a visually gorgeous feature (here & in the CSS): it allows for
independent scroll of the nav and body panels. In particular the nav
stays on-screen as you scroll the body.
However, two problems:
(1) In Firefox & Chrome both I get janky end-of-body scrolls: there is
more content but I can't scroll down to it unless I repeatedly retry the
scrolldown. Which is weird.
(2) Worse, only the first page renders in PDF (again, Firefox & Chrome).
For now I'm disabling this separate-scroll feature. A frontender, I am
not ... maybe someday I'll find a config which gets *all* the features
I want; for now, it's a tradeoff.
-->
<!-- Implementation details: one bit is right here:
div style="overflow-y:scroll;height:1500px"
and the other bit is in css/poki-callbacks.css:
.pokinav {
display: inline-block;
background: #e8d9bc;
border: 1;
box-shadow: 0px 0px 3px 3px #C9C9C9;
margin: 10px;
padding-top: 10px;
padding-bottom: 10px;
padding-left: 10px;
padding-right: 10px;
overflow-y: scroll; < - - - - - - here
height: 1500px;
}
-->
<div>
<center> <titleinbody> FAQ </titleinbody> </center>
<p/>
<!-- BODY COPIED FROM content-for-faq.html BY poki -->
<div class="pokitoc">
<center><b>Contents:</b></center>
&bull;&nbsp;<a href="#No_output_at_all">No output at all</a><br/>
&bull;&nbsp;<a href="#Fields_not_selected">Fields not selected</a><br/>
&bull;&nbsp;<a href="#Diagnosing_delimiter_specifications">Diagnosing delimiter specifications</a><br/>
&bull;&nbsp;<a href="#How_do_I_examine_then-chaining?">How do I examine then-chaining?</a><br/>
&bull;&nbsp;<a href="#I_assigned_$9_and_it&rsquo;s_not_9th">I assigned $9 and it&rsquo;s not 9th</a><br/>
&bull;&nbsp;<a href="#How_can_I_handle_field_names_with_special_symbols_in_them?">How can I handle field names with special symbols in them?</a><br/>
&bull;&nbsp;<a href="#How_can_I_put_single-quotes_into_strings?">How can I put single-quotes into strings?</a><br/>
&bull;&nbsp;<a href="#Why_doesn&rsquo;t_mlr_cut_put_fields_in_the_order_I_want?">Why doesn&rsquo;t mlr cut put fields in the order I want?</a><br/>
&bull;&nbsp;<a href="#NR_is_not_consecutive_after_then-chaining">NR is not consecutive after then-chaining</a><br/>
&bull;&nbsp;<a href="#Why_am_I_not_seeing_all_possible_joins_occur?">Why am I not seeing all possible joins occur?</a><br/>
&bull;&nbsp;<a href="#What_about_XML_or_JSON_file_formats?">What about XML or JSON file formats?</a><br/>
</div>
<p/>
<p/>
<button style="font-weight:bold;color:maroon;border:0" onclick="expand_all();" href="javascript:;">Expand all sections</button>
<button style="font-weight:bold;color:maroon;border:0" onclick="collapse_all();" href="javascript:;">Collapse all sections</button>
<a id="No_output_at_all"/><h1>No output at all</h1>
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_no_output_at_all');" href="javascript:;">Toggle section visibility</button>
<div id="section_toggle_no_output_at_all" style="display: block">
<p/>Try <tt>od -xcv</tt> and/or <tt>cat -e</tt> on your file to check for non-printable characters.
<p/>If you&rsquo;re using Miller version less than 5.0.0 (try
<tt>mlr --version</tt> on your system to find out), when the
line-ending-autodetect feature was introduced, please see
<a href="http://johnkerl.org/miller-releases/miller-4.5.0/doc/index.html">here</a>.
</div>
<a id="Fields_not_selected"/><h1>Fields not selected</h1>
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_fields_not_selected');" href="javascript:;">Toggle section visibility</button>
<div id="section_toggle_fields_not_selected" style="display: block">
<p/>Check the field-separators of the data, e.g. with the command-line
<tt>head</tt> program. Example: for CSV, Miller&rsquo;s default record
separator is comma; if your data is tab-delimited, e.g. <tt>aTABbTABc</tt>,
then Miller won&rsquo;t find three fields named <tt>a</tt>, <tt>b</tt>, and
<tt>c</tt> but rather just one named <tt>aTABbTABc</tt>. Solution in this
case: <tt>mlr --fs tab {remaining arguments ...}</tt>.
<p/>Also try <tt>od -xcv</tt> and/or <tt>cat -e</tt> on your file to check for non-printable characters.
</div>
<a id="Diagnosing_delimiter_specifications"/><h1>Diagnosing delimiter specifications</h1>
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_diagnosing_delimiter_specifications');" href="javascript:;">Toggle section visibility</button>
<div id="section_toggle_diagnosing_delimiter_specifications" style="display: block">
<p/>
<div class="pokipanel">
<pre>
# Use the `file` command to see if there are CR/LF terminators (in this case,
# there are not):
$ file data/colours.csv
data/colours.csv: UTF-8 Unicode text
# Look at the file to find names of fields
$ cat data/colours.csv
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR
masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz
masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah
# Extract a few fields:
$ mlr --csv cut -f KEY,PL,RO data/colours.csv
(only blank lines appear)
# Use XTAB output format to get a sharper picture of where records/fields
# are being split:
$ mlr --icsv --oxtab cat data/colours.csv
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah
# Using XTAB output format makes it clearer that KEY;DE;...;RO;TR is being
# treated as a single field name in the CSV header, and likewise each
# subsequent line is being treated as a single field value. This is because
# the default field separator is a comma but we have semicolons here.
# Use XTAB again with different field separator (--fs semicolon):
mlr --icsv --ifs semicolon --oxtab cat data/colours.csv
KEY masterdata_colourcode_1
DE Weiß
EN White
ES Blanco
FI Valkoinen
FR Blanc
IT Bianco
NL Wit
PL Biały
RO Alb
TR Beyaz
KEY masterdata_colourcode_2
DE Schwarz
EN Black
ES Negro
FI Musta
FR Noir
IT Nero
NL Zwart
PL Czarny
RO Negru
TR Siyah
# Using the new field-separator, retry the cut:
mlr --csv --fs semicolon cut -f KEY,PL,RO data/colours.csv
KEY;PL;RO
masterdata_colourcode_1;Biały;Alb
masterdata_colourcode_2;Czarny;Negru
</pre>
</div>
<p/>
</div>
<a id="How_do_I_examine_then-chaining?"/><h1>How do I examine then-chaining?</h1>
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_examine_then_chaining');" href="javascript:;">Toggle section visibility</button>
<div id="section_toggle_examine_then_chaining" style="display: block">
<p/>Then-chaining found in Miller is intended to function the same as Unix
pipes, but with less keystroking. You can print your data one pipeline step at
a time, to see what intermediate output at one step becomes the input to the
next step.
<p/>First, look at the input data:
<p/>
<div class="pokipanel">
<pre>
$ cat data/then-example.csv
Status,Payment_Type,Amount
paid,cash,10.00
pending,debit,20.00
paid,cash,50.00
pending,credit,40.00
paid,debit,30.00
</pre>
</div>
<p/>
Next, run the first step of your command, omitting anything from the first <tt>then</tt> onward:
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsv --opprint count-distinct -f Status,Payment_Type data/then-example.csv
Status Payment_Type count
paid cash 2
pending debit 1
pending credit 1
paid debit 1
</pre>
</div>
<p/>
After that, run it with the next <tt>then</tt> step included:
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsv --opprint count-distinct -f Status,Payment_Type then sort -nr count data/then-example.csv
Status Payment_Type count
paid cash 2
pending debit 1
pending credit 1
paid debit 1
</pre>
</div>
<p/>
Now if you use <tt>then</tt> to include another verb after that, the columns
<tt>Status</tt>, <tt>Payment_Type</tt>, and <tt>count</tt> will be the input to
that verb.
<p/>Note, by the way, that you&rsquo;ll get the same results using pipes:
<p/>
<div class="pokipanel">
<pre>
$ mlr --csv count-distinct -f Status,Payment_Type data/then-example.csv | mlr --icsv --opprint sort -nr count
Status Payment_Type count
paid cash 2
pending debit 1
pending credit 1
paid debit 1
</pre>
</div>
<p/>
</div>
<a id="I_assigned_$9_and_it&rsquo;s_not_9th"/><h1>I assigned $9 and it&rsquo;s not 9th</h1>
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_9_not_9th');" href="javascript:;">Toggle section visibility</button>
<div id="section_toggle_9_not_9th" style="display: block">
<p/> Miller records are ordered lists of key-value pairs. For NIDX format, DKVP
format when keys are missing, or CSV/CSV-lite format with
<tt>--implicit-csv-header</tt>, Miller will sequentially assign keys of the
form <tt>1</tt>, <tt>2</tt>, etc. But these are not integer array indices:
they&rsquo;re just field names taken from the initial field ordering in the
input data.
<p/>
<div class="pokipanel">
<pre>
$ echo x,y,z | mlr --dkvp cat
1=x,2=y,3=z
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ echo x,y,z | mlr --dkvp put '$6="a";$4="b";$55="cde"'
1=x,2=y,3=z,6=a,4=b,55=cde
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ echo x,y,z | mlr --nidx cat
x,y,z
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ echo x,y,z | mlr --csv --implicit-csv-header cat
1,2,3
x,y,z
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ echo x,y,z | mlr --dkvp rename 2,999
1=x,999=y,3=z
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ echo x,y,z | mlr --dkvp rename 2,newname
1=x,newname=y,3=z
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ echo x,y,z | mlr --csv --implicit-csv-header reorder -f 3,1,2
3,1,2
z,x,y
</pre>
</div>
<p/>
</div>
<a id="How_can_I_handle_field_names_with_special_symbols_in_them?"/><h1>How can I handle field names with special symbols in them?</h1>
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_field_names_with_special_symbols');" href="javascript:;">Toggle section visibility</button>
<div id="section_toggle_field_names_with_special_symbols" style="display: block">
<p/>Simply surround the field names with curly braces:
<p/>
<div class="pokipanel">
<pre>
$ echo 'x.a=3,y:b=4,z/c=5' | mlr put '${product.all} = ${x.a} * ${y:b} * ${z/c}'
x.a=3,y:b=4,z/c=5,product.all=60
</pre>
</div>
<p/>
</div>
<a id="How_can_I_put_single-quotes_into_strings?"/><h1>How can I put single-quotes into strings?</h1>
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_single_quotes_in_strings');" href="javascript:;">Toggle section visibility</button>
<div id="section_toggle_single_quotes_in_strings" style="display: block">
<p/> This is a little tricky due to the shell&rsquo;s handling of quotes. For simplicity, let&rsquo;s first put
an update script into a file:
<p/>
<div class="pokipanel">
<pre>
$a = "It's OK, I said, then 'for now'."
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ echo a=bcd | mlr put -f data/single-quote-example.mlr
a=It's OK, I said, then 'for now'.
</pre>
</div>
<p/>
<p/>So, it&rsquo;s simple: Miller&rsquo;s DSL uses double quotes for strings,
and you can put single quotes (or backslash-escaped double-quotes) inside
strings, no problem.
<p/> Without putting the update expression in a file, it&rsquo;s messier:
<p/>
<div class="pokipanel">
<pre>
$ echo a=bcd | mlr put '$a="It'\''s OK, I said, '\''for now'\''."'
a=It's OK, I said, 'for now'.
</pre>
</div>
<p/>
<p/> The idea is that the outermost single-quotes are to protect the
<tt>put</tt> expression from the shell, and the double quotes within them are
for Miller. To get a single quote in the middle there, you need to actually put it <i>outside</i> the single-quoting
for the shell. The pieces are
<ul>
<li/> <tt>$a="It</tt>
<li/> <tt>\'</tt>
<li/> <tt>s OK, I said,</tt>
<li/> <tt>\'</tt>
<li/> <tt>for now</tt>
<li/> <tt>\'</tt>
<li/> <tt>.</tt>
</ul>
all concatenated together.
</div>
<a id="Why_doesn&rsquo;t_mlr_cut_put_fields_in_the_order_I_want?"/><h1>Why doesn&rsquo;t mlr cut put fields in the order I want?</h1>
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_cut_out_of_order');" href="javascript:;">Toggle section visibility</button>
<div id="section_toggle_cut_out_of_order" style="display: block">
<p/>Example: columns <tt>x,i,a</tt> were requested but they appear here in the order <tt>a,i,x</tt>:
<p/>
<div class="pokipanel">
<pre>
$ cat data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ mlr cut -f x,i,a data/small
a=pan,i=1,x=0.3467901443380824
a=eks,i=2,x=0.7586799647899636
a=wye,i=3,x=0.20460330576630303
a=eks,i=4,x=0.38139939387114097
a=wye,i=5,x=0.5732889198020006
</pre>
</div>
<p/>
<p/>The issue is that Miller&rsquo;s <tt>cut</tt>, by default, outputs cut fields in the order they
appear in the input data. This design decision was made intentionally to parallel the *nix system <tt>cut</tt>
command, which has the same semantics.
<p/>The solution is to use the <tt>-o</tt> option:
<p/>
<div class="pokipanel">
<pre>
$ mlr cut -o -f x,i,a data/small
x=0.3467901443380824,i=1,a=pan
x=0.7586799647899636,i=2,a=eks
x=0.20460330576630303,i=3,a=wye
x=0.38139939387114097,i=4,a=eks
x=0.5732889198020006,i=5,a=wye
</pre>
</div>
<p/>
</div>
<a id="NR_is_not_consecutive_after_then-chaining"/><h1>NR is not consecutive after then-chaining</h1>
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_NR_not_consecutive_after_then');" href="javascript:;">Toggle section visibility</button>
<div id="section_toggle_NR_not_consecutive_after_then" style="display: block">
<p/> Given this input data:
<p/>
<div class="pokipanel">
<pre>
$ cat data/small
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
</pre>
</div>
<p/>
why don&rsquo;t I see <tt>NR=1</tt> and <tt>NR=2</tt> here??
<p/>
<div class="pokipanel">
<pre>
$ mlr filter '$x &gt; 0.5' then put '$NR = NR' data/small
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,NR=2
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,NR=5
</pre>
</div>
<p/>
<p/>The reason is that <tt>NR</tt> is computed for the original input records and isn&rsquo;t dynamically
updated. By contrast, <tt>NF</tt> is dynamically updated: it&rsquo;s the number of fields in the
current record, and if you add/remove a field, the value of <tt>NF</tt> will change:
<p/>
<div class="pokipanel">
<pre>
$ echo x=1,y=2,z=3 | mlr put '$nf1 = NF; $u = 4; $nf2 = NF; unset $x,$y,$z; $nf3 = NF'
nf1=3,u=4,nf2=5,nf3=3
</pre>
</div>
<p/>
<p/><tt>NR</tt>, by contrast (and <tt>FNR</tt> as well), retains the value from the original input stream,
and records may be dropped by a <tt>filter</tt> within a <tt>then</tt>-chain. To recover consecutive record
numbers, you can use out-of-stream variables as follows:
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint --from data/small put '
begin{ @nr1 = 0 }
@nr1 += 1;
$nr1 = @nr1
' \
then filter '$x&gt;0.5' \
then put '
begin{ @nr2 = 0 }
@nr2 += 1;
$nr2 = @nr2
'
a b i x y nr1 nr2
eks pan 2 0.7586799647899636 0.5221511083334797 2 1
wye pan 5 0.5732889198020006 0.8636244699032729 5 2
</pre>
</div>
<p/>
<p/>Or, simply use <tt>mlr cat -n</tt>:
<p/>
<div class="pokipanel">
<pre>
$ mlr filter '$x &gt; 0.5' then cat -n data/small
n=1,a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
n=2,a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
</pre>
</div>
<p/>
</div>
<a id="Why_am_I_not_seeing_all_possible_joins_occur?"/><h1>Why am I not seeing all possible joins occur?</h1>
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_not_all_possible_joins');" href="javascript:;">Toggle section visibility</button>
<div id="section_toggle_not_all_possible_joins" style="display: block">
<p/><b>This section describes behavior before Miller 5.1.0. As of 5.1.0, <tt>-u</tt> is the default.</b>
<p/>For example, the right file here has nine records, and the left file should
add in the <tt>hostname</tt> column &mdash; so the join output should also have
9 records:
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsvlite --opprint cat data/join-u-left.csv
hostname ipaddr
nadir.east.our.org 10.3.1.18
zenith.west.our.org 10.3.1.27
apoapsis.east.our.org 10.4.5.94
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsvlite --opprint cat data/join-u-right.csv
ipaddr timestamp bytes
10.3.1.27 1448762579 4568
10.3.1.18 1448762578 8729
10.4.5.94 1448762579 17445
10.3.1.27 1448762589 12
10.3.1.18 1448762588 44558
10.4.5.94 1448762589 8899
10.3.1.27 1448762599 0
10.3.1.18 1448762598 73425
10.4.5.94 1448762599 12200
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsvlite --opprint join -s -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
ipaddr hostname timestamp bytes
10.3.1.27 zenith.west.our.org 1448762579 4568
10.4.5.94 apoapsis.east.our.org 1448762579 17445
10.4.5.94 apoapsis.east.our.org 1448762589 8899
10.4.5.94 apoapsis.east.our.org 1448762599 12200
</pre>
</div>
<p/>
<p/>The issue is that Miller&rsquo;s <tt>join</tt>, by default (before 5.1.0),
took input sorted (lexically ascending) by the sort keys on both the left and
right files. This design decision was made intentionally to parallel the *nix
system <tt>join</tt> command, which has the same semantics. The benefit of this
default is that the joiner program can stream through the left and right files,
needing to load neither entirely into memory. The drawback, of course, is that
is requires sorted input.
<p/>The solution (besides pre-sorting the input files on the join keys) is to
simply use <b>mlr join -u</b> (which is now the default). This loads the left
file entirely into memory (while the right file is still streamed one line at a
time) and does all possible joins without requiring sorted input:
<p/>
<div class="pokipanel">
<pre>
$ mlr --icsvlite --opprint join -u -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
ipaddr hostname timestamp bytes
10.3.1.27 zenith.west.our.org 1448762579 4568
10.3.1.18 nadir.east.our.org 1448762578 8729
10.4.5.94 apoapsis.east.our.org 1448762579 17445
10.3.1.27 zenith.west.our.org 1448762589 12
10.3.1.18 nadir.east.our.org 1448762588 44558
10.4.5.94 apoapsis.east.our.org 1448762589 8899
10.3.1.27 zenith.west.our.org 1448762599 0
10.3.1.18 nadir.east.our.org 1448762598 73425
10.4.5.94 apoapsis.east.our.org 1448762599 12200
</pre>
</div>
<p/>
<p/>General advice is to make sure the left-file is relatively small, e.g.
containing name-to-number mappings, while saving large amounts of data for the
right file.
</div>
<a id="What_about_XML_or_JSON_file_formats?"/><h1>What about XML or JSON file formats?</h1>
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_xml_or_json');" href="javascript:;">Toggle section visibility</button>
<div id="section_toggle_xml_or_json" style="display: block">
<p/>Miller handles <boldmaroon>tabular data</boldmaroon>, which is a list of
records each having fields which are key-value pairs. Miller also doesn&rsquo;t
require that each record have the same field names (see also <a
href="record-heterogeneity.html">here</a>). Regardless, tabular data is a
<boldmaroon>non-recursive data structure</boldmaroon>.
<p/> XML, JSON, etc. are, by contrast, all <boldmaroon>recursive</boldmaroon>
or <boldmaroon>nested</boldmaroon> data structures. For example, in JSON
you can represent a hash map whose values are lists of lists.
<p/>Now, you can put tabular data into these formats &mdash; since list-of-key-value-pairs
is one of the things representable in XML or JSON. Example:
<p/>
<div class="pokipanel">
<pre>
# DKVP
x=1,y=2
z=3
# XML
&lt;table&gt;
&lt;record&gt;
&lt;field&gt;
&lt;key&gt; x &lt;/key&gt; &lt;value&gt; 1 &lt;/value&gt;
&lt;/field&gt;
&lt;field&gt;
&lt;key&gt; y &lt;/key&gt; &lt;value&gt; 2 &lt;/value&gt;
&lt;/field&gt;
&lt;/record&gt;
&lt;field&gt;
&lt;key&gt; z &lt;/key&gt; &lt;value&gt; 3 &lt;/value&gt;
&lt;/field&gt;
&lt;record&gt;
&lt;/record&gt;
&lt;/table&gt;
# JSON
[{"x":1,"y":2},{"z":3}]
</pre>
</div>
<p/>However, a tool like Miller which handles non-recursive data is never going
to be able to handle full XML/JSON semantics &mdash; only a small subset. If
tabular data represented in XML/JSON/etc are sufficiently well-structured, it
may be easy to grep/sed out the data into a simpler text form &mdash; this is a
general text-processing problem.
<p/>Miller does support tabular data represented in JSON: please see
<a href="file-formats.html">File formats</a>. See also <a
href="http://stedolan.github.io/jq/">jq</a> for a truly powerful, JSON-specific
tool.
<p/>For XML, my suggestion is to use a tool like
<a href="http://ff-extractor.sourceforge.net/">ff-extractor</a> to do format
conversion.
</div>
</div>
</td>
</table>
</body>
</html>