mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 18:25:45 +00:00
807 lines
27 KiB
HTML
807 lines
27 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html lang="en">
|
|
|
|
<!-- PAGE GENERATED FROM template.html and content-for-faq.html BY poki. -->
|
|
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
|
|
<head>
|
|
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
|
|
<meta name="description" content="Miller documentation"/>
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
|
|
<meta name="keywords"
|
|
content="John Kerl, Kerl, Miller, miller, mlr, OLAP, data analysis software, regression, correlation, variance, data tools, " />
|
|
|
|
<title> FAQ </title>
|
|
<link rel="stylesheet" type="text/css" href="css/miller.css"/>
|
|
<link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
|
|
</head>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript">
|
|
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
|
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
|
</script>
|
|
<script type="text/javascript">
|
|
try {
|
|
var pageTracker = _gat._getTracker("UA-15651652-1");
|
|
pageTracker._trackPageview();
|
|
} catch(err) {}
|
|
</script>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript">
|
|
function toggle_div(div) {
|
|
if (div != null) {
|
|
if (div.id.startsWith("section_toggle_")) {
|
|
var state = div.style.display;
|
|
if (state == "block") {
|
|
div.style.display = "none";
|
|
} else {
|
|
div.style.display = "block";
|
|
}
|
|
}
|
|
}
|
|
}
|
|
function expand_div(div) {
|
|
if (div != null) {
|
|
if (div.id.startsWith("section_toggle_")) {
|
|
div.style.display = "block";
|
|
}
|
|
}
|
|
}
|
|
function collapse_div(div) {
|
|
if (div != null) {
|
|
if (div.id.startsWith("section_toggle_")) {
|
|
div.style.display = "none";
|
|
}
|
|
}
|
|
}
|
|
|
|
function toggle_by_name(divName) {
|
|
toggle_div(document.getElementById(divName));
|
|
}
|
|
function expand_by_name(divName) {
|
|
expand_div(document.getElementById(divName));
|
|
}
|
|
function collapse_by_name(divName) {
|
|
collapse_div(document.getElementById(divName));
|
|
}
|
|
|
|
function expand_all() {
|
|
var divs = document.getElementsByTagName("div");
|
|
for(var i = 0; i < divs.length; i++) {
|
|
expand_div(divs[i]);
|
|
}
|
|
}
|
|
function collapse_all() {
|
|
var divs = document.getElementsByTagName("div");
|
|
for(var i = 0; i < divs.length; i++){
|
|
collapse_div(divs[i]);
|
|
}
|
|
}
|
|
</script>
|
|
|
|
<!--
|
|
The background image is from a screenshot of a Google search for "data analysis
|
|
tools", lightened and sepia-toned. Over this was placed a Mac Terminal app with
|
|
very light-grey font and translucent background, in which a few statistical
|
|
Miller commands were run with pretty-print-tabular output format.
|
|
<body background="pix/sepia-overlay.jpg">
|
|
-->
|
|
<body bgcolor="#ffffff">
|
|
|
|
<!-- ================================================================ -->
|
|
<table width="100%">
|
|
<tr>
|
|
|
|
<!-- navbar -->
|
|
<td width="15%">
|
|
<!--
|
|
<img src="pix/mlr.jpg" />
|
|
<img style="border-width:1px; color:black;" src="pix/mlr.jpg" />
|
|
-->
|
|
|
|
<div class="pokinav">
|
|
<center><titleinbody>Miller</titleinbody></center>
|
|
|
|
<!-- PAGE LIST GENERATED FROM template.html BY poki -->
|
|
<br/><b>Overview:</b>
|
|
<br/>• <a href="index.html">About Miller</a>
|
|
<br/>• <a href="10-min.html">Miller in 10 minutes</a>
|
|
<br/>• <a href="file-formats.html">File formats</a>
|
|
<br/>• <a href="feature-comparison.html">Miller features in the context of the Unix toolkit</a>
|
|
<br/>• <a href="record-heterogeneity.html">Record-heterogeneity</a>
|
|
<br/>• <a href="internationalization.html">Internationalization</a>
|
|
<br/><b>Using Miller:</b>
|
|
<br/>• <a href="faq.html"><b>FAQ</b></a>
|
|
<br/>• <a href="cookbook.html">Cookbook part 1</a>
|
|
<br/>• <a href="cookbook2.html">Cookbook part 2</a>
|
|
<br/>• <a href="cookbook3.html">Cookbook part 3</a>
|
|
<br/>• <a href="data-examples.html">Data-diving examples</a>
|
|
<br/>• <a href="manpage.html">Manpage</a>
|
|
<br/>• <a href="reference.html">Reference</a>
|
|
<br/>• <a href="reference-verbs.html">Reference: Verbs</a>
|
|
<br/>• <a href="reference-dsl.html">Reference: DSL</a>
|
|
<br/>• <a href="release-docs.html">Documents by release</a>
|
|
<br/>• <a href="build.html">Installation, portability, dependencies, and testing</a>
|
|
<br/><b>Background:</b>
|
|
<br/>• <a href="why.html">Why?</a>
|
|
<br/>• <a href="whyc.html">Why C?</a>
|
|
<br/>• <a href="etymology.html">Why call it Miller?</a>
|
|
<br/>• <a href="originality.html">How original is Miller?</a>
|
|
<br/>• <a href="performance.html">Performance</a>
|
|
<br/><b>Repository:</b>
|
|
<br/>• <a href="to-do.html">Things to do</a>
|
|
<br/>• <a href="contact.html">Contact information</a>
|
|
<br/>• <a href="https://github.com/johnkerl/miller">GitHub repo</a>
|
|
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
|
|
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
|
|
<br/> <br/> <br/> <br/> <br/> <br/>
|
|
</div>
|
|
</td>
|
|
|
|
<!-- page body -->
|
|
<td>
|
|
<!--
|
|
This is a visually gorgeous feature (here & in the CSS): it allows for
|
|
independent scroll of the nav and body panels. In particular the nav
|
|
stays on-screen as you scroll the body.
|
|
|
|
However, two problems:
|
|
|
|
(1) In Firefox & Chrome both I get janky end-of-body scrolls: there is
|
|
more content but I can't scroll down to it unless I repeatedly retry the
|
|
scrolldown. Which is weird.
|
|
|
|
(2) Worse, only the first page renders in PDF (again, Firefox & Chrome).
|
|
|
|
For now I'm disabling this separate-scroll feature. A frontender, I am
|
|
not ... maybe someday I'll find a config which gets *all* the features
|
|
I want; for now, it's a tradeoff.
|
|
-->
|
|
|
|
<!-- Implementation details: one bit is right here:
|
|
|
|
div style="overflow-y:scroll;height:1500px"
|
|
|
|
and the other bit is in css/poki-callbacks.css:
|
|
|
|
.pokinav {
|
|
display: inline-block;
|
|
background: #e8d9bc;
|
|
border: 1;
|
|
box-shadow: 0px 0px 3px 3px #C9C9C9;
|
|
margin: 10px;
|
|
padding-top: 10px;
|
|
padding-bottom: 10px;
|
|
padding-left: 10px;
|
|
padding-right: 10px;
|
|
overflow-y: scroll; < - - - - - - here
|
|
height: 1500px;
|
|
}
|
|
|
|
-->
|
|
<div>
|
|
<center> <titleinbody> FAQ </titleinbody> </center>
|
|
<p/>
|
|
|
|
<!-- BODY COPIED FROM content-for-faq.html BY poki -->
|
|
<div class="pokitoc">
|
|
<center><b>Contents:</b></center>
|
|
• <a href="#No_output_at_all">No output at all</a><br/>
|
|
• <a href="#Fields_not_selected">Fields not selected</a><br/>
|
|
• <a href="#Diagnosing_delimiter_specifications">Diagnosing delimiter specifications</a><br/>
|
|
• <a href="#How_do_I_examine_then-chaining?">How do I examine then-chaining?</a><br/>
|
|
• <a href="#I_assigned_$9_and_it’s_not_9th">I assigned $9 and it’s not 9th</a><br/>
|
|
• <a href="#How_can_I_handle_field_names_with_special_symbols_in_them?">How can I handle field names with special symbols in them?</a><br/>
|
|
• <a href="#How_can_I_put_single-quotes_into_strings?">How can I put single-quotes into strings?</a><br/>
|
|
• <a href="#Why_doesn’t_mlr_cut_put_fields_in_the_order_I_want?">Why doesn’t mlr cut put fields in the order I want?</a><br/>
|
|
• <a href="#NR_is_not_consecutive_after_then-chaining">NR is not consecutive after then-chaining</a><br/>
|
|
• <a href="#Why_am_I_not_seeing_all_possible_joins_occur?">Why am I not seeing all possible joins occur?</a><br/>
|
|
• <a href="#What_about_XML_or_JSON_file_formats?">What about XML or JSON file formats?</a><br/>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="expand_all();" href="javascript:;">Expand all sections</button>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="collapse_all();" href="javascript:;">Collapse all sections</button>
|
|
|
|
<a id="No_output_at_all"/><h1>No output at all</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_no_output_at_all');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_no_output_at_all" style="display: block">
|
|
|
|
<p/>Try <tt>od -xcv</tt> and/or <tt>cat -e</tt> on your file to check for non-printable characters.
|
|
|
|
<p/>If you’re using Miller version less than 5.0.0 (try
|
|
<tt>mlr --version</tt> on your system to find out), when the
|
|
line-ending-autodetect feature was introduced, please see
|
|
<a href="http://johnkerl.org/miller-releases/miller-4.5.0/doc/index.html">here</a>.
|
|
|
|
</div>
|
|
<a id="Fields_not_selected"/><h1>Fields not selected</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_fields_not_selected');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_fields_not_selected" style="display: block">
|
|
|
|
<p/>Check the field-separators of the data, e.g. with the command-line
|
|
<tt>head</tt> program. Example: for CSV, Miller’s default record
|
|
separator is comma; if your data is tab-delimited, e.g. <tt>aTABbTABc</tt>,
|
|
then Miller won’t find three fields named <tt>a</tt>, <tt>b</tt>, and
|
|
<tt>c</tt> but rather just one named <tt>aTABbTABc</tt>. Solution in this
|
|
case: <tt>mlr --fs tab {remaining arguments ...}</tt>.
|
|
|
|
<p/>Also try <tt>od -xcv</tt> and/or <tt>cat -e</tt> on your file to check for non-printable characters.
|
|
|
|
</div>
|
|
<a id="Diagnosing_delimiter_specifications"/><h1>Diagnosing delimiter specifications</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_diagnosing_delimiter_specifications');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_diagnosing_delimiter_specifications" style="display: block">
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
# Use the `file` command to see if there are CR/LF terminators (in this case,
|
|
# there are not):
|
|
$ file data/colours.csv
|
|
data/colours.csv: UTF-8 Unicode text
|
|
|
|
# Look at the file to find names of fields
|
|
$ cat data/colours.csv
|
|
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR
|
|
masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz
|
|
masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah
|
|
|
|
# Extract a few fields:
|
|
$ mlr --csv cut -f KEY,PL,RO data/colours.csv
|
|
(only blank lines appear)
|
|
|
|
# Use XTAB output format to get a sharper picture of where records/fields
|
|
# are being split:
|
|
$ mlr --icsv --oxtab cat data/colours.csv
|
|
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz
|
|
|
|
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah
|
|
|
|
# Using XTAB output format makes it clearer that KEY;DE;...;RO;TR is being
|
|
# treated as a single field name in the CSV header, and likewise each
|
|
# subsequent line is being treated as a single field value. This is because
|
|
# the default field separator is a comma but we have semicolons here.
|
|
# Use XTAB again with different field separator (--fs semicolon):
|
|
mlr --icsv --ifs semicolon --oxtab cat data/colours.csv
|
|
KEY masterdata_colourcode_1
|
|
DE Weiß
|
|
EN White
|
|
ES Blanco
|
|
FI Valkoinen
|
|
FR Blanc
|
|
IT Bianco
|
|
NL Wit
|
|
PL Biały
|
|
RO Alb
|
|
TR Beyaz
|
|
|
|
KEY masterdata_colourcode_2
|
|
DE Schwarz
|
|
EN Black
|
|
ES Negro
|
|
FI Musta
|
|
FR Noir
|
|
IT Nero
|
|
NL Zwart
|
|
PL Czarny
|
|
RO Negru
|
|
TR Siyah
|
|
|
|
# Using the new field-separator, retry the cut:
|
|
mlr --csv --fs semicolon cut -f KEY,PL,RO data/colours.csv
|
|
KEY;PL;RO
|
|
masterdata_colourcode_1;Biały;Alb
|
|
masterdata_colourcode_2;Czarny;Negru
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="How_do_I_examine_then-chaining?"/><h1>How do I examine then-chaining?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_examine_then_chaining');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_examine_then_chaining" style="display: block">
|
|
|
|
<p/>Then-chaining found in Miller is intended to function the same as Unix
|
|
pipes, but with less keystroking. You can print your data one pipeline step at
|
|
a time, to see what intermediate output at one step becomes the input to the
|
|
next step.
|
|
|
|
<p/>First, look at the input data:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/then-example.csv
|
|
Status,Payment_Type,Amount
|
|
paid,cash,10.00
|
|
pending,debit,20.00
|
|
paid,cash,50.00
|
|
pending,credit,40.00
|
|
paid,debit,30.00
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
Next, run the first step of your command, omitting anything from the first <tt>then</tt> onward:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --opprint count-distinct -f Status,Payment_Type data/then-example.csv
|
|
Status Payment_Type count
|
|
paid cash 2
|
|
pending debit 1
|
|
pending credit 1
|
|
paid debit 1
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
After that, run it with the next <tt>then</tt> step included:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --opprint count-distinct -f Status,Payment_Type then sort -nr count data/then-example.csv
|
|
Status Payment_Type count
|
|
paid cash 2
|
|
pending debit 1
|
|
pending credit 1
|
|
paid debit 1
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
Now if you use <tt>then</tt> to include another verb after that, the columns
|
|
<tt>Status</tt>, <tt>Payment_Type</tt>, and <tt>count</tt> will be the input to
|
|
that verb.
|
|
|
|
<p/>Note, by the way, that you’ll get the same results using pipes:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --csv count-distinct -f Status,Payment_Type data/then-example.csv | mlr --icsv --opprint sort -nr count
|
|
Status Payment_Type count
|
|
paid cash 2
|
|
pending debit 1
|
|
pending credit 1
|
|
paid debit 1
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="I_assigned_$9_and_it’s_not_9th"/><h1>I assigned $9 and it’s not 9th</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_9_not_9th');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_9_not_9th" style="display: block">
|
|
|
|
<p/> Miller records are ordered lists of key-value pairs. For NIDX format, DKVP
|
|
format when keys are missing, or CSV/CSV-lite format with
|
|
<tt>--implicit-csv-header</tt>, Miller will sequentially assign keys of the
|
|
form <tt>1</tt>, <tt>2</tt>, etc. But these are not integer array indices:
|
|
they’re just field names taken from the initial field ordering in the
|
|
input data.
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --dkvp cat
|
|
1=x,2=y,3=z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --dkvp put '$6="a";$4="b";$55="cde"'
|
|
1=x,2=y,3=z,6=a,4=b,55=cde
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --nidx cat
|
|
x,y,z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --csv --implicit-csv-header cat
|
|
1,2,3
|
|
x,y,z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --dkvp rename 2,999
|
|
1=x,999=y,3=z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --dkvp rename 2,newname
|
|
1=x,newname=y,3=z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --csv --implicit-csv-header reorder -f 3,1,2
|
|
3,1,2
|
|
z,x,y
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="How_can_I_handle_field_names_with_special_symbols_in_them?"/><h1>How can I handle field names with special symbols in them?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_field_names_with_special_symbols');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_field_names_with_special_symbols" style="display: block">
|
|
|
|
<p/>Simply surround the field names with curly braces:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo 'x.a=3,y:b=4,z/c=5' | mlr put '${product.all} = ${x.a} * ${y:b} * ${z/c}'
|
|
x.a=3,y:b=4,z/c=5,product.all=60
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="How_can_I_put_single-quotes_into_strings?"/><h1>How can I put single-quotes into strings?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_single_quotes_in_strings');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_single_quotes_in_strings" style="display: block">
|
|
|
|
<p/> This is a little tricky due to the shell’s handling of quotes. For simplicity, let’s first put
|
|
an update script into a file:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$a = "It's OK, I said, then 'for now'."
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo a=bcd | mlr put -f data/single-quote-example.mlr
|
|
a=It's OK, I said, then 'for now'.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>So, it’s simple: Miller’s DSL uses double quotes for strings,
|
|
and you can put single quotes (or backslash-escaped double-quotes) inside
|
|
strings, no problem.
|
|
|
|
<p/> Without putting the update expression in a file, it’s messier:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo a=bcd | mlr put '$a="It'\''s OK, I said, '\''for now'\''."'
|
|
a=It's OK, I said, 'for now'.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> The idea is that the outermost single-quotes are to protect the
|
|
<tt>put</tt> expression from the shell, and the double quotes within them are
|
|
for Miller. To get a single quote in the middle there, you need to actually put it <i>outside</i> the single-quoting
|
|
for the shell. The pieces are
|
|
|
|
<ul>
|
|
<li/> <tt>$a="It</tt>
|
|
<li/> <tt>\'</tt>
|
|
<li/> <tt>s OK, I said,</tt>
|
|
<li/> <tt>\'</tt>
|
|
<li/> <tt>for now</tt>
|
|
<li/> <tt>\'</tt>
|
|
<li/> <tt>.</tt>
|
|
</ul>
|
|
|
|
all concatenated together.
|
|
|
|
</div>
|
|
<a id="Why_doesn’t_mlr_cut_put_fields_in_the_order_I_want?"/><h1>Why doesn’t mlr cut put fields in the order I want?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_cut_out_of_order');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_cut_out_of_order" style="display: block">
|
|
|
|
<p/>Example: columns <tt>x,i,a</tt> were requested but they appear here in the order <tt>a,i,x</tt>:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/small
|
|
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
|
|
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
|
|
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
|
|
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
|
|
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr cut -f x,i,a data/small
|
|
a=pan,i=1,x=0.3467901443380824
|
|
a=eks,i=2,x=0.7586799647899636
|
|
a=wye,i=3,x=0.20460330576630303
|
|
a=eks,i=4,x=0.38139939387114097
|
|
a=wye,i=5,x=0.5732889198020006
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>The issue is that Miller’s <tt>cut</tt>, by default, outputs cut fields in the order they
|
|
appear in the input data. This design decision was made intentionally to parallel the *nix system <tt>cut</tt>
|
|
command, which has the same semantics.
|
|
|
|
<p/>The solution is to use the <tt>-o</tt> option:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr cut -o -f x,i,a data/small
|
|
x=0.3467901443380824,i=1,a=pan
|
|
x=0.7586799647899636,i=2,a=eks
|
|
x=0.20460330576630303,i=3,a=wye
|
|
x=0.38139939387114097,i=4,a=eks
|
|
x=0.5732889198020006,i=5,a=wye
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="NR_is_not_consecutive_after_then-chaining"/><h1>NR is not consecutive after then-chaining</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_NR_not_consecutive_after_then');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_NR_not_consecutive_after_then" style="display: block">
|
|
|
|
<p/> Given this input data:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/small
|
|
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
|
|
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
|
|
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
|
|
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
|
|
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
why don’t I see <tt>NR=1</tt> and <tt>NR=2</tt> here??
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr filter '$x > 0.5' then put '$NR = NR' data/small
|
|
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,NR=2
|
|
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,NR=5
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>The reason is that <tt>NR</tt> is computed for the original input records and isn’t dynamically
|
|
updated. By contrast, <tt>NF</tt> is dynamically updated: it’s the number of fields in the
|
|
current record, and if you add/remove a field, the value of <tt>NF</tt> will change:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x=1,y=2,z=3 | mlr put '$nf1 = NF; $u = 4; $nf2 = NF; unset $x,$y,$z; $nf3 = NF'
|
|
nf1=3,u=4,nf2=5,nf3=3
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/><tt>NR</tt>, by contrast (and <tt>FNR</tt> as well), retains the value from the original input stream,
|
|
and records may be dropped by a <tt>filter</tt> within a <tt>then</tt>-chain. To recover consecutive record
|
|
numbers, you can use out-of-stream variables as follows:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint --from data/small put '
|
|
begin{ @nr1 = 0 }
|
|
@nr1 += 1;
|
|
$nr1 = @nr1
|
|
' \
|
|
then filter '$x>0.5' \
|
|
then put '
|
|
begin{ @nr2 = 0 }
|
|
@nr2 += 1;
|
|
$nr2 = @nr2
|
|
'
|
|
a b i x y nr1 nr2
|
|
eks pan 2 0.7586799647899636 0.5221511083334797 2 1
|
|
wye pan 5 0.5732889198020006 0.8636244699032729 5 2
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>Or, simply use <tt>mlr cat -n</tt>:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr filter '$x > 0.5' then cat -n data/small
|
|
n=1,a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
|
|
n=2,a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="Why_am_I_not_seeing_all_possible_joins_occur?"/><h1>Why am I not seeing all possible joins occur?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_not_all_possible_joins');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_not_all_possible_joins" style="display: block">
|
|
|
|
<p/><b>This section describes behavior before Miller 5.1.0. As of 5.1.0, <tt>-u</tt> is the default.</b>
|
|
|
|
<p/>For example, the right file here has nine records, and the left file should
|
|
add in the <tt>hostname</tt> column — so the join output should also have
|
|
9 records:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint cat data/join-u-left.csv
|
|
hostname ipaddr
|
|
nadir.east.our.org 10.3.1.18
|
|
zenith.west.our.org 10.3.1.27
|
|
apoapsis.east.our.org 10.4.5.94
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint cat data/join-u-right.csv
|
|
ipaddr timestamp bytes
|
|
10.3.1.27 1448762579 4568
|
|
10.3.1.18 1448762578 8729
|
|
10.4.5.94 1448762579 17445
|
|
10.3.1.27 1448762589 12
|
|
10.3.1.18 1448762588 44558
|
|
10.4.5.94 1448762589 8899
|
|
10.3.1.27 1448762599 0
|
|
10.3.1.18 1448762598 73425
|
|
10.4.5.94 1448762599 12200
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint join -s -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
|
|
ipaddr hostname timestamp bytes
|
|
10.3.1.27 zenith.west.our.org 1448762579 4568
|
|
10.4.5.94 apoapsis.east.our.org 1448762579 17445
|
|
10.4.5.94 apoapsis.east.our.org 1448762589 8899
|
|
10.4.5.94 apoapsis.east.our.org 1448762599 12200
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>The issue is that Miller’s <tt>join</tt>, by default (before 5.1.0),
|
|
took input sorted (lexically ascending) by the sort keys on both the left and
|
|
right files. This design decision was made intentionally to parallel the *nix
|
|
system <tt>join</tt> command, which has the same semantics. The benefit of this
|
|
default is that the joiner program can stream through the left and right files,
|
|
needing to load neither entirely into memory. The drawback, of course, is that
|
|
is requires sorted input.
|
|
|
|
<p/>The solution (besides pre-sorting the input files on the join keys) is to
|
|
simply use <b>mlr join -u</b> (which is now the default). This loads the left
|
|
file entirely into memory (while the right file is still streamed one line at a
|
|
time) and does all possible joins without requiring sorted input:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint join -u -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
|
|
ipaddr hostname timestamp bytes
|
|
10.3.1.27 zenith.west.our.org 1448762579 4568
|
|
10.3.1.18 nadir.east.our.org 1448762578 8729
|
|
10.4.5.94 apoapsis.east.our.org 1448762579 17445
|
|
10.3.1.27 zenith.west.our.org 1448762589 12
|
|
10.3.1.18 nadir.east.our.org 1448762588 44558
|
|
10.4.5.94 apoapsis.east.our.org 1448762589 8899
|
|
10.3.1.27 zenith.west.our.org 1448762599 0
|
|
10.3.1.18 nadir.east.our.org 1448762598 73425
|
|
10.4.5.94 apoapsis.east.our.org 1448762599 12200
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>General advice is to make sure the left-file is relatively small, e.g.
|
|
containing name-to-number mappings, while saving large amounts of data for the
|
|
right file.
|
|
|
|
</div>
|
|
<a id="What_about_XML_or_JSON_file_formats?"/><h1>What about XML or JSON file formats?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_xml_or_json');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_xml_or_json" style="display: block">
|
|
|
|
<p/>Miller handles <boldmaroon>tabular data</boldmaroon>, which is a list of
|
|
records each having fields which are key-value pairs. Miller also doesn’t
|
|
require that each record have the same field names (see also <a
|
|
href="record-heterogeneity.html">here</a>). Regardless, tabular data is a
|
|
<boldmaroon>non-recursive data structure</boldmaroon>.
|
|
|
|
<p/> XML, JSON, etc. are, by contrast, all <boldmaroon>recursive</boldmaroon>
|
|
or <boldmaroon>nested</boldmaroon> data structures. For example, in JSON
|
|
you can represent a hash map whose values are lists of lists.
|
|
|
|
<p/>Now, you can put tabular data into these formats — since list-of-key-value-pairs
|
|
is one of the things representable in XML or JSON. Example:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
# DKVP
|
|
x=1,y=2
|
|
z=3
|
|
|
|
# XML
|
|
<table>
|
|
<record>
|
|
<field>
|
|
<key> x </key> <value> 1 </value>
|
|
</field>
|
|
<field>
|
|
<key> y </key> <value> 2 </value>
|
|
</field>
|
|
</record>
|
|
<field>
|
|
<key> z </key> <value> 3 </value>
|
|
</field>
|
|
<record>
|
|
</record>
|
|
</table>
|
|
|
|
# JSON
|
|
[{"x":1,"y":2},{"z":3}]
|
|
</pre>
|
|
</div>
|
|
|
|
<p/>However, a tool like Miller which handles non-recursive data is never going
|
|
to be able to handle full XML/JSON semantics — only a small subset. If
|
|
tabular data represented in XML/JSON/etc are sufficiently well-structured, it
|
|
may be easy to grep/sed out the data into a simpler text form — this is a
|
|
general text-processing problem.
|
|
|
|
<p/>Miller does support tabular data represented in JSON: please see
|
|
<a href="file-formats.html">File formats</a>. See also <a
|
|
href="http://stedolan.github.io/jq/">jq</a> for a truly powerful, JSON-specific
|
|
tool.
|
|
|
|
<p/>For XML, my suggestion is to use a tool like
|
|
<a href="http://ff-extractor.sourceforge.net/">ff-extractor</a> to do format
|
|
conversion.
|
|
|
|
</div>
|
|
</div>
|
|
</td>
|
|
|
|
</table>
|
|
</body>
|
|
</html>
|