mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 10:15:36 +00:00
1071 lines
33 KiB
HTML
1071 lines
33 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
|
|
<!-- PAGE GENERATED FROM template.html and content-for-faq.html BY poki. -->
|
|
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
|
|
<head>
|
|
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
|
|
<meta name="description" content="Miller documentation"/>
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
|
|
<meta name="keywords"
|
|
content="John Kerl, Kerl, Miller, miller, mlr, OLAP, data analysis software, regression, correlation, variance, data tools, " />
|
|
|
|
<title> FAQ </title>
|
|
<link rel="stylesheet" type="text/css" href="css/miller.css"/>
|
|
<link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
|
|
</head>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript">
|
|
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
|
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
|
</script>
|
|
<script type="text/javascript">
|
|
try {
|
|
var pageTracker = _gat._getTracker("UA-15651652-1");
|
|
pageTracker._trackPageview();
|
|
} catch(err) {}
|
|
</script>
|
|
<!-- ================================================================ -->
|
|
|
|
<body bgcolor="#ffffff">
|
|
|
|
<!-- ================================================================ -->
|
|
|
|
<!-- navbar -->
|
|
<div class="pokinav">
|
|
<center><titleinbody>Miller</titleinbody></center>
|
|
|
|
<!-- NAVBAR GENERATED FROM template.html BY poki -->
|
|
<br/>
|
|
<a class="poki-navbar-element" href="index.html">Overview</a>
|
|
|
|
<a class="poki-navbar-element" href="faq.html"><b>Using</b></a>
|
|
|
|
<a class="poki-navbar-element" href="reference.html">Reference</a>
|
|
|
|
<a class="poki-navbar-element" href="why.html">Background</a>
|
|
|
|
<a class="poki-navbar-element" href="contact.html">Repository</a>
|
|
|
|
<br/>
|
|
<br/><a href="faq.html"><b>FAQ</b></a>
|
|
<br/><a href="data-sharing.html">Mixing with other languages</a>
|
|
<br/><a href="cookbook.html">Cookbook part 1</a>
|
|
<br/><a href="cookbook2.html">Cookbook part 2</a>
|
|
<br/><a href="cookbook3.html">Cookbook part 3</a>
|
|
<br/><a href="data-examples.html">Data-diving examples</a>
|
|
</div>
|
|
|
|
<!-- page body -->
|
|
<p/>
|
|
|
|
<!-- BODY COPIED FROM content-for-faq.html BY poki -->
|
|
<div class="pokitoc">
|
|
<center><titleinbody>FAQ</titleinbody></center>
|
|
• <a href="#No_output_at_all">No output at all</a><br/>
|
|
• <a href="#Fields_not_selected">Fields not selected</a><br/>
|
|
• <a href="#Diagnosing_delimiter_specifications">Diagnosing delimiter specifications</a><br/>
|
|
• <a href="#How_do_I_suppress_numeric_conversion?">How do I suppress numeric conversion?</a><br/>
|
|
• <a href="#How_do_I_examine_then-chaining?">How do I examine then-chaining?</a><br/>
|
|
• <a href="#I_assigned_$9_and_it’s_not_9th">I assigned $9 and it’s not 9th</a><br/>
|
|
• <a href="#How_can_I_filter_by_date?">How can I filter by date?</a><br/>
|
|
• <a href="#How_can_I_handle_commas-as-data_in_various_formats?">How can I handle commas-as-data in various formats?</a><br/>
|
|
• <a href="#How_can_I_handle_field_names_with_special_symbols_in_them?">How can I handle field names with special symbols in them?</a><br/>
|
|
• <a href="#How_to_escape_'?'_in_regexes?">How to escape '?' in regexes?</a><br/>
|
|
• <a href="#How_can_I_put_single-quotes_into_strings?">How can I put single-quotes into strings?</a><br/>
|
|
• <a href="#Why_doesn’t_mlr_cut_put_fields_in_the_order_I_want?">Why doesn’t mlr cut put fields in the order I want?</a><br/>
|
|
• <a href="#NR_is_not_consecutive_after_then-chaining">NR is not consecutive after then-chaining</a><br/>
|
|
• <a href="#Why_am_I_not_seeing_all_possible_joins_occur?">Why am I not seeing all possible joins occur?</a><br/>
|
|
• <a href="#How_to_rectangularize_after_joins_with_unpaired?">How to rectangularize after joins with unpaired?</a><br/>
|
|
• <a href="#What_about_XML_or_JSON_file_formats?">What about XML or JSON file formats?</a><br/>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="bodyToggler.expandAll();" href="javascript:;">Expand all sections</button>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="bodyToggler.collapseAll();" href="javascript:;">Collapse all sections</button>
|
|
|
|
<a id="No_output_at_all"/><h1>No output at all</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_no_output_at_all');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_no_output_at_all" style="display: block">
|
|
|
|
<p/>Try <code>od -xcv</code> and/or <code>cat -e</code> on your file to check for non-printable characters.
|
|
|
|
<p/>If you’re using Miller version less than 5.0.0 (try
|
|
<code>mlr --version</code> on your system to find out), when the
|
|
line-ending-autodetect feature was introduced, please see
|
|
<a href="http://johnkerl.org/miller-releases/miller-4.5.0/doc/index.html">here</a>.
|
|
|
|
</div>
|
|
<a id="Fields_not_selected"/><h1>Fields not selected</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_fields_not_selected');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_fields_not_selected" style="display: block">
|
|
|
|
<p/>Check the field-separators of the data, e.g. with the command-line
|
|
<code>head</code> program. Example: for CSV, Miller’s default record
|
|
separator is comma; if your data is tab-delimited, e.g. <code>aTABbTABc</code>,
|
|
then Miller won’t find three fields named <code>a</code>, <code>b</code>, and
|
|
<code>c</code> but rather just one named <code>aTABbTABc</code>. Solution in this
|
|
case: <code>mlr --fs tab {remaining arguments ...}</code>.
|
|
|
|
<p/>Also try <code>od -xcv</code> and/or <code>cat -e</code> on your file to check for non-printable characters.
|
|
|
|
</div>
|
|
<a id="Diagnosing_delimiter_specifications"/><h1>Diagnosing delimiter specifications</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_diagnosing_delimiter_specifications');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_diagnosing_delimiter_specifications" style="display: block">
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
# Use the `file` command to see if there are CR/LF terminators (in this case,
|
|
# there are not):
|
|
$ file data/colours.csv
|
|
data/colours.csv: UTF-8 Unicode text
|
|
|
|
# Look at the file to find names of fields
|
|
$ cat data/colours.csv
|
|
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR
|
|
masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz
|
|
masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah
|
|
|
|
# Extract a few fields:
|
|
$ mlr --csv cut -f KEY,PL,RO data/colours.csv
|
|
(only blank lines appear)
|
|
|
|
# Use XTAB output format to get a sharper picture of where records/fields
|
|
# are being split:
|
|
$ mlr --icsv --oxtab cat data/colours.csv
|
|
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz
|
|
|
|
KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah
|
|
|
|
# Using XTAB output format makes it clearer that KEY;DE;...;RO;TR is being
|
|
# treated as a single field name in the CSV header, and likewise each
|
|
# subsequent line is being treated as a single field value. This is because
|
|
# the default field separator is a comma but we have semicolons here.
|
|
# Use XTAB again with different field separator (--fs semicolon):
|
|
mlr --icsv --ifs semicolon --oxtab cat data/colours.csv
|
|
KEY masterdata_colourcode_1
|
|
DE Weiß
|
|
EN White
|
|
ES Blanco
|
|
FI Valkoinen
|
|
FR Blanc
|
|
IT Bianco
|
|
NL Wit
|
|
PL Biały
|
|
RO Alb
|
|
TR Beyaz
|
|
|
|
KEY masterdata_colourcode_2
|
|
DE Schwarz
|
|
EN Black
|
|
ES Negro
|
|
FI Musta
|
|
FR Noir
|
|
IT Nero
|
|
NL Zwart
|
|
PL Czarny
|
|
RO Negru
|
|
TR Siyah
|
|
|
|
# Using the new field-separator, retry the cut:
|
|
mlr --csv --fs semicolon cut -f KEY,PL,RO data/colours.csv
|
|
KEY;PL;RO
|
|
masterdata_colourcode_1;Biały;Alb
|
|
masterdata_colourcode_2;Czarny;Negru
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="How_do_I_suppress_numeric_conversion?"/><h1>How do I suppress numeric conversion?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_suppress_numeric_conversion');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_suppress_numeric_conversion" style="display: block">
|
|
|
|
<p/><b>TL;DR use put -S</b>.
|
|
|
|
<p/> Within <code>mlr put</code> and <code>mlr filter</code>, the default behavior for
|
|
scanning input records is to parse them as integer, if possible, then as float,
|
|
if possible, else leave them as string:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/scan-example-1.tbl
|
|
value
|
|
1
|
|
2.0
|
|
3x
|
|
hello
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --pprint put '$copy = $value; $type = typeof($value)' data/scan-example-1.tbl
|
|
value copy type
|
|
1 1 int
|
|
2.0 2.000000 float
|
|
3x 3x string
|
|
hello hello string
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>The numeric-conversion rule is simple:
|
|
|
|
<ul>
|
|
<li/> Try to scan as integer (<code>"1"</code> should be int);
|
|
<li/> if that doesn’t succeed, try to scan as float (<code>"1.0"</code> should be float);
|
|
<li/> if that doesn’t succeed, leave the value as a string (<code>"1x"</code> is string).
|
|
</ul>
|
|
|
|
<p/>This is a sensible default: you should be able to put <code>'$z = $x +
|
|
$y'</code> without having to write <code>'$z = int($x) + float($y)'</code>. Also
|
|
note that default output format for floating-point numbers created by
|
|
<code>put</code> (and other verbs such as <code>stats1</code>) is six decimal places;
|
|
you can override this using <code>mlr --ofmt</code>. Also note that Miller uses
|
|
your system’s C library functions whenever possible: e.g. <code>sscanf</code>
|
|
for converting strings to integer or floating-point.
|
|
|
|
<p/>But now suppose you have data like these:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/scan-example-2.tbl
|
|
value
|
|
0001
|
|
0002
|
|
0005
|
|
0005WA
|
|
0006
|
|
0007
|
|
0007WA
|
|
0008
|
|
0009
|
|
0010
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --pprint put '$copy = $value; $type = typeof($value)' data/scan-example-2.tbl
|
|
value copy type
|
|
0001 1 int
|
|
0002 2 int
|
|
0005 5 int
|
|
0005WA 0005WA string
|
|
0006 6 int
|
|
0007 7 int
|
|
0007WA 0007WA string
|
|
0008 8.000000 float
|
|
0009 9.000000 float
|
|
0010 8 int
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> The same conversion rules as above are being used. Namely:
|
|
|
|
<ul>
|
|
<li/> By default field values are inferred to int, else float, else string;
|
|
|
|
<li/> leading zeroes indicate octal for integers (<code>sscanf</code> semantics);
|
|
|
|
<li/> since <code>0008</code> doesn't scan as integer (leading 0 requests octal but 8
|
|
isn't a valid octal digit), the float scan is tried next and it succeeds;
|
|
|
|
<li/> default floating-point output format is 6 decimal places (override with <code>mlr --ofmt</code>).
|
|
</ul>
|
|
|
|
<p/> Taken individually the rules make sense; taken collectively they produce a mishmash of types here.
|
|
|
|
<p/>The solution is to <b>use the -S flag</b> for <code>mlr put</code> and/or <code>mlr filter</code>.
|
|
Then all field values are left as string. You can type-coerce on demand using syntax like
|
|
<code>'$z = int($x) + float($y)'</code>. (See also the
|
|
<a href="reference-verbs.html#put">put documentation</a>; see also
|
|
<a href="https://github.com/johnkerl/miller/issues/150">https://github.com/johnkerl/miller/issues/150</a>.)
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --pprint put -S '$copy = $value; $type = typeof($value)' data/scan-example-2.tbl
|
|
value copy type
|
|
0001 0001 string
|
|
0002 0002 string
|
|
0005 0005 string
|
|
0005WA 0005WA string
|
|
0006 0006 string
|
|
0007 0007 string
|
|
0007WA 0007WA string
|
|
0008 0008 string
|
|
0009 0009 string
|
|
0010 0010 string
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="How_do_I_examine_then-chaining?"/><h1>How do I examine then-chaining?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_examine_then_chaining');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_examine_then_chaining" style="display: block">
|
|
|
|
<p/>Then-chaining found in Miller is intended to function the same as Unix
|
|
pipes, but with less keystroking. You can print your data one pipeline step at
|
|
a time, to see what intermediate output at one step becomes the input to the
|
|
next step.
|
|
|
|
<p/>First, look at the input data:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/then-example.csv
|
|
Status,Payment_Type,Amount
|
|
paid,cash,10.00
|
|
pending,debit,20.00
|
|
paid,cash,50.00
|
|
pending,credit,40.00
|
|
paid,debit,30.00
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
Next, run the first step of your command, omitting anything from the first <code>then</code> onward:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --opprint count-distinct -f Status,Payment_Type data/then-example.csv
|
|
Status Payment_Type count
|
|
paid cash 2
|
|
pending debit 1
|
|
pending credit 1
|
|
paid debit 1
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
After that, run it with the next <code>then</code> step included:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --opprint count-distinct -f Status,Payment_Type then sort -nr count data/then-example.csv
|
|
Status Payment_Type count
|
|
paid cash 2
|
|
pending debit 1
|
|
pending credit 1
|
|
paid debit 1
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
Now if you use <code>then</code> to include another verb after that, the columns
|
|
<code>Status</code>, <code>Payment_Type</code>, and <code>count</code> will be the input to
|
|
that verb.
|
|
|
|
<p/>Note, by the way, that you’ll get the same results using pipes:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --csv count-distinct -f Status,Payment_Type data/then-example.csv | mlr --icsv --opprint sort -nr count
|
|
Status Payment_Type count
|
|
paid cash 2
|
|
pending debit 1
|
|
pending credit 1
|
|
paid debit 1
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="I_assigned_$9_and_it’s_not_9th"/><h1>I assigned $9 and it’s not 9th</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_9_not_9th');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_9_not_9th" style="display: block">
|
|
|
|
<p/> Miller records are ordered lists of key-value pairs. For NIDX format, DKVP
|
|
format when keys are missing, or CSV/CSV-lite format with
|
|
<code>--implicit-csv-header</code>, Miller will sequentially assign keys of the
|
|
form <code>1</code>, <code>2</code>, etc. But these are not integer array indices:
|
|
they’re just field names taken from the initial field ordering in the
|
|
input data.
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --dkvp cat
|
|
1=x,2=y,3=z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --dkvp put '$6="a";$4="b";$55="cde"'
|
|
1=x,2=y,3=z,6=a,4=b,55=cde
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --nidx cat
|
|
x,y,z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --csv --implicit-csv-header cat
|
|
1,2,3
|
|
x,y,z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --dkvp rename 2,999
|
|
1=x,999=y,3=z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --dkvp rename 2,newname
|
|
1=x,newname=y,3=z
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x,y,z | mlr --csv --implicit-csv-header reorder -f 3,1,2
|
|
3,1,2
|
|
z,x,y
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="How_can_I_filter_by_date?"/><h1>How can I filter by date?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_date_filtering');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_date_filtering" style="display: block">
|
|
|
|
<p/> Given input like
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat dates.csv
|
|
date,event
|
|
2018-02-03,initialization
|
|
2018-03-07,discovery
|
|
2018-02-03,allocation
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
we can use <code>strptime</code> to parse the date field into seconds-since-epoch
|
|
and then do numeric comparisons. Simply match your input dataset’s
|
|
date-formatting to the <a href="reference-dsl.html#strptime">strptime</a>
|
|
format-string. For example:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --csv filter 'strptime($date, "%Y-%m-%d") > strptime("2018-03-03", "%Y-%m-%d")' dates.csv
|
|
date,event
|
|
2018-03-07,discovery
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>Caveat: localtime-handling in timezones with DST is still a work in progress; see
|
|
<a href="https://github.com/johnkerl/miller/issues/170">https://github.com/johnkerl/miller/issues/170</a>.
|
|
See also <a href="https://github.com/johnkerl/miller/issues/208">https://github.com/johnkerl/miller/issues/208</a>
|
|
— thanks @aborruso!
|
|
|
|
</div>
|
|
<a id="How_can_I_handle_commas-as-data_in_various_formats?"/><h1>How can I handle commas-as-data in various formats?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_comma_handling');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_comma_handling" style="display: block">
|
|
|
|
<p/> <a href="file-formats.html#CSV/TSV/ASV/USV/etc.">CSV</a> handles this well and by design:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat commas.csv
|
|
Name,Role
|
|
"Xiao, Lin",administrator
|
|
"Khavari, Darius",tester
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> Likewise <a href="file-formats.html#Tabular_JSON">JSON</a>:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --ojson cat commas.csv
|
|
{ "Name": "Xiao, Lin", "Role": "administrator" }
|
|
{ "Name": "Khavari, Darius", "Role": "tester" }
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> For Miller’s <a href="file-formats.html#XTAB:_Vertical_tabular">XTAB</a>
|
|
there is no escaping for carriage returns, but commas work fine:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --oxtab cat commas.csv
|
|
Name Xiao, Lin
|
|
Role administrator
|
|
|
|
Name Khavari, Darius
|
|
Role tester
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> But for <a href="file-formats.html#DKVP:_Key-value_pairs">DKVP</a>
|
|
and
|
|
<a href="file-formats.html#NIDX:_Index-numbered_(toolkit_style)">NIDX</a>, commas
|
|
are the default field separator. And — as of Miller 5.4.0 anyway —
|
|
there is no CSV-style double-quote-handling like there is for CSV. So commas within the data
|
|
look like delimiters:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --odkvp cat commas.csv
|
|
Name=Xiao, Lin,Role=administrator
|
|
Name=Khavari, Darius,Role=tester
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> One solution is to use a different delimiter, such as a pipe character:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --odkvp --ofs pipe cat commas.csv
|
|
Name=Xiao, Lin|Role=administrator
|
|
Name=Khavari, Darius|Role=tester
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> To be extra-sure to avoid data/delimiter clashes, you can also use control
|
|
characters as delimiters — here, control-A:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsv --odkvp --ofs '\001' cat commas.csv | cat -v
|
|
Name=Xiao, Lin^ARole=administrator
|
|
Name=Khavari, Darius^ARole=tester
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="How_can_I_handle_field_names_with_special_symbols_in_them?"/><h1>How can I handle field names with special symbols in them?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_field_names_with_special_symbols');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_field_names_with_special_symbols" style="display: block">
|
|
|
|
<p/>Simply surround the field names with curly braces:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo 'x.a=3,y:b=4,z/c=5' | mlr put '${product.all} = ${x.a} * ${y:b} * ${z/c}'
|
|
x.a=3,y:b=4,z/c=5,product.all=60
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="How_to_escape_'?'_in_regexes?"/><h1>How to escape '?' in regexes?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_escape_in_regex');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_escape_in_regex" style="display: block">
|
|
|
|
<p/> One way is to use square brackets; an alternative is to use simple
|
|
string-substitution rather than a regular expression.
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/question.dat
|
|
a=is it?,b=it is!
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --oxtab put '$c = gsub($a, "[?]"," ...")' data/question.dat
|
|
a is it?
|
|
b it is!
|
|
c is it ...
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --oxtab put '$c = ssub($a, "?"," ...")' data/question.dat
|
|
a is it?
|
|
b it is!
|
|
c is it ...
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> The <code>ssub</code> function exists precisely for this reason: so you don’t have to escape anything.
|
|
|
|
</div>
|
|
<a id="How_can_I_put_single-quotes_into_strings?"/><h1>How can I put single-quotes into strings?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_single_quotes_in_strings');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_single_quotes_in_strings" style="display: block">
|
|
|
|
<p/> This is a little tricky due to the shell’s handling of quotes. For simplicity, let’s first put
|
|
an update script into a file:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$a = "It's OK, I said, then 'for now'."
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo a=bcd | mlr put -f data/single-quote-example.mlr
|
|
a=It's OK, I said, then 'for now'.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>So, it’s simple: Miller’s DSL uses double quotes for strings,
|
|
and you can put single quotes (or backslash-escaped double-quotes) inside
|
|
strings, no problem.
|
|
|
|
<p/> Without putting the update expression in a file, it’s messier:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo a=bcd | mlr put '$a="It'\''s OK, I said, '\''for now'\''."'
|
|
a=It's OK, I said, 'for now'.
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> The idea is that the outermost single-quotes are to protect the
|
|
<code>put</code> expression from the shell, and the double quotes within them are
|
|
for Miller. To get a single quote in the middle there, you need to actually put it <i>outside</i> the single-quoting
|
|
for the shell. The pieces are
|
|
|
|
<ul>
|
|
<li/> <code>$a="It</code>
|
|
<li/> <code>\'</code>
|
|
<li/> <code>s OK, I said,</code>
|
|
<li/> <code>\'</code>
|
|
<li/> <code>for now</code>
|
|
<li/> <code>\'</code>
|
|
<li/> <code>.</code>
|
|
</ul>
|
|
|
|
all concatenated together.
|
|
|
|
</div>
|
|
<a id="Why_doesn’t_mlr_cut_put_fields_in_the_order_I_want?"/><h1>Why doesn’t mlr cut put fields in the order I want?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_cut_out_of_order');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_cut_out_of_order" style="display: block">
|
|
|
|
<p/>Example: columns <code>x,i,a</code> were requested but they appear here in the order <code>a,i,x</code>:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/small
|
|
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
|
|
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
|
|
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
|
|
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
|
|
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr cut -f x,i,a data/small
|
|
a=pan,i=1,x=0.3467901443380824
|
|
a=eks,i=2,x=0.7586799647899636
|
|
a=wye,i=3,x=0.20460330576630303
|
|
a=eks,i=4,x=0.38139939387114097
|
|
a=wye,i=5,x=0.5732889198020006
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>The issue is that Miller’s <code>cut</code>, by default, outputs cut fields in the order they
|
|
appear in the input data. This design decision was made intentionally to parallel the *nix system <code>cut</code>
|
|
command, which has the same semantics.
|
|
|
|
<p/>The solution is to use the <code>-o</code> option:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr cut -o -f x,i,a data/small
|
|
x=0.3467901443380824,i=1,a=pan
|
|
x=0.7586799647899636,i=2,a=eks
|
|
x=0.20460330576630303,i=3,a=wye
|
|
x=0.38139939387114097,i=4,a=eks
|
|
x=0.5732889198020006,i=5,a=wye
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="NR_is_not_consecutive_after_then-chaining"/><h1>NR is not consecutive after then-chaining</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_NR_not_consecutive_after_then');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_NR_not_consecutive_after_then" style="display: block">
|
|
|
|
<p/> Given this input data:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/small
|
|
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
|
|
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
|
|
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
|
|
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
|
|
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
why don’t I see <code>NR=1</code> and <code>NR=2</code> here??
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr filter '$x > 0.5' then put '$NR = NR' data/small
|
|
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,NR=2
|
|
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,NR=5
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>The reason is that <code>NR</code> is computed for the original input records and isn’t dynamically
|
|
updated. By contrast, <code>NF</code> is dynamically updated: it’s the number of fields in the
|
|
current record, and if you add/remove a field, the value of <code>NF</code> will change:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ echo x=1,y=2,z=3 | mlr put '$nf1 = NF; $u = 4; $nf2 = NF; unset $x,$y,$z; $nf3 = NF'
|
|
nf1=3,u=4,nf2=5,nf3=3
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/><code>NR</code>, by contrast (and <code>FNR</code> as well), retains the value from the original input stream,
|
|
and records may be dropped by a <code>filter</code> within a <code>then</code>-chain. To recover consecutive record
|
|
numbers, you can use out-of-stream variables as follows:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint --from data/small put '
|
|
begin{ @nr1 = 0 }
|
|
@nr1 += 1;
|
|
$nr1 = @nr1
|
|
' \
|
|
then filter '$x>0.5' \
|
|
then put '
|
|
begin{ @nr2 = 0 }
|
|
@nr2 += 1;
|
|
$nr2 = @nr2
|
|
'
|
|
a b i x y nr1 nr2
|
|
eks pan 2 0.7586799647899636 0.5221511083334797 2 1
|
|
wye pan 5 0.5732889198020006 0.8636244699032729 5 2
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>Or, simply use <code>mlr cat -n</code>:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr filter '$x > 0.5' then cat -n data/small
|
|
n=1,a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
|
|
n=2,a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="Why_am_I_not_seeing_all_possible_joins_occur?"/><h1>Why am I not seeing all possible joins occur?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_not_all_possible_joins');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_not_all_possible_joins" style="display: block">
|
|
|
|
<p/><b>This section describes behavior before Miller 5.1.0. As of 5.1.0, <code>-u</code> is the default.</b>
|
|
|
|
<p/>For example, the right file here has nine records, and the left file should
|
|
add in the <code>hostname</code> column — so the join output should also have
|
|
9 records:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint cat data/join-u-left.csv
|
|
hostname ipaddr
|
|
nadir.east.our.org 10.3.1.18
|
|
zenith.west.our.org 10.3.1.27
|
|
apoapsis.east.our.org 10.4.5.94
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint cat data/join-u-right.csv
|
|
ipaddr timestamp bytes
|
|
10.3.1.27 1448762579 4568
|
|
10.3.1.18 1448762578 8729
|
|
10.4.5.94 1448762579 17445
|
|
10.3.1.27 1448762589 12
|
|
10.3.1.18 1448762588 44558
|
|
10.4.5.94 1448762589 8899
|
|
10.3.1.27 1448762599 0
|
|
10.3.1.18 1448762598 73425
|
|
10.4.5.94 1448762599 12200
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint join -s -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
|
|
ipaddr hostname timestamp bytes
|
|
10.3.1.27 zenith.west.our.org 1448762579 4568
|
|
10.4.5.94 apoapsis.east.our.org 1448762579 17445
|
|
10.4.5.94 apoapsis.east.our.org 1448762589 8899
|
|
10.4.5.94 apoapsis.east.our.org 1448762599 12200
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>The issue is that Miller’s <code>join</code>, by default (before 5.1.0),
|
|
took input sorted (lexically ascending) by the sort keys on both the left and
|
|
right files. This design decision was made intentionally to parallel the *nix
|
|
system <code>join</code> command, which has the same semantics. The benefit of this
|
|
default is that the joiner program can stream through the left and right files,
|
|
needing to load neither entirely into memory. The drawback, of course, is that
|
|
is requires sorted input.
|
|
|
|
<p/>The solution (besides pre-sorting the input files on the join keys) is to
|
|
simply use <b>mlr join -u</b> (which is now the default). This loads the left
|
|
file entirely into memory (while the right file is still streamed one line at a
|
|
time) and does all possible joins without requiring sorted input:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --icsvlite --opprint join -u -j ipaddr -f data/join-u-left.csv data/join-u-right.csv
|
|
ipaddr hostname timestamp bytes
|
|
10.3.1.27 zenith.west.our.org 1448762579 4568
|
|
10.3.1.18 nadir.east.our.org 1448762578 8729
|
|
10.4.5.94 apoapsis.east.our.org 1448762579 17445
|
|
10.3.1.27 zenith.west.our.org 1448762589 12
|
|
10.3.1.18 nadir.east.our.org 1448762588 44558
|
|
10.4.5.94 apoapsis.east.our.org 1448762589 8899
|
|
10.3.1.27 zenith.west.our.org 1448762599 0
|
|
10.3.1.18 nadir.east.our.org 1448762598 73425
|
|
10.4.5.94 apoapsis.east.our.org 1448762599 12200
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>General advice is to make sure the left-file is relatively small, e.g.
|
|
containing name-to-number mappings, while saving large amounts of data for the
|
|
right file.
|
|
|
|
</div>
|
|
<a id="How_to_rectangularize_after_joins_with_unpaired?"/><h1>How to rectangularize after joins with unpaired?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_rectangularize_after_join');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_rectangularize_after_join" style="display: block">
|
|
|
|
<p/> Suppose you have the following two data files:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
id,code
|
|
3,0000ff
|
|
2,00ff00
|
|
4,ff0000
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
id,color
|
|
4,red
|
|
2,green
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> Joining on color the results are as expected:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --csv join -j id -f data/color-codes.csv data/color-names.csv
|
|
id,code,color
|
|
4,ff0000,red
|
|
2,00ff00,green
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> However, if we ask for left-unpaireds, since there’s no
|
|
<code>color</code> column, we get a row not having the same column names as the
|
|
other:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --csv join --ul -j id -f data/color-codes.csv data/color-names.csv
|
|
id,code,color
|
|
4,ff0000,red
|
|
2,00ff00,green
|
|
|
|
id,code
|
|
3,0000ff
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> To fix this, we can use <b>unsparsify</b>:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --csv join --ul -j id -f data/color-codes.csv then unsparsify --fill-with "" data/color-names.csv
|
|
id,code,color
|
|
4,ff0000,red
|
|
2,00ff00,green
|
|
3,0000ff,
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> Thanks to @aborruso for the tip!
|
|
|
|
</div>
|
|
<a id="What_about_XML_or_JSON_file_formats?"/><h1>What about XML or JSON file formats?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_xml_or_json');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_xml_or_json" style="display: block">
|
|
|
|
<p/>Miller handles <span class="boldmaroon">tabular data</span>, which is a list of
|
|
records each having fields which are key-value pairs. Miller also doesn’t
|
|
require that each record have the same field names (see also <a
|
|
href="record-heterogeneity.html">here</a>). Regardless, tabular data is a
|
|
<span class="boldmaroon">non-recursive data structure</span>.
|
|
|
|
<p/> XML, JSON, etc. are, by contrast, all <span class="boldmaroon">recursive</span>
|
|
or <span class="boldmaroon">nested</span> data structures. For example, in JSON
|
|
you can represent a hash map whose values are lists of lists.
|
|
|
|
<p/>Now, you can put tabular data into these formats — since list-of-key-value-pairs
|
|
is one of the things representable in XML or JSON. Example:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
# DKVP
|
|
x=1,y=2
|
|
z=3
|
|
|
|
# XML
|
|
<table>
|
|
<record>
|
|
<field>
|
|
<key> x </key> <value> 1 </value>
|
|
</field>
|
|
<field>
|
|
<key> y </key> <value> 2 </value>
|
|
</field>
|
|
</record>
|
|
<record>
|
|
<field>
|
|
<key> z </key> <value> 3 </value>
|
|
</field>
|
|
</record>
|
|
</table>
|
|
|
|
# JSON
|
|
[{"x":1,"y":2},{"z":3}]
|
|
</pre>
|
|
</div>
|
|
|
|
<p/>However, a tool like Miller which handles non-recursive data is never going
|
|
to be able to handle full XML/JSON semantics — only a small subset. If
|
|
tabular data represented in XML/JSON/etc are sufficiently well-structured, it
|
|
may be easy to grep/sed out the data into a simpler text form — this is a
|
|
general text-processing problem.
|
|
|
|
<p/>Miller does support tabular data represented in JSON: please see
|
|
<a href="file-formats.html">File formats</a>. See also <a
|
|
href="http://stedolan.github.io/jq/">jq</a> for a truly powerful, JSON-specific
|
|
tool.
|
|
|
|
<p/>For XML, my suggestion is to use a tool like
|
|
<a href="http://ff-extractor.sourceforge.net/">ff-extractor</a> to do format
|
|
conversion.
|
|
|
|
</div>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript" src="js/miller-doc-toggler.js"></script>
|
|
<!-- wtf -->
|
|
<script type="text/javascript">
|
|
// Put this at the bottom of the page since its constructor scans the
|
|
// document's div tags to find the toggleables.
|
|
const bodyToggler = new MillerDocToggler(
|
|
"body_section_toggle_",
|
|
'maroon',
|
|
'maroon',
|
|
);
|
|
</script>
|
|
|
|
</body>
|
|
</html>
|