miller/doc/performance.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">

<!-- PAGE GENERATED FROM template.html and content-for-performance.html BY poki. -->
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
<head>
  <meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
  <meta name="description" content="Miller documentation"/>
  <meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
  <title> Performance </title>
  <link rel="stylesheet" type="text/css" href="css/miller.css"/>
  <link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
</head>

<!-- ================================================================ -->
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-15651652-1");
pageTracker._trackPageview();
} catch(err) {}
</script>

<script type="text/javascript">
	function toggle(divName) {
		var eleDiv = document.getElementById(divName);
		if (eleDiv != null) {
			if (eleDiv.style.display == "block") {
				eleDiv.style.display = "none";
			} else {
				eleDiv.style.display = "block";
			}
		}
	}
</script>

<!--
The background image is from a screenshot of a Google search for "data analysis
tools", lightened and sepia-toned. Over this was placed a Mac Terminal app with
very light-grey font and translucent background, in which a few statistical
Miller commands were run with pretty-print-tabular output format.
-->
<body background="pix/sepia-overlay.jpg">

<!-- ================================================================ -->
<table width="100%">
<tr>

  <!-- navbar -->
  <td width="15%">
    <!--
    <img src="pix/mlr.jpg" />
    <img style="border-width:1px; color:black;" src="pix/mlr.jpg" />
    -->

    <div class="pokinav">
      <center><titleinbody>Miller</titleinbody></center>

<!-- PAGE LIST GENERATED FROM template.html BY poki -->
<br/>User info:
<br/>&bull;&nbsp;<a href="index.html">About</a>
<br/>&bull;&nbsp;<a href="file-formats.html">File formats</a>
<br/>&bull;&nbsp;<a href="feature-comparison.html">Miller features in the context of the Unix toolkit</a>
<br/>&bull;&nbsp;<a href="record-heterogeneity.html">Record-heterogeneity</a>
<br/>&bull;&nbsp;<a href="performance.html"><b>Performance</b></a>
<br/>&bull;&nbsp;<a href="etymology.html">Why call it Miller?</a>
<br/>&bull;&nbsp;<a href="originality.html">How original is Miller?</a>
<br/>&bull;&nbsp;<a href="reference.html">Reference</a>
<br/>&bull;&nbsp;<a href="data-examples.html">Data examples</a>
<br/>&bull;&nbsp;<a href="to-do.html">Things to do</a>
<br/>Developer info:
<br/>&bull;&nbsp;<a href="build.html">Compiling, portability, dependencies, and testing</a>
<br/>&bull;&nbsp;<a href="whyc.html">Why C?</a>
<br/>&bull;&nbsp;<a href="contact.html">Contact information</a>
<br/>&bull;&nbsp;<a href="https://github.com/johnkerl/miller">GitHub repo</a>
      <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
      <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
      <br/> <br/> <br/> <br/> <br/> <br/>
    </div>
  </td>

  <!-- page body -->
  <td>
    <div style="overflow-y:scroll;height:1500px">
    <center> <titleinbody> Performance </titleinbody> </center>
    <p/>

<!-- BODY COPIED FROM content-for-performance.html BY poki -->
<div class="pokitoc">
<center><b>Contents:</b></center>
&bull;&nbsp;<a href="#Data">Data</a><br/>
&bull;&nbsp;<a href="#Comparands">Comparands</a><br/>
&bull;&nbsp;<a href="#Raw_results">Raw results</a><br/>
&bull;&nbsp;<a href="#Analysis">Analysis</a><br/>
&bull;&nbsp;<a href="#Conclusion">Conclusion</a><br/>
</div>
<p/>

<a id="Data"/><h1>Data</h1>

Test data were of the form

<table><tr><td>
<p/>
<div class="pokipanel">
<pre>
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
</pre>
</div>
<p/>
</td><td>
<p/>
<div class="pokipanel">
<pre>
a,b,i,x,y
pan,pan,1,0.3467901443380824,0.7268028627434533
eks,pan,2,0.7586799647899636,0.5221511083334797
wye,wye,3,0.20460330576630303,0.33831852551664776
eks,wye,4,0.38139939387114097,0.13418874328430463
wye,pan,5,0.5732889198020006,0.8636244699032729
</pre>
</div>
<p/>
</td></tr></table>

for DKVP and CSV, respectively, where fields <tt>a</tt> and <tt>b</tt> take one of five text values,
uniformly distributed; <tt>i</tt> is a 1-up line counter; <tt>x</tt> and <tt>y</tt>
are independent uniformly distributed floating-point numbers in the unit
interval.

<p>Data files of one million lines (totalling about 50MB for CSV and 60MB for
DKVP) were used. In experiments not shown here, I also varied the file sizes;
the size-dependent results were the expected, completely unsurprising
linearities and so I produced no file-size-dependent plots for your viewing pleasure.

<a id="Comparands"/><h1>Comparands</h1>

The <tt>cat</tt>, <tt>cut</tt>, <tt>awk</tt>, <tt>sed</tt>, <tt>sort</tt> tools
were compared to <tt>mlr</tt> on an 8-core Darwin laptop; RAM capacity was
nowhere near challenged . The <tt>catc</tt> program is a simple line-oriented
line-printer (<a
href="https://github.com/johnkerl/miller/blob/master/c/tools/catc.c">source
here</a>) which is intermediate between Miller (which is record-aware as well
as line-aware) and <tt>cat</tt> (which is only byte-aware).

<a id="Raw_results"/><h1>Raw results</h1>

Note that for CSV data, the command is <tt>mlr --csv ...</tt> rather than <tt>mlr ...</tt>.

<p/>
<div class="pokipanel">
<pre>
   Mac     Mac         Comparand
   DKVP    CSV
  seconds seconds

   0.016   0.013       cat
   0.189   0.189       catc
   3.657   4.388       awk -F, '{print}'
   2.027   1.795       mlr cat

   2.292   1.940       cut -d , -f 1,4
   3.540   4.516       awk -F, '{print $1,$4}'
   1.600   1.390       mlr cut -f a,x
   1.694   1.648       mlr cut -x -f a,x

   0.845   0.643       sed -e 's/x=/EKS=/' -e 's/b=/BEE=/'
   2.076   1.842       mlr rename x,EKS,b,BEE

   5.643   5.031       awk -F, '{gsub("x=","",$4);gsub("y=","",$5);print $4+$5}'
   4.019   3.679       mlr put '$z=$x+$y'

   2.481   2.628       mlr stats1 -a mean -f x,y -g a,b

   2.587   2.389       mlr stats2 -a corr -f x,y -g a,b

  23.247  14.466       sort -t, -k 1,2
   3.023   5.658       mlr sort -f a,b

  17.224  15.523       sort -t, -k 4,5
   5.807   5.194       mlr sort -n x,y
</pre>
</div>
<p/>

<a id="Analysis"/><h1>Analysis</h1>

<ul>
<li/> As expected, <tt>cat</tt> is very fast &mdash; it needs only stream bytes as quickly as possible; it doesn&rsquo;t even need to touch individual bytes.
<li/> My <tt>catc</tt> is also faster than Miller: it needs to read and write lines, but it doesn&rsquo;t segment lines into records; in fact it does no iteration over bytes in each line.
<li/> Miller does not outperform <tt>sed</tt>, which is string-oriented rather than record-oriented.
<li/> For the tools which do need to pick apart fields (<tt>cut</tt>,
<tt>awk</tt>, <tt>sort</tt>), Miller is comparable or outperforms. As noted above, this effect
persists linearly across file sizes.
<li/> For univariate and bivariate statistics, I didn&rsquo;t attempt to
compare to other tools wherein such computations are less straightforward;
rather, I attempted only to show that Miller&rsquo;s processing time here is comparable to its own processing time for other problems.
</ul>

<a id="Conclusion"/><h1>Conclusion</h1>

For record-oriented data transformations, Miller meets or beats the Unix
toolkit in many contexts. Field renames in particular are worth doing as a
pre-pipe or post-pipe using <tt>sed</tt>.
    </div>
  </td>

</table>
</body>
</html>