miller/doc/performance.html
2020-01-29 21:18:25 -05:00

135 lines
5.3 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<!-- PAGE GENERATED FROM template.html and content-for-performance.html BY poki. -->
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
<meta name="description" content="Miller documentation"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
<meta name="keywords"
content="John Kerl, Kerl, Miller, miller, mlr, OLAP, data analysis software, regression, correlation, variance, data tools, " />
<title> Performance </title>
<link rel="stylesheet" type="text/css" href="css/miller.css"/>
<link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
</head>
<!-- ================================================================ -->
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-15651652-1");
pageTracker._trackPageview();
} catch(err) {}
</script>
<!-- ================================================================ -->
<body bgcolor="#ffffff">
<!-- ================================================================ -->
<!-- navbar -->
<div class="pokinav">
<center><titleinbody>Miller</titleinbody></center>
<!-- NAVBAR GENERATED FROM template.html BY poki -->
<br/>
<a class="poki-navbar-element" href="index.html">Overview</a>
&nbsp;
<a class="poki-navbar-element" href="faq.html">Using</a>
&nbsp;
<a class="poki-navbar-element" href="reference.html">Reference</a>
&nbsp;
<a class="poki-navbar-element" href="why.html"><b>Background</b></a>
&nbsp;
<a class="poki-navbar-element" href="contact.html">Repository</a>
&nbsp;
<br/>
<br/><a href="why.html">Why?</a>
<br/><a href="whyc.html">Why C?</a>
<br/><a href="whyc-details.html">Why C: details</a>
<br/><a href="etymology.html">Why call it Miller?</a>
<br/><a href="originality.html">How original is Miller?</a>
<br/><a href="performance.html"><b>Performance</b></a>
</div>
<!-- page body -->
<p/>
<!-- BODY COPIED FROM content-for-performance.html BY poki -->
<div class="pokitoc">
<center><titleinbody>Performance</titleinbody></center>
&bull;&nbsp;<a href="#Disclaimer">Disclaimer</a><br/>
&bull;&nbsp;<a href="#Summary">Summary</a><br/>
</div>
<p/>
<p/>
<button style="font-weight:bold;color:maroon;border:0" onclick="bodyToggler.expandAll();" href="javascript:;">Expand all sections</button>
<button style="font-weight:bold;color:maroon;border:0" onclick="bodyToggler.collapseAll();" href="javascript:;">Collapse all sections</button>
<a id="Disclaimer"/><h1>Disclaimer</h1>
In a previous version of this page (see <a
href="http://johnkerl.org/miller-releases/miller-5.1.0/doc/performance.html">here</a>)
I compared Miller to some items in the Unix toolkit in terms of run time. But
such comparisons are very much not apples-to-apples:
<ul>
<li/> Miller&rsquo;s principal strength is that it handles <b>key-value data in
various formats</b> while the system tools <b>do not</b>. So if you time
<code>mlr sort</code> on a CSV file against system <code>sort</code>, it's not relevant
to say which is faster by how many percent &mdash; Miller will respect the
header line, leaving it in place, while the system sort will move it, sorting
it along with all the other header lines. This would be comparing the run times
of two programs produce different outputs. Likewise, <code>awk</code>
doesn&rsquo;t respect header lines, although you can code up some CSV-handling
using <code>if (NR==1) { ... } else { ... }</code>. And that&rsquo;s just CSV: I
don&rsquo;t know any simple way to get <code>sort</code>, <code>awk</code>, etc. to
handle DKVP, JSON, etc. &mdash; which is the main rreason I wrote Miller.
<li/> <b>Implementations differ by platform</b>: one <code>awk</code> may be
fundamentally faster than another, and <code>mawk</code> has a very efficient
bytecode implementation &mdash; which handles positionally indexed data
far faster than Miller does.
<li/> The system <code>sort</code> command will, on some systems, handle
too-large-for-RAM datasets by spilling to disk; Miller (as of version 5.2.0,
mid-2017) does not. Miller sorts are always stable; GNU supports stable and
unstable variants.
<li/> Etc.
</ul>
<a id="Summary"/><h1>Summary</h1>
<p/> Miller can do many kinds of processing on key-value-pair data using
elapsed time roughly of the same order of magnitude as items in the Unix
toolkit can handle positionally indexed data. Specific results vary widely
by platform, implementation details, multi-core use (or not). Lastly,
specific special-purpose non-record-aware processing will run far faster
in <code>grep</code>, <code>sed</code>, etc.
</div>
<!-- ================================================================ -->
<script type="text/javascript" src="js/miller-doc-toggler.js"></script>
<!-- wtf -->
<script type="text/javascript">
// Put this at the bottom of the page since its constructor scans the
// document's div tags to find the toggleables.
const bodyToggler = new MillerDocToggler(
"body_section_toggle_",
'maroon',
'maroon',
);
</script>
</body>
</html>