mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 18:25:45 +00:00
333 lines
12 KiB
HTML
333 lines
12 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html lang="en">
|
|
|
|
<!-- PAGE GENERATED FROM template.html and content-for-performance.html BY poki. -->
|
|
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
|
|
<head>
|
|
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
|
|
<meta name="description" content="Miller documentation"/>
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
|
|
<meta name="keywords"
|
|
content="John Kerl, Kerl, Miller, miller, mlr, OLAP, data analysis software, regression, correlation, variance, data tools, " />
|
|
|
|
<title> Performance </title>
|
|
<link rel="stylesheet" type="text/css" href="css/miller.css"/>
|
|
<link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
|
|
</head>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript">
|
|
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
|
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
|
</script>
|
|
<script type="text/javascript">
|
|
try {
|
|
var pageTracker = _gat._getTracker("UA-15651652-1");
|
|
pageTracker._trackPageview();
|
|
} catch(err) {}
|
|
</script>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript">
|
|
function toggle_div(div) {
|
|
if (div != null) {
|
|
if (div.id.startsWith("section_toggle_")) {
|
|
var state = div.style.display;
|
|
if (state == "block") {
|
|
div.style.display = "none";
|
|
} else {
|
|
div.style.display = "block";
|
|
}
|
|
}
|
|
}
|
|
}
|
|
function expand_div(div) {
|
|
if (div != null) {
|
|
if (div.id.startsWith("section_toggle_")) {
|
|
div.style.display = "block";
|
|
}
|
|
}
|
|
}
|
|
function collapse_div(div) {
|
|
if (div != null) {
|
|
if (div.id.startsWith("section_toggle_")) {
|
|
div.style.display = "none";
|
|
}
|
|
}
|
|
}
|
|
|
|
function toggle_by_name(divName) {
|
|
toggle_div(document.getElementById(divName));
|
|
}
|
|
function expand_by_name(divName) {
|
|
expand_div(document.getElementById(divName));
|
|
}
|
|
function collapse_by_name(divName) {
|
|
collapse_div(document.getElementById(divName));
|
|
}
|
|
|
|
function expand_all() {
|
|
var divs = document.getElementsByTagName("div");
|
|
for(var i = 0; i < divs.length; i++) {
|
|
expand_div(divs[i]);
|
|
}
|
|
}
|
|
function collapse_all() {
|
|
var divs = document.getElementsByTagName("div");
|
|
for(var i = 0; i < divs.length; i++){
|
|
collapse_div(divs[i]);
|
|
}
|
|
}
|
|
</script>
|
|
|
|
<!--
|
|
The background image is from a screenshot of a Google search for "data analysis
|
|
tools", lightened and sepia-toned. Over this was placed a Mac Terminal app with
|
|
very light-grey font and translucent background, in which a few statistical
|
|
Miller commands were run with pretty-print-tabular output format.
|
|
<body background="pix/sepia-overlay.jpg">
|
|
-->
|
|
<body bgcolor="#ffffff">
|
|
|
|
<!-- ================================================================ -->
|
|
<table width="100%">
|
|
<tr>
|
|
|
|
<!-- navbar -->
|
|
<td width="15%">
|
|
<!--
|
|
<img src="pix/mlr.jpg" />
|
|
<img style="border-width:1px; color:black;" src="pix/mlr.jpg" />
|
|
-->
|
|
|
|
<div class="pokinav">
|
|
<center><titleinbody>Miller</titleinbody></center>
|
|
|
|
<!-- PAGE LIST GENERATED FROM template.html BY poki -->
|
|
<br/><b>Overview:</b>
|
|
<br/>• <a href="index.html">About Miller</a>
|
|
<br/>• <a href="10-min.html">Miller in 10 minutes</a>
|
|
<br/>• <a href="file-formats.html">File formats</a>
|
|
<br/>• <a href="feature-comparison.html">Miller features in the context of the Unix toolkit</a>
|
|
<br/>• <a href="record-heterogeneity.html">Record-heterogeneity</a>
|
|
<br/>• <a href="internationalization.html">Internationalization</a>
|
|
<br/><b>Using Miller:</b>
|
|
<br/>• <a href="faq.html">FAQ</a>
|
|
<br/>• <a href="cookbook.html">Cookbook part 1</a>
|
|
<br/>• <a href="cookbook2.html">Cookbook part 2</a>
|
|
<br/>• <a href="cookbook3.html">Cookbook part 3</a>
|
|
<br/>• <a href="data-examples.html">Data-diving examples</a>
|
|
<br/>• <a href="manpage.html">Manpage</a>
|
|
<br/>• <a href="reference.html">Reference</a>
|
|
<br/>• <a href="reference-verbs.html">Reference: Verbs</a>
|
|
<br/>• <a href="reference-dsl.html">Reference: DSL</a>
|
|
<br/>• <a href="release-docs.html">Documents by release</a>
|
|
<br/>• <a href="build.html">Installation, portability, dependencies, and testing</a>
|
|
<br/><b>Background:</b>
|
|
<br/>• <a href="why.html">Why?</a>
|
|
<br/>• <a href="whyc.html">Why C?</a>
|
|
<br/>• <a href="etymology.html">Why call it Miller?</a>
|
|
<br/>• <a href="originality.html">How original is Miller?</a>
|
|
<br/>• <a href="performance.html"><b>Performance</b></a>
|
|
<br/><b>Repository:</b>
|
|
<br/>• <a href="to-do.html">Things to do</a>
|
|
<br/>• <a href="contact.html">Contact information</a>
|
|
<br/>• <a href="https://github.com/johnkerl/miller">GitHub repo</a>
|
|
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
|
|
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
|
|
<br/> <br/> <br/> <br/> <br/> <br/>
|
|
</div>
|
|
</td>
|
|
|
|
<!-- page body -->
|
|
<td>
|
|
<!--
|
|
This is a visually gorgeous feature (here & in the CSS): it allows for
|
|
independent scroll of the nav and body panels. In particular the nav
|
|
stays on-screen as you scroll the body.
|
|
|
|
However, two problems:
|
|
|
|
(1) In Firefox & Chrome both I get janky end-of-body scrolls: there is
|
|
more content but I can't scroll down to it unless I repeatedly retry the
|
|
scrolldown. Which is weird.
|
|
|
|
(2) Worse, only the first page renders in PDF (again, Firefox & Chrome).
|
|
|
|
For now I'm disabling this separate-scroll feature. A frontender, I am
|
|
not ... maybe someday I'll find a config which gets *all* the features
|
|
I want; for now, it's a tradeoff.
|
|
-->
|
|
|
|
<!-- Implementation details: one bit is right here:
|
|
|
|
div style="overflow-y:scroll;height:1500px"
|
|
|
|
and the other bit is in css/poki-callbacks.css:
|
|
|
|
.pokinav {
|
|
display: inline-block;
|
|
background: #e8d9bc;
|
|
border: 1;
|
|
box-shadow: 0px 0px 3px 3px #C9C9C9;
|
|
margin: 10px;
|
|
padding-top: 10px;
|
|
padding-bottom: 10px;
|
|
padding-left: 10px;
|
|
padding-right: 10px;
|
|
overflow-y: scroll; < - - - - - - here
|
|
height: 1500px;
|
|
}
|
|
|
|
-->
|
|
<div>
|
|
<center> <titleinbody> Performance </titleinbody> </center>
|
|
<p/>
|
|
|
|
<!-- BODY COPIED FROM content-for-performance.html BY poki -->
|
|
<div class="pokitoc">
|
|
<center><b>Contents:</b></center>
|
|
• <a href="#Data">Data</a><br/>
|
|
• <a href="#Comparands">Comparands</a><br/>
|
|
• <a href="#Raw_results">Raw results</a><br/>
|
|
• <a href="#Analysis">Analysis</a><br/>
|
|
• <a href="#Conclusion">Conclusion</a><br/>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="expand_all();" href="javascript:;">Expand all sections</button>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="collapse_all();" href="javascript:;">Collapse all sections</button>
|
|
|
|
<a id="Data"/><h1>Data</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_data');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_data" style="display: block">
|
|
|
|
Test data were of the form
|
|
|
|
<table><tr><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
|
|
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
|
|
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
|
|
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
|
|
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td><td>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
a,b,i,x,y
|
|
pan,pan,1,0.3467901443380824,0.7268028627434533
|
|
eks,pan,2,0.7586799647899636,0.5221511083334797
|
|
wye,wye,3,0.20460330576630303,0.33831852551664776
|
|
eks,wye,4,0.38139939387114097,0.13418874328430463
|
|
wye,pan,5,0.5732889198020006,0.8636244699032729
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td></tr></table>
|
|
|
|
for DKVP and CSV, respectively, where fields <tt>a</tt> and <tt>b</tt> take one of five text values,
|
|
uniformly distributed; <tt>i</tt> is a 1-up line counter; <tt>x</tt> and <tt>y</tt>
|
|
are independent uniformly distributed floating-point numbers in the unit
|
|
interval.
|
|
|
|
<p>Data files of one million lines (totalling about 50MB for CSV and 60MB for
|
|
DKVP) were used. In experiments not shown here, I also varied the file sizes;
|
|
the size-dependent results were the expected, completely unsurprising
|
|
linearities and so I produced no file-size-dependent plots for your viewing pleasure.
|
|
|
|
</div>
|
|
<a id="Comparands"/><h1>Comparands</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_comparands');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_comparands" style="display: block">
|
|
|
|
The <tt>cat</tt>, <tt>cut</tt>, <tt>awk</tt>, <tt>sed</tt>, <tt>sort</tt> tools
|
|
were compared to <tt>mlr</tt> on an 8-core Darwin laptop; RAM capacity was
|
|
nowhere near challenged . The <tt>catc</tt> program is a simple line-oriented
|
|
line-printer (<a
|
|
href="https://github.com/johnkerl/miller/blob/master/c/tools/catc.c">source
|
|
here</a>) which is intermediate between Miller (which is record-aware as well
|
|
as line-aware) and <tt>cat</tt> (which is only byte-aware).
|
|
|
|
</div>
|
|
<a id="Raw_results"/><h1>Raw results</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_raw_results');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_raw_results" style="display: block">
|
|
|
|
Note that for CSV data, the command is <tt>mlr --csvlite ...</tt> rather than <tt>mlr ...</tt>.
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
Mac Mac Comparand
|
|
DKVP CSV
|
|
seconds seconds
|
|
|
|
0.016 0.013 cat
|
|
0.189 0.189 catc
|
|
3.657 4.388 awk -F, '{print}'
|
|
2.027 1.795 mlr cat
|
|
|
|
2.292 1.940 cut -d , -f 1,4
|
|
3.540 4.516 awk -F, '{print $1,$4}'
|
|
1.600 1.390 mlr cut -f a,x
|
|
1.694 1.648 mlr cut -x -f a,x
|
|
|
|
0.845 0.643 sed -e 's/x=/EKS=/' -e 's/b=/BEE=/'
|
|
2.076 1.842 mlr rename x,EKS,b,BEE
|
|
|
|
5.643 5.031 awk -F, '{gsub("x=","",$4);gsub("y=","",$5);print $4+$5}'
|
|
4.019 3.679 mlr put '$z=$x+$y'
|
|
|
|
2.481 2.628 mlr stats1 -a mean -f x,y -g a,b
|
|
|
|
2.587 2.389 mlr stats2 -a corr -f x,y -g a,b
|
|
|
|
23.247 14.466 sort -t, -k 1,2
|
|
3.023 5.658 mlr sort -f a,b
|
|
|
|
17.224 15.523 sort -t, -k 4,5
|
|
5.807 5.194 mlr sort -n x,y
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
</div>
|
|
<a id="Analysis"/><h1>Analysis</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_analysis');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_analysis" style="display: block">
|
|
|
|
<ul>
|
|
<li/> As expected, <tt>cat</tt> is very fast — it needs only stream bytes as quickly as possible; it doesn’t even need to touch individual bytes.
|
|
<li/> My <tt>catc</tt> is also faster than Miller: it needs to read and write lines, but it doesn’t segment lines into records; in fact it does no iteration over bytes in each line.
|
|
<li/> Miller does not outperform <tt>sed</tt>, which is string-oriented rather than record-oriented.
|
|
<li/> For the tools which do need to pick apart fields (<tt>cut</tt>,
|
|
<tt>awk</tt>, <tt>sort</tt>), Miller is comparable or outperforms. As noted above, this effect
|
|
persists linearly across file sizes.
|
|
<li/> For univariate and bivariate statistics, I didn’t attempt to
|
|
compare to other tools wherein such computations are less straightforward;
|
|
rather, I attempted only to show that Miller’s processing time here is comparable to its own processing time for other problems.
|
|
</ul>
|
|
|
|
</div>
|
|
<a id="Conclusion"/><h1>Conclusion</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_conclusion');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_conclusion" style="display: block">
|
|
|
|
For record-oriented data transformations, Miller meets or beats the Unix
|
|
toolkit in many contexts. Field renames in particular are worth doing as a
|
|
pre-pipe or post-pipe using <tt>sed</tt>.
|
|
|
|
</div>
|
|
</div>
|
|
</td>
|
|
|
|
</table>
|
|
</body>
|
|
</html>
|