mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 18:25:45 +00:00
391 lines
19 KiB
HTML
391 lines
19 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html lang="en">
|
|
|
|
<!-- PAGE GENERATED FROM template.html and content-for-why.html BY poki. -->
|
|
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
|
|
<head>
|
|
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
|
|
<meta name="description" content="Miller documentation"/>
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
|
|
<meta name="keywords"
|
|
content="John Kerl, Kerl, Miller, miller, mlr, OLAP, data analysis software, regression, correlation, variance, data tools, " />
|
|
|
|
<title> Why? </title>
|
|
<link rel="stylesheet" type="text/css" href="css/miller.css"/>
|
|
<link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
|
|
</head>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript">
|
|
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
|
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
|
</script>
|
|
<script type="text/javascript">
|
|
try {
|
|
var pageTracker = _gat._getTracker("UA-15651652-1");
|
|
pageTracker._trackPageview();
|
|
} catch(err) {}
|
|
</script>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript">
|
|
function toggle_div(div) {
|
|
if (div != null) {
|
|
if (div.id.startsWith("section_toggle_")) {
|
|
var state = div.style.display;
|
|
if (state == "block") {
|
|
div.style.display = "none";
|
|
} else {
|
|
div.style.display = "block";
|
|
}
|
|
}
|
|
}
|
|
}
|
|
function expand_div(div) {
|
|
if (div != null) {
|
|
if (div.id.startsWith("section_toggle_")) {
|
|
div.style.display = "block";
|
|
}
|
|
}
|
|
}
|
|
function collapse_div(div) {
|
|
if (div != null) {
|
|
if (div.id.startsWith("section_toggle_")) {
|
|
div.style.display = "none";
|
|
}
|
|
}
|
|
}
|
|
|
|
function toggle_by_name(divName) {
|
|
toggle_div(document.getElementById(divName));
|
|
}
|
|
function expand_by_name(divName) {
|
|
expand_div(document.getElementById(divName));
|
|
}
|
|
function collapse_by_name(divName) {
|
|
collapse_div(document.getElementById(divName));
|
|
}
|
|
|
|
function expand_all() {
|
|
var divs = document.getElementsByTagName("div");
|
|
for(var i = 0; i < divs.length; i++) {
|
|
expand_div(divs[i]);
|
|
}
|
|
}
|
|
function collapse_all() {
|
|
var divs = document.getElementsByTagName("div");
|
|
for(var i = 0; i < divs.length; i++){
|
|
collapse_div(divs[i]);
|
|
}
|
|
}
|
|
</script>
|
|
|
|
<!--
|
|
The background image is from a screenshot of a Google search for "data analysis
|
|
tools", lightened and sepia-toned. Over this was placed a Mac Terminal app with
|
|
very light-grey font and translucent background, in which a few statistical
|
|
Miller commands were run with pretty-print-tabular output format.
|
|
<body background="pix/sepia-overlay.jpg">
|
|
-->
|
|
<body bgcolor="#ffffff">
|
|
|
|
<!-- ================================================================ -->
|
|
<table width="100%">
|
|
<tr>
|
|
|
|
<!-- navbar -->
|
|
<td width="15%">
|
|
<!--
|
|
<img src="pix/mlr.jpg" />
|
|
<img style="border-width:1px; color:black;" src="pix/mlr.jpg" />
|
|
-->
|
|
|
|
<div class="pokinav">
|
|
<center><titleinbody>Miller</titleinbody></center>
|
|
|
|
<!-- PAGE LIST GENERATED FROM template.html BY poki -->
|
|
<br/><b>Overview:</b>
|
|
<br/>• <a href="index.html">About Miller</a>
|
|
<br/>• <a href="10-min.html">Miller in 10 minutes</a>
|
|
<br/>• <a href="file-formats.html">File formats</a>
|
|
<br/>• <a href="feature-comparison.html">Miller features in the context of the Unix toolkit</a>
|
|
<br/>• <a href="record-heterogeneity.html">Record-heterogeneity</a>
|
|
<br/>• <a href="internationalization.html">Internationalization</a>
|
|
<br/><b>Using Miller:</b>
|
|
<br/>• <a href="faq.html">FAQ</a>
|
|
<br/>• <a href="cookbook.html">Cookbook part 1</a>
|
|
<br/>• <a href="cookbook2.html">Cookbook part 2</a>
|
|
<br/>• <a href="cookbook3.html">Cookbook part 3</a>
|
|
<br/>• <a href="data-examples.html">Data-diving examples</a>
|
|
<br/>• <a href="manpage.html">Manpage</a>
|
|
<br/>• <a href="reference.html">Reference</a>
|
|
<br/>• <a href="reference-verbs.html">Reference: Verbs</a>
|
|
<br/>• <a href="reference-dsl.html">Reference: DSL</a>
|
|
<br/>• <a href="release-docs.html">Documents by release</a>
|
|
<br/>• <a href="build.html">Installation, portability, dependencies, and testing</a>
|
|
<br/><b>Background:</b>
|
|
<br/>• <a href="why.html"><b>Why?</b></a>
|
|
<br/>• <a href="whyc.html">Why C?</a>
|
|
<br/>• <a href="etymology.html">Why call it Miller?</a>
|
|
<br/>• <a href="originality.html">How original is Miller?</a>
|
|
<br/>• <a href="performance.html">Performance</a>
|
|
<br/><b>Repository:</b>
|
|
<br/>• <a href="to-do.html">Things to do</a>
|
|
<br/>• <a href="contact.html">Contact information</a>
|
|
<br/>• <a href="https://github.com/johnkerl/miller">GitHub repo</a>
|
|
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
|
|
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
|
|
<br/> <br/> <br/> <br/> <br/> <br/>
|
|
</div>
|
|
</td>
|
|
|
|
<!-- page body -->
|
|
<td>
|
|
<!--
|
|
This is a visually gorgeous feature (here & in the CSS): it allows for
|
|
independent scroll of the nav and body panels. In particular the nav
|
|
stays on-screen as you scroll the body.
|
|
|
|
However, two problems:
|
|
|
|
(1) In Firefox & Chrome both I get janky end-of-body scrolls: there is
|
|
more content but I can't scroll down to it unless I repeatedly retry the
|
|
scrolldown. Which is weird.
|
|
|
|
(2) Worse, only the first page renders in PDF (again, Firefox & Chrome).
|
|
|
|
For now I'm disabling this separate-scroll feature. A frontender, I am
|
|
not ... maybe someday I'll find a config which gets *all* the features
|
|
I want; for now, it's a tradeoff.
|
|
-->
|
|
|
|
<!-- Implementation details: one bit is right here:
|
|
|
|
div style="overflow-y:scroll;height:1500px"
|
|
|
|
and the other bit is in css/poki-callbacks.css:
|
|
|
|
.pokinav {
|
|
display: inline-block;
|
|
background: #e8d9bc;
|
|
border: 1;
|
|
box-shadow: 0px 0px 3px 3px #C9C9C9;
|
|
margin: 10px;
|
|
padding-top: 10px;
|
|
padding-bottom: 10px;
|
|
padding-left: 10px;
|
|
padding-right: 10px;
|
|
overflow-y: scroll; < - - - - - - here
|
|
height: 1500px;
|
|
}
|
|
|
|
-->
|
|
<div>
|
|
<center> <titleinbody> Why? </titleinbody> </center>
|
|
<p/>
|
|
|
|
<!-- BODY COPIED FROM content-for-why.html BY poki -->
|
|
<div class="pokitoc">
|
|
<center><b>Contents:</b></center>
|
|
• <a href="#Who_is_Miller_for?">Who is Miller for?</a><br/>
|
|
• <a href="#What_was_Miller_created_to_do?">What was Miller created to do?</a><br/>
|
|
• <a href="#Tradeoffs">Tradeoffs</a><br/>
|
|
• <a href="#Related_tools">Related tools</a><br/>
|
|
• <a href="#Moving_forward">Moving forward</a><br/>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="expand_all();" href="javascript:;">Expand all sections</button>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="collapse_all();" href="javascript:;">Collapse all sections</button>
|
|
|
|
<p/> Someone asked me the other day about design, tradeoffs, thought process,
|
|
why I felt it necessary to build Miller, etc. Here are some answers.
|
|
|
|
<a id="Who_is_Miller_for?"/><h1>Who is Miller for?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_who_is_miller_for');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_who_is_miller_for" style="display: block">
|
|
|
|
<p/> For background, I’m a software engineer, with a heavy devops bent
|
|
and a non-trivial amount of data-engineering in my career.
|
|
<boldmaroon>Initially I wrote Miller mainly for myself:</boldmaroon> I’m
|
|
coder-friendly (being a coder); I’m Github-friendly; most of my data are
|
|
well-structured or easily structurable (TSV-formatted SQL-query output, CSV
|
|
files, log files, JSON data structures); I care about interoperability between
|
|
all the various formats Miller supports (I’ve encountered them all); I do
|
|
all my work on Linux or OSX.
|
|
|
|
<p/> But now there’s this neat little tool <boldmaroon>which seems to be
|
|
useful for people in various disciplines</boldmaroon>. I don’t even know
|
|
entirely <i>who</i>. I can click through Github starrers and read a bit about
|
|
what they seem to do, but not everyone’s <i>on</i> Github (or stars
|
|
things). I’ve gotten a lot of feature requests through Github — but
|
|
only from people who are Github users. For sure, not everyone’s on Linux
|
|
or OSX (I have a Windows port underway). Not everyone’s a coder (it seems
|
|
like a lot of Miller’s Github starrers are devops folks like myself, or
|
|
data-science-ish people, or biology/genomics folks.) A lot of people care 100%
|
|
about CSV. And so on.
|
|
|
|
<p/> So I wonder (please drop a note at <a
|
|
href="https://github.com/johnkerl/miller/issues">https://github.com/johnkerl/miller/issues</a>)
|
|
does Miller do what you need? Do you use it for all sorts of things, or just
|
|
one or two nice things? Are there things you wish it did but it doesn’t?
|
|
Is it almost there, or just nowhere near what you want? Are there not enough
|
|
features or way too many? Are the docs too complicated; do you have a hard time
|
|
finding out how to do what you want? Should I think differently about what this
|
|
tool even <i>is</i> in the first place? Should I think differently about who
|
|
it’s for?
|
|
|
|
</div>
|
|
<a id="What_was_Miller_created_to_do?"/><h1>What was Miller created to do?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_what_was_miller_created_to_do');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_what_was_miller_created_to_do" style="display: block">
|
|
|
|
<p/> First: there are tools like <tt>xsv</tt> which handles CSV marvelously and
|
|
<tt>jq</tt> which handles JSON marvelously, and so on — but I over the
|
|
years of my career in the software industry I’ve found myself, and
|
|
others, doing a lot of ad-hoc things which really were fundamentally the same
|
|
<i>except</i> for format. So the number one thing about Miller is doing common
|
|
things while supporting <boldmaroon>multiple formats</boldmaroon>: (a) ingest a
|
|
list of records where a record is a list of key-value pairs (however
|
|
represented in the input files); (b) transform that stream of records; (c) emit
|
|
the transformed stream — either in the same format as input, or in a
|
|
different format.
|
|
|
|
<p/> Second thing, a lot like the first: just as I didn’t want to build
|
|
something only for a single file format, I didn’t want to build something
|
|
only for one problem domain. In my work doing software engineering, devops,
|
|
data engineering, etc. I saw a lot of commonalities and I wanted to
|
|
<boldmaroon>solve as many problems simultaneously as possible</boldmaroon>.
|
|
|
|
<p/> Third: it had to be <boldmaroon>streaming</boldmaroon>. As time goes by
|
|
and we (some of us, sometimes) have machines with tens or hundreds of GB of
|
|
RAM, it’s maybe less important, but I’m unhappy with tools which
|
|
ingest all data, then do stuff, then emit all data. One reason is to be able to
|
|
handle files bigger than available RAM. Another reason is to be able to handle
|
|
input which trickles in, e.g. you have some process emitting data now and then
|
|
and you can pipe it to Miller and it will emit transformed records one at a
|
|
time.
|
|
|
|
<p/> Fourth: it had to be <boldmaroon>fast</boldmaroon>. This precludes all
|
|
sorts of very nice things written in Ruby, for example. I love Ruby as a very
|
|
expressive language, and I have several very useful little utility scripts
|
|
written in Ruby. But a few years ago I ported over some of my old
|
|
tried-and-true C programs and the lines-of-code count was a <i>lot</i> lower
|
|
— it was great! Until I ran them on multi-GB files and realized they took
|
|
60x as long to complete. So I couldn’t write Miller in Ruby, or in
|
|
languages like it. I was going to have to do something in a low-level language
|
|
in order to make it performant. I did simple experiments in several languages,
|
|
and nothing was as fast as C, so I used C: see also <a
|
|
href="whyc.html#C_vs._Go,_D,_Rust,_etc.;_C_is_fast">here</a>.
|
|
|
|
<p/> Fifth thing: I wanted Miller to be <boldmaroon>pipe-friendly and
|
|
interoperate with other command-line tools</boldmaroon>. Since the basic
|
|
paradigm is ingest records, transform records, emit records — where the
|
|
input and output formats can be the same or different, and the transform can be
|
|
complex, or just pass-through — this means you can use it to transform
|
|
data, or re-format it, or both. So if you just want to do
|
|
data-cleaning/prep/formatting and do all the "real" work in R, you can. If you
|
|
just want a little glue script between other tools you can get that. And if you
|
|
want to do non-trivial data-reduction in Miller you can.
|
|
|
|
<p/> Sixth thing: Must have <boldmaroon>comprehensive documentation and
|
|
unit-test</boldmaroon>. Since Miller handles a lot of formats and solves a lot
|
|
of problems, there’s a lot to test and a lot to keep working correctly as
|
|
I add features or optimize. And I wanted it to be able to explain itself
|
|
— not only through web docs like the one you’re reading but also
|
|
through <tt>man mlr</tt> and <tt>mlr --help</tt>, <tt>mlr sort --help</tt>,
|
|
etc.
|
|
|
|
<p/> Seventh thing: <boldmaroon>Must have a domain-specific
|
|
language</boldmaroon> (DSL) <boldmaroon>but also must let you do common things
|
|
without it</boldmaroon>. All those little verbs Miller has to help you
|
|
<i>avoid</i> having to write for-loops are great. I use them for
|
|
keystroke-saving: <tt>mlr stats1 -a mean,stddev,min,max -f quantity</tt>, for
|
|
example, without you having to write for-loops or define accumulator variables.
|
|
But you also have to be able to break out of that and write arbitrary code when
|
|
you want to: <tt>mlr put '$distance = $rate * $time'</tt> or anything else you
|
|
can think up. In Perl/AWK/etc. it’s all DSL. In xsv et al. it’s
|
|
all verbs. In Miller I like having the combination.
|
|
|
|
<p/> Eighth thing: It’s an <boldmaroon>awful lot of fun to
|
|
write</boldmaroon>. In my experience I didn’t find any tools which do
|
|
multi-format, streaming, efficient, multi-purpose, with DSL and non-DSL, so I
|
|
wrote one. But I don’t guarantee it’s unique in the world. It fills
|
|
a niche in the world (people use it) but it also fills a niche in my life.
|
|
|
|
</div>
|
|
<a id="Tradeoffs"/><h1>Tradeoffs</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_tradeoffs');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_tradeoffs" style="display: block">
|
|
|
|
<p/> Miller is command-line-only by design. People who want a graphical user
|
|
interface won’t find it here. This is in part (a) accommodating my
|
|
personal preferences, and in part (b) guided by my experience/belief that the
|
|
command line is very expressive. Steep learning curve, yes. I consider that
|
|
price worth paying.
|
|
|
|
<p/> Another tradeoff: supporting lists of records — each with only one
|
|
depth — keeps me supporting only what can be expressed in <i>all</i> of
|
|
those formats. E.g. in JSON you can have lists of lists of lists which Miller
|
|
just doesn’t handle. So Miller can’t (and won’t) handle
|
|
arbitrary JSON because it only handles tabular data which can be expressed in a
|
|
variety of formats.
|
|
|
|
<p/> A third tradeoff is doing build-from-scratch in a low-level language.
|
|
It’d be quicker to write (but slower to run) if written in a high-level
|
|
language. If Miller were written in Python, it would be implemented in
|
|
significantly fewer lines of code than its current C implementation. The DSL
|
|
would just be an <tt>eval</tt> of Python code. And it would run slower, but
|
|
maybe not enough slower to be a problem for most folks. Later I found out about
|
|
the <a href="https://github.com/turicas/rows">rows</a> tool — if you find
|
|
Miller useful, you should check out <tt>rows</tt> as well.
|
|
|
|
<p/> A fourth tradeoff is in the DSL (more visibly so in 5.0.0 but already in
|
|
pre-5.0.0): how much to make it dynamically typed — so you can just say
|
|
y=x+1 with a minimum number of keystrokes — vs. having it do a good job
|
|
of telling you when you’ve made a typo. This is a common paradigm across
|
|
<i>all</i> languages. Some like Ruby you don’t declare anything and
|
|
they’re quick to code little stuff in but programs of even a few thousand
|
|
lines (which isn’t large in the software world) become insanely
|
|
unmanageable. Then Java at the other extreme which is very typesafe but you
|
|
have to type in a lot of punctuation, angle brackets, datatypes, repetition,
|
|
etc. just to be able to get anything done. And some in the middle like Go which
|
|
are typesafe but with type inference which aim to do the best of both. In the
|
|
Miller (5.0.0) DSL you get y=x+1 by default but you can have things like int y
|
|
= x+1 etc. so the typesafety is opt-in. See also <a
|
|
href="reference-dsl.html#Type-checking">here</a> for more information on
|
|
type-checking.
|
|
|
|
</div>
|
|
<a id="Related_tools"/><h1>Related tools</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_related_tools');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_related_tools" style="display: block">
|
|
|
|
<p/>Here’s a comprehensive list:
|
|
<a href="https://github.com/dbohdan/structured-text-tools">https://github.com/dbohdan/structured-text-tools</a>.
|
|
It doesn’t mention <a href="https://github.com/turicas/rows">rows</a> so here’s a plug for that as well.
|
|
|
|
</div>
|
|
<a id="Moving_forward"/><h1>Moving forward</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_moving_forward');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="section_toggle_moving_forward" style="display: block">
|
|
|
|
<p/> I originally aimed Miller at people who already know what
|
|
<tt>sed</tt>/<tt>awk</tt>/<tt>cut</tt>/<tt>sort</tt>/<tt>join</tt> are and
|
|
wanted some options. But as time goes by I realize that tools like this can be
|
|
useful to folks who <i>don’t</i> know what those things are; people who
|
|
aren’t primarily coders; people who are scientists, or data scientists.
|
|
These days some journalists do data analysis. So moving forward in terms of
|
|
docs, I am working on having more cookbook, follow-by-example stuff in addition
|
|
to the existing language-reference kinds of stuff. And prioritizing a Windows
|
|
port — which is way overdue. And continuing to seek out input from people
|
|
who use Miller on where to go next.
|
|
|
|
</div>
|
|
</div>
|
|
</td>
|
|
|
|
</table>
|
|
</body>
|
|
</html>
|