mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 18:25:45 +00:00
275 lines
15 KiB
HTML
275 lines
15 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
|
|
<!-- PAGE GENERATED FROM template.html and content-for-why.html BY poki. -->
|
|
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
|
|
<head>
|
|
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
|
|
<meta name="description" content="Miller documentation"/>
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
|
|
<meta name="keywords"
|
|
content="John Kerl, Kerl, Miller, miller, mlr, OLAP, data analysis software, regression, correlation, variance, data tools, " />
|
|
|
|
<title> Why? </title>
|
|
<link rel="stylesheet" type="text/css" href="css/miller.css"/>
|
|
<link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
|
|
</head>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript">
|
|
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
|
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
|
</script>
|
|
<script type="text/javascript">
|
|
try {
|
|
var pageTracker = _gat._getTracker("UA-15651652-1");
|
|
pageTracker._trackPageview();
|
|
} catch(err) {}
|
|
</script>
|
|
<!-- ================================================================ -->
|
|
|
|
<body bgcolor="#ffffff">
|
|
|
|
<!-- ================================================================ -->
|
|
|
|
<!-- navbar -->
|
|
<div class="pokinav">
|
|
<center><titleinbody>Miller</titleinbody></center>
|
|
|
|
<!-- NAVBAR GENERATED FROM template.html BY poki -->
|
|
<br/>
|
|
<a class="poki-navbar-element" href="index.html">Overview</a>
|
|
|
|
<a class="poki-navbar-element" href="faq.html">Using</a>
|
|
|
|
<a class="poki-navbar-element" href="reference.html">Reference</a>
|
|
|
|
<a class="poki-navbar-element" href="why.html"><b>Background</b></a>
|
|
|
|
<a class="poki-navbar-element" href="contact.html">Repository</a>
|
|
|
|
<br/>
|
|
<br/><a href="why.html"><b>Why?</b></a>
|
|
<br/><a href="whyc.html">Why C?</a>
|
|
<br/><a href="whyc-details.html">Why C: details</a>
|
|
<br/><a href="etymology.html">Why call it Miller?</a>
|
|
<br/><a href="originality.html">How original is Miller?</a>
|
|
<br/><a href="performance.html">Performance</a>
|
|
</div>
|
|
|
|
<!-- page body -->
|
|
<p/>
|
|
|
|
<!-- BODY COPIED FROM content-for-why.html BY poki -->
|
|
<div class="pokitoc">
|
|
<center><titleinbody>Why?</titleinbody></center>
|
|
• <a href="#Who_is_Miller_for?">Who is Miller for?</a><br/>
|
|
• <a href="#What_was_Miller_created_to_do?">What was Miller created to do?</a><br/>
|
|
• <a href="#Tradeoffs">Tradeoffs</a><br/>
|
|
• <a href="#Related_tools">Related tools</a><br/>
|
|
• <a href="#Moving_forward">Moving forward</a><br/>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="bodyToggler.expandAll();" href="javascript:;">Expand all sections</button>
|
|
<button style="font-weight:bold;color:maroon;border:0" onclick="bodyToggler.collapseAll();" href="javascript:;">Collapse all sections</button>
|
|
|
|
<p/> Someone asked me the other day about design, tradeoffs, thought process,
|
|
why I felt it necessary to build Miller, etc. Here are some answers.
|
|
|
|
<a id="Who_is_Miller_for?"/><h1>Who is Miller for?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_who_is_miller_for');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_who_is_miller_for" style="display: block">
|
|
|
|
<p/> For background, I’m a software engineer, with a heavy devops bent
|
|
and a non-trivial amount of data-engineering in my career.
|
|
<span class="boldmaroon">Initially I wrote Miller mainly for myself:</span> I’m
|
|
coder-friendly (being a coder); I’m Github-friendly; most of my data are
|
|
well-structured or easily structurable (TSV-formatted SQL-query output, CSV
|
|
files, log files, JSON data structures); I care about interoperability between
|
|
all the various formats Miller supports (I’ve encountered them all); I do
|
|
all my work on Linux or OSX.
|
|
|
|
<p/> But now there’s this neat little tool <span class="boldmaroon">which seems to be
|
|
useful for people in various disciplines</span>. I don’t even know
|
|
entirely <i>who</i>. I can click through Github starrers and read a bit about
|
|
what they seem to do, but not everyone’s <i>on</i> Github (or stars
|
|
things). I’ve gotten a lot of feature requests through Github — but
|
|
only from people who are Github users. Not everyone’s a coder (it seems
|
|
like a lot of Miller’s Github starrers are devops folks like myself, or
|
|
data-science-ish people, or biology/genomics folks.) A lot of people care 100%
|
|
about CSV. And so on.
|
|
|
|
<p/> So I wonder (please drop a note at <a
|
|
href="https://github.com/johnkerl/miller/issues">https://github.com/johnkerl/miller/issues</a>)
|
|
does Miller do what you need? Do you use it for all sorts of things, or just
|
|
one or two nice things? Are there things you wish it did but it doesn’t?
|
|
Is it almost there, or just nowhere near what you want? Are there not enough
|
|
features or way too many? Are the docs too complicated; do you have a hard time
|
|
finding out how to do what you want? Should I think differently about what this
|
|
tool even <i>is</i> in the first place? Should I think differently about who
|
|
it’s for?
|
|
|
|
</div>
|
|
<a id="What_was_Miller_created_to_do?"/><h1>What was Miller created to do?</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_what_was_miller_created_to_do');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_what_was_miller_created_to_do" style="display: block">
|
|
|
|
<p/> First: there are tools like <code>xsv</code> which handles CSV marvelously and
|
|
<code>jq</code> which handles JSON marvelously, and so on — but I over the
|
|
years of my career in the software industry I’ve found myself, and
|
|
others, doing a lot of ad-hoc things which really were fundamentally the same
|
|
<i>except</i> for format. So the number one thing about Miller is doing common
|
|
things while supporting <span class="boldmaroon">multiple formats</span>: (a) ingest a
|
|
list of records where a record is a list of key-value pairs (however
|
|
represented in the input files); (b) transform that stream of records; (c) emit
|
|
the transformed stream — either in the same format as input, or in a
|
|
different format.
|
|
|
|
<p/> Second thing, a lot like the first: just as I didn’t want to build
|
|
something only for a single file format, I didn’t want to build something
|
|
only for one problem domain. In my work doing software engineering, devops,
|
|
data engineering, etc. I saw a lot of commonalities and I wanted to
|
|
<span class="boldmaroon">solve as many problems simultaneously as possible</span>.
|
|
|
|
<p/> Third: it had to be <span class="boldmaroon">streaming</span>. As time goes by
|
|
and we (some of us, sometimes) have machines with tens or hundreds of GB of
|
|
RAM, it’s maybe less important, but I’m unhappy with tools which
|
|
ingest all data, then do stuff, then emit all data. One reason is to be able to
|
|
handle files bigger than available RAM. Another reason is to be able to handle
|
|
input which trickles in, e.g. you have some process emitting data now and then
|
|
and you can pipe it to Miller and it will emit transformed records one at a
|
|
time.
|
|
|
|
<p/> Fourth: it had to be <span class="boldmaroon">fast</span>. This precludes all
|
|
sorts of very nice things written in Ruby, for example. I love Ruby as a very
|
|
expressive language, and I have several very useful little utility scripts
|
|
written in Ruby. But a few years ago I ported over some of my old
|
|
tried-and-true C programs and the lines-of-code count was a <i>lot</i> lower
|
|
— it was great! Until I ran them on multi-GB files and realized they took
|
|
60x as long to complete. So I couldn’t write Miller in Ruby, or in
|
|
languages like it. I was going to have to do something in a low-level language
|
|
in order to make it performant. I did simple experiments in several languages,
|
|
and nothing was as fast as C, so I used C: see also <a
|
|
href="whyc.html#C_vs._Go,_D,_Rust,_etc.;_C_is_fast">here</a>.
|
|
|
|
<p/> Fifth thing: I wanted Miller to be <span class="boldmaroon">pipe-friendly and
|
|
interoperate with other command-line tools</span>. Since the basic
|
|
paradigm is ingest records, transform records, emit records — where the
|
|
input and output formats can be the same or different, and the transform can be
|
|
complex, or just pass-through — this means you can use it to transform
|
|
data, or re-format it, or both. So if you just want to do
|
|
data-cleaning/prep/formatting and do all the "real" work in R, you can. If you
|
|
just want a little glue script between other tools you can get that. And if you
|
|
want to do non-trivial data-reduction in Miller you can.
|
|
|
|
<p/> Sixth thing: Must have <span class="boldmaroon">comprehensive documentation and
|
|
unit-test</span>. Since Miller handles a lot of formats and solves a lot
|
|
of problems, there’s a lot to test and a lot to keep working correctly as
|
|
I add features or optimize. And I wanted it to be able to explain itself
|
|
— not only through web docs like the one you’re reading but also
|
|
through <code>man mlr</code> and <code>mlr --help</code>, <code>mlr sort --help</code>,
|
|
etc.
|
|
|
|
<p/> Seventh thing: <span class="boldmaroon">Must have a domain-specific
|
|
language</span> (DSL) <span class="boldmaroon">but also must let you do common things
|
|
without it</span>. All those little verbs Miller has to help you
|
|
<i>avoid</i> having to write for-loops are great. I use them for
|
|
keystroke-saving: <code>mlr stats1 -a mean,stddev,min,max -f quantity</code>, for
|
|
example, without you having to write for-loops or define accumulator variables.
|
|
But you also have to be able to break out of that and write arbitrary code when
|
|
you want to: <code>mlr put '$distance = $rate * $time'</code> or anything else you
|
|
can think up. In Perl/AWK/etc. it’s all DSL. In xsv et al. it’s
|
|
all verbs. In Miller I like having the combination.
|
|
|
|
<p/> Eighth thing: It’s an <span class="boldmaroon">awful lot of fun to
|
|
write</span>. In my experience I didn’t find any tools which do
|
|
multi-format, streaming, efficient, multi-purpose, with DSL and non-DSL, so I
|
|
wrote one. But I don’t guarantee it’s unique in the world. It fills
|
|
a niche in the world (people use it) but it also fills a niche in my life.
|
|
|
|
</div>
|
|
<a id="Tradeoffs"/><h1>Tradeoffs</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_tradeoffs');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_tradeoffs" style="display: block">
|
|
|
|
<p/> Miller is command-line-only by design. People who want a graphical user
|
|
interface won’t find it here. This is in part (a) accommodating my
|
|
personal preferences, and in part (b) guided by my experience/belief that the
|
|
command line is very expressive. Steep learning curve, yes. I consider that
|
|
price worth paying.
|
|
|
|
<p/> Another tradeoff: supporting lists of records — each with only one
|
|
depth — keeps me supporting only what can be expressed in <i>all</i> of
|
|
those formats. E.g. in JSON you can have lists of lists of lists which Miller
|
|
just doesn’t handle. So Miller can’t (and won’t) handle
|
|
arbitrary JSON because it only handles tabular data which can be expressed in a
|
|
variety of formats.
|
|
|
|
<p/> A third tradeoff is doing build-from-scratch in a low-level language.
|
|
It’d be quicker to write (but slower to run) if written in a high-level
|
|
language. If Miller were written in Python, it would be implemented in
|
|
significantly fewer lines of code than its current C implementation. The DSL
|
|
would just be an <code>eval</code> of Python code. And it would run slower, but
|
|
maybe not enough slower to be a problem for most folks. Later I found out about
|
|
the <a href="https://github.com/turicas/rows">rows</a> tool — if you find
|
|
Miller useful, you should check out <code>rows</code> as well.
|
|
|
|
<p/> A fourth tradeoff is in the DSL (more visibly so in 5.0.0 but already in
|
|
pre-5.0.0): how much to make it dynamically typed — so you can just say
|
|
y=x+1 with a minimum number of keystrokes — vs. having it do a good job
|
|
of telling you when you’ve made a typo. This is a common paradigm across
|
|
<i>all</i> languages. Some like Ruby you don’t declare anything and
|
|
they’re quick to code little stuff in but programs of even a few thousand
|
|
lines (which isn’t large in the software world) become insanely
|
|
unmanageable. Then Java at the other extreme which is very typesafe but you
|
|
have to type in a lot of punctuation, angle brackets, datatypes, repetition,
|
|
etc. just to be able to get anything done. And some in the middle like Go which
|
|
are typesafe but with type inference which aim to do the best of both. In the
|
|
Miller (5.0.0) DSL you get y=x+1 by default but you can have things like int y
|
|
= x+1 etc. so the typesafety is opt-in. See also <a
|
|
href="reference-dsl.html#Type-checking">here</a> for more information on
|
|
type-checking.
|
|
|
|
</div>
|
|
<a id="Related_tools"/><h1>Related tools</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_related_tools');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_related_tools" style="display: block">
|
|
|
|
<p/>Here’s a comprehensive list:
|
|
<a href="https://github.com/dbohdan/structured-text-tools">https://github.com/dbohdan/structured-text-tools</a>.
|
|
It doesn’t mention <a href="https://github.com/turicas/rows">rows</a> so here’s a plug for that as well.
|
|
|
|
</div>
|
|
<a id="Moving_forward"/><h1>Moving forward</h1>
|
|
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="bodyToggler.toggle('body_section_toggle_moving_forward');" href="javascript:;">Toggle section visibility</button>
|
|
<div id="body_section_toggle_moving_forward" style="display: block">
|
|
|
|
<p/> I originally aimed Miller at people who already know what
|
|
<code>sed</code>/<code>awk</code>/<code>cut</code>/<code>sort</code>/<code>join</code> are and
|
|
wanted some options. But as time goes by I realize that tools like this can be
|
|
useful to folks who <i>don’t</i> know what those things are; people who
|
|
aren’t primarily coders; people who are scientists, or data scientists.
|
|
These days some journalists do data analysis. So moving forward in terms of
|
|
docs, I am working on having more cookbook, follow-by-example stuff in addition
|
|
to the existing language-reference kinds of stuff. And continuing to seek out
|
|
input from people who use Miller on where to go next.
|
|
|
|
</div>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript" src="js/miller-doc-toggler.js"></script>
|
|
<!-- wtf -->
|
|
<script type="text/javascript">
|
|
// Put this at the bottom of the page since its constructor scans the
|
|
// document's div tags to find the toggleables.
|
|
const bodyToggler = new MillerDocToggler(
|
|
"body_section_toggle_",
|
|
'maroon',
|
|
'maroon',
|
|
);
|
|
</script>
|
|
|
|
</body>
|
|
</html>
|