miller/docs6/docs/_build/html/why.html

99 lines
No EOL
14 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Why? &#8212; Miller 6.0.0-alpha documentation</title>
<link rel="stylesheet" href="_static/scrolls.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/print.css" type="text/css" />
<script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
<script src="_static/jquery.js"></script>
<script src="_static/underscore.js"></script>
<script src="_static/doctools.js"></script>
<script src="_static/language_data.js"></script>
<script src="_static/theme_extras.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Why call it Miller?" href="etymology.html" />
<link rel="prev" title="Miscellaneous examples" href="misc-examples.html" />
</head><body>
<div id="content">
<div class="header">
<h1 class="heading"><a href="index.html"
title="back to the documentation overview"><span>Why?</span></a></h1>
</div>
<div class="relnav" role="navigation" aria-label="related navigation">
<a href="misc-examples.html">&laquo; Miscellaneous examples</a> |
<a href="#">Why?</a>
| <a href="etymology.html">Why call it Miller? &raquo;</a>
</div>
<div id="contentwrapper">
<div id="toc" role="navigation" aria-label="table of contents navigation">
<h3>Table of Contents</h3>
<ul>
<li><a class="reference internal" href="#">Why?</a><ul>
<li><a class="reference internal" href="#who-is-miller-for">Who is Miller for?</a></li>
<li><a class="reference internal" href="#what-was-miller-created-to-do">What was Miller created to do?</a></li>
<li><a class="reference internal" href="#tradeoffs">Tradeoffs</a></li>
<li><a class="reference internal" href="#related-tools">Related tools</a></li>
<li><a class="reference internal" href="#moving-forward">Moving forward</a></li>
</ul>
</li>
</ul>
</div>
<div role="main">
<div class="section" id="why">
<h1>Why?<a class="headerlink" href="#why" title="Permalink to this headline"></a></h1>
<p>Someone asked me the other day about design, tradeoffs, thought process, why I felt it necessary to build Miller, etc. Here are some answers.</p>
<div class="section" id="who-is-miller-for">
<h2>Who is Miller for?<a class="headerlink" href="#who-is-miller-for" title="Permalink to this headline"></a></h2>
<p>For background, Im a software engineer, with a heavy devops bent and a non-trivial amount of data-engineering in my career. <strong>Initially I wrote Miller mainly for myself:</strong> Im coder-friendly (being a coder); Im Github-friendly; most of my data are well-structured or easily structurable (TSV-formatted SQL-query output, CSV files, log files, JSON data structures); I care about interoperability between all the various formats Miller supports (Ive encountered them all); I do all my work on Linux or OSX.</p>
<p>But now theres this neat little tool <strong>which seems to be useful for people in various disciplines</strong>. I dont even know entirely <em>who</em>. I can click through Github starrers and read a bit about what they seem to do, but not everyone that uses Miller is even <em>on</em> Github (or stars things). Ive gotten a lot of feature requests through Github but only from people who are Github users. Not everyones a coder (it seems like a lot of Millers Github starrers are devops folks like myself, or data-science-ish people, or biology/genomics folks.) A lot of people care 100% about CSV. And so on.</p>
<p>So I wonder (please drop a note at <a class="reference external" href="https://github.com/johnkerl/miller/issues">https://github.com/johnkerl/miller/issues</a>) does Miller do what you need? Do you use it for all sorts of things, or just one or two nice things? Are there things you wish it did but it doesnt? Is it almost there, or just nowhere near what you want? Are there not enough features or way too many? Are the docs too complicated; do you have a hard time finding out how to do what you want? Should I think differently about what this tool even <em>is</em> in the first place? Should I think differently about who its for?</p>
</div>
<div class="section" id="what-was-miller-created-to-do">
<h2>What was Miller created to do?<a class="headerlink" href="#what-was-miller-created-to-do" title="Permalink to this headline"></a></h2>
<p>First: there are tools like <code class="docutils literal notranslate"><span class="pre">xsv</span></code> which handles CSV marvelously and <code class="docutils literal notranslate"><span class="pre">jq</span></code> which handles JSON marvelously, and so on but I over the years of my career in the software industry Ive found myself, and others, doing a lot of ad-hoc things which really were fundamentally the same <em>except</em> for format. So the number one thing about Miller is doing common things while supporting <strong>multiple formats</strong>: (a) ingest a list of records where a record is a list of key-value pairs (however represented in the input files); (b) transform that stream of records; (c) emit the transformed stream either in the same format as input, or in a different format.</p>
<p>Second thing, a lot like the first: just as I didnt want to build something only for a single file format, I didnt want to build something only for one problem domain. In my work doing software engineering, devops, data engineering, etc. I saw a lot of commonalities and I wanted to <strong>solve as many problems simultaneously as possible</strong>.</p>
<p>Third: it had to be <strong>streaming</strong>. As time goes by and we (some of us, sometimes) have machines with tens or hundreds of GB of RAM, its maybe less important, but Im unhappy with tools which ingest all data, then do stuff, then emit all data. One reason is to be able to handle files bigger than available RAM. Another reason is to be able to handle input which trickles in, e.g. you have some process emitting data now and then and you can pipe it to Miller and it will emit transformed records one at a time.</p>
<p>Fourth: it had to be <strong>fast</strong>. This precludes all sorts of very nice things written in Ruby, for example. I love Ruby as a very expressive language, and I have several very useful little utility scripts written in Ruby. But a few years ago I ported over some of my old tried-and-true C programs and the lines-of-code count was a <em>lot</em> lower it was great! Until I ran them on multi-GB files and realized they took 60x as long to complete. So I couldnt write Miller in Ruby, or in languages like it. I was going to have to do something in a low-level language in order to make it performant.</p>
<p>Fifth thing: I wanted Miller to be <strong>pipe-friendly and interoperate with other command-line tools</strong>. Since the basic paradigm is ingest records, transform records, emit records where the input and output formats can be the same or different, and the transform can be complex, or just pass-through this means you can use it to transform data, or re-format it, or both. So if you just want to do data-cleaning/prep/formatting and do all the “real” work in R, you can. If you just want a little glue script between other tools you can get that. And if you want to do non-trivial data-reduction in Miller you can.</p>
<p>Sixth thing: Must have <strong>comprehensive documentation and unit-test</strong>. Since Miller handles a lot of formats and solves a lot of problems, theres a lot to test and a lot to keep working correctly as I add features or optimize. And I wanted it to be able to explain itself not only through web docs like the one youre reading but also through <code class="docutils literal notranslate"><span class="pre">man</span> <span class="pre">mlr</span></code> and <code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">--help</span></code>, <code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">sort</span> <span class="pre">--help</span></code>, etc.</p>
<p>Seventh thing: <strong>Must have a domain-specific language</strong> (DSL) <strong>but also must let you do common things without it</strong>. All those little verbs Miller has to help you <em>avoid</em> having to write for-loops are great. I use them for keystroke-saving: <code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">stats1</span> <span class="pre">-a</span> <span class="pre">mean,stddev,min,max</span> <span class="pre">-f</span> <span class="pre">quantity</span></code>, for example, without you having to write for-loops or define accumulator variables. But you also have to be able to break out of that and write arbitrary code when you want to: <code class="docutils literal notranslate"><span class="pre">mlr</span> <span class="pre">put</span> <span class="pre">'$distance</span> <span class="pre">=</span> <span class="pre">$rate</span> <span class="pre">*</span> <span class="pre">$time'</span></code> or anything else you can think up. In Perl/AWK/etc. its all DSL. In xsv et al. its all verbs. In Miller I like having the combination.</p>
<p>Eighth thing: Its an <strong>awful lot of fun to write</strong>. In my experience I didnt find any tools which do multi-format, streaming, efficient, multi-purpose, with DSL and non-DSL, so I wrote one. But I dont guarantee its unique in the world. It fills a niche in the world (people use it) but it also fills a niche in my life.</p>
</div>
<div class="section" id="tradeoffs">
<h2>Tradeoffs<a class="headerlink" href="#tradeoffs" title="Permalink to this headline"></a></h2>
<p>Miller is command-line-only by design. People who want a graphical user interface wont find it here. This is in part (a) accommodating my personal preferences, and in part (b) guided by my experience/belief that the command line is very expressive. Steep learning curve, yes. I consider that price worth paying.</p>
<p>Another tradeoff: supporting lists of records each with only one depth keeps me supporting only what can be expressed in <em>all</em> of those formats. E.g. in JSON you can have lists of lists of lists which Miller just doesnt handle. So Miller cant (and wont) handle arbitrary JSON because it only handles tabular data which can be expressed in a variety of formats.</p>
<p>A third tradeoff is doing build-from-scratch in a low-level language. Itd be quicker to write (but slower to run) if written in a high-level language. If Miller were written in Python, it would be implemented in significantly fewer lines of code than its current Go implementation. The DSL would just be an <code class="docutils literal notranslate"><span class="pre">eval</span></code> of Python code. And it would run slower, but maybe not enough slower to be a problem for most folks. Later I found out about the <a class="reference external" href="https://github.com/turicas/rows">rows</a> tool if you find Miller useful, you should check out <code class="docutils literal notranslate"><span class="pre">rows</span></code> as well.</p>
<p>A fourth tradeoff is in the DSL (more visibly so in 5.0.0 but already in pre-5.0.0): how much to make it dynamically typed so you can just say y=x+1 with a minimum number of keystrokes vs. having it do a good job of telling you when youve made a typo. This is a common paradigm across <em>all</em> languages. Some like Ruby you dont declare anything and theyre quick to code little stuff in but programs of even a few thousand lines (which isnt large in the software world) become insanely unmanageable. Then Java at the other extreme which is very typesafe but you have to type in a lot of punctuation, angle brackets, datatypes, repetition, etc. just to be able to get anything done. And some in the middle like Go which are typesafe but with type inference which aim to do the best of both. In the Miller (5.0.0) DSL you get <code class="docutils literal notranslate"><span class="pre">y=x+1</span></code> by default but you can have things like <code class="docutils literal notranslate"><span class="pre">int</span> <span class="pre">y</span> <span class="pre">=</span> <span class="pre">x+1</span></code> etc. so the typesafety is opt-in. See also <a class="reference internal" href="reference-dsl-variables.html#reference-dsl-type-checking"><span class="std std-ref">Type-checking</span></a> for more information on type-checking.</p>
</div>
<div class="section" id="related-tools">
<h2>Related tools<a class="headerlink" href="#related-tools" title="Permalink to this headline"></a></h2>
<p>Heres a comprehensive list: <a class="reference external" href="https://github.com/dbohdan/structured-text-tools">https://github.com/dbohdan/structured-text-tools</a>. It doesnt mention <a class="reference external" href="https://github.com/turicas/rows">rows</a> so heres a plug for that as well.</p>
</div>
<div class="section" id="moving-forward">
<h2>Moving forward<a class="headerlink" href="#moving-forward" title="Permalink to this headline"></a></h2>
<p>I originally aimed Miller at people who already know what <code class="docutils literal notranslate"><span class="pre">sed</span></code>/<code class="docutils literal notranslate"><span class="pre">awk</span></code>/<code class="docutils literal notranslate"><span class="pre">cut</span></code>/<code class="docutils literal notranslate"><span class="pre">sort</span></code>/<code class="docutils literal notranslate"><span class="pre">join</span></code> are and wanted some options. But as time goes by I realize that tools like this can be useful to folks who <em>dont</em> know what those things are; people who arent primarily coders; people who are scientists, or data scientists. These days some journalists do data analysis. So moving forward in terms of docs, I am working on having more cookbook, follow-by-example stuff in addition to the existing language-reference kinds of stuff. And continuing to seek out input from people who use Miller on where to go next.</p>
</div>
</div>
</div>
</div>
</div>
<div class="footer" role="contentinfo">
&#169; Copyright 2021, John Kerl.
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 3.2.1.
</div>
</body>
</html>