miller/docs6/docs/_build/html/data-cleaning-examples.html

107 lines
No EOL
4.6 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Data-cleaning examples &#8212; Miller 6.0.0-alpha documentation</title>
<link rel="stylesheet" href="_static/scrolls.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/print.css" type="text/css" />
<script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
<script src="_static/jquery.js"></script>
<script src="_static/underscore.js"></script>
<script src="_static/doctools.js"></script>
<script src="_static/language_data.js"></script>
<script src="_static/theme_extras.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Statistics examples" href="statistics-examples.html" />
<link rel="prev" title="SQL examples" href="sql-examples.html" />
</head><body>
<div id="content">
<div class="header">
<h1 class="heading"><a href="index.html"
title="back to the documentation overview"><span>Data-cleaning examples</span></a></h1>
</div>
<div class="relnav" role="navigation" aria-label="related navigation">
<a href="sql-examples.html">&laquo; SQL examples</a> |
<a href="#">Data-cleaning examples</a>
| <a href="statistics-examples.html">Statistics examples &raquo;</a>
</div>
<div id="contentwrapper">
<div role="main">
<div class="section" id="data-cleaning-examples">
<h1>Data-cleaning examples<a class="headerlink" href="#data-cleaning-examples" title="Permalink to this headline"></a></h1>
<p>Here are some ways to use the type-checking options as described in <a class="reference internal" href="reference-dsl-variables.html#reference-dsl-type-tests-and-assertions"><span class="std std-ref">Type-test and type-assertion expressions</span></a> Suppose you have the following data file, with inconsistent typing for boolean. (Also imagine that, for the sake of discussion, we have a million-line file rather than a four-line file, so we cant see it all at once and some automation is called for.)</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> cat data/het-bool.csv
</span> name,reachable
barney,false
betty,true
fred,true
wilma,1
</pre></div>
</div>
<p>One option is to coerce everything to boolean, or integer:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --icsv --opprint put &#39;$reachable = boolean($reachable)&#39; data/het-bool.csv
</span> name reachable
barney false
betty true
fred true
wilma true
</pre></div>
</div>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --icsv --opprint put &#39;$reachable = int(boolean($reachable))&#39; data/het-bool.csv
</span> name reachable
barney 0
betty 1
fred 1
wilma 1
</pre></div>
</div>
<p>A second option is to flag badly formatted data within the output stream:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --icsv --opprint put &#39;$format_ok = is_string($reachable)&#39; data/het-bool.csv
</span> name reachable format_ok
barney false false
betty true false
fred true false
wilma 1 false
</pre></div>
</div>
<p>Or perhaps to flag badly formatted data outside the output stream:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --icsv --opprint put &#39;
</span><span class="hll"> if (!is_string($reachable)) {eprint &quot;Malformed at NR=&quot;.NR}
</span><span class="hll"> &#39; data/het-bool.csv
</span> Malformed at NR=1
Malformed at NR=2
Malformed at NR=3
Malformed at NR=4
name reachable
barney false
betty true
fred true
wilma 1
</pre></div>
</div>
<p>A third way is to abort the process on first instance of bad data:</p>
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --csv put &#39;$reachable = asserting_string($reachable)&#39; data/het-bool.csv
</span> Miller: is_string type-assertion failed at NR=1 FNR=1 FILENAME=data/het-bool.csv
</pre></div>
</div>
</div>
</div>
</div>
</div>
<div class="footer" role="contentinfo">
&#169; Copyright 2021, John Kerl.
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 3.2.1.
</div>
</body>
</html>