mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-24 02:36:15 +00:00
107 lines
No EOL
4.6 KiB
HTML
107 lines
No EOL
4.6 KiB
HTML
|
||
<!DOCTYPE html>
|
||
|
||
<html>
|
||
<head>
|
||
<meta charset="utf-8" />
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||
<title>Data-cleaning examples — Miller 6.0.0-alpha documentation</title>
|
||
|
||
<link rel="stylesheet" href="_static/scrolls.css" type="text/css" />
|
||
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
|
||
<link rel="stylesheet" href="_static/print.css" type="text/css" />
|
||
|
||
<script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
|
||
<script src="_static/jquery.js"></script>
|
||
<script src="_static/underscore.js"></script>
|
||
<script src="_static/doctools.js"></script>
|
||
<script src="_static/language_data.js"></script>
|
||
<script src="_static/theme_extras.js"></script>
|
||
<link rel="index" title="Index" href="genindex.html" />
|
||
<link rel="search" title="Search" href="search.html" />
|
||
<link rel="next" title="Statistics examples" href="statistics-examples.html" />
|
||
<link rel="prev" title="SQL examples" href="sql-examples.html" />
|
||
</head><body>
|
||
<div id="content">
|
||
<div class="header">
|
||
<h1 class="heading"><a href="index.html"
|
||
title="back to the documentation overview"><span>Data-cleaning examples</span></a></h1>
|
||
</div>
|
||
<div class="relnav" role="navigation" aria-label="related navigation">
|
||
<a href="sql-examples.html">« SQL examples</a> |
|
||
<a href="#">Data-cleaning examples</a>
|
||
| <a href="statistics-examples.html">Statistics examples »</a>
|
||
</div>
|
||
<div id="contentwrapper">
|
||
<div role="main">
|
||
|
||
<div class="section" id="data-cleaning-examples">
|
||
<h1>Data-cleaning examples<a class="headerlink" href="#data-cleaning-examples" title="Permalink to this headline">¶</a></h1>
|
||
<p>Here are some ways to use the type-checking options as described in <a class="reference internal" href="reference-dsl-variables.html#reference-dsl-type-tests-and-assertions"><span class="std std-ref">Type-test and type-assertion expressions</span></a> Suppose you have the following data file, with inconsistent typing for boolean. (Also imagine that, for the sake of discussion, we have a million-line file rather than a four-line file, so we can’t see it all at once and some automation is called for.)</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> cat data/het-bool.csv
|
||
</span> name,reachable
|
||
barney,false
|
||
betty,true
|
||
fred,true
|
||
wilma,1
|
||
</pre></div>
|
||
</div>
|
||
<p>One option is to coerce everything to boolean, or integer:</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --icsv --opprint put '$reachable = boolean($reachable)' data/het-bool.csv
|
||
</span> name reachable
|
||
barney false
|
||
betty true
|
||
fred true
|
||
wilma true
|
||
</pre></div>
|
||
</div>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --icsv --opprint put '$reachable = int(boolean($reachable))' data/het-bool.csv
|
||
</span> name reachable
|
||
barney 0
|
||
betty 1
|
||
fred 1
|
||
wilma 1
|
||
</pre></div>
|
||
</div>
|
||
<p>A second option is to flag badly formatted data within the output stream:</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --icsv --opprint put '$format_ok = is_string($reachable)' data/het-bool.csv
|
||
</span> name reachable format_ok
|
||
barney false false
|
||
betty true false
|
||
fred true false
|
||
wilma 1 false
|
||
</pre></div>
|
||
</div>
|
||
<p>Or perhaps to flag badly formatted data outside the output stream:</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --icsv --opprint put '
|
||
</span><span class="hll"> if (!is_string($reachable)) {eprint "Malformed at NR=".NR}
|
||
</span><span class="hll"> ' data/het-bool.csv
|
||
</span> Malformed at NR=1
|
||
Malformed at NR=2
|
||
Malformed at NR=3
|
||
Malformed at NR=4
|
||
name reachable
|
||
barney false
|
||
betty true
|
||
fred true
|
||
wilma 1
|
||
</pre></div>
|
||
</div>
|
||
<p>A third way is to abort the process on first instance of bad data:</p>
|
||
<div class="highlight-none notranslate"><div class="highlight"><pre><span></span><span class="hll"> mlr --csv put '$reachable = asserting_string($reachable)' data/het-bool.csv
|
||
</span> Miller: is_string type-assertion failed at NR=1 FNR=1 FILENAME=data/het-bool.csv
|
||
</pre></div>
|
||
</div>
|
||
</div>
|
||
|
||
|
||
</div>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="footer" role="contentinfo">
|
||
© Copyright 2021, John Kerl.
|
||
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 3.2.1.
|
||
</div>
|
||
</body>
|
||
</html> |