miller/doc/data-examples.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">

<!-- PAGE GENERATED FROM template.html and content-for-data-examples.html BY poki. -->
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
<head>
  <meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
  <meta name="description" content="Miller documentation"/>
  <meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
  <title> Data examples </title>
  <link rel="stylesheet" type="text/css" href="css/miller.css"/>
  <link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
</head>

<!-- ================================================================ -->
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-15651652-1");
pageTracker._trackPageview();
} catch(err) {}</script>

<!--
The background image is from a screenshot of a Google search for "data analysis
tools", lightened and sepia-toned. Over this was placed a Mac Terminal app with
very light-grey font and translucent background, in which a few statistical
Miller commands were run with pretty-print-tabular output format.
-->
<body background="pix/sepia-overlay.jpg">

<!-- ================================================================ -->
<table width="100%">
<tr>

  <!-- navbar -->
  <td width="15%">
    <div class="pokinav">
      <center><titleinbody>Miller</titleinbody></center>

<!-- PAGE LIST GENERATED FROM template.html BY poki -->
<br/>User info:
<br/>&bull;&nbsp;<a href="index.html">About</a>
<br/>&bull;&nbsp;<a href="file-formats.html">File formats</a>
<br/>&bull;&nbsp;<a href="feature-comparison.html">Miller features in the context of the Unix toolkit</a>
<br/>&bull;&nbsp;<a href="record-heterogeneity.html">Record-heterogeneity</a>
<br/>&bull;&nbsp;<a href="performance.html">Performance</a>
<br/>&bull;&nbsp;<a href="etymology.html">Why call it Miller?</a>
<br/>&bull;&nbsp;<a href="originality.html">How original is Miller?</a>
<br/>&bull;&nbsp;<a href="reference.html">Reference</a>
<br/>&bull;&nbsp;<a href="data-examples.html">Data examples</a>
<br/>&bull;&nbsp;<a href="to-do.html">Things to do</a>
<br/>Developer info:
<br/>&bull;&nbsp;<a href="build.html">Compiling, portability, dependencies, and testing</a>
<br/>&bull;&nbsp;<a href="whyc.html">Why C?</a>
<br/>&bull;&nbsp;<a href="contact.html">Contact information</a>
<br/>&bull;&nbsp;<a href="https://github.com/johnkerl/miller">GitHub repo</a>
      <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
      <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
      <br/> <br/> <br/> <br/> <br/> <br/>
    </div>
  </td>

  <!-- page body -->
  <td>
    <center> <titleinbody> Data examples </titleinbody> </center>
    <p>

<!-- BODY COPIED FROM content-for-data-examples.html BY poki -->
<div class="pokitoc">
<center><b>Contents:</b></center>
&bull;&nbsp;<a href="#flins_data">flins data</a><br/>
&bull;&nbsp;<a href="#Color/shape_data">Color/shape data</a><br/>
&bull;&nbsp;<a href="#Program_timing">Program timing</a><br/>
</div>
<p/>
<h1>flins data</h1>  <a id="flins_data"/>
<p/> The <a href="data/flins.csv">flins.csv</a> file is some sample data
obtained from <a href="https://support.spatialkey.com/spatialkey-sample-csv-data">https://support.spatialkey.com/spatialkey-sample-csv-data</a>.

<p/>Vertical-tabular format is good for a quick look at CSV data layout &mdash; seeing what columns you have to work with:
<p/>
<div class="pokipanel">
<pre>
$ head -n 2 data/flins.csv | mlr --icsv --oxtab cat
policyID           119736
statecode          FL
county             CLAY COUNTY
eq_site_limit      498960
hu_site_limit      498960
fl_site_limit      498960
fr_site_limit      498960
tiv_2011           498960
tiv_2012           792148.9
eq_site_deductible 0
hu_site_deductible 9979.2
fl_site_deductible 0
fr_site_deductible 0
point_latitude     30.102261
point_longitude    -81.711777
line               Residential
construction       Masonry
point_granularity  1
</pre>
</div>
<p/>
<p/> A few simple queries:
<p/>
<div class="pokipanel">
<pre>
$ cat data/flins.csv | mlr --icsv --opprint count-distinct -f county | head
county              count
CLAY COUNTY         363
SUWANNEE COUNTY     154
NASSAU COUNTY       135
COLUMBIA COUNTY     125
ST  JOHNS COUNTY    657
BAKER COUNTY        70
BRADFORD COUNTY     31
HAMILTON COUNTY     35
UNION COUNTY        15
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ cat data/flins.csv | mlr --icsv --opprint count-distinct -f construction,line
construction        line        count
Masonry             Residential 9257
Wood                Residential 21581
Reinforced Concrete Commercial  1299
Reinforced Masonry  Commercial  4225
Steel Frame         Commercial  272
</pre>
</div>
<p/>
<p/> Categorization of total insured value:
<p/>
<div class="pokipanel">
<pre>
$ cat data/flins.csv | mlr --icsv --opprint stats1 -a min,avg,max -f tiv_2012
tiv_2012_min tiv_2012_avg   tiv_2012_max
73.370000    2571004.097342 1701000000.000000
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ cat data/flins.csv | mlr --icsv --opprint stats1 -a min,avg,max -f tiv_2012 -g construction,line
construction        line        tiv_2012_min    tiv_2012_avg     tiv_2012_max
Masonry             Residential 261168.070000   1041986.129217   3234970.920000
Wood                Residential 73.370000       113493.017049    649046.120000
Reinforced Concrete Commercial  6416016.010000  20212428.681840  60570000.000000
Reinforced Masonry  Commercial  1287817.340000  4621372.981117   16650000.000000
Steel Frame         Commercial  29790000.000000 133492500.000000 1701000000.000000
</pre>
</div>
<p/>

<p/> xxx more Q's after sort -nr and quantiles:
<pre>
cat flins.csv | mlr --icsv --oxtab stats1 -a min,avg,max -f eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible
echo
cat flins.csv | mlr --icsv --oxtab stats1 -a min,avg,max -f eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible -g county
echo
cat flins.csv | mlr --icsv --opprint stats2 -a corr,linreg-ols,r2 -f eq_site_deductible,tiv_2012 -g county
echo
cat flins.csv | mlr --icsv --opprint stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012
echo
cat flins.csv | mlr --icsv --opprint stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012 -g county
</pre>

<p/> xxx plaintext hardlinks

<h1>Color/shape data</h1>  <a id="Color/shape_data"/>
<p/> The <a href="data/colored-shapes.dkvp">colored-shapes.dkvp</a> file is some sample data produced by the
<a href="https://github.com/johnkerl/miller/blob/master/doc/datagen/mkdat2">mkdat2</a> script. The idea is
<ul>
<li> Produce some data with known distributions and correlations, and verify that Miller recovers those properties empirically.
<li> Each record is labeled with one of a few colors and one of a few shapes.
<li> The <tt>flag</tt> field is 0 or 1, with probability dependent on color
<li> The <tt>u</tt> field is plain uniform on the unit interval.
<li> The <tt>v</tt> field is the same, except tightly correlated with <tt>u</tt> for red circles.
<li> The <tt>w</tt> field is autocorrelated for each color/shape pair.
<li> The <tt>x</tt> field is boring Gaussian with mean 5 and standard deviation about 1.2, with no dependence on color or shape.
</ul>

<p/> Peek at the data:
<p/>
<div class="pokipanel">
<pre>
$ wc -l data/colored-shapes.dkvp
  100000 data/colored-shapes.dkvp
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ head -n 6 data/colored-shapes.dkvp | mlr --opprint cat
color  shape    flag i u                  v                   w                    x
green  circle   1    1 0.3333681345949695 0.21376251997142037 0.500764608334147    5.335248996137616
red    square   0    2 0.9698237361677843 0.49622821725976396 0.8040548802590068   4.860269966304398
yellow circle   1    3 0.9879523154172577 0.8487346180375205  0.6423100990686053   5.277708912462331
red    circle   0    4 0.4622496635675988 0.5046956653096718  0.13094461569540355  5.370828092344023
blue   circle   1    5 0.9992738066015238 0.08847422779789516 0.020449531719886935 4.712152492660485
purple triangle 0    6 0.983871885308542  0.5390716976565088  0.8211678541134113   3.873918555931081
</pre>
</div>
<p/>

<p/> Look at uncategorized stats (using <a href="https://github.com/johnkerl/scripts/blob/master/fundam/creach"><tt>creach</tt></a> for spacing).
Here it looks reasonable that <tt>u</tt> is unit-uniform; something&rsquo;s up with <tt>v</tt> but we can't yet see what:
<p/>
<div class="pokipanel">
<pre>
$ mlr --oxtab stats1 -a min,avg,max -f flag,u,v data/colored-shapes.dkvp | creach 3
flag_min 0.000000
flag_avg 0.397960
flag_max 1.000000

u_min    0.000010
u_avg    0.501086
u_max    0.999983

v_min    -0.096195
v_avg    0.500016
v_max    1.095540
</pre>
</div>
<p/>
<p/>The histogram shows the different distribution of 0/1 flags:
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint histogram -f flag,u,v --lo -0.1 --hi 1.1 --nbins 12 data/colored-shapes.dkvp
bin_lo    bin_hi   flag_count u_count v_count
-0.100000 0.000000 60204      0       287
0.000000  0.100000 0          10089   9613
0.100000  0.200000 0          9844    10022
0.200000  0.300000 0          9976    10115
0.300000  0.400000 0          9923    10027
0.400000  0.500000 0          9918    9988
0.500000  0.600000 0          10017   9934
0.600000  0.700000 0          9998    9968
0.700000  0.800000 0          10202   10124
0.800000  0.900000 0          10003   9970
0.900000  1.000000 39796      10030   9650
1.000000  1.100000 0          0       302
</pre>
</div>
<p/>

<p/> Look at univariate stats by color and shape. In particular,
color-dependent flag probabilities pop out, aligning with their original
Bernoulli probablities from the data-generator script:

<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint stats1 -a min,avg,max -f flag,u,v -g color then sort color data/colored-shapes.dkvp
color  flag_min flag_avg flag_max u_min    u_avg    u_max    v_min     v_avg    v_max
blue   0.000000 0.599086 1.000000 0.000044 0.503581 0.999976 0.000057  0.496972 0.999870
green  0.000000 0.207102 1.000000 0.000010 0.504358 0.999983 0.000079  0.498956 0.999889
orange 0.000000 0.505079 1.000000 0.000464 0.496712 0.999574 0.000185  0.493220 0.999967
purple 0.000000 0.098957 1.000000 0.000037 0.498311 0.999693 0.000011  0.502892 0.999993
red    0.000000 0.297820 1.000000 0.000019 0.499904 0.999946 -0.096195 0.500635 1.095540
yellow 0.000000 0.898270 1.000000 0.000327 0.502877 0.999923 0.000138  0.501022 0.999983
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint stats1 -a min,avg,max -f flag,u,v -g shape then sort shape data/colored-shapes.dkvp
shape    flag_min flag_avg flag_max u_min    u_avg    u_max    v_min     v_avg    v_max
circle   0.000000 0.398115 1.000000 0.000037 0.502212 0.999976 -0.096195 0.496996 1.095540
square   0.000000 0.397398 1.000000 0.000037 0.499935 0.999970 0.000088  0.500389 0.999975
triangle 0.000000 0.398542 1.000000 0.000010 0.501676 0.999983 0.000010  0.501804 0.999995
</pre>
</div>
<p/>

<p/> Look at bivariate stats by color and shape. In particular, <tt>u,v</tt> pairwise correlation for red circles pops out (xxx sort -n oppo):
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint --right stats2 -a corr -f u,v,w,x data/colored-shapes.dkvp
u_v_corr w_x_corr
0.115078 0.000020
</pre>
</div>
<p/>
<p/>
<div class="pokipanel">
<pre>
$ mlr --opprint --right stats2 -a corr -f u,v,w,x -g color,shape then sort u_v_corr data/colored-shapes.dkvp
 color    shape  u_v_corr  w_x_corr
   red triangle -0.000349  0.003179
yellow triangle -0.001735  0.032633
yellow   circle -0.002083 -0.012606
purple triangle -0.004290  0.006341
 green   square -0.006983  0.001269
purple   square -0.011754 -0.003372
  blue   circle -0.017771  0.014975
  blue   square -0.019537 -0.006952
 green   circle -0.034313 -0.049415
orange triangle  0.000620 -0.013529
yellow   square  0.001905 -0.002849
orange   square  0.005342  0.042003
   red   square  0.006553 -0.007587
 green triangle  0.010633 -0.025945
purple   circle  0.021701  0.037687
orange   circle  0.021871 -0.040626
  blue triangle  0.029301  0.017138
   red   circle  0.980343  0.004277
</pre>
</div>
<p/>

<h1>Program timing</h1>  <a id="Program_timing"/>
This admittedly artificial example demonstrates using Miller time and stats
functions to introspectly acquire some information about Miller&rsquo;s own
runtime. The <tt>delta</tt> function computes the difference between successive
timestamps..

<p/>
<div class="pokipanel">
<pre>
$ ruby -e '10000.times{|i|puts "i=#{i+1}"}' &gt; lines.txt

$ head -n 5 lines.txt
i=1
i=2
i=3
i=4
i=5

mlr --ofmt '%.9le' --opprint put '$t=systime()' then step -a delta -f t lines.txt | head -n 7
i     t                 t_delta
1     1430603027.018016 1.430603027e+09
2     1430603027.018043 2.694129944e-05
3     1430603027.018048 5.006790161e-06
4     1430603027.018052 4.053115845e-06
5     1430603027.018055 2.861022949e-06
6     1430603027.018058 3.099441528e-06

mlr --ofmt '%.9le' --oxtab \
  put '$t=systime()' then \
  step -a delta -f t then \
  filter '$i&gt;1' then \
  stats1 -a min,avg,max -f t_delta \
  lines.txt
t_delta_min 2.861022949e-06
t_delta_avg 4.077508505e-06
t_delta_max 5.388259888e-05
</pre>
</div>
<p/>
  </td>

</table>
</body>
</html>