mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 18:25:45 +00:00
368 lines
13 KiB
HTML
368 lines
13 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html lang="en">
|
|
|
|
<!-- PAGE GENERATED FROM template.html and content-for-data-examples.html BY poki. -->
|
|
<!-- PLEASE MAKE CHANGES THERE AND THEN RE-RUN poki. -->
|
|
<head>
|
|
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
|
|
<meta name="description" content="Miller documentation"/>
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"/> <!-- mobile-friendly -->
|
|
<title> Data examples </title>
|
|
<link rel="stylesheet" type="text/css" href="css/miller.css"/>
|
|
<link rel="stylesheet" type="text/css" href="css/poki-callbacks.css"/>
|
|
</head>
|
|
|
|
<!-- ================================================================ -->
|
|
<script type="text/javascript">
|
|
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
|
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
|
</script>
|
|
<script type="text/javascript">
|
|
try {
|
|
var pageTracker = _gat._getTracker("UA-15651652-1");
|
|
pageTracker._trackPageview();
|
|
} catch(err) {}</script>
|
|
|
|
<!--
|
|
The background image is from a screenshot of a Google search for "data analysis
|
|
tools", lightened and sepia-toned. Over this was placed a Mac Terminal app with
|
|
very light-grey font and translucent background, in which a few statistical
|
|
Miller commands were run with pretty-print-tabular output format.
|
|
-->
|
|
<body background="pix/sepia-overlay.jpg">
|
|
|
|
<!-- ================================================================ -->
|
|
<table width="100%">
|
|
<tr>
|
|
|
|
<!-- navbar -->
|
|
<td width="15%">
|
|
<div class="pokinav">
|
|
<center><titleinbody>Miller</titleinbody></center>
|
|
|
|
<!-- PAGE LIST GENERATED FROM template.html BY poki -->
|
|
<br/>User info:
|
|
<br/>• <a href="index.html">About</a>
|
|
<br/>• <a href="file-formats.html">File formats</a>
|
|
<br/>• <a href="feature-comparison.html">Miller features in the context of the Unix toolkit</a>
|
|
<br/>• <a href="record-heterogeneity.html">Record-heterogeneity</a>
|
|
<br/>• <a href="performance.html">Performance</a>
|
|
<br/>• <a href="etymology.html">Why call it Miller?</a>
|
|
<br/>• <a href="originality.html">How original is Miller?</a>
|
|
<br/>• <a href="reference.html">Reference</a>
|
|
<br/>• <a href="data-examples.html">Data examples</a>
|
|
<br/>• <a href="to-do.html">Things to do</a>
|
|
<br/>Developer info:
|
|
<br/>• <a href="build.html">Compiling, portability, dependencies, and testing</a>
|
|
<br/>• <a href="whyc.html">Why C?</a>
|
|
<br/>• <a href="contact.html">Contact information</a>
|
|
<br/>• <a href="https://github.com/johnkerl/miller">GitHub repo</a>
|
|
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
|
|
<br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/>
|
|
<br/> <br/> <br/> <br/> <br/> <br/>
|
|
</div>
|
|
</td>
|
|
|
|
<!-- page body -->
|
|
<td>
|
|
<center> <titleinbody> Data examples </titleinbody> </center>
|
|
<p>
|
|
|
|
<!-- BODY COPIED FROM content-for-data-examples.html BY poki -->
|
|
<div class="pokitoc">
|
|
<center><b>Contents:</b></center>
|
|
• <a href="#flins_data">flins data</a><br/>
|
|
• <a href="#Color/shape_data">Color/shape data</a><br/>
|
|
• <a href="#Program_timing">Program timing</a><br/>
|
|
</div>
|
|
<p/>
|
|
<h1>flins data</h1> <a id="flins_data"/>
|
|
<p/> The <a href="data/flins.csv">flins.csv</a> file is some sample data
|
|
obtained from <a href="https://support.spatialkey.com/spatialkey-sample-csv-data">https://support.spatialkey.com/spatialkey-sample-csv-data</a>.
|
|
|
|
<p/>Vertical-tabular format is good for a quick look at CSV data layout — seeing what columns you have to work with:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ head -n 2 data/flins.csv | mlr --icsv --oxtab cat
|
|
policyID 119736
|
|
statecode FL
|
|
county CLAY COUNTY
|
|
eq_site_limit 498960
|
|
hu_site_limit 498960
|
|
fl_site_limit 498960
|
|
fr_site_limit 498960
|
|
tiv_2011 498960
|
|
tiv_2012 792148.9
|
|
eq_site_deductible 0
|
|
hu_site_deductible 9979.2
|
|
fl_site_deductible 0
|
|
fr_site_deductible 0
|
|
point_latitude 30.102261
|
|
point_longitude -81.711777
|
|
line Residential
|
|
construction Masonry
|
|
point_granularity 1
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/> A few simple queries:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/flins.csv | mlr --icsv --opprint count-distinct -f county | head
|
|
county count
|
|
CLAY COUNTY 363
|
|
SUWANNEE COUNTY 154
|
|
NASSAU COUNTY 135
|
|
COLUMBIA COUNTY 125
|
|
ST JOHNS COUNTY 657
|
|
BAKER COUNTY 70
|
|
BRADFORD COUNTY 31
|
|
HAMILTON COUNTY 35
|
|
UNION COUNTY 15
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/flins.csv | mlr --icsv --opprint count-distinct -f construction,line
|
|
construction line count
|
|
Masonry Residential 9257
|
|
Wood Residential 21581
|
|
Reinforced Concrete Commercial 1299
|
|
Reinforced Masonry Commercial 4225
|
|
Steel Frame Commercial 272
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/> Categorization of total insured value:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/flins.csv | mlr --icsv --opprint stats1 -a min,avg,max -f tiv_2012
|
|
tiv_2012_min tiv_2012_avg tiv_2012_max
|
|
73.370000 2571004.097342 1701000000.000000
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ cat data/flins.csv | mlr --icsv --opprint stats1 -a min,avg,max -f tiv_2012 -g construction,line
|
|
construction line tiv_2012_min tiv_2012_avg tiv_2012_max
|
|
Masonry Residential 261168.070000 1041986.129217 3234970.920000
|
|
Wood Residential 73.370000 113493.017049 649046.120000
|
|
Reinforced Concrete Commercial 6416016.010000 20212428.681840 60570000.000000
|
|
Reinforced Masonry Commercial 1287817.340000 4621372.981117 16650000.000000
|
|
Steel Frame Commercial 29790000.000000 133492500.000000 1701000000.000000
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> xxx more Q's after sort -nr and quantiles:
|
|
<pre>
|
|
cat flins.csv | mlr --icsv --oxtab stats1 -a min,avg,max -f eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible
|
|
echo
|
|
cat flins.csv | mlr --icsv --oxtab stats1 -a min,avg,max -f eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible -g county
|
|
echo
|
|
cat flins.csv | mlr --icsv --opprint stats2 -a corr,linreg-ols,r2 -f eq_site_deductible,tiv_2012 -g county
|
|
echo
|
|
cat flins.csv | mlr --icsv --opprint stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012
|
|
echo
|
|
cat flins.csv | mlr --icsv --opprint stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012 -g county
|
|
</pre>
|
|
|
|
<p/> xxx plaintext hardlinks
|
|
|
|
<h1>Color/shape data</h1> <a id="Color/shape_data"/>
|
|
<p/> The <a href="data/colored-shapes.dkvp">colored-shapes.dkvp</a> file is some sample data produced by the
|
|
<a href="https://github.com/johnkerl/miller/blob/master/doc/datagen/mkdat2">mkdat2</a> script. The idea is
|
|
<ul>
|
|
<li> Produce some data with known distributions and correlations, and verify that Miller recovers those properties empirically.
|
|
<li> Each record is labeled with one of a few colors and one of a few shapes.
|
|
<li> The <tt>flag</tt> field is 0 or 1, with probability dependent on color
|
|
<li> The <tt>u</tt> field is plain uniform on the unit interval.
|
|
<li> The <tt>v</tt> field is the same, except tightly correlated with <tt>u</tt> for red circles.
|
|
<li> The <tt>w</tt> field is autocorrelated for each color/shape pair.
|
|
<li> The <tt>x</tt> field is boring Gaussian with mean 5 and standard deviation about 1.2, with no dependence on color or shape.
|
|
</ul>
|
|
|
|
<p/> Peek at the data:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ wc -l data/colored-shapes.dkvp
|
|
100000 data/colored-shapes.dkvp
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ head -n 6 data/colored-shapes.dkvp | mlr --opprint cat
|
|
color shape flag i u v w x
|
|
green circle 1 1 0.3333681345949695 0.21376251997142037 0.500764608334147 5.335248996137616
|
|
red square 0 2 0.9698237361677843 0.49622821725976396 0.8040548802590068 4.860269966304398
|
|
yellow circle 1 3 0.9879523154172577 0.8487346180375205 0.6423100990686053 5.277708912462331
|
|
red circle 0 4 0.4622496635675988 0.5046956653096718 0.13094461569540355 5.370828092344023
|
|
blue circle 1 5 0.9992738066015238 0.08847422779789516 0.020449531719886935 4.712152492660485
|
|
purple triangle 0 6 0.983871885308542 0.5390716976565088 0.8211678541134113 3.873918555931081
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> Look at uncategorized stats (using <a href="https://github.com/johnkerl/scripts/blob/master/fundam/creach"><tt>creach</tt></a> for spacing).
|
|
Here it looks reasonable that <tt>u</tt> is unit-uniform; something’s up with <tt>v</tt> but we can't yet see what:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --oxtab stats1 -a min,avg,max -f flag,u,v data/colored-shapes.dkvp | creach 3
|
|
flag_min 0.000000
|
|
flag_avg 0.397960
|
|
flag_max 1.000000
|
|
|
|
u_min 0.000010
|
|
u_avg 0.501086
|
|
u_max 0.999983
|
|
|
|
v_min -0.096195
|
|
v_avg 0.500016
|
|
v_max 1.095540
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>The histogram shows the different distribution of 0/1 flags:
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint histogram -f flag,u,v --lo -0.1 --hi 1.1 --nbins 12 data/colored-shapes.dkvp
|
|
bin_lo bin_hi flag_count u_count v_count
|
|
-0.100000 0.000000 60204 0 287
|
|
0.000000 0.100000 0 10089 9613
|
|
0.100000 0.200000 0 9844 10022
|
|
0.200000 0.300000 0 9976 10115
|
|
0.300000 0.400000 0 9923 10027
|
|
0.400000 0.500000 0 9918 9988
|
|
0.500000 0.600000 0 10017 9934
|
|
0.600000 0.700000 0 9998 9968
|
|
0.700000 0.800000 0 10202 10124
|
|
0.800000 0.900000 0 10003 9970
|
|
0.900000 1.000000 39796 10030 9650
|
|
1.000000 1.100000 0 0 302
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> Look at univariate stats by color and shape. In particular,
|
|
color-dependent flag probabilities pop out, aligning with their original
|
|
Bernoulli probablities from the data-generator script:
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint stats1 -a min,avg,max -f flag,u,v -g color then sort color data/colored-shapes.dkvp
|
|
color flag_min flag_avg flag_max u_min u_avg u_max v_min v_avg v_max
|
|
blue 0.000000 0.599086 1.000000 0.000044 0.503581 0.999976 0.000057 0.496972 0.999870
|
|
green 0.000000 0.207102 1.000000 0.000010 0.504358 0.999983 0.000079 0.498956 0.999889
|
|
orange 0.000000 0.505079 1.000000 0.000464 0.496712 0.999574 0.000185 0.493220 0.999967
|
|
purple 0.000000 0.098957 1.000000 0.000037 0.498311 0.999693 0.000011 0.502892 0.999993
|
|
red 0.000000 0.297820 1.000000 0.000019 0.499904 0.999946 -0.096195 0.500635 1.095540
|
|
yellow 0.000000 0.898270 1.000000 0.000327 0.502877 0.999923 0.000138 0.501022 0.999983
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint stats1 -a min,avg,max -f flag,u,v -g shape then sort shape data/colored-shapes.dkvp
|
|
shape flag_min flag_avg flag_max u_min u_avg u_max v_min v_avg v_max
|
|
circle 0.000000 0.398115 1.000000 0.000037 0.502212 0.999976 -0.096195 0.496996 1.095540
|
|
square 0.000000 0.397398 1.000000 0.000037 0.499935 0.999970 0.000088 0.500389 0.999975
|
|
triangle 0.000000 0.398542 1.000000 0.000010 0.501676 0.999983 0.000010 0.501804 0.999995
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<p/> Look at bivariate stats by color and shape. In particular, <tt>u,v</tt> pairwise correlation for red circles pops out (xxx sort -n oppo):
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint --right stats2 -a corr -f u,v,w,x data/colored-shapes.dkvp
|
|
u_v_corr w_x_corr
|
|
0.115078 0.000020
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ mlr --opprint --right stats2 -a corr -f u,v,w,x -g color,shape then sort u_v_corr data/colored-shapes.dkvp
|
|
color shape u_v_corr w_x_corr
|
|
red triangle -0.000349 0.003179
|
|
yellow triangle -0.001735 0.032633
|
|
yellow circle -0.002083 -0.012606
|
|
purple triangle -0.004290 0.006341
|
|
green square -0.006983 0.001269
|
|
purple square -0.011754 -0.003372
|
|
blue circle -0.017771 0.014975
|
|
blue square -0.019537 -0.006952
|
|
green circle -0.034313 -0.049415
|
|
orange triangle 0.000620 -0.013529
|
|
yellow square 0.001905 -0.002849
|
|
orange square 0.005342 0.042003
|
|
red square 0.006553 -0.007587
|
|
green triangle 0.010633 -0.025945
|
|
purple circle 0.021701 0.037687
|
|
orange circle 0.021871 -0.040626
|
|
blue triangle 0.029301 0.017138
|
|
red circle 0.980343 0.004277
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
|
|
<h1>Program timing</h1> <a id="Program_timing"/>
|
|
This admittedly artificial example demonstrates using Miller time and stats
|
|
functions to introspectly acquire some information about Miller’s own
|
|
runtime. The <tt>delta</tt> function computes the difference between successive
|
|
timestamps..
|
|
|
|
<p/>
|
|
<div class="pokipanel">
|
|
<pre>
|
|
$ ruby -e '10000.times{|i|puts "i=#{i+1}"}' > lines.txt
|
|
|
|
$ head -n 5 lines.txt
|
|
i=1
|
|
i=2
|
|
i=3
|
|
i=4
|
|
i=5
|
|
|
|
mlr --ofmt '%.9le' --opprint put '$t=systime()' then step -a delta -f t lines.txt | head -n 7
|
|
i t t_delta
|
|
1 1430603027.018016 1.430603027e+09
|
|
2 1430603027.018043 2.694129944e-05
|
|
3 1430603027.018048 5.006790161e-06
|
|
4 1430603027.018052 4.053115845e-06
|
|
5 1430603027.018055 2.861022949e-06
|
|
6 1430603027.018058 3.099441528e-06
|
|
|
|
mlr --ofmt '%.9le' --oxtab \
|
|
put '$t=systime()' then \
|
|
step -a delta -f t then \
|
|
filter '$i>1' then \
|
|
stats1 -a min,avg,max -f t_delta \
|
|
lines.txt
|
|
t_delta_min 2.861022949e-06
|
|
t_delta_avg 4.077508505e-06
|
|
t_delta_max 5.388259888e-05
|
|
</pre>
|
|
</div>
|
|
<p/>
|
|
</td>
|
|
|
|
</table>
|
|
</body>
|
|
</html>
|