miller/doc/content-for-data-examples.html
2015-05-05 20:28:05 -07:00

73 lines
4.3 KiB
HTML

POKI_PUT_TOC_HERE
<h1>flins data</h1>
<p/> The <a href="data/flins.csv">flins.csv</a> file is some sample data
obtained from <a href="https://support.spatialkey.com/spatialkey-sample-csv-data">https://support.spatialkey.com/spatialkey-sample-csv-data</a>.
<p/>Vertical-tabular format is good for a quick look at CSV data layout &mdash; seeing what columns you have to work with:
POKI_RUN_COMMAND{{head -n 2 data/flins.csv | mlr --icsv --oxtab cat}}HERE
<p/> A few simple queries:
POKI_RUN_COMMAND{{cat data/flins.csv | mlr --icsv --opprint count-distinct -f county | head}}HERE
POKI_RUN_COMMAND{{cat data/flins.csv | mlr --icsv --opprint count-distinct -f construction,line}}HERE
<p/> Categorization of total insured value:
POKI_RUN_COMMAND{{cat data/flins.csv | mlr --icsv --opprint stats1 -a min,avg,max -f tiv_2012}}HERE
POKI_RUN_COMMAND{{cat data/flins.csv | mlr --icsv --opprint stats1 -a min,avg,max -f tiv_2012 -g construction,line}}HERE
<p/> xxx more Q's after sort -nr and quantiles:
<pre>
cat flins.csv | mlr --icsv --oxtab stats1 -a min,avg,max -f eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible
echo
cat flins.csv | mlr --icsv --oxtab stats1 -a min,avg,max -f eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible -g county
echo
cat flins.csv | mlr --icsv --opprint stats2 -a corr,linreg-ols,r2 -f eq_site_deductible,tiv_2012 -g county
echo
cat flins.csv | mlr --icsv --opprint stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012
echo
cat flins.csv | mlr --icsv --opprint stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012 -g county
</pre>
<p/> xxx plaintext hardlinks
<h1>Color/shape data</h1>
<p/> The <a href="data/colored-shapes.dkvp">colored-shapes.dkvp</a> file is some sample data produced by the
<a href="https://github.com/johnkerl/miller/blob/master/doc/datagen/mkdat2">mkdat2</a> script. The idea is
<ul>
<li> Produce some data with known distributions and correlations, and verify that Miller recovers those properties empirically.
<li> Each record is labeled with one of a few colors and one of a few shapes.
<li> The <tt>flag</tt> field is 0 or 1, with probability dependent on color
<li> The <tt>u</tt> field is plain uniform on the unit interval.
<li> The <tt>v</tt> field is the same, except tightly correlated with <tt>u</tt> for red circles.
<li> The <tt>w</tt> field is autocorrelated for each color/shape pair.
<li> The <tt>x</tt> field is boring Gaussian with mean 5 and standard deviation about 1.2, with no dependence on color or shape.
</ul>
<p/> Peek at the data:
POKI_RUN_COMMAND{{wc -l data/colored-shapes.dkvp}}HERE
POKI_RUN_COMMAND{{head -n 6 data/colored-shapes.dkvp | mlr --opprint cat}}HERE
<p/> Look at uncategorized stats (using <a href="https://github.com/johnkerl/scripts/blob/master/fundam/creach"><tt>creach</tt></a> for spacing).
Here it looks reasonable that <tt>u</tt> is unit-uniform; something&rsquo;s up with <tt>v</tt> but we can't yet see what:
POKI_RUN_COMMAND{{mlr --oxtab stats1 -a min,avg,max -f flag,u,v data/colored-shapes.dkvp | creach 3}}HERE
<p/>The histogram shows the different distribution of 0/1 flags:
POKI_RUN_COMMAND{{mlr --opprint histogram -f flag,u,v --lo -0.1 --hi 1.1 --nbins 12 data/colored-shapes.dkvp}}HERE
<p/> Look at univariate stats by color and shape. In particular,
color-dependent flag probabilities pop out, aligning with their original
Bernoulli probablities from the data-generator script:
POKI_RUN_COMMAND{{mlr --opprint stats1 -a min,avg,max -f flag,u,v -g color then sort color data/colored-shapes.dkvp}}HERE
POKI_RUN_COMMAND{{mlr --opprint stats1 -a min,avg,max -f flag,u,v -g shape then sort shape data/colored-shapes.dkvp}}HERE
<p/> Look at bivariate stats by color and shape. In particular, <tt>u,v</tt> pairwise correlation for red circles pops out (xxx sort -n oppo):
POKI_RUN_COMMAND{{mlr --opprint --right stats2 -a corr -f u,v,w,x data/colored-shapes.dkvp}}HERE
POKI_RUN_COMMAND{{mlr --opprint --right stats2 -a corr -f u,v,w,x -g color,shape then sort u_v_corr data/colored-shapes.dkvp}}HERE
<h1>Program timing</h1>
This admittedly artificial example demonstrates using Miller time and stats
functions to introspectly acquire some information about Miller&rsquo;s own
runtime. The <tt>delta</tt> function computes the difference between successive
timestamps..
POKI_INCLUDE_ESCAPED(data/timing-example.txt)HERE