miller/docs6/docs/data-diving-examples.md.in
2021-08-25 22:01:32 -04:00

120 lines
3.7 KiB
Markdown

# Data-diving examples
## flins data
The [flins.csv](data/flins.csv) file is some sample data obtained from [https://support.spatialkey.com/spatialkey-sample-csv-data](https://support.spatialkey.com/spatialkey-sample-csv-data).
Vertical-tabular format is good for a quick look at CSV data layout -- seeing what columns you have to work with, as this is a file big enough that we can't just see it on a single screenful:
GENMD_RUN_COMMAND
wc -l data/flins.csv
GENMD_EOF
GENMD_RUN_COMMAND
mlr --c2x --from data/flins.csv head -n 2
GENMD_EOF
A few simple queries:
GENMD_RUN_COMMAND
mlr --c2p --from data/flins.csv count-distinct -f county | head
GENMD_EOF
GENMD_RUN_COMMAND
mlr --c2p --from data/flins.csv count-distinct -f line
GENMD_EOF
Categorization of total insured value:
GENMD_RUN_COMMAND
mlr --c2x --from data/flins.csv stats1 -a min,mean,max -f tiv_2012
GENMD_EOF
GENMD_RUN_COMMAND
mlr --c2p --from data/flins.csv \
stats1 -a min,mean,max -f tiv_2012 -g construction,line
GENMD_EOF
GENMD_RUN_COMMAND
mlr --c2x --from data/flins.csv \
stats1 -a p0,p10,p50,p90,p95,p99,p100 -f hu_site_deductible
GENMD_EOF
GENMD_RUN_COMMAND
mlr --c2p --from data/flins.csv \
stats1 -a p95,p99,p100 -f hu_site_deductible -g county \
then sort -f county | head
GENMD_EOF
GENMD_RUN_COMMAND
mlr --c2x --from data/flins.csv \
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012
GENMD_EOF
GENMD_RUN_COMMAND
mlr --c2x --from data/flins.csv --ofmt '%.4f' \
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012 -g county \
then head -n 5
GENMD_EOF
## Color/shape data
The [data/colored-shapes.dkvp](data/colored-shapes.dkvp) file is some sample data produced by the [mkdat2](../data/mkdat2) script. The idea is:
* Produce some data with known distributions and correlations, and verify that Miller recovers those properties empirically.
* Each record is labeled with one of a few colors and one of a few shapes.
* The `flag` field is 0 or 1, with probability dependent on color
* The `u` field is plain uniform on the unit interval.
* The `v` field is the same, except tightly correlated with `u` for red circles.
* The `w` field is autocorrelated for each color/shape pair.
* The `x` field is boring Gaussian with mean 5 and standard deviation about 1.2, with no dependence on color or shape.
Peek at the data:
GENMD_RUN_COMMAND
wc -l data/colored-shapes.dkvp
GENMD_EOF
GENMD_RUN_COMMAND
head -n 6 data/colored-shapes.dkvp | mlr --opprint cat
GENMD_EOF
Look at uncategorized stats (using [creach](https://github.com/johnkerl/scripts/blob/master/fundam/creach) for spacing).
Here it looks reasonable that `u` is unit-uniform; something's up with `v` but we can't yet see what:
GENMD_RUN_COMMAND
mlr --oxtab stats1 -a min,mean,max -f flag,u,v data/colored-shapes.dkvp | creach 3
GENMD_EOF
The histogram shows the different distribution of 0/1 flags:
GENMD_RUN_COMMAND
mlr --opprint histogram -f flag,u,v --lo -0.1 --hi 1.1 --nbins 12 data/colored-shapes.dkvp
GENMD_EOF
Look at univariate stats by color and shape. In particular, color-dependent flag probabilities pop out, aligning with their original Bernoulli probablities from the data-generator script:
GENMD_RUN_COMMAND
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g color \
then sort -f color \
data/colored-shapes.dkvp
GENMD_EOF
GENMD_RUN_COMMAND
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g shape \
then sort -f shape \
data/colored-shapes.dkvp
GENMD_EOF
Look at bivariate stats by color and shape. In particular, `u,v` pairwise correlation for red circles pops out:
GENMD_RUN_COMMAND
mlr --opprint --right stats2 -a corr -f u,v,w,x data/colored-shapes.dkvp
GENMD_EOF
GENMD_RUN_COMMAND
mlr --opprint --right \
stats2 -a corr -f u,v,w,x -g color,shape then sort -nr u_v_corr \
data/colored-shapes.dkvp
GENMD_EOF