mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 18:25:45 +00:00
120 lines
3.7 KiB
Markdown
120 lines
3.7 KiB
Markdown
# Data-diving examples
|
|
|
|
## flins data
|
|
|
|
The [flins.csv](data/flins.csv) file is some sample data obtained from [https://support.spatialkey.com/spatialkey-sample-csv-data](https://support.spatialkey.com/spatialkey-sample-csv-data).
|
|
|
|
Vertical-tabular format is good for a quick look at CSV data layout -- seeing what columns you have to work with, as this is a file big enough that we can't just see it on a single screenful:
|
|
|
|
GENMD_RUN_COMMAND
|
|
wc -l data/flins.csv
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --c2x --from data/flins.csv head -n 2
|
|
GENMD_EOF
|
|
|
|
A few simple queries:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --c2p --from data/flins.csv count-distinct -f county | head
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --c2p --from data/flins.csv count-distinct -f line
|
|
GENMD_EOF
|
|
|
|
Categorization of total insured value:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --c2x --from data/flins.csv stats1 -a min,mean,max -f tiv_2012
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --c2p --from data/flins.csv \
|
|
stats1 -a min,mean,max -f tiv_2012 -g construction,line
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --c2x --from data/flins.csv \
|
|
stats1 -a p0,p10,p50,p90,p95,p99,p100 -f hu_site_deductible
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --c2p --from data/flins.csv \
|
|
stats1 -a p95,p99,p100 -f hu_site_deductible -g county \
|
|
then sort -f county | head
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --c2x --from data/flins.csv \
|
|
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --c2x --from data/flins.csv --ofmt '%.4f' \
|
|
stats2 -a corr,linreg-ols,r2 -f tiv_2011,tiv_2012 -g county \
|
|
then head -n 5
|
|
GENMD_EOF
|
|
|
|
## Color/shape data
|
|
|
|
The [data/colored-shapes.dkvp](data/colored-shapes.dkvp) file is some sample data produced by the [mkdat2](../data/mkdat2) script. The idea is:
|
|
|
|
* Produce some data with known distributions and correlations, and verify that Miller recovers those properties empirically.
|
|
* Each record is labeled with one of a few colors and one of a few shapes.
|
|
* The `flag` field is 0 or 1, with probability dependent on color
|
|
* The `u` field is plain uniform on the unit interval.
|
|
* The `v` field is the same, except tightly correlated with `u` for red circles.
|
|
* The `w` field is autocorrelated for each color/shape pair.
|
|
* The `x` field is boring Gaussian with mean 5 and standard deviation about 1.2, with no dependence on color or shape.
|
|
|
|
Peek at the data:
|
|
|
|
GENMD_RUN_COMMAND
|
|
wc -l data/colored-shapes.dkvp
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
head -n 6 data/colored-shapes.dkvp | mlr --opprint cat
|
|
GENMD_EOF
|
|
|
|
Look at uncategorized stats (using [creach](https://github.com/johnkerl/scripts/blob/master/fundam/creach) for spacing).
|
|
|
|
Here it looks reasonable that `u` is unit-uniform; something's up with `v` but we can't yet see what:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --oxtab stats1 -a min,mean,max -f flag,u,v data/colored-shapes.dkvp | creach 3
|
|
GENMD_EOF
|
|
|
|
The histogram shows the different distribution of 0/1 flags:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint histogram -f flag,u,v --lo -0.1 --hi 1.1 --nbins 12 data/colored-shapes.dkvp
|
|
GENMD_EOF
|
|
|
|
Look at univariate stats by color and shape. In particular, color-dependent flag probabilities pop out, aligning with their original Bernoulli probablities from the data-generator script:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g color \
|
|
then sort -f color \
|
|
data/colored-shapes.dkvp
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint stats1 -a min,mean,max -f flag,u,v -g shape \
|
|
then sort -f shape \
|
|
data/colored-shapes.dkvp
|
|
GENMD_EOF
|
|
|
|
Look at bivariate stats by color and shape. In particular, `u,v` pairwise correlation for red circles pops out:
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint --right stats2 -a corr -f u,v,w,x data/colored-shapes.dkvp
|
|
GENMD_EOF
|
|
|
|
GENMD_RUN_COMMAND
|
|
mlr --opprint --right \
|
|
stats2 -a corr -f u,v,w,x -g color,shape then sort -nr u_v_corr \
|
|
data/colored-shapes.dkvp
|
|
GENMD_EOF
|