mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 10:15:36 +00:00
weighted-means cookbook example
This commit is contained in:
parent
c4759fc351
commit
4e9d32ed7f
8 changed files with 138 additions and 33 deletions
|
|
@ -1,26 +1,49 @@
|
|||
This is a relatively minor release, containing feature requests.
|
||||
This release contains mostly feature requests.
|
||||
|
||||
**Features:**
|
||||
|
||||
* There is a new DSL function [**mapexcept**](http://johnkerl.org/miller-releases/miller-5.2.0/doc/reference-dsl.html#mapexcept) which returns a copy of the argument with specified key(s), if any, unset. Likewise, [**mapselect**](http://johnkerl.org/miller-releases/miller-5.2.0/doc/reference-dsl.html#mapselect) returns a copy of the argument with only specified key(s), if any, set. This resolves https://github.com/johnkerl/miller/issues/137.
|
||||
* There is a new DSL function
|
||||
[**mapexcept**](http://johnkerl.org/miller-releases/miller-5.2.0/doc/reference-dsl.html#mapexcept) which returns a
|
||||
copy of the argument with specified key(s), if any, unset. The motivating use-case is to split records to multiple
|
||||
filenames depending on particular field value, which is omitted from the output: `mlr --from f.dat put 'tee >
|
||||
"/tmp/data-".$a, mapexcept($*, "a")'` Likewise,
|
||||
[**mapselect**](http://johnkerl.org/miller-releases/miller-5.2.0/doc/reference-dsl.html#mapselect) returns a copy of the
|
||||
argument with only specified key(s), if any, set. This resolves https://github.com/johnkerl/miller/issues/137.
|
||||
|
||||
* xxx min/max functions and stats1/merge-fields min/max/percentile mix int and string. esp. string-only order statistics. doclink for mixed case. interpolation obv nonsensical.
|
||||
* The [**min**](http://johnkerl.org/miller-releases/miller-5.2.0/doc/reference-dsl.html#min)
|
||||
and [**max**](http://johnkerl.org/miller-releases/miller-5.2.0/doc/reference-dsl.html#max) DSL functions, and the
|
||||
min/max/percentile aggregators for the
|
||||
[**stats1**](http://johnkerl.org/miller-releases/miller-5.2.0/doc/reference-verbs.html#stats1) and
|
||||
[**merge-fields**](http://johnkerl.org/miller-releases/miller-5.2.0/doc/reference-verbs.html#merge-fields) verbs, now
|
||||
support numeric as well as string field values. (For mixed string/numeric fields, numbers compare before strings.) This
|
||||
means in particular that order statistics are now possible on string-only fields. Interpolation is obviously nonsensical
|
||||
for strings, so interpolated percentiles such as `mlr stats1 -a p50 -f a -i` yields an error for string-only fields.
|
||||
Likewise, any other aggregations requiring arithmetic, such as <tt>mean</tt>, also produce an error on string-valued
|
||||
input.
|
||||
|
||||
* A new **-u** option for [**count-distinct**](http://johnkerl.org/miller-releases/miller-5.2.0/doc/reference-verbs.html#count-distinct) allows unlashed counts for multiple field names. For example, with `-f a,b` and
|
||||
without `-u`, `count-distinct` computes counts for distinct pairs of `a` and `b` field values. With `-f a,b` and with `-u`, it computes counts
|
||||
for distinct `a` field values and counts for distinct `b` field values separately.
|
||||
|
||||
* xxx `./configure` vs. `autoreconf -fiv` 1st, and which issue is resolved by this.
|
||||
* If you [build from source](http://johnkerl.org/miller-releases/miller-5.2.0/doc/build.html), you can now
|
||||
do `./configure` without first doing `autoreconf -fiv`. This resolves https://github.com/johnkerl/miller/issues/xxx.
|
||||
**xxx to do**: figure out and fix the timestamp issue.
|
||||
**xxx to do**: update the build.html page.
|
||||
|
||||
* xxx UTF-8 BOM strip for CSV files; resolves xxx
|
||||
* The UTF-8 BOM sequence `0xef` `0xbb` `0xbf` is now automatically ignored from the start of CSV files. (The same is
|
||||
already done for JSON files.) This resolves https://github.com/johnkerl/miller/issues/xxx.
|
||||
|
||||
* For `put` and `filter` with `-S`, program literals such as the `6` in `$x = 6` were being parsed as strings. This is not sensible, since the `-S` option for `put` and `filter` is intended to suppress numeric conversion of record data, not program literals. To get string `6` one may use `$x = "6"`.
|
||||
|
||||
**Documentation:**
|
||||
|
||||
* Suppose you have counters in a SQL database with different values in successive queries. A new cookbook example shows [**how to compute differences between successive queries**](http://www.johnkerl.org/miller-releases/miller-5.2.0/doc/cookbook.html#Showing_differences_between_successive_queries).
|
||||
* A new cookbook example shows [**how to compute differences between successive
|
||||
queries**](http://www.johnkerl.org/miller-releases/miller-5.2.0/doc/cookbook.html#Showing_differences_between_successive_queries),
|
||||
e.g. to find out what changed in time-varying data when you run and rerun a SQL query.
|
||||
|
||||
* Another new cookbook example shows [**how to compute interquartile ranges**](http://www.johnkerl.org/miller-releases/miller-5.2.0/doc/cookbook2.html#Computing_interquartile_ranges)
|
||||
* Another new cookbook example shows [**how to compute interquartile ranges**](http://www.johnkerl.org/miller-releases/miller-5.2.0/doc/cookbook2.html#Computing_interquartile_ranges).
|
||||
|
||||
* A third new cookbook example shows [**how to compute weighted means**](http://www.johnkerl.org/miller-releases/miller-5.2.0/doc/cookbook2.html#Computing_weighted_means).
|
||||
|
||||
**Bugfixes:**
|
||||
|
||||
|
|
|
|||
|
|
@ -223,16 +223,16 @@ static sllv_t* mapper_uniq_process_unlashed(lrec_t* pinrec, context_t* pctx, voi
|
|||
else {
|
||||
sllv_t* poutrecs = sllv_alloc();
|
||||
for (lhmsve_t* pe = pstate->pcounts_unlashed->phead; pe != NULL; pe = pe->pnext) {
|
||||
lrec_t* poutrec = lrec_unbacked_alloc();
|
||||
char* field_name= pe->key;
|
||||
lhmsll_t* pcounts_for_field_name = pe->pvvalue;
|
||||
lrec_put(poutrec, "field", field_name, NO_FREE);
|
||||
for (lhmslle_t* pf = pcounts_for_field_name->phead; pf != NULL; pf = pf->pnext) {
|
||||
char* field_value = pf->key;
|
||||
lrec_put(poutrec, mlr_paste_2_strings(field_value, "_count"), mlr_alloc_string_from_ll(pf->value),
|
||||
FREE_ENTRY_KEY|FREE_ENTRY_VALUE);
|
||||
lrec_t* poutrec = lrec_unbacked_alloc();
|
||||
lrec_put(poutrec, "field", field_name, NO_FREE);
|
||||
lrec_put(poutrec, "value", field_value, NO_FREE);
|
||||
lrec_put(poutrec, "count", mlr_alloc_string_from_ll(pf->value), FREE_ENTRY_VALUE);
|
||||
sllv_append(poutrecs, poutrec);
|
||||
}
|
||||
sllv_append(poutrecs, poutrec);
|
||||
}
|
||||
sllv_append(poutrecs, NULL);
|
||||
return poutrecs;
|
||||
|
|
|
|||
|
|
@ -919,8 +919,14 @@ a=hat,b=wye,count=2
|
|||
a=pan,b=wye,count=2
|
||||
|
||||
mlr count-distinct -f a,b -u ./reg_test/input/small ./reg_test/input/abixy
|
||||
field=a,pan_count=4,eks_count=6,wye_count=4,zee_count=4,hat_count=2
|
||||
field=b,pan_count=8,wye_count=10,zee_count=2
|
||||
field=a,value=pan,count=4
|
||||
field=a,value=eks,count=6
|
||||
field=a,value=wye,count=4
|
||||
field=a,value=zee,count=4
|
||||
field=a,value=hat,count=2
|
||||
field=b,value=pan,count=8
|
||||
field=b,value=wye,count=10
|
||||
field=b,value=zee,count=2
|
||||
|
||||
mlr count-distinct -f a -n ./reg_test/input/small ./reg_test/input/abixy
|
||||
count=5
|
||||
|
|
|
|||
20
c/todo.txt
20
c/todo.txt
|
|
@ -11,37 +11,23 @@ BUGFIXES
|
|||
x=9223372036854775802,y=-9223372036854775806
|
||||
x=9223372036854775805,y=-9223372036854775802
|
||||
|
||||
mlr cat then sec2gmt:
|
||||
Usage: mlr (null) [options] {comma-separated list of field names}
|
||||
Replaces a numeric field representing seconds since the epoch with the
|
||||
corresponding GMT timestamp; leaves non-numbers as-is. This is nothing
|
||||
more than a keystroke-saver for the sec2gmt function:
|
||||
mlr (null) time1,time2
|
||||
is the same as
|
||||
|
||||
================================================================
|
||||
FUNDAM:
|
||||
|
||||
* synctool alias/flag handling ...
|
||||
|
||||
================================================================
|
||||
5.2.0 TO-DO:
|
||||
5.3.0 TO-DO:
|
||||
|
||||
----------------------------------------------------------------
|
||||
airable:
|
||||
|
||||
? count-distinct -u wtf ?
|
||||
|
||||
! termcvt -I
|
||||
!!! aux-list -> main help; dox too
|
||||
* UT unhex
|
||||
! faqent/cookbook/more:
|
||||
mlr termcvt --cr2lf foo.csv.cr > foo.csv
|
||||
|
||||
* IQR:
|
||||
- IQR-put faqent
|
||||
? pn-pm aggr @ stats1 ?!?
|
||||
|
||||
! !autoreconf doc note w/ as-of-5.2.0 caveat
|
||||
|
||||
* reg_test/run --mlrexec flag
|
||||
|
|
@ -141,13 +127,13 @@ MAPVAR CHECKLIST:
|
|||
* clarify ownership semantics in localstack & mlhmmv via function names, & top-of-file comments
|
||||
|
||||
================================================================
|
||||
5.2.0 ideas:
|
||||
5.3.0 ideas:
|
||||
|
||||
----------------------------------------------------------------
|
||||
! multi-field x many verbs: -f/-r field-name-spec opportunities throughout
|
||||
|
||||
which verbs:
|
||||
* stats1
|
||||
k stats1 (done in 5.2.0)
|
||||
* stats2
|
||||
* merge-fields -x
|
||||
- count-distinct
|
||||
|
|
|
|||
|
|
@ -117,6 +117,17 @@ POKI_INCLUDE_AND_RUN_ESCAPED(data/iqr1.sh)HERE
|
|||
|
||||
POKI_INCLUDE_AND_RUN_ESCAPED(data/iqrn.sh)HERE
|
||||
|
||||
</div>
|
||||
<!-- ================================================================ -->
|
||||
<h1>Computing weighted means</h1>
|
||||
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_weighted_mean');" href="javascript:;">Toggle section visibility</button>
|
||||
<div id="section_toggle_weighted_mean" style="display: block">
|
||||
|
||||
<p> This might be more elegantly implemented as an option within the <tt>stats1</tt> verb. Meanwhile, it’s
|
||||
expressible within the DSL:
|
||||
|
||||
POKI_INCLUDE_AND_RUN_ESCAPED(data/weighted-mean.sh)HERE
|
||||
|
||||
</div>
|
||||
<!-- ================================================================ -->
|
||||
<h1>Generating random numbers from various distributions</h1>
|
||||
|
|
|
|||
|
|
@ -197,6 +197,7 @@ Miller commands were run with pretty-print-tabular output format.
|
|||
• <a href="#Randomly_generating_jabberwocky_words">Randomly generating jabberwocky words</a><br/>
|
||||
• <a href="#Program_timing">Program timing</a><br/>
|
||||
• <a href="#Computing_interquartile_ranges">Computing interquartile ranges</a><br/>
|
||||
• <a href="#Computing_weighted_means">Computing weighted means</a><br/>
|
||||
• <a href="#Generating_random_numbers_from_various_distributions">Generating random numbers from various distributions</a><br/>
|
||||
• <a href="#Sieve_of_Eratosthenes">Sieve of Eratosthenes</a><br/>
|
||||
• <a href="#Mandelbrot-set_generator">Mandelbrot-set generator</a><br/>
|
||||
|
|
@ -374,6 +375,51 @@ y_iqr 0.511866
|
|||
</div>
|
||||
<p/>
|
||||
|
||||
</div>
|
||||
<!-- ================================================================ -->
|
||||
<a id="Computing_weighted_means"/><h1>Computing weighted means</h1>
|
||||
<button style="font-weight:bold;color:maroon;border:0" padding=0 onclick="toggle_by_name('section_toggle_weighted_mean');" href="javascript:;">Toggle section visibility</button>
|
||||
<div id="section_toggle_weighted_mean" style="display: block">
|
||||
|
||||
<p> This might be more elegantly implemented as an option within the <tt>stats1</tt> verb. Meanwhile, it’s
|
||||
expressible within the DSL:
|
||||
|
||||
<p/>
|
||||
<div class="pokipanel">
|
||||
<pre>
|
||||
$ mlr --from data/medium put -q '
|
||||
# Using the y field for weighting in this example
|
||||
weight = $y;
|
||||
|
||||
# Using the a field for weighted aggregation in this example
|
||||
@sumwx[$a] += weight * $i;
|
||||
@sumw[$a] += weight;
|
||||
|
||||
@sumx[$a] += $i;
|
||||
@sumn[$a] += 1;
|
||||
|
||||
end {
|
||||
map wmean = {};
|
||||
map mean = {};
|
||||
for (a in @sumwx) {
|
||||
wmean[a] = @sumwx[a] / @sumw[a]
|
||||
}
|
||||
for (a in @sumx) {
|
||||
mean[a] = @sumx[a] / @sumn[a]
|
||||
}
|
||||
#emit wmean, "a";
|
||||
#emit mean, "a";
|
||||
emit (wmean, mean), "a";
|
||||
}'
|
||||
a=pan,wmean=4979.563722,mean=5028.259010
|
||||
a=eks,wmean=4890.381593,mean=4956.290076
|
||||
a=wye,wmean=4946.987746,mean=4920.001017
|
||||
a=zee,wmean=5164.719685,mean=5123.092330
|
||||
a=hat,wmean=4925.533162,mean=4967.743946
|
||||
</pre>
|
||||
</div>
|
||||
<p/>
|
||||
|
||||
</div>
|
||||
<!-- ================================================================ -->
|
||||
<a id="Generating_random_numbers_from_various_distributions"/><h1>Generating random numbers from various distributions</h1>
|
||||
|
|
|
|||
25
doc/data/weighted-mean.sh
Normal file
25
doc/data/weighted-mean.sh
Normal file
|
|
@ -0,0 +1,25 @@
|
|||
mlr --from data/medium put -q '
|
||||
# Using the y field for weighting in this example
|
||||
weight = $y;
|
||||
|
||||
# Using the a field for weighted aggregation in this example
|
||||
@sumwx[$a] += weight * $i;
|
||||
@sumw[$a] += weight;
|
||||
|
||||
@sumx[$a] += $i;
|
||||
@sumn[$a] += 1;
|
||||
|
||||
end {
|
||||
map wmean = {};
|
||||
map mean = {};
|
||||
for (a in @sumwx) {
|
||||
wmean[a] = @sumwx[a] / @sumw[a]
|
||||
}
|
||||
for (a in @sumx) {
|
||||
mean[a] = @sumx[a] / @sumn[a]
|
||||
}
|
||||
#emit wmean, "a";
|
||||
#emit mean, "a";
|
||||
emit (wmean, mean), "a";
|
||||
}'
|
||||
|
||||
|
|
@ -655,8 +655,16 @@ a=eks,b=zee,count=357
|
|||
<div class="pokipanel">
|
||||
<pre>
|
||||
$ mlr count-distinct -u -f a,b data/medium
|
||||
field=a,pan_count=2081,eks_count=1965,wye_count=1966,zee_count=2047,hat_count=1941
|
||||
field=b,pan_count=1942,wye_count=2057,zee_count=1943,eks_count=2008,hat_count=2050
|
||||
field=a,value=pan,count=2081
|
||||
field=a,value=eks,count=1965
|
||||
field=a,value=wye,count=1966
|
||||
field=a,value=zee,count=2047
|
||||
field=a,value=hat,count=1941
|
||||
field=b,value=pan,count=1942
|
||||
field=b,value=wye,count=2057
|
||||
field=b,value=zee,count=1943
|
||||
field=b,value=eks,count=2008
|
||||
field=b,value=hat,count=2050
|
||||
</pre>
|
||||
</div>
|
||||
<p/>
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue