mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 18:25:45 +00:00
83 lines
2.6 KiB
Markdown
83 lines
2.6 KiB
Markdown
# Randomizing examples
|
|
|
|
## Generating random numbers from various distributions
|
|
|
|
Here we can chain together a few simple building blocks:
|
|
|
|
GENMD_RUN_COMMAND
|
|
cat expo-sample.sh
|
|
GENMD_EOF
|
|
|
|
Namely:
|
|
|
|
* Set the Miller random-number seed so this webdoc looks the same every time I regenerate it.
|
|
* Use pretty-printed tabular output.
|
|
* Use `seqgen` to produce 100,000 records `i=0`, `i=1`, etc.
|
|
* Send those to a `put` step which defines an inverse-transform-sampling function and calls it twice, then computes the sum and product of samples.
|
|
* Send those to a histogram, and from there to a bar-plotter. This is just for visualization; you could just as well output CSV and send that off to your own plotting tool, etc.
|
|
|
|
The output is as follows:
|
|
|
|
GENMD_RUN_COMMAND
|
|
sh expo-sample.sh
|
|
GENMD_EOF
|
|
|
|
## Randomly selecting words from a list
|
|
|
|
Given this [word list](./data/english-words.txt), first take a look to see what the first few lines look like:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
head data/english-words.txt
|
|
a
|
|
aa
|
|
aal
|
|
aalii
|
|
aam
|
|
aardvark
|
|
aardwolf
|
|
aba
|
|
abac
|
|
abaca
|
|
GENMD_EOF
|
|
|
|
Then the following will randomly sample ten words with four to eight characters in them:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr --from data/english-words.txt --nidx filter -S 'n=strlen($1);4<=n&&n<=8' then sample -k 10
|
|
thionine
|
|
birchman
|
|
mildewy
|
|
avigate
|
|
addedly
|
|
abaze
|
|
askant
|
|
aiming
|
|
insulant
|
|
coinmate
|
|
GENMD_EOF
|
|
|
|
## Randomly generating jabberwocky words
|
|
|
|
These are simple *n*-grams as [described here](http://johnkerl.org/randspell/randspell-slides-ts.pdf). Some common functions are [located here](https://github.com/johnkerl/miller/blob/master/docs/ngrams/ngfuncs.mlr.txt). Then here are scripts for [1-grams](https://github.com/johnkerl/miller/blob/master/docs/ngrams/ng1.mlr.txt), [2-grams](https://github.com/johnkerl/miller/blob/master/docs/ngrams/ng2.mlr.txt), [3-grams](https://github.com/johnkerl/miller/blob/master/docs/ngrams/ng3.mlr.txt), [4-grams](https://github.com/johnkerl/miller/blob/master/docs/ngrams/ng4.mlr.txt), and [5-grams](https://github.com/johnkerl/miller/blob/master/docs/ngrams/ng5.mlr.txt).
|
|
|
|
The idea is that words from the input file are consumed, then taken apart and pasted back together in ways which imitate the letter-to-letter transitions found in the word list -- giving us automatically generated words in the same vein as *bromance* and *spork*:
|
|
|
|
GENMD_CARDIFY_HIGHLIGHT_ONE
|
|
mlr --nidx --from ./ngrams/gsl-2000.txt put -q -f ./ngrams/ngfuncs.mlr -f ./ngrams/ng5.mlr
|
|
beard
|
|
plastinguish
|
|
politicially
|
|
noise
|
|
loan
|
|
country
|
|
controductionary
|
|
suppery
|
|
lose
|
|
lessors
|
|
dollar
|
|
judge
|
|
rottendence
|
|
lessenger
|
|
diffendant
|
|
suggestional
|
|
GENMD_EOF
|