Export library code in pkg/ (#1391)

* Export library code in `pkg/` * new doc page
2026-01-23 02:14:13 +00:00 · 2023-09-10 17:15:13 -04:00 · 2023-09-10 17:15:13 -04:00 · 268a96d002
commit 268a96d002
parent 93b7c8eac0
358 changed files with 1076 additions and 693 deletions
--- a/pkg/types/README.md
+++ b/pkg/types/README.md
@ -0,0 +1,39 @@
+This contains the implementation of the [`types.Mlrval`](./mlrval.go) datatype which is used for record values, as well as expression/variable values in the Miller `put`/`filter` DSL.
+
+## Mlrval
+
+The [`types.Mlrval`](./mlrval.go) structure includes **string, int, float, boolean, array-of-mlrval, map-string-to-mlrval, void, absent, and error** types as well as type-conversion logic for various operators.
+
+* Miller's `absent` type is like Javascript's `undefined` -- it's for times when there is no such key, as in a DSL expression `$out = $foo` when the input record is `$x=3,y=4` -- there is no `$foo` so `$foo` has `absent` type. Nothing is written to the `$out` field in this case. See also [here](https://miller.readthedocs.io/en/latest/reference-main-null-data) for more information.
+* Miller's `void` type is like Javascript's `null` -- it's for times when there is a key with no value, as in `$out = $x` when the input record is `$x=,$y=4`. This is an overlap with `string` type, since a void value looks like an empty string. I've gone back and forth on this (including when I was writing the C implementation) -- whether to retain `void` as a distinct type from empty-string, or not. I ended up keeping it as it made the `Mlrval` logic easier to understand.
+* Miller's `error` type is for things like doing type-uncoerced addition of strings. Data-dependent errors are intended to result in `(error)`-valued output, rather than crashing Miller. See also [here](https://miller.readthedocs.io/en/latest/reference-main-data-types) for more information.
+* Miller's number handling makes auto-overflow from int to float transparent, while preserving the possibility of 64-bit bitwise arithmetic.
+  * This is different from JavaScript, which has only double-precision floats and thus no support for 64-bit numbers (note however that there is now [`BigInt`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt)).
+  * This is also different from C and Go, wherein casts are necessary -- without which int arithmetic overflows.
+  * Using `$a * $b` in Miller will auto-overflow to float. Using `$a .* $b` will stick with 64-bit integers (if `$a` and `$b` are already 64-bit integers).
+  * More generally:
+    * Bitwise operators such as `|`, `&`, and `^` map ints to ints.
+    * The auto-overflowing math operators `+`, `*`, etc. map ints to ints unless they overflow in which case float is produced.
+    * The int-preserving math operators `.+`, `.*`, etc. map ints to ints even if they overflow.
+  * See also [here](https://miller.readthedocs.io/en/latest/reference-main-arithmetic) for the semantics of Miller arithmetic, which the `Mlrval` class implements.
+* Since a Mlrval can be of type array-of-mlrval or map-string-to-mlrval, a Mlrval is suited for JSON decoding/encoding.
+
+# Mlrmap
+
+[`types.Mlrmap`](./mlrmap.go) is the sequence of key-value pairs which represents a Miller record. The key-lookup mechanism is optimized for Miller read/write usage patterns -- please see `mlrmap.go` for more details.
+
+It's also an ordered map structure, with string keys and Mlrval values. This is used within Mlrval itself.
+
+# Context
+
+[`types.Context`](./context.go) supports AWK-like variables such as `FILENAME`, `NF`, `NR`, and so on.
+
+# A note on JSON
+
+* The code for JSON I/O is mixed between `Mlrval` and `Mlrmap. This is unsurprising since JSON is a mutually recursive data structure -- arrays can contain maps and vice versa.
+* JSON has non-collection types (string, int, float, etc) as well as collection types (array and object).  Support for objects is principally in [./mlrmap_json.go](mlrmap_json.go); support for non-collection types as well as arrays is in [./mlrval_json.go](mlrval_json.go).
+* Both multi-line and single-line formats are supported.
+* Callsites for JSON output are record-writing (e.g. `--ojson`), the `dump` and `print` DSL routines, and the `json_stringify` DSL function.
+  * The choice between single-line and multi-line for JSON record-writing is controlled by `--jvstack` and `--no-jvstack`, the former (multiline) being the default.
+  * The `dump` and `print` DSL routines produce multi-line output without a way for the user to choose single-line output.
+  * The `json_stringify` DSL function lets the user specify multi-line or single-line, with the former being the default,
--- a/pkg/types/context.go
+++ b/pkg/types/context.go
@ -0,0 +1,166 @@
+package types
+
+import (
+	"bytes"
+	"container/list"
+	"strconv"
+
+	"github.com/johnkerl/miller/pkg/mlrval"
+)
+
+// Since Go is concurrent, the context struct (AWK-like variables such as
+// FILENAME, NF, NR, FNR, etc.) needs to be duplicated and passed through the
+// channels along with each record.
+//
+// Strings to be printed from put/filter DSL print/dump/etc statements are
+// passed along to the output channel via this OutputString rather than
+// fmt.Println directly in the put/filter handlers since we want all print
+// statements and record-output to be in the same goroutine, for deterministic
+// output ordering.
+
+type RecordAndContext struct {
+	Record       *mlrval.Mlrmap
+	Context      Context
+	OutputString string
+	EndOfStream  bool
+}
+
+func NewRecordAndContext(
+	record *mlrval.Mlrmap,
+	context *Context,
+) *RecordAndContext {
+	return &RecordAndContext{
+		Record: record,
+		// Since Go is concurrent, the context struct needs to be duplicated and
+		// passed through the channels along with each record. Here is where
+		// the copy happens, via the '*' in *context.
+		Context:      *context,
+		OutputString: "",
+		EndOfStream:  false,
+	}
+}
+
+// For the record-readers to update their initial context as each new record is read.
+func (rac *RecordAndContext) Copy() *RecordAndContext {
+	if rac == nil {
+		return nil
+	}
+	recordCopy := rac.Record.Copy()
+	contextCopy := rac.Context
+	return &RecordAndContext{
+		Record:       recordCopy,
+		Context:      contextCopy,
+		OutputString: "",
+		EndOfStream:  false,
+	}
+}
+
+// For print/dump/etc to insert strings sequenced into the record-output
+// stream.  This avoids race conditions between different goroutines printing
+// to stdout: we have a single designated goroutine printing to stdout. This
+// makes output more predictable and intuitive for users; it also makes our
+// regression tests run reliably the same each time.
+func NewOutputString(
+	outputString string,
+	context *Context,
+) *RecordAndContext {
+	return &RecordAndContext{
+		Record:       nil,
+		Context:      *context,
+		OutputString: outputString,
+		EndOfStream:  false,
+	}
+}
+
+// For the record-readers to update their initial context as each new record is read.
+func NewEndOfStreamMarker(context *Context) *RecordAndContext {
+	return &RecordAndContext{
+		Record:       nil,
+		Context:      *context,
+		OutputString: "",
+		EndOfStream:  true,
+	}
+}
+
+// TODO: comment
+// For the record-readers to update their initial context as each new record is read.
+func NewEndOfStreamMarkerList(context *Context) *list.List {
+	ell := list.New()
+	ell.PushBack(NewEndOfStreamMarker(context))
+	return ell
+}
+
+// ----------------------------------------------------------------
+type Context struct {
+	FILENAME string
+	FILENUM  int64
+
+	// This is computed dynammically from the current record's field-count
+	// NF int
+	NR  int64
+	FNR int64
+}
+
+// TODO: comment: Remember command-line values to pass along to CST evaluators.
+// The options struct-pointer can be nil when invoked by non-DSL verbs such as
+// join or seqgen.
+func NewContext() *Context {
+	context := &Context{
+		FILENAME: "(stdin)",
+		FILENUM:  0,
+		NR:       0,
+		FNR:      0,
+	}
+
+	return context
+}
+
+// TODO: comment: Remember command-line values to pass along to CST evaluators.
+// The options struct-pointer can be nil when invoked by non-DSL verbs such as
+// join or seqgen.
+func NewNilContext() *Context { // TODO: rename
+	context := &Context{
+		FILENAME: "(stdin)",
+		FILENUM:  0,
+		NR:       0,
+		FNR:      0,
+	}
+
+	return context
+}
+
+// For the record-readers to update their initial context as each new file is opened.
+func (context *Context) UpdateForStartOfFile(filename string) {
+	context.FILENAME = filename
+	context.FILENUM++
+	context.FNR = 0
+}
+
+// For the record-readers to update their initial context as each new record is read.
+func (context *Context) UpdateForInputRecord() {
+	context.NR++
+	context.FNR++
+}
+
+func (context *Context) Copy() *Context {
+	other := *context
+	return &other
+}
+
+func (context *Context) GetStatusString() string {
+
+	var buffer bytes.Buffer // stdio is non-buffered in Go, so buffer for speed increase
+	buffer.WriteString("FILENAME=\"")
+	buffer.WriteString(context.FILENAME)
+
+	buffer.WriteString("\",FILENUM=")
+	buffer.WriteString(strconv.FormatInt(context.FILENUM, 10))
+
+	buffer.WriteString(",NR=")
+	buffer.WriteString(strconv.FormatInt(context.NR, 10))
+
+	buffer.WriteString(",FNR=")
+	buffer.WriteString(strconv.FormatInt(context.FNR, 10))
+
+	return buffer.String()
+}
--- a/pkg/types/doc.go
+++ b/pkg/types/doc.go
@ -0,0 +1,4 @@
+// Package types contains the implementation of the Mlrval datatype which is
+// used for record values, as well as expression/variable values in the Miller
+// put/filter DSL.
+package types
--- a/pkg/types/indexed-lvalues.md
+++ b/pkg/types/indexed-lvalues.md
@ -0,0 +1,40 @@
+# Supported indexable lvalues
+
+* Direct/indirect field name like `$x` or `$["x"]`
+* Direct/indirect oosvar like `@x` or `@["x"]`
+* Local variable like `x`
+* Full srec `$*`
+* Full oosvar `@*`
+
+# Supported indexing
+
+Each level by int or string:
+
+* `$x[1]` or `$x["a"]`
+* `@x[1]` or `@x["a"]`
+* `x[1]` or `x["a"]`
+* `$*[1]` or `$*["a"]`
+* `@*[1]` (not supported) or `@*["a"]`
+
+Multiple levels:
+
+* Each can be further indexed, e.g. `$x[1]["a"][3]`
+
+Auto-deepen:
+
+* `x[1][2][3] = 4` should auto-deepen
+* `x["a"]["b"]["c"] = 4` should auto-deepen
+  * Create new maps at each level if necessary, unless they're already something else -- like `x["a"]` is already int/array/etc.
+
+# Indexed types
+
+* `$x` is a `Mlrval`
+* `@x` is a `Mlrval`
+* `x` is a `Mlrval
+* `$*` is a `Mlrmap`
+* `@*` is a `Mlrmap`
+
+# Implementation
+
+* `*Mlrval` needs a `PutIndexed` which takes `indices []*Mlrval` and `rvalue *Mlrval`.
+* `*Mlrmap` needs a `PutIndexed` which takes `indices []*Mlrval` and `rvalue *Mlrval`.
--- a/pkg/types/mlrval_typing.go
+++ b/pkg/types/mlrval_typing.go
@ -0,0 +1,100 @@
+// ================================================================
+// Support for things like 'num x = $a + $b' in the DSL, wherein we check types
+// at assignment time.
+// ================================================================
+
+package types
+
+import (
+	"fmt"
+
+	"github.com/johnkerl/miller/pkg/mlrval"
+)
+
+// ----------------------------------------------------------------
+type TypeGatedMlrvalName struct {
+	Name     string
+	TypeName string
+	TypeMask int
+}
+
+func NewTypeGatedMlrvalName(
+	name string, // e.g. "x"
+	typeName string, // e.g. "num"
+) (*TypeGatedMlrvalName, error) {
+	typeMask, ok := mlrval.TypeNameToMask(typeName)
+	if !ok {
+		return nil, fmt.Errorf("mlr: couldn't resolve type name \"%s\".", typeName)
+	}
+	return &TypeGatedMlrvalName{
+		Name:     name,
+		TypeName: typeName,
+		TypeMask: typeMask,
+	}, nil
+}
+
+func (tname *TypeGatedMlrvalName) Check(value *mlrval.Mlrval) error {
+	bit := value.GetTypeBit()
+	if bit&tname.TypeMask != 0 {
+		return nil
+	} else {
+		return fmt.Errorf(
+			"mlr: couldn't assign variable %s %s from value %s %s\n",
+			tname.TypeName, tname.Name, value.GetTypeName(), value.String(),
+		)
+	}
+}
+
+// ----------------------------------------------------------------
+type TypeGatedMlrvalVariable struct {
+	typeGatedMlrvalName *TypeGatedMlrvalName
+	value               *mlrval.Mlrval
+}
+
+func NewTypeGatedMlrvalVariable(
+	name string, // e.g. "x"
+	typeName string, // e.g. "num"
+	value *mlrval.Mlrval,
+) (*TypeGatedMlrvalVariable, error) {
+	typeGatedMlrvalName, err := NewTypeGatedMlrvalName(name, typeName)
+	if err != nil {
+		return nil, err
+	}
+
+	err = typeGatedMlrvalName.Check(value)
+	if err != nil {
+		return nil, err
+	}
+
+	return &TypeGatedMlrvalVariable{
+		typeGatedMlrvalName,
+		value.Copy(),
+	}, nil
+}
+
+func (tvar *TypeGatedMlrvalVariable) GetName() string {
+	return tvar.typeGatedMlrvalName.Name
+}
+
+func (tvar *TypeGatedMlrvalVariable) GetValue() *mlrval.Mlrval {
+	return tvar.value
+}
+
+func (tvar *TypeGatedMlrvalVariable) ValueString() string {
+	return tvar.value.String()
+}
+
+func (tvar *TypeGatedMlrvalVariable) Assign(value *mlrval.Mlrval) error {
+	err := tvar.typeGatedMlrvalName.Check(value)
+	if err != nil {
+		return err
+	}
+
+	// TODO: revisit copy-reduction
+	tvar.value = value.Copy()
+	return nil
+}
+
+func (tvar *TypeGatedMlrvalVariable) Unassign() {
+	tvar.value = mlrval.ABSENT
+}