make toupper and tolower DSL functions UTF-8 aware

This commit is contained in:
John Kerl 2019-09-02 10:08:26 -04:00
parent 3cb7d78fb7
commit 1751ec87e4
6 changed files with 1293 additions and 40 deletions

View file

@ -1,37 +1,13 @@
# Positional-indexing and other data-cleaning features
# Title here
## Features:
* The new [**positional-indexing feature**](http://johnkerl.org/miller/doc/reference-dsl.html#Positional_field_names) resolves https://github.com/johnkerl/miller/issues/236 from @aborruso. You can now get the name of the 3rd field of each record via <tt>$[[3]]</tt>, and its value by <tt>$[[[3]]]</tt>. These are both usable on either the left-hand or right-hand side of assignment statements, so you can more easily do things like renaming fields progrmatically within the DSL.
* There is a new [**capitalize**](http://johnkerl.org/miller/doc/reference-dsl.html#capitalize) DSL function, complementing the already-existing <tt>toupper</tt>. This stems from https://github.com/johnkerl/miller/issues/236.
* There is a new [**skip-trivial-records**](http://johnkerl.org/miller/doc/reference-verbs.html#skip-trivial-records) verb, resolving https://github.com/johnkerl/miller/issues/197. Similarly, there is a new [**remove-empty-columns**](http://johnkerl.org/miller/doc/reference-verbs.html#remove-empty-columns) verb, resolving https://github.com/johnkerl/miller/issues/206. Both are useful for **data-cleaning use-cases**.
* Another pair is https://github.com/johnkerl/miller/issues/181 and https://github.com/johnkerl/miller/issues/256. While Miller uses <tt>mmap</tt> internally (and invisibily) to get approximately a 20% performance boost over not using it, this can cause out-of-memory issues with reading either large files, or too many small ones. Now, Miller automatically avoids <tt>mmap</tt> in these cases. You can still use <tt>--mmap</tt> or <tt>--no-mmap</tt> if you want manual control of this.
* There is a new [**--ivar option for the nest verb**](http://johnkerl.org/miller/doc/reference-verbs.html#nest) which complements the already-existing <tt>--evar</tt>. This is from https://github.com/johnkerl/miller/pull/260 thanks to @jgreely.
* There is a new keystroke-saving [**urandrange**](http://johnkerl.org/miller/doc/reference-dsl.html#urand) DSL function: <tt>urandrange(low, high)</tt> is the same as <tt>low + (high - low) * urand()</tt>.
* There is a new [**-v option for the cat verb**](http://johnkerl.org/miller/doc/reference-verbs.html#cat) which writes a low-level record-structure dump to standard error.
* There is a new [**-N option for mlr**](http://johnkerl.org/miller/doc/manpage.html) which is a keystroke-saver for <tt>--implicit-csv-header --headerless-csv-output</tt>.
* The [**toupper**](http://johnkerl.org/miller/doc/reference-dsl.html#toupper), [**tolower**](http://johnkerl.org/miller/doc/reference-dsl.html#tolower), and [**capitalize**](http://johnkerl.org/miller/doc/reference-dsl.html#capitalize) DSL functions are now UTF-8 aware, thanks to @sheredom's marvelous https://github.com/sheredom/utf8.h.
## Documentation:
* The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_to_escape_'?'_in_regexes resolves https://github.com/johnkerl/miller/issues/203.
* The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_can_I_filter_by_date resolves https://github.com/johnkerl/miller/issues/208.
* https://github.com/johnkerl/miller/issues/244 fixes a documentation issue while highlighting the need for https://github.com/johnkerl/miller/issues/241.
* ...
## Bugfixes:
* There was a SEGV using `nest` within `then`-chains, fixed in response to https://github.com/johnkerl/miller/issues/220.
* Quotes and backslashes weren't being escaped in JSON output with <tt>--jvquoteall</tt>; reported on https://github.com/johnkerl/miller/issues/222.
## Note:
I've never code-named releases but if I were to code-name 5.5.0 I would call it "aborruso". Andrea has contributed many fantastic feature requests, as well as driving a huge volume of Miller-related discussions in StackExchange (https://github.com/johnkerl/miller/issues/212). Mille grazie al mio amico @aborruso!
* ...

View file

@ -35,7 +35,8 @@ libmlr_la_SOURCES= free_flags.h \
string_builder.c \
string_builder.h \
mlr_test_util.c \
mlr_test_util.h
mlr_test_util.h \
utf8.h
AM_CPPFLAGS= -I${srcdir}/../
AM_CFLAGS= -std=gnu99

View file

@ -340,7 +340,8 @@ libmlr_la_SOURCES = free_flags.h \
string_builder.c \
string_builder.h \
mlr_test_util.c \
mlr_test_util.h
mlr_test_util.h \
utf8.h
AM_CPPFLAGS = -I${srcdir}/../
AM_CFLAGS = -std=gnu99

View file

@ -7,6 +7,7 @@
#include "lib/mlrdatetime.h"
#include "lib/mlrregex.h"
#include "lib/mvfuncs.h"
#include "lib/utf8.h"
// ================================================================
// See important notes at the top of mlrval.h.
@ -403,8 +404,14 @@ mv_t s_x_typeof_func(mv_t* pval1) {
// ----------------------------------------------------------------
mv_t s_s_tolower_func(mv_t* pval1) {
char* string = mlr_strdup_or_die(pval1->u.strv);
#if 0
// ASCII only
for (char* c = string; *c; c++)
*c = tolower((unsigned char)*c);
#else
// UTF-8
utf8lwr(string);
#endif
mv_free(pval1);
pval1->u.strv = NULL;
@ -413,8 +420,14 @@ mv_t s_s_tolower_func(mv_t* pval1) {
mv_t s_s_toupper_func(mv_t* pval1) {
char* string = mlr_strdup_or_die(pval1->u.strv);
#if 0
// ASCII only
for (char* c = string; *c; c++)
*c = toupper((unsigned char)*c);
#else
// UTF-8
utf8upr(string);
#endif
mv_free(pval1);
pval1->u.strv = NULL;

1262
c/lib/utf8.h Normal file

File diff suppressed because it is too large Load diff

20
configure vendored
View file

@ -1,6 +1,6 @@
#! /bin/sh
# Guess values for system-dependent variables and create Makefiles.
# Generated by GNU Autoconf 2.69 for mlr 5.5.0.
# Generated by GNU Autoconf 2.69 for mlr 5.5.0-dev.
#
#
# Copyright (C) 1992-1996, 1998-2012 Free Software Foundation, Inc.
@ -587,8 +587,8 @@ MAKEFLAGS=
# Identity of this package.
PACKAGE_NAME='mlr'
PACKAGE_TARNAME='mlr'
PACKAGE_VERSION='5.5.0'
PACKAGE_STRING='mlr 5.5.0'
PACKAGE_VERSION='5.5.0-dev'
PACKAGE_STRING='mlr 5.5.0-dev'
PACKAGE_BUGREPORT=''
PACKAGE_URL=''
@ -1313,7 +1313,7 @@ if test "$ac_init_help" = "long"; then
# Omit some internal or obsolete options to make the list less imposing.
# This message is too long to be a string in the A/UX 3.1 sh.
cat <<_ACEOF
\`configure' configures mlr 5.5.0 to adapt to many kinds of systems.
\`configure' configures mlr 5.5.0-dev to adapt to many kinds of systems.
Usage: $0 [OPTION]... [VAR=VALUE]...
@ -1383,7 +1383,7 @@ fi
if test -n "$ac_init_help"; then
case $ac_init_help in
short | recursive ) echo "Configuration of mlr 5.5.0:";;
short | recursive ) echo "Configuration of mlr 5.5.0-dev:";;
esac
cat <<\_ACEOF
@ -1493,7 +1493,7 @@ fi
test -n "$ac_init_help" && exit $ac_status
if $ac_init_version; then
cat <<\_ACEOF
mlr configure 5.5.0
mlr configure 5.5.0-dev
generated by GNU Autoconf 2.69
Copyright (C) 2012 Free Software Foundation, Inc.
@ -1771,7 +1771,7 @@ cat >config.log <<_ACEOF
This file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake.
It was created by mlr $as_me 5.5.0, which was
It was created by mlr $as_me 5.5.0-dev, which was
generated by GNU Autoconf 2.69. Invocation command line was
$ $0 $@
@ -2638,7 +2638,7 @@ fi
# Define the identity of the package.
PACKAGE='mlr'
VERSION='5.5.0'
VERSION='5.5.0-dev'
cat >>confdefs.h <<_ACEOF
@ -12714,7 +12714,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
# report actual input values of CONFIG_FILES etc. instead of their
# values after options handling.
ac_log="
This file was extended by mlr $as_me 5.5.0, which was
This file was extended by mlr $as_me 5.5.0-dev, which was
generated by GNU Autoconf 2.69. Invocation command line was
CONFIG_FILES = $CONFIG_FILES
@ -12784,7 +12784,7 @@ _ACEOF
cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`"
ac_cs_version="\\
mlr config.status 5.5.0
mlr config.status 5.5.0-dev
configured by $0, generated by GNU Autoconf 2.69,
with options \\"\$ac_cs_config\\"