mirror of
https://github.com/johnkerl/miller.git
synced 2026-01-23 02:14:13 +00:00
make toupper and tolower DSL functions UTF-8 aware
This commit is contained in:
parent
3cb7d78fb7
commit
1751ec87e4
6 changed files with 1293 additions and 40 deletions
|
|
@ -1,37 +1,13 @@
|
|||
# Positional-indexing and other data-cleaning features
|
||||
# Title here
|
||||
|
||||
## Features:
|
||||
|
||||
* The new [**positional-indexing feature**](http://johnkerl.org/miller/doc/reference-dsl.html#Positional_field_names) resolves https://github.com/johnkerl/miller/issues/236 from @aborruso. You can now get the name of the 3rd field of each record via <tt>$[[3]]</tt>, and its value by <tt>$[[[3]]]</tt>. These are both usable on either the left-hand or right-hand side of assignment statements, so you can more easily do things like renaming fields progrmatically within the DSL.
|
||||
|
||||
* There is a new [**capitalize**](http://johnkerl.org/miller/doc/reference-dsl.html#capitalize) DSL function, complementing the already-existing <tt>toupper</tt>. This stems from https://github.com/johnkerl/miller/issues/236.
|
||||
|
||||
* There is a new [**skip-trivial-records**](http://johnkerl.org/miller/doc/reference-verbs.html#skip-trivial-records) verb, resolving https://github.com/johnkerl/miller/issues/197. Similarly, there is a new [**remove-empty-columns**](http://johnkerl.org/miller/doc/reference-verbs.html#remove-empty-columns) verb, resolving https://github.com/johnkerl/miller/issues/206. Both are useful for **data-cleaning use-cases**.
|
||||
|
||||
* Another pair is https://github.com/johnkerl/miller/issues/181 and https://github.com/johnkerl/miller/issues/256. While Miller uses <tt>mmap</tt> internally (and invisibily) to get approximately a 20% performance boost over not using it, this can cause out-of-memory issues with reading either large files, or too many small ones. Now, Miller automatically avoids <tt>mmap</tt> in these cases. You can still use <tt>--mmap</tt> or <tt>--no-mmap</tt> if you want manual control of this.
|
||||
|
||||
* There is a new [**--ivar option for the nest verb**](http://johnkerl.org/miller/doc/reference-verbs.html#nest) which complements the already-existing <tt>--evar</tt>. This is from https://github.com/johnkerl/miller/pull/260 thanks to @jgreely.
|
||||
|
||||
* There is a new keystroke-saving [**urandrange**](http://johnkerl.org/miller/doc/reference-dsl.html#urand) DSL function: <tt>urandrange(low, high)</tt> is the same as <tt>low + (high - low) * urand()</tt>.
|
||||
|
||||
* There is a new [**-v option for the cat verb**](http://johnkerl.org/miller/doc/reference-verbs.html#cat) which writes a low-level record-structure dump to standard error.
|
||||
|
||||
* There is a new [**-N option for mlr**](http://johnkerl.org/miller/doc/manpage.html) which is a keystroke-saver for <tt>--implicit-csv-header --headerless-csv-output</tt>.
|
||||
* The [**toupper**](http://johnkerl.org/miller/doc/reference-dsl.html#toupper), [**tolower**](http://johnkerl.org/miller/doc/reference-dsl.html#tolower), and [**capitalize**](http://johnkerl.org/miller/doc/reference-dsl.html#capitalize) DSL functions are now UTF-8 aware, thanks to @sheredom's marvelous https://github.com/sheredom/utf8.h.
|
||||
|
||||
## Documentation:
|
||||
|
||||
* The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_to_escape_'?'_in_regexes resolves https://github.com/johnkerl/miller/issues/203.
|
||||
|
||||
* The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_can_I_filter_by_date resolves https://github.com/johnkerl/miller/issues/208.
|
||||
|
||||
* https://github.com/johnkerl/miller/issues/244 fixes a documentation issue while highlighting the need for https://github.com/johnkerl/miller/issues/241.
|
||||
* ...
|
||||
|
||||
## Bugfixes:
|
||||
|
||||
* There was a SEGV using `nest` within `then`-chains, fixed in response to https://github.com/johnkerl/miller/issues/220.
|
||||
|
||||
* Quotes and backslashes weren't being escaped in JSON output with <tt>--jvquoteall</tt>; reported on https://github.com/johnkerl/miller/issues/222.
|
||||
|
||||
## Note:
|
||||
|
||||
I've never code-named releases but if I were to code-name 5.5.0 I would call it "aborruso". Andrea has contributed many fantastic feature requests, as well as driving a huge volume of Miller-related discussions in StackExchange (https://github.com/johnkerl/miller/issues/212). Mille grazie al mio amico @aborruso!
|
||||
* ...
|
||||
|
|
|
|||
|
|
@ -35,7 +35,8 @@ libmlr_la_SOURCES= free_flags.h \
|
|||
string_builder.c \
|
||||
string_builder.h \
|
||||
mlr_test_util.c \
|
||||
mlr_test_util.h
|
||||
mlr_test_util.h \
|
||||
utf8.h
|
||||
|
||||
AM_CPPFLAGS= -I${srcdir}/../
|
||||
AM_CFLAGS= -std=gnu99
|
||||
|
|
|
|||
|
|
@ -340,7 +340,8 @@ libmlr_la_SOURCES = free_flags.h \
|
|||
string_builder.c \
|
||||
string_builder.h \
|
||||
mlr_test_util.c \
|
||||
mlr_test_util.h
|
||||
mlr_test_util.h \
|
||||
utf8.h
|
||||
|
||||
AM_CPPFLAGS = -I${srcdir}/../
|
||||
AM_CFLAGS = -std=gnu99
|
||||
|
|
|
|||
|
|
@ -7,6 +7,7 @@
|
|||
#include "lib/mlrdatetime.h"
|
||||
#include "lib/mlrregex.h"
|
||||
#include "lib/mvfuncs.h"
|
||||
#include "lib/utf8.h"
|
||||
|
||||
// ================================================================
|
||||
// See important notes at the top of mlrval.h.
|
||||
|
|
@ -403,8 +404,14 @@ mv_t s_x_typeof_func(mv_t* pval1) {
|
|||
// ----------------------------------------------------------------
|
||||
mv_t s_s_tolower_func(mv_t* pval1) {
|
||||
char* string = mlr_strdup_or_die(pval1->u.strv);
|
||||
#if 0
|
||||
// ASCII only
|
||||
for (char* c = string; *c; c++)
|
||||
*c = tolower((unsigned char)*c);
|
||||
#else
|
||||
// UTF-8
|
||||
utf8lwr(string);
|
||||
#endif
|
||||
mv_free(pval1);
|
||||
pval1->u.strv = NULL;
|
||||
|
||||
|
|
@ -413,8 +420,14 @@ mv_t s_s_tolower_func(mv_t* pval1) {
|
|||
|
||||
mv_t s_s_toupper_func(mv_t* pval1) {
|
||||
char* string = mlr_strdup_or_die(pval1->u.strv);
|
||||
#if 0
|
||||
// ASCII only
|
||||
for (char* c = string; *c; c++)
|
||||
*c = toupper((unsigned char)*c);
|
||||
#else
|
||||
// UTF-8
|
||||
utf8upr(string);
|
||||
#endif
|
||||
mv_free(pval1);
|
||||
pval1->u.strv = NULL;
|
||||
|
||||
|
|
|
|||
1262
c/lib/utf8.h
Normal file
1262
c/lib/utf8.h
Normal file
File diff suppressed because it is too large
Load diff
20
configure
vendored
20
configure
vendored
|
|
@ -1,6 +1,6 @@
|
|||
#! /bin/sh
|
||||
# Guess values for system-dependent variables and create Makefiles.
|
||||
# Generated by GNU Autoconf 2.69 for mlr 5.5.0.
|
||||
# Generated by GNU Autoconf 2.69 for mlr 5.5.0-dev.
|
||||
#
|
||||
#
|
||||
# Copyright (C) 1992-1996, 1998-2012 Free Software Foundation, Inc.
|
||||
|
|
@ -587,8 +587,8 @@ MAKEFLAGS=
|
|||
# Identity of this package.
|
||||
PACKAGE_NAME='mlr'
|
||||
PACKAGE_TARNAME='mlr'
|
||||
PACKAGE_VERSION='5.5.0'
|
||||
PACKAGE_STRING='mlr 5.5.0'
|
||||
PACKAGE_VERSION='5.5.0-dev'
|
||||
PACKAGE_STRING='mlr 5.5.0-dev'
|
||||
PACKAGE_BUGREPORT=''
|
||||
PACKAGE_URL=''
|
||||
|
||||
|
|
@ -1313,7 +1313,7 @@ if test "$ac_init_help" = "long"; then
|
|||
# Omit some internal or obsolete options to make the list less imposing.
|
||||
# This message is too long to be a string in the A/UX 3.1 sh.
|
||||
cat <<_ACEOF
|
||||
\`configure' configures mlr 5.5.0 to adapt to many kinds of systems.
|
||||
\`configure' configures mlr 5.5.0-dev to adapt to many kinds of systems.
|
||||
|
||||
Usage: $0 [OPTION]... [VAR=VALUE]...
|
||||
|
||||
|
|
@ -1383,7 +1383,7 @@ fi
|
|||
|
||||
if test -n "$ac_init_help"; then
|
||||
case $ac_init_help in
|
||||
short | recursive ) echo "Configuration of mlr 5.5.0:";;
|
||||
short | recursive ) echo "Configuration of mlr 5.5.0-dev:";;
|
||||
esac
|
||||
cat <<\_ACEOF
|
||||
|
||||
|
|
@ -1493,7 +1493,7 @@ fi
|
|||
test -n "$ac_init_help" && exit $ac_status
|
||||
if $ac_init_version; then
|
||||
cat <<\_ACEOF
|
||||
mlr configure 5.5.0
|
||||
mlr configure 5.5.0-dev
|
||||
generated by GNU Autoconf 2.69
|
||||
|
||||
Copyright (C) 2012 Free Software Foundation, Inc.
|
||||
|
|
@ -1771,7 +1771,7 @@ cat >config.log <<_ACEOF
|
|||
This file contains any messages produced by compilers while
|
||||
running configure, to aid debugging if configure makes a mistake.
|
||||
|
||||
It was created by mlr $as_me 5.5.0, which was
|
||||
It was created by mlr $as_me 5.5.0-dev, which was
|
||||
generated by GNU Autoconf 2.69. Invocation command line was
|
||||
|
||||
$ $0 $@
|
||||
|
|
@ -2638,7 +2638,7 @@ fi
|
|||
|
||||
# Define the identity of the package.
|
||||
PACKAGE='mlr'
|
||||
VERSION='5.5.0'
|
||||
VERSION='5.5.0-dev'
|
||||
|
||||
|
||||
cat >>confdefs.h <<_ACEOF
|
||||
|
|
@ -12714,7 +12714,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
|
|||
# report actual input values of CONFIG_FILES etc. instead of their
|
||||
# values after options handling.
|
||||
ac_log="
|
||||
This file was extended by mlr $as_me 5.5.0, which was
|
||||
This file was extended by mlr $as_me 5.5.0-dev, which was
|
||||
generated by GNU Autoconf 2.69. Invocation command line was
|
||||
|
||||
CONFIG_FILES = $CONFIG_FILES
|
||||
|
|
@ -12784,7 +12784,7 @@ _ACEOF
|
|||
cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
|
||||
ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`"
|
||||
ac_cs_version="\\
|
||||
mlr config.status 5.5.0
|
||||
mlr config.status 5.5.0-dev
|
||||
configured by $0, generated by GNU Autoconf 2.69,
|
||||
with options \\"\$ac_cs_config\\"
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue