diff --git a/.github/workflows/codespell.yml b/.github/workflows/codespell.yml index e5cfb1704..8b6cb20ec 100644 --- a/.github/workflows/codespell.yml +++ b/.github/workflows/codespell.yml @@ -33,7 +33,4 @@ jobs: with: check_filenames: true ignore_words_file: .codespellignore - # ignore_words_list: denom,inout,iput,nd,nin,numer,te,wee - # There is a word "RO" in docs/src/shapes-of-data.md.in and docs/src/shapes-of-data.md - # which is listed in .codespellignore but which codespell refuses to ignore. Not sure why. skip: "*.csv,*.dkvp,*.txt,*.js,*.html,*.map,./tags,./test/cases,./docs/src/shapes-of-data.md.in,./docs/src/shapes-of-data.md" diff --git a/docs/src/data/colours.csv b/docs/src/data/colours.csv index f6dbe24aa..e0ce75494 100644 --- a/docs/src/data/colours.csv +++ b/docs/src/data/colours.csv @@ -1,3 +1,3 @@ -KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR +KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah diff --git a/docs/src/shapes-of-data.md b/docs/src/shapes-of-data.md index 75ead4426..bab58b7f0 100644 --- a/docs/src/shapes-of-data.md +++ b/docs/src/shapes-of-data.md @@ -36,7 +36,7 @@ Use the `file` command to see if there are CR/LF terminators (in this case, ther file data/colours.csv
-data/colours.csv: UTF-8 Unicode text +data/colours.csv: Unicode text, UTF-8 textLook at the file to find names of fields: @@ -45,18 +45,15 @@ Look at the file to find names of fields: cat data/colours.csv
-KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR -masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Witter;Biały;Alb;Beyaz +KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR +masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;SiyahExtract a few fields: -
-mlr --csv cut -f KEY,PL,RO data/colours.csv --
-(only blank lines appear) ++mlr --csv cut -f KEY,PL,TO data/colours.csvUse XTAB output format to get a sharper picture of where records/fields are being split: @@ -65,12 +62,12 @@ Use XTAB output format to get a sharper picture of where records/fields are bein mlr --icsv --oxtab cat data/colours.csv
-KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Witter;Biały;Alb;Beyaz +KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz -KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah +KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah-Using XTAB output format makes it clearer that `KEY;DE;...;RO;TR` is being treated as a single field name in the CSV header, and likewise each subsequent line is being treated as a single field value. This is because the default field separator is a comma but we have semicolons here. Use XTAB again with different field separator (`--fs semicolon`): +Using XTAB output format makes it clearer that `KEY;DE;...;TR` is being treated as a single field name in the CSV header, and likewise each subsequent line is being treated as a single field value. This is because the default field separator is a comma but we have semicolons here. Use XTAB again with different field separator (`--fs semicolon`):
mlr --icsv --ifs semicolon --oxtab cat data/colours.csv @@ -83,9 +80,9 @@ ES Blanco FI Valkoinen FR Blanc IT Bianco -NL Witter +NL Wit PL Biały -RO Alb +TO Alb TR Beyaz KEY masterdata_colourcode_2 @@ -97,17 +94,17 @@ FR Noir IT Nero NL Zwart PL Czarny -RO Negru +TO Negru TR SiyahUsing the new field-separator, retry the cut:
-mlr --csv --fs semicolon cut -f KEY,PL,RO data/colours.csv +mlr --csv --fs semicolon cut -f KEY,PL,TO data/colours.csv
-KEY;PL;RO +KEY;PL;TO masterdata_colourcode_1;Biały;Alb masterdata_colourcode_2;Czarny;Negrudiff --git a/docs/src/shapes-of-data.md.in b/docs/src/shapes-of-data.md.in index b54719a1f..c32b0dad1 100644 --- a/docs/src/shapes-of-data.md.in +++ b/docs/src/shapes-of-data.md.in @@ -18,35 +18,34 @@ Use the `file` command to see if there are CR/LF terminators (in this case, ther GENMD-CARDIFY-HIGHLIGHT-ONE file data/colours.csv -data/colours.csv: UTF-8 Unicode text +data/colours.csv: Unicode text, UTF-8 text GENMD-EOF Look at the file to find names of fields: GENMD-CARDIFY-HIGHLIGHT-ONE cat data/colours.csv -KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR -masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Witter;Biały;Alb;Beyaz +KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR +masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah GENMD-EOF Extract a few fields: GENMD-CARDIFY-HIGHLIGHT-ONE -mlr --csv cut -f KEY,PL,RO data/colours.csv -(only blank lines appear) +mlr --csv cut -f KEY,PL,TO data/colours.csv GENMD-EOF Use XTAB output format to get a sharper picture of where records/fields are being split: GENMD-CARDIFY-HIGHLIGHT-ONE mlr --icsv --oxtab cat data/colours.csv -KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Witter;Biały;Alb;Beyaz +KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz -KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah +KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah GENMD-EOF -Using XTAB output format makes it clearer that `KEY;DE;...;RO;TR` is being treated as a single field name in the CSV header, and likewise each subsequent line is being treated as a single field value. This is because the default field separator is a comma but we have semicolons here. Use XTAB again with different field separator (`--fs semicolon`): +Using XTAB output format makes it clearer that `KEY;DE;...;TR` is being treated as a single field name in the CSV header, and likewise each subsequent line is being treated as a single field value. This is because the default field separator is a comma but we have semicolons here. Use XTAB again with different field separator (`--fs semicolon`): GENMD-CARDIFY-HIGHLIGHT-ONE mlr --icsv --ifs semicolon --oxtab cat data/colours.csv @@ -57,9 +56,9 @@ ES Blanco FI Valkoinen FR Blanc IT Bianco -NL Witter +NL Wit PL Biały -RO Alb +TO Alb TR Beyaz KEY masterdata_colourcode_2 @@ -71,15 +70,15 @@ FR Noir IT Nero NL Zwart PL Czarny -RO Negru +TO Negru TR Siyah GENMD-EOF Using the new field-separator, retry the cut: GENMD-CARDIFY-HIGHLIGHT-ONE -mlr --csv --fs semicolon cut -f KEY,PL,RO data/colours.csv -KEY;PL;RO +mlr --csv --fs semicolon cut -f KEY,PL,TO data/colours.csv +KEY;PL;TO masterdata_colourcode_1;Biały;Alb masterdata_colourcode_2;Czarny;Negru GENMD-EOF