Hi, I appreciate you trying to help, but setting LC_CTYPE=C won’t work
for me. The text I’m processing is UTF-8 encoded, and definitely contains
non-ascii characters. If sed isn’t provided the correct encoding, that
breaks pretty basic regular expressions like, ‘find all the three-letter
words.’
$ (echo foo; echo bar; echo føo) | sed -e 's/^...$/three-letter-word/g'
three-letter-word
three-letter-word
three-letter-word
$ (echo foo; echo bar; echo føo) | LC_CTYPE=C sed -e 's/^...$/three-letter-word/g'
three-letter-word
three-letter-word
føo
Now there is an edge case here, where that regular expression won’t work as
expected if there are combining characters in the input, perhaps from an
unexpected unicode normalization format, but that’s more advanced than I
need right now.
Some relevant links:
https://crashcourse.housegordon.org/coreutils-multibyte-support.html#useful-websites
https://ftfy.vercel.app
Since GNU sed and FreeBSD sed handle s/^/x/g properly on inputs starting
with multi-byte characters, I’m certain this is a bug in macOS
sed.
If anyone has guidance for categorizing tickets for unix command-line tools
in Feedback Assistant, that would be great.
Topic:
App & System Services
SubTopic:
Core OS
Tags: