andrew_n’s Profile | Apple Developer Forums

Reply to Bug in sed

Hi, I appreciate you trying to help, but setting LC_CTYPE=C won’t work for me. The text I’m processing is UTF-8 encoded, and definitely contains non-ascii characters. If sed isn’t provided the correct encoding, that breaks pretty basic regular expressions like, ‘find all the three-letter words.’ $ (echo foo; echo bar; echo føo) | sed -e 's/^...$/three-letter-word/g' three-letter-word three-letter-word three-letter-word $ (echo foo; echo bar; echo føo) | LC_CTYPE=C sed -e 's/^...$/three-letter-word/g' three-letter-word three-letter-word føo Now there is an edge case here, where that regular expression won’t work as expected if there are combining characters in the input, perhaps from an unexpected unicode normalization format, but that’s more advanced than I need right now. Some relevant links: https://crashcourse.housegordon.org/coreutils-multibyte-support.html#useful-websites https://ftfy.vercel.app Since GNU sed and FreeBSD sed handle s/^/x/g properly on inputs starting with multi-byte characters, I’m certain this is a bug in macOS sed. If anyone has guidance for categorizing tickets for unix command-line tools in Feedback Assistant, that would be great.

App & System Services Core OS

Nov ’22

Reply to Bug in sed

Yes, the input is definitely UTF-8. You can double-check the “UTF-8 (hex)” line of https://www.fileformat.info/info/unicode/char/00f8/index.htm $ echo ø > foo $ hexdump -C foo 00000000 c3 b8 0a |...| 00000003 $ LANG=en_US.UTF-8 sed -e 's/^/o/g' < foo sed: RE error: illegal byte sequence Returned 1. # Almost everyone’s terminal is going to be UTF-8 anyway; I only included $LANG to be explicit $ sed -e 's/^/o/g' < foo sed: RE error: illegal byte sequence Returned 1. ISO-8859-1 isn’t a multi-byte encoding. If you want to convert a UTF-8 character to it, echo ø | iconv -f UTF-8 -t ISO-8859-1 will do it but then you’ll also need to supply the sed regex in the correct encoding and you still won’t be able to trigger this bug that way because it is a bug in sed’s handling of multibyte characters.

App & System Services Core OS

Nov ’22

Reply to Bug in sed

Yeah, I know that’s a multi-byte quote character. The bug is with multi-byte characters at the start of a line. Try: $ echo ø | LANG=en_US.UTF-8 sed -e 's/^/o/g' sed: RE error: illegal byte sequence Sed processes the same character correctly in other very similar regexes: $ echo ø | LANG=en_US.UTF-8 sed -e 's/ø/o/g' o

App & System Services Core OS

Nov ’22

andrew_n

Post

Replies

Boosts

Views

Activity