Bug in sed

Question

Created Nov ’22

Replies 9

Boosts 0

Views 3.5k

Participants 4

With a UTF-8 locale, this sed command to insert a character at the beginning of each line incorrectly crashes with an error about an illegal byte sequence:

$ echo “hi | LANG=en_US.UTF-8 sed -e s'/^/x/g'
sed: RE error: illegal byte sequence

If you change the ^ to an ordinary character, or drop the g flag, it works fine. I’m guessing the g makes it check the line again, but it gets messed up trying to find the start of the line with the multi-byte character there.

I’m seeing this on both Ventura on arm and Monterey on intel, and haven’t checked further back than that. I know that the sed is BSD-derived, so I did test this on FreeBSD 13.1 and it does not have this bug.

I’ve never filed a bug with Apple before. If I file this as a bug with Feedback Assistant, what on earth do I tag it with in the hopes that the apple sed maintainer(s) might see it? There’s no ‘sed’ or ‘unix’ or ‘command line tools’ option.

Boost

Answer 1

darkpaw OP

Nov ’22

Are you sure? You're using a non-standard double-quote before hi.

When you use this, it works fine:

Command: echo "hi" | LANG=en_US.UTF-8 sed -e s'/^/x/g'

Output: xhi

0

Answer 2

andrew_n OP

Nov ’22

Yeah, I know that’s a multi-byte quote character. The bug is with multi-byte characters at the start of a line. Try:

$ echo ø | LANG=en_US.UTF-8 sed -e 's/^/o/g'
sed: RE error: illegal byte sequence

Sed processes the same character correctly in other very similar regexes:

$ echo ø | LANG=en_US.UTF-8 sed -e 's/ø/o/g'
o

0

Answer 3

darkpaw OP

Nov ’22

Oh, I get you now. Yes, but doesn't that suggest that the characters you're using aren't UTF-8? For example, this works:

Command: echo ø | LANG=ISO-8859-1 sed -e 's/^/o/g'

Output: oø

0

Answer 4

andrew_n OP

Nov ’22

Yes, the input is definitely UTF-8. You can double-check the “UTF-8 (hex)” line of https://www.fileformat.info/info/unicode/char/00f8/index.htm

$ echo ø > foo
$ hexdump -C foo
00000000  c3 b8 0a                                          |...|
00000003
$ LANG=en_US.UTF-8 sed -e 's/^/o/g' < foo
sed: RE error: illegal byte sequence
Returned 1.
# Almost everyone’s terminal is going to be UTF-8 anyway; I only included $LANG to be explicit
$ sed -e 's/^/o/g' < foo
sed: RE error: illegal byte sequence
Returned 1.

ISO-8859-1 isn’t a multi-byte encoding. If you want to convert a UTF-8 character to it, echo ø | iconv -f UTF-8 -t ISO-8859-1 will do it but then you’ll also need to supply the sed regex in the correct encoding and you still won’t be able to trigger this bug that way because it is a bug in sed’s handling of multibyte characters.

0

Answer 5

darkpaw OP

Nov ’22

Try using LC_CTYPE=C: echo “hi | LC_CTYPE=C LANG=en_US.UTF-8 sed -e s'/^/x/g'

If that doesn't help, then yes, raise a feedback report.

There's. lot of useful info here: https://stackoverflow.com/questions/19242275/re-error-illegal-byte-sequence-on-mac-os-x/19770395#19770395

0

Answer 6

andrew_n OP

Nov ’22

Hi, I appreciate you trying to help, but setting LC_CTYPE=C won’t work for me. The text I’m processing is UTF-8 encoded, and definitely contains non-ascii characters. If sed isn’t provided the correct encoding, that breaks pretty basic regular expressions like, ‘find all the three-letter words.’

    $ (echo foo; echo bar; echo føo) | sed -e 's/^...$/three-letter-word/g'
    three-letter-word
    three-letter-word
    three-letter-word
    $ (echo foo; echo bar; echo føo) | LC_CTYPE=C sed -e 's/^...$/three-letter-word/g'
    three-letter-word
    three-letter-word
    føo

Now there is an edge case here, where that regular expression won’t work as expected if there are combining characters in the input, perhaps from an unexpected unicode normalization format, but that’s more advanced than I need right now.

Some relevant links:

https://crashcourse.housegordon.org/coreutils-multibyte-support.html#useful-websites
https://ftfy.vercel.app

Since GNU sed and FreeBSD sed handle s/^/x/g properly on inputs starting with multi-byte characters, I’m certain this is a bug in macOS sed.

If anyone has guidance for categorizing tickets for unix command-line tools in Feedback Assistant, that would be great.

0

Answer 7

darkpaw OP

Nov ’22

As you said, gnu-sed handles it correctly, so can't you just use that?

brew install gsed

echo ø | LANG=en_US.UTF-8 gsed -e 's/^/o/g'

Output: oø

echo “hi | LANG=en_US.UTF-8 gsed -e s'/^/x/g'

Output: x“hi

0

Answer 8

DTS Engineer OP

Apple

Dec ’22

darkpaw wrote:

so can't you just use that?

Sure, but it’d be nice to get a bug on file about this as well.

andrew_n wrote:

If anyone has guidance for categorizing tickets for unix command-line tools in Feedback Assistant, that would be great.

At the first page, choose macOS. In the “Which area are you seeing an issue with?” popup, choose “Something else not on this list”.

For more bug reporting hints and tips, see Bug Reporting: How and Why? .

And please post your bug number, just for the record.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

0

Answer 9

Sokobania OP

Jan ’23

The g character in 's/^/o/g' means "replace each occurrence in the current line". Otherwise it would replace only the first occurrence of each line.

In the particular case where the regexp begins with the caret ^, which means "the regexp should match the beginning of the line", it makes no sense to use the g option.

On my mac, I get the following:

% echo 'ø\nø' | sed -e 's/^/o/g'
sed: RE error: illegal byte sequence

% echo 'ø\nø' | sed -e 's/^/o/' 
oø
oø

0