Skip to content

Fix CGI.unescapeHTML CompatibilityError in the pure-Ruby fallback#127

Merged
hsbt merged 1 commit into
masterfrom
fix-pure-ruby-escape-issue-103
Jun 23, 2026
Merged

Fix CGI.unescapeHTML CompatibilityError in the pure-Ruby fallback#127
hsbt merged 1 commit into
masterfrom
fix-pure-ruby-escape-issue-103

Conversation

@hsbt

@hsbt hsbt commented Jun 23, 2026

Copy link
Copy Markdown
Member

The pure-Ruby fallback of CGI.unescapeHTML raised Encoding::CompatibilityError for strings that mix non-ASCII bytes with a numeric character reference, while the C extension decoded them correctly. Issue #103 reported this on TruffleRuby, and it also affects CRuby whenever the C extension fails to load.

The ASCII-compatible path builds a binary buffer but returned numeric character references through chr(enc), so appending a non-ASCII replacement to a buffer that already held non-ASCII bytes failed. The fix decodes references into the binary buffer instead and lets the trailing force_encoding retag the whole string.

To keep the two implementations in lockstep I compared every method against the C extension across many encodings and inputs. The remaining divergences were all in unescapeHTML. Out-of-range references now stay verbatim with their leading zeros preserved, and surrogate code points are emitted as raw bytes the way rb_enc_mbcput does rather than raising RangeError. Regression tests covering these cases run against both the C extension and the pure-Ruby path.

Fixes #103

The ascii-compatible path builds a binary buffer but returned numeric
character references via chr(enc), so a non-ASCII replacement appended
to a buffer that already held non-ASCII bytes raised
Encoding::CompatibilityError. Decode into the binary buffer instead,
matching the C extension's optimized_unescape_html for out-of-range
references (kept verbatim, leading zeros included) and surrogate code
points (emitted as raw bytes).

#103

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@hsbt hsbt merged commit 09970b0 into master Jun 23, 2026
70 checks passed
@hsbt hsbt deleted the fix-pure-ruby-escape-issue-103 branch June 23, 2026 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unescape_html - Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8

1 participant