Fix CGI.unescapeHTML CompatibilityError in the pure-Ruby fallback#127
Merged
Conversation
The ascii-compatible path builds a binary buffer but returned numeric character references via chr(enc), so a non-ASCII replacement appended to a buffer that already held non-ASCII bytes raised Encoding::CompatibilityError. Decode into the binary buffer instead, matching the C extension's optimized_unescape_html for out-of-range references (kept verbatim, leading zeros included) and surrogate code points (emitted as raw bytes). #103 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The pure-Ruby fallback of CGI.unescapeHTML raised Encoding::CompatibilityError for strings that mix non-ASCII bytes with a numeric character reference, while the C extension decoded them correctly. Issue #103 reported this on TruffleRuby, and it also affects CRuby whenever the C extension fails to load.
The ASCII-compatible path builds a binary buffer but returned numeric character references through chr(enc), so appending a non-ASCII replacement to a buffer that already held non-ASCII bytes failed. The fix decodes references into the binary buffer instead and lets the trailing force_encoding retag the whole string.
To keep the two implementations in lockstep I compared every method against the C extension across many encodings and inputs. The remaining divergences were all in unescapeHTML. Out-of-range references now stay verbatim with their leading zeros preserved, and surrogate code points are emitted as raw bytes the way rb_enc_mbcput does rather than raising RangeError. Regression tests covering these cases run against both the C extension and the pure-Ruby path.
Fixes #103