Ticket #5617 (closed defect: fixed)

Bug contains 6 commit(s) | SVN Diffs for #5617

 

Opened 2 years ago

Last modified 3 months ago

Regex "\b" seems to be interpreting ZWJ as a word boundary

Reported by: bob_eaton@... Assigned to: andy
Priority: minor Milestone: 4.0
Component: regexp Version: 3.6
Keywords: Cc:
Load: Xref:
Java Version: Operating System: XP Pro
Project (C/J): all Weeks: 0.2
Review: mark

Description

Using RegexMatcher, I'm searching for a Find string of "\bते" (or \u005C\u0062\u0924\u0947, which is the '\b' word boundary code followed by a two letter Devanagari word). The string I am searching is: वक्‍ते (or \u0935\u0915\u094D\u200D\u0924\u0947). The u200D is the Zero Width Joiner that occurs just prior to the two Devanagari characters at the end of the search string. The ZWJ is there because the first of the two letters is actually part of a consonant cluster, which is what we use the halant (\u094d) followed by the ZWJ to signal (i.e. within a word).

Anyway, it appears that the word boundary code in the find string (i.e. '\b') is thinking that the ZWJ is signalling a word break, but it is not.

I also checked the BreakIterator stuff just to make sure and it is correctly treating that location as within a word and not at a break as RegexMatcher is. So at the very least, the BreakIterator is not behaving the same as RegexMatcher with the '\b' is.

Bob Eaton

P.S. Attached is a short C++ program that shows this error

Attachments

TestICU.cpp (1.8 kB) - added by bob_eaton@indiamvps.net on 02/24/07 01:39:18.
c++ source with example code (utf-8 encoding)

Change History

02/24/07 01:39:18 changed by bob_eaton@...

  • attachment TestICU.cpp added.

c++ source with example code (utf-8 encoding)

02/25/07 00:25:02 changed by grhoten

  • keywords deleted.
  • weeks changed.
  • xref changed.
  • revw changed.

04/04/07 11:46:44 changed by grhoten

  • load changed.
  • owner changed from somebody to andy.
  • priority changed from major to minor.
  • weeks set to 0.2.
  • milestone changed from UNSCH to 3.8.

07/13/07 14:57:51 changed by andy

  • status changed from new to assigned.

The problem has been fixed by updating the set of characters are neither word or non-word, corresponding to the characters listed in rule WB 4 from Unicode UAX 29. This applies to the simple (traditional, default) option for word boundaries. To get full Unicode UAX-29 style word boundaries, use the 'w' flag with the pattern.

08/31/07 15:38:52 changed by andy

  • milestone changed from 3.8 to 4.0.

This seems perfectly reasonable. We should generalize the question, and consider what other characters should be neutral with respect to traditional (non Unicode) regular expression word boundaries. I think it is just combining marks at the moment, but need to check. "Neutral" characters are neither word nor non-word.

10/10/07 10:51:49 changed by andy

  • status changed from assigned to new.

07/15/08 14:56:34 changed by andy

  • status changed from new to assigned.
  • revw set to mark.

09/30/08 01:34:38 changed by mark

  • status changed from assigned to closed.
  • resolution set to fixed.

09/30/08 01:35:29 changed by mark

  • status changed from closed to reopened.
  • resolution deleted.

09/30/08 01:35:42 changed by mark

  • status changed from reopened to closed.
  • resolution set to fixed.

Add/Change #5617 (Regex "\b" seems to be interpreting ZWJ as a word boundary)




Anti spam check: