Ticket #4184 (assigned defect)

Bug contains 10 commit(s) | SVN Diffs for #4184

 

Opened 4 years ago

Last modified 2 months ago

Searching for NULL characters does not work

Reported by: vargaz(at)gmail.com Assigned to: eric (accepted)
Priority: assess Milestone: 4.2
Component: collation Version: 3.0
Keywords: collation Cc:
Load: Xref: 6576 4562
Java Version: Operating System: linux
Project (C/J): ICU4C Weeks: 1
Review: srl

Description (Last modified by srl)

It is not possible to search for NULL characters using the usearch engine. When the pattern consists of a single NULL character, the engine says all positions in the text match. Here is a test case:

#include <assert.h>
#include <unicode/utypes.h>
#include <unicode/ustring.h>
#include <unicode/ures.h>
#include <unicode/ucol.h>
#include <unicode/usearch.h>

void main ()
{
	UCollator *coll;
	UErrorCode ec;
	UStringSearch *search;
	U_STRING_DECL (pattern, "0", 1);
	U_STRING_DECL (text, "IS 0 OK ?", 9);
	int pos;

	U_STRING_INIT (pattern, "0", 1);
	U_STRING_INIT (text, "IS 0 OK ?", 9);

	pattern [0] = pattern [1] = 0;
	text [3] = 0;

	ec=U_ZERO_ERROR;

	coll = ucol_open ("en-US", &ec);
	assert (U_SUCCESS (ec));

	search=usearch_openFromCollator (pattern, 1, text, 9, coll, NULL, &ec);
	assert (U_SUCCESS (ec));

	for(pos=usearch_first (search, &ec);
		pos!=USEARCH_DONE;
		pos=usearch_next (search, &ec)) {
		printf ("POS: %d %d\n", pos, usearch_getMatchedLength (search));
	}
}

Attachments

Change History

12/31/69 17:29:12 changed by notes2

Do assessment as suggested in Mark's reply, either return if working correctly, or reassign to Vladimir if it's a collation problem.

12/31/69 17:29:13 changed by auditor

  • Tue Nov 30 02:09:48 2004 weiv changed notes2: assign: "" to "weiv", target: "UNSCH" to "",
  • Tue Nov 30 02:09:48 2004 weiv moved from incoming to collation
  • Tue Nov 30 12:38:10 2004 guest sent reply 1
  • Fri Jan 7 02:01:50 2005 weiv changed notes2: priority: "" to "critical", target: "UNSCH" to "3.4", weeks: "" to "1",
  • Wed Jul 13 15:05:05 2005 weiv changed notes2: target: "3.4" to "3.6",
  • Fri Mar 31 14:22:39 2006 ram changed notes2: assign: "weiv" to "andy", target: "3.6" to "3.8",
  • Fri Oct 20 22:34:23 2006 andy changed notes2: comments: "" to "Do assessment as suggested in Mark's reply, either return if working correctly, or reassign to Vladimir if it's a collation problem.",

11/30/04 11:38:10 changed by mark.davis(at)us.ibm.com

(Guest Reply)

If you want ignorable characters to be significant, then you have to set the strength to IDENTICAL. Otherwise, the character counts as ignored, and thus as an empty string. And an empty string will match at each position in the string.

If the search doesn't work when you set then strength to IDENTICAL, then there is a real bug.

P.S. We should, however, clarify this in the text.

08/30/07 16:59:11 changed by andy

  • load changed.
  • xref changed.
  • java changed.
  • revw changed.
  • milestone changed from 3.8 to 4.0.

03/21/08 10:35:35 changed by andy

  • priority changed from critical to assess.

Eric, since you are up to your eyeballs in string search, I've reassigned this one to you. Let me know if it's a problem.

07/15/08 14:45:47 changed by andy

  • owner changed from andy to eric.
  • milestone changed from 4.0 to 4.2.

07/21/08 09:02:01 changed by hchapman

  • priority changed from assess to major.

07/21/08 09:02:22 changed by srl

  • description changed.

07/22/08 10:20:43 changed by hchapman

  • owner changed from eric to bdrower.

07/22/08 11:28:38 changed by bdrower

  • status changed from new to assigned.

08/05/08 10:58:27 changed by bdrower

Have been unable to reproduce error. Created a new test case in the string test file. Sent bug reporter message asking for code sample if still having problems. Pending reply.

08/05/08 11:06:17 changed by bdrower

User replied: Hi,

It was a long time ago, and I don't really remember this stuff any

more, also, we no longer use ICU, so I think bug report can be closed.

Zoltan

Bug will stay open until new test case is committed.

08/08/08 12:11:16 changed by bdrower

  • revw set to srl.

Not a problem anymore, but created a test case to make sure this is not a problem in the future. Code committed. Set for review.

08/11/08 11:25:45 changed by bdrower

  • revw deleted.

08/22/08 17:32:02 changed by bdrower

  • revw set to srl.

After committing test, failed on some systems. After review the problem was not an ICU library issue, it was a problem with the test case. Test case was updated and test now succeeds on all systems.

09/17/08 15:47:30 changed by srl

test code is missing these lines:

	pattern [0] = pattern [1] = 0;
	text [3] = 0;

09/17/08 15:51:49 changed by srl

added:

+    pattern[0]=0;
+    text[0]=0;
+    text[4]=0;

got:

 Expected search result length: 1; Got instead: 0
 Expected 2 search hits, found 8

09/22/08 18:13:04 changed by srl

  • owner changed from bdrower to srl.
  • status changed from assigned to new.
  • milestone changed from 4.2 to 4.1.2.

Current test is not correct, see above. Fix and reassign post M1.

09/24/08 15:09:48 changed by srl

  • milestone changed from 4.1.2 to 4.1.1.

09/26/08 14:15:04 changed by srl

  • owner changed from srl to eric.
  • priority changed from major to assess.
  • milestone changed from 4.1.1 to 4.1.2.

Updated the test case to use IDENTICAL strength, and it still fails.

It seems that other ignorables (U+0001, U+0002) etc also have the same behavior. Perhaps the linear search does not handle ignorables?

Search for NULL failing in strength not identical is NOT a bug but expected behavior.

09/26/08 15:58:04 changed by srl

  • revw deleted.

10/07/08 16:33:32 changed by eric

  • xref set to 6576.

The test code in this bug produces a pattern that is all ignorables. The test code fails because StringSearch cannot handle a pattern that starts or ends with ignorables. This bug was submitted four years ago, and I have verified that it exists in ICU 3.8.1, so it is not, strictly speaking, a regression caused by the fact that the linear search code does not correctly handle strength UCOL_IDENTICAL, though that fix is also required for the test to pass.

See ticket:6576 for details of the UCOL_IDENTICAL bug.

10/07/08 17:11:23 changed by eric

  • milestone changed from 4.1.2 to 4.2.

Moving this bug to milestone 4.2. Since it requires a major overhaul of the search logic to handle leading (and trailing?) ignorables, it's better to fold this into the Boyer-Moore update.

10/31/08 10:04:18 changed by eric

  • status changed from new to assigned.
  • xref changed from 6576 to 6576 4562.
  • revw set to srl.

Since there is code submitted against this bug that's included in ICU 4.0, I'm going to close this bug. The problem described here is actually a degenerate case of the "medial match" described in TR#10 and referenced in ticket:4562.


Add/Change #4184 (Searching for NULL characters does not work)




Anti spam check: