Ticket #2554 (closed defect: invalid)

SVN Diffs for #2554

 

Opened 6 years ago

Last modified 4 months ago

Consider adding circular buffer to search code

Reported by: mark.davis(at)us.ibm.com Assigned to: eric
Priority: assess Milestone: UNSCH
Component: collation Version:
Keywords: collation Cc:
Load: Xref:
Java Version: Operating System: all
Project (C/J): ICU4C,ICU4J and ICU4JNI Weeks: 2
Review:

Description (Last modified by srl)

May be worth adding circular buffer for both collator search and for normalized searching.

See following memo.

"Graham Asher" <graham.asher@btinternet.com> 2002.12.04 02:07

To: Mark Davis/Cupertino/IBM@IBMUS cc: Subject: RE: Quaternary sort keys

Mark,

thanks for your latest comments. I was interested in your remark that "There can actually be more levels -- in ICU -- since we allow the introduction of an algorithmically-derived 'case' level. This provides a few capabilities, among them the ability to reverse the order of upper vs lower.". I can see that that might be a good way of doing it. When I implemented my last collation system - for Symbian in 1999 - I handled that problem by changing the level-3 weights. But I suppose your way is better because you can separate out the case from other level-3 distinctions and perform a comparison in which, among level-3 attributes, only case is significant.

I have now incorporated the idea of the five levels into my current collation system (for Research In Motion) and implemented all the variable-element treatments, and it seems to work again ;-) I have also implemented a string search system. I've read about your way of implementing the Boyer-Moore system (using a patented algorithm, so that method's out) described by Laura Werner in http://oss.software.ibm.com/icu/docs/papers/efficient_text_searching_in_java .html.

The difficulty I have with Boyer-Moore used for searching using the Unicode Collation Algorithm is that the shifts are not by a certain number of characters but by a certain number of collation elements; we are not comparing raw text but a derived stream of collation elements with a stored buffer of collation elements representing the target. Boyer-Moore is only useful if we can jump forward cheaply. Here, however, we jump forward by setting the collation element iterator position. I don't believe that operation is very cheap. To move forward by 10 collation elements, we have to chug forward over the source text, canonically expanding it and looking up elements. There are no shortcuts - we don't know ahead of time what expansions, contractions and ignored characters will be encountered. And even if there is no possible jump forward, we will still have to iterate forward, eventually, over that exact same text. Therefore the only saving is in the comparisons, which are trivial bitwise comparisons of strings of integers.

My conclusion is that the Boyer-Moore search algorithm's time savings are increasingly small as the cost of skipping increases relative to the cost of comparison; and in this case the cost of skipping is much greater than the cost of comparison.

I'd be very happy if you told me that I'd missed a fundamental point about Boyer-Moore. I would much rather it was significantly better than brute force in this case. But for the moment I cannot see the point of adding the extra complexity to my code.

By the way, Laura Werner's code would be simpler and faster if it stored collation elements from the search text in a circular buffer. Then there'd be no need to run the collation element iterator backwards over the text when performing a comparison. Just compare the two buffers. This is the method I use, and it saves me a lot of trouble - mainly, my collation element iterator need only run forwards.

There may be another problem with this code. The CollationElementIterator documentation states that setOffset 'sets the offset of the currently processed character...' But the code needs to set the collation element offset, which is not the same thing.

Best regards,

Graham

Attachments

Change History

12/31/69 17:28:06 changed by notes2

Let's please at least investigate and discuss the concerns raised! markus 20021210

12/31/69 17:28:07 changed by auditor

  • 12/10/02 20:01:28 schererm changed notes2
  • 12/10/02 20:01:28 schererm moved from incoming to collation
  • 04/14/03 18:34:13 ram changed notes2
  • 06/25/03 14:31:18 srl changed notes2
  • 01/26/04 00:24:17 grhoten changed notes2
  • 02/06/04 21:11:34 mark changed notes2
  • Fri Oct 13 17:41:56 2006 andy changed notes2: assign: "vladimir" to "andy",

07/07/08 11:48:32 changed by srl

  • load changed.
  • xref changed.
  • java changed.
  • description changed.
  • priority changed from major to assess.
  • owner changed from andy to eric.
  • revw changed.

07/21/08 08:51:02 changed by hchapman

  • status changed from new to closed.
  • resolution set to invalid.

6 years old; obsolete; close.


Add/Change #2554 (Consider adding circular buffer to search code)




Anti spam check: