May be worth adding circular buffer for both collator search and for normalized
searching.
See following memo.
"Graham Asher" <graham.asher@btinternet.com>
2002.12.04 02:07
To: Mark Davis/Cupertino/IBM@IBMUS
cc:
Subject: RE: Quaternary sort keys
Mark,
thanks for your latest comments. I was interested in your remark that "There
can actually be more levels -- in ICU -- since we allow the introduction of
an algorithmically-derived 'case' level. This provides a few capabilities,
among them the ability to reverse the order of upper vs lower.". I can see
that that might be a good way of doing it. When I implemented my last
collation system - for Symbian in 1999 - I handled that problem by changing
the level-3 weights. But I suppose your way is better because you can
separate out the case from other level-3 distinctions and perform a
comparison in which, among level-3 attributes, only case is significant.
I have now incorporated the idea of the five levels into my current
collation system (for Research In Motion) and implemented all the
variable-element treatments, and it seems to work again ;-) I have also
implemented a string search system. I've read about your way of implementing
the Boyer-Moore system (using a patented algorithm, so that method's out)
described by Laura Werner in
http://oss.software.ibm.com/icu/docs/papers/efficient_text_searching_in_java
.html.
The difficulty I have with Boyer-Moore used for searching using the Unicode
Collation Algorithm is that the shifts are not by a certain number of
characters but by a certain number of collation elements; we are not
comparing raw text but a derived stream of collation elements with a stored
buffer of collation elements representing the target. Boyer-Moore is only
useful if we can jump forward cheaply. Here, however, we jump forward by
setting the collation element iterator position. I don't believe that
operation is very cheap. To move forward by 10 collation elements, we have
to chug forward over the source text, canonically expanding it and looking
up elements. There are no shortcuts - we don't know ahead of time what
expansions, contractions and ignored characters will be encountered. And
even if there is no possible jump forward, we will still have to iterate
forward, eventually, over that exact same text. Therefore the only saving is
in the comparisons, which are trivial bitwise comparisons of strings of
integers.
My conclusion is that the Boyer-Moore search algorithm's time savings are
increasingly small as the cost of skipping increases relative to the cost of
comparison; and in this case the cost of skipping is much greater than the
cost of comparison.
I'd be very happy if you told me that I'd missed a fundamental point about
Boyer-Moore. I would much rather it was significantly better than brute
force in this case. But for the moment I cannot see the point of adding the
extra complexity to my code.
By the way, Laura Werner's code would be simpler and faster if it stored
collation elements from the search text in a circular buffer. Then there'd
be no need to run the collation element iterator backwards over the text
when performing a comparison. Just compare the two buffers. This is the
method I use, and it saves me a lot of trouble - mainly, my collation
element iterator need only run forwards.
There may be another problem with this code. The CollationElementIterator
documentation states that setOffset 'sets the offset of the currently
processed character...' But the code needs to set the collation element
offset, which is not the same thing.
Best regards,
Graham