Peter N Lewis wrote:
That's not really true. Boyer-Moore is typically O(M/N) where N is the search string length and M is the text length - that is the longer the search string, the quicker the match. It's also relatively easy to implement, since all you have to do is search checking the last character, and when it fails slide forward a precomputed amount based on the character in the text and the search string (ie, a table of characters is needed). Of coruse, in non-ASCII days, this is more challenging, but for UTF-8 you could probably treate all high bit set characters as equal and still get good results on mostly roman text.
You could also do the matching byte-wise (not character-wise) AFAICS (at least if you want exact matches, not case-insensitive matches or so). -- Well, if you have to take into account different representations of the same characters (accents or so), which can even be of different lengths, then it gets hairy ...
There is some code I use in a shipping product below which you are welcome to adapt. It is quite old code, so the style is not something I'm proud of these days, but that's hardly unuusal. s is a string, already uppercased. It's reading through a file on disk so it is more complex than it needs to be if the content is in memory (and given how simple it is it should give you some idea that Boyer-Moore really is worthwhile for any case where a decent amount of text is expected).
The built-in `Pos' currently does the naive method. I'd thought about replacing it with a more sophisticated algorithm, but so far it hasn't been important enough to me.
But if anyone wants to, feel free to send me a subsitute (function PosFrom in p/rts/string1.pas). Of course, since `Pos' can be used for all kinds of purposes, it must not be less efficient in simple cases (at least not significantly so).
Frank