Re: "pos" and whole words

10 Apr 2005


      ...
You are basically talking about Boyer-Moore and Knuth-Morris-Pratt
searching algorithms, of which there is a good discussion in
Sedgewick.  In practice, when there is a good chance of match
failure on the first char. comparison, little is to be gained over
the elementary techniques.  The other methods are likely to gain on
such examples as "duckbill" in the presence of many "ducktails" or
"they" amidst many "the..." prefixes.
That's not really true.  Boyer-Moore is typically O(M/N) where N is 
the search string length and M is the text length - that is the 
longer the search string, the quicker the match.  It's also 
relatively easy to implement, since all you have to do is search 
checking the last character, and when it fails slide forward a 
precomputed amount based on the character in the text and the search 
string (ie, a table of characters is needed).  Of coruse, in 
non-ASCII days, this is more challenging, but for UTF-8 you could 
probably treate all high bit set characters as equal and still get 
good results on mostly roman text.
There is some code I use in a shipping product below which you are 
welcome to adapt.  It is quite old code, so the style is not 
something I'm proud of these days, but that's hardly unuusal.  s is a 
string, already uppercased.  It's reading through a file on disk so 
it is more complex than it needs to be if the content is in memory 
(and given how simple it is it should give you some idea that 
Boyer-Moore really is worthwhile for any case where a decent amount 
of text is expected).
That said, why not just add PCRE to your project.  Its BSD licensed 
so should be useable with any project, and is relatively painless. 
Then you can do your search with the regexp /b\Qsearchstrin\E\b and 
it will get all the edge cases right for you, and handle unicode, and 
all you to offer regex searching too.
http://www.pcre.org/
Enjoy,
    Peter.
function BoyerMooreFind: OSErr;
{ Boyer-Moore basic search }
{ From Algorithms by Robert Sedgewick, Second Edition, pg 287-288 }
   var
     skip: array[char] of integer;
     i, j: integer;
     pos, count: longInt;
     oe: OSErr;
     ch: char;
begin
   for i := 0 to 255 do
     skip[chr(i)] := len;
   for i := 1 to len do
     skip[s[i]] := len - i;
   oe := GetFPos(ob.refnum, pos);
   for i := 1 to len do
     ob.buffer[i] := chr(0);
   i := len + 1;
   j := len;
   while (oe = noErr) and (j >= 1) and (pos < ob.filelen) do begin
     BlockMove(@ob.buffer[i - len], @ob.buffer, len - 1);
     i := len;
     count := SizeOf(OBBufferType) - len;
     if count + pos > ob.filelen then
       count := ob.filelen - pos;
     oe := FSRead(ob.refnum, count, @ob.buffer[len]);
     if (oe = noErr) & (count = 0) then
       oe := errGeneric;
     pos := pos + count;
     if oe = noErr then
       while (i <= count) and (j >= 1) do begin
         ch := UpperCaseChar(ob.buffer[i]);
         if ch = s[j] then begin
           i := i - 1;
           j := j - 1;
         end else begin
           if len - j + 1 > skip[ch] then begin
             i := i + len - j + 1;
           end else begin
             i := i + skip[ch];
           end;
           j := len;
         end;
       end;
   end;
   if j >= 1 then begin
     if oe = noErr then
       oe := errGeneric;
   end else begin
     oe := SetFPos(ob.refnum, fsFromStart, pos - count + i);
   end;
   BoyerMooreFind := oe;
end;
-- 
http://www.stairways.com/  http://download.stairways.com/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: "pos" and whole words