Re: "pos" and whole words

10 Apr 2005


      Peter N Lewis wrote:
...
That's not really true.  Boyer-Moore is typically O(M/N) where N is
the search string length and M is the text length - that is the 
longer the search string, the quicker the match.  It's also 
relatively easy to implement, since all you have to do is search 
checking the last character, and when it fails slide forward a 
precomputed amount based on the character in the text and the search 
string (ie, a table of characters is needed).  Of coruse, in 
non-ASCII days, this is more challenging, but for UTF-8 you could 
probably treate all high bit set characters as equal and still get 
good results on mostly roman text.
You could also do the matching byte-wise (not character-wise) AFAICS
(at least if you want exact matches, not case-insensitive matches or
so). -- Well, if you have to take into account different
representations of the same characters (accents or so), which can
even be of different lengths, then it gets hairy ...
...
There is some code I use in a shipping product below which you are 
welcome to adapt.  It is quite old code, so the style is not 
something I'm proud of these days, but that's hardly unuusal.  s is a 
string, already uppercased.  It's reading through a file on disk so 
it is more complex than it needs to be if the content is in memory 
(and given how simple it is it should give you some idea that 
Boyer-Moore really is worthwhile for any case where a decent amount 
of text is expected).
The built-in `Pos' currently does the naive method. I'd thought
about replacing it with a more sophisticated algorithm, but so far
it hasn't been important enough to me.
But if anyone wants to, feel free to send me a subsitute
(function PosFrom in p/rts/string1.pas). Of course, since `Pos' can
be used for all kinds of purposes, it must not be less efficient in
simple cases (at least not significantly so).
Frank
-- 
Frank Heckenbach, frank@g-n-u.de, http://fjf.gnu.de/, 7977168E
GPC To-Do list, latest features, fixed bugs:
http://www.gnu-pascal.de/todo.html
GPC download signing key: ACB3 79B2 7EB2 B7A7 EFDE  D101 CD02 4C9D 0FE0 E5E8

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: "pos" and whole words