Re: "pos" and whole words

9 Apr 2005


      Tom Verhoeff wrote:
...
Prof A Olowofoyeku (The African Chief) wrote:
...
I have been trying to implement a "pos" function that only matches
whole words. Below is my best effort. Is there a more efficient (or
"better") way of implementing this? Thanks.
The efficiency of your solution (after correction) depends on the
efficiency of pos.
Better efficiency might be obtained, depending on the characteristics of
the actual parameters.  Searching for short identifiers in relatively
short strings versus searching for `tricky'/longer identifiers in
trickier/longer texts (with many partial matches), but this requires
that you rewrite the implementation of pos.
A typical implementation of pos traverses the text from left to right
(low to high index values), and matches it against the substring,
also from left to right.  When the match fails, the next character in
the text is taken as starting position of the next matching attempt.
If you match the string from right to left against the text, you can
often make a bigger jump in the text for the next candidate position.
You are basically talking about Boyer-Moore and Knuth-Morris-Pratt
searching algorithms, of which there is a good discussion in
Sedgewick.  In practice, when there is a good chance of match
failure on the first char. comparison, little is to be gained over
the elementary techniques.  The other methods are likely to gain on
such examples as "duckbill" in the presence of many "ducktails" or
"they" amidst many "the..." prefixes.
KMP is especially easy to install in place of the elementary
technique, and has the primary advantage of never requiring
backtracking over the input.  This makes the operation become, at
worst, O(M+N) where M and N are the sizes of the needle and the
haystack.  The elementary can be as bad as O(M*N), but normally
will be about O(M+N) anyhow.  BM has the same O(M+N) worst case,
but can (low probability) significantly exceed it.  However, it
requires scanning in the reverse direction.  It is especially
attractive when looking for a rare suffix in the presence of a
common prefix.
...
A New Taxonomy of Sublinear Keyword Pattern Matching Algorithms,
by L.G.W.A. Cleophas, B.W. Watson and G. Zwaan, Computing Science
Report 2004/03, Faculty of Mathematics and Computing Science,
Eindhoven University of Technology, Eindhoven, The Netherlands,
March 2004.
http://alexandria.tue.nl/extra1/wskrap/publichtml/200407.pdf
A Taxonomy of Keyword Pattern Matching Algorithms, by B.W. Watson
and G. Zwaan, Computing Science Report 92/27, Faculty of
Mathematics and Computing Science, Eindhoven University of
Technology, Eindhoven, The Netherlands, 1992, ISSN 0926-4515.
http://alexandria.tue.nl/extra1/wskrap/publichtml/9310412.pdf
I am not familiar with these references, but they sound excellent,
and I shall have to collect them later.  I hope they are in
English. :-)
-- 
"If you want to post a followup via groups.google.com, don't use
 the broken "Reply" link at the bottom of the article.  Click on 
 "show options" at the top of the article, then click on the 
 "Reply" at the bottom of the article headers." - Keith Thompson

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: "pos" and whole words