Tom Verhoeff wrote:
Prof A Olowofoyeku (The African Chief) wrote:
I have been trying to implement a "pos" function that only matches whole words. Below is my best effort. Is there a more efficient (or "better") way of implementing this? Thanks.
The efficiency of your solution (after correction) depends on the efficiency of pos.
Better efficiency might be obtained, depending on the characteristics of the actual parameters. Searching for short identifiers in relatively short strings versus searching for `tricky'/longer identifiers in trickier/longer texts (with many partial matches), but this requires that you rewrite the implementation of pos.
A typical implementation of pos traverses the text from left to right (low to high index values), and matches it against the substring, also from left to right. When the match fails, the next character in the text is taken as starting position of the next matching attempt. If you match the string from right to left against the text, you can often make a bigger jump in the text for the next candidate position.
You are basically talking about Boyer-Moore and Knuth-Morris-Pratt searching algorithms, of which there is a good discussion in Sedgewick. In practice, when there is a good chance of match failure on the first char. comparison, little is to be gained over the elementary techniques. The other methods are likely to gain on such examples as "duckbill" in the presence of many "ducktails" or "they" amidst many "the..." prefixes.
KMP is especially easy to install in place of the elementary technique, and has the primary advantage of never requiring backtracking over the input. This makes the operation become, at worst, O(M+N) where M and N are the sizes of the needle and the haystack. The elementary can be as bad as O(M*N), but normally will be about O(M+N) anyhow. BM has the same O(M+N) worst case, but can (low probability) significantly exceed it. However, it requires scanning in the reverse direction. It is especially attractive when looking for a rare suffix in the presence of a common prefix.
A New Taxonomy of Sublinear Keyword Pattern Matching Algorithms, by L.G.W.A. Cleophas, B.W. Watson and G. Zwaan, Computing Science Report 2004/03, Faculty of Mathematics and Computing Science, Eindhoven University of Technology, Eindhoven, The Netherlands, March 2004.
http://alexandria.tue.nl/extra1/wskrap/publichtml/200407.pdf
A Taxonomy of Keyword Pattern Matching Algorithms, by B.W. Watson and G. Zwaan, Computing Science Report 92/27, Faculty of Mathematics and Computing Science, Eindhoven University of Technology, Eindhoven, The Netherlands, 1992, ISSN 0926-4515.
http://alexandria.tue.nl/extra1/wskrap/publichtml/9310412.pdf
I am not familiar with these references, but they sound excellent, and I shall have to collect them later. I hope they are in English. :-)