KMP Algorithm for Pattern Searching: Efficient String Matching


10 min read 07-11-2024
KMP Algorithm for Pattern Searching: Efficient String Matching

Introduction

Imagine searching for a specific word or phrase within a large document. We might use the brute-force approach, comparing every possible starting position of the pattern within the text. However, as the text and pattern grow larger, this method becomes increasingly inefficient. This is where the Knuth-Morris-Pratt (KMP) algorithm shines. It offers a remarkable solution for pattern searching, significantly enhancing efficiency compared to brute-force techniques.

The Power of the KMP Algorithm

The KMP algorithm is a powerful tool for finding a specific pattern within a larger string. It leverages the concept of "prefix function" to optimize the search process. Instead of comparing characters one by one, it uses a precomputed lookup table to determine the maximum possible shift of the pattern after a mismatch. This intelligent approach minimizes redundant comparisons, resulting in faster search times.

Understanding the Algorithm's Core Concepts

At the heart of the KMP algorithm lies the concept of the "prefix function." This function is calculated for the pattern and helps determine the maximum shift to be made after a mismatch. Let's break down this concept:

1. Prefix Function: The Key to Efficient Shifts

The prefix function for a pattern is a table that indicates, for each index in the pattern, the length of the longest proper prefix that is also a suffix of the substring ending at that index. A proper prefix is a prefix that is not the entire string.

2. Computing the Prefix Function: Building the Lookup Table

We calculate the prefix function iteratively. Consider the pattern "AABAACAADA". The prefix function for this pattern would be:

Index Character Longest Proper Prefix that is also a Suffix Prefix Function Value
0 A 0
1 A A 1
2 B 0
3 A A 1
4 A AA 2
5 C 0
6 A A 1
7 A AA 2
8 D 0
9 A A 1

This table serves as a lookup for the KMP algorithm, guiding the pattern shifts.

3. Pattern Matching with the KMP Algorithm: A Step-by-Step Process

The KMP algorithm operates by iteratively comparing the pattern with the text. Let's illustrate this with an example:

  • Text: "ABABDABACDABABCABAB"
  • Pattern: "ABABCABAB"
  1. Initialization: Initialize two pointers, i for the text and j for the pattern, both starting at 0.
  2. Comparison: Compare characters at positions i and j.
  3. Match: If they match, increment both i and j.
  4. Mismatch: If they mismatch, use the prefix function to determine the shift.
  5. Shift: Shift the pattern by j - prefixFunction[j - 1] positions. Reset j to prefixFunction[j - 1].
  6. Termination: Repeat steps 2-5 until j is equal to the length of the pattern. If j reaches the end of the pattern, a match is found.

Example:

i Text j Pattern Prefix Function Value Shift
0 A 0 A 0 0
1 B 1 B 1 0
2 A 2 A 0 0
3 B 3 B 1 0
4 D 4 C 0 2
4 D 0 A 0 0
5 A 1 B 1 0
6 B 2 A 0 0
7 D 3 B 1 0
8 A 4 C 0 2
8 D 0 A 0 0
9 A 1 B 1 0
10 B 2 A 0 0
11 C 3 B 1 0
12 D 4 C 0 2
12 D 0 A 0 0
13 A 1 B 1 0
14 B 2 A 0 0
15 C 3 B 1 0
16 A 4 C 0 2
16 A 0 A 0 0
17 B 1 B 1 0
18 A 2 A 0 0
19 B 3 B 1 0
20 A 4 C 0 2
20 A 0 A 0 0
21 B 1 B 1 0
22 A 2 A 0 0
23 B 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0
24 1 B 1 0
24 2 A 0 0
24 3 B 1 0
24 4 C 0 2
24 0 A 0 0

|