Add Bitap fuzzy string match for small strings to GLib

What I would like to add to GLib.

“Bitap Algorithm” fuzzy string match for strings of size:

  • less than 32 characters
  • less than 64 characters

What it does:

“Find the first match of pattern in string with at most k substitution errors.”

The motivation for this addition is that these are actually viable maximum sizes in many contexts, for string keys that should be searchable by the user, such as:

  • image metadata tag names
  • names of application settings
  • names of people and places
  • etc.

It would be so cool to be able to just do fuzzy search on small strings with GLib, without re-implementing it yourself, or leaning on a library which, for example, you may not want to pull into your GTK application.

The implementation of both 32- and 64-character max-length cases would be short. For the 32-character max-length case, here is a working code example. Keep in mind this is just an example. We would probably want to return an array of match objects, ordered by increasing “k” value ( the number of substitution errors ), instead of just the first match. Bitap finds every match with k or fewer substitution errors ( k=0 is exact match ). It also tells you the number of errors in a given match. These features should be fully utilized if this suggestion makes it to actual implementation.

/*

COMPILE

gcc -o bitapp bitapp.c

RUN

./bitap

*/

 #include <stdlib.h>
 #include <string.h>
 #include <limits.h>
 #include <stdio.h>
 
/* Find the first match of @pattern in @string with at most @k substitution
 * errors.
 *
 * @text - The string we are searching to find @pattern.
 *
 * @pattern - The string we are trying to find in @text.
 *
 * @k - The maximum number of allowed errors. If k is zero, then the match is
 * exact.
 *
 * @returns A pointer to the character in @text where the first match found
 * starts.
 */
const char *bitap_fuzzy_bitwise_search( const char *text,
	const char *pattern, int k)
{
	const char *result = NULL;
	int m = strlen(pattern);
	unsigned long *R;
	unsigned long pattern_mask[CHAR_MAX+1];
	int i, d;

	if (pattern[0] == '\0') return text;
	if (m > 31) return "The pattern is too long!";

	/* Initialize the bit array R */
	R = malloc((k+1) * sizeof *R);
	for (i=0; i <= k; ++i)
		R[i] = ~1;

	/* Initialize the pattern bitmasks */
	for (i=0; i <= CHAR_MAX; ++i)
		pattern_mask[i] = ~0;
	for (i=0; i < m; ++i)
		pattern_mask[pattern[i]] &= ~(1UL << i);

	for (i=0; text[i] != '\0'; ++i) {
		/* Update the bit arrays */
		unsigned long old_Rd1 = R[0];

		R[0] |= pattern_mask[text[i]];
		R[0] <<= 1;

		for (d=1; d <= k; ++d) {
			unsigned long tmp = R[d];
			/* Substitution is all we care about */
			R[d] = (old_Rd1 & (R[d] | pattern_mask[text[i]])) << 1;
			old_Rd1 = tmp;
		}

		if (0 == (R[k] & (1UL << m))) {
			result = (text+i - m) + 1;
			break;
		}
	}

	free(R);
	return result;
}

int main() {
	const char *text = "hhhhhhhhhhhhereeeeeeeeeeeeeeeee";
	const char *pattern = "heee";
	int k = 1;
	const char *a =
	bitap_fuzzy_bitwise_search( text, pattern, k );
	puts( a );
}

Anyways, that is what the code would look like, but less like I just stole it from Wikipedia, which I totally did. ( I can explain every line of code and how it fits into the Bitap algorithm with pictures and stuff. )

The 64-character max-length case is the same code, but with some textual substitutions:

  • “unsigned long long” instead of “unsigned long”
  • “63” instead of “31”.

This is a nice writeup for a nice idea, but this algorithm belongs in a separate library, and that’s fine. People should not be afraid of adding additional dependencies to their applications.

It’s too specific to certain use cases to be useful in GLib — to use it you’d need to know that bitap is the most appropriate algorithm for your fuzzy search, rather than some other fuzzy search algorithm. Adding it to GLib would open the door to building a whole set of fuzzy search algorithms in GLib, but GLib is not a fuzzy search library.

1 Like

That makes sense. This is such a good response. This is my first suggestion for a contribution to GLib - thanks for the great feedback.

2 Likes