Add Bitap fuzzy string match for small strings to GLib

What I would like to add to GLib.

“Bitap Algorithm” fuzzy string match for strings of size:

  • less than 32 characters
  • less than 64 characters

What it does:

“Find the first match of pattern in string with at most k substitution errors.”

The motivation for this addition is that these are actually viable maximum sizes in many contexts, for string keys that should be searchable by the user, such as:

  • image metadata tag names
  • names of application settings
  • names of people and places
  • etc.

It would be so cool to be able to just do fuzzy search on small strings with GLib, without re-implementing it yourself, or leaning on a library which, for example, you may not want to pull into your GTK application.

The implementation of both 32- and 64-character max-length cases would be short. For the 32-character max-length case, here is a working code example. Keep in mind this is just an example. We would probably want to return an array of match objects, ordered by increasing “k” value ( the number of substitution errors ), instead of just the first match. Bitap finds every match with k or fewer substitution errors ( k=0 is exact match ). It also tells you the number of errors in a given match. These features should be fully utilized if this suggestion makes it to actual implementation.

/*

COMPILE

gcc -o bitapp bitapp.c

RUN

./bitap

*/

 #include <stdlib.h>
 #include <string.h>
 #include <limits.h>
 #include <stdio.h>
 
/* Find the first match of @pattern in @string with at most @k substitution
 * errors.
 *
 * @text - The string we are searching to find @pattern.
 *
 * @pattern - The string we are trying to find in @text.
 *
 * @k - The maximum number of allowed errors. If k is zero, then the match is
 * exact.
 *
 * @returns A pointer to the character in @text where the first match found
 * starts.
 */
const char *bitap_fuzzy_bitwise_search( const char *text,
	const char *pattern, int k)
{
	const char *result = NULL;
	int m = strlen(pattern);
	unsigned long *R;
	unsigned long pattern_mask[CHAR_MAX+1];
	int i, d;

	if (pattern[0] == '\0') return text;
	if (m > 31) return "The pattern is too long!";

	/* Initialize the bit array R */
	R = malloc((k+1) * sizeof *R);
	for (i=0; i <= k; ++i)
		R[i] = ~1;

	/* Initialize the pattern bitmasks */
	for (i=0; i <= CHAR_MAX; ++i)
		pattern_mask[i] = ~0;
	for (i=0; i < m; ++i)
		pattern_mask[pattern[i]] &= ~(1UL << i);

	for (i=0; text[i] != '\0'; ++i) {
		/* Update the bit arrays */
		unsigned long old_Rd1 = R[0];

		R[0] |= pattern_mask[text[i]];
		R[0] <<= 1;

		for (d=1; d <= k; ++d) {
			unsigned long tmp = R[d];
			/* Substitution is all we care about */
			R[d] = (old_Rd1 & (R[d] | pattern_mask[text[i]])) << 1;
			old_Rd1 = tmp;
		}

		if (0 == (R[k] & (1UL << m))) {
			result = (text+i - m) + 1;
			break;
		}
	}

	free(R);
	return result;
}

int main() {
	const char *text = "hhhhhhhhhhhhereeeeeeeeeeeeeeeee";
	const char *pattern = "heee";
	int k = 1;
	const char *a =
	bitap_fuzzy_bitwise_search( text, pattern, k );
	puts( a );
}

Anyways, that is what the code would look like, but less like I just stole it from Wikipedia, which I totally did. ( I can explain every line of code and how it fits into the Bitap algorithm with pictures and stuff. )

The 64-character max-length case is the same code, but with some textual substitutions:

  • “unsigned long long” instead of “unsigned long”
  • “63” instead of “31”.

This is a nice writeup for a nice idea, but this algorithm belongs in a separate library, and that’s fine. People should not be afraid of adding additional dependencies to their applications.

It’s too specific to certain use cases to be useful in GLib — to use it you’d need to know that bitap is the most appropriate algorithm for your fuzzy search, rather than some other fuzzy search algorithm. Adding it to GLib would open the door to building a whole set of fuzzy search algorithms in GLib, but GLib is not a fuzzy search library.

1 Like

That makes sense. This is such a good response. This is my first suggestion for a contribution to GLib - thanks for the great feedback.

2 Likes

In case anyone ends up here looking for “fuzzy match in C”, here is a more useful approximate string matching algorithm, which allows for mismatches (substitutions) like Bitapp does, but has no limit on pattern size. It’s authored by the same guy Baeza-Yates. Here is a C implementation.

You can easily adapt that code to be used in the GLib ecosystem by simply wrapping the Match structs in GList nodes, or something similar. That’s what I plan on doing for my humble search feature of the GTK application I’m contributing to.

Thanks. I love GLib & GTK.

Last comment. This version uses GLib lists and does not make the optimization that imposes a maximum pattern length of 256 characters. The original version by Baeza-Yates and Perleberg is heavily optimized.

This version is not as optimized, but the max pattern length is an argument to the search and preprocess functions, instead of a macro constant.

This is not a GLib function obviously - I just prefixed it with a g_ for my sake, because it uses GLib.

g_match_with_mismatches_v3.c

By leveraging more GNOME things, this fuzzy search code (which uses GLib) has grown into a demoable fuzzy search feature (which uses GTK4). A maximum number of allowed mismatches of k=3 was selected.

gtk4_search_with_mismatches_demo

The critical GNOME things for this fuzzy search feature:

GLib

  • GList
  • GString
  • g_strdup_printf

GTK4

  • GtkSearch / GtkSearchEntry

GList eliminated the need for custom linked lists and provided many useful methods.

GString methods and g_strdup_printf eliminated the need for writing string manipulation methods.

GtkSearch and GtkSearchEntry eliminated a lot of work by providing a signal that is emitted whenever the search entry text stops changing.

I am seriously finally done commenting on this thread. Hope it helps somebody.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.