Function like strncpy_s() but ensuring correct truncation of UTF-8 strings

lb90 · January 2, 2023, 3:10pm

I have an UTF-8 string and I have to copy it inside a fixed-size char[] array. It’s not a problem if the string gets truncated, but I want to ensure that the result is a well-formed UTF-8 string, how can I achieve that using GLib?

mcatanzaro · January 2, 2023, 3:27pm

You can use g_utf8_strncpy().

Of course, this is inferior to g_strdup(), and should only be used when you really have no choice but to use a fixed-size char array.

lb90 · January 2, 2023, 3:36pm

Thanks Michael! However it seems that g_utf8_strncpy is not quite what I need. The number of characters passed to g_utf8_strncpy relates to the source string, it doesn’t take in account the size of the destination buffer.

For some context, I am working in GIMP, in particular I have to copy a path into app/core/gimpbacktrace.h · GIMP_2_99_14 · GNOME / GIMP · GitLab. I’d like to turn object_name into a plain pointer but it’s out of scope for the work I’m doing at the moment. Hope to do that in the future!

lb90 · January 2, 2023, 4:39pm

Ended up doing the following (NOTE: untested)

#define IS_BEGINNING_OF_UTF8_CHARACTER(c) \
((c & 0xc0) == 0x80)

static void
utf8_copy_sized (char *dest,
                 const char *src,
                 size_t size)
{
  if (size == 0)
    return;

  memset (dest, 0, size);
  strncpy (dest, src, size);

  if (dest[size - 1] != 0)
    {
      char *p = &dest[size - 1];

      /* Checking for p > dest is not actually needed,
       * but it's useful in case of malformed source string. */
      while (!IS_BEGINNING_OF_UTF8_CHARACTER (*p) && G_LIKELY (p > dest))
        *p-- = 0;

      *p = 0;
    }
}

There’s also the question of strings ending with combining characters, but shouldn’t be an actual issue

PeterB · January 2, 2023, 5:52pm

memset (dest, 0, size);

is redundant:

If the length of src is less than n, strncpy() writes additional null bytes to dest to ensure
that a total of n bytes are written.

lb90 · January 2, 2023, 5:54pm

True, I didn’t remember that!

halfline · January 2, 2023, 8:58pm

i think you can just call g_utf8_strlen after doing the copy and NUL terminate the result. partial characters don’t get included in the returned length.

lb90 · January 2, 2023, 10:20pm

The problem is that g_utf8_strlen returns a “character count”, not a byte length. Anyway a similar approach is to iterate using g_utf8_next_char until it returns a pointer that is either out of bounds or points to the terminating character '\0'. From what I gather g_utf8_next_char(p) only reads *p and nothing past p, so it’s safe. Well, technically the last g_utf8_next_char could overflow…

pwithnall · January 3, 2023, 12:06am

If potentially truncating the last character isn’t an issue, just stop the iteration and add your nul terminator when a character is returned which starts within 4 bytes of the end of the buffer. That’s the maximum encoded length of a codepoint. Note that this could split a multi-codepoint character though.

madmurphy · January 3, 2023, 2:39am

How about

/*  Maximum length in bytes, WITHOUT counting the NUL terminator  */
#define MAX_LENGTH_IN_BYTES 100

gsize len = strlen(src);

if (len > MAX_LENGTH_IN_BYTES) {

	g_strlcpy(
		dest,
		src,
		g_utf8_prev_char(src + MAX_LENGTH_IN_BYTES + 1) - src + 1
	);

} else {

	memcpy(dest, src, len + 1);

}

?

―madmurphy

lb90 · January 3, 2023, 11:51am

Yeah, looks good

I’ll probably use that since it’s less code

halfline · January 3, 2023, 12:05pm

yea, sure, but you can convert between a character count and byte length easily using g_utf8_offset_to_pointer

Granted, all these gymnastics are iterating over the string several times, but probably not a big deal for short strings and clearer to read than manually looking for bit patterns, imo.

halfline · January 3, 2023, 12:09pm

ah…

 * g_utf8_find_prev_char:• 
...
 * @p does not have to be at the beginning of a UTF-8 character. No check•
 * is made to see if the character found is actually valid other than•
 * it starts with an appropriate byte.•

So g_utf8_prev_char does work when started in the middle of a character. Using it is probably the most efficient way to go while still maintaining clarity indeed.

lb90 · January 3, 2023, 12:21pm

you can convert between a character count and byte length easily using g_utf8_offset_to_pointer

Ah yes, totally missed that func!

madmurphy · January 3, 2023, 1:48pm

There are also other variations possible. This might be the shortest and most efficient:

/*  Maximum length in bytes, WITHOUT counting the NUL terminator  */
#define MAX_LENGTH_IN_BYTES 100

gsize len = strlen(src);

if (len > MAX_LENGTH_IN_BYTES) {

	len = g_utf8_prev_char(src + MAX_LENGTH_IN_BYTES + 1) - src;

}

memcpy(dest, src, len);
dest[len] = '\0';

―madmurphy

system · February 2, 2023, 1:49pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.