I have an UTF-8 string and I have to copy it inside a fixed-size char[]
array. It’s not a problem if the string gets truncated, but I want to ensure that the result is a well-formed UTF-8 string, how can I achieve that using GLib?
You can use g_utf8_strncpy().
Of course, this is inferior to g_strdup(), and should only be used when you really have no choice but to use a fixed-size char[] array.
Thanks Michael! However it seems that g_utf8_strncpy is not quite what I need. The number of characters passed to g_utf8_strncpy relates to the source string, it doesn’t take in account the size of the destination buffer.
For some context, I am working in GIMP, in particular I have to copy a path into app/core/gimpbacktrace.h · GIMP_2_99_14 · GNOME / GIMP · GitLab. I’d like to turn object_name
into a plain pointer but it’s out of scope for the work I’m doing at the moment. Hope to do that in the future!
Ended up doing the following (NOTE: untested)
#define IS_BEGINNING_OF_UTF8_CHARACTER(c) \
((c & 0xc0) == 0x80)
static void
utf8_copy_sized (char *dest,
const char *src,
size_t size)
{
if (size == 0)
return;
memset (dest, 0, size);
strncpy (dest, src, size);
if (dest[size - 1] != 0)
{
char *p = &dest[size - 1];
/* Checking for p > dest is not actually needed,
* but it's useful in case of malformed source string. */
while (!IS_BEGINNING_OF_UTF8_CHARACTER (*p) && G_LIKELY (p > dest))
*p-- = 0;
*p = 0;
}
}
There’s also the question of strings ending with combining characters, but shouldn’t be an actual issue
memset (dest, 0, size);
is redundant:
If the length of
src
is less thann
,strncpy()
writes additional null bytes todest
to ensure
that a total ofn
bytes are written.
True, I didn’t remember that!
i think you can just call g_utf8_strlen
after doing the copy and NUL terminate the result. partial characters don’t get included in the returned length.
The problem is that g_utf8_strlen
returns a “character count”, not a byte length. Anyway a similar approach is to iterate using g_utf8_next_char
until it returns a pointer that is either out of bounds or points to the terminating character '\0'
. From what I gather g_utf8_next_char(p)
only reads *p
and nothing past p
, so it’s safe. Well, technically the last g_utf8_next_char
could overflow…
If potentially truncating the last character isn’t an issue, just stop the iteration and add your nul terminator when a character is returned which starts within 4 bytes of the end of the buffer. That’s the maximum encoded length of a codepoint. Note that this could split a multi-codepoint character though.
How about
/* Maximum length in bytes, WITHOUT counting the NUL terminator */
#define MAX_LENGTH_IN_BYTES 100
gsize len = strlen(src);
if (len > MAX_LENGTH_IN_BYTES) {
g_strlcpy(
dest,
src,
g_utf8_prev_char(src + MAX_LENGTH_IN_BYTES + 1) - src + 1
);
} else {
memcpy(dest, src, len + 1);
}
?
―madmurphy
Yeah, looks good
I’ll probably use that since it’s less code
yea, sure, but you can convert between a character count and byte length easily using g_utf8_offset_to_pointer
Granted, all these gymnastics are iterating over the string several times, but probably not a big deal for short strings and clearer to read than manually looking for bit patterns, imo.
ah…
* g_utf8_find_prev_char:•
...
* @p does not have to be at the beginning of a UTF-8 character. No check•
* is made to see if the character found is actually valid other than•
* it starts with an appropriate byte.•
So g_utf8_prev_char
does work when started in the middle of a character. Using it is probably the most efficient way to go while still maintaining clarity indeed.
you can convert between a character count and byte length easily using
g_utf8_offset_to_pointer
Ah yes, totally missed that func!
There are also other variations possible. This might be the shortest and most efficient:
/* Maximum length in bytes, WITHOUT counting the NUL terminator */
#define MAX_LENGTH_IN_BYTES 100
gsize len = strlen(src);
if (len > MAX_LENGTH_IN_BYTES) {
len = g_utf8_prev_char(src + MAX_LENGTH_IN_BYTES + 1) - src;
}
memcpy(dest, src, len);
dest[len] = '\0';
―madmurphy
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.