A working example for g_utf8_validate()?

Some colleagues and I are having problems with a program in which libpango calls the glib function ‘g_utf8_validate()’. Regardless of platform, it seems consistent for very simple UTF8 strings - i.e. the low value characters, such as numbers and A-Z characters. But for anything more complicated it’s giving different results for different platforms and compilers. The higher value UTF8 characters seem to get displayed okay if compiled for Linux (with gcc). And some of them work if compiled for Windows (with gcc…) But I haven’t found any yet that’ll display properly if compiled for Windows with MSVC.

So can anyone point me in the direction of a coding example that’ll show ‘g_utf8_validate()’ being used correctly?

Just to discount this possibility first off: are you sure the strings you’re handling on Windows are UTF-8 and not UTF-16? The native encoding on Windows is UTF-16 and depending on where you’re getting the strings from, they potentially are encoded as that.

Many thanks Philip - here’s a very simple example of a call which fails:-

gboolean valid = g_utf8_validate (“\u00A9”, -1, NULL); // <— Returns FALSE !!!

IIUC the string “\u00A9” should equate to the UTF8 character for a copyright symbol but after g_utf8_validate() returns, valid is FALSE. Admittedly this is just a single charavter - so does it need to be NULL terminated maybe?

That’s not how you use g_utf8_validate(): \u00A9 is an ASCII encoding that you have to transform into a Unicode code point.

You use g_utf8_validate() on an actual UTF-8 string, e.g.:

g_assert (g_utf8_validate ("©", -1, NULL));

If you want to decode a \u-escaped string, you can use something like this:

static inline gboolean
is_hex_digit (char c)
{
  return (c >= '0' && c < = '9') ||
         (c >= 'a' && c <= 'f') ||
         (c >= 'A' && c <= 'F');
}

static char
to_hex_digit (char c)
{
  return (c <= '9') ? c - '0' : (c & 7) + 9;
}

gunichar
get_unichar (const char *str)
{
  g_assert (str[0] == '\' && str[1] == 'u');
  g_assert (strlen (str) == 6);

  gunichar uchar = 0;
  for (int i = 0; i < 4; i++)
    {
      char ch = *(str + (2 + i));

      if (is_hex_digit (ch))
        uchar += ((gunichar) to_hex_digit (ch) << ((3 - i) * 4));
      else
        break;
    }

  g_assert (g_unichar_validate (uchar));

  return uchar;
}

At that point, you can use GString and g_string_append_unichar() to append the gunichar to a buffer.

Are you certain?

The compiler should interpret that. On both posix as well as window systems "\u1234" and "\U12345678" produces a UTF8 string, while a literal "©" is ambiguous.

#include <stdio.h>
#include <string.h>

int main ()
{
  char const* foo = "\u00A9";
  for (size_t i = 0; i < strlen (foo); ++i) {
    printf ("%02x ", foo[i] & 0xff);
  }
  return 0;
}

prints

c2 a9

The “\uXXXX” escape sequence a C99 feature, and it’s converted by the compiler, sure; but it’s only going to be applied to string literals, not to any random string.

Good point, in our case they are all string literals which are interpreted at compile time in C++11 code [1]. This works fine with gcc, clang and mingw, but apparently fails with MSVC for @johne53 for some reason.


[1]

Unicode literals in Visual C++ - Stack Overflow suggests that MSVC might be interpreting the \u escapes into the wrong codepage, so they are compiled as something which isn’t UTF-8. Is that the case?

2 Likes

You may need to set an appropriate flag for MSVC:

/execution-charset:utf-8

Or simply:

/utf-8

See:

  1. https://pspdfkit.com/blog/2021/string-literals-character-encodings-and-multiplatform-cpp/
  2. /execution-charset (Set execution character set) | Microsoft Learn
  3. String and character literals (C++) | Microsoft Learn

Let me know if that’s not enough to address the issue! :slightly_smiling_face:

Adding u8 before the relevant strings has helped a lot and doesn’t seem to be upsetting the non-MSVC compilers so far! Many thanks for everyone’s help with this.