A working example for g_utf8_validate()?

johne53 · January 14, 2023, 4:33pm

Some colleagues and I are having problems with a program in which libpango calls the glib function ‘g_utf8_validate()’. Regardless of platform, it seems consistent for very simple UTF8 strings - i.e. the low value characters, such as numbers and A-Z characters. But for anything more complicated it’s giving different results for different platforms and compilers. The higher value UTF8 characters seem to get displayed okay if compiled for Linux (with gcc). And some of them work if compiled for Windows (with gcc…) But I haven’t found any yet that’ll display properly if compiled for Windows with MSVC.

So can anyone point me in the direction of a coding example that’ll show ‘g_utf8_validate()’ being used correctly?

pwithnall · January 14, 2023, 6:37pm

Just to discount this possibility first off: are you sure the strings you’re handling on Windows are UTF-8 and not UTF-16? The native encoding on Windows is UTF-16 and depending on where you’re getting the strings from, they potentially are encoded as that.

johne53 · January 14, 2023, 6:56pm

Many thanks Philip - here’s a very simple example of a call which fails:-

gboolean valid = g_utf8_validate (“\u00A9”, -1, NULL); // <— Returns FALSE !!!

IIUC the string “\u00A9” should equate to the UTF8 character for a copyright symbol but after g_utf8_validate() returns, valid is FALSE. Admittedly this is just a single charavter - so does it need to be NULL terminated maybe?

ebassi · January 14, 2023, 7:07pm

That’s not how you use g_utf8_validate(): \u00A9 is an ASCII encoding that you have to transform into a Unicode code point.

You use g_utf8_validate() on an actual UTF-8 string, e.g.:

g_assert (g_utf8_validate ("©", -1, NULL));

If you want to decode a \u-escaped string, you can use something like this:

static inline gboolean
is_hex_digit (char c)
{
  return (c >= '0' && c < = '9') ||
         (c >= 'a' && c <= 'f') ||
         (c >= 'A' && c <= 'F');
}

static char
to_hex_digit (char c)
{
  return (c <= '9') ? c - '0' : (c & 7) + 9;
}

gunichar
get_unichar (const char *str)
{
  g_assert (str[0] == '\' && str[1] == 'u');
  g_assert (strlen (str) == 6);

  gunichar uchar = 0;
  for (int i = 0; i < 4; i++)
    {
      char ch = *(str + (2 + i));

      if (is_hex_digit (ch))
        uchar += ((gunichar) to_hex_digit (ch) << ((3 - i) * 4));
      else
        break;
    }

  g_assert (g_unichar_validate (uchar));

  return uchar;
}

At that point, you can use GString and g_string_append_unichar() to append the gunichar to a buffer.

x42 · January 14, 2023, 7:25pm

Are you certain?

The compiler should interpret that. On both posix as well as window systems "\u1234" and "\U12345678" produces a UTF8 string, while a literal "©" is ambiguous.

#include <stdio.h>
#include <string.h>

int main ()
{
  char const* foo = "\u00A9";
  for (size_t i = 0; i < strlen (foo); ++i) {
    printf ("%02x ", foo[i] & 0xff);
  }
  return 0;
}

prints

c2 a9

ebassi · January 14, 2023, 7:29pm

The “\uXXXX” escape sequence a C99 feature, and it’s converted by the compiler, sure; but it’s only going to be applied to string literals, not to any random string.

x42 · January 14, 2023, 7:37pm

Good point, in our case they are all string literals which are interpreted at compile time in C++11 code [1]. This works fine with gcc, clang and mingw, but apparently fails with MSVC for @johne53 for some reason.

–
[1]

github.com

Ardour/ardour/blob/d1b72b28ece1cad2715d52f092dd0c2aa59d9592/gtk2_ardour/export_report.cc#L771-L777


      
          					layout->set_text ("\u274C"); // cross mark
          				} else if (lufs < pi->LUFS_range[1]) {
          					cr->set_source_rgba (.6, .7, 0, 1.0);
          					layout->set_text ("\U0001F509"); // speaker icon w/1 bar
          				} else {
          					cr->set_source_rgba (.1, 1, .1, 1.0);
          					layout->set_text ("\u2714"); // heavy check mark

pwithnall · January 14, 2023, 8:45pm

Unicode literals in Visual C++ - Stack Overflow suggests that MSVC might be interpreting the \u escapes into the wrong codepage, so they are compiled as something which isn’t UTF-8. Is that the case?

lb90 · January 14, 2023, 11:25pm

You may need to set an appropriate flag for MSVC:

/execution-charset:utf-8

Or simply:

/utf-8

See:

Let me know if that’s not enough to address the issue!

johne53 · January 16, 2023, 1:21pm

Adding u8 before the relevant strings has helped a lot and doesn’t seem to be upsetting the non-MSVC compilers so far! Many thanks for everyone’s help with this.

system · February 15, 2023, 1:22pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.