Calling check functions like ‘g_utf8_validate’ twice

emergenz · August 17, 2021, 1:43pm

it is recommended to check text pieces that are used by text-processing UTF-8 based functions. For this reason we usually write a code fragment like this one:

if(g_utf8_validate(textpointer, -1, NULL))
   gtk_text_buffer_set_text(textbuffer, textpointer, -1);
else generate_error_message("Not a valid UTF-8 text");

However I found it out that the function gtk_text_buffer_set_text also checks the validity of the text to be inserted. As a result, we call the function g_utf8_validate twice. I’m quite uncomfortable with this because I don’t like wasting resources, even if the CPU power is cheap nowadays.

The problem is that the function gtk_text_buffer_set_text has no return value for errors. I considered using another function gtk_text_buffer_insert, but this wouldn’t help, for the same reason. So I must carefully check all input data to be sure that the function will work well. My question is whether it is possible to get error info from these functions in any way, something like errno? Ideally this must be a meaningful message that gives me the cause of the failure.

One method would be to listen to changed or insert-text signal from the text buffer. I guess that it is emitted only if the insertion is successful. However I still have no info about the cause of failure.

Of course, I can simply copy the implementation of gtk_text_buffer_emit_insert, which is used by both functions mentioned above. Whether it would be a proper solution depends on the status of this function. Can a ‘common’ application developer use it? gtk_text_buffer_emit_insert is not listed in the documentation, you find it only in the source code.

I also read the page about error handling in general: https://docs.gtk.org/glib/error-reporting.html but it was not very helpful. In particular, it does not explain why lots of GTK functions that can produce recoverable, user-side errors (e.g. because of the wrong text encoding) don’t use GError.

Any ideas?

ebassi · August 17, 2021, 1:54pm

Because:

they would be terribly non-ergonomic to turn every single failure point into a recoverable error
because validation of the input can be disabled—in the case of GtkTextBuffer.insert_text(), the call to g_utf8_validate() is conditional on GTK being built without G_DISABLE_CHECKS
validation and error recovery are two different things

For instance, you may very well convert any random text you get into UTF-8, according to your own set of constraints; you could also decide to fail early, and not call gtk_text_buffer_set_text() at all. Validating UTF-8 is not processing intensive—it’s definitely cheaper than encoding conversions.

The UTF-8 validation in gtk_text_buffer_set_text() is there to prevent programmer errors. You get a critical warning on your terminal, and you’re supposed to fix it. It’s there to catch you when you’ve done something wrong, not when the user did something wrong.

The gtk_text_buffer_emit_insert() function is static to the GtkTextBuffer, so you cannot override it even if you wanted to. Your only option would be do build GTK with all precondition checks disabled.

emergenz · August 17, 2021, 4:02pm

Thank you very much, Emmanuele, for a detailed answer. I see that the error handling must be done to an appropriate extent, not too much and not too little.

I’m very sorry, but I cannot understand this idea. If my code has already validated the text, then the second call of g_utf8_validate within gtk_text_buffer_set_text or gtk_text_buffer_insert is not necessary at all. If my code has not validated the raw data read from a file and it was a non-UTF-8-encoded text, then the insertion function fails and the user gets an empty window without any text. In this case, however, the user can select another encoding (provided the feature to switch between different encodings is implemented) and get the positive result. So it depends on the user behavior. In other words, the result of the g_utf8_validate call within insertion functions depends on the run-time options set by the user.

I mean something very simple, a replacement for gtk_text_buffer_insert like follows:

if(g_utf8_validate(textpointer, -1, NULL))
    g_signal_emit(textbuffer, signals[INSERT_TEXT], 0, startiter, textpointer, strlen(textpointer));
else
   error_message_to_the_user_not_the_developer("Select an encoding different from UTF-8");

If I see it right G_DISABLE_CHECKS is not usable by compiling common applications?

emergenz · August 19, 2021, 3:21pm

Meanwhile, I explored the source code doing the text insertion. The architecture of this part of GTK is remarkable. You have a chain of helper (better: convenience) functions that repeatedly do lots of identical checks. In general, I understand why it is so. A developer can use these functions in different combinations and you cannot be sure that the caller has already checked the input data.

My solution is to insert text by emitting signal insert-text. By doing so, I skip all gtk_text_buffer_* insertion functions and one call to g_utf8_validate. I think that an application developer is allowed to do so because this signal is well documented. In other words, it is not for the use by GTK developers only.

I know now that the real insertion function called by the signal handler validates the text again. This means that by using gtk_text_buffer_* insertion functions your text is validated at least twice. The title of this topic gets a slightly different meaning.

BTW, the situation is much more dramatic. The insertion function _gtk_text_btree_insert processes the text line by line. Every line is validated with a call to g_utf8_validate. By one thousand lines we have one thousand function calls. As mentioned above, I managed to skip ONE call to this function

pwithnall · August 19, 2021, 3:34pm

Are the g_utf8_validate() checks showing up on a profiling trace for the code you’re running?

emergenz · August 19, 2021, 4:54pm

Well, the statement about multiple calls to g_utf8_validate within _gtk_text_btree_insert is based on the analysis of the source code.

However, I’m pretty sure that this check is inside, because if I deliberately send a signal with wrong data (non UTF-8) the program crashes with the message:
assertion failed: (g_utf8_validate (&text[sol], chunk_len, NULL))

ebassi · August 19, 2021, 5:02pm

That’s not what @pwithnall asked, though.

Are you profiling this code? Does it have a measurable impact?

emergenz · August 19, 2021, 5:23pm

No, I didn’t. So I cannot say anything about the impact.

You mean that the overhead is negligible? Probably, it is.

It might be because of my background from image processing area where the optimization of inner loops is really neccessary, and a kind of sport

matthiasc · August 19, 2021, 6:59pm

Best to find the inner loop first, though.

emergenz · August 19, 2021, 7:53pm

Are there any reasons why the validation function sits in the loop and is not called once somewhere at the start of the function? E. g. just after strlen?

chergert · August 19, 2021, 8:14pm

The reason people are asking for profiling data is because a number of validations are part of development builds only and are compiled out of production builds.

For example, there is a g_utf8_validate() in gtk_text_buffer_emit_insert() which emits a signal containing G_SIGNAL_TYPE_STATIC_SCOPE on the text parameter to avoid copies/checks when boxing GValue for FFI trampolines.

Then in the default handler, _gtk_text_btree_insert() is called. That function also calls g_utf8_validate() but within a g_assert() macro. The point here is that in production builds, g_assert() should be compiled out through the use of -DG_DISABLE_ASSERT.

emergenz · August 19, 2021, 9:30pm

You’ve touched an interesting point. If I understand your statement right, a production build should have disabled g_assert macros. If a ‘normal’ Linux distro uses GTK with g_asserts, does it mean that this is virtually a debug version? Are there any sound reasons to offer such a version to an average user?

chergert · August 19, 2021, 9:42pm

Are there any sound reasons to offer such a version to an average user?

I’m not sure that is the best way to frame the question. Rather, I think it should be stated that upstream expects that distributions disable assertions in production builds. That means that we do an incredible amount of verification in debug builds, such as validating btree’s, utf-8, and more at a rate that is just barely passable in debug builds and non-existent in production builds.

This, of course, is nothing new in C software. Virtually ever piece of major software I’ve worked on in the past 20+ years has followed this convention.

mcatanzaro · August 19, 2021, 10:39pm

FWIW we don’t have a consensus on this, and it’s unheard of for distributions to manually add -DG_DISABLE_ASSERT when building packages.

My recommendation is to use some alternative method to disable performance-sensitive assertions at runtime, e.g. #ifndef NDEBUG. Most asserts are not performance-sensitive and provide valuable confidence in release builds.

chergert · August 19, 2021, 11:18pm

Why? That just creates two things distributions do stupidly.

This is literally the entire point of assert fro the beginning of time, not to give runtime assurance but to give development time assurance. To make it do something else just a reinterpretation of a word by people who didn’t live through it.

ebassi · August 19, 2021, 11:46pm

That’s entirely untrue.

That’s a funny way to word the fact that you’re the only one still complaining about this. Everyone else got along with the program.

People building for embedded environments, or simply for products, would inject their own compiler flags all the time; it’s always been expected, that’s why it’s documented. Some general purpose Linux distributions decided not to do that under the impression that “general purpose” means “minimum common denominator”—except they also go around injecting random compiler flags to “fortify” binaries, or to remove things that make debugging and profiling actually work reliably.

mcatanzaro · August 20, 2021, 12:59pm

I can’t imagine having to support Epiphany or glib-networking if g_assert()s were not running in release builds. I need these to make sure the software is not messed up. We actually have a pretty serious case of misbehavior right now that we wouldn’t even know about if g_assert() were to be disabled in release builds because it’s only hit by Evolution users, never by upstream tests.

I’ve done nothing upstream to alter the default behavior of g_assert(), and similarly downstreams have done nothing to disable it. I’m bamboozled as to what you expect here. You really think distributions should add -DG_DISABLE_ASSERT to their default CPPFLAGS? That’s extremely unlikely. If you want asserts disabled you’re going to have to do that manually in your meson.build, like GLib and GTK sort of attempt to do depending on build type (almost no other projects do this afaik).

pwithnall · August 20, 2021, 1:15pm

Just to provide some contextual information: GLib has several macros for disabling and enabling different types of checks. -DG_DISABLE_ASSERT is one of them, but there’s also -DG_DISABLE_CHECKS and -DG_ENABLE_DEBUG.

See the documentation for the differences and for details of which of these are defined/undefined for various types of debug/release build of GLib.

chergert · August 20, 2021, 5:36pm

Exactly. If you want these checks at runtime, you should have it ingrained to use g_return_if_fail()/g_return_val_if_fail()/g_return_val_if_reached() macros, not g_assert(). Those can only be disabled if using -DG_DISABLE_CHECKS which nobody is recommending you do.

Setting -DG_DISABLE_CAST_CHECKS also only turns a FOO_BAR() into a cast without a type check and is also something I generally recommend in production code. That doesn’t affect FOO_IS_BAR() at all so if you leave checks in, you still get type checks at the API boundaries assuming you’re doing the proper thing of precondition checks in public API.

If you are using g_assert() instead of the proper macros for control flow, that’s on you.

system · September 19, 2021, 5:37pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.