DTD parser changed in 2.14, now throws error

Hi everyone,

I’ve been using libxml for 10+ years (since about v2.6) in one of my projects, so some (or a lot) of the things that I’m using might be old school now, but fortunately never had any issues with new libxml versions.
Until recently, when my distro got the package update from 2.13.8 to 2.14.2.

I have my DTD compiled into my project in one of my header files:

#define KC_DTD  "\
<!ELEMENT kc (keychain)*> \
\
<!ELEMENT keychain (key)*> \
<!ATTLIST keychain \
        name CDATA #REQUIRED \
        description CDATA #REQUIRED \
        created CDATA #REQUIRED \
        modified CDATA #REQUIRED> \
\
<!ELEMENT key EMPTY> \
<!ATTLIST key \
        name CDATA #REQUIRED \
        value CDATA #REQUIRED \
        created CDATA #REQUIRED \
        modified CDATA #REQUIRED> \
"

I’m loading this like so - more or less:

xmlParserInputBufferPtr buf = NULL;
xmlDtdPtr               dtd = NULL;

buf = xmlParserInputBufferCreateMem(KC_DTD, sizeof(KC_DTD), XML_CHAR_ENCODING_NONE);

dtd = xmlIOParseDTD(NULL, buf, XML_CHAR_ENCODING_NONE);
if (!dtd) {
        xmlGenericError(xmlGenericErrorContext, "ERROR: Could not parse kc DTD.\n");
        return(0);
}

A couple of printf()s and other shenanigans are missing from the example but the culprit turned out to be xmlIOParseDTD().

The error output is:

Entity: line 1: parser error : Content error in the external subset
IRED    value CDATA #REQUIRED   created CDATA #REQUIRED         modified CDATA #REQUIRED>
                                                                                         ^

Only thing I could link this with in the libxml changelog was this:

  • xmlIOParseDTD doesn’t allow null bytes at the end of the input anymore.

Which as far as I could trace relates to issue #868, and in turn commit 05bd1720.

I’m at a loss trying to figure out how null bytes are related to this in my case. I’ve been trying to reformat my embedded DTD to iron out possible collation/new-line errors introduced by the pre-processor, to no avail.

Looking at a sample XML and the DTD checked with xmllint(1):

test.xml

<?xml version="1.0" encoding="UTF-8"?>
<kc>
        <keychain name="default" created="1746443463" modified="1746443463" description="">
        </keychain>
</kc>

DTD:

<!ELEMENT kc (keychain)*>

<!ELEMENT keychain (key)*>
<!ATTLIST keychain
        name CDATA #REQUIRED
        description CDATA #REQUIRED
        created CDATA #REQUIRED
        modified CDATA #REQUIRED>

<!ELEMENT key EMPTY>
<!ATTLIST key
        name CDATA #REQUIRED
        value CDATA #REQUIRED
        created CDATA #REQUIRED
        modified CDATA #REQUIRED>

Test (doesn’t report any issues):

$ xmllint --noout --dtdvalid kc.dtd test.xml

To double check, when I deliberately mess up the XML or the DTD, it checks out:

$ xmllint --noout --dtdvalid kc.dtd  regress/test.xml
Document test.xml does not validate against kc.dtd

Am I on the right track in that this change (xmlIOParseDTD doesn’t allow null bytes at the end of the input anymore) actually introduced this? Have I been using this badly the whole time and now that it’s fixed it started to show?

Context:
I’ve had a similar issue with e.g. xmlIOParseDTD() not free()ing its input buffer although the documentation states: input will be freed by the function in any case, so I’ve been free()ing it with xmlFreeParserInputBuffer(buf) since forever. But now with this new libxml version it seems this is actually being free()d and now I get a segfault for a double free() - which is a good thing, I’ll adjust the code, I’m just thinking I might’ve been using this in a way that something’s now been fixed in libxml, my application started to show symptoms of using this the wrong way.

Any help, suggestions would be appreciated,
Daniel

You have to use sizeof(KC_DTD) - 1 here to account for the trailing null byte.

This is sounds like a regression. I’ll make sure that the old behavior will be restored.

This is sounds like a regression. I’ll make sure that the old behavior will be restored.

I had a quick look but I couldn’t find any obvious issues.

I’ve had a similar issue with e.g. xmlIOParseDTD() not free()ing its input buffer although the documentation states: input will be freed by the function in any case , so I’ve been free()ing it with xmlFreeParserInputBuffer(buf) since forever.

Can you provide a test case where the old code didn’t free the input buffer?

First of all, thank you for the clue stick re -1 for the size, I totally misinterpreted the description of the change…

As for providing a test for the free() issue, of course I can’t… The oldest version I could find atm was 2.12.10 and naturally xmlFreeParserInputBuffer() segfaulted there too with an invalid pointer / double free after using xmlIOParseDTD().

It was this silly snippet, btw:

#define KC_DTD  "\
<!ELEMENT kc (keychain)*> \
\
<!ELEMENT keychain (key)*> \
<!ATTLIST keychain \
        nme CDATA #REQUIRED \
        dscription CDATA #REQUIRED \
        ceated CDATA #REQUIRED \
        mdified CDATA #REQUIRED> \
\
<!ELEMENT key EMPTY> \
<!ATTLIST key \
        name CDATA #REQUIRED \
        value CDATA #REQUIRED \
        created CDATA #REQUIRED \
        modified CDATA #REQUIRED> \
"

#include <stdio.h>
#include <stdlib.h>

#include <libxml/parser.h>
#include <libxml/xmlversion.h>


int main(int argc, char *argv[]) {
        xmlParserInputBufferPtr buf = NULL;
        xmlDtdPtr               dtd = NULL;


        buf = xmlParserInputBufferCreateMem(KC_DTD,
                sizeof(KC_DTD) - (LIBXML_VERSION < 21401 ? 0 : 1),
                XML_CHAR_ENCODING_NONE);
        if (!buf) {
                xmlGenericError(xmlGenericErrorContext, "ERROR: Could not allocate buffer for DTD.\n");
                exit(1);
        }

        dtd = xmlIOParseDTD(NULL, buf, XML_CHAR_ENCODING_NONE);
        if (!dtd) {
                xmlGenericError(xmlGenericErrorContext, "ERROR: Could not parse DTD.\n");
                exit(1);
        }

        xmlFreeParserInputBuffer(buf);
        /*
        xmlFreeParserInputBuffer(buf);
        */
}

So either I tested this veery long ago (the commit for the comment in my code dates back to 2013, I’m not even sure which version I was using back then), or it was never actually an issue; it might’ve been that it was on a code path that was already exiting (thus the cleaning up) because of the validation issue, and I actually missed the memory corruption late in the game.

I wouldn’t be surprised if the code for xmlIOParseDTD() in libxml or at least the part that does its clean-up would not have been touched for at least as long a time, too, and it was simply PEBKAC on my end.

Anyway, thanks again for the quick pointer!

Daniel