Libgweather data validator

In Increasing location and city coverage - #6 @ebassi highlighted that a big issue when it comes to accepting more data into libgweather is the effort required in reviewing them. I indicated I wanted to do something about it but then I got sidetracked by life and in the mean time the thread got locked.

I’ve pushed after a few hours of hacking today, and having tried it out on a number of the PRs with the Locations label, it seems to work out pretty well. It will never be 100% correct because names are hard and humans are complicated, but it’ll hopefully help.

As the commit notes the code is currently horrible, but that’s fixable if this is something that would be useful to develop further, including potentially direct Gitlab integration so the results can be posted as comments on the PR.

I’d be curious to hear what folks think of it.

That’s very interesting, thanks!

Another thing that would be interesting to include is the population threshold, though it does require filling out that field—see:

Looking at that issue, there’s two things that jump out:

  • Freebase is gone :sweat:
  • Having the Wikimedia database available would be hard

That said, OSM does publish population data for things like cities, sometimes. So at the very least I could try and grab the population tag for a city if it’s there and see if it passes the threshold.

In the mean time I’ve reworked the code a bit, so it’s gone from a high speed train wreck to a regular train wreck.

Looking at locations.dtd, it doesn’t seem we currently track population at all?

We don’t track population at the moment, no. The point of that issue is to add the population field to the XML (and the GVariant-based blob) for the cities we currently have. Then we could figure out how to get the population of new cities and validate them.

I imagine we’re talking about only tracking population size on city, since location seems to primarily be aerodromes where that doesn’t make much sense?

I added a really crude validation for it. Assuming a population tag on a city, it’ll check if that number exceeds 100k or not. I’ve added some placeholder code to also check if it matches OSM population data, but I haven’t implemented retrieving that just yet.

Had a bit more fun with this today.

For one, the city search is now much more robust resolving a number of false positives on one of the PRs. I have no idea why I decided to do a reverse search based on the coordinates instead of searching based on city, state and country name but here we are.

It also fetches the OSM population data now and assuming it is available checks the threshold. Once we get population data in the libgweather data itself it’ll also check for that.

There’s an interesting validation issue with city names that’s come up.

MR 219 includes the city of Mohali. However, that one fails validation because Mohali is the colloquial name, whereas its official name is Sahibzada Ajit Singh Nagar. We have no way to encode this currently, so I guess this comes down to what the maintainer decides to do.

MR 230 raises a similar issue. The city is added as Freiburg Littenweiler which OSM has no idea about, since it’s actually named Littenweiler. But colloquially folks probably refer to it as Freiburg Littenweiler I’m guessing based on the MR.

A similar but way more fun issue arises for The Netherlands. There are 2 cities, colloquially known as Den Bosch and Den Haag, but their official names are 's-Hertogenbosch and 's-Gravenhage. Den Haag being The Hague in English.

However, whereas for Den Bosch the official name of 's-Hertogenbosch is used quite often, like on road signage, train announcements etc. 's-Gravenhage has fallen completely out of use except on official documents, whereas Den Haag is used for road signage etc.

For the city of 's-Hertogenbosch, which is also the name in English, the ambiguity is a bit less, but it would still be nice to be able to encode the alternative name in both the main locations.xml as well as the Dutch localisation. Especially for folks unfamiliar with the distinction or new to the country, it might take a bit to pick up that they’re the same.

For The Hague however, it would be nice to be able to encode it in the Dutch localisation as both 's-Gravenhage, the official name, and Den Haag, the actual name everyone uses and probably search for.

The comments do currently mention this, but it results in the curious situation of the city being named Den Bosch in locations.xml, its colloquial, even though the official English name is 's-Hertogenbosch. In the Dutch localisation Den Bosch is absent, since it’s the same thing, but The Hague is translated as Den Haag. This is probably the most correct, but it would be useful to have 's-Gravenhage in there too somehow.

All this to say, is there perhaps a need to make a proper distinction in libgweather between a city’s official name versus one or multiple additional names it’s also referred as? A city like Mohali would then result in

<_name>Sahibzada Ajit Singh Nagar</_name>

Den Bosch would theoretically become:

<otherName>Den Bosch</otherName>

But The Hague presents an interesting situation:

<_name>The Hague</_name>
<otherName>Den Haag</otherName>

With the name in Dutch then becoming 's-Gravenhage, or

<_name>The Hague</_name>

With the name in Dutch then becoming Den Haag, which is much closer in translation from the English The Hague and the most commonly used