Increasing location and city coverage

daenney · July 28, 2022, 9:11am

I’d like to discuss the option of, sustainably, providing a lot more locations
and cities in libgweather. By sustainable I mean without overloading the libgweather
maintainers.

My reason for wanting to do this is that the current 100k population suggestion
for inclusion can be extremely limiting in cases. Though the policy says this
is not a hard limit, it then makes the decision to include a location or city
a judgement call on the part of the maintainers, who might not be well versed
in a region’s human and political geography to be able to make that call.

I’ll make this a bit more concrete by using Sweden as an example. The coverage
of Sweden is a bit odd, in that it includes things like Ljungbyhed, a small
town with a population of roughly 2k, but a city like Uppsala, the capital of
Uppsala County with a population of 170k+ and Sweden’s 4th largest city, is
missing. Similarly, Jämtland county and its capital of Östersund and Gävleborg
county with its capital of Gävle appear to be missing. Those two total about
7% of Sweden’s entire population. Right now for me GNOME Weather for example
falls in the “cute but useless” category because of this, as it picks
Stockholm for my closest location. This is true given what libgweather knows
of Sweden, but entirely useless for weather data. (I’ll be submitting PRs to fix
these particular cases, but there’s more).

I can completely understand how this situation arises, ultimately someone cared
to submit a PR for this or not and the maintainers decided to include it. But
I’d like to explore if there’s a way to improve things, without that resulting
in a potential avalanche of updates to Locations.xml for the current
maintainers to deal with. With that in mind, I’m wondering if it would be possible
to have additional location and city sources, in the style of drop-in files. If
libgweather could have a set of predefined directories it reads, then it would be
possible for external repositories to be used to maintain the data.

This would let interested folks maintain a libgweather-data-swe for example
where we can include significantly more data for Sweden with rules for inclusion
being tailored to the realities of the country. Additional, this would make it
easier to use some local to the country data sources to generate some of this.
In the particular case of Sweden, we could build some automation based on the
data published by Statistiska Centralbyrån (the national statistics office).
Though this makes sense in the context of Sweden, the approach doesn’t necessarily
translate to other countries, hence looking at if it’s possible to provide the data
to libgweather through something like drop-in files.

I can see some problems arising from supporting something like this:

What to do when duplicates are found: as long as an explicit policy is set,
like files are parsed in ASCII-betical order and last definition wins, the
behaviour becomes predictable and easily user overridable
What to do about translations: this one I’m less clear about, though in a
lot of cases having accurate city name translations isn’t crucial (since
they’re primarily relevant to locals who know how to spell them) and in the
majority of cases location names aren’t translated. As the po-locations
translation document states:

The state/province and city names are more likely to only be seen by people
living in those states/cities, and so are less critical to have 100% translation
of
What to do about Locations.bin, the database that gets compiled when libgweather
is built. I’m unsure how to solve this one. I can imagine distributions packaging
something like build-aux/meson/gen_locations_variant.py and running that
whenever a package is installed with files in a libgweather drop-in location,
but I’m not sure if that’s workable and how to solve any user override
scenario (like ~/.local/share/libgweather/...)

What do folks think of this? Is this reasonable, feasible, desirable? Do folks
see other ways of solving this?

mcatanzaro · July 28, 2022, 2:48pm

If external repositories are required to get extra location data, users won’t find them and won’t use them.

I’d prefer to focus on fixing this problem out-of-the-box. The 100k population limit is clearly too high.

daenney · July 28, 2022, 4:21pm

If external repositories are required to get extra location data, users won’t find them and won’t use them.

I suppose that’s true, though if distributions were to package them up it would be less of an issue but still not ideal.

I’d prefer to focus on fixing this problem out-of-the-box. The 100k population limit is clearly too high.

If we want to reconsider a limit for inclusion, I would then propose we don’t define it in the context of a minimum population number, but more along the lines of administrative geography. Something like including at least the first administrative division of each country and top 3-5 cities within it?

For example for Sweden it would make sure to include the regions/counties and their capitals, but that isn’t quite enough to coverage the population in a meaningful manner. We’d have to add a few cities per county. Top 3 would probably do it in this case, though I suspect this would vary a lot by country.

A model like this would reasonable also work for a country like The Netherlands, where each province + its respective capital would already cover quite a bit, but you’d still want a few more cities per province. To make this one more concrete, Noord Brabant for example has 's Hertogenbosch as its capital, but the largest city is Eindhoven. Tilburg and Breda should be included at the very least too (all cities in this case exceed the 100k population mark so they’d already be up for inclusion with the current rules).

However, if we stretched that model to the US, the first administrative division is the state and limiting it to that would be too restrictive, especially for big states. The same would go for France. Only including the regions would be too limiting, it would make sense to have representation of each of the departments. China represents a similar issue, you’d not only want provinces but also prefectures at the very least.

I’m not sure if going by administrative division makes for a better end result, but it at least feels a bit less constraining than 100k population.

One thing that then does remain; if the limit is eased and thus more locations are reasonable candidates for inclusion it has the potential of blowing up the size of Locations.xml (and Locations.bin) as well as the amount of PRs to libgweather. I’m not sure how the current maintainers feel about that, or if it would make sense to move the curation of the data out of libgweather to a separate repo (that’s then pulled in at build time or something like that).

ebassi · July 28, 2022, 5:31pm

I’d rather avoid that, as it makes building things exceedingly complicated—especially in the context of sandboxed builds.

Currently, the most time consuming part of reviewing changes to the locations database is making sure that the merge request conforms to the requirements. People submit their closest location, and I have to go and check on Wikipedia or OpenStreetMap that the coordinates are correct; the location is relevant; and that there is a METAR tag associated to it. Having some tool that does simple validation and that I can put inside the CI pipeline would already be extremely helpful.

I’m entirely open to lowering the threshold for new cities; the problem is not really adding stuff—with the caveat that some places are more politically complicated than others. The problem, really, is ensuring that the data is up to date. Some locations are renamed; some are removed; some are really multiple locations described as one. Just like time zones, it’s not just a case of dropping a new location and then walking away without looking at the explosion.

daenney · July 29, 2022, 2:04pm

Having some tool that does simple validation and that I can put inside the CI pipeline would already be extremely helpful.

That’s good to know. I figured that was part of it, just wasn’t sure how far that would go in helping expand the data.

If I understand you correctly, on the tooling side we’d need something to more accurately validate the location? I’m also curious if there always needs to be a location for a city and vice-versa, or if those can be added/removed and treated independently? For example, it’s not too hard to get a list of airport codes for a country and generate locations for them, but if it would end up including every airfield that’s potentially a metric ton of data which won’t always be useful. So from a tooling/validation perspective, should a location only get included if it’s near a city etc?

system · August 28, 2022, 2:05pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.