Four Short Rants on International Data

On Naming Conventions

I'm going to name some countries, and there's nothing you can do to stop me:

- "Macedonia"
- "The Former Yugoslav Republic of Macedonia"
- "Macedonia, the Former Yugoslav Republic of"
- "Republic of Macedonia"
- "Република Македонија"
- "Республика Македония"
- "Republik Mazedonien"
- "Makedonian tasavalta"
- "Republika Makedonija"

Can you tell what all these country names have in common? Chances are, yes, you probably can. You know who can't tell? Computers. Each one of these strings looks different to whatever naive automated process is going through your country data. When you use an idiosyncratic naming convention, you force a slow, expensive, fleshy human (like me) to manually interpret your data.

We have this great little thing called the ISO 3166 standard. It was invented in the 1970s alongside floppy discs and the VCR, and it eliminates these ambiguities. Please use this in your international dataset. I'm looking at you, United Nations Development Programme.

On Availability

There is no well-established indicator for a country's attitudes towards LGBT issues, but enterprising people build their own anyway. The Wikipedia page on LGBT rights by country and the Spartacus International Gay Guide both have something approaching a composite index on these subjects. Both of these heavily rely on the ILGA annual report on state-sponsored homophobia as a source. The ILGA website itself has a (broken) interactive map on international LGBT legislation, presumably based on its own data.

The ILGA data is only publicly available in the form of prosaic exposition in its annual report, which is a PDF document. When those Wikipedian contributors, Spartacus guide editors, and ILGA map-makers produced their various data-based offerings, they almost certainly had to manually go through that document, country by country, territory by territory. This is a staggering waste of time, effort and resources.

Looking at its website, the ILGA clearly doesn't have the internal resources to offer up all-singing, all-dancing techno-stats wizardry, but if it published its raw data, it wouldn't have to. Other people could, and would, do it for them.

On Access

Transparency International is an NGO dedicated to combatting corruption. Every year they compile a composite index called the Corruption Perceptions Index, which ranks countries based on their publicly-perceived level of corruption. This is useful data for a lot of organisations, and is referenced extensively in work with an international scope. The full index data is available on the Transparency International website.

In an excel file. In a zip file.

I will say three very positive things about Transparency International:

1) From a scholastic perspective, the data they provide is impeccable. That excel file is a really good excel file, with lots of salient metadata, and it's bundled with their full methodology for compiling the index. Mad props. A++.

2) They practise what they preach regarding openness. Their operational budget for 2015 is €21,559,000, and finding this out took me a matter of seconds on their website.

3) They have clearly invested heavily in their website and presumably understand the importance of this.

I would accept an excel file from ILGA because they're a much less well-funded organisation. (Their current operational budget is more like €1,400,000, which I found in a Googled job advert). But Transparency International clearly has all the pieces in place for offering this up through a lightweight REST API, or at least as more easily-interrogable data. As it stands, someone's had to do it for them.

On Maps

I like a good map as much as the next nerd. In fact, "what's the most interesting map you've seen recently" is currently on my list of conversation-starters. Maps are an informative visualisation tool. Maps are good at telling stories. Maps are pretty.

But there are some things that aren't maps, and I think the international dataset "scene" has forgotten this.

It seems like every time I try and find some international dataset, someone wants to show me an interactive map, and I can't help but wonder if they've maybe forgotten that data has other uses, like helping people to make decisions or conduct scientific inquiry.

This becomes more annoying when they'll gladly give you a whole smorgasbord of maps, but won't let you get within ten feet of the actual data. "Why would you possibly want all those numbers?" they say. "We've made the maps for you already! Maps! Yay maps!"