Guest research: Census results, beware the raw data 4

Our Research posts are about the latest research and this week we have a guest post:


by David Wilson (IHS)

David Wilson (IHS)

For most South Africans, census releases probably seem like one-off events that are only relevant for the short time during which newspaper articles are dedicated to the major findings. For many socio-economic researchers, however, much of our daily working lives are defined by the gap between census releases – or more importantly, how to fill that gap with meaningful data. So pull up a chair and let me take you through some of that gap-filling process, and point out why census data is unique – and where it sometimes fails.

A mere twelve months after the mammoth count, Census 2011 was released. If you missed it, the official release date was October 30, which is a very quick turnaround time compared to previous censuses. If you didn’t miss it, you would have seen reports that immediately focussed on the data. Most of the time, those reports looked at income and unemployment – two topics that occupy the forefront of South Africa’s consciousness. Unfortunately however, only certain variables from a census are useful in a direct way, and both income and unemployment are not two of those variables.

For starters, censuses do not measure unemployment very well. That’s because measuring one’s employment status turns out to be rather complex. Students of economics will remember from their studies that there is a difference between strict and expanded unemployment, and a difference between unemployment and economic inactivity. A person without a job is not automatically unemployed. They could be a student, or a housewife – or perhaps they have given up looking for work. Sometimes, the line between those options isn’t very clear. Therefore, a good employment survey needs to tease out all of these additional layers of complexity by asking the appropriate questions.

The census questions on employment are not designed to tease out this complexity. Consider that the census questionnaire uses only three questions to determine one’s employment status. This is inferior to the Quarterly Labour Force Survey, the survey from which South Africa’s official unemployment rates are derived, and in which half of the Survey – a full 25 questions – are dedicated to determining whether someone is employed or not.

Unemployment is only one example. Another is household income, which is also not advisably used directly from a census. If employment status is hard to measure in general, then household income is next to impossible. Even if the survey respondents are comfortable telling enumerators their incomes, they may not even know exactly how much they earn. Disparate sources of income, garnishee orders (which still need to be counted as income), imputed rent (my personal favourite), annual bonuses, income in kind (this is when your neighbour gives you a cup of sugar), incentive income – and a host of others – need to be dealt with appropriately when designing an income survey.

In fact, the only correct methodology to use when measuring income is to measure all of the expenditure of the household. This is best done by having the household keep a diary of everything they buy over a given period. This is the method that StatsSA uses in their Income and Expenditure Survey. The census, however, has only one question on household income.

This all means that many of the variables lifted from the Census 2011 release and used ‘voetstoots’ in media reports need to be taken with a small grain of salt. Of course this doesn’t mean that census data is not useful. On the contrary, it is one of the most useful data sources that South Africans will have at their disposal for many years to come – but it needs to be interpreted and used in the appropriate manner.

It is useful to know that the power of census data is in the size of its sample. Although the aim of a census is to reach every single person in the country, this has never happened in South Africa. Nonetheless, the sample size is still significant. Census 2011 reached over 13 million households. By comparison, the labour force and income surveys mentioned above both have sample sizes of around 30 000 households. With such small sample sizes, researchers are unable to ‘pivot’ their data into very narrow slices.

For example, if I wanted to use the Quarterly Labour Force Survey to calculate the unemployment rate of a given race in a given province, I would probably end up with a final sample size of less than 800 households – and my result would therefore have a very high standard error. I mean, 800 households are not a lot of households to survey if one wanted to know the unemployment rate of an entire province. The size of the census sample, however, allows me to extract meaningful data about a single suburb, and I may have a total sample size of 800 households for that one suburb; yielding a significantly more stable result.

Consumers of census data must therefore keep in mind the design census data before taking a decision based on the published results. In the case of income and unemployment data from a census, the appropriate methodology would be to use the ratios from the official labour and income surveys weighted against a demographic model that has taken all of the recent censuses into account. This technique not only ensures that we balance the depth of the recent census against the correct measure from the national survey – it also ensures that we have accounted for other problems that might have arisen in the current census. This is because we are able to include similar data from other censuses. There is, of course, one additional benefit from following the appropriate modelling methodology. That is, it allows researchers to make conclusions about certain variables in years for which no census was undertaken, but where only the national surveys are available.

So although it may be interesting and fun to look at the raw numbers as they are published, serious data users would be well advised to follow an appropriate rebalancing methodology before making a serious claim based on that raw data.

David is a Senior Analyst at IHS, focussing on regional economic analysis and data modelling. He maintains the Regional eXplorer suite of models, which provide detailed socio-economic data on a local municipal level. IHS is a leading global provider of critical technical information, related decision-support tools and strategic and operational services. The company combines highly reliable technical content with deep domain expertise in its focused industries.


  1. Lovely article. My greatest frustration is Stats SA “refusal” to make the database publicly available unlike in some of the more developed countries where data democratization has taken root. It would be so much more usefull of one could do their own analysis and pick up on relationships with other dimensions that the standard results do not begin to look at.

    • Vusi, thanks for the reply. I cannot agree more that everyone should have access to more data. It enables analysis and adds to the discourse. The World Bank has recently opened up lots of their data and I think it sets an example for national stats agencies. With the 2001 Census it was possible to get your hands on the 10% sample after a while. I hope that it is sooner rather than later for the 2011 data.

      • Apologies for two comments…i had thought the first one did not post hence the second one. Stats SA wrote back to me to let me know that they will hopefully be in a position to release the 10% samples in March next year. Wish there was a way to lobby for more data than that.

  2. Great enjoyable blog. My greatest frustration is that unlike in the more developed countries, data democratization has not firmly taken root in that Stats SA “refuses” to release the complete raw data set in order to allow us to analyse it and find correlations that they do not consider for their statistics. One has to wonder at why they keep such a close leash on the data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s