Tutorial: Extracting World Bank Data from CSV using Python and Pandas

One of the most important activities in data science is the extraction and collection of data, followed by the transformation into formats that can be used to analyse and be interpreted.

In this post, I will use the Python programming language (Version 3.6.2) and the Pandas library to extract and convert data that the World Bank websites provides in the CSV format. For the sake of easiness, I use Jupyter, so if some of the commands do not work, maybe try to search for differences between Jupyter and other IDEs in the search engine of your choice.

I assume that you are familiar with the basics of Python and Pandas. If you are not, I suggest that you have a look at some of the many introductory tutorials out there (both free and for purchase). One that I have followed and that was helpful to me as a beginner is the Online Web course by Jose Portillas on Udemy (Hint: They run actions with reduced prices on a regular basis).

Loading the CSV File into Pandas

With all this out of the way let’s get started. The first thing to do is to navigate to the webpage where the world bank stores its data sets. I randomly picked Germany and downloaded the corresponding CSV file, which in our case is called API_DEU_DS2_en_csv_v2.csv

Pandas provides a plethora of methods to deal with all kinds of adversity. One of its simpler tasks is to read a CSV file into a DataFrame class, so let’s do that then. First import all the necessary libraries and then we can read in the csv file.

import numpy as np
import pandas as pd

df = pd.read_csv('API_DEU_DS2_en_csv_v2.csv')

And? Did you follow the instructions? Yep, that is exactly what happened to me.

ParserError                               Traceback (most recent call last)
<ipython-input-2-5772fbbc0698> in <module>()
----> 1 ori = pd.read_csv('API_DEU_DS2_en_csv_v2.csv')


pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)()

ParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 62

I do not know how it was for you, but every tutorial I ever read before, assumed that the pd.read_csv()method would work out of the box.

So let’s have a look at what happened. For that, I checked the format of the CSV file first, which looks as follows:

"Data Source","World Development Indicators",

"Last Updated Date","2017-08-02",

"Country Name","Country Code","Indicator Name","Indicator Code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016",
"Germany","DEU","Mobile account, income, richest 60% (% ages 15+)","WP15163_4.9","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",
"Germany","DEU","Mobile account, income, poorest 40% (% ages 15+)","WP15163_4.8","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",

If you look careful, you can already spot the issue. The actual data only starts in line 5. So the method assumes according to the first line that there are 2 columns, but is then in for a rude awakening when there are suddenly 62 columns to deal with.

One quick look at the Pandas documentation and this is now fairly easy to deal with. All we need to do, is add the skiprowsflag and set it to 4, i.e. tell Pandas to skip the first 4 rows before parsing the file et voilĂ , we have loaded our data into a DataFrame.

Transforming the Data

Now that we have our data in a Pandas dataframe, let’s have a look at the format.


What you should see is most likely something like this.

When we look at these information, what we see are four named columns: Country Name, Country Code, Indicator Name and Indicator Code. Afterwards, we have the years of 1960 to present day.

As a side note, do not worry about the many NaN values. We are only seeing 5 rows of over 1500 and only a fraction of the columns. If you look carefully, there are plenty of valid data here. This will become interesting once we look into data visualization.

What comes now is a personal preference of mine when looking at data sets. To me a date should always be in the rows not a column. In addition, there are several columns that are not necessary for our purposes.

Removing unnecessary columns

We know that all these data are collected in Germany, so there is no need for the Country Name and Country Code column. It would be, if we had different countries here… but we do not. Also the Indicator Code is not useful to me, so I want to remove that, as well. How do we do this? Pandas’ .drop()function to the rescue.

df.drop(df.columns[[0, 1, 3]], axis=1, inplace=True)

Note that we want to drop several columns here, which means we have to declare axis=1to denote that we are looking at columns, otherwise we will receive a ValueError. Furthermore, we set the inplace=Trueflag, in order to make this change permanent in df.

Flipping the table

As I said before, I prefer to have information of dates in the rows not as column names and ideally the sets of data not in the row but as a column. This is just me, the way I was brought up and it improves readability for me. In order to achieve this, we need to flip the table using

dft = df.transpose()

If you check the output, you will notice that the index is now the column names and vice versa. To make life easier for us later on, I will make the first row the new column names.

dft.columns = dft.iloc[0]
dft.drop(dft.index[0], inplace=True)

As you can see, I first named the columns after the first row and then I removed said first row. Note here, that I did not pass the axis flag in as it is set to 0 by default.

Now we only set the name of the index to Year and we are done with the formatting to a satisfactory level and you should have a table as shown below.

dft.index.name = 'Year'

Slicing the Data Set in Preparation for Data Visualization

We now should have a data set with 1519 columns. Let’s be honest with ourselves, it is unlikely that we will visualise all these data together in one graph or analyse them all in a meaningful way. At this point, it is rather more important to pick and choose our data sets.

The thing that interested me in this particular data set was actually the change in population composition in Germany over the years. If you carefully look into our data set, you will actually find a lot of information on that.

Filtering by Population, will give you a rather large amount of possible columns.

[In]: dft.filter(regex='Population').columns.tolist()


['Population, female (% of total)',
 'Population, total',
 'Population growth (annual %)',
 'Population ages 80 and above, male (% of male population)',
 'Population ages 80 and above, female (% of female population)',
 'Population ages 75-79, male (% of male population)',
 'Population ages 75-79, female (% of female population)',
 'Population ages 30-34, male (% of male population)',
 'Population ages 30-34, female (% of female population)',
 'Population in urban agglomerations of more than 1 million (% of total population)',
 'Population in urban agglomerations of more than 1 million',
 'Population in the largest city (% of urban population)',
 'Population in largest city',
 'Population living in slums (% of urban population)',
 'Population living in areas where elevation is below 5 meters (% of total population)',
 'Population density (people per sq. km of land area)',
 'Population ages 25-29, male (% of male population)',
 'Population ages 25-29, female (% of female population)',
 'Population ages 0-4, male (% of male population)',
 'Population ages 0-4, female (% of female population)']

So to round this lengthy blog post off, let’s prepare a data set that will allow us to visualize the age composition of German females from 1960 to 2016.

In order to do that, I will need to look again at the filter method and choose a fitting regular expression. The one that suits my purposes looks this:

popln = dft.filter(regex=r'^.*?\bPopulation\b.*?\bfemale\b.*$')

For those of you, who are new to regular expressions, I would strongly suggest that you follow a tutorial on this (such as the Dive into Python section of it), as it is a whole subcategory of Python (and originally imported from Perl, I believe) that will come in very handy as you need to search strings.

To round this data set off, we shall change the column name to something more fitting and then I believe we are done and you should have an output not too dissimilar to the one below.

popln.columns.name = 'Female Population in Germany across Age Groups'


I hope that sharing my own experiences with you was helpful and you can put the newly encountered experiences to good use.

I’d love to hear what you think on this topic, so please comment below and thanks a lot for reading.

Please note: As with all the tutorials that I write, it may not necessarily be the only way of doing it (or even the best way). All I want to do is share my experiences and the pitfalls, that I encountered, to give back to the community and to spark a discussion. If you find a better way in your opinion, please comment below and share the wisdom.

  1. https://www.python.org/
  2. http://pandas.pydata.org/
  3. http://data.worldbank.org/
  4. http://jupyter.org/
  5. https://www.udemy.com/python-for-data-science-and-machine-learning-bootcamp/learn/v4/overview/
  6. http://data.worldbank.org/country/germany?view=chart
  7. http://www.diveintopython3.net/regular-expressions.html
  8. https://en.wikipedia.org/wiki/Regular_expression#History
This entry was posted in Data Science, Python, Tutorial and tagged , , , . Bookmark the permalink.

1 Response to Tutorial: Extracting World Bank Data from CSV using Python and Pandas

  1. Pingback: [BLOCKED BY STBV] Tutorial: How to create a stacked bar chart with python and matplotlib - Learning the MachineLearning the Machine

Leave a Reply

Your email address will not be published. Required fields are marked *