{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# statsmodels Principal Component Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Key ideas:* Principal component analysis, world bank data, fertility\n", "\n", "In this notebook, we use principal components analysis (PCA) to analyze the time series of fertility rates in 192 countries, using data obtained from the World Bank. The main goal is to understand how the trends in fertility over time differ from country to country. This is a slightly atypical illustration of PCA because the data are time series. Methods such as functional PCA have been developed for this setting, but since the fertility data are very smooth, there is no real disadvantage to using standard PCA in this case." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:53:26.476753Z", "iopub.status.busy": "2021-02-02T06:53:26.465989Z", "iopub.status.idle": "2021-02-02T06:53:28.077960Z", "shell.execute_reply": "2021-02-02T06:53:28.078978Z" } }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import matplotlib.pyplot as plt\n", "import statsmodels.api as sm\n", "from statsmodels.multivariate.pca import PCA\n", "\n", "plt.rc(\"figure\", figsize=(16,8))\n", "plt.rc(\"font\", size=14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data can be obtained from the [World Bank web site](http://data.worldbank.org/indicator/SP.DYN.TFRT.IN), but here we work with a slightly cleaned-up version of the data:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2021-02-02T06:53:28.083551Z", "iopub.status.busy": "2021-02-02T06:53:28.082349Z", "iopub.status.idle": "2021-02-02T06:53:28.133672Z", "shell.execute_reply": "2021-02-02T06:53:28.134541Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | Country Name | \n", "Country Code | \n", "Indicator Name | \n", "Indicator Code | \n", "1960 | \n", "1961 | \n", "1962 | \n", "1963 | \n", "1964 | \n", "1965 | \n", "... | \n", "2004 | \n", "2005 | \n", "2006 | \n", "2007 | \n", "2008 | \n", "2009 | \n", "2010 | \n", "2011 | \n", "2012 | \n", "2013 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "Aruba | \n", "ABW | \n", "Fertility rate, total (births per woman) | \n", "SP.DYN.TFRT.IN | \n", "4.820 | \n", "4.655 | \n", "4.471 | \n", "4.271 | \n", "4.059 | \n", "3.842 | \n", "... | \n", "1.786 | \n", "1.769 | \n", "1.754 | \n", "1.739 | \n", "1.726 | \n", "1.713 | \n", "1.701 | \n", "1.690 | \n", "NaN | \n", "NaN | \n", "
1 | \n", "Andorra | \n", "AND | \n", "Fertility rate, total (births per woman) | \n", "SP.DYN.TFRT.IN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "1.240 | \n", "1.180 | \n", "1.250 | \n", "1.190 | \n", "1.220 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 | \n", "Afghanistan | \n", "AFG | \n", "Fertility rate, total (births per woman) | \n", "SP.DYN.TFRT.IN | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "... | \n", "7.136 | \n", "6.930 | \n", "6.702 | \n", "6.456 | \n", "6.196 | \n", "5.928 | \n", "5.659 | \n", "5.395 | \n", "NaN | \n", "NaN | \n", "
3 | \n", "Angola | \n", "AGO | \n", "Fertility rate, total (births per woman) | \n", "SP.DYN.TFRT.IN | \n", "7.316 | \n", "7.354 | \n", "7.385 | \n", "7.410 | \n", "7.425 | \n", "7.430 | \n", "... | \n", "6.704 | \n", "6.657 | \n", "6.598 | \n", "6.523 | \n", "6.434 | \n", "6.331 | \n", "6.218 | \n", "6.099 | \n", "NaN | \n", "NaN | \n", "
4 | \n", "Albania | \n", "ALB | \n", "Fertility rate, total (births per woman) | \n", "SP.DYN.TFRT.IN | \n", "6.186 | \n", "6.076 | \n", "5.956 | \n", "5.833 | \n", "5.711 | \n", "5.594 | \n", "... | \n", "2.004 | \n", "1.919 | \n", "1.849 | \n", "1.796 | \n", "1.761 | \n", "1.744 | \n", "1.741 | \n", "1.748 | \n", "NaN | \n", "NaN | \n", "
5 rows × 58 columns
\n", "\n", " | 1960 | \n", "1961 | \n", "1962 | \n", "1963 | \n", "1964 | \n", "1965 | \n", "1966 | \n", "1967 | \n", "1968 | \n", "1969 | \n", "... | \n", "2002 | \n", "2003 | \n", "2004 | \n", "2005 | \n", "2006 | \n", "2007 | \n", "2008 | \n", "2009 | \n", "2010 | \n", "2011 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country Name | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Aruba | \n", "4.820 | \n", "4.655 | \n", "4.471 | \n", "4.271 | \n", "4.059 | \n", "3.842 | \n", "3.625 | \n", "3.417 | \n", "3.226 | \n", "3.054 | \n", "... | \n", "1.825 | \n", "1.805 | \n", "1.786 | \n", "1.769 | \n", "1.754 | \n", "1.739 | \n", "1.726 | \n", "1.713 | \n", "1.701 | \n", "1.690 | \n", "
Afghanistan | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "7.671 | \n", "... | \n", "7.484 | \n", "7.321 | \n", "7.136 | \n", "6.930 | \n", "6.702 | \n", "6.456 | \n", "6.196 | \n", "5.928 | \n", "5.659 | \n", "5.395 | \n", "
Angola | \n", "7.316 | \n", "7.354 | \n", "7.385 | \n", "7.410 | \n", "7.425 | \n", "7.430 | \n", "7.422 | \n", "7.403 | \n", "7.375 | \n", "7.339 | \n", "... | \n", "6.778 | \n", "6.743 | \n", "6.704 | \n", "6.657 | \n", "6.598 | \n", "6.523 | \n", "6.434 | \n", "6.331 | \n", "6.218 | \n", "6.099 | \n", "
Albania | \n", "6.186 | \n", "6.076 | \n", "5.956 | \n", "5.833 | \n", "5.711 | \n", "5.594 | \n", "5.483 | \n", "5.376 | \n", "5.268 | \n", "5.160 | \n", "... | \n", "2.195 | \n", "2.097 | \n", "2.004 | \n", "1.919 | \n", "1.849 | \n", "1.796 | \n", "1.761 | \n", "1.744 | \n", "1.741 | \n", "1.748 | \n", "
United Arab Emirates | \n", "6.928 | \n", "6.910 | \n", "6.893 | \n", "6.877 | \n", "6.861 | \n", "6.841 | \n", "6.816 | \n", "6.783 | \n", "6.738 | \n", "6.679 | \n", "... | \n", "2.428 | \n", "2.329 | \n", "2.236 | \n", "2.149 | \n", "2.071 | \n", "2.004 | \n", "1.948 | \n", "1.903 | \n", "1.868 | \n", "1.841 | \n", "
5 rows × 52 columns
\n", "