The Datasets Package¶
statsmodels
provides data sets (i.e. data and meta-data) for use in
examples, tutorials, model testing, etc.
Using Datasets from Stata¶
|
Download and return an example dataset from Stata. |
Using Datasets from R¶
The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset
function. The actual data is accessible by the data
attribute. For example:
In [1]: import statsmodels.api as sm
In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "carData")
In [3]: print(duncan_prestige.__doc__)
.. container::
.. container::
====== ===============
Duncan R Documentation
====== ===============
.. rubric:: Duncan's Occupational Prestige Data
:name: duncans-occupational-prestige-data
.. rubric:: Description
:name: description
The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
prestige and other characteristics of 45 U. S. occupations in
1950.
.. rubric:: Usage
:name: usage
.. code:: R
Duncan
.. rubric:: Format
:name: format
This data frame contains the following columns:
type
Type of occupation. A factor with the following levels:
``prof``, professional and managerial; ``wc``, white-collar;
``bc``, blue-collar.
income
Percentage of occupational incumbents in the 1950 US Census who
earned $3,500 or more per year (about $36,000 in 2017 US
dollars).
education
Percentage of occupational incumbents in 1950 who were high
school graduates (which, were we cynical, we would say is
roughly equivalent to a PhD in 2017)
prestige
Percentage of respondents in a social survey who rated the
occupation as “good” or better in prestige
.. rubric:: Source
:name: source
Duncan, O. D. (1961) A socioeconomic index for all occupations. In
Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free
Press [Table VI-1].
.. rubric:: References
:name: references
Fox, J. (2016) *Applied Regression Analysis and Generalized Linear
Models*, Third Edition. Sage.
Fox, J. and Weisberg, S. (2019) *An R Companion to Applied
Regression*, Third Edition, Sage.
In [4]: duncan_prestige.data.head(5)
Out[4]:
type income education prestige
rownames
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
R Datasets Function Reference¶
|
download and return R dataset |
|
Return the path of the statsmodels data dir. |
|
Delete all the content of the data home cache. |
Available Datasets¶
- American National Election Survey 1996
- Breast Cancer Data
- Bill Greene’s credit scoring data.
- Smoking and lung cancer in eight cities in China.
- Mauna Loa Weekly Atmospheric CO2 Data
- First 100 days of the US House of Representatives 1995
- World Copper Market 1951-1975 Dataset
- US Capital Punishment dataset.
- Danish Money Demand Data
- El Nino - Sea Surface Temperatures
- Engel (1857) food expenditure data
- Affairs dataset
- World Bank Fertility Data
- Grunfeld (1950) Investment Data
- Transplant Survival Data
- (West) German interest and inflation rate 1972-1998
- Longley dataset
- United States Macroeconomic data
- Travel Mode Choice
- Nile River flows at Ashwan 1871-1970
- RAND Health Insurance Experiment Data
- Taxation Powers Vote for the Scottish Parliament 1997
- Spector and Mazzeo (1980) - Program Effectiveness Data
- Stack loss data
- Star98 Educational Dataset
- Statewide Crime Data 2009
- U.S. Strike Duration Data
- Yearly sunspots data 1700-2008
Usage¶
Load a dataset:
In [5]: import statsmodels.api as sm
In [6]: data = sm.datasets.longley.load_pandas()
The Dataset object follows the bunch pattern. The full dataset is available
in the data
attribute.
In [7]: data.data
Out[7]:
TOTEMP GNPDEFL GNP UNEMP ARMED POP YEAR
0 60323.0 83.0 234289.0 2356.0 1590.0 107608.0 1947.0
1 61122.0 88.5 259426.0 2325.0 1456.0 108632.0 1948.0
2 60171.0 88.2 258054.0 3682.0 1616.0 109773.0 1949.0
3 61187.0 89.5 284599.0 3351.0 1650.0 110929.0 1950.0
4 63221.0 96.2 328975.0 2099.0 3099.0 112075.0 1951.0
5 63639.0 98.1 346999.0 1932.0 3594.0 113270.0 1952.0
6 64989.0 99.0 365385.0 1870.0 3547.0 115094.0 1953.0
7 63761.0 100.0 363112.0 3578.0 3350.0 116219.0 1954.0
8 66019.0 101.2 397469.0 2904.0 3048.0 117388.0 1955.0
9 67857.0 104.6 419180.0 2822.0 2857.0 118734.0 1956.0
10 68169.0 108.4 442769.0 2936.0 2798.0 120445.0 1957.0
11 66513.0 110.8 444546.0 4681.0 2637.0 121950.0 1958.0
12 68655.0 112.6 482704.0 3813.0 2552.0 123366.0 1959.0
13 69564.0 114.2 502601.0 3931.0 2514.0 125368.0 1960.0
14 69331.0 115.7 518173.0 4806.0 2572.0 127852.0 1961.0
15 70551.0 116.9 554894.0 4007.0 2827.0 130081.0 1962.0
Most datasets hold convenient representations of the data in the attributes endog and exog:
In [8]: data.endog.iloc[:5]
Out[8]:
0 60323.0
1 61122.0
2 60171.0
3 61187.0
4 63221.0
Name: TOTEMP, dtype: float64
In [9]: data.exog.iloc[:5,:]
Out[9]:
GNPDEFL GNP UNEMP ARMED POP YEAR
0 83.0 234289.0 2356.0 1590.0 107608.0 1947.0
1 88.5 259426.0 2325.0 1456.0 108632.0 1948.0
2 88.2 258054.0 3682.0 1616.0 109773.0 1949.0
3 89.5 284599.0 3351.0 1650.0 110929.0 1950.0
4 96.2 328975.0 2099.0 3099.0 112075.0 1951.0
Univariate datasets, however, do not have an exog attribute.
Variable names can be obtained by typing:
In [10]: data.endog_name
Out[10]: 'TOTEMP'
In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
In [12]: type(data.data)
Out[12]: pandas.core.frame.DataFrame
In [13]: type(data.raw_data)
Out[13]: pandas.core.frame.DataFrame
In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
Loading data as pandas objects¶
For many users it may be preferable to get the datasets as a pandas DataFrame or
Series object. Each of the dataset modules is equipped with a load_pandas
method which returns a Dataset
instance with the data readily available as pandas objects:
In [15]: data = sm.datasets.longley.load_pandas()
In [16]: data.exog
Out[16]:
GNPDEFL GNP UNEMP ARMED POP YEAR
0 83.0 234289.0 2356.0 1590.0 107608.0 1947.0
1 88.5 259426.0 2325.0 1456.0 108632.0 1948.0
2 88.2 258054.0 3682.0 1616.0 109773.0 1949.0
3 89.5 284599.0 3351.0 1650.0 110929.0 1950.0
4 96.2 328975.0 2099.0 3099.0 112075.0 1951.0
5 98.1 346999.0 1932.0 3594.0 113270.0 1952.0
6 99.0 365385.0 1870.0 3547.0 115094.0 1953.0
7 100.0 363112.0 3578.0 3350.0 116219.0 1954.0
8 101.2 397469.0 2904.0 3048.0 117388.0 1955.0
9 104.6 419180.0 2822.0 2857.0 118734.0 1956.0
10 108.4 442769.0 2936.0 2798.0 120445.0 1957.0
11 110.8 444546.0 4681.0 2637.0 121950.0 1958.0
12 112.6 482704.0 3813.0 2552.0 123366.0 1959.0
13 114.2 502601.0 3931.0 2514.0 125368.0 1960.0
14 115.7 518173.0 4806.0 2572.0 127852.0 1961.0
15 116.9 554894.0 4007.0 2827.0 130081.0 1962.0
In [17]: data.endog
Out[17]:
0 60323.0
1 61122.0
2 60171.0
3 61187.0
4 63221.0
5 63639.0
6 64989.0
7 63761.0
8 66019.0
9 67857.0
10 68169.0
11 66513.0
12 68655.0
13 69564.0
14 69331.0
15 70551.0
Name: TOTEMP, dtype: float64
The full DataFrame is available in the data
attribute of the Dataset object
In [18]: data.data
Out[18]:
TOTEMP GNPDEFL GNP UNEMP ARMED POP YEAR
0 60323.0 83.0 234289.0 2356.0 1590.0 107608.0 1947.0
1 61122.0 88.5 259426.0 2325.0 1456.0 108632.0 1948.0
2 60171.0 88.2 258054.0 3682.0 1616.0 109773.0 1949.0
3 61187.0 89.5 284599.0 3351.0 1650.0 110929.0 1950.0
4 63221.0 96.2 328975.0 2099.0 3099.0 112075.0 1951.0
5 63639.0 98.1 346999.0 1932.0 3594.0 113270.0 1952.0
6 64989.0 99.0 365385.0 1870.0 3547.0 115094.0 1953.0
7 63761.0 100.0 363112.0 3578.0 3350.0 116219.0 1954.0
8 66019.0 101.2 397469.0 2904.0 3048.0 117388.0 1955.0
9 67857.0 104.6 419180.0 2822.0 2857.0 118734.0 1956.0
10 68169.0 108.4 442769.0 2936.0 2798.0 120445.0 1957.0
11 66513.0 110.8 444546.0 4681.0 2637.0 121950.0 1958.0
12 68655.0 112.6 482704.0 3813.0 2552.0 123366.0 1959.0
13 69564.0 114.2 502601.0 3931.0 2514.0 125368.0 1960.0
14 69331.0 115.7 518173.0 4806.0 2572.0 127852.0 1961.0
15 70551.0 116.9 554894.0 4007.0 2827.0 130081.0 1962.0
With pandas integration in the estimation classes, the metadata will be attached to model results:
In [19]: y, x = data.endog, data.exog
In [20]: res = sm.OLS(y, x).fit()
In [21]: res.params
Out[21]:
GNPDEFL -52.993570
GNP 0.071073
UNEMP -0.423466
ARMED -0.572569
POP -0.414204
YEAR 48.417866
dtype: float64
In [22]: res.summary()
Out[22]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
=======================================================================================
Dep. Variable: TOTEMP R-squared (uncentered): 1.000
Model: OLS Adj. R-squared (uncentered): 1.000
Method: Least Squares F-statistic: 5.052e+04
Date: Mon, 20 Jan 2025 Prob (F-statistic): 8.20e-22
Time: 16:29:28 Log-Likelihood: -117.56
No. Observations: 16 AIC: 247.1
Df Residuals: 10 BIC: 251.8
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
GNPDEFL -52.9936 129.545 -0.409 0.691 -341.638 235.650
GNP 0.0711 0.030 2.356 0.040 0.004 0.138
UNEMP -0.4235 0.418 -1.014 0.335 -1.354 0.507
ARMED -0.5726 0.279 -2.052 0.067 -1.194 0.049
POP -0.4142 0.321 -1.289 0.226 -1.130 0.302
YEAR 48.4179 17.689 2.737 0.021 9.003 87.832
==============================================================================
Omnibus: 1.443 Durbin-Watson: 1.277
Prob(Omnibus): 0.486 Jarque-Bera (JB): 0.605
Skew: 0.476 Prob(JB): 0.739
Kurtosis: 3.031 Cond. No. 4.56e+05
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 4.56e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
Extra Information¶
If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example
>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']
Additional information¶
The idea for a datasets package was originally proposed by David Cournapeau.
To add datasets, see the notes on adding a dataset.