Contingency tables¶
Statsmodels supports a variety of approaches for analyzing contingency tables, including methods for assessing independence, symmetry, homogeneity, and methods for working with collections of tables from a stratified population.
The methods described here are mainly for two-way tables. Multi-way
tables can be analyzed using log-linear models. Statsmodels does not
currently have a dedicated API for loglinear modeling, but Poisson
regression in statsmodels.genmod.GLM
can be used for this
purpose.
A contingency table is a multi-way table that describes a data set in which each observation belongs to one category for each of several variables. For example, if there are two variables, one with \(r\) levels and one with \(c\) levels, then we have a \(r \times c\) contingency table. The table can be described in terms of the number of observations that fall into a given cell of the table, e.g. \(T_{ij}\) is the number of observations that have level \(i\) for the first variable and level \(j\) for the second variable. Note that each variable must have a finite number of levels (or categories), which can be either ordered or unordered. In different contexts, the variables defining the axes of a contingency table may be called categorical variables or factor variables. They may be either nominal (if their levels are unordered) or ordinal (if their levels are ordered).
The underlying population for a contingency table is described by a distribution table \(P_{i, j}\). The elements of \(P\) are probabilities, and the sum of all elements in \(P\) is 1. Methods for analyzing contingency tables use the data in \(T\) to learn about properties of \(P\).
The statsmodels.stats.Table
is the most basic class for
working with contingency tables. We can create a Table
object
directly from any rectangular array-like object containing the
contingency table cell counts:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import statsmodels.api as sm
In [4]: df = sm.datasets.get_rdataset("Arthritis", "vcd").data
In [5]: tab = pd.crosstab(df['Treatment'], df['Improved'])
In [6]: tab = tab.loc[:, ["None", "Some", "Marked"]]
In [7]: table = sm.stats.Table(tab)
Alternatively, we can pass the raw data and let the Table class construct the array of cell counts for us:
In [8]: table = sm.stats.Table.from_data(df[["Treatment", "Improved"]])
Independence¶
Independence is the property that the row and column factors occur independently. Association is the lack of independence. If the joint distribution is independent, it can be written as the outer product of the row and column marginal distributions:
P_{ij} = sum_k P_{ij} cdot sum_k P_{kj} forall i, j
We can obtain the best-fitting independent distribution for our observed data, and then view residuals which identify particular cells that most strongly violate independence:
In [9]: print(table.table_orig)
Improved Marked None Some
Treatment
Placebo 7 29 7
Treated 21 13 7
In [10]: print(table.fittedvalues)