15 min read

Estimating Marginal Means and Pairwise Tests By Hand in Python

Worked examples of estimating marginal means and conducting pairwise tests for mixed effects models (including random effects and unabalanced data) using matrix multiplication in Python.

This is the second in a series of blog posts working through how to generate standard errors for estimated marginal means and pooled standard errors for pairwise tests of these estimates from mixed effects models. The first shows how this can be done in R by hand for models fitted with lme4, largely replicating results provided by the excellent emmeans package. This post aims to show how this can be done in Python, while the next post will show how this can be done in Julia. This was motivated primarily by exploring methods for fitting mixed effects models in Python and Julia and finding no suitable alternative to emmeans in R for estimating marginal effects and conducting pairwise tests.

While in Python we can use rpy2 or more easily Pymer4 to interface with R to perform model fitting with lme4 and follow-up tests using emmeans there are no packages that allow us to recreate the emmeans functionality for mixed effects models fitted directly in Python, for example following model fitting using the statsmodels package. This post will therefore work through fitting a mixed effects model from a factorial design in Python using the statsmodels package before calculating estimated marginal means and conducting pairwise tests by hand.

This post will show working in Python and R for comparing results using the reticulate R-package.

Getting Started

Here’s a quick breakdown of the R packages used here, and why they are used:

  • lme4: used for fitting mixed effects models.
  • afex: used to obtain the data set with categorical predictors.
  • emmeans: used for generating estimated marginal means and pairwise tests between group estimates.
  • dplyr (as always!): used for having access to the pipe operator and for general data cleaning.
  • reticulate: used for interfacing between R and Python (i.e. passing data between the two and showing this Python code on a blogdown site!)
library(lme4)
library(afex)
library(emmeans)
library(dplyr)
library(reticulate)

Next, we need to tell R where our version of Python lives. For me, on a Mac, this is as follows. Yours might differ.

use_python("usr/local/bin/python3")

From Python we’ll need a number of packages, with their breakdown as follows:

  • statsmodels.api: Used for fitting mixed effects models.
  • statsmodels.formula.api: Used for allowing an R-like formula syntax for model fitting.
  • numpy: Used for numerical computing (e.g. using arrays, matrices, and maths functions including matrix multiplication).
  • pandas: Used for creating data frames.
  • scipy.stats: Used for working with distributions.
  • dmatrix from patsy: Used for creating a design matrix to calculate estimated marginal means.
  • combinations from itertools: Used for generating unique combinations for pairwise tests.
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
import scipy.stats as st
from patsy import dmatrix
from itertools import combinations

We’ll first clean the data in R to show that the data set is consistent across both R and Python methods.

fhch <- get(data("fhch2010")) %>%  # load data from afex %>% 
  filter(correct == TRUE) %>% # subset the data to correct answers 
  select(id, task, stimulus, item, rt, log_rt) # keep only used cols

We’ll then make the data accessible in Python before inspecting it.

dat = r.fhch # grab data from r
print(dat)
##         id    task stimulus    item     rt    log_rt
## 0       N1  naming     word  potted  1.091  0.087095
## 1       N1  naming     word  engine  0.876 -0.132389
## 2       N1  naming     word   ideal  0.710 -0.342490
## 3       N1  naming  nonword   uares  1.210  0.190620
## 4       N1  naming  nonword    xazz  0.843 -0.170788
## ...    ...     ...      ...     ...    ...       ...
## 12955  L28  lexdec     word   idiom  1.688  0.523544
## 12956  L28  lexdec     word   lawns  1.563  0.446607
## 12957  L28  lexdec     word  spiced  1.641  0.495306
## 12958  L28  lexdec     word  weighs  1.797  0.586119
## 12959  L28  lexdec     word  darted  1.531  0.425921
## 
## [12960 rows x 6 columns]

Fitting the Models

Now we’ll fit the linear mixed effects model in statsmodels. For R users, there’s a few differences here when compared to lme4:

  • We can specify numeric columns in our data set as identifying levels of factors, or categories, using the C() function in our formulae. Thus, task, stimulus, item, and id – all variables that indicate IDs for groups – get this C() wrapper. I quite like how explicit this is.

  • Crossed random effects are difficult to specify in statsmodels. To get similar results to lme4 we need to collapse our independent groups of participants and items. We’ll create a group column in our data set where everyone gets the same ID. Next, we need to separately specify our random effects variance components. Using the 0 + C(...) notation we specify random intercepts for each group. For independent random slopes we would specify the term which varies by group within the vcf component. This is commented out below as this model doesn’t converge in statsmodels. For comparison we will also fit a random intercepts only model in lme4. Needless to say this is a bit of a pain and seems to enforce some strong assumptions on the random effects.

# specify random effects
dat["group"] = 1
vcf = {"item": "0 + C(item)", "id": "0 + C(id)"} 
# vcf = {"item" : "0 + C(task):C(item)", "id" : "0 + C(stimulus):C(id)"}

# define the model
mod = smf.mixedlm(
  'log_rt ~ C(task) * C(stimulus)', 
  data = dat,
  groups = "group",
  vc_formula = vcf,
  re_formula = "0"
)

# fit the model, then print it
mod_fit = mod.fit() 
## /usr/local/lib/python3.9/site-packages/statsmodels/regression/mixed_linear_model.py:2237: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
##   warnings.warn(msg, ConvergenceWarning)
print(mod_fit.summary())
##                        Mixed Linear Model Regression Results
## ====================================================================================
## Model:                      MixedLM          Dependent Variable:          log_rt    
## No. Observations:           12960            Method:                      REML      
## No. Groups:                 1                Scale:                       0.0991    
## Min. group size:            12960            Log-Likelihood:              -3693.8467
## Max. group size:            12960            Converged:                   Yes       
## Mean group size:            12960.0                                                 
## ------------------------------------------------------------------------------------
##                                          Coef.  Std.Err.    z    P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------------
## Intercept                                -0.353    0.044  -8.057 0.000 -0.439 -0.267
## C(task)[T.lexdec]                         0.344    0.059   5.863 0.000  0.229  0.458
## C(stimulus)[T.nonword]                    0.312    0.010  32.602 0.000  0.293  0.331
## C(task)[T.lexdec]:C(stimulus)[T.nonword] -0.259    0.011 -23.133 0.000 -0.280 -0.237
## id Var                                    0.037    0.026                            
## item Var                                  0.003    0.002                            
## ====================================================================================

How does the model look in R?

m <- lme4::lmer(
  log_rt ~ 1 + task * stimulus + 
    (1 | id) +
    (1 | item), 
  data = fhch
)
summary(m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: log_rt ~ 1 + task * stimulus + (1 | id) + (1 | item)
##    Data: fhch
## 
## REML criterion at convergence: 7388
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -6.026 -0.608 -0.106  0.488  5.292 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  item     (Intercept) 0.0035   0.0591  
##  id       (Intercept) 0.0374   0.1935  
##  Residual             0.0991   0.3148  
## Number of obs: 12960, groups:  item, 600; id, 45
## 
## Fixed effects:
##                            Estimate Std. Error t value
## (Intercept)                -0.35293    0.04378   -8.06
## tasklexdec                  0.34360    0.05857    5.87
## stimulusnonword             0.31202    0.00957   32.60
## tasklexdec:stimulusnonword -0.25856    0.01118  -23.14
## 
## Correlation of Fixed Effects:
##             (Intr) tsklxd stmlsn
## tasklexdec  -0.743              
## stimlsnnwrd -0.109  0.060       
## tsklxdc:stm  0.069 -0.095 -0.638

Those results look pretty similar. The main difference is statsmodels doesn’t separate out the random effects from the fixed effects and we don’t get correlations between our terms in the model.

Estimating Marginal Means

First, we need to set up a grid allowing us to see the unique combinations for the levels of each factor. Nicely, the numpy function meshgrid() allows us to do this if we pass it our unique values for each factor. We then reorganise and make this combination of values into a dataframe for later use. (I’m sure there’s a better way to do this…)

grid = np.array(np.meshgrid(
  dat["task"].unique(), 
  dat["stimulus"].unique()
)).reshape(2, 4).T

grid = pd.DataFrame(grid, columns = ['task','stimulus'])

print(grid)
##      task stimulus
## 0  naming     word
## 1  lexdec     word
## 2  naming  nonword
## 3  lexdec  nonword

Once we have the unique combinations of our levels we can create the design matrix using the dmatrix() function. To this, we need to pass our formula used to fit the model, including the C() notation to indicate the variables are categorical. In this case, we specify main effects and an interaction between task and stimulus.

Crucially, we have to include the coding scheme at this point. Here, we used the default treatment coding scheme in both statsmodels and lme4. With treatment coding the intercept is usually defined alphabetically for each factor. With task (lexdec and naming) and stimulus (non-word and word) these factors should be coded as lexdec = 0, naming = 1, and non-word = 0 and word = 1 respectively. The intercept in the model should therefore represent scores where all variables are equal to 0 if treatment coding is used. This doesn’t seem to be the case. Check the model results above and you’ll see that the slope terms correspond to lexdec and non-words, so they can’t be the intercept! Instead, the reference level for the intercept is 1, such that it represents naming for words. When we make our design matrix we need to be careful to pass this information on using Treatment(1).

So, we will define a design matrix, ensuring that our variables are coded as categories and with the correct coding scheme and reference level.

mat = dmatrix(
  "C(task, Treatment(1))*C(stimulus, Treatment(1))", 
  grid, 
  return_type = "matrix"
)
print(mat)
## [[1. 0. 0. 0.]
##  [1. 1. 0. 0.]
##  [1. 0. 1. 0.]
##  [1. 1. 1. 1.]]

Now we have a design matrix we can get the betas from our fixed effects like so.

# grab betas
betas = mod_fit.fe_params
print(betas)
## Intercept                                  -0.352931
## C(task)[T.lexdec]                           0.343598
## C(stimulus)[T.nonword]                      0.312021
## C(task)[T.lexdec]:C(stimulus)[T.nonword]   -0.258556
## dtype: float64

Then using some matrix multiplication from our grid and fixed effect parameter estimates we can calculate our estimated marginal means. Note here that we use the numpy @ operator for matrix multiplication. This is a little more concise that using np.matmul(), especially when multiplying several terms.

emmeans = grid
emmeans['means'] = mat @ betas
print(emmeans)
##      task stimulus     means
## 0  naming     word -0.352931
## 1  lexdec     word -0.009333
## 2  naming  nonword -0.040910
## 3  lexdec  nonword  0.044132

Following this we’ll get the standard errors for our estimated marginal means by first getting the variance covariance matrix using the cov_params() function. We need to reduce the variance covariance matrix to exclude our random effects. We do that here by filtering out any terms including the strings “Var” or “Cor” which by default statsmodels uses to indicate random effects terms.

vcov = mod_fit.cov_params()

vcov = vcov[~vcov.index.str.contains('Var|Cor')]
vcov = vcov.loc[:,~vcov.columns.str.contains('Var|Cor')]
print(vcov)
##                                           Intercept  ...  C(task)[T.lexdec]:C(stimulus)[T.nonword]
## Intercept                                  0.001919  ...                                  0.000034
## C(task)[T.lexdec]                         -0.001907  ...                                 -0.000062
## C(stimulus)[T.nonword]                    -0.000046  ...                                 -0.000068
## C(task)[T.lexdec]:C(stimulus)[T.nonword]   0.000034  ...                                  0.000125
## 
## [4 rows x 4 columns]

Next we’ll add standard errors to the means by getting the square root of the diagonal for the design matrix multiplied by the variance covariance matrix multiplied by the the transpose of the design matrix. (Phew…)

emmeans['SE'] = np.sqrt(np.diagonal(mat @ vcov) @ mat.T)
print(emmeans)
##      task stimulus     means        SE
## 0  naming     word -0.352931  0.043805
## 1  lexdec     word -0.009333  0.058699
## 2  naming  nonword -0.040910  0.044328
## 3  lexdec  nonword  0.044132  0.059332

Let’s see how these stack up against emmeans.

emm <- emmeans(m, ~task*stimulus, lmer.df = "a")
emm
##  task   stimulus emmean    SE  df asymp.LCL asymp.UCL
##  naming word      -0.35 0.044 Inf     -0.44    -0.267
##  lexdec word      -0.01 0.039 Inf     -0.09     0.068
##  naming nonword   -0.04 0.044 Inf     -0.13     0.045
##  lexdec nonword    0.04 0.039 Inf     -0.03     0.121
## 
## Degrees-of-freedom method: asymptotic 
## Confidence level used: 0.95

Perfect, if a little long-winded.

Pairwise tests between marginal means

Making pairwise mean differences

Here we’ll skip straight to the best method we found for conducting pairwise tests in the previous post.

We’ll first get the names for the unique combinations for each contrast. We will paste the values in each column together and then list the combinations. We will then join each one together to create a vector of contrasts.

combo_names = emmeans.task + "_" + emmeans.stimulus
contrast_pairs = list(combinations(combo_names, 2))

contrast_names = []
for names in contrast_pairs:
    contrast_names.append('-'.join(names))

print(contrast_names)
## ['naming_word-lexdec_word', 'naming_word-naming_nonword', 'naming_word-lexdec_nonword', 'lexdec_word-naming_nonword', 'lexdec_word-lexdec_nonword', 'naming_nonword-lexdec_nonword']

We’ll then get indices for each pairwise comparison from our estimated marginal means. This just requires us to get every combination of the numbers 0 through to 3 (the maximum row length in our estimated marginal means given Python is 0-indexed) and again transpose this so we have columns indicating which rows in our contrast matrix we’ll subtract from one another. We’ll give these the generic column names A and B.

limits = list(combo_names.index)
combo_cols = list(combinations(limits, 2))
combo_cols = pd.DataFrame(combo_cols, columns = ['A', 'B'])
print(combo_cols)
##    A  B
## 0  0  1
## 1  0  2
## 2  0  3
## 3  1  2
## 4  1  3
## 5  2  3

To figure out which terms need to be switched on and off in our parameter estimates we can perform vectorised subtraction on the indices. We transpose this to turn the values back into a matrix which we can use for matrix multiplication with our parameter estimates.

model_pairwise_contrasts = mat[combo_cols.A, :] - mat[combo_cols.B, :]
model_pairwise_contrasts = model_pairwise_contrasts.T
print(model_pairwise_contrasts.T)
## [[ 0. -1.  0.  0.]
##  [ 0.  0. -1.  0.]
##  [ 0. -1. -1. -1.]
##  [ 0.  1. -1.  0.]
##  [ 0.  0. -1. -1.]
##  [ 0. -1.  0. -1.]]

Here we’ll get the contrast names and present them in a table. Along with these names we’ll put the estimates for the mean differences based on the matrix multiplication outlined above. Here, we use np.transpose() in place of the .T shorthand to ensure the result of the multiplication is transposed.

pairwise = pd.DataFrame(contrast_names, columns = ['contrast'])
pairwise['mean'] = np.transpose(betas @ model_pairwise_contrasts)
print(pairwise)
##                         contrast      mean
## 0        naming_word-lexdec_word -0.343598
## 1     naming_word-naming_nonword -0.312021
## 2     naming_word-lexdec_nonword -0.397063
## 3     lexdec_word-naming_nonword  0.031576
## 4     lexdec_word-lexdec_nonword -0.053465
## 5  naming_nonword-lexdec_nonword -0.085041

For getting the standard errors we’ll again use our best method for doing so in the previous post. Below we’ll append the standard errors to our table under the heading “SE”. We do this by stepping through each contrast and calculating the standard errors for that contrast using the matrix multiplication method outlined above.

for i in pairwise.index:
    grad_g = model_pairwise_contrasts[:, i]
    pairwise.at[i, 'SE'] = np.sqrt(
      grad_g @ vcov @ grad_g.T
    )
    
print(pairwise)
##                         contrast      mean        SE
## 0        naming_word-lexdec_word -0.343598  0.058600
## 1     naming_word-naming_nonword -0.312021  0.009570
## 2     naming_word-lexdec_nonword -0.397063  0.058803
## 3     lexdec_word-naming_nonword  0.031576  0.058803
## 4     lexdec_word-lexdec_nonword -0.053465  0.008943
## 5  naming_nonword-lexdec_nonword -0.085041  0.058609

Now the calculation is a little easier. We make the z-scores by dividing each mean score by the standard error. Using these scores, we can calculate the p-value using the normal approximation, which for a two-tailed test is 2 \(\times\) the probability of our z-scores on the negative scale.

pairwise['z'] = pairwise['mean']/pairwise['SE']
pairwise['p'] = 2*st.norm.sf(abs(pairwise['z']))

pairwise.round(3)
##                         contrast   mean     SE       z      p
## 0        naming_word-lexdec_word -0.344  0.059  -5.863  0.000
## 1     naming_word-naming_nonword -0.312  0.010 -32.602  0.000
## 2     naming_word-lexdec_nonword -0.397  0.059  -6.752  0.000
## 3     lexdec_word-naming_nonword  0.032  0.059   0.537  0.591
## 4     lexdec_word-lexdec_nonword -0.053  0.009  -5.978  0.000
## 5  naming_nonword-lexdec_nonword -0.085  0.059  -1.451  0.147

Next we can add confidence intervals around the means and tidy up the table.

confint = st.norm.interval(
  0.95, 
  loc=pairwise['mean'], 
  scale=pairwise['SE']
)
pairwise['lci'] = confint[0]
pairwise['uci'] = confint[1]

pairwise = pairwise[[
  'contrast', 
  'mean', 
  'lci', 
  'uci', 
  'SE', 
  'z', 
  'p'
]].round(3)

print(pairwise)
##                         contrast   mean    lci    uci     SE       z      p
## 0        naming_word-lexdec_word -0.344 -0.458 -0.229  0.059  -5.863  0.000
## 1     naming_word-naming_nonword -0.312 -0.331 -0.293  0.010 -32.602  0.000
## 2     naming_word-lexdec_nonword -0.397 -0.512 -0.282  0.059  -6.752  0.000
## 3     lexdec_word-naming_nonword  0.032 -0.084  0.147  0.059   0.537  0.591
## 4     lexdec_word-lexdec_nonword -0.053 -0.071 -0.036  0.009  -5.978  0.000
## 5  naming_nonword-lexdec_nonword -0.085 -0.200  0.030  0.059  -1.451  0.147

Does this match the output from emmeans?

pairs(emm, adjust = "none")
##  contrast                        estimate    SE  df z.ratio p.value
##  naming word - lexdec word          -0.34 0.059 Inf  -6.000 <.0001 
##  naming word - naming nonword       -0.31 0.010 Inf -33.000 <.0001 
##  naming word - lexdec nonword       -0.40 0.059 Inf  -7.000 <.0001 
##  lexdec word - naming nonword        0.03 0.059 Inf   1.000 0.5900 
##  lexdec word - lexdec nonword       -0.05 0.009 Inf  -6.000 <.0001 
##  naming nonword - lexdec nonword    -0.09 0.059 Inf  -1.000 0.1500 
## 
## Degrees-of-freedom method: asymptotic

That looks pretty good!

That said, to do this took a lot of effort and the code is fairly fragile in that it relies on us carefully specifying our contrasts, including the coding scheme, and with large parts of the code requiring us to know how many variables we have to properly build the contrasts. I’m sure this could be neatly packaged into a few functions with more flexibility, but that probably requires more time and patience with using Python for this than is worth it.

While statsmodels is great, at the time of writing it’s missing the functionality for generalised linear mixed effects models with various common link functions under a frequentist framework (e.g. logit models). Using crossed random effects also seems particularly difficult compared to lme4. Of course, as an R user this comes with a lot of heavy bias for R, but right now I’d recommend either sticking with the options in R for fitting mixed effects models or delving more into Bayesian methods using packages like PyMC3 or Bambi for those used to brms in R.

In the next post we’ll look at calculating estimated marginal means and pairwise tests from mixed effects models fitted using the MixedModels.jl package in Julia. Given my love for the speed and flexibility of this package, we’ll perhaps put this post together with the aim of building a working package!

Thanks for reading! Comments and criticisms are welcome.

Updates and Clarifications

Last updated 2021-10-26.

2021-10-26: Replaced every instance of np.matmul() with the numpy @ operator. This is a little more concise and is especially easier to read when multiplying several arrays (e.g. np.matmul(np.matmul(A, B), C) becomes simply A @ B @ C).