This is the **second in a series of blog posts** working through how to generate standard errors for estimated marginal means and pooled standard errors for pairwise tests of these estimates from mixed effects models. The first shows how this can be done in R by hand for models fitted with `lme4`

, largely replicating results provided by the excellent `emmeans`

package. This post aims to show how this can be done in Python, while the next post will show how this can be done in Julia. This was motivated primarily by exploring methods for fitting mixed effects models in Python and Julia and finding no suitable alternative to `emmeans`

in R for estimating marginal effects and conducting pairwise tests.

While in Python we can use rpy2 or more easily Pymer4 to interface with R to perform model fitting with `lme4`

and follow-up tests using `emmeans`

there are no packages that allow us to recreate the `emmeans`

functionality for mixed effects models fitted directly in Python, for example following model fitting using the `statsmodels`

package. This post will therefore work through fitting a mixed effects model from a factorial design in Python using the `statsmodels`

package before calculating estimated marginal means and conducting pairwise tests by hand.

This post will show working in Python and R for comparing results using the reticulate R-package.

# Getting Started

Here’s a quick breakdown of the R packages used here, and why they are used:

`lme4`

: used for fitting mixed effects models.`afex`

: used to obtain the data set with categorical predictors.`emmeans`

: used for generating estimated marginal means and pairwise tests between group estimates.`dplyr`

(as always!): used for having access to the pipe operator and for general data cleaning.`reticulate`

: used for interfacing between R and Python (i.e. passing data between the two and showing this Python code on a blogdown site!)

```
library(lme4)
library(afex)
library(emmeans)
library(dplyr)
library(reticulate)
```

Next, we need to tell R where our version of Python lives. For me, on a Mac, this is as follows. Yours might differ.

`use_python("usr/local/bin/python3")`

From Python we’ll need a number of packages, with their breakdown as follows:

`statsmodels.api`

: Used for fitting mixed effects models.`statsmodels.formula.api`

: Used for allowing an R-like formula syntax for model fitting.`numpy`

: Used for numerical computing (e.g. using arrays, matrices, and maths functions including matrix multiplication).`pandas`

: Used for creating data frames.`scipy.stats`

: Used for working with distributions.`dmatrix`

from`patsy`

: Used for creating a design matrix to calculate estimated marginal means.`combinations`

from`itertools`

: Used for generating unique combinations for pairwise tests.

```
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
import scipy.stats as st
from patsy import dmatrix
from itertools import combinations
```

We’ll first clean the data in R to show that the data set is consistent across both R and Python methods.

```
fhch <- get(data("fhch2010")) %>% # load data from afex %>%
filter(correct == TRUE) %>% # subset the data to correct answers
select(id, task, stimulus, item, rt, log_rt) # keep only used cols
```

We’ll then make the data accessible in Python before inspecting it.

```
dat = r.fhch # grab data from r
print(dat)
## id task stimulus item rt log_rt
## 0 N1 naming word potted 1.091 0.087095
## 1 N1 naming word engine 0.876 -0.132389
## 2 N1 naming word ideal 0.710 -0.342490
## 3 N1 naming nonword uares 1.210 0.190620
## 4 N1 naming nonword xazz 0.843 -0.170788
## ... ... ... ... ... ... ...
## 12955 L28 lexdec word idiom 1.688 0.523544
## 12956 L28 lexdec word lawns 1.563 0.446607
## 12957 L28 lexdec word spiced 1.641 0.495306
## 12958 L28 lexdec word weighs 1.797 0.586119
## 12959 L28 lexdec word darted 1.531 0.425921
##
## [12960 rows x 6 columns]
```

# Fitting the Models

Now we’ll fit the linear mixed effects model in `statsmodels`

. For R users, there’s a few differences here when compared to `lme4`

:

We can specify numeric columns in our data set as identifying levels of factors, or categories, using the

`C()`

function in our formulae. Thus,`task`

,`stimulus`

,`item`

, and`id`

– all variables that indicate IDs for groups – get this`C()`

wrapper. I quite like how explicit this is.Crossed random effects are difficult to specify in

`statsmodels`

. To get similar results to`lme4`

we need to collapse our independent groups of participants and items. We’ll create a group column in our data set where everyone gets the same ID. Next, we need to separately specify our random effects variance components. Using the`0 + C(...)`

notation we specify random intercepts for each group. For independent random slopes we would specify the term which varies by group within the`vcf`

component. This is commented out below as this model doesn’t converge in`statsmodels`

. For comparison we will also fit a random intercepts only model in`lme4`

. Needless to say this is a bit of a pain and seems to enforce some strong assumptions on the random effects.

```
# specify random effects
dat["group"] = 1
vcf = {"item": "0 + C(item)", "id": "0 + C(id)"}
# vcf = {"item" : "0 + C(task):C(item)", "id" : "0 + C(stimulus):C(id)"}
# define the model
mod = smf.mixedlm(
'log_rt ~ C(task) * C(stimulus)',
data = dat,
groups = "group",
vc_formula = vcf,
re_formula = "0"
)
# fit the model, then print it
mod_fit = mod.fit()
## /usr/local/lib/python3.9/site-packages/statsmodels/regression/mixed_linear_model.py:2237: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
## warnings.warn(msg, ConvergenceWarning)
print(mod_fit.summary())
## Mixed Linear Model Regression Results
## ====================================================================================
## Model: MixedLM Dependent Variable: log_rt
## No. Observations: 12960 Method: REML
## No. Groups: 1 Scale: 0.0991
## Min. group size: 12960 Log-Likelihood: -3693.8467
## Max. group size: 12960 Converged: Yes
## Mean group size: 12960.0
## ------------------------------------------------------------------------------------
## Coef. Std.Err. z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------------
## Intercept -0.353 0.044 -8.057 0.000 -0.439 -0.267
## C(task)[T.lexdec] 0.344 0.059 5.863 0.000 0.229 0.458
## C(stimulus)[T.nonword] 0.312 0.010 32.602 0.000 0.293 0.331
## C(task)[T.lexdec]:C(stimulus)[T.nonword] -0.259 0.011 -23.133 0.000 -0.280 -0.237
## id Var 0.037 0.026
## item Var 0.003 0.002
## ====================================================================================
```

How does the model look in R?

```
m <- lme4::lmer(
log_rt ~ 1 + task * stimulus +
(1 | id) +
(1 | item),
data = fhch
)
summary(m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: log_rt ~ 1 + task * stimulus + (1 | id) + (1 | item)
## Data: fhch
##
## REML criterion at convergence: 7388
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -6.026 -0.608 -0.106 0.488 5.292
##
## Random effects:
## Groups Name Variance Std.Dev.
## item (Intercept) 0.0035 0.0591
## id (Intercept) 0.0374 0.1935
## Residual 0.0991 0.3148
## Number of obs: 12960, groups: item, 600; id, 45
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.35293 0.04378 -8.06
## tasklexdec 0.34360 0.05857 5.87
## stimulusnonword 0.31202 0.00957 32.60
## tasklexdec:stimulusnonword -0.25856 0.01118 -23.14
##
## Correlation of Fixed Effects:
## (Intr) tsklxd stmlsn
## tasklexdec -0.743
## stimlsnnwrd -0.109 0.060
## tsklxdc:stm 0.069 -0.095 -0.638
```

Those results look pretty similar. The main difference is `statsmodels`

doesn’t separate out the random effects from the fixed effects and we don’t get correlations between our terms in the model.

# Estimating Marginal Means

First, we need to set up a grid allowing us to see the unique combinations for the levels of each factor. Nicely, the `numpy`

function `meshgrid()`

allows us to do this if we pass it our unique values for each factor. We then reorganise and make this combination of values into a dataframe for later use. (I’m sure there’s a better way to do this…)

```
grid = np.array(np.meshgrid(
dat["task"].unique(),
dat["stimulus"].unique()
)).reshape(2, 4).T
grid = pd.DataFrame(grid, columns = ['task','stimulus'])
print(grid)
## task stimulus
## 0 naming word
## 1 lexdec word
## 2 naming nonword
## 3 lexdec nonword
```

Once we have the unique combinations of our levels we can create the design matrix using the `dmatrix()`

function. To this, we need to pass our formula used to fit the model, including the `C()`

notation to indicate the variables are categorical. In this case, we specify main effects and an interaction between task and stimulus.

Crucially, we have to include the coding scheme at this point. Here, we used the default treatment coding scheme in both `statsmodels`

and `lme4`

. With treatment coding the intercept is usually defined alphabetically for each factor. With `task`

(lexdec and naming) and `stimulus`

(non-word and word) these factors should be coded as lexdec = 0, naming = 1, and non-word = 0 and word = 1 respectively. The intercept in the model should therefore represent scores where all variables are equal to 0 if treatment coding is used. This doesn’t seem to be the case. Check the model results above and you’ll see that the slope terms correspond to lexdec and non-words, so they can’t be the intercept! Instead, the reference level for the intercept is 1, such that it represents naming for words. When we make our design matrix we need to be careful to pass this information on using `Treatment(1)`

.

So, we will define a design matrix, ensuring that our variables are coded as categories and with the correct coding scheme and reference level.

```
mat = dmatrix(
"C(task, Treatment(1))*C(stimulus, Treatment(1))",
grid,
return_type = "matrix"
)
print(mat)
## [[1. 0. 0. 0.]
## [1. 1. 0. 0.]
## [1. 0. 1. 0.]
## [1. 1. 1. 1.]]
```

Now we have a design matrix we can get the betas from our fixed effects like so.

```
# grab betas
betas = mod_fit.fe_params
print(betas)
## Intercept -0.352931
## C(task)[T.lexdec] 0.343598
## C(stimulus)[T.nonword] 0.312021
## C(task)[T.lexdec]:C(stimulus)[T.nonword] -0.258556
## dtype: float64
```

Then using some matrix multiplication from our grid and fixed effect parameter estimates we can calculate our estimated marginal means. Note here that we use the numpy `@`

operator for matrix multiplication. This is a little more concise that using `np.matmul()`

, especially when multiplying several terms.

```
emmeans = grid
emmeans['means'] = mat @ betas
print(emmeans)
## task stimulus means
## 0 naming word -0.352931
## 1 lexdec word -0.009333
## 2 naming nonword -0.040910
## 3 lexdec nonword 0.044132
```

Following this we’ll get the standard errors for our estimated marginal means by first getting the variance covariance matrix using the `cov_params()`

function. We need to reduce the variance covariance matrix to exclude our random effects. We do that here by filtering out any terms including the strings “Var” or “Cor” which by default `statsmodels`

uses to indicate random effects terms.

```
vcov = mod_fit.cov_params()
vcov = vcov[~vcov.index.str.contains('Var|Cor')]
vcov = vcov.loc[:,~vcov.columns.str.contains('Var|Cor')]
print(vcov)
## Intercept ... C(task)[T.lexdec]:C(stimulus)[T.nonword]
## Intercept 0.001919 ... 0.000034
## C(task)[T.lexdec] -0.001907 ... -0.000062
## C(stimulus)[T.nonword] -0.000046 ... -0.000068
## C(task)[T.lexdec]:C(stimulus)[T.nonword] 0.000034 ... 0.000125
##
## [4 rows x 4 columns]
```

Next we’ll add standard errors to the means by getting the square root of the diagonal for the design matrix multiplied by the variance covariance matrix multiplied by the the transpose of the design matrix. (Phew…)

```
emmeans['SE'] = np.sqrt(np.diagonal(mat @ vcov) @ mat.T)
print(emmeans)
## task stimulus means SE
## 0 naming word -0.352931 0.043805
## 1 lexdec word -0.009333 0.058699
## 2 naming nonword -0.040910 0.044328
## 3 lexdec nonword 0.044132 0.059332
```

Let’s see how these stack up against emmeans.

```
emm <- emmeans(m, ~task*stimulus, lmer.df = "a")
emm
## task stimulus emmean SE df asymp.LCL asymp.UCL
## naming word -0.35 0.044 Inf -0.44 -0.267
## lexdec word -0.01 0.039 Inf -0.09 0.068
## naming nonword -0.04 0.044 Inf -0.13 0.045
## lexdec nonword 0.04 0.039 Inf -0.03 0.121
##
## Degrees-of-freedom method: asymptotic
## Confidence level used: 0.95
```

Perfect, if a little long-winded.

# Pairwise tests between marginal means

## Making pairwise mean differences

Here we’ll skip straight to the best method we found for conducting pairwise tests in the previous post.

We’ll first get the names for the unique combinations for each contrast. We will paste the values in each column together and then list the combinations. We will then join each one together to create a vector of contrasts.

```
combo_names = emmeans.task + "_" + emmeans.stimulus
contrast_pairs = list(combinations(combo_names, 2))
contrast_names = []
for names in contrast_pairs:
contrast_names.append('-'.join(names))
print(contrast_names)
## ['naming_word-lexdec_word', 'naming_word-naming_nonword', 'naming_word-lexdec_nonword', 'lexdec_word-naming_nonword', 'lexdec_word-lexdec_nonword', 'naming_nonword-lexdec_nonword']
```

We’ll then get indices for each pairwise comparison from our estimated marginal means. This just requires us to get every combination of the numbers 0 through to 3 (the maximum row length in our estimated marginal means given Python is 0-indexed) and again transpose this so we have columns indicating which rows in our contrast matrix we’ll subtract from one another. We’ll give these the generic column names A and B.

```
limits = list(combo_names.index)
combo_cols = list(combinations(limits, 2))
combo_cols = pd.DataFrame(combo_cols, columns = ['A', 'B'])
print(combo_cols)
## A B
## 0 0 1
## 1 0 2
## 2 0 3
## 3 1 2
## 4 1 3
## 5 2 3
```

To figure out which terms need to be switched on and off in our parameter estimates we can perform vectorised subtraction on the indices. We transpose this to turn the values back into a matrix which we can use for matrix multiplication with our parameter estimates.

```
model_pairwise_contrasts = mat[combo_cols.A, :] - mat[combo_cols.B, :]
model_pairwise_contrasts = model_pairwise_contrasts.T
print(model_pairwise_contrasts.T)
## [[ 0. -1. 0. 0.]
## [ 0. 0. -1. 0.]
## [ 0. -1. -1. -1.]
## [ 0. 1. -1. 0.]
## [ 0. 0. -1. -1.]
## [ 0. -1. 0. -1.]]
```

Here we’ll get the contrast names and present them in a table. Along with these names we’ll put the estimates for the mean differences based on the matrix multiplication outlined above. Here, we use `np.transpose()`

in place of the `.T`

shorthand to ensure the result of the multiplication is transposed.

```
pairwise = pd.DataFrame(contrast_names, columns = ['contrast'])
pairwise['mean'] = np.transpose(betas @ model_pairwise_contrasts)
print(pairwise)
## contrast mean
## 0 naming_word-lexdec_word -0.343598
## 1 naming_word-naming_nonword -0.312021
## 2 naming_word-lexdec_nonword -0.397063
## 3 lexdec_word-naming_nonword 0.031576
## 4 lexdec_word-lexdec_nonword -0.053465
## 5 naming_nonword-lexdec_nonword -0.085041
```

For getting the standard errors we’ll again use our best method for doing so in the previous post. Below we’ll append the standard errors to our table under the heading “SE”. We do this by stepping through each contrast and calculating the standard errors for that contrast using the matrix multiplication method outlined above.

```
for i in pairwise.index:
grad_g = model_pairwise_contrasts[:, i]
pairwise.at[i, 'SE'] = np.sqrt(
grad_g @ vcov @ grad_g.T
)
print(pairwise)
## contrast mean SE
## 0 naming_word-lexdec_word -0.343598 0.058600
## 1 naming_word-naming_nonword -0.312021 0.009570
## 2 naming_word-lexdec_nonword -0.397063 0.058803
## 3 lexdec_word-naming_nonword 0.031576 0.058803
## 4 lexdec_word-lexdec_nonword -0.053465 0.008943
## 5 naming_nonword-lexdec_nonword -0.085041 0.058609
```

Now the calculation is a little easier. We make the *z*-scores by dividing each mean score by the standard error. Using these scores, we can calculate the *p*-value using the normal approximation, which for a two-tailed test is 2 \(\times\) the probability of our *z*-scores on the negative scale.

```
pairwise['z'] = pairwise['mean']/pairwise['SE']
pairwise['p'] = 2*st.norm.sf(abs(pairwise['z']))
pairwise.round(3)
## contrast mean SE z p
## 0 naming_word-lexdec_word -0.344 0.059 -5.863 0.000
## 1 naming_word-naming_nonword -0.312 0.010 -32.602 0.000
## 2 naming_word-lexdec_nonword -0.397 0.059 -6.752 0.000
## 3 lexdec_word-naming_nonword 0.032 0.059 0.537 0.591
## 4 lexdec_word-lexdec_nonword -0.053 0.009 -5.978 0.000
## 5 naming_nonword-lexdec_nonword -0.085 0.059 -1.451 0.147
```

Next we can add confidence intervals around the means and tidy up the table.

```
confint = st.norm.interval(
0.95,
loc=pairwise['mean'],
scale=pairwise['SE']
)
pairwise['lci'] = confint[0]
pairwise['uci'] = confint[1]
pairwise = pairwise[[
'contrast',
'mean',
'lci',
'uci',
'SE',
'z',
'p'
]].round(3)
print(pairwise)
## contrast mean lci uci SE z p
## 0 naming_word-lexdec_word -0.344 -0.458 -0.229 0.059 -5.863 0.000
## 1 naming_word-naming_nonword -0.312 -0.331 -0.293 0.010 -32.602 0.000
## 2 naming_word-lexdec_nonword -0.397 -0.512 -0.282 0.059 -6.752 0.000
## 3 lexdec_word-naming_nonword 0.032 -0.084 0.147 0.059 0.537 0.591
## 4 lexdec_word-lexdec_nonword -0.053 -0.071 -0.036 0.009 -5.978 0.000
## 5 naming_nonword-lexdec_nonword -0.085 -0.200 0.030 0.059 -1.451 0.147
```

Does this match the output from `emmeans`

?

```
pairs(emm, adjust = "none")
## contrast estimate SE df z.ratio p.value
## naming word - lexdec word -0.34 0.059 Inf -6.000 <.0001
## naming word - naming nonword -0.31 0.010 Inf -33.000 <.0001
## naming word - lexdec nonword -0.40 0.059 Inf -7.000 <.0001
## lexdec word - naming nonword 0.03 0.059 Inf 1.000 0.5900
## lexdec word - lexdec nonword -0.05 0.009 Inf -6.000 <.0001
## naming nonword - lexdec nonword -0.09 0.059 Inf -1.000 0.1500
##
## Degrees-of-freedom method: asymptotic
```

That looks pretty good!

That said, to do this took a lot of effort and the code is fairly fragile in that it relies on us carefully specifying our contrasts, including the coding scheme, and with large parts of the code requiring us to know how many variables we have to properly build the contrasts. I’m sure this could be neatly packaged into a few functions with more flexibility, but that probably requires more time and patience with using Python for this than is worth it.

While `statsmodels`

is great, at the time of writing it’s missing the functionality for generalised linear mixed effects models with various common link functions under a frequentist framework (e.g. logit models). Using crossed random effects also seems particularly difficult compared to `lme4`

. Of course, as an R user this comes with a lot of heavy bias for R, but right now I’d recommend either sticking with the options in R for fitting mixed effects models or delving more into Bayesian methods using packages like PyMC3 or Bambi for those used to brms in R.

In the next post we’ll look at calculating estimated marginal means and pairwise tests from mixed effects models fitted using the `MixedModels.jl`

package in Julia. Given my love for the speed and flexibility of this package, we’ll perhaps put this post together with the aim of building a working package!

Thanks for reading! Comments and criticisms are welcome.

**Updates and Clarifications**

*Last updated 2021-10-26.*

**2021-10-26**: Replaced every instance of `np.matmul()`

with the numpy `@`

operator. This is a little more concise and is especially easier to read when multiplying several arrays (e.g. `np.matmul(np.matmul(A, B), C)`

becomes simply `A @ B @ C`

).