Estimating Marginal Means and Pairwise Tests Manually in R

This is the second in a series of blog posts working through how to generate standard errors for estimated marginal means and pooled standard errors for pairwise tests of these estimates from mixed effects models. In this one, we will recreate the emmeans functionality in R. In the second, we will do so in Python.

The aim of this post is to explore how to generate standard errors for estimated marginal means and pooled standard errors for pairwise tests of these estimates. This was motivated primarily by exploring methods for fitting mixed effects models in Python and Julia and finding no suitable alternative to emmeans in R for estimating marginal effects and conducting pairwise tests. Given the heavy focus on factorial designs in my own research, I wanted to see how to recreate some of the output from emmeans in R before potentially applying this to Python and finally to Julia for the great speedup gains in model fitting in the MixedModels.jl package. While the JellyMe4.jl package allows one to pass models between Julia and R, completely sidestepping this need, I was still motivated by a need to understand how emmeans might work and how to do this all in a way that can easily be ported from one language to another.

In the process of trying to figure out how to do this, a number of posts were essential in producing the code below and in learning a little bit about the procedures involved. These are linked in text throughout.

The main conclusion at the end of this post should be that the R-package emmeans is magic.

First, we’ll load two packages that are essential for the content of the post:

lme4: used for fitting mixed effects models.
afex: used to obtain the data set with categorical predictors.
emmeans: used for generating estimated marginal means and pairwise tests between group estimates.
dplyr (as always!): used for having access to the pipe operator and for general data cleaning.

library(lme4)
library(afex)
library(emmeans)
library(dplyr)

We’ll load some data and do some processing to get it ready for analysis.

data("fhch2010") # load data from afex

# subset the data to correct answers and keep 
# only the columns we need
fhch <- fhch2010 %>% 
  filter(correct == TRUE) %>% 
  select(id, task, stimulus, item, rt, log_rt)

head(fhch) # inspect it!
##   id   task stimulus   item   rt log_rt
## 1 N1 naming     word potted 1.09  0.087
## 2 N1 naming     word engine 0.88 -0.132
## 3 N1 naming     word  ideal 0.71 -0.342
## 4 N1 naming  nonword  uares 1.21  0.191
## 5 N1 naming  nonword   xazz 0.84 -0.171
## 6 N1 naming     word   fill 0.79 -0.242

This data set is from an experiment where participants either name words out loud or make a lexical decision (clicking one button to indicate a word is a real word, e.g. fill, or another to indicate it is a non-word, e.g. xazz). The words themselves are defined in groups as being words or non-words. Along with this we have id, the participant’s ID number; item, the written form of the target stimulus; rt, or reaction time in seconds; and log_rt, the log transformed form of the rt column. (Note log transformation is often used for reaction times to make them more ‘normal’, though other methods for handling the skew in this sort of data can be used.)

When we have categorical predictors and want to do some formal analysis, each level of the factor is given a numeric code. The specific coding scheme we use has important implications for the interpretation of the model coefficients. In R, the default coding scheme is to use dummy (or treatment) coding. For factors with two levels, the codes used are 0 and 1. We’ll first fit out model using this coding scheme.

We can check these codes by using the contrasts() function:

contrasts(fhch$task)
##        lexdec
## naming      0
## lexdec      1
contrasts(fhch$stimulus)
##         nonword
## word          0
## nonword       1

For task, naming = 0 and lexical decision = 1, while for stimulus word = 0 and nonword = 1.

Next, we’ll fit a mixed effects model to our data to predict log reaction times. This model contains fixed effects including an intercept and slopes for stimulus and task and random effects including random intercepts and slopes of stimulus by participant ID and random intercepts and slopes of task by item. For further details on why you might use mixed effects models and the impact of random effects and concepts like partial pooling, check out TJ Mahr’s great post here.

m <- lme4::lmer(
  log_rt ~ 1 + task * stimulus + 
    (1 + stimulus | id) +
    (1 + task | item), 
  data = fhch
)
summary(m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: log_rt ~ 1 + task * stimulus + (1 + stimulus | id) + (1 + task |  
##     item)
##    Data: fhch
## 
## REML criterion at convergence: 6899
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -6.087 -0.593 -0.105  0.471  5.573 
## 
## Random effects:
##  Groups   Name            Variance Std.Dev. Corr 
##  item     (Intercept)     0.00855  0.0925        
##           tasklexdec      0.02683  0.1638   -0.78
##  id       (Intercept)     0.03592  0.1895        
##           stimulusnonword 0.00793  0.0891   -0.02
##  Residual                 0.09059  0.3010        
## Number of obs: 12960, groups:  item, 600; id, 45
## 
## Fixed effects:
##                            Estimate Std. Error t value
## (Intercept)                 -0.3524     0.0431   -8.18
## tasklexdec                   0.3434     0.0581    5.91
## stimulusnonword              0.3123     0.0227   13.73
## tasklexdec:stimulusnonword  -0.2550     0.0318   -8.03
## 
## Correlation of Fixed Effects:
##             (Intr) tsklxd stmlsn
## tasklexdec  -0.745              
## stimlsnnwrd -0.078  0.066       
## tsklxdc:stm  0.063 -0.095 -0.745

Our model fit nicely with no convergence warnings. While we might be interested in a lot of the output here, for now we’ll draw our attention to the fixed effects parameter estimates. Let’s break down each term. The intercept generally represents predicted values on our outcome when predictors are set to 0 and the random effects are set to 0 (so that the prediction applies to a “typical” participant and for a “typical” item, whereby random effects are set to 0). In this instance, this means a log reaction time of -0.35 (or a real reaction time of 0.7 seconds) can be expected for the naming task for real words. Why is this the case? Our dummy codes define the naming task as 0 for task and real words as 0 for stimulus.

Slope terms generally denote the increase in our outcome variable for a 1 unit change in our predictors. This means that the slope term for task (here tasklexdec) defines the difference in log reaction times between the naming and lexical decision tasks for real words. Similarly the slope term for stimulus (here stimulusnonword) defines the difference in log reaction times between the real words and non-words for the naming task. Finally, we can understand the interaction term as one which modifies these two slope terms. Understanding this is a little easier if we formally define at least the fixed effects terms of our model above where \(X_1\) represents our dummy codes for task and \(X_2\) represents our dummy code for stimulus:

\[ \alpha + \beta_{task}X_1 + \beta_{stimulus}X_2 + \beta_{taskstimulus}X_1X_2 \] Crucially, because the intercept (\(\alpha\)) represents a specific combination of our factors (here, naming words) the slope terms represent differences between naming words and other combinations of our factors. Thus to get the difference between this combination of conditions and others we simply have to manipulate our slope terms. Thus, the difference between naming words and a lexical decision for words is defined as:

\[ (\beta_{task} \times 1) + ( \beta_{stimulus}\times 0) + (\beta_{taskstimulus}\times 1 \times 0) \] Or (0.34 \(\times\) 1) + (0 \(\times\) 0.34) + (-0.26 \(\times\) 1 \(\times\) 0) which is simply 0.34.

If we work through every combination of this to switch on or off each slope term, we might end up with a model matrix that looks like so. (Note, this could be done easily by hand, which we’ll do later.)

grid <- expand.grid(
  task = levels(fhch$task), 
  stimulus = levels(fhch$stimulus)
)

emmeans_contrasts <- model.matrix(~ task * stimulus, data = grid)

# transpose the matrix for printing
t(emmeans_contrasts) 
##                            1 2 3 4
## (Intercept)                1 1 1 1
## tasklexdec                 0 1 0 1
## stimulusnonword            0 0 1 1
## tasklexdec:stimulusnonword 0 0 0 1
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$task
## [1] "contr.treatment"
## 
## attr(,"contrasts")$stimulus
## [1] "contr.treatment"

Here, the first column represents only having our intercept term switched on. This is essentially the naming word condition. The second one is lexical decision for words, the third naming non-words, and the last one (which uses all parameters) is the lexical decision for non-words.

We could get these estimates for each group by simply adding the parameter estimates together, but we can use our model matrix defined above and a little matrix multiplication with the parameter estimates to arrive at the same answer more easily.

Estimating marginal means

# get fixed effect parameter estimates
betas <- fixef(m) 

# matrix multiply contrasts by estimates
grid$emmean <- c(emmeans_contrasts %*% betas) 

grid 
##     task stimulus emmean
## 1 naming     word -0.352
## 2 lexdec     word -0.009
## 3 naming  nonword -0.040
## 4 lexdec  nonword  0.048

Now we have the means for each group, we might want to also get their standard errors. For balanced designs without random effects, this is as easy as taking the standard errors directly from the model coefficients. However, with this data and model we need a more complicated approach. We can take the variance covariance matrix from our model:

v <- vcov(m)
v
## 4 x 4 Matrix of class "dpoMatrix"
##                            (Intercept) tasklexdec stimulusnonword
## (Intercept)                   0.001856  -0.001866       -0.000076
## tasklexdec                   -0.001866   0.003380        0.000087
## stimulusnonword              -0.000076   0.000087        0.000517
## tasklexdec:stimulusnonword    0.000087  -0.000176       -0.000538
##                            tasklexdec:stimulusnonword
## (Intercept)                                  0.000087
## tasklexdec                                  -0.000176
## stimulusnonword                             -0.000538
## tasklexdec:stimulusnonword                   0.001009

And then multiply the contrast matrix by the variance covariance matrix and the transpose of the contrast matrix. Taking the diagonal of this we have the variance for the marginal effects. Finally taking the square root of this variance we arrive at the marginal standard errors.

grid$SE <- sqrt(diag(
  emmeans_contrasts %*% v %*% t(emmeans_contrasts)
))

grid
##     task stimulus emmean    SE
## 1 naming     word -0.352 0.043
## 2 lexdec     word -0.009 0.039
## 3 naming  nonword -0.040 0.047
## 4 lexdec  nonword  0.048 0.042

We can confirm the accuracy of this method by exploring the estimated marginal means directly from emmeans. Note that this uses the normal approximation for calculating p-values as by default lme4 doesn’t estimate degrees of freedom. Thus we see the degrees of freedom here as Inf for infinity.

emms <- emmeans(m, ~ task * stimulus, lmer.df = "a")
emms
##  task   stimulus emmean    SE  df asymp.LCL asymp.UCL
##  naming word      -0.35 0.043 Inf     -0.44    -0.268
##  lexdec word      -0.01 0.039 Inf     -0.08     0.067
##  naming nonword   -0.04 0.047 Inf     -0.13     0.052
##  lexdec nonword    0.05 0.042 Inf     -0.03     0.131
## 
## Degrees-of-freedom method: asymptotic 
## Confidence level used: 0.95

Pairwise tests between marginal means

Making pairwise mean differences

Now we have our estimated marginal means we might want to calculate differences between these means. To do this, we’ll take a similar approach to before and split this process into calculating the mean differences (which is relatively easy) and calculating the standard errors of the mean differences (which is pretty hard).

As before, we can calculate mean differences manually. For example, to get the mean differences between naming words and for doing a lexical decision for words, we simply have to subtract the estimated marginal mean for naming words and for doing a lexical decision for words, i.e. -0.35 - -0.01 = -0.34. However, we can do this by also using a contrast matrix to get all mean differences. Using our table of estimated marginal means as a reference we can define some contrasts between each condition. So, for example, if we want to get the mean difference between naming words and for doing a lexical decision for words we will take the value for naming words and multiply that by 1. We will then take the value for making a lexical decision for words and multiply that by -1. The remaining values will then by multiplied by 0 (i.e. set to 0). We will then sum all those values together. This could be done manually in a loop or we could again take advantage of matrix multiplication.

First, we’ll determine the pairwise contrast matrix of means then multiply our grid of means with these contrasts.

pairwise_contrasts <- cbind(
  "naming word - lexdec word" = c(1, -1, 0, 0),
  "naming word - naming nonword" = c(1, 0, -1, 0),
  "naming word - lexdec nonword" = c(1, 0, 0, -1),
  "lexdec word - naming nonword" = c(0, 1, -1, 0),
  "lexdec word - lexdec nonword" = c(0, 1, 0, -1),
  "naming nonword - lexdec nonword" = c(0, 0, -1, 1)
)

pairwise_means <- t(grid$emmean %*% pairwise_contrasts)

# fix for printing
pairwise_means <- data.frame(
  contrast = rownames(pairwise_means), 
  est = as.numeric(pairwise_means)
)
pairwise_means
##                          contrast    est
## 1       naming word - lexdec word -0.343
## 2    naming word - naming nonword -0.312
## 3    naming word - lexdec nonword -0.401
## 4    lexdec word - naming nonword  0.031
## 5    lexdec word - lexdec nonword -0.057
## 6 naming nonword - lexdec nonword  0.088

However, while this works nicely for the means we can’t use these estimates to determine the standard errors of the pairwise tests, which we need to generate z-statistics and p-values. Instead, we can get around this issue by using estimates from the model directly and specifying our contrasts based on the coding scheme in our model.

For our treatment-coded effects, we can get each contrast between conditions as follows:

model_pairwise_contrasts <- cbind(
  "naming word - lexdec word" = c(0, -1, 0, 0),
  "naming word - naming nonword" = c(0, 0, -1, 0),
  "naming word - lexdec nonword" = c(0, -1, -1, -1),
  "lexdec word - naming nonword" = c(0, 1, -1, 0),
  "lexdec word - lexdec nonword" = c(0, 0, -1, -1),
  "naming nonword - lexdec nonword" = c(0, -1, 0, -1)
)

t(model_pairwise_contrasts) # transpose for printing
##                                 [,1] [,2] [,3] [,4]
## naming word - lexdec word          0   -1    0    0
## naming word - naming nonword       0    0   -1    0
## naming word - lexdec nonword       0   -1   -1   -1
## lexdec word - naming nonword       0    1   -1    0
## lexdec word - lexdec nonword       0    0   -1   -1
## naming nonword - lexdec nonword    0   -1    0   -1

This might look a little confusing, so we can break down each contrast by understanding how we coded our effects in the model. Because the intercept represents values of our outcome when all predictors are set to 0, this means the baseline is stimulus = 0 (word) and task = 0 (naming), or naming words. Remember each slope term represents the change in our outcome when we move from a code of 0 to 1.

Naming word - lexdec word: To get the difference between naming words and making a lexical decision about words we subtract the effect of task = 1 (or making lexical decision) from our baseline. This is the difference between naming words and making a lexical decision about words.
Naming words - Naming nonwords: Similarly subtract the effect of stimulus = 1 (or nonwords) from our baseline.
Naming words - lexdec nonwords: Getting more complicated, we take the baseline (naming words) and subtract task = 1 (making a lexical decision) stimulus = 1 (nonwords) and their interaction from the baseline. Why subtract the interaction? When an interaction is present we have to account for it affecting our pairwise contrasts. Using our linear model formula above when we increase each predictor by 1 this is propagated to the interaction term because 1 (predictor 1) \(\times\) 1 (predictor 2) = 1.
Lexdec words - naming nonword: If we want the difference between conditions other than the baseline, we have to figure out the codes for each level before we determine our contrasts. Making a lexical decision about words is the same as the baseline plus the slope effect of task (i.e. 1 = lexdec). Naming nonwords is the same as the baseline plus the slope effect of stimulus (i.e. 1 = nonword). Thus, if we subtract one slope term from the other we get the difference between making a lexical decision about words and naming nonwords.
Lexdec word - lexdec nonword: We’re getting even more complicated now, but the easiest way to think of this contrast is that the interaction term is present when stimulus = 1 (word) and task = 1 (lexdec), or when making a lexical decision about words. If we want to subtract the lexical decision about words estimates from anything else, we set this interaction term to -1. Thus, subtracting the slope term for stimulus from this (i.e. 1 = nonword) gets us the difference between making a lexical decision about words and nonwords…(Hopefully you can see just how useful emmeans is at this point!)
Naming nonword - lexdec nonword: This final contrast has a similar logic to the previous one. Get the interaction term present to set the contrast to a lexical decision about nonwords, then subtract the effect of task to get the difference between the two tasks for nonwords.

Note that flipping the sign on the last 3 contrasts just changes which condition is subtracted from the other.

However, this approach requires a fair bit of thinking and is likely to be error-prone due to the manual nature of setting contrasts. Instead, we can use our initial contrasts for the estimated marginal means in each condition and generate a contrast matrix. The logic here is that pairwise tests take a mean for one condition and subtract the mean for another condition. Thus the means for difference scores can be taken directly from the model by taking the contrast for one condition and subtracting the contrast from another. This can be done repetitively like so:

model_pairwise_contrasts <- as.matrix(t(rbind(
  emmeans_contrasts[1,] - emmeans_contrasts[2,],
  emmeans_contrasts[1,] - emmeans_contrasts[3,],
  emmeans_contrasts[1,] - emmeans_contrasts[4,],
  emmeans_contrasts[2,] - emmeans_contrasts[3,],
  emmeans_contrasts[2,] - emmeans_contrasts[4,],
  emmeans_contrasts[3,] - emmeans_contrasts[4,]
)))

t(model_pairwise_contrasts) # transpose for printing
##      (Intercept) tasklexdec stimulusnonword tasklexdec:stimulusnonword
## [1,]           0         -1               0                          0
## [2,]           0          0              -1                          0
## [3,]           0         -1              -1                         -1
## [4,]           0          1              -1                          0
## [5,]           0          0              -1                         -1
## [6,]           0         -1               0                         -1

But again, this isn’t satisfactory because we’re still establishing the contrasts manually. A better way might be instead to figure our each combination of contrasts programatically, and then use the indices of these contrasts to construct our final matrix for pairwise comparisons.

To do this, we first want to get labels for each combination of conditions we have. We can grab these from our current grid:

# get names for the combinations for the contrasts
combo_names <- paste(grid$task, grid$stimulus)
combo_names
## [1] "naming word"    "lexdec word"    "naming nonword" "lexdec nonword"

We can then get every combination of pairs for these names and transpose them so that we have rows corresponding to every contrast between conditions. We’ll take advantage of the combn() function in R for this, which should have equivalents in any language.

contrast_names <- t(combn(combo_names, 2))
contrast_names
##      [,1]             [,2]            
## [1,] "naming word"    "lexdec word"   
## [2,] "naming word"    "naming nonword"
## [3,] "naming word"    "lexdec nonword"
## [4,] "lexdec word"    "naming nonword"
## [5,] "lexdec word"    "lexdec nonword"
## [6,] "naming nonword" "lexdec nonword"

Finally, we’ll collapse these two columns and add a sign so we know which differences we’ll calculate. Nicely, given that we’re going to calculate contrasts based on the table in our estimated marginal means this order will be correct.

contrast_names <- paste(contrast_names[,1], contrast_names[,2], sep = "-")
contrast_names
## [1] "naming word-lexdec word"       "naming word-naming nonword"   
## [3] "naming word-lexdec nonword"    "lexdec word-naming nonword"   
## [5] "lexdec word-lexdec nonword"    "naming nonword-lexdec nonword"

Next, we’ll figure out the indices for each pairwise comparison from our estimated marginal means. This just requires us to get every combination of the numbers 1 through to 4 (the maximum row length in our estimated marginal means) and again transpose this so we have columns indicating which rows in our contrast matrix we’ll subtract from one another.

# get combination of columns
combo_cols <- t(combn(c(1:length(combo_names)), 2))
combo_cols
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## [3,]    1    4
## [4,]    2    3
## [5,]    2    4
## [6,]    3    4

Above, this tells us that we’ll subtract row 2 from row 1, row 3 from row 1, row 4 from row 1 etc. in our estimated marginal means contrast matrix. We’ll do this vectorised subtraction below. We’ll finally transpose this and ensure it’s stored as a matrix. We’ll then assign the names for each contrast to each column in the matrix. This way when we generate the contrasts through matrix multiplication later on we’ll know which contrast each score belongs to.

model_pairwise_contrasts <- as.matrix(t(
  emmeans_contrasts[combo_cols[,1],] - emmeans_contrasts[combo_cols[,2],]
))

# give colnames to the martrix
colnames(model_pairwise_contrasts) <- contrast_names
t(model_pairwise_contrasts) # transpose back for printing
##                               (Intercept) tasklexdec stimulusnonword
## naming word-lexdec word                 0         -1               0
## naming word-naming nonword              0          0              -1
## naming word-lexdec nonword              0         -1              -1
## lexdec word-naming nonword              0          1              -1
## lexdec word-lexdec nonword              0          0              -1
## naming nonword-lexdec nonword           0         -1               0
##                               tasklexdec:stimulusnonword
## naming word-lexdec word                                0
## naming word-naming nonword                             0
## naming word-lexdec nonword                            -1
## lexdec word-naming nonword                             0
## lexdec word-lexdec nonword                            -1
## naming nonword-lexdec nonword                         -1

Now that we’ve defined our model-based contrasts we can get estimates of the differences in the marginal means by matrix multiplying our fixed effects parameter estimates by our contrasts.

pairwise <- t(betas %*% model_pairwise_contrasts)

# fix for printing
pairwise <- data.frame(
  contrast = rownames(pairwise), 
  est = as.numeric(pairwise)
)
pairwise
##                        contrast    est
## 1       naming word-lexdec word -0.343
## 2    naming word-naming nonword -0.312
## 3    naming word-lexdec nonword -0.401
## 4    lexdec word-naming nonword  0.031
## 5    lexdec word-lexdec nonword -0.057
## 6 naming nonword-lexdec nonword -0.088

Does this match what we get from emmeans? We can recreate these means with emmeans by using the pairs() function. Here, we’ll ensure that we don’t adjust our p-values for multiplicity by setting adjust to “none” (the default is Tukey). We do this here just for confirmation of our hand-calculated values later, but typically you would want to adjust these.

pairs(emms, adjust = "none")
##  contrast                        estimate    SE  df z.ratio p.value
##  naming word - lexdec word          -0.34 0.058 Inf  -5.900  <.0001
##  naming word - naming nonword       -0.31 0.023 Inf -13.700  <.0001
##  naming word - lexdec nonword       -0.40 0.060 Inf  -6.600  <.0001
##  lexdec word - naming nonword        0.03 0.061 Inf   0.500  0.6100
##  lexdec word - lexdec nonword       -0.06 0.021 Inf  -2.700  0.0100
##  naming nonword - lexdec nonword    -0.09 0.064 Inf  -1.400  0.1600
## 
## Degrees-of-freedom method: asymptotic

Nice! We got the same means. Though emmeans is much more useful in also providing standard errors, z-values and p-values. Can we recreate these? First we need to calculate the standard errors from which we can easily recreate our other statistics. There’s a few ways we can do this, but all must take into account that our model contains an interaction. First, we could calculate this by hand.

Let’s calculate the standard error for the difference between naming words and a lexical decision for words. The formula for this is defined as:

\[ Var(b_{task}) + Var(b_{task*stimulus}) \times stimulus^2 + 2 \times stimulus \times Cov(b_{task*stimulus}) \] In this case, we set the stimulus effect to zero, so the formula is a little easier to read as:

\[ Var(b_{task}) + Var(b_{task*stimulus}) \times 0^2 + 2 \times 0 \times Cov(b_{task*stimulus}) \] Or even more simply:

\[ Var(b_{task})\] Remember that we can get the variance covariance matrix from our model using the vcov() function. We did this above and stored it in the variable v. Here’s a reminder of what this looks like:

v
## 4 x 4 Matrix of class "dpoMatrix"
##                            (Intercept) tasklexdec stimulusnonword
## (Intercept)                   0.001856  -0.001866       -0.000076
## tasklexdec                   -0.001866   0.003380        0.000087
## stimulusnonword              -0.000076   0.000087        0.000517
## tasklexdec:stimulusnonword    0.000087  -0.000176       -0.000538
##                            tasklexdec:stimulusnonword
## (Intercept)                                  0.000087
## tasklexdec                                  -0.000176
## stimulusnonword                             -0.000538
## tasklexdec:stimulusnonword                   0.001009

So our standard error from this formula is the square root of the variance for the slope of lexdec.

sqrt(v["tasklexdec","tasklexdec"])
## [1] 0.058

What about a more complicated case? Here, we’ll get the standard error for the naming *nonword - lexdec nonword contrast. We take the variance for the slope of task, add the variance for the interaction of task and stimulus multiplied by our slope contrast code squared (which here doesn’t change the value at all), and then add two times the contrast code times the covariance of the interaction. (Why do we multiply the latter two by our contrast code of 1 at points? This doesn’t have much effect here but is really useful if we have continuous variables where we want marginal effects by a given value, say the mean of one variable and is an important consideration for effects that should have these values switched off.)

sqrt(
    v["tasklexdec","tasklexdec"] + # var(task)
        v["tasklexdec:stimulusnonword", "tasklexdec:stimulusnonword"] * 1^2 + # var(task*stimulus)
        2 * 1 * v["tasklexdec", "tasklexdec:stimulusnonword"] # cov(task*stimulus)
)
## [1] 0.064

As you can see, this process is quite complicated and error prone. Let’s try another method instead; the delta method. To do this, we’ll return to our matrix multiplication approach. The numDeriv package contains a function, jacobian() which “calculate[s] the m by n numerical approximation of the gradient of a real m-vector valued function with n-vector argument”. To be honest, the logic behind this approach is a bit beyond me at this moment, but as I understand it we pass a function to jacobian(), indicating which slope terms should be switched on or off for a given contrast. We then take this index, matrix multiply this by our variance covariance matrix and matrix multiply that by the transpose of our variance covariance matrix. Finally we take the square root of this to get our standard error. Let’s do this again for the naming word - lexdec word contrast.

# load numerical derivative package
library(numDeriv) 

# gradient function; switch on the slope for task and nothing else
g <- function(b){
  b[2]*1 + b[3]*0 + b[4]*0
}

# gives dimensions of terms to switch on or off (including the intercept)
grad_g <-  jacobian(g, fixef(m))

sqrt(grad_g %*% v %*% t(grad_g)) 
## 1 x 1 Matrix of class "dgeMatrix"
##       [,1]
## [1,] 0.058

Again, we could step through this by changing our codes, but that seems still a little complicated. Remember, we started this whole exercise to see if we could reproduce the standard errors for the pairwise tests without packages, so using numDeriv seems a little like cheating. Nicely, the jacobian() function in this case just gives us our model_pairwise_contrasts table back (only with one contrast each time). So, we can reproduce these values by stepping through our contrasts and applying the same matrix multiplication steps and finally square rooting the result.

Or you can define your contrasts by combinations of the coefficients from the linear model.

# make an SE data holder in our pairwise table
pairwise$SE <- NA_real_

# step through each contrast and apply our function to each one
# add these to the SE column
for(i in 1:nrow(pairwise)) {
    grad_g <- matrix(model_pairwise_contrasts[, i], nrow = 1)
    pairwise$SE[i] <- sqrt(grad_g %*% v %*% t(grad_g)) 
}

# print the result
pairwise
##                        contrast    est    SE
## 1       naming word-lexdec word -0.343 0.058
## 2    naming word-naming nonword -0.312 0.023
## 3    naming word-lexdec nonword -0.401 0.060
## 4    lexdec word-naming nonword  0.031 0.061
## 5    lexdec word-lexdec nonword -0.057 0.021
## 6 naming nonword-lexdec nonword -0.088 0.064

That seems to match what we got from the pairs function!

Making z-scores and p-values

The only last steps we need to complete are to calculate the z-scores and the p-values. Nicely, z-scores are simply are parameter estimates divided by the standard error. Using these scores, we can calculate the p-value using the normal approximation, which for a two-tailed test is \(2 \times\) the probability of our z-scores on the negative scale.

pairwise$z <- pairwise$est/pairwise$SE
pairwise$p <- 2*pnorm(-abs(pairwise$z))

pairwise
##                        contrast    est    SE      z       p
## 1       naming word-lexdec word -0.343 0.058  -5.91 3.5e-09
## 2    naming word-naming nonword -0.312 0.023 -13.73 6.3e-43
## 3    naming word-lexdec nonword -0.401 0.060  -6.63 3.3e-11
## 4    lexdec word-naming nonword  0.031 0.061   0.51 6.1e-01
## 5    lexdec word-lexdec nonword -0.057 0.021  -2.70 6.9e-03
## 6 naming nonword-lexdec nonword -0.088 0.064  -1.39 1.6e-01

Finally, let’s confirm that these results match the ones from the pairs() function in emmeans.

pairs(emms, adjust = "none")
##  contrast                        estimate    SE  df z.ratio p.value
##  naming word - lexdec word          -0.34 0.058 Inf  -5.900  <.0001
##  naming word - naming nonword       -0.31 0.023 Inf -13.700  <.0001
##  naming word - lexdec nonword       -0.40 0.060 Inf  -6.600  <.0001
##  lexdec word - naming nonword        0.03 0.061 Inf   0.500  0.6100
##  lexdec word - lexdec nonword       -0.06 0.021 Inf  -2.700  0.0100
##  naming nonword - lexdec nonword    -0.09 0.064 Inf  -1.400  0.1600
## 
## Degrees-of-freedom method: asymptotic

That looks spot on!

I guess one last thing we could do is add some confidence intervals around the estimates. Because we’re using the normal approximation we can assume the degrees of freedom are infinite, but if we wanted more conservative estimates for this (and all stats throughout) we might use Kenward-Roger or Satterthwaite degrees of freedom. Though this is beyond the scope of this post.

pairwise <- pairwise %>% 
  mutate(
    lci = est + SE * qt(c(0.5 * 0.05), Inf), # lower ci
    uci = est + SE * qt(c(1 - 0.5 * 0.05), Inf), # upper ci
    p = round(p, 3) # nicer printing for p-vals
  ) %>% 
  select(contrast, est, lci, uci, everything()) # reorder it

pairwise
##                        contrast    est    lci    uci    SE      z     p
## 1       naming word-lexdec word -0.343 -0.457 -0.229 0.058  -5.91 0.000
## 2    naming word-naming nonword -0.312 -0.357 -0.268 0.023 -13.73 0.000
## 3    naming word-lexdec nonword -0.401 -0.519 -0.282 0.060  -6.63 0.000
## 4    lexdec word-naming nonword  0.031 -0.088  0.151 0.061   0.51 0.610
## 5    lexdec word-lexdec nonword -0.057 -0.099 -0.016 0.021  -2.70 0.007
## 6 naming nonword-lexdec nonword -0.088 -0.213  0.036 0.064  -1.39 0.164

All done!

Conclusions

This blog post started with me wondering how to reproduce effects in emmeans by hand. Along the way, we’ve learn that this is possible but maybe a lot more complicated than expected, especially in the case of standard errors. However, I think doing this is useful because (1) it helps us to understand our models more if we know how things are estimated, (2) it allows us to reproduce effects using different software without being tied down to the one with the existing packages, and (3) it allows us to see just how genius the emmeans package is! There’s a lot here still that is unique to this scenario while emmeans can nicely and consistently handle a lot of different scenarios. Thanks for reading! Comments and questions are welcomed.

Future Goals

Next up, we’ll might want to look at what we need to do to adapt this to different coding schemes, such as a sum-coding scheme which allows us to interpret the (non-interaction) slopes for our factors in this multiple two-level case as main effects rather than simple effects. But that’s a post for another day!

Updates and Clarifications

Last updated 2021-09-07.

2021-05-14: Added a method for calculating pairwise contrasts between conditions without the need to manually define a contrast matrix, instead using methods such as manual pairwise subtraction of contrasts for condition means and vectorised subtraction of contrasts for condition means more programatically.

2021-03-25: Thanks to Isabella R. Ghement for clarifying that the fixed effects predictions apply to a typical participant and item. A previous version of this post neglected the implications of the random effects.

2021-09-07: Added a reference to the Python post.