If you have a local machine you can install programs on, you may want to install R and RStudio, an IDE for working with R that makes many tasks easier to manage and has version control integration via Git. However, if you don’t have a local machine, please make an account at RStudio Cloud. This provides an online interface to RStudio that is managed by the makers of RStudio. If you have any issues with your local install, this is always a reliable option.
Once installed, please download the content for this session from GitHub. If you double click the ioc_quickstats.Rproj
file and have RStudio installed, this will open up RStudio and set your file paths to be relative to the ioc_quickstats
folder. This is important for ensuring that things like loading data happens without any problems. Generally, it’s a good idea to have an RProject associated with any project you’re carrying out for this reason: It makes collaboration easier. (To start a new project, in RStudio just use File –> New Project).
R comes with a suite of pre-installed packages that provide functions for many common tasks. However, we’ll download a popular package from CRAN, the comprehensive R Archive Network, which acts as a quality control for R packages. We’ll use the tidyverse
package, which is a package of pacakges that simplify wrangling, summarising, and plotting data. Run the following chunk to check if you have it already installed and to install it if not. Then be sure to load it on every new session.
if(!require(tidyverse)) {install.packages("tidyverse")} # optionally install it
library(tidyverse) # load it; run this every session.
We’ll also use the here
package. This makes working with file paths across different systems relatively easy (no more switching between /
and \
).
if(!require(here)) {install.packages("here")} # optionally install it
library(here) # load it; run this every session.
Finally, for reproducibility, we’ll set a seed for our random number generation. Why? This just allows you to get the same numbers as me when we simulate “random” data.
set.seed(1000)
What do we want to know from our data? Your philosophy of science dictates how you’ll analyse and interpret your data, so having a background in common approaches is useful. Here, we’ll cover two popular approaches to statistics. First, however, it’s important to understand some basics of probability and sampling data.
If we have the whole population, we don’t need inferential statistics.
These are used only when we want to draw inferences about the population based on our sample.
e.g. Are basketball players in your local club taller than football players in your local club? Just measure them, there aren’t that many people in the clubs and the data are easily accessible.
e.g. Are basketball players in the population generally taller than football players? Get a representative sample of basketball and football players from the population, measure their heights, then use inferential statistics to come to a conclusion.
We use the sample to draw inferences about the entire population. How we do so depends on our definition of probability or the types of questions we’d like to answer.
There are many approaches to statistical inference. Two of the most commonly used approaches are Neyman-Pearson frequentist statistics (often using null-hypothesis significance testing), and Bayesian inference (often using estimation or model comparison approaches).
e.g. Get 7 heads out of 10 coin flips. Is the coin fair? Assume it is, i.e. P(heads) = 0.5, and estimate how often you would get 7 heads in 10 flips across an infinite set of flips. If very rare under a pre-defined cutoff (e.g. 5%), we reject the hypothesis that the coin is fair.
Using this approach, we will only make an incorrect decision (e.g. false rejection, false acceptance of hypothesis) at a known and controlled rate.
e.g. A coin get 7 heads out of 10 flips. Which probabilities of getting heads are most plausible (e.g. 0.6, 0.7, 0.8)? We can estimate this for a range of probabilities or to a degree of credibility (i.e. I’m 90% certain it is between 0.6 and 0.8).
Using this approach we often get the closest estimates of the true probabilities. We can also incorporate our beliefs about the data in the model. This is useful when data are hard or expensive to get.
If we aim to describe our sample and draw inferences from it that apply to a population, we often need to make an assumption about the sampling distribution of the data.
Many “standard” approaches to statistical inference assume data are drawn from a Gaussian/normal distribution. Why? Some explanations (and an example) are provided by Richard McElreath (2016):
Imagine lining up at the centre line of a football field. Flip a coin 20 times. Every time it lands heads, step left. Every time it lands tails, step right. The distance of the step taken is allowed to vary between -1 and 1 yard.
We’ll demonstrate this using a few inbuilt functions in R: replicate()
which repeats a process a set number of times (here 1000); runif()
which draws data from a random, uniform distribution, giving n draws set between certain minimum and maximum values (here -1 and 1 yard). Finally, we sum their values together using sum()
and assign the 1000 summed steps to a variable called positions
.
# get 1000 people, sum up their 20 flips of heads or tails
positions <- replicate(1000, sum(runif(n = 20, min = -1, max = 1)))
I’ve plotted the density of the data at each distance from the middle line (0) against the normal distribution below.
Why is this captured well by the normal distribution? There are many more ways to realise a position of 0 from combining left-right steps than there are extreme values (e.g. all right steps).
There are of course many other ways that data can be generated, but we commonly rely on the exponential family. Which one you choose to model your data with depends on domain expertise and knowledge of the underlying data generation process.
e.g. if modelling something with a binomial response (i.e. 1s or 0s), a binomial distribution will work best.
Your choice matters, because if you model data with the wrong distribution you might get nonsensical model outputs. For example, modelling a probabilities (bounded between 0 and 1) with a Gaussian distribution could give probabilities below 0 or above 1, which is impossible.
We will approach modelling our data assuming a Gaussian distribution here. This is what the general linear model – a very flexible family of models in statistics – assumes. However, we can choose to fit our data under a different assumption for the data generating process by using different distributions. In R, it is also common to fit data using a logistic regression which models data from a binomial process. Typically, these different distributions come under the family of a generalised linear model and apply link functions to our data.
If you want to learn more about this or love the figure (or probability, or statistics in general), please check out Statistical Rethinking by Richard McElreath.
Now that we have a background in statistics and how we might model the data, let’s look at how we’ll get about doing so in R. First off, we need to describe the sample of data we have.
This description can be broken down into two measures; central tendency and dispersion. Central tendency represents the “average” of the data. Where abouts is the most common score?
Mean: Often used to describe data that are normally distributed. Can be misleading when outliers are present which drag the mean towards an unusually high or low value.
Median: Often used to describe data that are skewed in some way away from a normal distribution. This does a good job of suppressing effects of outliers on dragging scores in one direction.
Let’s see how they compare:
normal_data <- rnorm(n = 20, mean = 10, sd = 1)
outlier_data <- c(normal_data[1:18], 30, 55) # replace last 2 values with outliers
# print values to see what they look like
data.frame(normal = normal_data, outlier = outlier_data)
## normal outlier
## 1 10.947085 10.947085
## 2 9.625501 9.625501
## 3 9.770286 9.770286
## 4 9.992245 9.992245
## 5 11.153806 11.153806
## 6 7.525531 7.525531
## 7 8.146430 8.146430
## 8 10.805776 10.805776
## 9 8.402324 8.402324
## 10 10.059105 10.059105
## 11 9.499599 9.499599
## 12 8.205292 8.205292
## 13 8.419617 8.419617
## 14 10.368419 10.368419
## 15 9.981481 9.981481
## 16 10.444425 10.444425
## 17 9.346919 9.346919
## 18 9.538571 9.538571
## 19 9.934868 30.000000
## 20 11.549903 55.000000
comparing_central <- data.frame(
normal_mean = mean(normal_data),
normal_median = median(normal_data),
outlier_mean = mean(outlier_data),
outlier_median = median(outlier_data)
)
comparing_central # print output
## normal_mean normal_median outlier_mean outlier_median
## 1 9.685859 9.852577 12.86162 9.875883
Notice how the mean is substantially affected by the presence of 2 large outliers, while the median stays much closer to most values.
Standard Deviation: A measure of the spread of the data from the sample mean. With a normal distribution approximately 1, 2, and 3 SDs capture approximately 68, 95, and 99.7% of the data. Use sd()
in R.
Standard Error: A measure of the estimated of how far the sample mean is likely to be from the population mean. Use sd(x)/sqrt(length(x))
assuming x is the outcome variable in R.
Interquartile Range: Often used when data are described by different distribution to the normal. Rank orders data into four equal parts. We get 3 values: the middle part of the first region – or quartile – (i.e. 25%), the median (i.e. 50%), and the middle part of the third quartile (i.e. 75%). Use IQR()
in R.
We simulated some data in R looking at how heights of basketball and football players can be predicted by their weight and the sport they play. For full details on the simulation, please see the attached .Rmd file.
If you’re viewing this online and want to follow along without downloading the files from GitHub, please click here.
sports_dat <- read_csv(here("data", "sports_dat.csv"))
## Parsed with column specification:
## cols(
## sport = col_character(),
## sport_num = col_double(),
## weight = col_double(),
## height = col_double()
## )
Nicely, we can see with read_csv()
what the data types of our data look like. It’s generally a good idea to inspect our data to get an idea of how it’s laid out. The function glimpse()
from the tibble
pacakage in the tidyverse
has a good method for getting this overview:
glimpse(sports_dat)
## Rows: 200
## Columns: 4
## $ sport <chr> "basketball", "basketball", "basketball", "basketball", "ba…
## $ sport_num <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ weight <dbl> 70.13634, 106.22125, 81.42445, 95.98709, 94.95682, 92.90130…
## $ height <dbl> 202.1078, 211.1086, 204.5480, 201.0060, 208.4662, 205.9269,…
First off, we’ll analyse our data just assuming that we only care about the relationship between height and weight. We are ignoring the sports variable for now just for simplicity.
Your first step to analysing your data should always be to plot it. This allows you to get a good overview of your data and to inspect any subtleties that may not be apparent in point summaries.
ggplot2
in the tidyverse
provides a convenient way to visualise data using the Grammar of Graphics. We essentially build up our plot from a series of layers of visual elements.
This consists of a few parameters that must be defined:
With only these parameters we will get a blank background with our x and y variables on the axis labels. However, we won’t have any data within the table. Why not? Because we need to explain how this data is presented and add it as a new layer. We do so using geom_XXX()
where we can define different geoms (read: geometric objects) by replacing XXX
with something like point
, bar
, or boxplot
to get different types of plots.
Here’s a basic dotplot from the sports_dat
:
ggplot(
data = sports_dat,
mapping = aes(x = weight, y = height)
) +
geom_point() + # dots
geom_smooth(formula = y ~ x, method = "lm")
We can change the axis labels, ticks, legend, and general aesthetics of the plot using different arguments (which we won’t cover here).
However, the main message from this plot is that there seems to be a linear relationship between weight and height, such that as your weight increases, so too does your height (on average). Notice that we fitted a line of best fit to our observations using the geom_smooth()
function. This take a formula using R’s concise syntax and fits a linear model to the data (which we will get coeffients from later on). Nicely, it also contains a ribbon showing the 95% confidence interval around this estimate. This will be defined more properly later, but it essentially gives a measure of uncertainty in this estimate from the line of best fit.
Our next step should be to describe the patterns within our data. We can do so using descriptiv statistics.
Here, we take our data and use the pipe function %>%
from magrittr
, part of the tidyverse
. This allows us to chain functions together. Here, we summarise the data by defining columns in the summary of the data defined by the means and SDs of the heights and weights. We use the function n()
to count up the number of observations in each group.
sports_dat %>%
summarise(
mean_weight = mean(weight),
sd_weight = sd(weight),
mean_height = mean(height),
sd_height = sd(height),
n = n()
)
## # A tibble: 1 x 5
## mean_weight sd_weight mean_height sd_height n
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 89.9 14.0 200. 6.84 200
Notice that we use the mean here as we have no reason to assume that the data are skewed in some way, and the plot reveals that there are no outliers to further bias our estimates.
Optional: With missing data, n()
can misbehave. Often, it’s better to take your outcome variable and calculate the number of observations with length(unique(x))
where x is the label of your outcome variable. Additionally, mean()
returns an NA if you have missing data. You can explictly ignore NAs in the calculation with mean(x, na.rm = TRUE)
where again x is the label of your outcome variable.
Here, we will focus on modelling our data using a frequentist approach. This is the most common approach to statistics in most of the social sciences, and is computationally easy to perform. Bayesian statistics requires some further overheads and is more computationally expensive, but in my opinion is often worth the effort.
Now it’s time to fit a simple model to our data. We want to know whether or not the trend in our sample will scale up to the general population (of all basketball and football players). Here, we assume that there is a linear relationship between height and weight, such that as weight increases, height increases by some constant value. We do this by using a formula structure that is like linar algebra. We estimate that height is predicted by weight as follows:
In R: y ~ x
, here height ~ weight
In algebra: \(y = \alpha + \beta_w*weight + e\), where we estimate height from:
This way, we’ll determine what the predicted height is when weight is 0 (quite nonsensical in this case) but how much height increases by in cm for each kg increase in weight.
We fit this model using the lm()
function in R and see the summary of model coefficients using summary()
:
hw_mod <- lm(height ~ weight, sports_dat)
summary(hw_mod)
##
## Call:
## lm(formula = height ~ weight, data = sports_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1535 -2.7274 -0.1339 2.6274 10.4438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 163.34871 1.78692 91.41 <2e-16 ***
## weight 0.40372 0.01965 20.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.872 on 198 degrees of freedom
## Multiple R-squared: 0.6807, Adjusted R-squared: 0.6791
## F-statistic: 422.2 on 1 and 198 DF, p-value: < 2.2e-16
This model by default is fitted using a frequentist approach to modelling. This means that not only do we estimate what the true parameters are for the relationship between height and weight, but we also get test statistics such as the t and p-value.
The t-value assumes data are drawn from a t distribution rather than a normal distribution, which is a little more conservative with small samples (but converges to the normal with large samples). This just allows our estimates to be more accurate out of sample.
We then get a measure of how much of the t distribution is above a cut-off for what is considered to be an uncommonly large effect if there were no relationship between our variables (i.e. where the estimate for the intercept or the slope of weight is 0). This is called the p-value.
Here, the p-value is very small (<.001) for the intercept and slope. This means that at a weight of 0kg, height is non-zero. This doesn’t actually make much sense here. We can fix this by normalising our data, but we won’t bother here as we’re mainly interested in the slope, the linear increase in height by weight. This is also very small, meaning that if we assume there is no relationship between height and weight, the result we have would be very rare (i.e. its probability is <.001).
We can then confidently reject the null hypothesis that there is no relationship between height and weight (and by implication argue for support for our alternative hypothesis that there is a relationship). If we do this, we will only be wrong to reject this null hypothesis at a rate described by our cut-off for the p-value. Typically this is 0.05 in social science.
Notice though that we have no understanding of how accurate our alternaitve model is, i.e. that there is a linear relationship between height and weight. This is a subtle thing to think about, but with this method we only control our error rate in rejecting the null hypothesis (and other rates of error if we do further checks). We cannot say whether our alternative hypothesis is likely true. Indeed, it could be that we can reject the null hypothesis, but some other alternative hypothesis is at play that we didn’t think about. For example, there could be a a linear and quadratic relationship between the variables.
What can we conclude here? First, we have estimates of the parameters, i.e. the population values. We can work out the most likely heights given someone’s weight. For example, what is the best estimate of height given a weight of 100kg?
# access model coefficients
intercept <- coef(summary(hw_mod))[1, 1] # row 1, column 1 of the coefficients
slope <- coef(summary(hw_mod))[2, 1] # row 2, column 1 of the coefficients
# height at 0kg + 100 units of increase in height for each kg of weight
intercept + (slope * 100)
## [1] 203.7204
Check the plot above. This seems to correspond perfectly with the parameter estimate we just calculated. But lets return to the ribbon around the line. This is the 95% confidence interval. Essentially, this is an interval which is calculated such that 95% of all confidence intervals will contain the true parameter value. We can get this by using the confint()
function on our model object in R.
confint(hw_mod)
## 2.5 % 97.5 %
## (Intercept) 159.8248773 166.8725429
## weight 0.3649698 0.4424646
This gives you the lower and upper bounds of the confidence interval around the parameter estimate. For example, height may actually increase by weight by a factor of as low as 0.36 or 0.44 rather than the estimate of approximately 0.4037172 in the main model.
So, now we have both estimated parameters and ran a null hypothesis significance test for the presence of an effect, i.e. that weight predicts height. We also know the uncertainty in our estimates.
We know though that our sample is made up of both basketball players and football players. Maybe we should model that directly.
Let’s first look at how the data look grouped by sport. Here, we will add sport to the mapping of colour in our plot code. This acts to push this mapping down to the geom_point()
and geom_smooth()
aesthetics such that we have two coloured lines and points indicating the sport played by the people in the sample. We also get a legend along with this labelling the colours for us.
ggplot(
data = sports_dat,
mapping = aes(x = weight, y = height, colour = sport)
) +
geom_point() + # dots
geom_smooth(formula = y ~ x, method = "lm") # line of best fit# optional changes axis ticks and labels
ggplot
picks default colours, labels, axes, ticks, legend placement and text etc for us but again we can readily change this using additional commands. However this is beyond the scope of this session.
Our takeaway message from this plot should be that as weight increases so does height, but people who play basketball on average have a baseline of higher heights and weights than those who play football. Additionally, the rate of increase in height by weight within groups might differ.
We can next summarise our data with the added grouping of sport by adding the group_by()
command from dplyr
to our summary code from above. This simply splits your data in two by the group you provide and gives a summary by each group.
sports_dat %>%
group_by(sport) %>%
summarise(
mean_weight = mean(weight),
sd_weight = sd(weight),
mean_height = mean(height),
sd_height = sd(height),
n = n()
)
## # A tibble: 2 x 6
## sport mean_weight sd_weight mean_height sd_height n
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 basketball 99.7 9.83 205. 3.78 100
## 2 football 80.0 10.0 194. 4.15 100
We can see that on average basketball players are both heavier and taller than football players in our sample. This aligns well with what we can see in the plot. Next up, we should model our data. Let’s add a categorical predictor to our model.
We will use the same code as before, however our formula is now changed from height ~ weight
(height is predicted by weight) to height ~ weight * sport
indicating that height varies both by the weight and sport played by the individual.
Before we do this though, it’s important to know how R handles modelling with character/factor predictors. R detects that you want to model your group as a factor with levels (i.e. sport with levels of basketball and football). As with our numerical predictor, R simply assigns a number to these levels. However, it can’t read minds. So, by default, it assigns the value 0 to the label of the factor which comes first in the alphabet. Basketball comes before football, so R gives these the values of 0 and 1 respectively. However, remember that in our data we have assigned basketball a 1 and football a 0. To keep our estimates as similar as possible to our prior model, we’ll force R to give basketball the 1 and football the 0.
The impact of your coding can have important ramifications on what your parameter estimates represent. So, always be sure to check how your factors are coded and that you undertand fully what your parameter estimates mean. This is especially important when your factors have more than 2 levels or when you have mutliple categorical variables/factors.
sports_dat$sport <- factor(sports_dat$sport) # force this variable to be a factor
# get "treatment" contrasts for a 2 level variable and
# make the second level (football) be the baseline parameter (base = 2)
contrasts(sports_dat$sport) <- contr.treatment(2, base = 2)
# inspect the factor contrasts to ensure it worked
contrasts(sports_dat$sport)
## 1
## basketball 1
## football 0
We will then model our data as we described above. Our parameter estimates should have similar meanings as with the previous model.
hwp_mod <- lm(height ~ weight*sport, sports_dat)
summary(hwp_mod)
##
## Call:
## lm(formula = height ~ weight * sport, data = sports_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1905 -2.1738 -0.0141 2.3648 7.1614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 170.58312 2.52601 67.531 < 2e-16 ***
## weight 0.29349 0.03131 9.372 < 2e-16 ***
## sport1 15.44379 4.07772 3.787 0.000202 ***
## weight:sport1 -0.10132 0.04474 -2.264 0.024638 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.125 on 196 degrees of freedom
## Multiple R-squared: 0.7941, Adjusted R-squared: 0.7909
## F-statistic: 251.9 on 3 and 196 DF, p-value: < 2.2e-16
Now, we have an estimate for the height at the baseline, i.e. for football players where weight is 0, which is 170.5831152. We also have an estimate of the overall increase in height by weight, which is 0.293492. Followed by this, we see that as sport changes from football to basketball, height increases by round(coef(summary(hwp_mod))[3, 1], 2)
cm. We also see that we have an interaction between weight and sport, such that the slope for weight decreases from 0.293492 by -0.1013166 for basketball players. Their increase in height by weight isn’t as steep as that for football players.
We also can see that all of these effects are statistically significant such that if these parameter estimates had true values of 0 in the population, our results would be very unlikely to arise due to chance.
There are ways to explore interactions when, for example, we have multiple categorical predictors. One nice package to use for this is the emmeans
package which allows for comparisons between different combinations of factors.
You may have noticed some dodgy statistics going around (some of it by government bodies) during the Covid-19 crisis. In some cases, this comes down to making erroneous assumptions about how data are modelled.
We can fit our data assuming not a Gaussian distribution, but an Exponential distribution. This works well when we predict a massive uptick in growth. The number of infections in a pandemic as time goes on seems to fit this nicely on a local scale. However, the exponential will keep growing. It’s both powerful in the range of data that you have, but it can predict out of sample very poorly.
First, we can simulate some data imagining exponential growth in some outcome. We’ll make an example by building a parabola. This is maybe not the best example as the model actually accurately fits the data generation process, but with censored data, it at least follows a local exponential trend similar to the real world example above.
# make a parabola with some random noise
dodgy_timecourse_exponential <- tibble(
time = seq(1, 200),
number = time^2 + 6 + rnorm(200, 0, 100)
)
We can then plot this and see how the model behaves outside the observed data.
ggplot(dodgy_timecourse_exponential, mapping = aes(x = time, y = number)) +
geom_point() +
geom_smooth(formula = y ~ poly(x, 2), method = "lm", fullrange = TRUE) +
scale_x_continuous(limits = c(0, 300), breaks = c(0, 300)) +
scale_y_continuous(limits = c(0, 60000))
You can see here that the number of cases, according to our model, will never decrease. This y-axis scale could go on forever.
The same can be said from the use of polynomial functions in modelling growth in infections over time. Using not just a linear model, but orthogonal linear and quadratic polynomials can allow for curvature in the growth. However, again, these functions can go wrong out of sample. In this case, the model can predict negative infections in just a few days after a massive number of infections with the data we have.
Let’s make another example by building a parabola. Again, the similar limitations on this simulation apply, but with censored data it follows a local polynomial similar to the real world example above.
# make a parabola with some random noise
x <- seq(from = -100, to = 100)
y <- -0.25*x^2 + 6 + rnorm(n = length(x), mean = 0, sd = 50)
# put it together in a censored fashion
dodgy_timecourse_polynomial <- tibble(
time = seq(from = 1, to = 110),
number = y[1:110]
)
# scale observations to be strictly positive
dodgy_timecourse_polynomial$number <-
dodgy_timecourse_polynomial$number - min(dodgy_timecourse_polynomial$number)
Now, we can fit the data using a polynomial function of time + the time squared. This allows a curve in the linear model. But, look what the model does beyond the data we have. Crucially, in real world examples where we know the data can’t go below 0, this would be a bad model.
ggplot(dodgy_timecourse_polynomial, mapping = aes(x = time, y = number)) +
geom_point() +
geom_smooth(formula = y ~ poly(x, 2), method = "lm", fullrange = TRUE) +
scale_x_continuous(limits = c(0, 300), breaks = c(0, 300)) +
scale_y_continuous(limits = c(-1000, 3000))
Knowing the limits of your model and where they can go wrong requires some thought about what your model predicts. Even our linear model of height and weight predicts a tall person who weighs nothing! Using real world knowledge and domain expertise allows us to use these models well, while remaining skeptical to parts of it that might not make sense.
What are some assumptions of the general linear model?
Linear relationship: your variables must be linearly related. Simply check the scale of the data here; Likert scales are likely inappropriate.
No auto-correlation: your observations for your dependent variable must not impact one-another. When using mean scores, this is generally always met. If plotting change over time, this is often violated.
No perfect multicolinnearity: variables shouldn’t be perfectly related to one-another. For example, don’t model both height in cm and feet in the same model. Check with correlations if unsure.
Normality: Your residuals must be normally distributed. Check this by plotting your residuals with a histogram or density plot.
Homoscedasticity (homogeneity of variance): your residuals are equal across the regression line. You shouldn’t e.g. get much more variability in scores at certain parts of the line.
R4Pysch: A 10 lesson course and 10 chapter book on using R for research from the perspective of a psychologist. This is appropriate for any applied for social science.