If you have a local machine you can install programs on, you may want to install R and RStudio, an IDE for working with R that makes many tasks easier to manage and has version control integration via Git. However, if you don’t have a local machine, please make an account at RStudio Cloud. This provides an online interface to RStudio that is managed by the makers of RStudio. If you have any issues with your local install, this is always a reliable option.
Once installed, please download the content for this session from GitHub. If you double click the ioc_quickstats.Rproj
file and have RStudio installed, this will open up RStudio and set your file paths to be relative to the ioc_quickstats
folder. This is important for ensuring that things like loading data happens without any problems. Generally, it’s a good idea to have an RProject associated with any project you’re carrying out for this reason: It makes collaboration easier. (To start a new project, in RStudio just use File –> New Project).
R comes with a suite of pre-installed packages that provide functions for many common tasks. However, we’ll download a popular package from CRAN, the comprehensive R Archive Network, which acts as a quality control for R packages. We’ll use the tidyverse
package, which is a package of pacakges that simplify wrangling, summarising, and plotting data. Run the following chunk to check if you have it already installed and to install it if not. Then be sure to load it on every new session.
if(!require(tidyverse)) {install.packages("tidyverse")} # optionally install it
library(tidyverse) # load it; run this every session.
We’ll also use the here
package. This makes working with file paths across different systems relatively easy (no more switching between /
and \
).
if(!require(here)) {install.packages("here")} # optionally install it
library(here) # load it; run this every session.
Finally, for reproducibility, we’ll set a seed for our random number generation. Why? This just allows you to get the same numbers as me when we simulate “random” data.
set.seed(1000)
What do we want to know from our data? Your philosophy of science dictates how you’ll analyse and interpret your data, so having a background in common approaches is useful. Here, we’ll cover two popular approaches to statistics. First, however, it’s important to understand some basics of probability and sampling data.
If we have the whole population, we don’t need inferential statistics.
These are used only when we want to draw inferences about the population based on our sample.
e.g. Are basketball players in your local club taller than football players in your local club? Just measure them, there aren’t that many people in the clubs and the data are easily accessible.
e.g. Are basketball players in the population generally taller than football players? Get a representative sample of basketball and football players from the population, measure their heights, then use inferential statistics to come to a conclusion.
We use the sample to draw inferences about the entire population. How we do so depends on our definition of probability or the types of questions we’d like to answer.
There are many approaches to statistical inference. Two of the most commonly used approaches are Neyman-Pearson frequentist statistics (often using null-hypothesis significance testing), and Bayesian inference (often using estimation or model comparison approaches).
e.g. Get 7 heads out of 10 coin flips. Is the coin fair? Assume it is, i.e. P(heads) = 0.5, and estimate how often you would get 7 heads in 10 flips across an infinite set of flips. If very rare under a pre-defined cutoff (e.g. 5%), we reject the hypothesis that the coin is fair.
Using this approach, we will only make an incorrect decision (e.g. false rejection, false acceptance of hypothesis) at a known and controlled rate.
e.g. A coin get 7 heads out of 10 flips. Which probabilities of getting heads are most plausible (e.g. 0.6, 0.7, 0.8)? We can estimate this for a range of probabilities or to a degree of credibility (i.e. I’m 90% certain it is between 0.6 and 0.8).
Using this approach we often get the closest estimates of the true probabilities. We can also incorporate our beliefs about the data in the model. This is useful when data are hard or expensive to get.
If we aim to describe our sample and draw inferences from it that apply to a population, we often need to make an assumption about the sampling distribution of the data.
Many “standard” approaches to statistical inference assume data are drawn from a Gaussian/normal distribution. Why? Some explanations (and an example) are provided by Richard McElreath (2016):
Imagine lining up at the centre line of a football field. Flip a coin 20 times. Every time it lands heads, step left. Every time it lands tails, step right. The distance of the step taken is allowed to vary between -1 and 1 yard.
We’ll demonstrate this using a few inbuilt functions in R: replicate()
which repeats a process a set number of times (here 1000); runif()
which draws data from a random, uniform distribution, giving n draws set between certain minimum and maximum values (here -1 and 1 yard). Finally, we sum their values together using sum()
and assign the 1000 summed steps to a variable called positions
.
# get 1000 people, sum up their 20 flips of heads or tails
positions <- replicate(1000, sum(runif(n = 20, min = -1, max = 1)))
I’ve plotted the density of the data at each distance from the middle line (0) against the normal distribution below.