1 Getting Started in R

If you have a local machine you can install programs on, you may want to install R and RStudio, an IDE for working with R that makes many tasks easier to manage and has version control integration via Git. However, if you don’t have a local machine, please make an account at RStudio Cloud. This provides an online interface to RStudio that is managed by the makers of RStudio. If you have any issues with your local install, this is always a reliable option.

Once installed, please download the content for this session from GitHub. If you double click the ioc_quickstats.Rproj file and have RStudio installed, this will open up RStudio and set your file paths to be relative to the ioc_quickstats folder. This is important for ensuring that things like loading data happens without any problems. Generally, it’s a good idea to have an RProject associated with any project you’re carrying out for this reason: It makes collaboration easier. (To start a new project, in RStudio just use File –> New Project).

R comes with a suite of pre-installed packages that provide functions for many common tasks. However, we’ll download a popular package from CRAN, the comprehensive R Archive Network, which acts as a quality control for R packages. We’ll use the tidyverse package, which is a package of pacakges that simplify wrangling, summarising, and plotting data. Run the following chunk to check if you have it already installed and to install it if not. Then be sure to load it on every new session.

if(!require(tidyverse)) {install.packages("tidyverse")} # optionally install it
library(tidyverse) # load it; run this every session.

We’ll also use the here package. This makes working with file paths across different systems relatively easy (no more switching between / and \).

if(!require(here)) {install.packages("here")} # optionally install it
library(here) # load it; run this every session.

Finally, for reproducibility, we’ll set a seed for our random number generation. Why? This just allows you to get the same numbers as me when we simulate “random” data.

set.seed(1000)

2 Philosophy of Science

What do we want to know from our data? Your philosophy of science dictates how you’ll analyse and interpret your data, so having a background in common approaches is useful. Here, we’ll cover two popular approaches to statistics. First, however, it’s important to understand some basics of probability and sampling data.

3 Samples and Populations

If we have the whole population, we don’t need inferential statistics.

  • These are used only when we want to draw inferences about the population based on our sample.

  • e.g. Are basketball players in your local club taller than football players in your local club? Just measure them, there aren’t that many people in the clubs and the data are easily accessible.

  • e.g. Are basketball players in the population generally taller than football players? Get a representative sample of basketball and football players from the population, measure their heights, then use inferential statistics to come to a conclusion.

We use the sample to draw inferences about the entire population. How we do so depends on our definition of probability or the types of questions we’d like to answer.

4 Probability and Inference in Many Flavours

There are many approaches to statistical inference. Two of the most commonly used approaches are Neyman-Pearson frequentist statistics (often using null-hypothesis significance testing), and Bayesian inference (often using estimation or model comparison approaches).

4.1 Frequentist Statistics:

  • Approaches probability from an “objective” standpoint.
  • Probability refers to the likelihood of an event vs. a collection of events.
  • Concerned with long-run error control.

e.g. Get 7 heads out of 10 coin flips. Is the coin fair? Assume it is, i.e. P(heads) = 0.5, and estimate how often you would get 7 heads in 10 flips across an infinite set of flips. If very rare under a pre-defined cutoff (e.g. 5%), we reject the hypothesis that the coin is fair.

Using this approach, we will only make an incorrect decision (e.g. false rejection, false acceptance of hypothesis) at a known and controlled rate.

4.2 Bayesian Statistics:

  • Approaches probability from a “subjective” standpoint.
  • Probability refers to the degree of belief in a hypothesis.
  • Concerned with maximal use of available data to understand how beliefs should change.

e.g. A coin get 7 heads out of 10 flips. Which probabilities of getting heads are most plausible (e.g. 0.6, 0.7, 0.8)? We can estimate this for a range of probabilities or to a degree of credibility (i.e. I’m 90% certain it is between 0.6 and 0.8).

Using this approach we often get the closest estimates of the true probabilities. We can also incorporate our beliefs about the data in the model. This is useful when data are hard or expensive to get.

5 Probability Distributions

5.1 The Gaussian Distribution

If we aim to describe our sample and draw inferences from it that apply to a population, we often need to make an assumption about the sampling distribution of the data.

Many “standard” approaches to statistical inference assume data are drawn from a Gaussian/normal distribution. Why? Some explanations (and an example) are provided by Richard McElreath (2016):

  • Many processes that arise in nature tend towards a Gaussian distribution, so it tends to be a good fit.
  • When we are ignorant as to how the data arise, the Gaussian is a conservative choice. Is has maximum entropy.

Imagine lining up at the centre line of a football field. Flip a coin 20 times. Every time it lands heads, step left. Every time it lands tails, step right. The distance of the step taken is allowed to vary between -1 and 1 yard.

We’ll demonstrate this using a few inbuilt functions in R: replicate() which repeats a process a set number of times (here 1000); runif() which draws data from a random, uniform distribution, giving n draws set between certain minimum and maximum values (here -1 and 1 yard). Finally, we sum their values together using sum() and assign the 1000 summed steps to a variable called positions.

# get 1000 people, sum up their 20 flips of heads or tails
positions <- replicate(1000, sum(runif(n = 20, min = -1, max = 1)))

I’ve plotted the density of the data at each distance from the middle line (0) against the normal distribution below.