If you have a local machine you can install programs on, you may want to install R and RStudio, an IDE for working with R that makes many tasks easier to manage and has version control integration via Git. However, if you don’t have a local machine, please make an account at RStudio Cloud. This provides an online interface to RStudio that is managed by the makers of RStudio. If you have any issues with your local install, this is always a reliable option.

Once installed, please download the content for this session from GitHub. If you double click the `ioc_quickstats.Rproj`

file and have RStudio installed, this will open up RStudio and set your file paths to be relative to the `ioc_quickstats`

folder. This is important for ensuring that things like loading data happens without any problems. Generally, it’s a good idea to have an RProject associated with any project you’re carrying out for this reason: It makes collaboration easier. (To start a new project, in RStudio just use File –> New Project).

R comes with a suite of pre-installed packages that provide functions for many common tasks. However, we’ll download a popular package from CRAN, the comprehensive R Archive Network, which acts as a quality control for R packages. We’ll use the `tidyverse`

package, which is a package of pacakges that simplify wrangling, summarising, and plotting data. Run the following chunk to check if you have it already installed and to install it if not. Then be sure to load it on every new session.

```
if(!require(tidyverse)) {install.packages("tidyverse")} # optionally install it
library(tidyverse) # load it; run this every session.
```

We’ll also use the `here`

package. This makes working with file paths across different systems relatively easy (no more switching between `/`

and `\`

).

```
if(!require(here)) {install.packages("here")} # optionally install it
library(here) # load it; run this every session.
```

Finally, for reproducibility, we’ll set a seed for our random number generation. Why? This just allows you to get the same numbers as me when we simulate “random” data.

`set.seed(1000)`

What do we want to know from our data? Your philosophy of science dictates how you’ll analyse and interpret your data, so having a background in common approaches is useful. Here, we’ll cover two popular approaches to statistics. First, however, it’s important to understand some basics of probability and sampling data.

If we have the whole population, we don’t need inferential statistics.

These are used only when we want to draw inferences about the population based on our sample.

e.g. Are basketball players in your local club taller than football players in your local club? Just measure them, there aren’t that many people in the clubs and the data are easily accessible.

e.g. Are basketball players in the population generally taller than football players? Get a representative sample of basketball and football players from the population, measure their heights, then use inferential statistics to come to a conclusion.

We use the sample to draw inferences about the entire population. How we do so depends on our definition of probability or the types of questions we’d like to answer.

There are many approaches to statistical inference. Two of the most commonly used approaches are Neyman-Pearson frequentist statistics (often using null-hypothesis significance testing), and Bayesian inference (often using estimation or model comparison approaches).

- Approaches probability from an “objective” standpoint.
- Probability refers to the likelihood of an event vs. a collection of events.
- Concerned with
**long-run error control**.

e.g. Get 7 heads out of 10 coin flips. Is the coin fair? Assume it is, i.e. P(heads) = 0.5, and estimate how often you would get 7 heads in 10 flips across an infinite set of flips. If very rare under a pre-defined cutoff (e.g. 5%), we reject the hypothesis that the coin is fair.

Using this approach, we will only make an incorrect decision (e.g. false rejection, false acceptance of hypothesis) at a known and controlled rate.

- Approaches probability from a “subjective” standpoint.
- Probability refers to the degree of belief in a hypothesis.
- Concerned with
**maximal use of available data**to understand how beliefs should change.

e.g. A coin get 7 heads out of 10 flips. Which probabilities of getting heads are most plausible (e.g. 0.6, 0.7, 0.8)? We can estimate this for a range of probabilities or to a degree of credibility (i.e. I’m 90% certain it is between 0.6 and 0.8).

Using this approach we often get the closest estimates of the true probabilities. We can also incorporate our beliefs about the data in the model. This is useful when data are hard or expensive to get.

If we aim to describe our sample and draw inferences from it that apply to a population, we often need to make an assumption about the **sampling distribution** of the data.

Many “standard” approaches to statistical inference assume data are drawn from a Gaussian/normal distribution. Why? Some explanations (and an example) are provided by Richard McElreath (2016):

- Many processes that arise in nature tend towards a Gaussian distribution, so it tends to be a good fit.
- When we are ignorant as to how the data arise, the Gaussian is a conservative choice. Is has maximum entropy.

Imagine lining up at the centre line of a football field. Flip a coin 20 times. Every time it lands heads, step left. Every time it lands tails, step right. The distance of the step taken is allowed to vary between -1 and 1 yard.

We’ll demonstrate this using a few inbuilt functions in R: `replicate()`

which repeats a process a set number of times (here 1000); `runif()`

which draws data from a random, uniform distribution, giving n draws set between certain minimum and maximum values (here -1 and 1 yard). Finally, we sum their values together using `sum()`

and assign the 1000 summed steps to a variable called `positions`

.

```
# get 1000 people, sum up their 20 flips of heads or tails
positions <- replicate(1000, sum(runif(n = 20, min = -1, max = 1)))
```

I’ve plotted the density of the data at each distance from the middle line (0) against the normal distribution below.