Introduction to R

Summary of Week 1 of the FutureLearn course on Introduction to R for Data Science (Purdue University).

I used R to analyse the data for my dissertation but I have gone rusty since then (as I found out when doing the data analysis MOOC) and this is my attempt to keep my skills sharp! 

Why R? R is an industry-standard tool in the field of data science, and allows for robust data analysis. Week 1 went through the basics of importing data in R, extracting the head and tail of a data set, creating subsets, and identifying properties of the data set.


Endogeneity, Instrumental Variables, and Experimental Design

Review of Module 11 of Data Analysis for Social Scientists (MITx, edX) – Intro to Machine Learning and Data Visualisation

Endogeneity problems can occur when there is simultaneous causality (i.e. the outcome variable affects the regressor of interest). Examples include health and exercise.

Instrumental variables are a way to indirectly measure causal relationships. For example, randomly assigned scholarships can be used as an instrument for education. One challenge with using instrumental variables is that the instrument should not have a direct effect on the outcome. For example, it can be argued that scholarships create confidence which then, together with years of education, increases test scores.

When designing experiments, things to think about are: what is being randomised; who is being randomised; how is randomisation introduced; and how many units are being randomised. Randomisation could be simple, through stratification or by clustering. Experimental designs include phase-in, randomising at the cutoff, encouragement design, etc.

Practical issues and omitted variable bias

Review of Module 9 of Data Analysis for Social Scientists (MITx, edX) – Practical Issues in Running Regressions, and Omitted Variable Bias

It has been challenging to fully understand the technical concepts taught in this course, as well as use R to complete the homework, given my intense workload. So, I have settled for understanding the general ideas, and I hope to revisit this (or a similar course) again in the future. Anyway, since I have already gained enough credit to pass, there is no need to work too hard 😛 (joking!)

A random selection of what I learnt this week…

Most statistical packages will provide the F-test (all coefficients = 0) and t-test (individual coefficients = 0). The standard error is also provided – this allows the confidence interval of the coefficient to be constructed.

Prof Esther discussed some practical issues in running regressions, including regressions with categorical variables, and interaction effects. With her examples I got a much better feel of how to interpret linear regressions.

It is possible to use a linear regression framework when the relationship between the independent and dependent variable is non linear. For example, polynomial models can be used to transform non-linear relationships. Regression discontinuities were also discussed.

Finally, omitted variable bias occurs when a model created incorrectly leaves out one or more important factors. The “bias” is created when the model compensates for the missing factor by over- or underestimating the effect of one of the other factors (Source: Wikipedia).

Single and multivariate linear models

Review of Module 8 of Data Analysis for Social Scientists (MITx, edX) – Single and multivariate linear models

Estimating the parameters of joint distributions can be used for prediction, determining causality and just understanding the world better. In linear regression, the regression coefficients can be estimated by using least squares, least absolute deviations or reverse least squares. By performing an analysis of variance, we can get a measure of the goodness-of-fit of the regression obtained. Linear regressions can also be used for non-linear relationships.

In the lectures, Prof Sara discussed the single and multivariate linear models and their assumptions in details, but I will not get into that here!

I was on the road this week and rushed to get the module done. So, I didn’t quite absorb everything, but I think the general concept of fitting relationships between variables is quite straightforward and the homework was not too challenging (unlike the week on Functions of Random Variables!).

Randomisation is not a substitute for thinking

As quoted from Prof Duflo during the lecture…

Review of Module 7 of Data Analysis for Social Scientists (MITx, edX) – Causality, Analysing Randomised Experiments, and Nonparametric Regression

The lectures were very interesting and I thought I had grasped the overall concepts. However, I found the homework was extremely difficult to work through. I think not having problem sets to work through makes it hard to be comfortable/confident with the calculations.

Here’s a summary of what I learnt from the lectures:


We make causal statements all the time. Causality may be thought of as the effect of manipulating a cause, where we compare (our best approximation) of what would have happened absent that cause and what actually happened. The Rubin causal model, for example, considers potential outcomes. This forces us to think about the counterfactual. The problem of causal inference is that at most only one of the potential outcomes can be realised, which means we are missing a lot of data about other potential outcomes. Complete randomisation would eliminate selection bias where there are underlying differences between those in the treatment group and selection group.

Analysing randomised experiments

Without knowledge of regression, it is very easy to analyse to completely randomised experiments through the Fisher exact test and Neyman’s approach. In designing an experiment, the power calculation helps us determine the sample size required, although there are many assumptions involved!

Randomised controlled trials (RCTs) are considered the gold standard, which has traditionally been used in clinical trials. In practice, there are incentives for selective reporting. Mitigating solutions include registries and pre-analysis plans.  RCTs also have a long history in the social sciences, and randomisation is also used in web design and marketing.

Non-parametric regressions

Kernel regression is one common way to express the relationship between two variables.