How To Add Regression Line In Ggplot2

Those of you who know me well know I am not really a ggplot2 type of guy. Not because I do not remember that it is practiced work—I would be an idiot to claim that probably the most downloaded package in the history of CRAN is not good—but because I predate both Hadley and ggplot2 (yep I admit information technology, I am a dinosaur now) I can practice nearly of what I want in base of operations R. However, there are times in exploratory data analysis where going the ggplot2 route makes more than sense. I situation where that happens for me is when I am fitting linear models with a mixture of continuous explanatory variables and factors. Some people have an onetime-fashioned name for this situation—but really repeating that name would be a get-burned-at-a-stake type of offence. It is useful when fitting linear models of this blazon to be able to come across the behaviour of the response with respect to a continuous predictor for each combination of the levels of the factors. ggplot2 makes it fairly like shooting fish in a barrel to produce this blazon of plot through its faceting mechanism. It also makes it actually to add a fitted line with a pretty confidence interval to each facet. I should say at this bespeak that this is non restricted to linear models, and in fact works for generalised linear models besides, and for semi-parametric models similar smoothers. It is cool stuff.

I desire a pony, and I desire it At present Daddy

This is, as I have said, made easy to practice in ggplot2 and a one-half hour of Googling volition get you to the betoken where you lot tin can do information technology with your data. However, I found myself with the following problem. I had a situation where there was a suggestion that an interaction might exist significant and then I wanted to explore visually how the fitted models differed with and without interaction. You lot often find yourself in this state of affairs with tests suggesting the interactions are pregnant only to notice that it is driven past 1 combination of the factors, or even worse, by a unmarried outlier. Making things explicit, in this problem I was interested in the difference between this model:

$$ Y \sim X * F_1 * F_2 $$

and this model

$$Y \sim 10 * F_1 + F_2$$

These models are expressed in standard Wilkinson and Rogers notation, where $Y$ and $X$ are continuous, and $F_1$ and $F_2$ are factors. However, the same issues arise if we are interested in

$$ Y \sim 10 * F $$

and this model
$$ Y \sim X + F $$

The simply deviation is that the different levels of the gene $F$ can exist represented on a single row—if there are not too many levels—or in a wrapped facet plot if there are quite a few levels.

We need some data

The data in this post come from Hand, D.J., Daly, F., Lunn, A.D., McConway, Grand.J. & Ostrowski, E. (1994). A Handbook of Pocket-sized Data Sets Boca Raton, Florida: Chapman and Hall/CRC. The data try to relate the mortality in United kingdom towns and cities—measured every bit deaths per 100,000 head of population—to the hardness of the water—measured by calcium concentration (in parts-per-meg = ppm). The original authors too had the theory that this human relationship changed when the town or city was in "The North." I have always institute the English concept of "Up North" rather bizarre. As you leave the outskirts of London you offset seeing road signs telling you how far information technology is to "The North." In this data gear up, "The North" is defined as existence north of the English city of Derby. The information is available here: h2o.csv.

water.df = read.csv("h2o.csv")

It is worth having a quick look at the summary information

library(skimr) skim(water.df)

            ## Skim summary statistics ##  north obs: 61  ##  n variables: 4  ##  ## Variable type: factor  ##  variable missing complete  n n_unique                     top_counts ##      town       0       61 61       61 Bat: 1, Bir: ane, Bir: ane, Bla: 1 ##  ordered ##    False ##  ## Variable type: integer  ##   variable missing complete  n    hateful     sd   p0  p25  p50  p75 p100 ##    calcium       0       61 61   47.18  38.09    5   fourteen   39   75  138 ##  mortality       0       61 61 1524.fifteen 187.67 1096 1379 1555 1668 1987 ##      northward       0       61 61    0.57   0.5     0    0    1    ane    i

We can see from this that our factor north is non beingness treated as such. I will recode it, and since nosotros are using ggplot2 nosotros might likewise go into the tidyverse also. I think I am going to violate a Hadley principle by overwriting the existing data, but I can live with this.

library(tidyverse) water.df = water.df %>%    mutate(north = factor(ifelse(north == 1, "due north", "due south")))

The initial plot

I am going to plot mortality as a function of calcium and I am going to facet (if that is a verb) by our gene north. In that location is a good argument to make that considering both variables are rates (deaths per 100,000 and parts per one thousand thousand) then we should work on a log-log scale. However, given this is not a post on regression analysis, I will non make this transformation (Spoiler: in that location is no evidence that the factor makes any difference on this scale which kind of makes it useless to show what I want to prove ☺)

library(ggplot2) p = water.df %>%    ggplot(aes(x = calcium, y = mortality)) + geom_point()  p = p + facet_wrap(~north) p

It looks like there is bear witness of a shift, only is there evidence of a difference in slopes?

h2o.fit = lm(mortality ~ calcium * northward, data = h2o.df) summary(water.fit)

                          ##  ## Call: ## lm(formula = mortality ~ calcium * north, information = h2o.df) ##  ## Residuals: ##     Min      1Q  Median      3Q     Max  ## -221.27  -75.45    eight.52   87.48  310.xiv  ##  ## Coefficients: ##                     Estimate Std. Error t value Pr(>|t|)     ## (Intercept)        1692.3128    32.2005  52.555   <2e-16 *** ## calcium              -ane.9313     0.8081  -2.390   0.0202 *   ## northsouth         -169.4978    58.5916  -ii.893   0.0054 **  ## calcium:northsouth   -0.1614     i.0127  -0.159   0.8740     ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## Residual standard mistake: 123.2 on 57 degrees of freedom ## Multiple R-squared:  0.5909, Adjusted R-squared:  0.5694  ## F-statistic: 27.44 on 3 and 57 DF,  p-value: 4.102e-11

Ignoring all model checking (information technology own't so bad STATS 201x kids), in that location does not appear to exist much support for the different slopes model. However there is for the dissimilar intercepts model. I would like to explore this visually.

And then nearly those lines…

What I would like to do is to place the fitted lines from each model on each facet of the plot. In addition, I would similar the confidence intervals for each line to exist added to each line. In that location might be a way to practice this with geom_smooth, and I would exist happy to hear near information technology. My solution involves three steps:

Fitting each of the models, and using predict to get the confidence intervals
Calculation the fitted line from each model to each facet
Adding the confidence intervals from each model to each facet

Step 1—getting the data for the conviction intervals

This step involves, as previously mentioned, using the predict role (in fact predict.lm). This should be fairly straightforward. The fundamental here is to add the interval data, and the fitted values to either a new information frame or our existing data frame. I volition opt for the latter.

water.fit1 = h2o.fit ## interaction model h2o.fit2 = lm(mortality ~ calcium + due north, information = water.df)  water.pred1 = predict(water.fit1, water.df, interval = "confidence") h2o.pred2 = predict(water.fit2, water.df, interval = "conviction")  water.df = water.df %>%    mutate(fit1 = water.pred1[,"fit"],          lwr1 = water.pred1[,"lwr"],          upr1 = water.pred1[,"upr"],          fit2 = water.pred2[,"fit"],          lwr2 = water.pred2[,"lwr"],          upr2 = h2o.pred2[,"upr"])

Step 2—Adding the fitted lines

Call back that our plot is stored in the variable p. We volition add together the fitted lines using the geom_line function. Nosotros could have washed this with the geom_abline and merely the coefficients, however this would have made the method less flexible because we could not accomodate a simple quadratic model for example.

p = p + geom_line(data = water.df,                    mapping = aes(x = calcium, y = fit1),                    col = "blue", size = 1.2) p = p + geom_line(data = water.df,                    mapping = aes(x = calcium, y = fit2),                    col = "ruby", size = one.2) p

Nosotros can see already the lack of support for the different slopes model, even so, allow's add together the conviction intervals.

Step 3—Adding the conviction intervals

Nosotros add the confidence intervals by using the geom_ribbon part. The aesthetic for geom_ribbon requires two sets of y-values, ymin and ymax. These, clearly, are the values nosotros calculated for each of the confidence intervals. We set the colour of the bands past setting fill. Setting col controls the border colour of the ribbon. We make these transparent (and so they do non obscure our fitted lines), past setting an blastoff value closer to zero than to one (0.2 seems to be a skilful option).

p = p + geom_ribbon(data = water.df,                      mapping = aes(ten = calcium, ymin = lwr1, ymax = upr1),                     fill up = "blue", alpha = 0.two) p = p + geom_ribbon(data = water.df,                      mapping = aes(10 = calcium, ymin = lwr2, ymax = upr2),                     fill = "red", alpha = 0.2) p

Figure three: Lines with confidence intervals (as promised)

As expected the intervals completely overlap, thus supporting our decision to choose the simpler model.

Well that sucked

How about if at that place really was a slope upshot? I have artificially added one, and redrawn the plot simply so we can run across what happens when there actually is a difference.

## only so we all get the same outcome prepare.seed(123)   ## get the original coefficient b = coef(water.fit1)   ## and change the effect for the interaction b[four] = -5 # make a big divergence to the slopes  ## get the standard difference of the residuals sigma = summary(water.fit1)$sigma  ## and use it to brand a new ready of responses h2o.df = water.df %>%    mutate(bloodshed = (b[1] + (north == 'north') * b[2]) + (b[2] + (due north == 'northward') * b[4]) * calcium + rnorm(61, 0, sigma))  ## Now we repeat all the steps above h2o.fit1 = lm(bloodshed ~ calcium * due north, information = water.df) ## interaction model water.fit2 = lm(mortality ~ calcium + n, data = water.df)  water.pred1 = predict(water.fit1, water.df, interval = "conviction") h2o.pred2 = predict(water.fit2, water.df, interval = "confidence")  water.df = h2o.df %&gt;%    mutate(fit1 = water.pred1[,"fit"],          lwr1 = water.pred1[,"lwr"],          upr1 = h2o.pred1[,"upr"],          fit2 = h2o.pred2[,"fit"],          lwr2 = water.pred2[,"lwr"],          upr2 = h2o.pred2[,"upr"])  p = water.df %&>%    ggplot(aes(x = calcium, y = mortality)) + geom_point()  p = p + facet_wrap(~north) p = p + geom_line(data = water.df,                    mapping = aes(ten = calcium, y = fit1),                    col = "blue", size = ane.2) p = p + geom_line(information = water.df,                    mapping = aes(x = calcium, y = fit2),                    col = "cherry-red", size = i.2) p = p + geom_ribbon(information = h2o.df,                      mapping = aes(ten = calcium, ymin = lwr1, ymax = upr1),                     fill up = "bluish", alpha = 0.two) p = p + geom_ribbon(data = water.df,                      mapping = aes(ten = calcium, ymin = lwr2, ymax = upr2),                     fill = "cherry", alpha = 0.2) p

1 final note

I suspect, just I have not checked properly, that when geom_smooth is used, that the fitted lines on each facet are derived from the subset of data in that facet, therefore the standard errors might exist slightly larger (because less information is used to estimate $\sigma$. Notwithstanding, I am happy to be corrected (and if I am wrong I will remove this department :-))

Mail Views: 12,663