5 min read

US Election Polling Redux

So Democrat candidate Biden has a massive lead over the incumbent? Sounds familiar. Do the polling results show much difference between 2020 and 2016?

Biden is getting better percentages than Clinton pulled in. And Trump’s numbers are tanking. Enough for some optimism amongst democarts but not enough for complancency.

For actual insight fivethirtyeight.com reliably deliver with authority and flair. But in our relentless pursuit of data and R mastery, we have to give it a go ourselves.

Where can data be obtained?

The only dataset which covers both elections is Wikipedia, so not much choice.

library(readxl)
library(tidyverse)
library(lubridate)
# library(plyr)
library(tidyquant)
  • Setup a data structure for the project
  • Point to folder containing data and any other assests
use <- list()  # (us election)
use$folder <- "/folder/containing/csv/data"
  • Ive copied the wikipedia pages into a open office spreadsheet, exported it to xls.
  • Inspect the xls.

Read the three sheets

use$ws <- list() # worksheets
use$ws$e2020 <- read_excel(use$filename, use$sheetNames[1])
use$ws$e2016postConvention <- read_excel(use$filename, use$sheetNames[2])
use$ws$e2016preConvention <- read_excel(use$filename, use$sheetNames[3])

inspect the data

glimpse(use$ws$e2020)
## Rows: 340
## Columns: 8
## $ `Poll source`      <chr> "Ipsos/Reuters", "YouGov/Economist", "IBD/TIPP", "…
## $ Date               <chr> "Jun 29-30, 2020", "Jun 28-30, 2020", "Jun 27-30, …
## $ `Sample\nsize[b]`  <chr> "943 (RV)", "1,198 (RV)", "1,005 (RV)", "733 (RV)"…
## $ `Margin\nof error` <chr> "± 3.6%", "± 3.4%", "± 3.1%", "± 3.6%", "± 3.1%", …
## $ Trump              <chr> "0.38", "0.40000000000000002", "0.4000000000000000…
## $ Biden              <chr> "0.46000000000000002", "0.48999999999999999", "0.4…
## $ Other              <chr> "10%[c]", "6%[d]", "–", "4%[e]", NA, "1%[i]", "–",…
## $ Undecided          <chr> "0.059999999999999998", "0.040000000000000001", "–…

A few problems

  • 2016 pre convention structure differs from post convention structure.
  • The date column contains a range, difficult to parse.
  • There is another gotcha in date column that glimpse misses. When the date crosses a month boundary the format is different. e.g August 16 – September 12, 2016.
  • Another problem not revealed by glimpse in the 2020 sheet, a few of the percentages we are interested in are annotated with letters, used like * or † for notes, this has also broken the spreadsheet export percentage representation. So for a few items instead of a decimal we have an integer followed by a letter.

Approach

  • Solve structure issues with trivial remmaping in mutate.
  • Provide function for cleaning the date range. To make this function even more annoying, we need range start and range end in separate columns, so cannot do it with a single mutate.
  • Provide a function to clean the annotated percentages.

Date ranges

Turns out this is a lot more difficult that it looks. It would have been so easy if strptime allowed exclusions in the format string. Instead I had to use regex. Regex should have been easy for me but R’s implementation defied my expectations.

use$r1 <- "([A-Za-z]*) ([0-9]{1,2}).([0-9]{1,2}), ([0-9]{4})"
use$r2 <- "([A-Za-z]*) ([0-9]{1,2}) . ([A-Za-z]*) ([0-9]{1,2}), ([0-9]{4})"
use$r3 <- "([A-Za-z]*) ([0-9]{1,2}), ([0-9]{4})"

# There are three different date formats mixed in the dataset.
# This function identified the format by a number.  
# 1. Intramonth, 2. Intermonth, 3. Single Day.
dateCase <- Vectorize(function(range) {
  if (grepl(use$r1, trimws(range) )) {
    return(1)
  } else if (grepl(use$r2, trimws(range) )) {
    return(2)
  } else {
    return(3)
  }
})

dateStrFrom <- Vectorize(function(range, case) {
  if (case==1) {
    sub(use$r1, "\\2-\\1-\\4", range )    
  } else if (case==2) {
    sub(use$r2, "\\2-\\1-\\5", range )
  } else {
    sub(use$r3, "\\2-\\1-\\3", range )    
  }
})

dateStrTo <- Vectorize(function(range, case) {
  if (case==1) {
    sub(use$r1, "\\3-\\1-\\4", range )    
  } else if (case==2) {
    sub(use$r2, "\\4-\\3-\\5", range )
  } else {
    sub(use$r3, "\\2-\\1-\\3", range )    
  }
})

# Take a tibble worksheet tibble and separate the range in Date column into two separate columns.
cleanDate <- function(df) {
  df %>%
    mutate(DateCase=dateCase(Date)) %>%
    mutate(DateFrom=dmy(dateStrFrom(Date,DateCase ))) %>%
    mutate(DateTo=dmy(dateStrTo(Date,DateCase ))) %>%
    select(-DateCase)
}    

Combine and Transform

  • Join the data sets
  • Mutate for consistency
  • Convert from wide to long format
use$ws$e2016preConvention <- use$ws$e2016preConvention %>%
  filter(stringr::str_detect(DemocraticCandidate, 'Hillary Clinton') & stringr::str_detect(RepublicanCandidate, 'Donald Trump'))


setYear <- function(date) {
  year(date) <- 2020  
  date
}

use$combined <- 
  use$ws$e2020 %>% 
  dplyr::rename(Democrat = Biden, Republican = Trump) %>%
  plyr::rbind.fill(use$ws$e2016preConvention, use$ws$e2016postConvention) %>%
  cleanDate() %>%
  select(PollSource = "Poll source", DateFrom, DateTo, Democrat, Republican) %>%
  gather(Party, Percentage, Democrat:Republican) %>%
  mutate(Year=as.character(year(DateTo))) %>%
  mutate(ChartDate=setYear(DateTo)) %>%
  mutate(Key=paste(Party, Year)) %>%
  arrange(DateTo) %>%
  filter(Year %in% c(2020,2016)) %>%
  filter(Percentage < 1) %>%
  mutate(Percentage = as.numeric(Percentage))
## Warning: 1 failed to parse.

## Warning: 1 failed to parse.
use$combined %>% head(4) %>% knitr::kable() 
PollSource DateFrom DateTo Party Percentage Year ChartDate Key
Fox News[400] 2016-01-04 2016-01-07 Democrat 0.44 2016 2020-01-07 Democrat 2016
Fox News[400] 2016-01-04 2016-01-07 Republican 0.47 2016 2020-01-07 Republican 2016
Morning Consult[398] 2016-01-08 2016-01-10 Democrat 0.42 2016 2020-01-10 Democrat 2016
Gravis Marketing/One America News Network[399] 2016-01-10 2016-01-10 Democrat 0.49 2016 2020-01-10 Democrat 2016

Plot

ggplot(use$combined, aes(x=ChartDate, y=Percentage, color=Key)) +
geom_point(aes(shape=Key, alpha=0.75)) +
scale_color_manual(values=c("#aaaaff", "#0000aa", "#ffaaaa","#aa0000" )) +
scale_shape_manual(values = c(5, 16, 5,16)) +
scale_y_continuous(labels = scales::percent_format(accuracy=1)) +
geom_smooth( method = loess,  formula = y ~ x, aes(linetype = Year), show.legend = FALSE) +
scale_linetype_manual(values=c("twodash", "solid")) +
guides(alpha = F, Key = T)  +
theme_tq() 

Testing

case1 <- "June 28–29, 2016"
case2 <- "September 26 – October 2, 2016"
case3 <- "Jun 23, 2020"


assertthat::assert_that(dateCase(case1) == 1)
## [1] TRUE
assertthat::assert_that(dateCase(case2) == 2)
## [1] TRUE
assertthat::assert_that(dateCase(case3) == 3)
## [1] TRUE
assertthat::are_equal(
  dmy(dateStrFrom(case1, dateCase(case1))), 
  ymd("2016-06-28"))
## [1] TRUE
assertthat::are_equal(
  dmy(dateStrTo(case1, dateCase(case1))), 
  ymd("2016-06-29"))
## [1] TRUE
assertthat::are_equal(
  dmy(dateStrFrom(case2, dateCase(case2))), 
  ymd("2016-09-26"))
## [1] TRUE
assertthat::are_equal(
  dmy(dateStrTo(case2, dateCase(case2))), 
  ymd("2016-10-2"))
## [1] TRUE