Getting Weather and Covid Data - Kieran Shah's Data Science Examples

Weather Data

Many epidemiological experts have linked weather to rising or falling Covid-19 cases. The first two steps of the process of getting data is as follows:

The Covid-19 cases by county in the United States (US) can be found here:.
The weather data comes from Darksky. We use the package darksky. darksky is a wrapper for the API DarkSky provides.

The package darksky and its documentation is here:. Importantly, the API will shutdown at the end of 2021.

In order to get the weather for each county for a given day, we need the longitude and latitude for each of the counties. The dataset with every longitude and latitude in a given county is provided by the Open Data initiative of the U.S. Government. The complete raw dataset can be downloaded here:. To get a longitude and latitude for each county, we just take the means by county. It is not a perfect estimation for weather, but depending on the geographical size of the county, it should work well.

library(covdata)
library(tidyverse)
library(darksky)
library(ggmap)
## Darksky API: 
DS = "Weather API"
darksky_api_key(DS)
## google maps API: 
GoogleAPI = "API Key"
register_google(key = GoogleAPI)

longlat = readr::read_csv("C:/Users/Kieran Shah/Downloads/Geocodes_USA_with_Counties.csv") %>% 
  filter(!is.na(county)) %>% 
  group_by(county, state) %>% 
  summarise(long = mean(longitude ),
            lat = mean(latitude)) %>% 
  ungroup()

st_crosswalk <- tibble(state = state.name) %>%
  bind_cols(tibble(abb = state.abb)) %>% 
  bind_rows(tibble(state = "District of Columbia", abb = "DC"))

IndlCounties = nytcovcounty %>% 
  distinct(county, state) %>% 
  inner_join(st_crosswalk, by = "state") %>% 
  inner_join(longlat, by = c("county" = "county", "abb" = "state")) %>% 
  inner_join(nytcovcounty , by = c("county", "state")) %>%
  filter(lubridate::month(date) %in% c(3,4,5,6) ) %>% 
  mutate(timeStamp = paste0(date, "T13:00:00")) %>%  
  mutate(TempData = purrr::pmap(list(lat, long, timeStamp), get_forecast_for, 
                                header = TRUE, 
                                exclude = "currently,minutely,hourly,alerts,flags"))

IndlCountiesUnnest = IndlCounties %>%
  mutate(Data2Weather = purrr::map(TempData, ~ .x$daily %>% 
                                     select(temperatureMax , temperatureLow, 
                                                            precipProbability , precipIntensity))) %>% 
  tidyr::unnest(c(Data2Weather))

Other predictors:

There are many factors which are associated with rising, or declining Covid-19 cases. This post cannot get all of the predictors. Furthermore, all the relevant predictors are probably unknown at this time.

To get other predictors, we use the package tidycensus. After registering for an API key, you can check all the variables in the American Community Survey (ACS) with the load_variables function.

I pulled median age and median income for this analysis.

The weather data, Covid-19 data, and ACS data are then combined together.

library(dplyr)

tidycensus::census_api_key("API Key")
v17 <- tidycensus::load_variables(2017, "acs5", cache = TRUE)

MedianAgeCounty <- tidycensus::get_acs(geography = "county",
                                       variables = c(MedianAge = "B01002_001"),
                                       #state = "VT",
                                       year = 2017)

Poverty = tidycensus::get_acs(geography = "county",
                              variables = c(MedianIncome = "B17020_001"),
                              #state = "VT",
                              year = 2017)

ExtaVariables = inner_join(Poverty %>% 
                             select(MedianIncome = estimate, GEOID), 
                           MedianAgeCounty %>% 
                             select(GEOID, NAME, MedianAge = estimate ), by = "GEOID")

Focus on California

While using the entire U.S. dataset provides the greatest amount of variation, there are 50 state governments provided different policies. In order to isolate government policy without researching every 50 state policies, we can focus on California. California has both a lot of variation in temperatures and cases by county, it provides an interesting case study.

California has opened up to Stage 3 as of July 22.

California closed for shelter-at-home on March 19, 2020.
California opened for stage 2 on May 8, 2020.
California opened for stage 3 on June June 12, 2020.

An important caveat is that some of the openings are based on meeting county thresholds.

The first step is to create multiple lags for weather. Covid-19 cases will show up after temperature changes. To do so, we follow this blog post.

lags <- function(var, n=10){
  var <- enquo(var)
  
  indices <- seq_len(n)
  purrr::map( indices, ~quo(lag(!!var, !!.x)) ) %>% 
    purrr::set_names(sprintf("lag_%s_%02d", quo_text(var), indices))
  
}

Basic Correlations:

Next, we take a look at the data. How do the scatter plots look comparing Covid-19 cases and the four different weather variables.

Max Temperature:

The below table shows the distribution by range of max temperature. The rows are the number of days in lags. The columns are the temperature ranges.

TempDaySince	Range
TempDaySince	[27,59.3]	(59.3,66.8]	(66.8,74.1]	(74.1,83]	(83,112]
5	24.0	60.5	104.5	163.5	316.5
6	27.0	66.0	106.0	171.0	325.0
7	30.0	74.0	110.0	175.0	337.0
8	33.0	80.5	116.0	184.0	349.5
9	34.0	87.0	120.0	184.0	358.0
10	36.5	95.0	126.0	178.0	358.0
11	39.0	102.0	137.5	177.5	367.0
12	41.0	105.0	140.0	172.0	375.5
13	42.0	110.0	145.0	176.5	386.0
14	44.0	113.0	149.0	187.5	389.0

Both the plots and the table above show that the lag and the higher temperatures seem to be correlated with higher cases. The greater the lag the more cases, and the greater the temperature, the greater the new cases.

Precipitation Probability:

The below table shows the distribution by range of max temperature.

TempDaySince	Range
TempDaySince	[0,0.02]	(0.02,0.04]	(0.04,0.09]	(0.09,0.37]	(0.37,1]
5	270.0	113.0	91.0	49.5	35
6	273.5	118.0	94.5	52.0	39
7	278.0	123.5	97.0	53.0	41
8	285.5	129.0	101.5	55.0	44
9	291.5	136.0	104.0	62.0	48
10	297.0	129.0	108.0	71.0	50
11	301.0	129.0	110.0	77.5	54
12	309.0	140.0	112.0	82.0	64
13	312.0	140.0	112.0	89.0	71
14	317.0	145.0	116.0	100.0	77

Precipitation Intensity:

The below table shows the distribution by range of max temperature.

TempDaySince	Range
TempDaySince	[0,0.0004]	(0.0004,0.0006]	(0.0006,0.0008]	(0.0008,0.0022]	(0.0022,0.139]
5	525.0	160.0	54.0	45.0	33.0
6	547.5	163.0	55.0	47.0	34.0
7	564.0	172.0	56.0	50.0	36.0
8	578.5	177.0	58.5	52.0	40.0
9	587.0	186.0	62.5	54.0	42.0
10	579.0	188.0	66.5	55.0	44.0
11	595.0	193.0	67.0	57.0	47.5
12	595.0	194.5	70.0	61.5	50.0
13	587.0	199.0	74.5	64.0	54.0
14	589.0	199.0	79.0	67.0	62.0

Both the precipitation probability, and the precipitation intensity show a similar pattern. Greater amounts of rain are associated with fewer cases, and the greater the lag of precipation probability of intensity, the greater the number of cases.

Cases and Stage Openings:

The plot below shows the change in new cases over time. The background colours show the period of each stage. For the pre-lockdown, the background is red.

The next plot shows the same idea, but instead of showing the stages, it shows the seven day lag since the new stage of lockdown was imposed.

Finally, the plot below shows the twelve day lag since the last stage of lockdown occurred. It is important to note that we do not have enough data to do the 14 day lag.

Overall, it looks like the one week plot is the most predictive, but this is simply just a visually inspection.