1 Objective

Let’s demonstrate file merging today.

1.1 Read in the PH data

First we read the Philippines data back in, using the same read_csv() function as yesterday

library(tidyverse)
library(readr)

PH <- read_csv("Data/DOH COVID Data Drop_ 20201110 - 04 Case Information.csv")
## Warning: 339284 parsing failures.
##   row          col           expected             actual                                                           file
## 22857 BarangayRes  1/0/T/F/TRUE/FALSE SAN LORENZO (POB.) 'Data/DOH COVID Data Drop_ 20201110 - 04 Case Information.csv'
## 22857 BarangayPSGC 1/0/T/F/TRUE/FALSE PH012804012        'Data/DOH COVID Data Drop_ 20201110 - 04 Case Information.csv'
## 22858 BarangayRes  1/0/T/F/TRUE/FALSE CALLAGUIP (POB.)   'Data/DOH COVID Data Drop_ 20201110 - 04 Case Information.csv'
## 22858 BarangayPSGC 1/0/T/F/TRUE/FALSE PH012805012        'Data/DOH COVID Data Drop_ 20201110 - 04 Case Information.csv'
## 22870 BarangayRes  1/0/T/F/TRUE/FALSE SAN ANTONIO        'Data/DOH COVID Data Drop_ 20201110 - 04 Case Information.csv'
## ..... ............ .................. .................. ..............................................................
## See problems(...) for more details.

1.2 Merging

1.2.1 Prepare cases and deaths to merge

First, tabulate cases and deaths by age and sex. Let’s choose a reference date that isn’t too affected by registration lags in deaths. This is its own little challenge.

Let’s say most cases have entered the system by 2 weeks before today. Note, this assumes the data are from Nov 10. I modified the ref_date calcs to account for this code being run at future dates. If you’re using a different version of the data, you’ll need to account for it.

We declare the earliest_date as the earliest of onset, specimen, release, or entering the statistics. This last one has no missings.

library(lubridate)
# ref_date <- today() - 14
ref_date <- today() - (today() - ymd("2020-11-11") + 14)

Cases <-
  PH %>% 
  # see pmin() explanation in later section.
  mutate(earliest_date = pmin(DateOnset, 
                              DateSpecimen, 
                              DateResultRelease, 
                              DateRepConf, 
                              na.rm =TRUE)) %>% 
  filter(earliest_date <= ref_date) %>% 
  mutate(Age5 = Age - Age %% 5,
         Age5 = ifelse(Age5 > 100, 100, Age5)) %>% 
         # Age5 = case_when(Age5 > 100 ~ 100,
         #                  TRUE ~ Age5)) %>% 
  group_by(Age5) %>% 
  summarize(Cases = n()) %>% 
  mutate(UNK = Cases[is.na(Age5)]) %>% 
  filter(!is.na(Age5)) %>% 
  mutate(dist = Cases / sum(Cases),
         Cases = Cases + dist * UNK) %>% 
  select(Age5, Cases)
## `summarise()` ungrouping output (override with `.groups` argument)

We filter out all those cases whose earliest date is before our 2-weeks-ago reference date. We then create 5-year age groups as in previous days, taking care to group ages > 100 so that we can later merge with WPP population data (we made that modification later). Note you can do this with ifelse(), case_when() or surely something else. This is a 3-step process: 1) create grouping variable 2) declare the groups 3) count the rows per group.

Then we redistribute unknowns without leaving the pipeline. The idea is to move unknown ages (a single value) to a new column UNK using mutate(), then turn cases by age into a distribution dist, then redistribute using that distribution, and finally select() just the columns we want to keep.

Deaths are filtered on a less stringent citerion, but otherwise follow the same logic.

Deaths <-
  PH %>% 
  filter(!is.na(DateDied), # commas are like ANDs &
    DateDied <= ref_date) %>% 
  mutate(Age5 = Age - Age %% 5,
          Age5 = ifelse(Age5 > 100, 100, Age5)) %>% 
  group_by(Age5) %>% 
  summarize(Deaths = n()) %>% 
  mutate(UNK = Deaths[is.na(Age5)]) %>% 
  filter(!is.na(Age5)) %>% 
  mutate(dist = Deaths / sum(Deaths),
         Deaths = Deaths + dist * UNK) %>% 
  select(Age5, Deaths)
## `summarise()` ungrouping output (override with `.groups` argument)

1.2.2 merging cases and deaths together

We referred to a data wrangling cheat sheet https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf Nevermind the gather() spread() info on that sheet, we’ll do the new thing. left_join() treats the left-side dataset as holy, and removing none of its rows. right_join() does the opposite. inner_join() just keeps rows that are matched, where the key variable combinations are present in both datasets. full_join() keeps everything from both sides. Me, personally, I most often use left_join(), it’s just easier to remember it and set things up accordingly.

CFR <- Cases %>% 
  left_join(Deaths) %>% 
  mutate(ASCFR = Deaths / Cases)
## Joining, by = "Age5"

Main points to remember: If you don’t tell it which columns to match on, it will compare column names in both datasets, and by default it will try to join using any names in common. The columns classes must be the same for that to work. In our case, it joins using out variable Age5, which satisfies this condition. If you want finer control, read the help file.

2 Plot CFR

let’s plot it!

CFR %>% 
  ggplot(aes(x = Age5, y = ASCFR)) + 
  geom_line() +
  scale_y_log10()

The infant part has been popping up in other countries too, the deceleration in older ages is inconsistent between countries, but it is unlikely that there is actually a drop in extreme old age. We suspect it’s a data quality artifact, and therefore chop the x axis at 100.

3 Merge in WPP data

Now lets merge on population counts from the WPP2019. There’s a package containing selected WPP outputs in 5-year age groups, which serves us here.

Males and females are given separately. We first extract the columns we need with select(), then cut it down to just the Philippines. We then stack them, then group on age and sum accordingly to give both-sex totals by age. Note this data stops at 100+, and this is why we grouped the earlier case and death data down to 100. Note the extensive use of case_when() to recode age groups. There are slicker tricks that take up less space, but this is the most explicit and elegant way of doing this sort of thing.

library(wpp2019)
data(popM)
data(popF)
# popM %>% dim()
# popM %>% pull(name) %>% unique()
males <- 
  popM %>% 
  select(name, age, `2020`) %>% 
  filter(name == "Philippines")
females <- 
  popF %>% 
  select(name, age, `2020`) %>% 
  filter(name == "Philippines")

Pop <- bind_rows(
  males,
  females
) %>% 
  group_by(age) %>% 
  summarize(Population = sum(`2020`)) %>% 
  mutate(Age5 = 
           case_when(
             age == "0-4" ~ 0,
             age == "5-9" ~ 5,
             age == "10-14" ~ 10,
             age == "15-19" ~ 15,
             age == "20-24" ~ 20,
             age == "25-29" ~ 25,
             age == "30-34" ~ 30,
             age == "35-39" ~ 35,
             age == "40-44" ~ 40,
             age == "45-49" ~ 45,
             age == "50-54" ~ 50,
             age == "55-59" ~ 55,
             age == "60-64" ~ 60,
             age == "65-69" ~ 65,
             age == "70-74" ~ 70,
             age == "75-79" ~ 75,
             age == "80-84" ~ 80,
             age == "85-89" ~ 85,
             age == "90-94" ~ 90,
             age == "95-99" ~ 95,
             age == "100+" ~ 100,
           )) %>% 
  arrange(Age5) %>% 
  select(-age)
## `summarise()` ungrouping output (override with `.groups` argument)

Let’s merge with CFR, but note there’s a difference of just one row. WPP stops at 100+. PH stops at 105+.

CFR <- CFR %>% 
  left_join(Pop)
## Joining, by = "Age5"

4 Exercise

Merge 3 datasets: OWD, COVerAGE-DB, and STMF. Pick a country included in all 3 datasets. Ideally one whose inputdata in STMF is given in 5-year age groups. I chose Denmark. Each country will imply different challenges.

4.1 Download and read into R

I examined what was available in the STMF inputs and selected from it. Demark ought to be in all three sources.

OWD <- read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/testing/covid-testing-all-observations.csv")
## 
## ── Column specification ─────────────────────
## cols(
##   Entity = col_character(),
##   Date = col_date(format = ""),
##   `ISO code` = col_character(),
##   `Source URL` = col_character(),
##   `Source label` = col_character(),
##   Notes = col_character(),
##   `Cumulative total` = col_double(),
##   `Daily change in cumulative total` = col_double(),
##   `Cumulative total per thousand` = col_double(),
##   `Daily change in cumulative total per thousand` = col_double(),
##   `7-day smoothed daily change` = col_double(),
##   `7-day smoothed daily change per thousand` = col_double(),
##   `Short-term tests per case` = col_double(),
##   `Short-term positive rate` = col_double()
## )
COV <- read_csv("Data/Output_5.zip",
                     skip = 3,
                     col_types = "ccccciiddd")
# this one is specifically Denmark!
STMF <- read_csv(unzip("Data/STMFinput.zip","DNKstmf.csv"))
## 
## ── Column specification ─────────────────────
## cols(
##   PopCode = col_character(),
##   Area = col_double(),
##   Year = col_double(),
##   Week = col_double(),
##   Sex = col_character(),
##   Age = col_character(),
##   AgeInterval = col_character(),
##   Deaths = col_double(),
##   Type = col_character(),
##   Access = col_character()
## )

the COV read_csv() part works straight from the zip file because the contents have the same name. However the STMF zip files has lots of csvs in it, so we need to grab out just the one we want using unzip(): no need to manually unzip.

4.2 Filter to Denmark 2020

First filter down OWD and COVerAGE-DB to Denmark. Filter down STMF to 2020.

COV <-
  COV %>% 
  filter(Country == "Denmark",
         Region == "All") %>% 
  mutate(Date = dmy(Date))

OWD <- 
  OWD %>% 
  filter(`ISO code` == "DNK")

STMF <- 
  STMF %>% 
  filter(Year == 2020)

Trick: OWD is just daily figures, both cumulative and new, smoothed and raw. STMF is new deaths, in week bins whereas COVerAGE-DB is cumulative counts. We need to make things match somewhat! Well technically, we just want to achieve a merge, but let’s make things match a bit better from the start.

4.3 Decumulate COVerAGE-DB

Step 1: let’s decumulate COVerAGE-DB data over time so that it’s new cases, deaths, and tests. Do we have steady daily observations even?

A visual check and a programmatic check: do we have daily data?

COV %>% 
  group_by(Date) %>% 
  mutate(Deaths = sum(Deaths)) %>% 
  ggplot(aes(x = Date, y = Deaths)) +
  geom_point()
## Warning: Removed 1218 rows containing missing
## values (geom_point).

# check for day gaps: ruh roh! There are some gaps we need to interpolate, no? Weekends I guess.
COV %>% pull(Date) %>% unique() %>% sort() %>% diff()
## Time differences in days
##   [1] 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [21] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [41] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [61] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [81] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [101] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [121] 1 1 1 1 1 3 1 1 1 1 3 1 1 1 1 3 1 1 1 1
## [141] 3 1 1 1 1 3 1 1 1 1 3 1 1 1 1 3 1 1 1 1
## [161] 3 1 1 1 1 3 1 1 1 1 3 1 1 1 1 3 1 1 1 1
## [181] 3 1 1 2 3 1 1 1 1 3 1 1 1 1 3 1 1 1 1 3
## [201] 1 1 1

Conclusion: it’s fairly consistent, but there are often gaps. Probably weekends where data was either not released or not captured. No problem: STMF is weekly data anyway, so let’s figure out how to turn COVerAGE-DB into weekly new data.

First idea: interpolate to get a clean daily series, then decumulate, then group to weeks? Lots of challenges to figure out if we do that.

Second idea: Maybe there’s a way to get to new counts by week without needing to interpolate to single days first? This would simplify things a lot.

Check to see if there is some weekday() consistently in the data, like Monday.

 weekdays(today()) 
## [1] "Friday"
COV %>% pull(Date) %>% unique() %>% sort() %>% weekdays() %>% table()
## .
##    Friday    Monday  Saturday    Sunday 
##        34        34        17        17 
##  Thursday   Tuesday Wednesday 
##        34        34        34

OK, we can select just the Mondays in the data. Note: it will choose the day names based on your locale! If you want to ensure spelling, try the tricks here https://stackoverflow.com/a/17031207/889960

Our steps: select only Mondays (we’re not missing any), then derive the ISO week from these (STMF follows that convention), then sort first by Sex, then by Age within Sex then by week within Sex and Age. (the innermost sort is last). Then group by Sex and Age and decumulate counts using diff(). Note the result must be the same length, so we pad with NA at the end. Finally, we select() just the columns we need to retain.

a <- c(1,2,3,4,5,6)
diff(a)
## [1] 1 1 1 1 1
COV <- 
  COV %>% 
  filter(weekdays(Date) == "Monday") %>% 
  mutate(Week = week(Date)) %>% 
  # sort just to be sure decumulation works right
  arrange(Sex, Age, Week) %>% 
  group_by(Sex, Age) %>% 
  # decumulate, pad w NA
  mutate(cov_deaths_wk = c(diff(Deaths),NA),
         cov_cases_wk = c(diff(Cases), NA),
         cov_tests_wk = c(diff(Tests), NA)) %>% 
  # keep just what we want
  select(Sex, 
         Week, 
         Age, 
         cov_deaths_wk, 
         cov_cases_wk, 
         cov_tests_wk)

Now we coerced COVerAGE-DB to the STMF dimensions, sort of.

4.4 prepare STMF data

Now Let’s examine STMF, recalling this is input data, and it may have unknown ages that need to be redistributed, or similar.

STMF %>% pull(Age) %>% unique()
##  [1] "0"   "5"   "10"  "15"  "20"  "25" 
##  [7] "30"  "35"  "40"  "45"  "50"  "55" 
## [13] "60"  "65"  "70"  "75"  "80"  "85" 
## [19] "90"  "95"  "100" "TOT"

OK, looks like no explicit unknown ages. They do give totals, but we don’t know for sure whether marginal sums match these totals. Typically, we would trust TOT more than we would trust the marginal sum. We’ll just make sure, that this is so by forcing a rescale, using the same trick from before to redistribute unknowns. Really, should check that each week has a total. For the moment, we’re doing this on faith, but we could also check..

STMF <-
  STMF %>% 
  group_by(Week, Sex) %>% 
  # move total to column
  mutate(TOT = Deaths[Age == "TOT"]) %>% 
  filter(Age != "TOT") %>% 
  mutate(dist = Deaths / sum(Deaths),
         Deaths = dist * TOT,
         Age = as.integer(Age)) %>% 
  select(Sex, Week, Age, Deaths) %>% 
  arrange(Sex, Week, Age)

4.5 Ready to merge STMF and COVerAGE-DB

I’ll put COVerAGE-DB on the left, since it has fewer weeks, and that’s all we want to compare anyway. We’ve already taken care to make columns Sex, Week and Age commensurable.

head(STMF)
## # A tibble: 6 x 4
## # Groups:   Week, Sex [1]
##   Sex    Week   Age Deaths
##   <chr> <dbl> <int>  <dbl>
## 1 b         1     0      4
## 2 b         1     5      1
## 3 b         1    10      1
## 4 b         1    15      0
## 5 b         1    20      4
## 6 b         1    25      5
head(COV)
## # A tibble: 6 x 6
## # Groups:   Sex, Age [1]
##   Sex    Week   Age cov_deaths_wk
##   <chr> <dbl> <int>         <dbl>
## 1 b        11     0            NA
## 2 b        12     0            NA
## 3 b        13     0            NA
## 4 b        14     0             0
## 5 b        15     0             0
## 6 b        16     0             0
## # … with 2 more variables:
## #   cov_cases_wk <dbl>, cov_tests_wk <dbl>
OUT <- left_join(COV, STMF) %>% 
  arrange(Sex, Week, Age)
## Joining, by = c("Sex", "Week", "Age")

We’ll the actually joining is the least of our worries it seems! All the work went into prepping the data!

4.6 Merge on OWD data

First, we also need a Week variable for OWD.

OWD <-
  OWD %>% 
  mutate(Week = week(Date))

Oh snap, OWD is mostly missing data for this date range in Denmark!!

But guess what? We can still use the testing data present in COVerAGE-DB. Likely it was discarded by OWD due to differences in definitions or something like this.

4.7 Calculate and plot

Let’s see new positivity by age

OUT %>% 
  mutate(Positivity = cov_cases_wk / cov_tests_wk) %>% 
  filter(Sex == "b") %>% 
  ggplot(aes(x = Week, 
             y = Positivity, 
             color = Age, 
             group = Age)) +
  geom_line() 
## Warning: Removed 21 row(s) containing missing
## values (geom_path).

Maybe clearer picture in 20-year age groups?

OUT %>% 
  mutate(Age = Age - Age %% 20) %>% 
  group_by(Week,Sex,Age) %>% 
  summarize(cov_cases_wk = sum(cov_cases_wk),
            cov_tests_wk = sum(cov_tests_wk)) %>% 
  mutate(Positivity = cov_cases_wk / cov_tests_wk) %>% 
  filter(Sex == "b") %>% 
  ggplot(aes(x = Week, 
             y = Positivity, 
             color = Age, 
             group = Age)) +
  geom_line() 
## `summarise()` regrouping output by 'Week', 'Sex' (override with `.groups` argument)
## Warning: Removed 6 row(s) containing missing
## values (geom_path).

Negatives (impossible) have a few potential sources. 1) could be cumualtive totals are reported each day, but are not back-updated. 2) could be due to erratic behavior in the closeout, depending on the open age group in the source data for Denmark (numbers are small). 3) data entry errors, totally possible! My guess is it’s likely (1) because the open age is 90+ in the source data, and because in the limit all input errors are removed via repeated diagnostics and user feedback. Will need to look closer into it!

5 Mini lessons

5.1 details on logical conditions behavior

Or checks if at least one instance of a logical is TRUE. If that is the case, i.e. c(TRUE, NA, NA, FALSE), then it will return TRUE. However, if it can’t find a single TRUE but there are NAs, it cannot do the opposite, because one of those NAs could be TRUE but we just don’t know it, so it returns NA instead.

1 > 0 | NA > 0 # TRUE, NA -> TRUE
## [1] TRUE
# likewise:
any(c(1,NA)>0) # TRUE
## [1] TRUE
# but not:
1 < 0 | NA > 0 # FALSE, NA -> NA
## [1] NA

5.2 pmin() trick

While organically coding the PH example, I wanted to find the minimum of a set of dates, where one of the dates was never missing, but other earlier dates (preferred) were sometimes missing and sometimes not. When I tried this with min(date1,date2,date3) inside mutate() it revealed some behavior to be aware of.

  1. min() returns NA unless you tell it to throw them out with na.rm=TRUE.
  2. min() takes everything you give it, and sticks it in c(), a single vector, then takes the overall minimum.

Therefore, min(date1,date2,date3,na.rm=TRUE) gave the date of the first case as the same value for the whole series! Which is not what we wanted. So, I remembered pmin() does the same thing, but elementwise for sets of vectors, and that trick saved the day. Examine the below code to see these behaviors demonstrated.

a <- 1:10
b <- 10:1
d <- runif(10)
e <- c(runif(9),NA)

 # takes the global min
min(a,b)
## [1] 1
# NA always wins
min(a,e) 
## [1] NA
# unless you declare otherwise!
min(a,e, na.rm=TRUE) 
## [1] 0.07750548
 # takes the elementwise min
pmin(a,b)  
##  [1] 1 2 3 4 5 5 4 3 2 1
# works with multiple vectors :-)
pmin(a,b,d) 
##  [1] 0.948933459 0.219073185 0.021221673
##  [4] 0.482982325 0.748399175 0.763619276
##  [7] 0.078663250 0.291185562 0.001042514
## [10] 0.447819927
# na.rm works as expected too :-)
pmin(a, b, d, e, na.rm = TRUE) 
##  [1] 0.453721729 0.077505483 0.021221673
##  [4] 0.482982325 0.748399175 0.176956060
##  [7] 0.078663250 0.291185562 0.001042514
## [10] 0.447819927

This last example is how we ended up using it inside mutate() in order to find the earliest of a set of dates, and it worked as expected, just took a bit of trial and error to get there.

5.2.1 Save out our results

I save the results and add them to the repository so that they can be used to demonstate pivoting used a contrived example.

saveRDS(OUT, file = "Data/DNK_STMF_COV.rds")
write_csv(OUT, file = "Data/DNK_STMF_COV.csv")