Getting started with COVerAGE-DB datasets

COVerAGE-DB shares several files, ranging from untransformed inputs to fully harmonized outputs. Files have a similar structure, are all in .csv format, and ought to be staightforward to read in any common statistical package.

The inputDB is quite diverse in terms of the measures, metrics, and age groups it includes. This file includes everything we collect except for subsets that we are unable to process for one reason or another. In other words this file contains all prima facie valid data that we’ve collected. This is a long structure, in that all data collected are in a single column Value. Aside from Country, Region, Code, and Date, you can filter on Metric ("Ratio", “Count”, or "Fraction") or Measure ("Cases", "Deaths", "Tests","ASCFR"). At some point we will incorporate Rate metrics. The thing to note is that if you’re interested in Count data, it may be included already as such, or you may need to calculate it from other data types. This is one of the things we take care of in the harmonization routines.

The output files Output_5 and Output_10 are harmonized to "Count" metrics, and to 5 and 10-year age groups 0-100+. The columns are also different, as different Measures are, for the time being, in separate columns. These can be used as-is for many kinds of analyses, but you may also consider further smoothing, especially for the 5-year age group data, as it can contain stochastic 0s.

An interactive data downloading tool will soon be released, which will facilitate smaller data extracts for specific analyses.

Details: commas are column separators, decimals are dots. When reading in, you may need to skip a row or more. You can can typically read these in straight from the repository using the raw version. See examples.

Reading in COVerAGE-DB

Each COVerAGE-DB file is given a random-looking id on OSF, which can be used to download the files from R. Here are the ids and other important file parameters for the main files, for easy copy-pasting (updated 10 May 2022):

File	`id`	`col_types`	`skip`
`inputDB.zip`	`9dsfk`	`"cccccciccdc"`	1
`Output_5.zip`	`7tnfh`	`"ccccciiddd"`	3
`Output_10.zip`	`43ucn`	`"ccccciiddd"`	3
`qualityMetrics.zip`	`qpfw5`	`"ccDcdddcciiiiiiiiiiildd"`	1
`offsets.csv`	`unf6v`	`"idccc"`	1

col_types gives the columns classes, where c means character, i is integer, d is double, D is a proper date format, and l is logical. The files have headers with timestamps, so some rows need to be skipped too (skip) when reading in.

Here are the main packages you’ll want to have installed to read the files directly into an R session with no need to manually download or unzip.

library(tidyverse)
library(readr)
library(lubridate)
library(ggplot2)
library(osfr)
library(covidAgeData)

`inputDB` file

# This downloads the file, preserving the name inputDB.csv
cdb_repo <- osf_retrieve_node("mpwjq")
osf_ls_files(cdb_repo, path = "Data") %>%
  dplyr::filter(name == "inputDB.zip") %>%
  osf_download(conflicts = "overwrite")

## # A tibble: 1 x 4
##   name        id                       local_path    meta            
##   <chr>       <chr>                    <chr>         <list>          
## 1 inputDB.zip 5f3ed659746a8100ad1a2420 ./inputDB.zip <named list [3]>

# This reads it in
inputDB <-  read_csv("inputDB.zip",
                     skip = 1,
                     col_types = "cccccciccdc")

glimpse(inputDB)

## Rows: 33,806,302
## Columns: 11
## $ Country    <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afg~
## $ Region     <chr> "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All~
## $ Code       <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "~
## $ Date       <chr> "01.07.2020", "01.07.2020", "01.07.2020", "01.07.2020", "01.07.2020", "01.07.2020", "01.07.2020", "01.07.2020~
## $ Sex        <chr> "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "f",~
## $ Age        <chr> "0", "10", "20", "30", "40", "50", "60", "70", "80", "TOT", "0", "10", "20", "30", "40", "50", "60", "70", "8~
## $ AgeInt     <int> 10, 10, 10, 10, 10, 10, 10, 10, 25, NA, 10, 10, 10, 10, 10, 10, 10, 10, 25, NA, NA, NA, 10, 10, 10, 10, 10, 1~
## $ Metric     <chr> "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "~
## $ Measure    <chr> "Cases", "Cases", "Cases", "Cases", "Cases", "Cases", "Cases", "Cases", "Cases", "Cases", "Deaths", "Deaths",~
## $ Value      <dbl> 169, 1525, 7948, 7592, 5362, 3747, 2267, 840, 301, 29751, 0, 5, 18, 54, 117, 161, 221, 95, 55, 726, 73515, 82~
## $ templateID <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "~

Notice, Age is character because it can contain "TOT" or "UNK" values. Therefore, if you sort, it will do so alphabetically, in which case "10" comes before "5", etc. This is the file from which we begin the harmonization routines. There a lot in there that currently does not make it through processing for one reason or another. Maybe you have use for it?

Output files

Output files are friendlier for several reasons. Since measures from like subsets are next to each other, it’s easier to calculate things. Since age groups are uniform, it’s easy to compare subsets. Since Age is a proper integer it’s easy to plot it in the abscissa. However, you’ll still want to convert the Date class to use it in plotting. These come as character strigngs following the format "DD.MM.YYYY", and can be converted using lubridate::dmy() without further ado.

osf_ls_files(cdb_repo, path = "Data") %>%
  dplyr::filter(name == "Output_10.zip") %>%
  osf_download(conflicts = "overwrite")

## # A tibble: 1 x 4
##   name          id                       local_path      meta            
##   <chr>         <chr>                    <chr>           <list>          
## 1 Output_10.zip 5f3ed666bacde800a533bb10 ./Output_10.zip <named list [3]>

# This reads it in
Output_10 <-  read_csv("Output_10.zip",
                     skip = 3,
                     col_types = "ccccciiddd")
# convert to date class
Output_10 <-
  Output_10 %>%
  mutate(Date = dmy(Date))

Save out smaller files

By now the files have got too big to handle easily in Excel. Here’s a way to filter down and save out so that you can work with them in Excel if need be (a slicker interactive solution is in the works.)

# Filter down to just one country
Out <-
  Output_10 %>%
  filter(Country == "Spain", Region == "All")

# save it out
write_csv(Out, file = "Spain_Output_10.csv")

# or for the inputDB
In <-
  inputDB %>%
  filter(Country == "Spain", Region == "All")

write_csv(In, file = "Spain_InputDB.csv")

`covidAgeData` package

The process of reading and filtering COVerAGE-DB files is automated in the covidAgeData package. It’s not yet available in CRAN, but can be installed from github via the remotes package.

remotes::install_github("eshom/covid-age-data")

# This downloads and reads in one line
inputDB <- download_covid("inputDB", progress = FALSE)

covidAgeData includes wrappers for either very fast in-memory data subsetting, or slower, but memory efficient subsetting. The next example demonstrates both methods on inputDB.

subset_covid(inputDB, Country = "Brazil", Region = "All") %>%
  head()

##   Country Region Code       Date Sex Age AgeInt Metric Measure Value templateID
## 1  Brazil    All   BR 16.03.2020   b   0     10  Count  Deaths     0         BR
## 2  Brazil    All   BR 16.03.2020   b  10     10  Count  Deaths     0         BR
## 3  Brazil    All   BR 16.03.2020   b  20     10  Count  Deaths     0         BR
## 4  Brazil    All   BR 16.03.2020   b  30     10  Count  Deaths     0         BR
## 5  Brazil    All   BR 16.03.2020   b  40     10  Count  Deaths     0         BR
## 6  Brazil    All   BR 16.03.2020   b  50     10  Count  Deaths     0         BR

# Drastically conserve memory using this memory efficient version, powered by
# the `vroom` pacakage
read_subset_covid("inputDB.zip", "inputDB", Country = "Sweden",
                  Region = "All") %>%
  head()

##   Country Region Code       Date Sex Age AgeInt Metric Measure Value templateID
## 1  Sweden    All   SE 06.04.2020   b   0     10  Count   Cases    43         SE
## 2  Sweden    All   SE 06.04.2020   b  10     10  Count   Cases   143         SE
## 3  Sweden    All   SE 06.04.2020   b  20     10  Count   Cases   503         SE
## 4  Sweden    All   SE 06.04.2020   b  30     10  Count   Cases   651         SE
## 5  Sweden    All   SE 06.04.2020   b  40     10  Count   Cases   935         SE
## 6  Sweden    All   SE 06.04.2020   b  50     10  Count   Cases  1291         SE

Common plot types

Let’s walk through some commonly made plots of this kind of data. These are a pretty random selection. Different examples are found in the tabs. Code donations gladly accepted.

CFR by age

How can we plot log CFR by age? Subset to any population with both Cases and Deaths, select a not-so-noisy date range (eyeball). Here, we go ahead and plot a time series of ASCFR. It works OK because there has been a relatively smooth trend in how CFR has changed over time. Apparent decreases could be real or data artifacts.

Output_10 %>%
  filter(Country == "Germany",
         Region == "All",
         Sex == "b",
         Date >= dmy("01.05.2020")) %>%
  mutate(ASCFR = Deaths / Cases) %>%
  filter(!is.na(ASCFR)) %>%
  ggplot(aes(x = Age, y = ASCFR, group = Date, color = Date)) +
  geom_line(alpha = .2) +
  scale_y_log10()

Age composition

How about a composition plot of new cases by age over time? Here we group data to 20-year age bands, then decumulate using New = Cases - lead(Cases), then convert to fractions Frac = New / N and plot using geom_area(). Could be cleaned up a bit more, but this gets it started.

library(colorspace)
 Output_10 %>%
     filter(Country == "Germany",
            Region == "All",
            Date >= dmy("01.04.2020")) %>%
     mutate(Age20 = Age - Age %% 20) %>%
     group_by(Date, Age20) %>%
     summarize(Cases = sum(Cases),
               .groups = "drop") %>%
     group_by(Age20) %>%
     arrange(Date) %>%
     mutate(New = Cases - lead(Cases)) %>%
     ungroup() %>%
     group_by(Date) %>%
     mutate(N = sum(New),
            Frac = New / N) %>%
     ungroup() %>%
     ggplot(aes(x = Date,
                y = Frac,
                fill = as.factor(Age20))) +
     geom_area() +
     scale_fill_discrete_sequential("Emrld")

Positivity by age

In general, taking ratios of variables is a good diagnostic tool. In the first run of this plot, we notice a major 1-day rupture early in the series. Major age crossovers since October are notable.

Output_10 %>%
  dplyr::filter(Country == "Denmark",
         Region == "All",
         Sex == "b",
         Date >= dmy("01.05.2020")) %>%
  mutate(Postivity = Cases / Tests) %>%
  dplyr::filter(!is.na(Postivity)) %>%
  ggplot(aes(x = Date, y = Postivity, group = Age, color = as.factor(Age))) +
  geom_line() +
  scale_color_discrete_sequential("Magenta")

Merging COVerAGE-DB with other major sources

This section covers how to merge with other global database. Different sources are found in the tabs. We will use the countrycode package to help with matching country names sometimes.

merge with WPP

Let’s merge 2020 WPP population estimates to the 10-year age groups data. We’ll use the wpp2019 package for this for the time being. Let’s see, how about South Africa?

library(wpp2019)
data(popM)

Pop <-
  popM %>%
  dplyr::filter(name == "South Africa") %>%
  select(Age = age, Population = `2020`) %>%
  mutate(Population = Population * 1000,
         Age = as.character(Age)) %>%
  separate(Age,
           into = c("Age",NA),
           sep = "-") %>%
  mutate(Age = ifelse(Age == "100+", 100, as.integer(Age)),
         Age = Age - Age %% 10) %>%
  group_by(Age) %>%
  summarize(Population = sum(Population), .groups = "drop")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [21].

## Warning in ifelse(Age == "100+", 100, as.integer(Age)): NAs introduced by coercion

ZA <-
Output_10 %>%
  dplyr::filter(Country == "South Africa",
         Region == "All",
         Sex == "m") %>%
  left_join(Pop)

## Joining, by = "Age"

Let’s see what we get, Cases / Population is not a rate per se, FYI, Population isn’t an exposure. Handle population denominators with care. An alternative would be to decumulate Cases, group to weeks, and divide by Population / 52 or some better approximation. Either way you need to remember that Cases are those infections that have been detected, so it’s necessarily a subset of infections.

ZA %>%
 dplyr:: filter(Date >= dmy("10.07.2020")) %>%
  mutate(CPop = Cases / Population) %>%
  ggplot(aes(x=Date,y=CPop,group=Age,color=as.factor(Age))) +
  geom_line() +
  scale_color_discrete_sequential("Magenta") +
  scale_y_log10()

merge with OWID

Out World in Data gives different kinds of testing aggregates and other interesting things https://github.com/owid/covid-19-data/tree/master/public/data. COVerAGE-DB collects a lot of testing aggregates too. When not broken down by age, these are delivered in the inputDB. Test counts broken down by age pass through to the output files. Our collected testing measures could either be Tests or Tested individuals, and the only way to know is to check the metadata. OWID applies a consistent definition at this time.

There are different files on the OWD github repository. Here’s one that contains the full available time series for each country.

library(countrycode)
OWID <- read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/testing/covid-testing-all-observations.csv",
                col_types= "ccDcccdddddddd") %>%
  dplyr::filter(!is.na(`ISO code`)) %>%
  dplyr::filter(! `ISO code` %in% c("OWID_KOS","OWID_WRL")) %>%
  mutate(ISO2 = countrycode(`ISO code`,
                             origin = 'iso3c',
                             destination = 'iso2c')) %>%
  select(ISO2,
         Date,
         `Short-term tests per case`,
         `Short-term positive rate`)

# merge for coverage with positivity
COVwPOS <- Output_10 %>%
  dplyr::filter(Region == "All") %>%
  rename(ISO2 = Code) %>%
  group_by(Country) %>%
  left_join(OWID)

We’ll add a plot using the merged data when we think of something.

merge with STMF

The Short Term Mortality Fluctuations project is part of the Human Mortality Database <www.mortality.org>. They deliver all-cause deaths data standardized in wide age groups. This is what you want to use to estimate all-cause excess mortality, which may be larger or smaller than registered COVID-19 deaths, depending on the place and date. Much of the input data is given in 5-year age groups. To work with that, you need to take care of some data processing things, like redistributing deaths of unknown age. There’s also a fundamental difference in that STMF is given as new deaths per week, whereas COVerAGE-DB is cumulative daily data, but it has gaps.

In this example, adapted from a 2020 EDSD Data Wrangling Example, we’ll pick out a single country to merge, Denmark. This will require some prep on both sides of the merge. On the COVerAGE-DB side, we’ll convert COVerAGE-DB data to new counts in isoweeks. On the STMF side we’ll redistribute any deaths of unknown age, and we’ll group input data to 10-year age groups. Then it should be ready to merge.

The easier alternative is to group COVerAGE-DB data to the same large age groups, then slightly less data wrangling is required, but the age resolution is lower.

First prepare STMF data. In this case there are no ages coded "UNK", but we rescale to "TOT" just to be sure.

download.file("https://www.mortality.org/Public/STMF/Inputs/STMFinput.zip",
              destfile ="STMFinput.zip")

STMF <-
  read_csv(utils::unzip("STMFinput.zip", "DNKstmf.csv")) %>%
  dplyr::filter(Year == 2020) %>%
  group_by(Week, Sex) %>%
  # move total to column
  mutate(TOT = Deaths[Age == "TOT"]) %>%
  dplyr::filter(Age != "TOT") %>%
  mutate(dist = Deaths / sum(Deaths),
         Deaths = dist * TOT,
         Age = as.integer(Age),
         Age = Age - Age %% 10) %>%
  group_by(Week, Sex, Age) %>%
  summarize(Deaths = sum(Deaths), .groups = "drop") %>%
  arrange(Sex, Week, Age)

## Rows: 52932 Columns: 10
## -- Column specification ----------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr (6): PopCode, Sex, Age, AgeInterval, Type, Access
## dbl (4): Area, Year, Week, Deaths
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now prep COVerAGE-DB data to new events by week. Denmark has every Monday, which is also the start date of the ISO week definition, which is what STMF uses. Since the series is cumulative, we can just take differences between the Monday totals.

COV <-
  Output_10 %>%
  dplyr::filter(Country == "Denmark",
         Region == "All") %>%
  dplyr::filter(weekdays(Date) == "Monday") %>%
  mutate(Week = week(Date)) %>%
  # sort just to be sure decumulation works right
  arrange(Sex, Age, Week) %>%
  group_by(Sex, Age) %>%
  # decumulate, pad w NA
  mutate(cov_deaths_wk = c(diff(Deaths),NA),
         cov_cases_wk = c(diff(Cases), NA),
         cov_tests_wk = c(diff(Tests), NA)) %>%
  # keep just what we want
  select(Sex,
         Week,
         Age,
         cov_deaths_wk,
         cov_cases_wk,
         cov_tests_wk)

Now ready to join. It would be the same if you had some excess mortality estimates done on the STMF data, for example.

Joined <- left_join(COV, STMF) %>%
  arrange(Sex, Week, Age)

## Joining, by = c("Sex", "Week", "Age")