COVerAGE-DB shares several files, ranging from untransformed inputs to fully harmonized outputs. Files have a similar structure, are all in .csv
format, and ought to be staightforward to read in any common statistical package.
The inputDB
is quite diverse in terms of the measures, metrics, and age groups it includes. This file includes everything we collect except for subsets that we are unable to process for one reason or another. In other words this file contains all prima facie valid data that we’ve collected. This is a long structure, in that all data collected are in a single column Value
. Aside from Country
, Region
, Code
, and Date
, you can filter on Metric
("Ratio"
, “Count
”, or "Fraction"
) or Measure
("Cases"
, "Deaths"
, "Tests"
,"ASCFR"
). At some point we will incorporate Rate
metrics. The thing to note is that if you’re interested in Count
data, it may be included already as such, or you may need to calculate it from other data types. This is one of the things we take care of in the harmonization routines.
The output files Output_5
and Output_10
are harmonized to "Count"
metrics, and to 5 and 10-year age groups 0-100+. The columns are also different, as different Measures are, for the time being, in separate columns. These can be used as-is for many kinds of analyses, but you may also consider further smoothing, especially for the 5-year age group data, as it can contain stochastic 0s.
An interactive data downloading tool will soon be released, which will facilitate smaller data extracts for specific analyses.
Details: commas are column separators, decimals are dots. When reading in, you may need to skip a row or more. You can can typically read these in straight from the repository using the raw
version. See examples.
Each COVerAGE-DB file is given a random-looking id
on OSF, which can be used to download the files from R
. Here are the id
s and other important file parameters for the main files, for easy copy-pasting (updated 10 May 2022):
File | id |
col_types |
skip |
---|---|---|---|
inputDB.zip |
9dsfk |
"cccccciccdc" |
1 |
Output_5.zip |
7tnfh |
"ccccciiddd" |
3 |
Output_10.zip |
43ucn |
"ccccciiddd" |
3 |
qualityMetrics.zip |
qpfw5 |
"ccDcdddcciiiiiiiiiiildd" |
1 |
offsets.csv |
unf6v |
"idccc" |
1 |
col_types
gives the columns classes, where c
means character, i
is integer, d
is double, D
is a proper date format, and l
is logical. The files have headers with timestamps, so some rows need to be skipped too (skip
) when reading in.
Here are the main packages you’ll want to have installed to read the files directly into an R
session with no need to manually download or unzip.
library(tidyverse)
library(readr)
library(lubridate)
library(ggplot2)
library(osfr)
library(covidAgeData)
inputDB
file# This downloads the file, preserving the name inputDB.csv
cdb_repo <- osf_retrieve_node("mpwjq")
osf_ls_files(cdb_repo, path = "Data") %>%
dplyr::filter(name == "inputDB.zip") %>%
osf_download(conflicts = "overwrite")
## # A tibble: 1 x 4
## name id local_path meta
## <chr> <chr> <chr> <list>
## 1 inputDB.zip 5f3ed659746a8100ad1a2420 ./inputDB.zip <named list [3]>
# This reads it in
inputDB <- read_csv("inputDB.zip",
skip = 1,
col_types = "cccccciccdc")
glimpse(inputDB)
## Rows: 33,806,302
## Columns: 11
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afg~
## $ Region <chr> "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All", "All~
## $ Code <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "~
## $ Date <chr> "01.07.2020", "01.07.2020", "01.07.2020", "01.07.2020", "01.07.2020", "01.07.2020", "01.07.2020", "01.07.2020~
## $ Sex <chr> "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "f",~
## $ Age <chr> "0", "10", "20", "30", "40", "50", "60", "70", "80", "TOT", "0", "10", "20", "30", "40", "50", "60", "70", "8~
## $ AgeInt <int> 10, 10, 10, 10, 10, 10, 10, 10, 25, NA, 10, 10, 10, 10, 10, 10, 10, 10, 25, NA, NA, NA, 10, 10, 10, 10, 10, 1~
## $ Metric <chr> "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "Count", "~
## $ Measure <chr> "Cases", "Cases", "Cases", "Cases", "Cases", "Cases", "Cases", "Cases", "Cases", "Cases", "Deaths", "Deaths",~
## $ Value <dbl> 169, 1525, 7948, 7592, 5362, 3747, 2267, 840, 301, 29751, 0, 5, 18, 54, 117, 161, 221, 95, 55, 726, 73515, 82~
## $ templateID <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "~
Notice, Age
is character because it can contain "TOT"
or "UNK"
values. Therefore, if you sort, it will do so alphabetically, in which case "10"
comes before "5"
, etc. This is the file from which we begin the harmonization routines. There a lot in there that currently does not make it through processing for one reason or another. Maybe you have use for it?
Output files are friendlier for several reasons. Since measures from like subsets are next to each other, it’s easier to calculate things. Since age groups are uniform, it’s easy to compare subsets. Since Age
is a proper integer it’s easy to plot it in the abscissa. However, you’ll still want to convert the Date
class to use it in plotting. These come as character strigngs following the format "DD.MM.YYYY"
, and can be converted using lubridate::dmy()
without further ado.
osf_ls_files(cdb_repo, path = "Data") %>%
dplyr::filter(name == "Output_10.zip") %>%
osf_download(conflicts = "overwrite")
## # A tibble: 1 x 4
## name id local_path meta
## <chr> <chr> <chr> <list>
## 1 Output_10.zip 5f3ed666bacde800a533bb10 ./Output_10.zip <named list [3]>
# This reads it in
Output_10 <- read_csv("Output_10.zip",
skip = 3,
col_types = "ccccciiddd")
# convert to date class
Output_10 <-
Output_10 %>%
mutate(Date = dmy(Date))
By now the files have got too big to handle easily in Excel. Here’s a way to filter down and save out so that you can work with them in Excel if need be (a slicker interactive solution is in the works.)
# Filter down to just one country
Out <-
Output_10 %>%
filter(Country == "Spain", Region == "All")
# save it out
write_csv(Out, file = "Spain_Output_10.csv")
# or for the inputDB
In <-
inputDB %>%
filter(Country == "Spain", Region == "All")
write_csv(In, file = "Spain_InputDB.csv")
covidAgeData
packageThe process of reading and filtering COVerAGE-DB files is automated in the covidAgeData
package. It’s not yet available in CRAN, but can be installed from github via the remotes
package.
remotes::install_github("eshom/covid-age-data")
# This downloads and reads in one line
inputDB <- download_covid("inputDB", progress = FALSE)
covidAgeData
includes wrappers for either very fast in-memory data subsetting, or slower, but memory efficient subsetting. The next example demonstrates both methods on inputDB
.
subset_covid(inputDB, Country = "Brazil", Region = "All") %>%
head()
## Country Region Code Date Sex Age AgeInt Metric Measure Value templateID
## 1 Brazil All BR 16.03.2020 b 0 10 Count Deaths 0 BR
## 2 Brazil All BR 16.03.2020 b 10 10 Count Deaths 0 BR
## 3 Brazil All BR 16.03.2020 b 20 10 Count Deaths 0 BR
## 4 Brazil All BR 16.03.2020 b 30 10 Count Deaths 0 BR
## 5 Brazil All BR 16.03.2020 b 40 10 Count Deaths 0 BR
## 6 Brazil All BR 16.03.2020 b 50 10 Count Deaths 0 BR
# Drastically conserve memory using this memory efficient version, powered by
# the `vroom` pacakage
read_subset_covid("inputDB.zip", "inputDB", Country = "Sweden",
Region = "All") %>%
head()
## Country Region Code Date Sex Age AgeInt Metric Measure Value templateID
## 1 Sweden All SE 06.04.2020 b 0 10 Count Cases 43 SE
## 2 Sweden All SE 06.04.2020 b 10 10 Count Cases 143 SE
## 3 Sweden All SE 06.04.2020 b 20 10 Count Cases 503 SE
## 4 Sweden All SE 06.04.2020 b 30 10 Count Cases 651 SE
## 5 Sweden All SE 06.04.2020 b 40 10 Count Cases 935 SE
## 6 Sweden All SE 06.04.2020 b 50 10 Count Cases 1291 SE
Let’s walk through some commonly made plots of this kind of data. These are a pretty random selection. Different examples are found in the tabs. Code donations gladly accepted.
How can we plot log CFR by age? Subset to any population with both Cases
and Deaths
, select a not-so-noisy date range (eyeball). Here, we go ahead and plot a time series of ASCFR. It works OK because there has been a relatively smooth trend in how CFR has changed over time. Apparent decreases could be real or data artifacts.
Output_10 %>%
filter(Country == "Germany",
Region == "All",
Sex == "b",
Date >= dmy("01.05.2020")) %>%
mutate(ASCFR = Deaths / Cases) %>%
filter(!is.na(ASCFR)) %>%
ggplot(aes(x = Age, y = ASCFR, group = Date, color = Date)) +
geom_line(alpha = .2) +
scale_y_log10()
How about a composition plot of new cases by age over time? Here we group data to 20-year age bands, then decumulate using New = Cases - lead(Cases)
, then convert to fractions Frac = New / N
and plot using geom_area()
. Could be cleaned up a bit more, but this gets it started.
library(colorspace)
Output_10 %>%
filter(Country == "Germany",
Region == "All",
Date >= dmy("01.04.2020")) %>%
mutate(Age20 = Age - Age %% 20) %>%
group_by(Date, Age20) %>%
summarize(Cases = sum(Cases),
.groups = "drop") %>%
group_by(Age20) %>%
arrange(Date) %>%
mutate(New = Cases - lead(Cases)) %>%
ungroup() %>%
group_by(Date) %>%
mutate(N = sum(New),
Frac = New / N) %>%
ungroup() %>%
ggplot(aes(x = Date,
y = Frac,
fill = as.factor(Age20))) +
geom_area() +
scale_fill_discrete_sequential("Emrld")
In general, taking ratios of variables is a good diagnostic tool. In the first run of this plot, we notice a major 1-day rupture early in the series. Major age crossovers since October are notable.
Output_10 %>%
dplyr::filter(Country == "Denmark",
Region == "All",
Sex == "b",
Date >= dmy("01.05.2020")) %>%
mutate(Postivity = Cases / Tests) %>%
dplyr::filter(!is.na(Postivity)) %>%
ggplot(aes(x = Date, y = Postivity, group = Age, color = as.factor(Age))) +
geom_line() +
scale_color_discrete_sequential("Magenta")
This section covers how to merge with other global database. Different sources are found in the tabs. We will use the countrycode
package to help with matching country names sometimes.
Let’s merge 2020 WPP population estimates to the 10-year age groups data. We’ll use the wpp2019
package for this for the time being. Let’s see, how about South Africa?
library(wpp2019)
data(popM)
Pop <-
popM %>%
dplyr::filter(name == "South Africa") %>%
select(Age = age, Population = `2020`) %>%
mutate(Population = Population * 1000,
Age = as.character(Age)) %>%
separate(Age,
into = c("Age",NA),
sep = "-") %>%
mutate(Age = ifelse(Age == "100+", 100, as.integer(Age)),
Age = Age - Age %% 10) %>%
group_by(Age) %>%
summarize(Population = sum(Population), .groups = "drop")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [21].
## Warning in ifelse(Age == "100+", 100, as.integer(Age)): NAs introduced by coercion
ZA <-
Output_10 %>%
dplyr::filter(Country == "South Africa",
Region == "All",
Sex == "m") %>%
left_join(Pop)
## Joining, by = "Age"
Let’s see what we get, Cases / Population
is not a rate per se, FYI, Population
isn’t an exposure. Handle population denominators with care. An alternative would be to decumulate Cases
, group to weeks, and divide by Population / 52
or some better approximation. Either way you need to remember that Cases
are those infections that have been detected, so it’s necessarily a subset of infections.
ZA %>%
dplyr:: filter(Date >= dmy("10.07.2020")) %>%
mutate(CPop = Cases / Population) %>%
ggplot(aes(x=Date,y=CPop,group=Age,color=as.factor(Age))) +
geom_line() +
scale_color_discrete_sequential("Magenta") +
scale_y_log10()
Out World in Data gives different kinds of testing aggregates and other interesting things https://github.com/owid/covid-19-data/tree/master/public/data. COVerAGE-DB collects a lot of testing aggregates too. When not broken down by age, these are delivered in the inputDB
. Test
counts broken down by age pass through to the output files. Our collected testing measures could either be Tests
or Tested individuals
, and the only way to know is to check the metadata. OWID applies a consistent definition at this time.
There are different files on the OWD github repository. Here’s one that contains the full available time series for each country.
library(countrycode)
OWID <- read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/testing/covid-testing-all-observations.csv",
col_types= "ccDcccdddddddd") %>%
dplyr::filter(!is.na(`ISO code`)) %>%
dplyr::filter(! `ISO code` %in% c("OWID_KOS","OWID_WRL")) %>%
mutate(ISO2 = countrycode(`ISO code`,
origin = 'iso3c',
destination = 'iso2c')) %>%
select(ISO2,
Date,
`Short-term tests per case`,
`Short-term positive rate`)
# merge for coverage with positivity
COVwPOS <- Output_10 %>%
dplyr::filter(Region == "All") %>%
rename(ISO2 = Code) %>%
group_by(Country) %>%
left_join(OWID)
We’ll add a plot using the merged data when we think of something.
The Short Term Mortality Fluctuations project is part of the Human Mortality Database <www.mortality.org>. They deliver all-cause deaths data standardized in wide age groups. This is what you want to use to estimate all-cause excess mortality, which may be larger or smaller than registered COVID-19 deaths, depending on the place and date. Much of the input data is given in 5-year age groups. To work with that, you need to take care of some data processing things, like redistributing deaths of unknown age. There’s also a fundamental difference in that STMF is given as new deaths per week, whereas COVerAGE-DB is cumulative daily data, but it has gaps.
In this example, adapted from a 2020 EDSD Data Wrangling Example, we’ll pick out a single country to merge, Denmark. This will require some prep on both sides of the merge. On the COVerAGE-DB side, we’ll convert COVerAGE-DB data to new counts in isoweeks. On the STMF side we’ll redistribute any deaths of unknown age, and we’ll group input data to 10-year age groups. Then it should be ready to merge.
The easier alternative is to group COVerAGE-DB data to the same large age groups, then slightly less data wrangling is required, but the age resolution is lower.
First prepare STMF data. In this case there are no ages coded "UNK"
, but we rescale to "TOT"
just to be sure.
download.file("https://www.mortality.org/Public/STMF/Inputs/STMFinput.zip",
destfile ="STMFinput.zip")
STMF <-
read_csv(utils::unzip("STMFinput.zip", "DNKstmf.csv")) %>%
dplyr::filter(Year == 2020) %>%
group_by(Week, Sex) %>%
# move total to column
mutate(TOT = Deaths[Age == "TOT"]) %>%
dplyr::filter(Age != "TOT") %>%
mutate(dist = Deaths / sum(Deaths),
Deaths = dist * TOT,
Age = as.integer(Age),
Age = Age - Age %% 10) %>%
group_by(Week, Sex, Age) %>%
summarize(Deaths = sum(Deaths), .groups = "drop") %>%
arrange(Sex, Week, Age)
## Rows: 52932 Columns: 10
## -- Column specification ----------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr (6): PopCode, Sex, Age, AgeInterval, Type, Access
## dbl (4): Area, Year, Week, Deaths
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now prep COVerAGE-DB data to new events by week. Denmark has every Monday, which is also the start date of the ISO week definition, which is what STMF uses. Since the series is cumulative, we can just take differences between the Monday totals.
COV <-
Output_10 %>%
dplyr::filter(Country == "Denmark",
Region == "All") %>%
dplyr::filter(weekdays(Date) == "Monday") %>%
mutate(Week = week(Date)) %>%
# sort just to be sure decumulation works right
arrange(Sex, Age, Week) %>%
group_by(Sex, Age) %>%
# decumulate, pad w NA
mutate(cov_deaths_wk = c(diff(Deaths),NA),
cov_cases_wk = c(diff(Cases), NA),
cov_tests_wk = c(diff(Tests), NA)) %>%
# keep just what we want
select(Sex,
Week,
Age,
cov_deaths_wk,
cov_cases_wk,
cov_tests_wk)
Now ready to join. It would be the same if you had some excess mortality estimates done on the STMF data, for example.
Joined <- left_join(COV, STMF) %>%
arrange(Sex, Week, Age)
## Joining, by = c("Sex", "Week", "Age")