How to use R to analyze US COVID pandemic waves and peaks

Posted on 2023-01-12 Edited on 2024-01-27 In Tech , Big Data Views:

Goal

According to weekly data between 2020 to 2022, we want to get to know the waves and peaks of COVID pandemic in these years.

Download the data

We will continue to use NCHS(National Center for Health Statistics) as our data source.

Visit https://data.cdc.gov/browse?category=NCHS&sortBy=last_modified, and search Provisional COVID-19 Death Counts by Week, we will find the data we are intrest.

https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-by-Week-Ending-D/r8kw-7aab, in this page, we can export data into csv file.

With that, we may get the data source csv, Provisional_COVID-19_Death_Counts_by_Week_Ending_Date_and_State.csv.

Load the data

ibrary("dplyr")
ibrary("janitor")
ibrary("tidyr")
ibrary("readr")
f <- readr::read_csv(file.path(getwd(), "Provisional_COVID-19_Death_Counts_by_Week_Ending_Date_and_State.csv"), col_names = TRUE)
f <- clean_names(df)
mp_start_date <- strptime(df$start_date, "%m/%d/%Y")
f$start_date <- format(tmp_start_date, "%Y-%m-%d")
> glimpse(df)
Rows: 10,800
Columns: 17
$ data_as_of                             <chr> "01/09/2023", "01/09/2023", "01…
$ start_date                             <chr> "2019-12-29", "2020-01-05", "20…
$ end_date                               <chr> "01/04/2020", "01/11/2020", "01…
$ group                                  <chr> "By Week", "By Week", "By Week"…
$ year                                   <chr> "2019/2020", "2020", "2020", "2…
$ month                                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA,…
$ mmwr_week                              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
$ week_ending_date                       <chr> "01/04/2020", "01/11/2020", "01…
$ state                                  <chr> "United States", "United States…
$ covid_19_deaths                        <dbl> 0, 1, 2, 3, 0, 4, 6, 6, 9, 38, …
$ total_deaths                           <dbl> 60176, 60734, 59362, 59162, 588…
$ percent_of_expected_deaths             <dbl> 98, 97, 98, 99, 99, 100, 100, 1…
$ pneumonia_deaths                       <dbl> 4111, 4153, 4066, 3915, 3818, 3…
$ pneumonia_and_covid_19_deaths          <dbl> 0, 1, 2, 0, 0, 1, 1, 3, 5, 19, …
$ influenza_deaths                       <dbl> 434, 475, 468, 500, 481, 520, 5…
$ pneumonia_influenza_or_covid_19_deaths <dbl> 4545, 4628, 4534, 4418, 4299, 4…
$ footnote                               <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
!> df
# A tibble: 10,800 × 17
   data_as_of start_date end_d…¹ group year  month mmwr_…² week_…³ state covid…⁴
   <chr>      <chr>      <chr>   <chr> <chr> <dbl>   <dbl> <chr>   <chr>   <dbl>
 1 01/09/2023 2019-12-29 01/04/… By W… 2019…    NA       1 01/04/… Unit…       0
 2 01/09/2023 2020-01-05 01/11/… By W… 2020     NA       2 01/11/… Unit…       1
 3 01/09/2023 2020-01-12 01/18/… By W… 2020     NA       3 01/18/… Unit…       2
 4 01/09/2023 2020-01-19 01/25/… By W… 2020     NA       4 01/25/… Unit…       3
 5 01/09/2023 2020-01-26 02/01/… By W… 2020     NA       5 02/01/… Unit…       0
 6 01/09/2023 2020-02-02 02/08/… By W… 2020     NA       6 02/08/… Unit…       4
 7 01/09/2023 2020-02-09 02/15/… By W… 2020     NA       7 02/15/… Unit…       6
 8 01/09/2023 2020-02-16 02/22/… By W… 2020     NA       8 02/22/… Unit…       6
 9 01/09/2023 2020-02-23 02/29/… By W… 2020     NA       9 02/29/… Unit…       9
10 01/09/2023 2020-03-01 03/07/… By W… 2020     NA      10 03/07/… Unit…      38
# … with 10,790 more rows, 7 more variables: total_deaths <dbl>,
#   percent_of_expected_deaths <dbl>, pneumonia_deaths <dbl>,
#   pneumonia_and_covid_19_deaths <dbl>, influenza_deaths <dbl>,
#   pneumonia_influenza_or_covid_19_deaths <dbl>, footnote <chr>, and
#   abbreviated variable names ¹ end_date, ² mmwr_week, ³ week_ending_date,
#   ⁴ covid_19_deaths
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Identify the data we want to focus

As we can see, there are 4 diffrent groups, and it has the whole United states and each state’s data

> unique(df$group)
[1] "By Week"  "By Month" "By Year"  "By Total"
>
```   

We only want to get the weekly data, so we may want to `filter` with "group=By Week", and "state=United states"

in the mean time, we may only want to `select` only 2 columns.

- start_date
- covid_19_deaths

df1 <- df %>%
filter(state == “United States” & group == “By Week”) %>%
select(start_date, covid_19_deaths)
print(df1,n=20)

A tibble: 158 × 2
start_date covid_19_deaths

1 2019-12-29 0
2 2020-01-05 1
3 2020-01-12 2
4 2020-01-19 3
5 2020-01-26 0
6 2020-02-02 4
7 2020-02-09 6
8 2020-02-16 6
9 2020-02-23 9
10 2020-03-01 38
11 2020-03-08 60
12 2020-03-15 588
13 2020-03-22 3226
14 2020-03-29 10141
15 2020-04-05 16347
16 2020-04-12 17221
17 2020-04-19 15557
18 2020-04-26 13223
19 2020-05-03 11243
20 2020-05-10 9239
…

1
2
3


## Draw the graph to see the wave

library(“ggplot2”)
library(“sjPlot”)
p = ggplot(df1, aes( x=start_date, y=covid_19_deaths, group=1)) +
geom_line(color=”blue”) +
theme(axis.text.x=element_text(angle=45,hjust=1,size=5))
save_plot(“covid_plot_weekly_wave.svg”, fig = p, width=60, height=20)


![Image](/images/covid_plot_weekly_wave.svg)

## Find the peak by R mark it in the graph

From above graph, we can easily to figure out the waves and peaks, but we also can let R help us to do it, it's pretty useful if we have to deal with many data and many graphs.

To achive it, firstly we can call `findpeaks` from `pracma` library to find the peaks

library(“pracma”)

peaks = findpeaks(df1$covid_19_deaths, npeaks=5, sortstr=TRUE)

peaks
[,1] [,2] [,3] [,4]

[1,] 26027 54 40 66
[2,] 21364 108 98 121
[3,] 17221 16 8 26
[4,] 15536 88 79 98
[5,] 8308 31 26 38


The 2nd column means the the row index of the peak. in this case, we can tell, the 54th row has the top peak covid death number `26027`.

It's not very obvious which week(start_date) is hitting the peak, so we can do something like this.

is_peak <- vector( “logical” , length(df1$covid_19_deaths ))
df1$is_peak = is_peak

for (x in peaks[,2]) {
df1$is_peak[x] = TRUE
}

1
2
3


As you can see, we added a new column `is_peak`, so we can use it to filter out those none peak data, sort the peak data points.

!> df2 = df1 %>% filter(is_peak == TRUE)

df2[order(-df2$covid_19_deaths),]

A tibble: 5 × 3
start_date covid_19_deaths is_peak

1 2021-01-03 26027 TRUE
2 2022-01-16 21364 TRUE
3 2020-04-12 17221 TRUE
4 2021-08-29 15536 TRUE
5 2020-07-26 8308 TRUE
1
2
3

## Hightlight the peak points
p = ggplot(df1, aes(x=start_date, y=covid_19_deaths, group=1)) +
geom_line(color=”blue”) +
geom_point(data = . %>% filter(is_peak == TRUE), stat=”identity”, size = 4, color = “red”) +
scale_y_continuous(breaks=seq(0,30000,4000)) +
theme(axis.text.x=element_text(angle=45,hjust=1,size=5))

save_plot(“covid_plot_weekly_peak.svg”, fig = p, width=60, height=20)


![Image](/images/covid_plot_weekly_peak.svg)

## Other finding

!> > sum(df1$covid_19_deaths)
[1] 1089714 ===> the total covid_19_deaths death number from 2020 to 2022

!> summary(df1$covid_19_deaths)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 2223 4428 6897 9862 26027

!> df3 <- df %>%

filter(state == "United States" & group == "By Week") %>%

```
select(start_date, total_deaths)
```
sum(df3$total_deaths)
[1] 10077273 ===> the total death number from 2020 to 2022
summary(df3$total_deaths)

Min. 1st Qu. Median Mean 3rd Qu. Max.
7100 58522 60451 63780 68610 87415

Goal

Download the data

Load the data

Identify the data we want to focus

A tibble: 158 × 2

A tibble: 5 × 3