Using R to analyze US COVID pandemic waves and peaks

Goal

According to weekly data between 2020 to 2022, we want to get to know the waves and peaks of COVID pandemic in these years.

Download the data

We will continue to use NCHS(National Center for Health Statistics) as our data source.

Visit https://data.cdc.gov/browse?category=NCHS&sortBy=last_modified, and search Provisional COVID-19 Death Counts by Week, we will find the data we are intrest.

https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-by-Week-Ending-D/r8kw-7aab, in this page, we can export data into csv file.

With that, we may get the data source csv, Provisional_COVID-19_Death_Counts_by_Week_Ending_Date_and_State.csv.

Load the data

library("dplyr")
library("janitor")
library("tidyr")
library("readr")
df <- readr::read_csv(file.path(getwd(), "Provisional_COVID-19_Death_Counts_by_Week_Ending_Date_and_State.csv"), col_names = TRUE)
df <- clean_names(df)
tmp_start_date <- strptime(df$start_date, "%m/%d/%Y")
df$start_date <- format(tmp_start_date, "%Y-%m-%d")

 > glimpse(df)
 Rows: 10,800
 Columns: 17
 $ data_as_of                             <chr> "01/09/2023", "01/09/2023", "01…
 $ start_date                             <chr> "2019-12-29", "2020-01-05", "20…
 $ end_date                               <chr> "01/04/2020", "01/11/2020", "01…
 $ group                                  <chr> "By Week", "By Week", "By Week"…
 $ year                                   <chr> "2019/2020", "2020", "2020", "2…
 $ month                                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA,…
 $ mmwr_week                              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
 $ week_ending_date                       <chr> "01/04/2020", "01/11/2020", "01…
 $ state                                  <chr> "United States", "United States…
 $ covid_19_deaths                        <dbl> 0, 1, 2, 3, 0, 4, 6, 6, 9, 38, …
 $ total_deaths                           <dbl> 60176, 60734, 59362, 59162, 588…
 $ percent_of_expected_deaths             <dbl> 98, 97, 98, 99, 99, 100, 100, 1…
 $ pneumonia_deaths                       <dbl> 4111, 4153, 4066, 3915, 3818, 3…
 $ pneumonia_and_covid_19_deaths          <dbl> 0, 1, 2, 0, 0, 1, 1, 3, 5, 19, …
 $ influenza_deaths                       <dbl> 434, 475, 468, 500, 481, 520, 5…
 $ pneumonia_influenza_or_covid_19_deaths <dbl> 4545, 4628, 4534, 4418, 4299, 4…
 $ footnote                               <chr> NA, NA, NA, NA, NA, NA, NA, NA,…

 !> df
 # A tibble: 10,800 × 17
    data_as_of start_date end_d…¹ group year  month mmwr_…² week_…³ state covid…⁴
    <chr>      <chr>      <chr>   <chr> <chr> <dbl>   <dbl> <chr>   <chr>   <dbl>
  1 01/09/2023 2019-12-29 01/04/… By W… 2019…    NA       1 01/04/… Unit…       0
  2 01/09/2023 2020-01-05 01/11/… By W… 2020     NA       2 01/11/… Unit…       1
  3 01/09/2023 2020-01-12 01/18/… By W… 2020     NA       3 01/18/… Unit…       2
  4 01/09/2023 2020-01-19 01/25/… By W… 2020     NA       4 01/25/… Unit…       3
  5 01/09/2023 2020-01-26 02/01/… By W… 2020     NA       5 02/01/… Unit…       0
  6 01/09/2023 2020-02-02 02/08/… By W… 2020     NA       6 02/08/… Unit…       4
  7 01/09/2023 2020-02-09 02/15/… By W… 2020     NA       7 02/15/… Unit…       6
  8 01/09/2023 2020-02-16 02/22/… By W… 2020     NA       8 02/22/… Unit…       6
  9 01/09/2023 2020-02-23 02/29/… By W… 2020     NA       9 02/29/… Unit…       9
 10 01/09/2023 2020-03-01 03/07/… By W… 2020     NA      10 03/07/… Unit…      38
 # … with 10,790 more rows, 7 more variables: total_deaths <dbl>,
 #   percent_of_expected_deaths <dbl>, pneumonia_deaths <dbl>,
 #   pneumonia_and_covid_19_deaths <dbl>, influenza_deaths <dbl>,
 #   pneumonia_influenza_or_covid_19_deaths <dbl>, footnote <chr>, and
 #   abbreviated variable names ¹ end_date, ² mmwr_week, ³ week_ending_date,
 #   ⁴ covid_19_deaths
 # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Identify the data we want to focus

As we can see, there are 4 diffrent groups, and it has the whole United states and each state’s data

> unique(df$group)
[1] "By Week"  "By Month" "By Year"  "By Total"
>

We only want to get the weekly data, so we may want to filter with “group=By Week”, and “state=United states”

in the mean time, we may only want to select only 2 columns.

  • start_date

  • covid_19_deaths

    df1 <- df %>%
    filter(state == “United States” & group == “By Week”) %>%
    select(start_date, covid_19_deaths)
    print(df1,n=20)

    • A tibble: 158 × 2

      start_date covid_19_deaths

      1 2019-12-29 0
      2 2020-01-05 1
      3 2020-01-12 2
      4 2020-01-19 3
      5 2020-01-26 0
      6 2020-02-02 4
      7 2020-02-09 6
      8 2020-02-16 6
      9 2020-02-23 9
      10 2020-03-01 38
      11 2020-03-08 60
      12 2020-03-15 588
      13 2020-03-22 3226
      14 2020-03-29 10141
      15 2020-04-05 16347
      16 2020-04-12 17221
      17 2020-04-19 15557
      18 2020-04-26 13223
      19 2020-05-03 11243
      20 2020-05-10 9239

Draw the graph to see the wave

library("ggplot2")
library("sjPlot")
p = ggplot(df1, aes( x=start_date, y=covid_19_deaths, group=1)) +
    geom_line(color="blue") +
    theme(axis.text.x=element_text(angle=45,hjust=1,size=5))
save_plot("covid_plot_weekly_wave.svg", fig = p, width=60, height=20)

Image

Find the peak by R mark it in the graph

From above graph, we can easily to figure out the waves and peaks, but we also can let R help us to do it, it’s pretty useful if we have to deal with many data and many graphs.

To achive it, firstly we can call findpeaks from pracma library to find the peaks

library("pracma")
+ peaks = findpeaks(df1$covid_19_deaths, npeaks=5,  sortstr=TRUE)
> > peaks
      [,1] [,2] [,3] [,4]
[1,] 26027   54   40   66
[2,] 21364  108   98  121
[3,] 17221   16    8   26
[4,] 15536   88   79   98
[5,]  8308   31   26   38
>

The 2nd column means the the row index of the peak. in this case, we can tell, the 54th row has the top peak covid death number 26027.

It’s not very obvious which week(start_date) is hitting the peak, so we can do something like this.

is_peak <- vector( "logical" , length(df1$covid_19_deaths ))
df1$is_peak = is_peak

for (x in peaks[,2]) {
  df1$is_peak[x] = TRUE
}

As you can see, we added a new column is_peak, so we can use it to filter out those none peak data, sort the peak data points.

!> df2 = df1 %>% filter(is_peak == TRUE)
 + df2[order(-df2$covid_19_deaths),]
 > # A tibble: 5 × 3
   start_date covid_19_deaths is_peak
   <chr>                <dbl> <lgl>
 1 2021-01-03           26027 TRUE
 2 2022-01-16           21364 TRUE
 3 2020-04-12           17221 TRUE
 4 2021-08-29           15536 TRUE
 5 2020-07-26            8308 TRUE
 > >

Hightlight the peak points

p = ggplot(df1, aes(x=start_date, y=covid_19_deaths, group=1)) +
    geom_line(color="blue") +
    geom_point(data = . %>% filter(is_peak == TRUE), stat="identity", size = 4, color = "red") +
    scale_y_continuous(breaks=seq(0,30000,4000)) +
    theme(axis.text.x=element_text(angle=45,hjust=1,size=5))

save_plot("covid_plot_weekly_peak.svg", fig = p, width=60, height=20)

Image

Other finding

!> > sum(df1$covid_19_deaths)
 [1] 1089714   ===> the total covid_19_deaths death number from 2020 to 2022

!> summary(df1$covid_19_deaths)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
       0    2223    4428    6897    9862   26027
 >
!> df3 <- df %>%
 +     filter(state == "United States" & group == "By Week") %>%
 +     select(start_date, total_deaths)
 + sum(df3$total_deaths)
 + > [1] 10077273  ===> the total death number from 2020 to 2022 
 + summary(df3$total_deaths)
 >    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    7100   58522   60451   63780   68610   87415
 >