Final Project

Daniel Tafmizi

Lis 4273

Dr. Ajani 

April 21, 2024

Final Project

    My final project uses a World Health Organization data set that tracks weekly deaths and cases (D/C) of Covid 19. I subsetted the United States of America and Canada data rows. I am interested in learning about the D/C ratio between the years of each country, as well as the D/C ratio between the countries. This offers useful insight into the healthcare and social systems of the countries. Having the D/C ratio decrease over time shows that better healthcare and social practices are being implemented. It also acts as a measure of the country's overall healthcare systems.

    I will use what I have learned in this class to run Anova analysis on the data. I will detail the process I took as I go along. As a sidenote, it is important I note that the data may have some departures due to local adaptations (WHO). Such as, what they consider a covid death or how they timeline the report. However, taking a yearly approach should avoid these issues.

data <- read.csv("C:/Users/dtafm/Downloads/WHO-COVID-19-global-data.csv") 

USData <- na.omit(sqldf("SELECT Date_reported, New_cases, New_deaths FROM data WHERE Country LIKE 'United States of America'")) 

USData$relation <- USData$New_deaths/USData$New_cases


    I start by reading in the data file. I use the slqdf function to subset all non-na US data. Then I create a new column to hold the D/C ratios of each week.


data2020 <- as.vector(sqldf("SELECT relation FROM USData WHERE date_reported LIKE '2020%'"))

data2021 <- as.vector(sqldf("SELECT relation FROM USData WHERE date_reported LIKE '2021%'"))

data2022 <- as.vector(sqldf("SELECT relation FROM USData WHERE date_reported LIKE '2022%'"))

data2023 <- as.vector(sqldf("SELECT relation FROM USData WHERE date_reported LIKE '2023%'"))

max_length <- length(data2021$relation)

USdf <- data.frame(us2020 = c(data2020$relation, rep(NA, max_length - length(data2020$relation))),

                   us2021 = c(data2021$relation, rep(NA, max_length - length(data2021$relation))),

                   us2022 = c(data2022$relation, rep(NA, max_length - length(data2022$relation))),

                   us2023 = c(data2023$relation, rep(NA, max_length - length(data2023$relation))))

newdfus <- data.frame(Group = rep(c("us2020","us2021", "us2022", "us2023"),each=52),

                 data = c(USdf$us2020, USdf$us2021, USdf$us2022,USdf$us2023 ))

I then data munged by subsetting yearly rows, filling with NAs to create equal lengths, and finally turning it into a usable dataframe for Anova analysis.

boxplot(USdf)

I used a box plot to see that there are unequal variances and uneven distributions. Furthermore, my datasets have different sample sizes. Because of this, I decided to use the nonparametric Kruskal-Wallis test. This fits my requirements because it compares the between group sum of squares calculated from the average ranks (Dalgaard, 2008). Thus, it is not reliant on equal variances and even distribution.

Null Hypothesis: The median is equal across all groups
Alternative Hypothesis: The median is not equal across all groups

kruskal.test(data~Group, data = newdfus)

                      Kruskal-Wallis rank sum test

data:  data by Group
Kruskal-Wallis chi-squared = 57.855, df = 3, p-value = 1.688e-12


Given the extremely low P-value, we reject the null hypothesis. We agree that there are significant differences between the medians of the years. To see where these differences lie, I will use the post hoc Dunn's Test.

dunnTest(data~Group, data = newdfus)

Dunn (1964) Kruskal-Wallis multiple comparison
  p-values adjusted with the Holm method.

       Comparison         Z      P.unadj        P.adj
1 us2020 - us2021  2.178917 2.933785e-02 5.867570e-02
2 us2020 - us2022  7.274403 3.479565e-13 2.087739e-12
3 us2021 - us2022  5.322061 1.025979e-07 5.129894e-07
4 us2020 - us2023  3.176151 1.492432e-03 5.969728e-03
5 us2021 - us2023  1.559085 1.189764e-01 1.189764e-01
6 us2022 - us2023 -2.407746 1.605135e-02 4.815405e-02



This test shows us that there were significant differences between the years 2020-2022, 2021-2022, 2020-2023, and 2022-2023. 

USdataTS <-ts(USData$relation, frequency = 52, start=c(2020,10), end = c(2023, 21))
plot(USdataTS)

The visualization shows a major spike in early 2020 that levels out by the latter half of the year.
The following years show a great deal of volatility. The Anova analysis fits well with the visual-
ization. One thing I find interesting is the intense volatility. I expected a continual negative trend
over time. Our mean D/C is about 0.016. I wonder how this compares to Canada.


I redid the same data munging to retrieve the subset of Canadian rows.

CanData <- na.omit(sqldf("SELECT Date_reported, New_cases, New_deaths FROM data WHERE Country LIKE 'Canada'"))
CanData$New_deaths <- abs(CanData$New_deaths) # one of the values is negative. I am assuming it was meant to be positive CanData$relation <- CanData$New_deaths/CanData$New_cases Cdata2020 <- as.vector(sqldf("SELECT relation FROM CanData WHERE date_reported LIKE '2020%'")) Cdata2021 <- as.vector(sqldf("SELECT relation FROM CanData WHERE date_reported LIKE '2021%'")) Cdata2022 <- as.vector(sqldf("SELECT relation FROM CanData WHERE date_reported LIKE '2022%'")) Cdata2023 <- as.vector(sqldf("SELECT relation FROM CanData WHERE date_reported LIKE '2023%'")) max_lengthc<- length(Cdata2021$relation) Candf <- data.frame(can2020 = c(Cdata2020$relation, rep(NA, max_lengthc - length(Cdata2020$relation))), can2021 = c(Cdata2021$relation, rep(NA, max_lengthc - length(Cdata2021$relation))), can2022 = c(Cdata2022$relation, rep(NA, max_lengthc - length(Cdata2022$relation))), can2023 = c(Cdata2023$relation, rep(NA, max_lengthc - length(Cdata2023$relation)))) newdfcan <- data.frame(Group = rep(c("can2020","can2021", "can2022", "can2023"),each=52), data = c(Candf$can2020, Candf$can2021, Candf$can2022,Candf$can2023 ))

I performed the same tests.

boxplot(Candf)


The box plot shows unequal variances and uneven distributions. The datasets also have
different sample sizes. Thus, I chose to use the same Kruskal-Wallis test.

Null Hypothesis: The median is equal across all groups
Alternative Hypothesis: The median is not equal across all groups

kruskal.test(data~Group, data = newdfcan) Kruskal-Wallis rank sum test data: data by Group Kruskal-Wallis chi-squared = 34.609, df = 3, p-value = 1.474e-07


Given the extremely low P-value, we reject the null hypothesis. We agree that there are 
significant differences between the medians of the years. To see where these differences lie, I will use the post hoc Dunn's Test.

dunnTest(data~Group, data = newdfcan)

Dunn (1964) Kruskal-Wallis multiple comparison
  p-values adjusted with the Holm method.

         Comparison           Z      P.unadj        P.adj
1 can2020 - can2021  3.69125444 2.231508e-04 6.694523e-04
2 can2020 - can2022  3.76318265 1.677646e-04 6.710583e-04
3 can2021 - can2022  0.07608929 9.393481e-01 9.393481e-01
4 can2020 - can2023 -0.78249371 4.339245e-01 8.678490e-01
5 can2021 - can2023 -4.48913858 7.151177e-06 3.575588e-05
6 can2022 - can2023 -4.56058639 5.101097e-06 3.060658e-05

This test shows us that there were significant differences between the years 2021-2022 and 2020-2023. 


CanDataTS <-ts(CanData$relation, frequency = 52, start=c(2020,10), end = c(2023, 21))

plot(CanDataTS, yaxt = "n")
axis(side=2, at=seq(0,.15,by = .02))

Canda has a similar trend to the United States. We see a major spike in early 2020, that levels out by the latter half of the year. We continue to see volatility. The mean D/C ratio is about
0.02. This is 25% higher than the US D/C ratio. I guess it is true that you get what you pay for.

For my last section, I will do an Anova between US and Canada joined years.

trimmed <- data.frame(USData[3:168,4] ,CanData[1:166, 4]) colnames(trimmed) <- c("USratio", "Canratio") boxplot(trimmed)
They have similar variances and distributions. A normal two-way Anova will suffice.
I will start with a T-test of trimmed data. I trimmed the data to include weeks where both
countries reported their D/C. This ensures there are differences in the means before I use
Anova with different sample sizes.
Tres <- t.test(data = trimmed, trimmed$USratio, trimmed$Canratio) print(Tres)

Welch Two Sample t-test data: trimmed$USratio and trimmed$Canratio t = -1.9545, df = 262.88, p-value = 0.0517 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -8.261529e-03 3.057302e-05 sample estimates: mean of x mean of y 0.01574676 0.01986224

While the P-value is barely greater than .05. I would still like to do more research into the
differences.

allDF <- data.frame(Country = rep(c("US", "Canda"), each = 208), year = rep(c("us2020","us2021", "us2022", "us2023","can2020","can2021", "can2022", "can2023"),each=52), data = c(USdf$us2020, USdf$us2021, USdf$us2022,USdf$us2023, Candf$can2020, Candf$can2021, Candf$can2022,Candf$can2023 ))

Null Hypothesis: There is no significant difference between US and Canada D/C ratio means
Alternative Hypothesis: There is significant difference between US and Canada D/C ratio 
means


model <- aov(data ~ Country * year, data = allDF) summary(model)
    Df  Sum Sq  Mean Sq F value Pr(>F)    
Country       1 0.00100 0.001002    3.69 0.0556 .  
year          6 0.02909 0.004849   17.85 <2e-16 ***
Residuals   347 0.09424 0.000272                   

TukeyHSD(model)

Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = data ~ Country * year, data = allDF)

$Country
                 diff          lwr          upr     p adj
US-Canda -0.003365008 -0.006810499 8.048297e-05 0.0555658

$year
                         diff          lwr           upr     p adj
us2020-can2020  -0.0082691606 -0.019110368  0.0025720471 0.2824439
us2021-can2021   0.0063207912 -0.003535028  0.0161766110 0.5135135
us2022-can2022   0.0005957571 -0.009260063  0.0104515768 0.9999996
us2023-can2023  -0.0029329410 -0.016639783  0.0107739013 0.9980541


I trimmed the post hoc results to meet my needs. I am interested in seeing the year-to-year
comparison between US and Canada. Given the high P-values for each comparison, we can
accept the null hypothesis. That means the year-to-year interchange of US and Canada D/C
ratio does not significantly differ. The years 2020 and 2021 show some differences, but not a
significant amount. The years 2022 and 2023 show nearly no difference.

Summary:

Expectations: I expected that both countries would see a continual negative trend in D/C ratio.
I expected that Canada would have a lower D/C ratio. My uncle lives in Toronto, Ca, and I
visit yearly. The area was extremely strict on masks and vaccinations, at a time where Florida was becoming more lenient.

Results: Both countries experienced similar year-to-year interchange of D/C ratio. Each
country's Anova analysis shows significant differences over the years. The joined
comparison Anova analysis shows no significant differences. US and Canda experienced.
similar spikes and volatility, as opposed to my negative trend expectation. Canada's D/C ratio was 25% higher than the United States, as opposed to my expectation that Canada will have
a lower D/C ratio.

References

Dalgaard, P. (2008). Introductory statistics with R (2nd ed.). Springer.

World Health Organization. (Accessed April, 2024). Covid-19 data | who covid-19 dashboard. World Health Organization. https://data.who.int/dashboards/covid19/data?n=c








Comments

Popular posts from this blog

First Day Assignment

Module 10 Assignment