Synopsis
Severe weather events, and storms in particular, can cause significant harm to public
health and damage to national economy. It might be tempting to allocate larger
chunks of public budgets to preparation for those types of storms that occur
more frequently.
This research paper switches focus from frequency of occurrence of a particular storm to measurable consequences. By analyzing NOAA Storm Database maintained since 1950,
this paper answers two questions:
- Across the United States, which types of events are most harmful with respect
to population health? - Across the United States, which types of events have the greatest economic
consequences?
Introduction
- Raw data of severe storm events from U.S. National Oceanic and Atmospheric Administration
is available at the Coursera web site (Storm Data). - R was used to conduct analysis and RStudio/RPubs to prepare and publish report.
Full reproducibility was one of the governing principles of the report - Plan of research:
- Data Processing: load data from web and read data into R. Transform data
into tidy data set containing
date of event, type of event, harm to public health and damage to economy (both
to be defined in due course of analysis). - Data Analysis: vizually analize obtained data.
- Results: present findings.
- Data Processing: load data from web and read data into R. Transform data
Data Processing
The aim of this section is to show and explain logic behind of all steps involved in obtaining tidy data set suitable for further analysis. The author of the report assumes
that no other information than the:
- link where the Storm Data can be obtained and
- accompanying documentation (Storm Data Documentation, FAQ)
is known to the reader at this point. So I will read, explore, and transform data
together with the reader of the report.
Load data
# download bzipped file if it has not been downlowded before
if (!file.exists("./data/StormData.csv.bz2")) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dir.create("data")
download.file(url, "./data/StormData.csv.bz2", method="curl")
}
Read CSV
# read StormData.csv into R object stormData if it has not been read already
# no manipulations on stormData object are allowed !
# before making any change to data, copy stormData into another object
if (!exists("stormData")) {
stormData <- read.csv(bzfile("./data/StormData.csv.bz2"),
stringsAsFactors=F)
}
Transform data
# explore stormData object
str(stormData)
## 'data.frame': 902297 obs. of 37 variables: ## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ... ## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ... ## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ... ## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ... ## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ... ## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ... ## $ STATE : chr "AL" "AL" "AL" "AL" ... ## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ... ## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ... ## $ BGN_AZI : chr "" "" "" "" ... ## $ BGN_LOCATI: chr "" "" "" "" ... ## $ END_DATE : chr "" "" "" "" ... ## $ END_TIME : chr "" "" "" "" ... ## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ... ## $ COUNTYENDN: logi NA NA NA NA NA NA ... ## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ... ## $ END_AZI : chr "" "" "" "" ... ## $ END_LOCATI: chr "" "" "" "" ... ## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ... ## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ... ## $ F : int 3 2 2 2 2 2 2 1 3 3 ... ## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ... ## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ... ## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ... ## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ... ## $ PROPDMGEXP: chr "K" "K" "K" "K" ... ## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ... ## $ CROPDMGEXP: chr "" "" "" "" ... ## $ WFO : chr "" "" "" "" ... ## $ STATEOFFIC: chr "" "" "" "" ... ## $ ZONENAMES : chr "" "" "" "" ... ## $ LATITUDE : num 3040 3042 3340 3458 3412 ... ## $ LONGITUDE : num 8812 8755 8742 8626 8642 ... ## $ LATITUDE_E: num 3051 0 0 0 0 ... ## $ LONGITUDE_: num 8806 0 0 0 0 ... ## $ REMARKS : chr "" "" "" "" ... ## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
After perusing these two documents (Storm Data Documentation, FAQ) and
looking at the structure of stormData, it is becoming clear that harm to public health
and damage to economy contain in variables that have to do with “FATALITIES”, “INJURIES”, and
those names containing “DMG” string. Let's subset these variables together with
date and event type.
# subset data related to health and economic damages
ind <- grep("fat|inj|dmg", colnames(stormData), ignore.case = T)
str(stormData[,ind])
## 'data.frame': 902297 obs. of 6 variables: ## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ... ## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ... ## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ... ## $ PROPDMGEXP: chr "K" "K" "K" "K" ... ## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ... ## $ CROPDMGEXP: chr "" "" "" "" ...
damage <- stormData[,c(2, 8, ind)]# 2 and 8 for date and event type
colnames(damage) <- tolower(colnames(damage))
# check if we got what we intended to
str(damage)
## 'data.frame': 902297 obs. of 8 variables: ## $ bgn_date : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ... ## $ evtype : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ... ## $ fatalities: num 0 0 0 0 0 0 0 0 1 0 ... ## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ... ## $ propdmg : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ... ## $ propdmgexp: chr "K" "K" "K" "K" ... ## $ cropdmg : num 0 0 0 0 0 0 0 0 0 0 ... ## $ cropdmgexp: chr "" "" "" "" ...
Storm Data Documentation further tells us that dollar value of property and crop damages consists
of numeric figure and character label: K for '000, M for '000'000, and B for '000'000'000.
Let's explore if we should use this line verbatim in our analysis or rather as a hint.
# check do we have "K", "M", "B" labels consistently
table(damage$propdmgexp)
## ## - ? + 0 1 2 3 4 5 ## 465934 1 8 5 216 25 13 4 4 28 ## 6 7 8 B h H K m M ## 4 5 1 40 1 6 424665 7 11330
As it turns out people who entered data made quite a few mistakes. Thus, our
rule for combining figures and labels into dollar values will be as follows:
- use “k” or “K” for '000
- use “m” or “M” for '000'000
- use “b” or “B” for '000'000'000
- substitute all other with numeric “0”
# calculate tables of multiples for property and crop
propMultiples <- data.frame(propdmgexp=c("k","K","m","M","b","B"),
propMultiples= c(1000,
1000,
1000000,
1000000,
1000000000,
1000000000))
# explore
propMultiples
## propdmgexp propMultiples ## 1 k 1e+03 ## 2 K 1e+03 ## 3 m 1e+06 ## 4 M 1e+06 ## 5 b 1e+09 ## 6 B 1e+09
table(damage$cropdmgexp)
## ## ? 0 2 B k K m M ## 618413 7 19 1 9 21 281832 1 1994
cropMultiples <- data.frame(cropdmgexp=c("k","K","m","M","b","B"),
cropMultiples= c(1000,
1000,
1000000,
1000000,
1000000000,
1000000000))
# explore
cropMultiples
## cropdmgexp cropMultiples ## 1 k 1e+03 ## 2 K 1e+03 ## 3 m 1e+06 ## 4 M 1e+06 ## 5 b 1e+09 ## 6 B 1e+09
library(plyr)
# join multiples with damage table
damage <- join(damage, propMultiples)
## Joining by: propdmgexp
damage <- join(damage, cropMultiples)
## Joining by: cropdmgexp
#explore
str(damage)
## 'data.frame': 902297 obs. of 10 variables: ## $ bgn_date : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ... ## $ evtype : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ... ## $ fatalities : num 0 0 0 0 0 0 0 0 1 0 ... ## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ... ## $ propdmg : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ... ## $ propdmgexp : chr "K" "K" "K" "K" ... ## $ cropdmg : num 0 0 0 0 0 0 0 0 0 0 ... ## $ cropdmgexp : chr "" "" "" "" ... ## $ propMultiples: num 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ... ## $ cropMultiples: num NA NA NA NA NA NA NA NA NA NA ...
# calculate dollar values of property and crop damage
damageValue <- mutate(damage,
propDamage = propdmg * propMultiples,
cropDamage = cropdmg * cropMultiples)
# explore
str(damageValue)
## 'data.frame': 902297 obs. of 12 variables: ## $ bgn_date : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ... ## $ evtype : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ... ## $ fatalities : num 0 0 0 0 0 0 0 0 1 0 ... ## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ... ## $ propdmg : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ... ## $ propdmgexp : chr "K" "K" "K" "K" ... ## $ cropdmg : num 0 0 0 0 0 0 0 0 0 0 ... ## $ cropdmgexp : chr "" "" "" "" ... ## $ propMultiples: num 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ... ## $ cropMultiples: num NA NA NA NA NA NA NA NA NA NA ... ## $ propDamage : num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ... ## $ cropDamage : num NA NA NA NA NA NA NA NA NA NA ...
# explore data
damageValue[1:5,1]
## [1]"4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" ## [4]"6/8/1951 0:00:00" "11/15/1951 0:00:00"
# coerce date column to Date class
damageValue$bgn_date <- as.Date(damageValue$bgn_date, format="%m/%d/%Y")
# substitute NA's with 0
damageValue[is.na(damageValue)]<-0
Before we can proceed to the next step of creating tidy data set we need to define
harm to public health and damage to economy due to a particular event.
For the sake of this report, I use the following definitions:
Harm to public health: total amount of fatalities and injuries occurred due
to a particular storm eventDamage to economy: total dollar amount of crop and property damage occurred due
to a particular storm event
Now let's calculate a tidy data set containing one column for every variable:
date, type of event, health damage, and economic damage
# create tidy data set for further analysis
data <- with(damageValue, data.frame(date=bgn_date,
eventType=evtype,
healthDamage = fatalities + injuries,
economicDamage = propDamage + cropDamage)
)
summary(data)
## date eventType healthDamage ## Min. :1950-01-03 HAIL :288661 Min. : 0.0 ## 1st Qu.:1995-04-20 TSTM WIND :219940 1st Qu.: 0.0 ## Median :2002-03-18 THUNDERSTORM WIND: 82563 Median : 0.0 ## Mean :1998-12-27 TORNADO : 60652 Mean : 0.2 ## 3rd Qu.:2007-07-28 FLASH FLOOD : 54277 3rd Qu.: 0.0 ## Max. :2011-11-30 FLOOD : 25326 Max. :1742.0 ## (Other) :170878 ## economicDamage ## Min. :0.00e+00 ## 1st Qu.:0.00e+00 ## Median :0.00e+00 ## Mean :5.28e+05 ## 3rd Qu.:1.00e+03 ## Max. :1.15e+11 ##
Data Analysis
Different types of storm by annual frequency
Let's visualize types of storms by average annual occurrences. To do so,
I calculate an eventFreq object that contains two variables, type of event and
frequency of occurrence, and subset it for top 10 events.
eventFreq <- arrange(count(data, "eventType"), desc(freq))
topEvents <- eventFreq[1:10,]# save top 3 for results presentation
results <- data.frame(eventFreq = eventFreq[1:3, "eventType"])
library(ggplot2)
ggplot(data=topEvents, aes(y=freq/62, x = reorder(eventType, freq))) + #calculate mean over 62 years
geom_bar(stat="identity",
fill="#0D3F65",
width=.7) +
coord_flip() +
labs(y="Number of storm occurrences",
x=NULL,
title = "Average number of occurrences per annum \n of a particular type of storm, US, 1950 - 2011")
Harm to public health
Let's analize vizually what are the top 10 types of storm that cause most damage
to public health as defined earlier:
healthDamage <- ddply(data,
"eventType",
summarize,
healthDamage=sum(healthDamage))
healthDamage <- arrange(healthDamage, desc(healthDamage))
topHealthDamage <- healthDamage[1:10,]topHealthDamage
## eventType healthDamage ## 1 TORNADO 96979 ## 2 EXCESSIVE HEAT 8428 ## 3 TSTM WIND 7461 ## 4 FLOOD 7259 ## 5 LIGHTNING 6046 ## 6 HEAT 3037 ## 7 FLASH FLOOD 2755 ## 8 ICE STORM 2064 ## 9 THUNDERSTORM WIND 1621 ## 10 WINTER STORM 1527
# save top3 for results presentation
results$healthDamage <- topHealthDamage[1:3, "eventType"]# plot top 10 as average per annum
ggplot(data=topHealthDamage, aes(y=healthDamage/62,
x = reorder(eventType, healthDamage))) +
geom_bar(stat="identity",
fill="#0D3F65",
width=.7) +
coord_flip() +
labs(y="Number of injuries and fatalities per annum",
x=NULL,
title = "Average annual harm to public health\n by a particular type of storm, US, 1950 - 2011")
! harm-to-public-health-by-storms
Damage to economy
Let's analyze vizually what are the top 10 event causing most damage to economy
head(data)
## date eventType healthDamage economicDamage ## 1 1950-04-18 TORNADO 15 25000 ## 2 1950-04-18 TORNADO 0 2500 ## 3 1951-02-20 TORNADO 2 25000 ## 4 1951-06-08 TORNADO 2 2500 ## 5 1951-11-15 TORNADO 2 2500 ## 6 1951-11-15 TORNADO 6 2500
economicDamage <- ddply(data,
"eventType",
summarize,
economicDamage=sum(economicDamage))
economicDamage <- arrange(economicDamage, desc(economicDamage))
topEconomicDamage <- economicDamage[1:10,]# save top3 for results presentation
results$economicDamage <- topEconomicDamage[1:3, "eventType"]# plot top 10 as average per annum
ggplot(data=topEconomicDamage, aes(y=economicDamage/62,
x = reorder(eventType, economicDamage))) +
geom_bar(stat="identity",
fill="#0D3F65",
width=.7) +
coord_flip() +
labs(y="Economic damage per annum, USD",
x=NULL,
title = "Average annual economic damage \n by a particular type of storm, US, 1950 - 2011")
Results
Let's summarize in a table top 3 types of storms in all the categories that
we have vizualized before:
- Frequency of occurrence
- Harm to public health (defined as deaths and injuries combined)
- Damage to economy (defined as damage to property and crop combined).
library(xtable)
colnames(results) <- c("By frequency of occurrence",
"By harm to public health",
"By damage to economy")
table <- xtable(results,
caption = "Table 1: Top 3 types of storm")
print(table, type="html")
By frequency of occurrence | By harm to public health | By damage to economy | |
---|---|---|---|
1 | HAIL | TORNADO | FLOOD |
2 | TSTM WIND | EXCESSIVE HEAT | HURRICANE/TYPHOON |
3 | THUNDERSTORM WIND | TSTM WIND | TORNADO |
It is becoming clear now, after analyzing 60 years worth of data, that most
attention need to be paid to preparation for Tornadoes and Floods. These events do not occur as frequently as hails, but the consequences are by far more damaging.