A Data Story on

Natural Disasters in Quotebank

## Abstract

### A brief introduction to the project

Every year, natural disasters happen and often take many lives. After such events, the pages of newspapers are full of quotes from people expressing regret for the unfortunate event. These events often remain in people’s memory for a lifetime. What influences how long these events will be talked about? In this research project, we explore how much is said about the biggest earthquakes after they have occurred and what factors influence this. We will look for answers in the Quotebank 2008-2020 quotes on disasters taken from the international disasters database combined with world development indicators from the World Data Bank. To simplify disaster quote detection, we will further look into classifying the quotes by whether they talk about a disaster or not.

## Research Questions

### Questions tackled in the research

In this research project, we explore two questions.

First, how correctly will NLP models trained on disaster tweets like in this Kaggle challenge generalize to classifying disaster quotes in Quotebank? This question is important in respect of robustness analysis of models and their transfer learning capabilities.

Second, what factors influence how long an earthquake will be talked about in Quotebank quotes from 2008 to 2020? The interesting factors include total deaths, total damage in dollars, country of disaster, wealth indicators of the country, etc.

There are various other research questions related to this that are interesting and worth further research, like “what is the sentiment towards different disasters and why” and “how does the country of the speaker affect which disasters he is talking about”.

## Timeline

### An introduction of the steps taken on a timeline

First, how accurate is the NLP model trained in disaster tweets generalized to the classification of disaster estimates in Quotebank? Second, what are the factors that influence the duration of the earthquake in the Quotebank quote from 2008 to 2020?

To answer these questions, we will use a timeline to visualize the steps taken in this section.

• #### Always Need More Data!

We started with gathering data for our project, preprocessing and visualizing it. We gathered data from indicators of the severity of a disaster, from the development of different countries in different years, from domain geographics and of course, Twitter.

• #### NLP Models

We used a combination of SotA NLP models that are trained on the Twitter Disaster Dataset to detect the disasters in the quotes.

• #### Filtering Quotes with RegEx

We also filtered earthquakes from QuoteBank using keyword regex. For the disasters we gathered the most important keywords, such as the associated event name.

• #### Media Coverage Analysis

We analyzed in depth the possible factors of how much people talk about these horrible events.

## Approach

### A brief introduction to the project

Here we can give a brief introduction to our methodologies and the data we used to create the project.

## The International Disasters Database

We use the international disasters database to introduce natural disasters of this century with their most important attributes. This dataset was compiled from various sources including UN, governmental and non-governmental agencies, insurance companies, research institutes, and press agencies. In the majority of cases, a disaster will only be entered into EM-DAT if at least two sources report the disaster’s occurrence in terms of deaths and/or affected persons.

Below, we present the number of disasters per country on the world map.

## World Development Indicators

One important factor in how much people talk about a disaster might be the country and its attributes. In this dataset, the most important development indicators of the country can be found, for example GDP, population, and fertility rate. Detailed per-indicator source and description is given in databank_wdi_metadata.csv. We would like to observe whether there is a connection between these indicators and the length and distribution of time they talk about the disaster.

Below, we present the correlation of the development indicators as a heatmap.

## GDELT Geographic Lookup of Domains

The geographical location of newspapers could affect the citations contained in them. Although the quotes in the Quotebank dataset contain links to the article in which they were found, we cannot find out the true geographical location of the news source from the link itself. E.g. theguardian.com and nytimes.com both use .com top-level domain, but they are reporting events in different countries. That’s why we decided to choose a GDELT dataset that associates a particular domain with the right country from which that news source comes. This dataset was created from the enormous GDELT dataset and used the fact that news outlets cover events physically proximate to them far more often than they do events on the other side of the world.

In this dataset we have a collection of tweets about disasters and random topics from Twitter users. The tweets are labeled whether they are about a disaster or not and we use them to train NLP models for transfer learning. The structure of the dataset is as follows:

Columns
id - a unique identifier for each tweet
text - the text of the tweet
location - the location the tweet was sent from (may be blank)
keyword - a particular keyword from the tweet (may be blank)
target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

## NLP from Tweets

### Training transformers on Tweets and evaluating on the Quotebank

To simplify the classification of quotes by whether they talk about disasters or not, we looked into transferring a model trained to classify whether a Tweet is talking about a disaster and using it to classify Quotebank quotes. To do so, we have started with this Kaggle challenge. Twitter has become an important communication channel in times of emergency, which is why the Kaggle challenge authors’ argument for more agencies being interested in programmatically monitoring Twitter for disaster detection (i.e. disaster relief organizations and news agencies). In that competition, data scientists are challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. The challenge participants are provided with a dataset of 10,000 tweets that were hand classified, with a disclaimer of the dataset containing text that may be considered profane, vulgar, or offensive. To give a taste of how these tweets look like, consider the following examples:

id keyword location text target
5636 flooding United States #BREAKING Business at the Sponge Docks washed out by rain flooding /#news 1
46 ablaze London Birmingham Wholesale Market is ablaze BBC News - Fire breaks out at Birmingham's Wholesale Market http://t.co/irWqCEZWEU 1
140 accident @Calum5SOS this happened on accident but I like it http://t.co/QHmXuljSX9 0
230 airplane accident #BlackLivesMatter @thugIauren I had myself on airplane mode by accident ?? 0
473 armageddon #Turkey invades #Israel - Halfway to #Armageddon http://t.co/xUOh3sJNXF 1
497 army Campinas Sp You da One #MTVSummerStar #VideoVeranoMTV #MTVHottest Britney Spears Lana Del Rey 0
6291 hostage Seattle, WA Sinai branch of Islamic State threatens to execute Croatian hostage in 48 hours http://t.co/YvtcXrPt34 1
701 attacked Aus need career best of @davidwarner31 along with Steve smith to even try competing.. SURA virus has attacked team Australia brutally 0
2798 curfew ! Sex-themed e-books given curfew in Germany on http://t.co/7NLEnCph8X 0

These small texts are labeled as talking about a disaster in a general sense of disasters (accident, attack, battle, catastrophe, etc.) that we will refer to as general-disaster. We are, however, interested in natural disasters like earthquakes, but a successfully performing general-disaster classification model could still be used to narrow down the search for natural disasters in a dataset as big as Quotebank.

Among the numerous models submitted for the challenge, we have picked two with high test scores, one of which is a BERT transformer and the other a DISTILBERT. We will use BERT and DISTILBERT as their code names. We have modified them, performed a small hyperparameter search to find the best parameters, and trained the final model on all available data (train+test). Their respective test performance on the tweets dataset is shown below:

Given the big number of quotes in Quotebank (234'994'989), the computational time needed to predict the class for all Quotebank quotes was about 80 GPU days. Because of this, we applied the final models on a randomly chosen sample consisting of 0.1% of the whole dataset, which is still an enormous sample of 234’991 quotes and gives tight bounds for the resulting estimates

BERT has predicted 9.74% quotes to be disasters, and DISTILBERT slightly less - 8.89%. This, however, seems to be rather unrealistic given the nature of Quotebank quotes - does really about 1 out of 10 quotes talk about a general disaster? To evaluate the precision of these results, we have hand-labeled a subset of 400 positive classification results as true positives or false positives. In the hand labeling, we distinguished general disasters the models were trained on from natural disasters we are interested in. The resulting precisions are, however, quite low and are shown in the table below:

Model General Disaster Precision Natural Disaster Precision
BERT 5.50% 1.00%
DISTILBERT 36.50% 4.50%

DISTILBERT had a far better generalization performance, as evaluated by precision, for both general disasters and natural disasters. The transfer learning didn’t succeed overall and the models were not robust to the change of text distribution from tweets to Quotebank quotes. This could have been expected given the significant differences in the texts that tweets have compared to quotes: tweets are much shorter, use loads of hashtags, URLs, and emojis, etc.

But going beneath just the numbers, we experienced many strange things when labeling the quotes. Firstly, BERT worked almost the opposite way from what we wanted it to, as if it deliberately decided to choose those quotes which have no connection to any disasters. They were about Formula-1, politics, and celebrity news. However, DISTILBERT was more promising. Even though it had many false positives, they would often be quotes about sports, but in the context of sports matches being wars, and people fighting for triumph. For example: “Our defense did a great job. There were some times where they could have been better, but I saw the puck well.” Also, it can be interesting that a lot of quotes were about veterans and their rehabilitation. Fires, health problems, cybercrime, and financial crises were also a frequent part of the false positives. The overall sentiment was also negative in these predictions, which is also something that is expected when predicting the topic of catastrophic events. All in all, none of the models performed well, but DISTILBERT had shown some robustness qualitatively.

Below are snippets of positives for BERT and DISTILBERT.

## BERT

GENERAL DISASTER (1=YES, 0=NO) NATURAL DISASTER (1=YES, 0=NO) quotation speaker
0 1 We cannot ignore the issues of migration and need to bring them out from the shadows of the climate change and disaster risk reduction debate, Henry Puna
0 0 i was, like, a super nerdy kid... fully invested in creating narrative from a very young age. writing stories for my little sister and me to act out. like a full-on nerd. paul dano
0 0 It would be great if we won Friday, but it won't end our season. Eric Hayes
0 0 I want to ask the minister whether or not you have said a single word in your speech about the poor man of Pakistan. Certainly not. The PML-N is a party of mill owners, businessmen and..., Aitzaz Ahsan
0 0 the ftc's coppa rule requires parental notice and consent before collecting children's personal information online, whether through a website or a mobile app, jon leibowitz

## DISTILBERT

GENERAL DISASTER (1=YES, 0=NO) NATURAL DISASTER (1=YES, 0=NO) quotation speaker
0 0 including seven months building the database to list all the names on it and ensure it was correct. this list contains 58,227 names of american military who lost their lives in vietnam. ruth ross
0 1 The flood had a big impact on everyone here None
0 0 was the complete beacon of inspiration in the depression for this country. it's sort of an incredible coincidence that we are in this slump. i think it's an excellent time to remind people of the heroes that kept us afloat -- more than afloat -- in amelia's time. amelia earhart
0 1 Waters cooled more than they should have, and there was a lot of cooling by late March. But right on cue, things changed as we put out the forecast. Then there was all sorts of warming. Philip Klotzbach
0 0 at that time, we did not know there were occupants in the house. we arrived, deployed a nozzle, had the fire knocked down fairly quickly. had to force entry, the doors were locked. tom crawford

For the details of the transfer learning effort, you can head over to the relevant notebooks: tweets-bert, tweets-transfer-bert, tweets-transfer-distilbert, quotebank-sample.

## Media Coverage Analysis

### Analysis of the media coverage of natural disasters in the Quotebank dataset

What makes an earthquake worth to talk about?

All of the news portals are trying to find those events which reach the most readers. Natural disasters are a topic that interests many readers. Naturally, people don’t want to lag behind the world’s most important events, but that people are much more likely to read negative news than positive news, also contributes to that news portals are happy to bring down articles about these events. These tendencies are also true for celebrities, politicians and other media influencers whose thoughts and ideas are quoted in the media.

As for other news as well, the number of quotes in total and in the first period after the event is a meaningful indicator of its popularity in media. For the earthquakes we used this measurement to analyse the appearance of the disaster in the news. We also included the length between the first and last appearance of the earthquake in the quotes in the analysis, which happened to be less meaningful in this respect.

Due to Plotly’s auto-slideshow feature, the plots are not automatically rescaled. You can use Autoscale and Reset Axis buttons on the right top corner of the plot to better visualize the quote distributions of a particular disaster in the plot below.