A Data Story on

Natural Disasters in Quotebank

Let's Explore!

Abstract

A brief introduction to the project

Every year, natural disasters happen and often take many lives. After such events, the pages of newspapers are full of quotes from people expressing regret for the unfortunate event. These events often remain in people’s memory for a lifetime. What influences how long these events will be talked about? In this research project, we explore how much is said about the biggest earthquakes after they have occurred and what factors influence this. We will look for answers in the Quotebank 2008-2020 quotes on disasters taken from the international disasters database combined with world development indicators from the World Data Bank. To simplify disaster quote detection, we will further look into classifying the quotes by whether they talk about a disaster or not.

Research Questions

Questions tackled in the research

In this research project, we explore two questions.

First, how correctly will NLP models trained on disaster tweets like in this Kaggle challenge generalize to classifying disaster quotes in Quotebank? This question is important in respect of robustness analysis of models and their transfer learning capabilities.

Second, what factors influence how long an earthquake will be talked about in Quotebank quotes from 2008 to 2020? The interesting factors include total deaths, total damage in dollars, country of disaster, wealth indicators of the country, etc.

There are various other research questions related to this that are interesting and worth further research, like “what is the sentiment towards different disasters and why” and “how does the country of the speaker affect which disasters he is talking about”.

Timeline

An introduction of the steps taken on a timeline

First, how accurate is the NLP model trained in disaster tweets generalized to the classification of disaster estimates in Quotebank? Second, what are the factors that influence the duration of the earthquake in the Quotebank quote from 2008 to 2020?

To answer these questions, we will use a timeline to visualize the steps taken in this section.

  • Tape

    Always Need More Data!

    We started with gathering data for our project, preprocessing and visualizing it. We gathered data from indicators of the severity of a disaster, from the development of different countries in different years, from domain geographics and of course, Twitter.

  • Twitter logo

    NLP Models

    We used a combination of SotA NLP models that are trained on the Twitter Disaster Dataset to detect the disasters in the quotes.

  • Filter

    Filtering Quotes with RegEx

    We also filtered earthquakes from QuoteBank using keyword regex. For the disasters we gathered the most important keywords, such as the associated event name.

  • Media

    Media Coverage Analysis

    We analyzed in depth the possible factors of how much people talk about these horrible events.

  • How about
    some more
    details?

Approach

A brief introduction to the project

Here we can give a brief introduction to our methodologies and the data we used to create the project.

Additional Datasets

Additional datasets used during the project

  1. The International Disasters Database
  2. World Development Indicators
  3. GDELT Geographic Lookup of Domains
  4. Twitter Disaster Dataset

The International Disasters Database

We use the international disasters database to introduce natural disasters of this century with their most important attributes. This dataset was compiled from various sources including UN, governmental and non-governmental agencies, insurance companies, research institutes, and press agencies. In the majority of cases, a disaster will only be entered into EM-DAT if at least two sources report the disaster’s occurrence in terms of deaths and/or affected persons.

Below, we present the number of disasters per country on the world map.

Number of disasters per country

World Development Indicators

One important factor in how much people talk about a disaster might be the country and its attributes. In this dataset, the most important development indicators of the country can be found, for example GDP, population, and fertility rate. Detailed per-indicator source and description is given in databank_wdi_metadata.csv. We would like to observe whether there is a connection between these indicators and the length and distribution of time they talk about the disaster.

Below, we present the correlation of the development indicators as a heatmap.

Correlation of World Development Indicators

GDELT Geographic Lookup of Domains

The geographical location of newspapers could affect the citations contained in them. Although the quotes in the Quotebank dataset contain links to the article in which they were found, we cannot find out the true geographical location of the news source from the link itself. E.g. theguardian.com and nytimes.com both use .com top-level domain, but they are reporting events in different countries. That’s why we decided to choose a GDELT dataset that associates a particular domain with the right country from which that news source comes. This dataset was created from the enormous GDELT dataset and used the fact that news outlets cover events physically proximate to them far more often than they do events on the other side of the world.

Twitter Disaster Dataset

The Twitter Disaster Dataset is a collection of tweets from Twitter users about natural disasters. The dataset is available for download from the Kaggle website.

In this dataset we have a collection of tweets about disasters and random topics from Twitter users. The tweets are labeled whether they are about a disaster or not and we use them to train NLP models for transfer learning. The structure of the dataset is as follows:

Columns
id - a unique identifier for each tweet
text - the text of the tweet
location - the location the tweet was sent from (may be blank)
keyword - a particular keyword from the tweet (may be blank)
target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

NLP from Tweets

Training transformers on Tweets and evaluating on the Quotebank

To simplify the classification of quotes by whether they talk about disasters or not, we looked into transferring a model trained to classify whether a Tweet is talking about a disaster and using it to classify Quotebank quotes. To do so, we have started with this Kaggle challenge. Twitter has become an important communication channel in times of emergency, which is why the Kaggle challenge authors’ argument for more agencies being interested in programmatically monitoring Twitter for disaster detection (i.e. disaster relief organizations and news agencies). In that competition, data scientists are challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. The challenge participants are provided with a dataset of 10,000 tweets that were hand classified, with a disclaimer of the dataset containing text that may be considered profane, vulgar, or offensive. To give a taste of how these tweets look like, consider the following examples:

id keyword location text target
5636 flooding United States #BREAKING Business at the Sponge Docks washed out by rain flooding /#news 1
46 ablaze London Birmingham Wholesale Market is ablaze BBC News - Fire breaks out at Birmingham's Wholesale Market http://t.co/irWqCEZWEU 1
140 accident @Calum5SOS this happened on accident but I like it http://t.co/QHmXuljSX9 0
230 airplane accident #BlackLivesMatter @thugIauren I had myself on airplane mode by accident ?? 0
473 armageddon #Turkey invades #Israel - Halfway to #Armageddon http://t.co/xUOh3sJNXF 1
497 army Campinas Sp You da One #MTVSummerStar #VideoVeranoMTV #MTVHottest Britney Spears Lana Del Rey 0
6291 hostage Seattle, WA Sinai branch of Islamic State threatens to execute Croatian hostage in 48 hours http://t.co/YvtcXrPt34 1
701 attacked Aus need career best of @davidwarner31 along with Steve smith to even try competing.. SURA virus has attacked team Australia brutally 0
2798 curfew ! Sex-themed e-books given curfew in Germany on http://t.co/7NLEnCph8X 0

These small texts are labeled as talking about a disaster in a general sense of disasters (accident, attack, battle, catastrophe, etc.) that we will refer to as general-disaster. We are, however, interested in natural disasters like earthquakes, but a successfully performing general-disaster classification model could still be used to narrow down the search for natural disasters in a dataset as big as Quotebank.

Bert and DistilBert

Among the numerous models submitted for the challenge, we have picked two with high test scores, one of which is a BERT transformer and the other a DISTILBERT. We will use BERT and DISTILBERT as their code names. We have modified them, performed a small hyperparameter search to find the best parameters, and trained the final model on all available data (train+test). Their respective test performance on the tweets dataset is shown below:

Bert and DistilBert Confusion Matrix

Given the big number of quotes in Quotebank (234'994'989), the computational time needed to predict the class for all Quotebank quotes was about 80 GPU days. Because of this, we applied the final models on a randomly chosen sample consisting of 0.1% of the whole dataset, which is still an enormous sample of 234’991 quotes and gives tight bounds for the resulting estimates

BERT has predicted 9.74% quotes to be disasters, and DISTILBERT slightly less - 8.89%. This, however, seems to be rather unrealistic given the nature of Quotebank quotes - does really about 1 out of 10 quotes talk about a general disaster? To evaluate the precision of these results, we have hand-labeled a subset of 400 positive classification results as true positives or false positives. In the hand labeling, we distinguished general disasters the models were trained on from natural disasters we are interested in. The resulting precisions are, however, quite low and are shown in the table below:

Model General Disaster Precision Natural Disaster Precision
BERT 5.50% 1.00%
DISTILBERT 36.50% 4.50%

DISTILBERT had a far better generalization performance, as evaluated by precision, for both general disasters and natural disasters. The transfer learning didn’t succeed overall and the models were not robust to the change of text distribution from tweets to Quotebank quotes. This could have been expected given the significant differences in the texts that tweets have compared to quotes: tweets are much shorter, use loads of hashtags, URLs, and emojis, etc.

But going beneath just the numbers, we experienced many strange things when labeling the quotes. Firstly, BERT worked almost the opposite way from what we wanted it to, as if it deliberately decided to choose those quotes which have no connection to any disasters. They were about Formula-1, politics, and celebrity news. However, DISTILBERT was more promising. Even though it had many false positives, they would often be quotes about sports, but in the context of sports matches being wars, and people fighting for triumph. For example: “Our defense did a great job. There were some times where they could have been better, but I saw the puck well.” Also, it can be interesting that a lot of quotes were about veterans and their rehabilitation. Fires, health problems, cybercrime, and financial crises were also a frequent part of the false positives. The overall sentiment was also negative in these predictions, which is also something that is expected when predicting the topic of catastrophic events. All in all, none of the models performed well, but DISTILBERT had shown some robustness qualitatively.

Below are snippets of positives for BERT and DISTILBERT.

BERT

GENERAL DISASTER (1=YES, 0=NO) NATURAL DISASTER (1=YES, 0=NO) quotation speaker
0 1 We cannot ignore the issues of migration and need to bring them out from the shadows of the climate change and disaster risk reduction debate, Henry Puna
0 0 i was, like, a super nerdy kid... fully invested in creating narrative from a very young age. writing stories for my little sister and me to act out. like a full-on nerd. paul dano
0 0 It would be great if we won Friday, but it won't end our season. Eric Hayes
0 0 I want to ask the minister whether or not you have said a single word in your speech about the poor man of Pakistan. Certainly not. The PML-N is a party of mill owners, businessmen and..., Aitzaz Ahsan
0 0 the ftc's coppa rule requires parental notice and consent before collecting children's personal information online, whether through a website or a mobile app, jon leibowitz

DISTILBERT

GENERAL DISASTER (1=YES, 0=NO) NATURAL DISASTER (1=YES, 0=NO) quotation speaker
0 0 including seven months building the database to list all the names on it and ensure it was correct. this list contains 58,227 names of american military who lost their lives in vietnam. ruth ross
0 1 The flood had a big impact on everyone here None
0 0 was the complete beacon of inspiration in the depression for this country. it's sort of an incredible coincidence that we are in this slump. i think it's an excellent time to remind people of the heroes that kept us afloat -- more than afloat -- in amelia's time. amelia earhart
0 1 Waters cooled more than they should have, and there was a lot of cooling by late March. But right on cue, things changed as we put out the forecast. Then there was all sorts of warming. Philip Klotzbach
0 0 at that time, we did not know there were occupants in the house. we arrived, deployed a nozzle, had the fire knocked down fairly quickly. had to force entry, the doors were locked. tom crawford

For the details of the transfer learning effort, you can head over to the relevant notebooks: tweets-bert, tweets-transfer-bert, tweets-transfer-distilbert, quotebank-sample.

Media Coverage Analysis

Analysis of the media coverage of natural disasters in the Quotebank dataset

What makes an earthquake worth to talk about?

All of the news portals are trying to find those events which reach the most readers. Natural disasters are a topic that interests many readers. Naturally, people don’t want to lag behind the world’s most important events, but that people are much more likely to read negative news than positive news, also contributes to that news portals are happy to bring down articles about these events. These tendencies are also true for celebrities, politicians and other media influencers whose thoughts and ideas are quoted in the media.

As for other news as well, the number of quotes in total and in the first period after the event is a meaningful indicator of its popularity in media. For the earthquakes we used this measurement to analyse the appearance of the disaster in the news. We also included the length between the first and last appearance of the earthquake in the quotes in the analysis, which happened to be less meaningful in this respect.

Due to Plotly’s auto-slideshow feature, the plots are not automatically rescaled. You can use Autoscale and Reset Axis buttons on the right top corner of the plot to better visualize the quote distributions of a particular disaster in the plot below.

As we would expect, people tend to talk about more in the first few months after the earthquake about the disaster. As for any news an event is interesting if it happened in the recent past. For the earthquakes, one would expect that its appearance in the news is longer that casual news, because of the post-earthquakes and the long time of recovery of an area. In our analysis, we could also observe a peak of the daily quotes after one year of the date of the event, this can be the result of the anniversary commemorations.

However, our main focus was that what can influence the media coverage of the earthquakes. To analyse this question we gathered a lot of data about the disaster itself and its location and examined the correlation between these indicators and the media coverage indicators. What we would expect that those factors which measures the human exposure.

As it can be seen in the heatmap below, we observed a massive correlation between the number of quotes and

  • the number of death associated with the event
  • the number of injured
  • the reconstruction cost

As for the number of quotes in the first year the list expanded by

  • aid contribution
  • mortality rate between under 5 years olds
  • number of days required to start a business
Heatmap

Conclusion

Some closing words on our project

All in all, we can see that NLP is an area of science where there is still room for improvement. The simplest regex search method has, in the end, worked the best for detecting the quotes in our use case. Furthermore, from the data set we extracted, we can learn a wide range of valuable information about the world, including what factors make a disaster breakthrough news. As a result, this field is something that has a huge potential to reform science.

Team

The Amazing ADA HiveMind Team

Hilda Horvath

Lovro Nuic

Frano Rajic

Batuhan Faik Derinbay