Disease Spread Prediction

Abstract

Dengue fever is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, dengue fever can cause severe bleeding, low blood pressure, and even death.

Millions of cases of dengue infection occur worldwide each year. Dengue fever is most common in Southeast Asia and the western Pacific islands, but the disease has been increasing rapidly in Latin America and the Caribbean. This project aimed to establish a statistical model to predict the number of dengue cases in two Latin American cities.

Effect of Climate on Dengue Spread

Because it is carried by mosquitoes, the transmission dynamics of dengue are related to climate variables such as temperature and precipitation. Although the relationship to climate is complex, a growing number of scientists argue that climate change is likely to produce distributional shifts that will have significant public health implications worldwide.

An understanding of the relationship between climate and dengue dynamics can improve research initiatives and resource allocation to help fight life-threatening pandemics. Using environmental data collected by the Centre for Disease Control & Prevention (CDC) and the National Oceanic and Atmospheric Administration in the U.S.A, we set out to predict the number of dengue fever cases reported each week in San Juan, Puerto Rico, and Iquitos, Peru.

Data Description

To model the dengue fever incidence, several climatic factors like air temperature (min, max, avg), diurnal temperature range, specific humidity, relative humidity, precipitation, etc. were used as explanatory variables. Weekly climate data were collected from three sources:

  • Daily climate data from the weather station
  • Satellite measurements
  • Climate Forecast System Reanalysis measurements


  • Data Preparation

    Missing Value Imputation

    Any missing values are imputed with the average of bounding non-missing values. Missing values in the margins of a series were replaced with forwarding fill (or backward fill) imputations.

    Outlier Treatment

    In the dependent variable (total cases of dengue), the presence of outliers was visually inspected and confirmed. So, the Hampel filter was used for outlier treatment.

    Hampel Filter

    Hampel filter is used to identify and remove outliers in time series data. It uses a sliding window of configurable width to go over the data. For each window, we calculate the median. Then the standard deviation is estimated from the median absolute deviation

    The Hampel filter has two configurable parameters:

  • The size of the sliding window
  • The number of standard deviations which identify the outlier
  • We selected these two parameters depending on the use-case and now the data is processed and is ready for modeling.

    Statistical Modelling

    The approach taken to solve the problem at hand was to use an ensemble of models. A Negative binomial regression was employed to predict the spread of dengue with climate variables as input. Time Series forecasting was done using a SARIMA model. Then the results of both the models were averaged to get the final prediction.

    SARIMA

    Auto-Regressive Integrated Moving Average (ARIMA) models are widely used to forecast time series data.
    SARIMA stands for Seasonal Auto-Regressive Integrated Moving models and are used to forecast time series with seasonality present in it. A SARIMA model has the following hyperparameters:

  • p: AR order.
  • d: differencing order
  • q: MA order
  • P: Seasonal AR order
  • D: Seasonal differencing order
  • Q: Seasonal MA order
  • S: length of a period
  • A SARIMA model uses differencing at a lag equal to S, for removing additive seasonal effects. As with lag equal to 1 differencing for removing a trend. It also includes autoregressive and moving average terms at lag S.

    For our prediction scenario,

  • As there is no integrating effect in the total cases observed, d was set to 0
  • Appropriate values for q and p can be found from ACF and PACF plots respectively
  • S was chosen as 52 as the seasonality in climate data is observed every year.
  • P and Q were found from the PACF and ACF plots by looking at the correlation values near the length of a period
  • A SARIMA model was built using the chosen parameters and predictions were made.

    Negative Binomial Regression

    Negative binomial regression is a Supervised Machine Learning technique which is generally used to predict count variables, usually for over-dispersed count data (i.e.) when the variance exceeds the mean. As the weekly number of dengue cases reported was over-dispersed, an NBM model was used to forecast it with climate parameters as independent variables.

    Lag variables were created with climate data and features with good predictability were chosen. For feature selection, a random forest model was fitted to the raw data, and variables with high feature importance (based on GINI score) were selected.

    Hyperparameter tuning was done using cross-validation and the final model was trained using Collab GPUs.

    Further Steps

    A final forecast was made by taking the average of predictions made by both SARIMA and Negative Binomial models. The models were validated on a holdout dataset, and the final performance was recorded as an MSE of 14.8.

    Dengue cases are subjected to various factors as briefly found in the scope of this project. A lot of external parameters such as vegetation index, population density, and other socio-economic factors have heavy control over the spread of dengue. A further stretch of this work will lead to more accurate predictions, which would aid governments and healthcare workers in suppressing the spread of dengue.