Machine learning for vector control decision making

Machine learning for vector control decision making

Dataset

This data set is collected by KY Lee and the group from Seoul city, South Korea. It consist of the number of mosquitos per specific area in Seoul city.
The data has been collected daily from May 2016- December 2019.
Authors has made this data set freely available at https://www.kaggle.com/kukuroo3/mosquito-indicator-in-seoul-korea.

Question

Lets assume that the Seoul city health officials need to decide whether the indoor residual spraying (IRS) is necessary or not based on climatic factors. They need to know the number of mosquitos per specific area (mosquito indicator) for this. In this sample project I will develop a machine learning model to predict the mosquito indicator value by feeding climate data.

Importing required python libraries

In [22]:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler #To standardize the data to a common range
from sklearn.model_selection import train_test_split #to split data
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

Loading the dataset as a pandas dataframe

In [23]:

mosData = pd.read_csv('mosIndicator.csv') 

Exploring the data

In [24]:

mosData.head()

Out[24]:

datemosquito_Indicatorrain(mm)mean_T(℃)min_T(℃)max_T(℃)
02016-05-01254.40.018.812.226.0
12016-05-02273.516.521.116.528.4
22016-05-03304.027.012.98.917.6
32016-05-04256.20.015.710.220.6
42016-05-05243.87.518.910.226.9

In [25]:

mosData.describe()

Out[25]:

mosquito_Indicatorrain(mm)mean_T(℃)min_T(℃)max_T(℃)
count1342.0000001342.0000001342.0000001342.0000001342.000000
mean251.9918033.53986614.16602110.00566319.096870
std295.87133613.86810610.94399011.10948911.063394
min0.0000000.000000-14.800000-17.800000-10.700000
25%5.5000000.0000004.5000000.3000009.300000
50%91.9000000.00000016.50000011.50000021.900000
75%480.4000000.40000023.30000019.50000028.175000
max1000.000000144.50000033.70000030.30000039.600000

In [26]:

#### Changing column names

mosData = mosData.rename(columns={"date": "Date", 'mosquito_Indicator': 'Mosquitos', 'rain(mm)': 'Rain', "mean_T(℃)": "Meant", "min_T(℃)": "Mint", "max_T(℃)": "Maxt"})

In [27]:

mosData.head()

Out[27]:

DateMosquitosRainMeantMintMaxt
02016-05-01254.40.018.812.226.0
12016-05-02273.516.521.116.528.4
22016-05-03304.027.012.98.917.6
32016-05-04256.20.015.710.220.6
42016-05-05243.87.518.910.226.9

In [28]:

# Checking for missing values

mosData.isnull().values.any()

#If we had null values
# mosData = mosData.dropna()

Out[28]:

False

In [42]:

#If the data had very low values for some categories we can remove them with following function

#def drop_lows(category, cutoff):
#    categorical_map = {}
#    for i in range(len(category)):
#        if category.values[i] >= cutoff:
#            categorical_map[category.index[i] = category.index[i]]
#        else:
#            categorical_map[category.index[i] = "Other"
#    return categorical_map

In [43]:

# Area is a hypothetical column to illustrate how to do with different datasets
# country_map = drop_lows(mosData.Area.value_counts(), 400)
# mosData["Area"] = mosData["Area"].map(country_map)

In [44]:

# Removing outliers - as evident from boxplots

#fig, ax=plt.subplot(1,1,figsize=(12,7))
#mosData.boxplot["Mosquitos", "Area", ax=ax]
#plt.subtitle("Mosquitos vs Country")
#plt.title("")
#plt.ylabel("Mosquitos")
#plt.xticks("rotation=90")
#plt.show()

In [45]:

# If there are outliers There may be dots outside the boxes
# If the boxes of the box plot (including error bars) do not 
# go beyond 700 and 50. Lets remove those outliers as well
#mosData = mosData[mosData["Mosquitos"] <= 700]
#mosData = mosData[mosData["Mosquitos"] <= 20]
#mosData = mosData[mosData["Area"] = "Other"]

Separating the data and the features

In [29]:

xtemp = mosData.drop("Mosquitos", axis = 1)
x = xtemp.drop("Date",  axis = 1)
y = mosData["Mosquitos"]
* Both the categories has reasonable counts.
* All the values are numerical and do not need to convert

Data standardisation

The rainfall is in a different range than the temperatures. Thus, the data must be standardised.

In [30]:

scaler = StandardScaler() # Importing standard scaler function to a variable named scaler

In [31]:

scaler.fit(x) #Fitting and transforming x to the standard scaler function 

Out[31]:

StandardScaler()

In [32]:

st_data_x = scaler.transform(x)

In [33]:

x = st_data_x

Splitting train and test datasets

In [41]:

X_train, X_test, Y_train, Y_test = train_test_split(x,y, test_size = 0.2)
# Test size 20% of the data
# Using stratify will maintain similar amounts of 1s and 2s of mosquito numbers

Training the model

Selecting the correct algorithm require the consideration of several factors including the type of target variable, size of the data set. Here the target variable is an exact number. Thus,I can use a regression or a classification algorithm. I will use linear regression, lasso regression, support vector regression (Wu et al 2020), random forest (Ebrahimi-Khusf et al 2021), Stochastic gradient boosting (Ebrahimi-Khusf et al 2021) and will compare which one is best.

In [98]:

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error

1. Linear regression

In [99]:

lin_reg = LinearRegression() 

lin_reg.fit(X_train, Y_train)

Ytrpreds = lin_reg.predict(X_train)
Ytestpred = lin_reg.predict(X_test)

trerror = np.sqrt(mean_squared_error(Y_train, Ytrpreds))
testerror = np.sqrt(mean_squared_error(Y_test, Ytestpred))

print(trerror)
print(testerror)
192.98943977171356
202.67642370681074

2. Lasso Regression

In [100]:

lasso = Lasso()
lasso.fit(X_train, Y_train)

lasso_Ytrpreds = lasso.predict(X_train)
lasso_Ytestpred = lasso.predict(X_test)

lasso_trerror = np.sqrt(mean_squared_error(Y_train, lasso_Ytrpreds))
lasso_testerror = np.sqrt(mean_squared_error(Y_test, lasso_Ytestpred))

print(lasso_trerror)
print(lasso_testerror)
193.17517395879298
203.53583520781154

3. SVR

In [101]:

svr = SVR(kernel="linear", C=100, gamma="auto")
svr.fit(X_train, Y_train)

svr_Ytrpreds = svr.predict(X_train)
svr_Ytestpred = svr.predict(X_test)

svr_trerror = np.sqrt(mean_squared_error(Y_train, svr_Ytrpreds))
svr_testerror = np.sqrt(mean_squared_error(Y_test, svr_Ytestpred))

print(svr_trerror)
print(svr_testerror)
194.02301601725773
206.40789365465545

4. RF

In [102]:

rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(X_train, Y_train)

rf_Ytrpreds = rf.predict(X_train)
rf_Ytestpred = rf.predict(X_test)

rf_trerror = np.sqrt(mean_squared_error(Y_train, rf_Ytrpreds))
rf_testerror = np.sqrt(mean_squared_error(Y_test, rf_Ytestpred))

print(rf_trerror)
print(rf_testerror)
62.597274333213534
165.61398484014964

5. Stochastic gradient boosting

In [103]:

sgb = GradientBoostingRegressor(n_estimators = 1000, random_state = 0)
sgb.fit(X_train, Y_train)

sgb_Ytrpreds = sgb.predict(X_train)
sgb_Ytestpred = sgb.predict(X_test)

sgb_trerror = np.sqrt(mean_squared_error(Y_train, sgb_Ytrpreds))
sgb_testerror = np.sqrt(mean_squared_error(Y_test, sgb_Ytestpred))

print(sgb_trerror)
print(sgb_testerror)
33.115401993709604
184.6828138793723
Linear regression shows the best performance among all the algorithms.

Creating a prediction system using linear regression

In [109]:

x = np.array([[0.0,16.4,10.4,23.5]]) # If we paste the variables rainfall (0.0), 
# mean temperature (20.0), minimum temperature (14.8), maximum temperature (27.4) here we can get the predictions as below

# reshaping the array as I predict only for one instance
x = x.reshape(1,-1) 

# standardizing the data
x = scaler.transform(x)

predMosquitos = lin_reg.predict(x)

predMosquitos

Out[109]:

array([230.12553066])
Note: In this model the features are not selected or extracted. Such method is required if we have a large number of features and if they are redundant or irrelevant. If not it might cause the model to take the noise of the features to learn and it may be overfitting or perform below optimum level.

Saving the model

In [95]:

import pickle

In [96]:

with open("lin_reg.pkl", "wb") as file:
    pickle.dump(lin_reg, file)
#wb-write binary
To make a real time predictor app we can create an app using the saved model and flask (or streamlit). Heroku server provides the hosting for such applications.

References

Ebrahimi-Khusfi Z, Nafarzadegan AR & Dargahian F (2021). Predicting the number of dusty days around the desert wetlands in southeastern Iran using feature selection and machine learning techniques. Ecological Indicators 125, 107499.
Wang Z, Wang Y, Zeng R, Srinivasan RS & Ahrentzen S (2018). Random Forest based hourly building energy prediction. Energy and Buildings 171, 11–25.
Wu CH, Ho JM & Lee DT (2004). Travel-time prediction with support vector regression. IEEE Transactions on Intelligent Transportation Systems 5, 276–281.

Leave a comment