Machine learning for vector control decision making
Dataset
This data set is collected by KY Lee and the group from Seoul city, South Korea. It consist of the number of mosquitos per specific area in Seoul city.
The data has been collected daily from May 2016- December 2019.
Authors has made this data set freely available at https://www.kaggle.com/kukuroo3/mosquito-indicator-in-seoul-korea.
Question
Lets assume that the Seoul city health officials need to decide whether the indoor residual spraying (IRS) is necessary or not based on climatic factors. They need to know the number of mosquitos per specific area (mosquito indicator) for this. In this sample project I will develop a machine learning model to predict the mosquito indicator value by feeding climate data.
Importing required python libraries
In [22]:
import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler #To standardize the data to a common range from sklearn.model_selection import train_test_split #to split data from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt
Loading the dataset as a pandas dataframe
In [23]:
mosData = pd.read_csv('mosIndicator.csv')
Exploring the data
In [24]:
mosData.head()
Out[24]:
| date | mosquito_Indicator | rain(mm) | mean_T(℃) | min_T(℃) | max_T(℃) | |
|---|---|---|---|---|---|---|
| 0 | 2016-05-01 | 254.4 | 0.0 | 18.8 | 12.2 | 26.0 |
| 1 | 2016-05-02 | 273.5 | 16.5 | 21.1 | 16.5 | 28.4 |
| 2 | 2016-05-03 | 304.0 | 27.0 | 12.9 | 8.9 | 17.6 |
| 3 | 2016-05-04 | 256.2 | 0.0 | 15.7 | 10.2 | 20.6 |
| 4 | 2016-05-05 | 243.8 | 7.5 | 18.9 | 10.2 | 26.9 |
In [25]:
mosData.describe()
Out[25]:
| mosquito_Indicator | rain(mm) | mean_T(℃) | min_T(℃) | max_T(℃) | |
|---|---|---|---|---|---|
| count | 1342.000000 | 1342.000000 | 1342.000000 | 1342.000000 | 1342.000000 |
| mean | 251.991803 | 3.539866 | 14.166021 | 10.005663 | 19.096870 |
| std | 295.871336 | 13.868106 | 10.943990 | 11.109489 | 11.063394 |
| min | 0.000000 | 0.000000 | -14.800000 | -17.800000 | -10.700000 |
| 25% | 5.500000 | 0.000000 | 4.500000 | 0.300000 | 9.300000 |
| 50% | 91.900000 | 0.000000 | 16.500000 | 11.500000 | 21.900000 |
| 75% | 480.400000 | 0.400000 | 23.300000 | 19.500000 | 28.175000 |
| max | 1000.000000 | 144.500000 | 33.700000 | 30.300000 | 39.600000 |
In [26]:
#### Changing column names
mosData = mosData.rename(columns={"date": "Date", 'mosquito_Indicator': 'Mosquitos', 'rain(mm)': 'Rain', "mean_T(℃)": "Meant", "min_T(℃)": "Mint", "max_T(℃)": "Maxt"})
In [27]:
mosData.head()
Out[27]:
| Date | Mosquitos | Rain | Meant | Mint | Maxt | |
|---|---|---|---|---|---|---|
| 0 | 2016-05-01 | 254.4 | 0.0 | 18.8 | 12.2 | 26.0 |
| 1 | 2016-05-02 | 273.5 | 16.5 | 21.1 | 16.5 | 28.4 |
| 2 | 2016-05-03 | 304.0 | 27.0 | 12.9 | 8.9 | 17.6 |
| 3 | 2016-05-04 | 256.2 | 0.0 | 15.7 | 10.2 | 20.6 |
| 4 | 2016-05-05 | 243.8 | 7.5 | 18.9 | 10.2 | 26.9 |
In [28]:
# Checking for missing values mosData.isnull().values.any() #If we had null values # mosData = mosData.dropna()
Out[28]:
False
In [42]:
#If the data had very low values for some categories we can remove them with following function
#def drop_lows(category, cutoff):
# categorical_map = {}
# for i in range(len(category)):
# if category.values[i] >= cutoff:
# categorical_map[category.index[i] = category.index[i]]
# else:
# categorical_map[category.index[i] = "Other"
# return categorical_map
In [43]:
# Area is a hypothetical column to illustrate how to do with different datasets # country_map = drop_lows(mosData.Area.value_counts(), 400) # mosData["Area"] = mosData["Area"].map(country_map)
In [44]:
# Removing outliers - as evident from boxplots
#fig, ax=plt.subplot(1,1,figsize=(12,7))
#mosData.boxplot["Mosquitos", "Area", ax=ax]
#plt.subtitle("Mosquitos vs Country")
#plt.title("")
#plt.ylabel("Mosquitos")
#plt.xticks("rotation=90")
#plt.show()
In [45]:
# If there are outliers There may be dots outside the boxes # If the boxes of the box plot (including error bars) do not # go beyond 700 and 50. Lets remove those outliers as well #mosData = mosData[mosData["Mosquitos"] <= 700] #mosData = mosData[mosData["Mosquitos"] <= 20] #mosData = mosData[mosData["Area"] = "Other"]
Separating the data and the features
In [29]:
xtemp = mosData.drop("Mosquitos", axis = 1)
x = xtemp.drop("Date", axis = 1)
y = mosData["Mosquitos"]
* Both the categories has reasonable counts.
* All the values are numerical and do not need to convert
Data standardisation
The rainfall is in a different range than the temperatures. Thus, the data must be standardised.
In [30]:
scaler = StandardScaler() # Importing standard scaler function to a variable named scaler
In [31]:
scaler.fit(x) #Fitting and transforming x to the standard scaler function
Out[31]:
StandardScaler()
In [32]:
st_data_x = scaler.transform(x)
In [33]:
x = st_data_x
Splitting train and test datasets
In [41]:
X_train, X_test, Y_train, Y_test = train_test_split(x,y, test_size = 0.2) # Test size 20% of the data # Using stratify will maintain similar amounts of 1s and 2s of mosquito numbers
Training the model
Selecting the correct algorithm require the consideration of several factors including the type of target variable, size of the data set. Here the target variable is an exact number. Thus,I can use a regression or a classification algorithm. I will use linear regression, lasso regression, support vector regression (Wu et al 2020), random forest (Ebrahimi-Khusf et al 2021), Stochastic gradient boosting (Ebrahimi-Khusf et al 2021) and will compare which one is best.
In [98]:
from sklearn.linear_model import LinearRegression from sklearn.linear_model import Lasso from sklearn.svm import SVR from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import GradientBoostingRegressor from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
1. Linear regression
In [99]:
lin_reg = LinearRegression() lin_reg.fit(X_train, Y_train) Ytrpreds = lin_reg.predict(X_train) Ytestpred = lin_reg.predict(X_test) trerror = np.sqrt(mean_squared_error(Y_train, Ytrpreds)) testerror = np.sqrt(mean_squared_error(Y_test, Ytestpred)) print(trerror) print(testerror)
192.98943977171356 202.67642370681074
2. Lasso Regression
In [100]:
lasso = Lasso() lasso.fit(X_train, Y_train) lasso_Ytrpreds = lasso.predict(X_train) lasso_Ytestpred = lasso.predict(X_test) lasso_trerror = np.sqrt(mean_squared_error(Y_train, lasso_Ytrpreds)) lasso_testerror = np.sqrt(mean_squared_error(Y_test, lasso_Ytestpred)) print(lasso_trerror) print(lasso_testerror)
193.17517395879298 203.53583520781154
3. SVR
In [101]:
svr = SVR(kernel="linear", C=100, gamma="auto") svr.fit(X_train, Y_train) svr_Ytrpreds = svr.predict(X_train) svr_Ytestpred = svr.predict(X_test) svr_trerror = np.sqrt(mean_squared_error(Y_train, svr_Ytrpreds)) svr_testerror = np.sqrt(mean_squared_error(Y_test, svr_Ytestpred)) print(svr_trerror) print(svr_testerror)
194.02301601725773 206.40789365465545
4. RF
In [102]:
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42) rf.fit(X_train, Y_train) rf_Ytrpreds = rf.predict(X_train) rf_Ytestpred = rf.predict(X_test) rf_trerror = np.sqrt(mean_squared_error(Y_train, rf_Ytrpreds)) rf_testerror = np.sqrt(mean_squared_error(Y_test, rf_Ytestpred)) print(rf_trerror) print(rf_testerror)
62.597274333213534 165.61398484014964
5. Stochastic gradient boosting
In [103]:
sgb = GradientBoostingRegressor(n_estimators = 1000, random_state = 0) sgb.fit(X_train, Y_train) sgb_Ytrpreds = sgb.predict(X_train) sgb_Ytestpred = sgb.predict(X_test) sgb_trerror = np.sqrt(mean_squared_error(Y_train, sgb_Ytrpreds)) sgb_testerror = np.sqrt(mean_squared_error(Y_test, sgb_Ytestpred)) print(sgb_trerror) print(sgb_testerror)
33.115401993709604 184.6828138793723
Linear regression shows the best performance among all the algorithms.
Creating a prediction system using linear regression
In [109]:
x = np.array([[0.0,16.4,10.4,23.5]]) # If we paste the variables rainfall (0.0), # mean temperature (20.0), minimum temperature (14.8), maximum temperature (27.4) here we can get the predictions as below # reshaping the array as I predict only for one instance x = x.reshape(1,-1) # standardizing the data x = scaler.transform(x) predMosquitos = lin_reg.predict(x) predMosquitos
Out[109]:
array([230.12553066])
Note: In this model the features are not selected or extracted. Such method is required if we have a large number of features and if they are redundant or irrelevant. If not it might cause the model to take the noise of the features to learn and it may be overfitting or perform below optimum level.
Saving the model
In [95]:
import pickle
In [96]:
with open("lin_reg.pkl", "wb") as file:
pickle.dump(lin_reg, file)
#wb-write binary
To make a real time predictor app we can create an app using the saved model and flask (or streamlit). Heroku server provides the hosting for such applications.
References
Ebrahimi-Khusfi Z, Nafarzadegan AR & Dargahian F (2021). Predicting the number of dusty days around the desert wetlands in southeastern Iran using feature selection and machine learning techniques. Ecological Indicators 125, 107499.
Wang Z, Wang Y, Zeng R, Srinivasan RS & Ahrentzen S (2018). Random Forest based hourly building energy prediction. Energy and Buildings 171, 11–25.
Wu CH, Ho JM & Lee DT (2004). Travel-time prediction with support vector regression. IEEE Transactions on Intelligent Transportation Systems 5, 276–281.

Leave a comment