Modelling possible habitats for poplar invasion in South Africa

, and the southern parts of Limpopo. The evaluation of the relative importance of the bioclimates used showed that the warmest and driest quarter’s precipitation and annual precipitation significantly contribute to the poplar population. These results demonstrate the power of machine learning and regression models for predicting suitable habitats and extracting valuable environmental-climatic knowledge for monitoring and managing invasive tree species such as poplars. Conservation implications: Poplars are among the most aggressive invasive plant species in South Africa. The results of this study are expected to help conservation authorities understand the current climatic factors affecting the species distribution, as well as potential sites.


Introduction
Species distribution modelling (SDM) has emerged as a powerful tool in ecological research (Peterson et al. 2011), greatly facilitating the understanding of geo-ecology and predicting the potentially suitable environment for species (Stewart et al. 2022).In principle, SDMs link defined species locations to predictor variables to evaluate patterns of species occurrence and habitat suitability (Elith, Kearney & Phillips 2010;Guisan, Thuiller & Zimmermann 2017;Wolmarans, Robertson & Van Rensburg 2010).The SDMs may even be used to predict species survival in response to future environmental shifts (Peterson et al. 2011).While SDMs often integrates' climatic variables such as temperature and precipitation (Zhang et al. 2020), they also include other abiotic variables such as soil, land use, land cover and topography (Buri et al. 2017;Dubuis et al. 2013).Because of their fine spatial resolution for better fluctuations, the latter variables are included when the study aims to create SDM at local scale.Given their unique modelling ability, SDMs have become a popular strategy and a most widely used tool for characterising species distribution and predicting habitat suitability (Araújo et al. 2019).
Advances in computing resources and machine learning algorithms, as well as the availability of fine-scale remote sensing products and climate data are an important step forward in the use of SDMs for various applications (Guisan et al. 2017;Kass et al. 2021;Thuiller et al. 2009).These developments offer the opportunity to implement SDMs with great effect because the resolution at which they are computed greatly influence their overall performance (Lee-Yaw et al. 2022).Besides, the growing utility of SDMs lies in their relatively simple application with publicly available software packages, as well as low data requirements (Zurell 2019) and guidelines (Guisan et al. 2017).
Poplars (Populus alba, Populus canadensis, Populus canescens, Populus deltoides, Populus fremontii, Populus nigra and Populus simonii) are found throughout the world and are invasive in South Africa, where they are spatially permitted in certain areas under controlled conditions, as specified in the country's invasive species legislation.To better trace their geographic distribution, this study predicts the potentially suitable habitat of poplar trees in South Africa based on generalised linear model (GLM), Random Forests (RF) and Support Vector Machines (SVM) models and also assesses the climatic variables with the greatest impact on prediction performance.The results show excellent performance for all models (Area Under the Receiver Operation Characteristics Curve [AUC] > 0.9) in predicting the poplar distribution, with RF achieving the best performance (r = 0.83 and AUC = 0.965), followed by SVM (r = 0.72 and AUC = 0.959) and then GLM (r = 0.65 and AUC = 0.937).In a geographical perspective, all models show a similar pattern, with the highest concentrations being in the south-western parts of the Western Cape, the Southern Cape on the Garden Route, the central-eastern Free State, Mpumalanga, and the southern parts of Limpopo.The evaluation of the relative importance of the bioclimates used showed that the warmest and driest quarter's precipitation and annual precipitation significantly contribute to the poplar population.These results demonstrate the power of machine learning and regression models for predicting suitable habitats and extracting valuable environmental-climatic knowledge for monitoring and managing invasive tree species such as poplars.
Several studies have shown the high model performance in SDMs applications.In this regard, Zhang et al. (2020) predicted the possible distribution of Anredera cordifolia using Random Forest (RF) and also identified the temperature as the most important factor influencing the predictive performance of this species.Yudaputra, Pujiastuti and Cropper (2019) predicted the potential current distribution of Guettarda speciosa in Indonesia using the Maximum Entropy (MaxEnt), Support Vector Machine (SVM), RF, Generalized Linear Model (GLM), Domain, and Bioclim.Their study showed that MaxEnt outperformed other competitors with the highest area under the receiver operation characteristics curve (AUC) value of 0.89.Kaky et al. (2020) evaluated the performance of ensemble and MaxEnt modelling approaches in predicting the potential spread of Egyptian medicinal plants, where they found RF and MaxEnt performing better than SVM and Classification and Regression Tree (CART) in predicting the potential distribution of target species.However, the low performance of some models does not make them weak modelling approaches.Some models will perform poorly if all competency requirements are not met.For example, SVM sometimes performs better than MaxEnt and RF, possibly because the data structure is conducive to optimal performance.Therefore, ensuring that the model requirements are fully met before application and evaluation is essential.In general, studies have demonstrated high potential of machine learning and regression techniques for predicting species distributions in landscapes and ranking explanatory factors affecting distribution trends (Ahmed, Atzberger & Zewdie 2021;Shabani, Kumar & Ahmadi 2016).While these have been widely and successfully used to quantify the dispersal of many (native and non-native) floral species, they have not been applied in the study of Poplars in the Global South regions.
In South Africa, Populus alba and Populus canescens are classified as a Category 2 weed under the National Environmental Management and Biodiversity Act (NEMBA) (No. 10 of 2004) and Conservation of Agricultural Resources Act (CARA) (No. 43 of 1983) of South Africa's Invasive Species Legislation (Henderson 2007).This means that poplars (Category 2) plants may not occur on land or inland water surfaces other than a demarcated area or a biological control reserve (Roy, Pauchard & Stoett 2023).For example, in most conservation areas under South African National Parks (SANParks), they are classified as a species of very high concern, encroaching on ecologically sensitive landscapes such as riparian zones; they are therefore cleared under continuous surveillance (SANParks 2020).Essentially, poplar species consumes a lot of water in streams, thereby worsening the conditions for aquatic species in the streams (Théroux Rancourt, Éthier & Pepin 2015).Like any woody species, poplar competes with native species for resources, with a better chance of consuming more and more because of its growth structure and crown, which intercepts rain and sunlight that would otherwise be used by native species (Kumschick et al. 2020;Zhang et al. 2021).For these reasons, it is important to understand their spatial distribution and climatic drivers so that they can be appropriately monitored and managed in the country.This study was conducted at national scale because at smaller spatial scale the bioclimatic variables do not give better variations (Fournier et al. 2017).
To achieve the aim of this study, the performance of GLM, RF and SVM models in predicting the existence and distribution of poplar trees in South Africa was tested and compared.The predictive power of each model under current different climatic conditions and evaluation of the most influential climatic parameters for Poplar distribution in the country was examined.The results of this study could provide valuable insights to support alien invasive species management actions and decision-making processes.

Study area
This study was conducted in South Africa, the southernmost country on the African continent (Figure 1).The country lies between the 22-35° S and the 16-33° E (Van Wilgen et al. 2020).The country's interior consists of a large, almost flat plateau with an altitude of about 1000 m in most places above the sea level (King 1942).South Africa has a temperate climate with warm and humid summers and cold and dry winters (Blamey et al. 2017).The country's annual average rainfall is 464 mm, with the Western Cape receiving winter rainfall (June to August) and the rest of the country receiving summer rainfall (December to February) (Roffe, Fitchett & Curtis 2019).South Africa consists of nine vegetation biomes or units: Forests, Savannah, Fynbos, Grassland, Karoo, Desert, Nama-Karoo, Succulent Karoo, Indian Ocean Coastal Belts and Albany Thicket (Dayaram et al. 2019;Mucina & Rutherford 2006).Large impact craters, orogenic belts, granite belts and cratons are only a few examples of South Africa's extremely diverse geology (Du Toit 1940).South Africa's topography is made up of the craggy mountains of the Cape Fold Belt and the great escarpment, which encircle the country to the west, south and southeast.There is a narrow coastal plain strip beyond this (Haughton 1969).

Data collection
To quantitatively predict poplar (Populus nigra, Populus simonii, Populus fremontii, Populus deltoides, P. canescens, Populus canadensis and P. alba) distribution in South Africa, spatially referenced poplar occurrence sample points were obtained from the Global Biodiversity Information Facility (GBIF) (https://www.gbif.org/accessed on 12 June 2023 and updated on 20 December 2023).After filtering the data by South African boarder and removing duplicates, 467 poplar occurrence points were finally obtained and eligible for analysis.The species breakdown in terms of their locations is shown in Table 1.About 600 absence points were generated using the random point function in R environment.
This study also attempted to identify the South African protected areas most likely to be affected by poplars by superimposing the best-modelled poplar distribution results on the protected area shapefiles and extracted protected areas where poplar occur using intersection function in Quantum Geographic Information System (QGIS).Only mostly affected protected areas are reported; however, the wider distribution is provided.

Environmental predictors
The Bioclim algorithm was used in this study; this is a commonly used climate-envelope-model (Booth et al. 2014) integrated into the R environment using a raster package, for environmental prediction from the Worldclim database (Table 2).Soil, land-use cover and topography have been documented as important environmental variables for plant species distribution (Huang et al. 2021;Krauss et al. 2008).However, in this study, the focus was only on the current climate variables affecting the poplar species' range distribution in South Africa.
All Bioclim variables were spatially scaled to a 1 km grid and cropped to the extent of the South African border.Temperature Bioclim in the Worldclim database is usually multiplied by 10 for storage purposes to reduce size.Temperature units were converted to degrees Celsius.

Modelling methods
This study was conducted using all available poplar species data as indicated in section 'Distribution of poplar in South Africa'.Some species (e.g.Populus fremontii) had insufficient occurrence data and were unable to meet the modelling requirements for separate analyses.The commonly applied models for species distribution that include the GLMs (Melo-Merino, Reyes-Bonilla & Lira-Noriega 2020), RFs (Breiman 2001) and SVM (Vapnik 1998) were used in the R statistical environment.Stepwise logistic regression was used to select the best model combination.This process was set to repeat 10 times in a row to ensure that robust modelling output is achieved.The predictors from the model that yielded the lowest akaike information criterion (AIC) score were selected for modelling.This process is important as it allows for proper fitting of the model, which is a model that does not possess multicollinearity and data redundancy.
For all models, the same initial set of inputs was included, that is, poplar occurrence and absence points as response variable and 13 selected Bioclims as environmental predictors.Species presence was given value 1 and absence was given value 0 for modelling.Distribution data were divided into two parts, 75% for training and 25% for testing.The GLM was implemented using the 'GLM' function, link function and error distribution are given with the family arguments.Random Forests was implemented with 'randomForest' and the SVMs in R using the 'ksvm' in package 'kernlab' because it contains many different SVM formulations and 'kernals' and provides useful options and functions such as a method for plotting.Next, the RF was used to evaluate the most important variables (Genuer, Poggi & Tuleau-Malot 2010) in  determining the occurrence and distribution of poplar in the South African landscape.All the models were set to repeat 10 times to ensure that our modelling results are consistent and accurate.The accuracy of all models was assessed by the area under the curve (AUC).

Ethical considerations
This article followed all ethical standards for research without direct contact with human or animal subjects.

Model performance and validation
The geographic distribution of poplar trees (P.alba, P. canadensis, P. canescens, Populus deltoides, P. fremontii, Populus nigra and Populus simonii) were modelled in South Africa using GLM, RFs and SVM models, and their performance is shown in Table 3.The results show excellent performance for all models, with RF achieving the highest correlation of 0.83 and 0.965 AUC.This was followed by SVM and then GLM with correlation and AUC of 0.72 and 0.959, and 0.65 and 0.937, respectively (Table 3).A key feature of these results is the superior performance of machine learning methods than a regression model.In Figure 2, the comparison of the model accuracies with the receiver operator characteristics (ROC) curve is shown.Here, ROC shows the proportion of the true positive (presence) rate and the false positive (absence) rate for poplars.Higher RF AUC values indicate the ability to identify the presence and absence of poplar samples (Figure 2a), followed by SVM (Figure 2b) and then GLM (Figure 2c).

Distribution of poplar in South Africa
In this section, the results of the modelled poplar distribution are presented for South Africa using RF, SVM and GLM, and the results are shown in Figure 3.All models showed a similar pattern of poplar distribution, with higher concentration in the south-western part of the Western Cape, the southern Cape on the Garden Route, the central-eastern Free State, and the north-eastern part of the country in Mpumalanga, even in Limpopo (Figure 3a-c).However, RF had a better prediction performance than its counterparts (see Figure 2).
Populus nigra, P. alba, and P. canescens appear to have suitable habitats throughout the rest of the country, except for the extremely hot and arid northern parts of the Northern Cape (Figure 4a).This conclusion was reached after overlaying all poplar species points on the results of a RFs modelling application.Many provinces have populations of P. deltoides; however, it has not been recorded in provinces such as the North West and the Free State.Nevertheless, there are just a few records of P. simonii (Western Cape), P. fremontii (KwaZulu-Natal), and P. canadensis (Free State and Mpumalanga) in South Africa (see Figure 4a).Overall, poplar species appear to be highly distributed in temperate regions with adequate rainfall patterns and less in arid deserted areas (see Figure 4b and Appendix 3).This is not surprising because the majority of poplar species require an inordinate amount of water to flourish and survive.
In assessing the occurrence status of poplars in South African protected areas, this study found that poplars are common in few protected areas, where conditions are more favourable.These include Golden Gate Highlands National Park,   The study ranked the contribution of all 13 environmental variables in determining poplar occurrence and distribution and found the warmest and driest quarter precipitation, and annual precipitation to be the top three contributing factors (refer to Figure 5).The findings highlight the significance of wet conditions and water availability for poplar presence in South Africa.

Discussion
This study provided a successful prediction of poplar species distribution in South African landscapes and allowed the: (1) understanding of the geographic distribution and pattern of poplars in the country, (2) identification of environmental conditions that most affect poplar occurrences and dispersion, and (3) testing of the superior performance of Machine Learning models (RF and SVM), for modelling species distribution over a regression model.
The results showed that poplars are mainly found in regions of warm temperature and high rainfall in South Africa, including the south-western parts of the Western Cape and Northern Cape, the southern Cape on the Garden Route, the central-eastern Free State, western parts of KwaZulu-Natal, Gauteng, eastern parts of Northwest, the north-eastern part of the country in Mpumalanga and Limpopo.These also include protected areas found in the aforementioned regions, such as Golden Gate Highlands National Park (Daemane, Van Wyk & Moteetee 2010), Table Mountain National Park, Cape Peninsula Nature Reserve, Tweefontein Reserve, and Platberg Private Nature Reserve, to name but a few.The findings also seem to suggest that species such as P. nigra, P. alba, P. canescens, and P. deltoides are mostly found throughout the country, in the wetter and warmer provinces.
Their ability to thrive in semi-drier to wetter conditions may cause this adaptation (Caudullo & De Rigo 2016;De Rigo et al. 2016).The majority develop themselves organically and thrive when enough water and sunlight are available.For windbreaking and decorative purposes, poplar species including P. simonii, P. fremontii, and P. canadensis are planted in homes and streets and are carefully tended to with frequent irrigation.Overall, the results are consistent with Ntshidi et al. (2018) and Mtengwana et al. (2021), because, from a geographic perspective, invasive species predominate in the wetter parts of the country, which are important water sources for the country's major rivers.Poplar needs enough water and warmer temperatures for better growth and survival (Fischer et al. 2018;Kalcsits, Silim & Tanino 2009).This justifies their occurrence in riparian zones, and warmer and wetter parts of the country (Sperandio et al. 2022;Xi et al. 2021).While its invasiveness has not been extensively studied, like any other invasive species, it competes with native important species for resources, and at most outcompete them (Poudel et al. 2019;Zhang et al. 2021).Therefore, the results of this study are important as they enable the understanding of the patterns of their spread across the country and selected protected areas to know what interventions to put in place to monitor and manage them.In protected areas under the SANParks, poplar is categorised as species of special concern, which requires constant monitoring and management (SANParks 2020).
However, knowledge about what constitutes to its spread is not yet available.Therefore, it is hoped that this gap will be filled as an extension of this study.
It is well known that the geospatial distribution of plants is largely influenced by an interplay of multiple mechanisms,  including climate and other environmental characteristics (Chen et al. 2014).Although some of these parameters are not part of this modelling approach, they are likely to help determine poplar distribution.To some extent, the variations in topography and land cover are related to climatic variations (Huang et al. 2021;Krauss et al. 2008); hence, it is preferred to focus solely on the bioclimatic variables.Therefore, it is assumed that the observed distribution of the poplars depends solely on the current climate.For this reason, the RF was used to evaluate the most influential climatic variables.In doing so, this study found that warmest and driest quarter's precipitation and annual precipitation are the most important variables in the prediction of poplar trees.In fact, warmer and humid conditions are favourable for poplars, while cold and dry conditions appear to be unsuitable.Likewise, Yudaputra, Robiansyah and Rinandio (2019) using RF, examined the variables essential to the dispersal of Eusideroxylon zwageri in Indonesia and found that precipitation in the coldest months, precipitation seasonality, and isothermality are the most important variables affecting the dispersal of the target species.This information is important for predicting the habitat suitability of landscapes to allow for the existence of species and thus their future spread (Burns, Clemann & White 2020;Zhang et al. 2019).
The results of this study are consistent with previous studies in which machine learning methods (RF and SVM) outperformed a regression model (GLM) (Ahmed et al. 2021).
The superior performance of RF was also confirmed by Jensen et al. (2020) in their prediction of an invasive Kudzu vine in the USA using Sentinel-2.Random Forest uses only a random subset of the predictor variables for each subset as each tree grows.This creates decorrelated trees and reduces the variance of the final model (Hastie, Tibshirani & Friedman 2009).Additionally, RF avoids overfitting, although they can have quite a complex response, making RF the best model for modelling species distribution.As the second-best model, SVM works by constructing a series of hyperplanes to separate data points based on their class (Ahmed et al. 2021;Kass et al. 2021).Performance is often poor when the number of background samples significantly exceeds the occurrence records.This is largely due to class overlap (García-Roselló et al. 2019).This issue was combated by applying weights, which significantly increases the cost of misclassifying presence points (Duan et al. 2014).Support Vector Machine has been acknowledged for its performance in SDMs in many studies.For example, Bedia, Busqué and Gutiérrez (2011) used SVM and other models to predict the distribution of plant species in an alpine rangeland in northern Spain.Hailu et al. (2017) reported that SVM performed well together with MaxEnt in assessing the spatial distribution of Coffea arabica in the Ethiopian highlands.It is said that SVM essentially performs better than RF, especially when the number of observed occurrences is small, since it can be trained with few meaningful pixels (Pouteau et al. 2012).Generalized linear model is the least performing model in this study and is commonly used to determine the distribution pattern of species using the iterative weighted linear regression technique to obtain the  estimated maximum likelihood of the parameters (Fukuda et al. 2013;Godsoe 2014).Its relative performance to RF and SVM is consistent with other similar studies.However, in some applications, the GLM has been found to perform better than the Boosted Regression Tree in predicting the distribution of eight different species in Australia (Shabani et al. 2016).Overall, all three models performed well as they achieved an AUC > 0.9, confirming their value to model species distributions, especially machine learning-based techniques.

Conclusions
This study predicted poplar distribution in South Africa using machine learning (RF and SVM) and regression (GLM) models.The results showed that poplars are mostly distributed in the warmer regions that receive the above average rainfall of the country, from the southwest of the Western Cape and Northern Cape, the central Free State, western parts of KwaZulu-Natal, eastern parts of North West, Mpumalanga, Gauteng, to the southern parts of Limpopo.This distribution appears to affect several protected areas from a conservation perspective, including the Golden Gate Highlands National Park,

FIGURE 2 :
FIGURE 2: Receiver operator characteristics curve for (a) random forests, (b) support vector machine, and (c) generalised linear model.
legend/ramp on the y axis represent suitability levels, with green representing highly suitable site, while white-pinkish colour represents low suitability areas.

FIGURE 3 :FIGURE 4 :
FIGURE 3: Modelled distribution of poplars in South Africa using (a) random forests, (b) support vector machine, and (c) generalised linear model.

FIGURE 5 :
FIGURE 5: The importance of bioclimatic variables measured using random forests (a) percentage IncMSE and (b) IncNodePurity.

TABLE 2 :
Bioclimatic variables and their descriptions.

TABLE 1 :
Poplar species gathered from global biodiversity information facility, including localities and counts.Depicts the distribution of different Populus species in South Africa.

Table Mountain
AUC, area under the receiver operation characteristics curve.

TABLE 3 :
A summary of model performance.

TABLE 4 :
Evaluation of environmental factors for predicting the distributions of poplars.
Table Mountain National Park, Cape Peninsula Nature Reserve, Tweefontein Reserve, and Platberg Private Nature Reserve to name but a few.The results of this study showed that machine learning methods (RF, 0.965 and SVM, 0.959) outperformed GLM (0.937) in predicting the occurrence and distribution of poplar trees under the current climatic conditions, with RF performing the best.Based on RF, the warmest and driest quarter's precipitation and annual precipitation were the most influential climatic parameters for poplar distribution in South Africa.With continued evolution of SDMs, more poplar distribution data and availability are expected to further improve the understanding of species distribution and environmental variables that influence predictive performance.