Sampling bias in reptile occurrence data for the Kruger National Park

Effective conservation and management of organisms require an understanding of how species are spatially distributed at both broad and fine spatial resolutions, and ideally also the underlying determinants of their distribution patterns (Hurlbert & Jetz 2007; Kery 2011). However, species geographic data that may help inform conservation management decisions are often limited and biased in their collection strategies (Franklin 2010). For example, although museum databases often include occurrence data of collected specimens, the principal purpose of most museum collections is to act as reference catalogues for species identification rather than for species distribution mapping (Newbold 2010). It is important to note that although several museum specimens are collected directly as a result of systematic sampling, many specimens are collected opportunistically (Kadmon, Farber & Danin 2004; Pyke & Ehrlich 2010). As a result, collection effort and spatial coverage within museum data naturally vary depending on the interests of the collection. Despite this, a recently increased urgency in the need for species distribution information has placed a greater emphasis on the use of museum databases for amassing species occurrence records (Syfert, Smith & Coomes 2013).


Introduction
Effective conservation and management of organisms require an understanding of how species are spatially distributed at both broad and fine spatial resolutions, and ideally also the underlying determinants of their distribution patterns (Hurlbert & Jetz 2007;Kery 2011). However, species geographic data that may help inform conservation management decisions are often limited and biased in their collection strategies (Franklin 2010). For example, although museum databases often include occurrence data of collected specimens, the principal purpose of most museum collections is to act as reference catalogues for species identification rather than for species distribution mapping (Newbold 2010). It is important to note that although several museum specimens are collected directly as a result of systematic sampling, many specimens are collected opportunistically (Kadmon, Farber & Danin 2004;Pyke & Ehrlich 2010). As a result, collection effort and spatial coverage within museum data naturally vary depending on the interests of the collection. Despite this, a recently increased urgency in the need for species distribution information has placed a greater emphasis on the use of museum databases for amassing species occurrence records (Syfert, Smith & Coomes 2013).
In recent years, the capture of museum data within electronic databases, the establishment and continued activities of atlasing projects, and the growth of citizen science projects have provided a wealth of species occurrence data that are accessible online (Newbold 2010). These data are To effectively conserve and manage species, it is important to (1) understand how they are spatially distributed across the globe at both broad and fine spatial resolutions and (2) elucidate the determinants of these distributions. However, information pertaining to the distributions of many species remains poor as occurrence data are often scarce or collected with varying motivations, making the resulting patterns susceptible to sampling bias. Exacerbating an already limited quantity of occurrence data with an assortment of biases hinders their effectiveness for research, thus making it important to identify and understand the biases present within species occurrence data sets. We quantitatively assessed occurrence records of 126 reptile species occurring in the Kruger National Park (KNP), South Africa, to quantify the severity of sampling bias within this data set. We collated a data set of 7118 occurrence records from museum, literature and citizen science sources and analysed these at a biologically relevant spatial resolution of 1 km × 1 km. As a result of logistical challenges associated with sampling in KNP, approximately 92% of KNP is data deficient for reptile occurrences at the 1 km × 1 km resolution. Additionally, the spatial coverage of available occurrences varied at species and family levels, and the majority of occurrence records were strongly associated with publicly accessible human infrastructure. Furthermore, we found that sampled areas within KNP were not necessarily ecologically representative of KNP as a whole, suggesting that areas of unique environmental space remain to be sampled. Our findings highlight the need for substantially greater sampling effort for reptiles across KNP and emphasise the need to carefully consider the sampling biases within existing data should these be used for conservation management decision-making. Modelling species distributions could potentially serve as a short-term solution, but a concomitant increase in surveys across the park is needed.
undoubtedly valuable but are subject to a multitude of biases, errors and uncertainties that need to be considered should these data be used for environmental research. Generally, species occurrence records are susceptible to geospatial or taxonomic sampling biases and on their own do not explain the full extents of species distributions (Bird et al. 2014;Botts, Erasmus & Alexander 2011;Reddy & Davalos 2003). For example, museum data are often biased towards heavily sampled areas (Newbold 2010), atlas data tend to be vulnerable to omission errors (Botts et al. 2011) and citizen science records are often imprecise (Geldmann et al. 2016;McGrath et al. 2015).
For rare and understudied species, bias in occurrence data sets exacerbates an already severe issue of misinformation and overall data deficiency (Reddy & Davalos 2003). With recorded occurrences of these animals already limited, the presence of an assortment of sampling biases within databases further restricts our understanding of these species' distributions and curtails our ability to manage them effectively. For cryptic species such as some species of reptiles McGrath et al. 2015), there is often a distinct lack of high-quality records of these animals' occurrences within their natural environments (Böhm et al. 2013;Tolley et al. 2016), even within areas specifically designated for conservation (Ferreira et al. 2011;Venter et al. 2008;Zielinski 2001).
In South Africa, the Kruger National Park (KNP) is home to approximately 126 reptile species Branch 1998;Pienaar 1978). The presence of reptiles promotes ecological diversity within KNP, and more broadly southern Africa, as many reptile species are likely to have important ecological roles or carry out ecologically beneficial functions within a variety of habitats and ecosystems (Trimble & Aarde 2014). Overall, reptiles comprise approximately 14% of vertebrate species within KNP (Parr, Woinarski & Pienaar 2009), and the conservation of these animals is essential for maintaining diversity within this important protected area (Gascon et al. 2015;Parr et al. 2009;Venter et al. 2008).
The KNP biological reference collection houses thousands of preserved specimens across a wide variety of taxa and includes hundreds of individual reptiles collected within the park over the past 80 years. The collection also includes an extensive electronic database that catalogues each specimen along with its respective biological and locality information where available. This collection places KNP among the best sampled protected areas in South Africa (and probably in Africa) for reptiles ). However, the very nature of such reference collections is that sampling intensity and objectives vary over time, with earlier sampling efforts focused primarily on compiling inventory lists and collecting reference material. As such, the KNP biological reference collection database for reptiles was never intended as a systematic survey across all habitats and reflecting all patterns of occurrence within the park. In recent years, however, the need for spatially explicit species occurrence data sets to inform modern conservation tools requires that the data from biological reference collections should be coopted into conservation analyses. Accordingly, there is a need to critically evaluate such existing data sets to understand any inherent patterns of bias they possess.
In this study, we (1) collate and synthesise available occurrence data for reptile species in KNP from reference collections, museum databases and literature sources; (2) assess patterns of geographic and taxonomic biases within this data set; and (3) evaluate whether areas of spatial bias are environmentally representative of KNP as a whole, including under-sampled regions.

Reptile occurrence data
We collated reptile locality and occurrence data from literature sources, museum and reference collection databases, a virtual museum platform, citizen science sightings from social media platforms and field data gathered under various teaching, monitoring and inventorying exercises by the Organization for Tropical Studies (Table 1). Additionally, two of the authors (J.M.B. and B.M.) provided 151 novel records from personal observations in KNP (listed as 'this study'). In total, we collated 14 533 records, but after georeferencing these to match locality descriptions and removing duplicates across sources, we had a final data set of 7118 records representing 126 reptile species occurring in KNP. This data set is available upon request from SANParks Scientific Services.

Coverage biases
We summarised reptile species occurrence data to identify geographic and taxonomic biases in coverage across KNP. By carrying out regression analyses, we tested if reptile families were evenly represented across KNP by comparing the relationship between the number of occurrences for each reptile family to the extent of the geographical areas (in km 2 ) surrounding those occurrences of each reptile family (i.e. the area of the minimum convex polygon enclosing all To evaluate the proportion of KNP for which reptile occurrence data exist and quantify the extent of unsampled areas, we divided KNP into equal-sized grid cells at 1 km × 1 km, 2 km × 2 km, 4 km × 4 km and 9 km × 9 km (pentad scale) resolutions, respectively. These resolutions allowed us to identify patterns of geographic sampling bias across a range of biologically appropriate spatial resolutions. However, the 1 km × 1 km resolution was preferred for most analyses. This resolution subjectively offered the best trade-off between the spatial error associated with historical records of occurrence data (Newbold 2010) and the relatively small spatial scale at which many reptiles utilise landscapes (Fischer, Lindenmayer & Cowling 2004;Price, Kutt & McAlpine 2010). We plotted reptile occurrences across the grid cells of each spatial resolution by using Quantum Geographic Information System (QGIS) version 3.4 (QGIS Development Team 2018) and counted the number of occurrences per grid cell. By carrying out regression analyses, we also tested whether a relationship exists between the numbers of reptile occurrences recorded within each grid cell and the proximity of those grid cells to the nearest publicly accessible infrastructure areas of KNP (defined here as all camps, gates, picnic sites and public roads) at the finest spatial resolution (1 km × 1 km).

Are sampled areas representative of the Kruger National Park as a whole?
We downloaded environmental and infrastructural data layers at a spatial resolution of 1 km × 1 km to represent the overall environmental space of KNP. These included 20 bioclimatic layers representing current climate  and elevation from the Worldclim database (http://www. worldclim.org), soil type classifications of South Africa from the International Soil Resource and Information Centre (ISRIC; https://www.isric.org), vegetation type classifications of South Africa from the South African National Biodiversity Institute (SANBI; www.bgis.sanbi.org) and infrastructural layers for publicly accessible camps, gates, picnic sites and roads within KNP from South African National Parks (SANParks; http://dataknp.sanparks.org/ sanparks). We also generated 'slope', 'aspect' and 'distance to water bodies' layers for KNP by using ArcGIS version 10.4 (ESRI 2016), resulting in a total of 27 representative layers for the environmental space of KNP.
To reduce the effects of spatial autocorrelation between layers, we performed a principal component analysis by using R verswion 3.5.3 (R Core Team 2018) to summarise the layers into 27 new, uncorrelated principal component layers. We retained the first six principal component layers as representatives of the overall environmental variability of KNP as they cumulatively represented 85% of all modelled variation, which we selected as an effective stopping point as per Jackson (1993

Ethical considerations
This article followed all ethical standards for a research without direct contact with human or animal subjects.

Summary of occurrence data
Our database contained 7118 reptile occurrence records, unevenly distributed across 60 lizard species, 59 snake species, 6 testudine species and 1 crocodylian species ( Table 2). As such, the majority of occurrences were of squamates (lizards: 48% of all records; snakes: 41% of all records), with the less speciose testudine and crocodilian groups having less representation (8% and 3% of all records, respectively). This was not the case at the species level where the Nile crocodile (210 records) and the leopard tortoise (232 records) ranked only below the rainbow rock skink (242 records) for species with the highest numbers of occurrence records in our data set. The number of occurrence records per reptile family was positively related to the number of species per said family (Linear regression analysis: F 1, 17 = 28.45, p < 0.01, R 2 = 0.63). The uneven distributions of records across reptile families were likely present as a product of collection bias and the specific combination of species occurring within KNP rather than being solely because of collection bias on its own.

Coverage biases
Representation based on spatial coverage was unevenly distributed across reptile families in KNP. We found a significant positive relationship between the cumulative number of records and the cumulative extents of the areas encompassing records of each reptile species per reptile family (linear regression analysis: F 1, 17 = 76.60, p < 0.01, R 2 = 0.81). We identified reptile families that appeared to be significantly under-represented (Lacertidae, Leptotyphlopidae and Typhlopidae) and those that were over-represented (Crocodylidae and Scincidae) geographically across KNP (Figure 1), providing evidence of taxonomic sampling bias at the family level within our data set.
At the biologically appropriate spatial resolution of 1 km × 1 km, we found that only 1751 of 21 761 grid cells (8%) contained any reptile occurrence records at all ( Figure 2). Moreover, 52% of these grid cells contained only a single record (911 grid cells; Figure 3). We found that as the numbers of records per grid cell increased, the numbers of grid cells containing records decreased (regression analysis: F 1, 114 = 9.34, p < 0.01, R 2 = 0.08). This pattern held true at resolutions of 2 km × 2 km and 4 km × 4 km, respectively, but was not present at 9 km × 9 km (Figure 3), demonstrating that geographic sampling bias is strongest at fine resolutions but weakens as resolution becomes coarser. We also found a significant relationship between the number of records present within each grid cell and its proximity to publicly accessible human infrastructure within the park (regression analyses: public roads -F 1, 1749 = 6.75, p < 0.01, R 2 = 0.06; camp sites and picnic spots -F 1, 1749 = 9.01, p < 0.01, R 2 = 0.10; Figure 4). As the distance to infrastructure increased, the frequency of recorded reptile occurrences per grid cell significantly decreased, providing evidence of sampling bias towards publicly accessible areas.

Environmental representation of sampled areas
At the spatial resolution of 1 km × 1 km, sampled areas of KNP were not representative of the full range of environmental space of KNP as a whole ( Figure 5). The results of six separate Kolmogorov-Smirnov tests showed that there were significant differences in environmental variability between sampled and unsampled areas across each of the six principal components representing the overall environmental space of KNP (D = 0.06-0.27, p < 0.01 in all cases). Grid cells containing records of reptile occurrences were thus not statistically representative of the overall ecological variability of KNP.

Discussion
At a fine spatial resolution ecologically relevant to reptiles, occurrence data for reptile species in KNP are geographically http://www.koedoe.co.za Open Access and taxonomically biased. As a consequence of overall data deficiency, representation within our reptile occurrences data set varied, with highly detectable reptile species and families having had significantly more records than those with comparatively lower detectability. Moreover, the majority of reptile occurrence data were associated with human infrastructure. Approximately 68% of all records occurred in close proximity (< 2 km) to publicly accessible human infrastructure areas in KNP. Unsurprisingly, grid cells associated with major tourist camps and surrounding areas were considerably better sampled than the remainder of the park. Importantly, sampled areas were not representative of the complete range of environmental variability across KNP. This suggests that regions of the park that comprised unique environmental space are not represented in the current data set.
Spatial sampling biases associated with human infrastructure are common in biological sampling data sets (Newbold 2010). Most notably, presence-only data sets derived from atlas projects, citizen science data and museum records are typically susceptible to geographic bias in collection effort (Botts et al. 2011;Geldmann et al. 2016;Reddy & Davalos 2003;Zielinski 2001;). Geographic bias in collection effort is often present within these data sets as a result of sampling being inhibited in certain areas but facilitated in others (Botts et al. 2011;Pyke & Ehrlich 2010). For example, some areas may be difficult to sample because of extreme weather conditions, rough terrain, the presence of dangerous animals, distance from roads or restricted access (Bird et al. 2014;Freitag et al. 1998). Conversely, other areas facilitate more complete sampling by providing ease of access and associated increased visitation. In this context, sampling intensity within certain areas is likely to be dramatically lower in comparison with that of less restrictive areas that offer greater accessibility. This is certainly the case in KNP where publicly accessible infrastructural areas have increased human presence and accessibility from staff and visitors alike in comparison with the remainder of the park. Consequently, our data set seemingly represents areas of high sampling intensity rather than true biological patterns.
Areas of high sampling intensity seldom represent the full range of environmental space and ecological factors associated with determining species distributions (Tolley et al. 2016). Several studies have found substantial differences in climate between well-and under-sampled areas (see Botts et al. 2011;Kadmon et al. 2004;Martinez & Wool 2006;Reddy & Davalos 2003;Stockwell & Peters 1999), with many of these highlighting the significance of climatic biases towards assessments of the true biological distributions of species. Botts et al. (2011) found that sparse sampling effort in areas away from human infrastructure resulted in incomplete representations of amphibian distributions across South Africa. Here, where we have encountered similar geographical sampling biases with our reptile data set to those encountered by Botts et al. (2011), we similarly conclude that our data set is unlikely to reflect the complete range of real-world distributions of reptile species across KNP.
The biased nature of our data set has important implications for SANPark's management of reptiles in KNP. Despite being perhaps the most comprehensive collation of reptile occurrence data for KNP to date, the use of this data set for informing robust conservation management decisions would need to be considered with caution. Because of the large variation in geographical sampling intensity within our data set and the associated biases within the underlying occurrence data, it would be inappropriate to use this data set in its current form within the context of spatial planning for species conservation management. Because a large proportion of our data was not collected explicitly for the purposes of mapping species distributions, the biological patterns as presented within our data set may confound comparisons among species, or comparisons of a particular species' abundance in different habitat types, across environmental gradients, or across time series (Bird et al. 2014;Fischer et al. 2004). However, should the geographic biases be minimised or reversed without concurrently increasing other forms of bias (Botts et al. 2011), this data set could become an important resource for KNP conservation management.
Minimising the biases within our data set could be achieved by targeted sampling of data-deficient areas in KNP. Although the majority of 1 km × 1 km grid cells in KNP could benefit from supplemental sampling, priority should be given to grid cells that contain no data and are distant from publicly accessible areas that are steadily subject to human visitation.
In particular, we recommend that the mopane woodlandsdominated north-western region of the park (i.e. areas demarked as ecozones P and P1 as per SANParks 2016) should be targeted for additional sampling. The majority of this region is poorly sampled, with most of its grid cells containing no reptile records. Moreover, this region has few public roads and is largely lacking in human visitation. Here, we recommend that an approach emulating the Karoo Biogaps project led by SANBI (Main et al. 2019) should be implemented, in which specific grid cells are selected as sites in which to extensively collect occurrence records on the basis of a statistical sampling design. This method of grid cell selection would involve the use of a statistical algorithm (such as Latin hypercube sampling) that seeks to maximise coverage across the region although minimising the total number of grid cells to be sampled. Traversal to grid cells targeted for sampling could be facilitated through the use of management roads unavailable to the general public. Together with public tourist roads, this offers the widest range of vehicle access across KNP. Although these roads do not cover the full extent of KNP, their usage can alleviate some of the challenges associated with inaccessibility for many grid cells and offers a feasible option towards future sampling campaigns.
Systematically sampling for reptile occurrences in targeted and supplementary areas within KNP will substantially improve upon the overall coverage and comprehensiveness of available data. It is important to note however that the challenges associated with reptile sampling, such as low detectability (McGrath et al. 2015), may result in underestimations of species richness and occurrences at specific sites. False absences as a result of underestimations can falsely inform on species' performance within monitoring frameworks, including those relating to thresholds of potential concern, and may result in incorrect assignments of conservation priority (Botts et al. 2011;Ferreira et al. 2011).
Compiling complete inventories for targeted grid cells will thus be vitally important, but this may require several sampling trips to ensure that comprehensive species lists are compiled. Such an approach would be unavoidably costly, time-consuming and could delay investigations of the statuses of reptile species within the park.
In the meantime, other options are available to fill gaps in sampling within KNP reptile data. Over the last two decades, an increasing number of studies have used species distribution models (SDMs) to extrapolate spatially explicit predictions of the distributional ranges of species (Bird et al. 2014;Stockwell & Peters 1999). Species distribution models predict environmental suitability for species, which can be used to infer species' presence or absence within a given area (Guisan et al. 2013;Hurlbert & Jetz 2007;Kery 2011). These types of models are typically referred to as ecological niche models as the predictions produced are based on statistical relationships between species occurrences and environmental descriptor variables (Guisan et al. 2013;Kadmon et al. 2004). Importantly, studies have found that SDMs based on biased data with limited occurrences can produce strong models with accurate predictions (e.g. Pearson et al. 2007;Syfert et al. 2013), if the underlying biases are corrected for during model production and high-quality predictor variables are available.
Studies that aim to identify sources of data bias, particularly within presence-only data sets, offer invaluable insights into bias correction within the context of SDMs (Syfert et al. 2013). By understanding the sources of bias, it may be possible to correct historical and current population distributions modelled through SDM frameworks using mathematically inferred or experimentally determined bias correction factors. A good example of this is the use of visibility bias correction factors on aerial census data to improve the accuracy of geographical distributions and population size estimates of large herbivores in KNP (see Redfern et al. 2002). Potential SDM frameworks for KNP reptile species should seek to correct for the proximity of reptile occurrences to publicly accessible areas within the park. Overcoming the challenges associated with this bias may require innovative solutions; however, if implemented correctly, such a framework could offer a feasible approach towards obtaining meaningful reptile distribution information for use within conservation management in KNP.

Conclusion
We sought to collate occurrence records for KNP reptile species and provide a quantitative assessment of the sampling biases within these data. We have shown that at biologically relevant resolutions, KNP is largely data deficient for reptile occurrences, with existing data being geographically biased towards publicly accessible areas. We further show that sampled areas were not environmentally representative of KNP as a whole and from this, we conclude that our data set does not provide a true reflection of real-world reptile species distributions across KNP. Because the majority of the data within our database were not explicitly collected with species mapping in mind, additional sampling is needed to reverse the biases present. We recommend that future sampling efforts should target historically poorly sampled regions in the park that are distant from publicly accessible locations. Finally, we suggest that in the meantime SDMs may offer a more feasible approach for use within conservation management decisionmaking relating to reptile species within KNP.