PLoS ONE
Public Library of Science
SimSurvey: An R package for comparing the design and analysis of surveys by simulating spatially-correlated populations
Volume: 15, Issue: 5
DOI 10.1371/journal.pone.0232822
•
•
•
• Altmetric

### Notes

Abstract

Populations often show complex spatial and temporal dynamics, creating challenges in designing and implementing effective surveys. Inappropriate sampling designs can potentially lead to both under-sampling (reducing precision) and over-sampling (through the extensive and potentially expensive sampling of correlated metrics). These issues can be difficult to identify and avoid in sample surveys of fish populations as they tend to be costly and comprised of multiple levels of sampling. Population estimates are therefore affected by each level of sampling as well as the pathway taken to analyze such data. Though simulations are a useful tool for exploring the efficacy of specific sampling strategies and statistical methods, there are a limited number of tools that facilitate the simulation testing of a range of sampling and analytical pathways for multi-stage survey data. Here we introduce the R package SimSurvey, which has been designed to simplify the process of simulating surveys of age-structured and spatially-distributed populations. The package allows the user to simulate age-structured populations that vary in space and time and explore the efficacy of a range of built-in or user-defined sampling protocols to reproduce the population parameters of the known population. SimSurvey also includes a function for estimating the stratified mean and variance of the population from the simulated survey data. We demonstrate the use of this package using a case study and show that it can reveal unexpected sources of bias and be used to explore design-based solutions to such problems. In summary, SimSurvey can serve as a convenient, accessible and flexible platform for simulating a wide range of sampling strategies for fish stocks and other populations that show complex structuring. Various statistical approaches can then be applied to the results to test the efficacy of different analytical approaches.

Regular, Robertson, Lewis, Babyn, Healey, Mowbray, and Duplisea: SimSurvey: An R package for comparing the design and analysis of surveys by simulating spatially-correlated populations

## Introduction

Good survey design is a critical prerequisite for obtaining accurate and precise estimates of a population of interest. The need for sound sampling techniques is as relevant today as it was in the 1930’s when much of the theory and principles behind the collection of scientific data were developed by statisticians such as R.A. Fisher and Jerzy Neyman [1]. Then and now, we are experiencing a period of increasing human activity and robust methods are required to detect the impacts of our actions on the natural world [2]. Survey quality is especially important for conservation-oriented science as monitoring data are used to identify species at risk of extinction [3], to assess the efficacy of recovery plans [4], and to determine precautionary levels of exploitation [5]. Surveys with a solid basis in sampling theory have therefore become a core component of many ecological monitoring programs throughout the world.

One of the biggest challenges associated with surveys is the cost of planning and implementing effective and efficient sampling designs [6]. Despite their costs, surveys have become a mainstay in the management of dynamic fish stocks as survey data are not easily replaced by other sources of information [7]. For instance, large population declines can be overlooked by relying solely on catch and effort data from commercial fisheries [8]. Such difficulties have contributed to the growing reliance on fisheries-independent surveys to monitor fish populations. Planning or improving such surveys is an important but non-trivial task because of the inherent complexity of both the survey and target population. Specifically, fisheries-independent surveys usually involve multiple levels of sampling and fish populations often show size-structured spatial and temporal dynamics [9]. These features are not easily captured by simple statistical models, which makes it difficult to determine optimal sampling and sub-sampling schemes [1012]. In such situations it is common for practitioners to resort to simulations to test various sampling and analytical pathways (e.g. [13]). However, such tests are rare in fisheries research (but see [14,15]), perhaps because of the complexities associated with fisheries-independent surveys.

Here we document SimSurvey, an R package designed to facilitate realistic simulations of fisheries-independent trawl surveys. In short, the package allows for the simulation of random or stratified-random surveys of an age-structured population that varies in space and time. The package has two main components: the first focuses on mimicking realistic fish stocks by simulating a spatially and age-correlated population distributed across a habitat gradient, and the second component focuses on simulating various surveys of these virtual fish stocks. Although constructed to assess the design of fisheries-independent trawl surveys, this package can be used to model any population that shows complex spatial and age structure.

This simulation framework has similarities to those presented by Schnute and Haigh [14] and Puerta et al. [15], however, we focused our efforts on developing a series of general and accessible functions to simplify the process of testing multiple sampling scenarios and analytical pathways. The steps taken to simulate surveys of spatial, age-structure populations are outlined below. We first outline the equations underlying the package (Model structure section) and then we demonstrate how to use its core functions (Core functions section). The core functions of the package are largely demonstrated using default settings, and these settings are based on a case study (see S1 Appendix for details on the case study, and see S2 Appendix for guidance on how to modify default settings to suit specific needs). Several of the results from the case study are described and discussed in the Interpretation section as they highlight one use case of the package. Finally, we discuss the broader research opportunities and future directions of the package (Discussion section).

## Model structure

In this section, we describe the framework currently implemented in the SimSurvey package. With this framework, we tried to strike a balance between realism, simplicity, generality and computational feasibility. The framework follows four general steps: 1) simulate a spatially-aggregated age-structured population; 2) distribute the population throughout a spatial grid, imposing correlation across space, time and age; 3) sample the population using random sampling; and, 4) obtain population estimates using a design-based analysis. Though there is a degree of flexibility in each of these steps, users can circumvent specific components by applying user defined equations, inputs and/or analyses. Details on how to use the package and, if desired, circumvent some aspects of its structure are outlined in the Core functions section.

### Simulate abundance

The simulation starts with an exponential decay cohort model where the abundance at age a in year y (Na,y) equals the abundance of that cohort in the previous year multiplied by the associated survival rate, which is expressed in terms of total mortality (Za,y):

${N}_{a,y}={N}_{a-1,y-1}{e}^{-{Z}_{a-1,y-1}}$

Here, numbers at age in the first year are filled via exponential decay, ${N}_{a,1}={N}_{a-1,1}{e}^{-{Z}_{a-1,1}}$, numbers at age 1 (i.e. recruits) vary around a baseline value, log(N1,y) = log(μr) + ϵy, and total mortality is set to a baseline level plus process error, log(Za,y) = log(μZ) + δa,y. The error around the recruitment process was set to follow a random walk, ${ϵ}_{y}\sim N\left({ϵ}_{y-1},{\sigma }_{r}^{2}\right)$, and the process error was simulated using the covariance structure described in Cadigan [16], δa,yN(0,Σa,y). The covariance across ages and years is controlled by a process error variance parameter (${\sigma }_{\delta }^{2}$) along with age and year correlation parameters (φδ,age and φδ,year, respectively). This structure allows for autocorrelation in process errors across ages and years (i.e. total mortality can be made to be more similar for fish that are closer together in age and/or time). Note that a plus group is not modeled as the number of ages can easily be extended to include groups with zero fish.

In practice, abundance at age is often inferred from length data as it is easier to collect. Abundance at length is therefore simulated from abundance at age using the original von Bertalanffy growth curve [17]:

$log\left(L\right)=log\left({L}_{\mathrm{\infty }}-\left({L}_{\mathrm{\infty }}-{L}_{0}\right){e}^{-Ka}\right)+\epsilon$
Where L is the mean asymptotic length, L0 is length at birth, K is the growth rate parameter and the error is assumed to follow the normal distribution, $\epsilon \sim N\left(0,{\sigma }_{L}^{2}\right)$. Numbers at age are distributed across discrete length groups following a lognormal distribution by calculating the probability of being in a specific length group given age, ϕa,l. These probabilities are calculated using the standard normal cumulative density function Φ for a sequence of length groups l from length 0 to 10 times the maximum predicted length L at an interval of ${l}_{group}^{N}$:
${\varphi }_{a,l}=\Phi \left(\frac{log\left({L}_{l}\right)}{{\sigma }_{L}}\right)-\Phi \left(\frac{log\left({L}_{l-1}\right)}{{\sigma }_{L}}\right)$

Overall, this formulation facilitates the simulation of a dynamic length and age structured population. Though some typical relationships have yet to be implemented (e.g. stock-recruitment), sufficient information can be simulated to assess survey performance across a range of abundance levels across years, lengths and ages.

### Simulate spatial distribution

Rather than developing a full spatially-explicit model, population and spatial dynamics are modeled as independent processes for simplicity. The complexities of spatial population dynamics—such as larval dispersal, spatial differences in growth and population connectivity—are not explicitly accounted for and, as such, the model is a necessary simplification of reality. Despite this limitation, the approach taken facilitates the simulation of spatial, age-structured populations with sufficient complexity for testing the efficacy of various survey designs. The simplicity also limits the number of unknown parameters that need to be specified to simulate a population. Parameter estimates from spatially-aggregated age-structured models, which are commonly used in stock assessments, can therefore be used to simulate a population using the cohort model and the resultant abundance at age values can be distributed across a spatial grid. Here, a grid of s cells is generated where each cell has an area of S and depth d; depth is defined using a sigmoid curve, applied across one spatial axis, with a depth range of [dmin, dmax], shelf depth of dshelf and a shelf width of wshelf. We use depth as our main stratification variable, but note that any other appropriate stratification variables could be used. The grid can be divided into two hierarchical levels, management divisions and habitat-based survey strata. For demonstration purposes, we envision these levels as part of a stratified-random survey within international fishery divisions, i.e., Hstrat depth-based strata within Hdiv divisions (e.g. NAFO or ICES divisions, or any other geographically bounded area). The simulated population is distributed through the grid by simulating spatial-temporal noise controlled by a parabolic relationship with depth and covariance between ages, years and space. This noise term, ηa,y,s, is scaled to sum to 1 to ensure that the total population of each age for each year through the grid equals the number simulated by the cohort model:

$\begin{array}{cc}\hfill {N}_{a,y,s}& ={N}_{a,y}\frac{{\eta }_{a,y,s}}{\sum _{s=1}^{{N}_{s}}{\eta }_{a,y,s}}\hfill \\ \hfill {\eta }_{a,y,s}& =\frac{\left({d}_{s}-{\mu }_{d}{\right)}^{2}}{2{\sigma }_{d}^{2}}+{\xi }_{a,y,s}\hfill \end{array}$
Where ds is the depth in a specific cell of the grid, μd is the mean depth where abundance is typically highest and σd controls the width or dispersion of abundance around the mean depth. Residual noise ξa,y,s is added to this depth relationship using a combination of Matérn covariance, to control the level of spatial aggregation within ages and years, and a two dimension AR1 age-year covariance described in Cadigan [16], to control the level of similarity in distributions across ages and years. The rate at which point-to-point spatial correlation decays with distance is controlled by a smoothing (λ) and a scaling parameter (κ) (here κ is approximated from range parameter r, $\kappa =\sqrt{8\lambda }/r$; [18]) and correlation across ages and years is controlled by φξ,age and φξ,year, respectively. The overall variance of the spatial process is controlled by ${\sigma }_{\xi }^{2}$ (see S3 Appendix for a more detailed description of the space-age-year covariance structure). In short, this formulation allows control of depth preferences, the level of spatial aggregation and the degree of age and year specific clustering.

### Simulate survey

The final step in the simulation is to sample the simulated population over the age-year-space array generated. The sampling is random or stratified random, emulating surveys conducted by many research institutions around the world. The area of each strata Astrat is calculated and this is used to define the number of sampling stations Hsets, hereafter referred to as sets, allocated to one or more strata under a particular set density, Dsets (i.e. Hsets = AstratDsets). The allocated number of cells are randomly selected in each strata and the number of fish caught in each set is calculated by applying binomial sampling of the fish in each sampled cell by the proportion of the area covered by the trawl and the catchability of each age:

${n}_{a,y,s}\sim Bin\left({N}_{a,y,s},\frac{{A}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{w}\mathrm{l}}}{{A}_{\mathrm{c}\mathrm{e}\mathrm{l}\mathrm{l}}}{q}_{a}\right)$
Where na,y,s is the number of fish of age a in year y sampled by a set at location s, Atrawl indicates the area covered by the trawl, Acell is the area of a grid cell, and qa is the catchability coefficient of each age (i.e. the ability of the trawling gear to catch specific age groups). Here, catchabilities were defined using a logistic curve controlled by a steepness, k, and midpoint parameter, x0. In cases where there are multiple sets in one cell, the population in that cell is divided across the sets. While this means that numbers caught in an isolated simulation cannot exceed the numbers in the population, keep in mind that the survey, no matter how intense, is assumed to have no impact on the population from one year to the next.

Once catches are simulated, lengths of the fish sampled by each set are simulated using the von Bertalanffy growth equation found above in the Simulate abundance section. Sub-sampling is then conducted whereby a subset of fish are sampled for length measurements and a subset of this subset are sampled for age determination. Specifically, a maximum number of lengths are measured per set (Mlengths) and a maximum number of ages (Mages) are sampled per length group (lgroup) per division, strata or set (sgroup). Such sub-sampling is common in fisheries-independent surveys as it is costly, impractical and unnecessary to sample every fish captured. Age determination is especially time-consuming, which is why otoliths for age-determination tend to be sub-sampled by length-bin to obtain a representative age sample across a wider range of lengths than would be obtained via random sampling.

### Stratified analysis

While there are many model-based options for obtaining an abundance index from survey data (e.g. [19]), design-based approaches, such as stratified analyses, are often used. Here we apply formulae presented in Smith and Somerton [20] (equations are replicated in S4 Appendix) to calculate year y and simulation j specific stratified estimates of total abundance (${\stackrel{^}{I}}_{y,j}$), abundance at length (${\stackrel{^}{I}}_{l,y,j}$) and abundance at age (${\stackrel{^}{I}}_{a,y,j}$). Note that estimates of total abundance are based on total numbers caught at each set, ni, while abundance at length requires the sub-sampled length frequencies at each set, ml,i, to be scaled up using set-specific ratios of the number of fish measured, mi, to numbers caught, ni:

${n}_{l,i}={m}_{l,i}×\frac{{n}_{i}}{{m}_{i}}$

Likewise, age frequencies need to be calculated to obtain stratified estimates of abundance at age. This is done by constructing an age-length key, which is the proportion of fish in each length bin that fall into specific age classes. Once these proportions are calculated, they are applied to the bumped up length frequencies, nl,i, to approximate age frequencies, na,i, that is:

${n}_{a,i}=\sum _{l=1}^{L}{n}_{l,i}×\frac{{m}_{a,l,g}^{\mathrm{\text{'}}}}{{m}_{l,g}^{\mathrm{\text{'}}}}$

Here, the prime symbol ′ indicates that these values are tertiary sampling units as they represent fish sub-sampled for age determination from those sub-sampled for length measurements. Within this sub-sample, ${m}_{l,g}^{\text{'}}$ represents the number of fish within length bin l and from area g, and ${m}_{a,l,g}^{\text{'}}$ represents the number of fish within age-class a, length bin l and area g. The area g can be division, strata or set specific. Though age-length keys are typically constructed and applied over large spatial scales (e.g. division), Aanes and Vølstad [21] recommends calculating proportions at finer scales to better account for hierarchical sample designs.

We used root-mean-squared error (RMSE) as a measure of the precision and bias of the abundance at age estimates from each survey:

$RMSE=\sqrt{\frac{\sum _{a=1}^{A}\sum _{y=1}^{Y}\sum _{j=1}^{J}\left({\stackrel{^}{I}}_{a,y,j}-{I}_{a,y}{\right)}^{2}}{A×Y×J}}$
Where A, Y, and J are the number of ages, years and simulations, respectively, and Ia,y is the true abundance available to the survey (i.e. catchability corrected abundance; Ia,y = qaNa,y). RMSE was also calculated for abundance at length estimates, where the above formula is indexed by length groups l, and total abundance, which lacks a group index of a or l.

## Core functions

The SimSurvey package was written in the programming language R [22] and it holds a series of functions for 1) simulating the abundance and distribution of virtual fish populations with correlation across space, time and age (sim_abundance, sim_distribution), 2) simulating surveys with a range of sampling strategies and intensities (sim_survey), and 3) estimating the stratified mean and variance of simulated survey data (run_strat; Table 1). SimSurvey relies heavily on functions from the data.table [23], raster [24] and plotly [25] packages for their efficient data processing, geographic and plotting facilities, respectively. Package documentation has been published online using pkgdown (https://paulregular.github.io/SimSurvey/) and all the source R code behind SimSurvey is available on GitHub (https://github.com/PaulRegular/SimSurvey). SimSurvey can be installed via GitHub using the remotes package:

Table 1
Functions in bold font are core functions and those in medium font are designed for use inside the core functions. The latter are typically closures, which are functions that contain data and return functions [26]; here they are used to store parameter values and return functions that require dimensions, such as ages or years, to be supplied.
Names and descriptions of the key functions of SimSurvey.
FunctionDescription
sim_abundanceSimulate a basic age-structured population dynamics model
sim_R, sim_Z, sim_N0, sim_vonBClosures, to use inside sim_abundance, for simulating recruitment, total mortality, initial abundance and growth, respectively
sim_distributionSimulate spatial and temporal distribution of an age-structured population
sim_ays_covar, sim_parabolaClosures, to use inside sim_distribution, for simulating age-year-space covariance and parabolic relationships with covariates (e.g. depth), respectively
make_gridMake a basic depth stratified square grid to use inside sim_distribution
sim_surveySimulate a survey of a spatial, age-structured population
sim_logisticClosure, to use inside sim_survey, for simulating age-specific catchability as a logistic curve
run_stratRun a stratified analysis on simulated survey data
strat_errorCalculate the error of stratified estimates (e.g. root mean squared error of stratified estimates from true values)
test_surveysTest the sampling design of multiple surveys using a stratified analysis (internally loops over sim_survey, run_strat and strat_error)
expand_surveysCreate a data frame, for use in test_surveys, with all combinations of supplied survey settings

The equations behind the functions listed in Table 1 are detailed in the Model structure section. Note that several of the core equations are implemented using “closures”, which are functions that contain data and return functions [26]. For example, sim_R returns a function that holds the supplied parameter values and requires a sequence of years to be supplied.

Here, the R_vec object holds 100 years of simulated recruitment values and each run of the R_fun function will result in different simulated values using the internal formulation and the parameters supplied to sim_R. The other closures included in the package operate in a similar way in that parameter inputs are supplied to the closure and the functions returned by the closure requires inputs such as ages and/or years. This was done to avoid the repeated specifications of key arguments, such as ages and years. Moreover, this approach provides an option for advanced R users to inspect and modify the closures implemented in the package to supply custom closures with alternate equations (see https://paulregular.github.io/SimSurvey/articles/custom_closures.html for a short vignette on creating custom closures). Also note that each of the closures implemented in the package includes a plot argument such that quick visuals can be obtained using a line of code like this: sim_R(log_mean = log(500), log_sd = 0.5, plot = TRUE)(years = 1:100)).

## sim_abundance

Abundance at age and length is simulated using the sim_abundance function and a default function call is described in Table 2 along with associated symbols from the equations outlined in the Simulate abundance section. This function has a simple structure and requires the specification of a series of ages and years along with a series of closures, such as sim_R, sim_Z and sim_vonB, for simulating recruitment (R), total mortality (Z) and growth (growth), respectively. Overall, the function provides a simple tool for simulating a range of dynamic age-structured populations. For instance, below we provide examples where we simulate a relatively long and short lived species (note that default variance, starting abundance and growth settings were used in both simulations).

Table 2
Default sim_abundance function call, with descriptions, default values and associated parameter symbols of key arguments.
Function callDescriptionSymbol
sim_abundance(
ages = 1:20,Agesa
years = 1:20,Yearsy
R = sim_R(log_mean = log(30000000),Mean log recruitment1μr
log_sd = 0.5),Standard deviation of log recruitmentσr
Z = sim_Z(log_mean = log(0.5),Mean log total mortality2μZ
log_sd = 0.2,Standard deviation of log total mortalityσZ
phi_age = 0.9,Correlation across ages in error around total mortalityφδ,age
phi_year = 0.5),Correlation across years in error around total mortalityφδ,year
growth = sim_vonB(Linf = 120,Mean asymptotic length (cm)L
L0 = 5,Length in birth year (cm)L0
K = 0.1,Growth rateK
log_sd = 0.1,Standard deviation of the von Bertalanffy growth curveσL
length_group = 3))Length group bin size for abundance at length (cm)
1 Can be a vector of means with a length equal to the number of years in the simulation.
2 Can be a matrix of means with number of rows and columns equaling the number of ages and years in the simulation, respectively.

The sim_abundance function returns a list with the sequence of ages (ages), sequence of years (years), sequence of lengths (lengths), numbers of recruits across all years (R), numbers at age in the first year (N0), total mortality matrix (Z), abundance at age matrix (N), abundance at length matrix (N_at_length) and the function supplied to the growth argument (sim_length). The growth function is retained for later use in sim_survey to simulate lengths given simulated catch at age in a simulated survey.

The package also includes several plotting functions for making quick plotly-based [25] interactive visuals of the simulated population. For instance, the plot_surface function can be used to make quick visuals of matrices contained within the list returned by sim_abundance. As an example, we display the abundance at age matrix (object named N in the list produced by sim_abundance; Fig 1); other names can be supplied to the mat argument to visualize a different matrix from the sim_abundance list, such as Z.

Fig 1
These plots were produced by plot_surface when supplied a list produced by sim_abundance.Surface plots of simulated abundance at age of a relatively a) long lived and b) short lived species.

## sim_distribution

The equations outlined in the Simulate spatial distribution section are used in the make_grid, sim_ays_covar and sim_parabola functions, and these functions are used within sim_distribution to distribute a population simulated using sim_abundance throughout a grid (Table 3). The output from make_grid is a raster object [24] with four layers: depth, cell, division and strat. If a more detailed and realistic grid is required, users can manually generate their own survey grid using real data and this grid can be supplied as a raster to sim_distribution if the same structure and correct projection is used. The package includes a manually constructed survey grid of NAFO Subdivision 3Ps off the southern coast of Newfoundland (named survey_grid) and the data-raw folder in the GitHub directory includes the data and code used to construct this grid. However, for simplicity, we use make_grid to construct a square grid for a default run of sim_distribution. Below we generate and plot (Fig 2) a default grid, another grid with the number of strata increased by increasing the number of strat_splits, and another with the number of divisions increased using n_div and a linear depth gradient (the sigmoid curve is forced to be linear when shelf_width is set to zero).

Fig 2
In these plots, the yellow to purple color gradient represents depth, the thick grey lines delineate divisions and thin white lines delineate strata.Plots produced by plot_grid when supplied a raster object produced by make_grid a) using default settings, b) settings that increase the number of times depth strata are split (strat_splits = 3), and c) settings that produce a more linear depth gradient (shelf_width = 0) and increase the number of divisions (n_div = 4).
Table 3
Default sim_distribution function call, with descriptions and associated parameter symbols of key arguments.
Function callDescriptionSymbol
sim_distribution(
sim,Simulated population from sim_abundance
grid = make_grid(x_range = c(-140, 140),Range of grid in the x dimension
y_range = c(-140, 140),Range of grid in the y dimension
res = c(3.5, 3.5),Grid resolution in x and y dimensions (km)—i.e. cell areaAcell
shelf_depth = 200,Shelf depth (m)dshelf
shelf_width = 100,Shelf width (km)wshelf
depth_range = c(0, 1000),Depth range from coast to slope (m)[dmin, dmax]
n_div = 1,Number of divisionsHdiv
strat_splits = 2,Number of strata within each depth class
strat_breaks = seq(0, 1000, 40)),Series of depth breaks for defining strata
ays_covar = sim_ays_covar(sd = 2.8,Standard deviation of age-year-space distributionσξ
range = 300,Range of spatial correlation (km)r
lambda = 1,Smoothness of spatial correlationλ
phi_age = 0.5,Correlation across ages in spatial distributionφξ,age
phi_year = 0.9,Correlation across years in spatial distributionφξ,year
group_ages = 5:20,Make space-age-year noise equal across these ages1
group_years = NULL),Make space-age-year noise equal across these years1
depth_par = sim_parabola(mu = 200,Depth at which abundance is typically highest (m)μd
sigma = 70))Dispersion around depth of peak abundance (m)σd
1 All ages or years are independent if argument values is NULL.

In addition to supplying objects produced by sim_abundance and make_grid, the sim_distribution function requires two closures that describe the age-year-space covariance and the relationship with depth. Here we use sim_ays_covar and sim_parabola to control these relationships and a wide range of age and year specific distributions can be obtained by adjusting a few parameters in these closures (Fig 3). While here we only describe one option for simulating a spatial noise (see S3 Appendix for details), custom closures can be used that leverage simulation models provided by packages such as RandomFields [27] or INLA [28] (see the code behind sim_ays_covar_sped [https://github.com/PaulRegular/SimSurvey/blob/master/R/sim_dist_spde.R] for an example of how the sim_ays_covar closure was modified to apply a Stochastic Partial Differential Equation approach using the INLA package). Below we run a default sim_distribution call, which generates a population that forms tight clusters that are more strongly correlated across years than ages, and another call that generates a population that is more diffuse (i.e. wider range) and exhibits stronger correlation across ages than years (i.e. lower phi_year and higher phi_age). Distributions can also be forced to be the same across ages and years by using the group_ages and group_years arguments, respectively, in the sim_ays_covar closure. Variance in the size of the clusters can also be modified by changing the sd argument in the sim_ays_covar function. In other words, these parameters can be modified to control the degree of age-specific clustering and inter-annual site-fidelity exhibited by the simulated population. Note that the resolution of the default grid is high and, as such, the simulations below may take minutes to complete. Also note that the key functions in the SimSurvey package have been set-up to be pipe (%>%; [29]) friendly such that values from one function call are forwarded to the next function (i.e. output from the two calls below are functionally the same though the approach is slightly different).

Fig 3
This plot is a facet of plots produced by plot_distribution when supplied simulations from sim_distribution.Distribution plots of simulated populations that form a) tight clusters with stronger correlation through years than ages (default settings), and b) relatively diffuse clusters with stronger correlation through ages than years.

The sim_distribution function retains all the data simulated by sim_abundance and adds a data.table [23], named sp_N, with abundance (N) split by age, year and cell. The function also retains the grid object and converts these data into a data.table, named grid_xy, with headers x, y, depth, cell, division and strat. The sp_N object can be merged with the grid_xy data by cell to associate abundance with specific locations, depth, divisions or strata. The plot_distribution function can be used to provide a quick visual of the distribution across ages and years. The code below will generate interactive plots with an Age-Year slider, however, for this paper we present a facet plot of the simulated data (Fig 3).

## sim_survey

The function sim_survey can be used to simulate data from one survey over a population created using sim_distribution. A default function call is described in Table 4. The sim_survey function simulates the sampling process of the survey and, as such, requires a closure for defining catchability as a function of age and survey protocol settings. Specifically, the q argument requires a closure, such as sim_logistic, for defining the probability of catching specific age groups, trawl dimensions are defined in the trawl_dim argument, and set, length and age sampling effort are defined using the set_den, lengths_cap and ages_cap arguments, respectively. Also note that the min_sets argument imposes a minimum of number of sets to conduct per strata, regardless of its allocation given strata area and set density. This argument imposes a useful constraint for generating data to be analyzed using design-based approaches that require more than one value for the calculation of a mean.

Table 4
Default sim_survey function call, with descriptions and associated parameter symbols of key arguments.
Function callDescriptionSymbol
sim_survey(
sim,Simulated spatial population from sim_distribution
n_sims = 1Number of times to repeat the survey
q = sim_logistic(k = 2,Steepness of logistic curve of catchabilityk
x0 = 3),Midpoint of logistic curve of catchability (age)x0
trawl_dim = c(1.5, 0.02),Trawl dimensions (distance towed, trawl width; km)—i.e. area trawledAtrawl
min_sets = 2Minimum number of sets to conduct per strata
set_den = 2/1000,Set density (km-2)Dsets
lengths_cap = 500,Maximum number of lengths to collect / setMlengths
ages_cap = 10,Maximum number of ages to sample / length group / spatial group 1Mages
age_sampling = "stratified",Controls whether age sampling is length "stratified" or "random"
age_length_group = 1,Length group bin size for stratified age sampling (cm)lgroup
age_space_group = "division")Spatial scale of stratified age sampling ("division", "strat", "set")sgroup
1 Length group and spatial group are defined using the age_length_group and age_space_group arguments, respectively. These arguments are ignored if age_sampling is set to "random" and the value supplied to ages_cap represents the maximum number of ages to sample per set.

Like sim_abundance and sim_distribution, custom closures can be supplied to sim_survey to impose alternate parametric curves for catchability at age (i.e. a closure including an equation for a dome can be constructed and used in lieu of sim_logistic to impose a dome-shaped catchability). Multiple simulations of the same survey can be run using the n_sims argument, however, requesting large numbers of simulations can be computationally demanding depending on the processing capacity available. Below we use sim_survey to simulate two surveys over a default population, of which one is set-up to have higher set density (set_den) than the other.

Again, this function retains all the objects listed in the output of sim_distribution and adds data.tables that detail the set locations (setdet) and sampling details (samp). Catchability corrected abundance matrices (abundance at age matrix multiplied by survey catchability), named I and I_at_length, are also produced and added to the output; these matrices are useful for comparing the true abundance available to the survey to abundance estimates obtained using design-based or model-based analyses of the simulated survey data. Specific surveys can be explored using the plot_survey function, which uses plotly [25] and crosstalk [30] in the background to link the bubble plot of aggregate set catch to the histogram of lengths and ages sampled to facilitate explorations of set-specific catches (Fig 4).

Fig 4
Point size and color are scaled by abundance in the bubble plots. Histograms of length and age composition include the distribution of all fish caught next to those sampled overlaid with a line of the true distribution of lengths and ages available to the survey. Note that the first simulation of the survey in year 20 is depicted here. These plots are produced by plot_survey when supplied survey data simulated using sim_survey.Bubble plots of abundance and histograms of set catches from a simulated stratified-random survey of a default population under relatively a) low and b) high sampling effort.

As noted above, available RAM may limit the utility of the sim_survey function for running thousands of simulations of the same survey. The sim_survey_parallel function was therefore constructed to facilitate this process. This function is set-up to run multiple sim_survey calls in parallel using the doParallel package [31] and, as such, multiple loops can be run using the n_loops argument and, within each loop, multiple simulations can be run (controlled using the n_sims argument). Total simulations will be the product of n_loops and n_sims arguments. If more than one core (cores) is specified, then the simulations will be run in parallel to speed up the process. Low numbers of n_sims and high numbers of n_loops will be easier on RAM, but may be slower. The optimum ratio of n_sims to n_loops will depend on the amount of RAM and number of cores in a given computer. In any case, this function simplifies the process of running thousands of simulations of the same survey and the simulated data can then be supplied to survey-based or model-based analyses that require simulation testing.

## run_strat

Stratified estimates of abundance are obtained by supplying the output from sim_survey to the run_strat function. There are only four arguments in the run_strat function: length_group, alk_scale, strat_data_fun, and strat_means_fun. The length_group argument defines the size of the length frequency bins for both abundance at length calculations and age-length key construction; this argument is set to "inherit", by default, to utilize the length_group value defined inside the closure supplied to the growth argument in sim_abundance. The alk_scale argument defines the scale at which to construct and apply age-length keys ("division" [default], "strat" or "set"). Finally, strat_data_fun and strat_means_fun allow users to supply custom functions for processing the simulated data and calculating stratified means, respectively; the built in functions strat_data and strat_means are supplied by default. RMSE of the stratified estimates from run_strat can then be calculated using the strat_error function. Results and error of a stratified analysis of one survey over a population are obtained using the following code (using default values):

The returned object will include all the objects accumulated through the sim_abundance to strat_error. The run_strat function adds three data.tables called total_strat, length_strat and age_strat that include stratified estimates of total abundance, abundance at length, and abundance at age, respectively. To this, strat_error adds data.tables ending with _strat_error or _strat_error_stats. The _strat_error objects simply contain stratified estimates of abundance (column named I_hat) with corresponding true values of abundance available to the survey (column named I) and the strat_error_stats data.frame includes metrics of mean error (ME), mean absolute error (MAE), mean-squared error (MSE), and root-mean-squared error (RMSE).

## test_surveys

Assuming a stratified analysis as the default method for obtaining an index of abundance, a series of survey protocols can be tested using the test_surveys function. Provided a simulated population from sim_distribution and a series of survey protocols from expand_surveys, this function will simulate and analyze data from each survey using the sim_survey, run_strat, and strat_error functions. Like sim_survey_parallel, this function operates in parallel and allows the specification of n_sims and n_loops, and the product of these two arguments equals the number of times each survey is simulated. Keep in mind that low numbers of n_sims and high numbers of n_loops will be less demanding on RAM, but may be slower, especially if the work is spread across few cores. Because most of the default settings of the functions match the case study settings, the code below will replicate the results from our case study (see S1 Appendix for more detail). The expand_surveys function sets up a series of 175 surveys to test (i.e. all possible combinations of the set_den, lengths_cap and ages_cap vectors) and the test_surveys function will run 1000 simulations of each survey and compare stratified estimates of abundance to the true abundance available to the survey. While test_surveys is set-up for testing key sampling effort settings (set_den, lengths_cap, and ages_cap), other options can be assessed using independent calls of test_surveys as it accepts several arguments from sim_survey (q, trawl_dim, min_sets, age_sampling, age_length_group, and age_space_group) and run_strat (length_group and alk_scale).

Processing time will be system (i.e. amount of RAM and number of cores) and setting (i.e. n_loops and n_sims ratio) dependent. The test_survey function will print a progress bar, generated using the progress package [32], which details percent completion and will also include an estimate time of arrival (eta) after the first step of the loop completes. The test_surveys function therefore includes an option for exporting intermediate results to a local directory, via the export_dir argument, and the resume_test function can be used to resume a test_surveys run that had to be stopped part way through the process. The final object produced will be a list that includes all objects from sim_abundance and sim_distribution with the table of survey designs tested (named surveys) and tables produced by strat_error that end with the names _strat_error and _strat_error_stats. These tables include a survey column to allow merging of the survey protocol table with the error tables. Objects produced by sim_survey (set and sampling details) and run_strat (full stratified analysis results) are not retained to minimize object size.

Like other core functions, some convenience functions are included in SimSurvey for creating interactive plots of the results from test_surveys. For instance a series of plotting functions ending in _fan produces fan charts where stratified estimates of abundance from each simulated survey are converted into a series of quantiles to depict the probability that estimates fall within a particular range. True values of abundance available to the survey are overlaid on the series of probability envelopes. These plots help visually assess the level of precision and bias from a specific set of survey protocol; ideally, the probability envelopes will be tightly centered around the true values. The three lines of code below will produce interactive fan charts for stratified estimates of total abundance, abundance at length and abundance at age, respectively (e.g. Figs 5, 6 and 7).

Fig 5
The thick black line indicates the true trend in the total population available to the survey and the yellow to purple color gradient represents a range of probability envelopes from 10% to 90%. This plot is a facet of plots produced by plot_total_strat_fan when supplied results from test_surveys.Fan chart of stratified estimates of the trend in total abundance from surveys with different set densities, Dsets.
Fig 6
The thick black line indicates the true trend in the total population available to the survey and the yellow to purple color gradient represents a range of probability envelopes from 10% to 90%. This plot is a facet of plots produced by plot_total_strat_fan when supplied results from test_surveys.Fan chart of stratified estimates of abundance at length from year seven of the simulation from surveys with different set densities, Dsets, and the maximum number of length samples, Mlengths.
Fig 7
Number of ages sampled per length group, Mages, was 10 in all scenarios. The thick black line indicates the true trend in the total population available to the survey and the yellow to purple color gradient represents a range of probability envelopes from 10% to 90%. This plot is a facet of plots produced by plot_total_strat_fan when supplied results from test_surveys.Fan chart of stratified estimates of abundance at age from year seven of the simulation from surveys with different set densities, Dsets, and the maximum number of length samples, Mlengths.

The relative performance of the surveys tested can be compared using plot_survey_rank and plot_error_surface. The plot_survey_rank function produces a divergent dot plot of the results which ranks the surveys by RMSE, where lower values indicate sampling strategies that minimize bias and maximize precision. Using the which_strat argument, the plot can be focused on total, length or age based stratified results (Fig 8). The plot_error_surface displays the age based stratified results by plotting a surface of RMSE (z-axis) by set (drop down selection), length (y-axis), and age (z-axis) sampling effort. The sampling effort axes can be rule or sample size based (plot_by = "rule" or plot_by = "samples", respectively; Fig 9).

Fig 8
Records are ranked by lowest to highest RMSE score. Within each plot, a yellow to purple color gradient is applied from lowest to highest value as an additional visual aid. Note that the exponent format of the axes defaults to SI unit symbols (e.g. M for million).Divergent dot plot of the precision and accuracy (root-mean-squared error; RMSE) of length based stratified estimates of abundance, and total sampling effort (number of sets [Nsets] and length measurements [Nmeasured]), under various sampling protocols (set density [Dsets] and maximum number of lengths measured per set [Mlengths]).
Fig 9
Panels represent surveys with different set densities (Dsets), x-axes represent the maximum sampling effort of lengths per set (Mlengths), and y-axes represent the maximum number of ages to collect per length group (Mages). This plot is a facet of plots produced by plot_error_surface when supplied results from test_surveys. Note that RMSE scales are different across Dsets panels. Note that the exponent format of the axes defaults to SI unit symbols (e.g. M for million).Surface plots of root-mean-squared error (RMSE) from an array of surveys with different sampling protocol.

## Interpretation

Results shown in the Core functions section demonstrates one use of the SimSurvey package that is focused on evaluating the efficacy of increasing the sampling effort of a stratified random survey of a cod population (see S1 Appendix for details). These results largely align with expectations from sampling theory that 1) design-based estimators for stratified random surveys are unbiased, and 2) precision is increased by increasing the number of primary sampling units [33]. Specifically, estimates of total abundance and abundance at length are centered around true values and their probability envelopes tighten as set density increases (Figs 5 and 6). Our case study results also echo the growing body of literature which concludes that extra sub-sampling is an ineffective means of improving estimates relative to sampling more locations [1012,3436]. This is exemplified by the relatively large drops in RMSE when set density is increased compared to when sub-sampling effort is increased (Figs 8 and 9). Moreover, it appears that it is more advantageous to measure fewer total fish at more locations than measuring many fish at fewer locations (Fig 8). Again, this result was expected because the size-structured spatial clustering imposed by the simulation causes set-to-set variation to be greater than within-set variation in characteristics such as length and age. This structure was imposed because fish caught together tend to have similar characteristics and it is well known that this reduces the effective sample size [10]. Increasing sub-sampling effort, however, should not lead to poorer population estimates like those observed for abundance at age (Figs 7 and 9). Bias, in particular, appears to be a problem with stratified estimates of abundance at age under certain scenarios (Fig 7); in contrast, there appears to be little bias in the stratified estimates of abundance at length (Fig 6). This result was unexpected as these estimates of abundance at age were presumed to be unbiased. The contrast between the length and age based analysis indicates that the problem lies with the intervening age-length-key.

Age-length keys have long been used in conjunction with length frequency tables to estimate the age distribution of fish populations over large scales [37]. By default, sim_survey spreads length-stratified age sampling across the division and run_strat constructs and applies age-length keys at the division scale. This default was chosen because it is standard protocol for the actual survey the case study is based on. There is, however, a potential cost to the spatial scale of the key. Namely, it is unlikely that one age-length key is representative for the whole region as the probability of being a specific age given length varies in space [38]. Because different age groups sometimes occur in different places, the translation of lengths to ages may be biased by the samples used to generate the age-length key. In short, multi-stage sampling of a clustered population violates the assumption that the age-length key is generated from a simple random sample [21]. Bias introduced by this assumption can be resolved by explicitly accounting for the hierarchical sampling by using designed-based estimators [21]. For the case study, a proper design-based estimator will need to account for cluster sampling in a stratified random survey. Following recommendations in Aanes and Vølstad [21], we used SimSurvey to 1) conduct a survey with concurrent length and age sampling at every set, 2) construct and apply age-length keys on a set-by-set basis, and 3) weight the resultant age frequencies at each set using stratified random estimators. This test was accomplished using this code:

This is a minor modification of the code used in the test_surveys section whereby the same simulated population was used (pop object) but set density scenarios were reduced to one option for simplicity, and age sampling protocol and stratified analysis options were changed. That is, 1) age_length_group and age_space_group were changed to simulate a survey that conducts concurrent length and age sampling at every set (e.g. collect age samples from a maximum of 1 fish per 3 cm length group per set), 2) alk_scale was changed to "set" to construct and apply age-length keys on a set-by-set basis, and 3) the resultant age frequencies at each set are weighted accordingly using the design-based estimators built into run_strat. Again, results are visualized using plot_age_strat_fan and plot_error_surface:

Unlike the default approach to collecting and analyzing age samples, results from this test appear to be unbiased and additional sub-sampling effort beyond a certain point appears to have little to no effect on the estimates (Fig 10). This is an important result with real-world implications as they imply that 1) inference from the actual survey can be improved by altering the sampling and analysis of ages and, 2) valuable survey time can be saved by cutting back on sub-sampling effort. This example demonstrates how SimSurvey can be used to reveal problems and explore solutions to the sampling designs of fisheries-independent surveys.

Fig 10
Specifically, a) is a fan chart of stratified estimates of abundance at age from year seven of the simulation from a survey with a set density (Dsets) of 0.002 sets / km2, maximum number of length samples (Mlengths) of 100, and maximum number of age samples per 3 cm length group per set (Mages) (the thick black line indicates the true trend in the total population available to the survey and the yellow to purple color gradient represents a range of probability envelopes from 10% to 90%); and b) is a surface plot of root-mean-squared error (RMSE) from an array of surveys with set density (Dsets) fixed to 0.002 sets / km2 but different sub-sampling effort, where the x-axes represent the maximum number of lengths to measure per set (Mlengths), and y-axes represent the maximum number of ages to collect per 3 cm length group per set (Mages).Main results from an alternative design-based sampling and analysis procedure that accounts for multi-stage cluster sampling of ages.

## Discussion

Though somewhat narrow in scope, results from our case study demonstrate why it is important to evaluate the design and analysis of complex surveys. Specifically, we tested the design and analysis of a multi-staged stratified random survey of a cod population. This type of sampling is common in fisheries research, however, the clustered nature of the samples are rarely taken into consideration [9]. Instead, it is commonly assumed that individuals are randomly drawn from the target populations even though such sampling is virtually impossible to accomplish in practice. The assumption of simple random sampling implies that the characteristics of fish caught at the same location are independent and, if estimators that assume independence are used, this can introduce bias and underestimate variance [9,21]. Moreover, this assumption may contribute to the perception that more sub-samples will lead to better population estimates. Increasing sub-sampling effort, however, has been shown to be an ineffective means of improving estimates because samples at the same location are usually correlated; the best way to increase the number of independent samples is to sample more locations [1012,3436]. Results from the case study reiterate these points and indicate that there is room for improvements to the actual survey the case study was based on. Specifically, it appears that valuable human resources may be wasted by collecting excessive samples of correlated metrics, and abundance at age estimates may be biased by the assumption that age samples are from a simple random sample. We have used SimSurvey to identify solutions to these problems and, given current results, we suggest that time can be saved by decreasing sub-sampling effort and bias can be reduced by using an approach that account for the hierarchical design of the survey.

### Research opportunities

The case study used here provides one example of how SimSurvey can be used to simulation test the design of fisheries-independent trawl surveys. Default settings can, of course, be modified to emulate data from an array of different surveys of different populations (see S2 Appendix for guidance on how to modify default settings to suit specific needs). The package therefore facilitates the exploration of a range of research questions. Below we outline some examples where the SimSurvey package may aid future research efforts.

#### Design or model-based approach

The analysis of data from fisheries-independent surveys have generally been confined to design-based mean and variance estimates of abundance using standard formulae (e.g. [33]). Nevertheless, there has long been interest in using model-based approaches to improve abundance estimates (e.g. [19,39,40]). SimSurvey can serve as a convenient tool for simulation testing mean and variance estimates provided by a range of different approaches (design-based analyses, bootstrap estimates, generalized additive models, geostatistical models, etc.). Moreover, the full analytical pathway for obtaining age-disaggregated estimates of abundance has rarely been simulation tested. Existing and future approaches for calculating age-based indices of abundance can be simulation tested using SimSurvey.

#### Growth analyses

Assessing ages for a large number of fish is very time-consuming and, as such, length-stratified sampling is often used to estimate age frequencies of fish populations. The resultant sub-sample is used to construct an age-length key (i.e. the probability a fish is a specific age given length) and age frequencies are obtained by applying this key to length-frequencies obtained via more expansive random sampling. One age-length key is typically assumed to be representative of the whole stock area, however, spatial variability in the relationship may introduce bias in abundance-at-age estimates [21,38]. Results from the case study (see S1 Appendix) reiterate this point and SimSurvey may serve as a platform for testing potential model-based solutions to this problem (e.g. [38]).

#### Random or stratified sampling

SimSurvey can be used to compare the precision and bias of population estimates obtained using random or stratified sampling. Simple random sampling can be implemented using a grid with one strata (e.g. make_grid(depth_range = c(0, 1000), strat_breaks = c(0, 1000), strat_split = 0)). Sub-sampling of ages can also be random rather than length-stratified by setting the age_sampling argument in the sim_survey function to "random" rather than "stratified". This can facilitate research similar to work presented in Puerta et al. [15].

### Future directions

Up to now, the package has focused on the effects of sampling design on the precision and bias of population estimates obtained from fisheries-independent surveys; however, the costs associated with sampling has yet to be considered. In future iterations of SimSurvey, we hope to add options for integrating data on the time and monetary costs associated with each level of sampling (sets, length measurements, age determination) to facilitate cost-benefit analyses. We also realize that a single fisheries-independent survey may have multiple goals as data obtained are often used to assess multiple species or to conduct community analyses. We will therefore endeavor to add functions for simulating multi-species surveys. Another limitation of the package is that we have yet to implement alternatives to random or stratified random survey designs (e.g. systematic sampling); expanding these options would allow for a more comprehensive evaluations of various designs. Finally, it would be useful to add an option for testing the consequences of surveys with partial coverage of a population as survey coverage is a frequent concern in stock assessment.

### Assumptions

Like any model, this simulation is a simplification of a much more complex reality. For instance, the population is assumed to aggregate by age-class and be uniformly distributed within a cell, instead fish may aggregate by length and form finer-scale clusters. The survey is also an instantaneous snapshot of the population, meaning that the population is assumed to be in the same location from the beginning to the end of the survey. Also, fish are aged at random within length bins and ages are estimated without error. Finally, area trawled is assumed to be perfectly standard. These assumptions, plus a range of others, will surely under-represent the natural variability of fish populations and survey protocol. Nevertheless, the SimSurvey package provides a relatively complex and flexible operating model for simulating survey data from a population that varies across age, year and space dimensions.

### Summary

The SimSurvey package serves as a tool for simulating sample surveys of dynamic populations that vary across ages, time and space. The core of the simulation is based on the widely used cohort equation and, even though the processes that define recruitment and total mortality are simple, a wide range of stock dynamics can be simulated by changing a few parameters. This base population can then be distributed through a grid and relationships with depth and correlation across ages, years and space can be defined. Together, two functions (sim_abundance and sim_distribution) are capable of simulating a wide range of populations with different life histories, depth associations and spatial properties. The next step to generating data similar to actual observations is to conduct a survey. In this package we implement a function, sim_survey, that conducts a survey of the population. The sampling process is governed by the area covered by the trawl as well as age-specific catchability. Sub-sampling protocol (length and age sampling) can also be varied. As such, data from a wide range of surveys can be simulated.

A large number of statistical models can be tested using these simulated data, but, implementing a variety of analytical approaches was outside the scope of this package. Instead we focus on analyzing simulated stratified-random survey data using a design-based stratified analysis. A stratified analysis is facilitated using the run_strat function and the precision and accuracy of the results (e.g. RMSE) can be calculated using strat_error. This is a simple and widely-used analysis, and the speed at which it runs allows for a wide range of survey designs to be tested, via the test_surveys function, in a reasonable time-frame.

Simulation testing is an important tool in the field of fisheries science as the inferred status of fish stocks hinge on the data and models used to assess fish populations. Simulations provide an opportunity to explore survey and model performance, and such explorations are becoming increasingly important as model complexity increases. It is also important to continually assess the efficacy and efficiency of sampling programs given their costs and the constant scrutiny of the value added by such surveys. These are some of the reasons multiple simulation frameworks, including SimSurvey, have been developed to test the design and analyses of complex surveys. We have made SimSurvey as open and accessible as possible to allow the broader community to validate, reuse and improve this package. We hope that open-source sharing will extend the value of such simulation frameworks and we encourage users to extend the package for their own needs and contribute to future versions.

## Acknowledgements

This work has benefited from valuable feedback from numerous colleagues, including Aaron Adamack, Alejandro Buren, Noel Cadigan, Karen Dwyer, Geoff Evans, Paul Higdon, Danny Ings, Mariano Koen-Alonso, Joanne Morgan, Derek Osborne, Pierre Pepin, Dwayne Pittman, Don Power, Craig Purchase, Martha Robertson, Mark Simpson, Brad Squires, Don Stansbury and Peter Upward. We also thank Dave Cote and Joanne Morgan for providing constructive comments on a previous version of this manuscript. Finally, the package and the manuscript were greatly improved by feedback from editor Daniel Duplisea and two anonymous reviewers. This work was supported by the NSERC visiting-fellow program and Fisheries and Oceans Canada.

## References

1

SE Fienberg, JM Tanur. . Reconsidering the fundamental contributions of Fisher and Neyman on experimentation and sampling. International Statistical Review. 1996; , pp.237–253.

2

JD Nichols, BK Williams. . Monitoring for conservation. Trends in Ecology & Evolution. 2006;21: , pp.668–673.

3

J Martin, WM Kitchens, JE Hines. . Importance of well-designed monitoring programs for the conservation of endangered species: Case study of the snail kite. Conservation Biology. 2007;21: , pp.472–481.

4

SP Campbell, JA Clark, LH Crampton, AD Guerry, LT Hatch, PR Hosseini, et al. An assessment of monitoring efforts in endangered species recovery plans. Ecological Applications. 2002;12: , pp.674–681.

5

WJ Sutherland. . Sustainable exploitation: A review of principles and methods. Wildlife Biology. 2001;7: , pp.131–140.

6

JH Reynolds, WL Thompson, B Russell. . Planning for success: Identifying effective and efficient survey designs for monitoring. Biological Conservation. 2011;144: , pp.1278–1284.

7

M Pennington, T Strømme. . Surveys as a research tool for managing dynamic stocks. Fisheries Research. 1998;37: , pp.97–106.

8

GA Rose, DW Kulka. . Hyperaggregation of fish and fisheries: How catch-per-unit-effort increased as the northern cod (Gadus morhua) declined. Canadian Journal of Fisheries and Aquatic Sciences. 1999;56: , pp.118–127.

9

GA Nelson. . Cluster sampling: a pervasive, yet little recognized survey design in fisheries research. Transactions of the American Fisheries Society. 2014;143: , pp.926–938.

10

M Pennington, JH Vølstad. . Assessing the effect of intra-haul correlation and variable density on estimates of population characteristics from marine surveys. Biometrics. 1994;50: , pp.725–732.

11

M Pennington, L-M Burmeister, V Hjellvik. . Assessing the precision of frequency distributions estimated from trawl-survey samples. Fisheries Bulletin. 2002;100: , pp.74–80.

12

IJ Stewart, OS Hamel, K Rose. . Bootstrapping of sample sizes for length- or age-composition data used in stock assessments. Canadian Journal of Fisheries & Aquatic Sciences. 2014;71: , pp.581–588.

13

L Thomas, ST Buckland, EA Rexstad, JL Laake, S Strindberg, SL Hedley, et al. Distance software: Design and analysis of distance sampling surveys for estimating population size. Journal of Applied Ecology. 2010;47: , pp.5–14. , doi: 10.1111/j.1365-2664.2009.01737.x

14

JT Schnute, R Haigh. . A simulation model for designing groundfish trawl surveys. Canadian Journal of Fisheries and Aquatic Sciences. 2003;60: , pp.640–656.

15

P Puerta, L Ciannelli, B Johnson. . A simulation framework for evaluating multi-stage sampling designs in populations with spatially structured traits. PeerJ. 2019;7: , pp.e6471, doi: 10.7717/peerj.6471

16

NG Cadigan. . A state-space stock assessment model for northern cod, including under-reported catches and variable natural mortality rates. Canadian Journal of Fisheries and Aquatic Sciences. 2016;73: , pp.296–308.

17

L Von Bertalanffy. . A quantitative theory of organic growth (inquiries on growth laws. II). Human biology. 1938;10: , pp.181–213.

18

M Blangiardo, M Cameletti. Spatial and spatio-temporal bayesian models with R-INLA. John Wiley & Sons; 2015.

19

JT Thorson, AO Shelton, EJ Ward, HJ Skaug. . Geostatistical delta-generalized linear mixed models improve precision for estimated abundance indices for West Coast groundfishes. ICES Journal of Marine Science. 2015;72: , pp.1297–1310.

20

Smith S, Somerton G. STRAP: A User-Oriented Computer Analysis System for Groundfish Research Trawl Survey Data. Canadian Technical Report of Fisheries; Aquatic Sciences No. 1030; 1981. p. 66.

21

S Aanes, JH Vølstad. . Efficient statistical estimators and sampling strategies for estimating the age composition of fish. Canadian Journal of Fisheries and Aquatic Science. 2015;72: , pp.938–953.

22

R Core Team. R: A language and environment for statistical computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2017. https://www.R-project.org/

23

Dowle M, Srinivasan A. Data.table: Extension of ‘data.frame‘ [Internet]. 2017. https://CRAN.R-project.org/package=data.table

24

Hijmans RJ. Raster: Geographic data analysis and modeling [Internet]. 2016. https://CRAN.R-project.org/package=raster

25

Sievert C. Plotly for R [Internet]. 2018. https://plotly-book.cpsievert.me

26

Wickham H. Advanced R. Chapman; Hall/CRC; 2014.

27

Schlather M, Malinowski A, Oesting M, Boecker D, Strokorb K, Engelke S, et al. RandomFields: Simulation and analysis of random fields [Internet]. 2020. https://cran.r-project.org/package=RandomFields

28

H Rue, S Martino, N Chopin. . Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2009;71: , pp.319–392.

29

Bache SM, Wickham H. Magrittr: A forward-pipe operator for r [Internet]. 2014. https://CRAN.R-project.org/package=magrittr

30

Cheng J. Crosstalk: Inter-widget interactivity for html widgets [Internet]. 2016. https://CRAN.R-project.org/package=crosstalk

31

Revolution Analytics, Weston S. DoParallel: Foreach parallel adaptor for the ‘parallel’ package [Internet]. 2015. https://CRAN.R-project.org/package=doParallel

32

Csárdi G, FitzJohn R. Progress: Terminal progress bars [Internet]. 2016. https://CRAN.R-project.org/package=progress

33

WG Cochran. Sampling techniques. 3rd edJohn Wiley & Sons; 1977 p. , pp.428.

34

B Bogstad, M Pennington, JH Vølstad. . Cost-efficient survey designs for estimating food consumption by fish. Fisheries Research. 1995;23: , pp.37–46.

35

LG Coggins, DC Gwinn, MS Allen. . Evaluation of Age-Length Key Sample Sizes Required to Estimate Fish Total Mortality and Growth. Transactions of the American Fisheries Society. 2013;142: , pp.832–840.

36

Y Zhang, SX Cadrin. . Estimating Effective Sample Size for Monitoring Length Distributions: A Comparative Study of Georges Bank Groundfish. Transactions of the American Fisheries Society. 2013;142: , pp.59–67.

37

A Fridriksson. . On the calculation of age-distribution within a stock of cod by means of relatively few age-determinations as a key to measurements on a large scale. Rapports Et Proces-Verbaux Des Reunions, Conseil International Pour l’Exploration De La Mer. 1934;86: , pp.1–5.

38

CW Berg, K Kristensen. . Spatial age-length key modelling using continuation ratio logits. Fisheries Research. 2012;129–130: , pp.119–126.

39

SJ Smith. . Use of statistical models for the estimation of abundance from groundfish trawl survey data. Canadian Journal of Fisheries and Aquatic Sciences. 1990;47: , pp.894–903.

40

CW Berg, A Nielsen, K Kristensen. . Evaluation of alternative age-based methods for estimating relative abundance from survey data in relation to assessment models. Fisheries Research. 2014;151: , pp.91–99.

13 Nov 2019

PONE-D-19-26904

SimSurvey: an R package to optimize the design and analysis of fisheries surveys by simulating spatially-correlated fish stocks

PLOS ONE

Dear Dr. Regular,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

These two thorough reviews raise excellent points that should greatly improve the utility of the package if they can be addressed. Please consider each point particularly keeping in mind the research content of the manuscript beyond a description of the software.

We would appreciate receiving your revised manuscript by Dec 28 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Daniel E. Duplisea, PhD

PLOS ONE

Journal Requirements:

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

[Note: HTML markup is below. Please do not edit.]

Reviewer's Responses to Questions

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Reviewer #1: This paper describes the R package “SimSurvey” which is a package to test the design and analysis of fisheries independent surveys.

I found the paper very interesting and the package of great potential value and interest to many fishery scientists. However, I have several comments that I would like the authors to address first before I recommend acceptance of this paper. My main comments relate to:

1. Verify some of the equations and model description used in the model. I think that some of them lack description to fully understand what was done and few others are misleading, even incorrect. The detail can be found below but in general, they are linked to the problem of bias correction for log-normal distribution, re-scaling”, moving from abundance at age to abundance at length.

2. Add additional explanation to justify the choices. E.g. why population dynamics is modeled independently of the spatial distribution. Is this reasonable, limitation, etc.

3. Few suggestions (add-ons, renaming) to increase the generality of the paper (without requiring too much work)

Abstract:

What about the flexibility to include new sampling strategies (user defined one) or including new estimation approaches

Introduction:

L43: quality of

It would be good to add few references here e.g. L50,

Method:

L69-73: I think it would be good to add the reference to the specific function name listed in Table2. This way, I would be more explicit and avoid possible confusion. For example, in my case, when I first read the section header “simulate abundance”, I was expecting to see the way you dealt with the spatial distribution of abundance.

L94&95&104: I just want to make sure that the bias correction has been applied to these two equations. If logN (or logZ) is normally distribution with mean mu and sigma, then the N (or Z) is lognormally distributed with mean exp(mu+sigma^2/2). Therefore, when you write “mean” in table 3 for R and Z, I hope that you modified in your actual equation for log(Z) to:

log(Z) ~ normal(log(mean)-sigma^2/2, sigma)

In other word, the mean(R) is not simply exp(meanR_log_scale). exp(meanR_log_scale) is the median of the distribution of R.

L107: there is not enough description to understand how you converted the abundance at age to abundance at length. What were the choice of length bin groups? Every 5 cm, 10cm? Such information is also missing in the tables. Please add that. But then you have to calculate some cumulative distribution (until length l) for the length at age distribution (using the VB growth equation) and do some subtractions to be able to allocate the abundance at age to each length group.

See equation A.1.14 in here for example. https://www.nwfsc.noaa.gov/news/events/program_reviews/documents/C.2_Methot_Wetzel_SSTechnicalDescription.pdf

“Simulate spatial distribution”:

I have a few comments on this section:

1. First of all, I think it is important to state that you assume that the population dynamic model and the spatial distribution is independent i.e. you do not have a spatially-explicit population dynamics model. And talk about what it means (is it realistic, limitation, etc).

2. L161. I think it would be better to rename “depth” as “main covariate influencing the species distribution”. And “depth” happens to be one of them for many species, but there are cases when this is not the case.

3. L161. I think it would be good to say upfront that there are two ways of defining the spatial structure. Using the function in the package or user defined

4. L161: I think another possibility is to generate a random field by using the package “Randomfields” for example. This way you can generate a map where “depth” is patchily distributed (maybe more like an island type case study)

5. L163-164: please add more description about the division or strata. How can we set it up and what do you specifically mean by division and strata? Is one nested within the other, or not necessarily? Your examples are based on Atlantic Ocean and people in other regions might not be familiar with how these divisions are created.

6. L163: Why only focus on “depth-based strata”? I think it would be good to allow the user to choose their own stratification approach. It could be depth based as you did (which happens most often in surveys) but it could technically be any other thing (user supplied). This allows more flexibility.

7. L167: the equation is misleading and I am not sure it is right based on what is written in the text. You mentioned later on L178: that you “re-scaled” so that the total number of fish in a specific year and age across space is equal to the single number from the population dynamics model. If so, the re-scaling should be done in the identify (natural) scale, not in log scale. In log scale, even if ,, sums to zero across space, the sum in the identity scale won’t match. This is often refer to as “bias correction” for the log normal models. And you should ideally show how the rescaling was done in terms of the equations too.

8. L167: this “depth preference” function is very simplistic and gives only very “smooth” symmetric distribution. More often, fish have a skewed depth preference: often right-skewed.

9. L173-174: I think it might be worth adding, in simpler terms, the meaning of the spatial smoothing and scaling parameters.

10. L178: it is another question of scaling. How did you exactly do the scaling? In the identity scale? By dividing my the sum of the effects? I am asking this because depending on how you did the rescaling, your correlation structure in space and age might have been affected and is not the same as the one specified in L172. Did you verify that?

Table 4. “group_ages”. Ok but how is the variance controlled for the other age classes?

L189: “user supplied”. This is a good feature. However, I think it is important to mention here that user have to make sure that they use the correct projection method to ensure that each grid is of the same dimension.

L216: reference to figure 3?

L221: there is not “group_years” argument in Table 4.

L237: “this function”. I think it would be better to replace “this function” with “sim_distribution” as you do not mention the word “sim_distribution” in the sentence above this.

L158-251: In general, I think re-organising this section using sub-section headers could be useful. Just to guide the readers

L254: you say that sampling is stratified random but SRS is also an option based on Table5. Please correct.

L257: what does this mean? Does this control the number of set but how is this calculated?

L257: I do not see how you control for the total number of set in the survey? How do you control it?

L261: I think you should mention here that you can also force the sample size (as seen on table 5). Moreover, in table 5, it would be good to set-up a “ages_min” for the minimum number of ages to sample […]” so that it gives the ability to fix the sample size if needed by writing the same value for “ages_min” and “ages_cap”.

L261: How are you making sure that the number of sampled fish for that specific cell, age and year won’t be above the total number of fish in that cell, age, and year? The probability value could be close to one and if you fish in a few a time, then you are at risk. Especially because your population dynamics model is not spatially explicit and is completely independent of the distribution function itself i.e. you can technically fish out all the fish in an area but it will be populated back the year after the way you implemented in this study… Maybe you need to put a condition (or just a note) for general users to make sure that this probability value is much below 1?

L267: I recommend to clarify something here. 1. Depending on the number of fish caught? What do you mean? What is the rule you used? 2. The way you coded, sample by age is first decided, then the corresponding length is calculated, then age-subsample is determined. In reality, length sample is taken in the field, then age sub-samples are taken. While similar, I do not think it always equal. Especially, when you start including some correlation structure in the sampling. By the way, did you consider including some correlation structure in the sampling process to make more realistic?

L275 Table 1 on should be table 5

L275: Table5: “age_sammpling” should be “age_sampling”

L275: “min_sets” you have not described it yet and what is it? You have sample from all cells? If no, this is not realistic.

L279: Table5 not Table1?

L285-286: Could you be more specific on how custom closures can be supplied and where?

L306: how are these catchability corrected abundance matrices calculated? It is important to write this information somewhere (or write “please refer to the section “Stratified analysis” for further information on the calculation of abundance indices”) or something alike and Appendix S3.

L336: I think it would be good to say that other methods exist and people can use it in this package (maybe)?

L421: color gradient. Even though it is obvious it might be good to say green to purple gradient.

L427: instead of “sampling protocol”, I think it would be more meaning full to say the maximum number of length samples.

L452: say that the color ramps from yellow to purple

S1 appendix: missing figure in S1

Reviewer #2: This manuscript describes an R package called SimSurvey. The package includes a set of functions for simulating point-based fisheries survey designs, e.g. bottom trawl surveys, for estimating abundance indices. It focuses on number of stations and number of fish sampled ignores other constraints such as distance between stations and day-time duration which impose strong constraints on real surveys. The functions included allow the user to first simulate age-and length structured population dynamics, distribute individuals randomly in space (assuming a certain correlation structure), carry out a survey and finally calculate abundance indices from the simulated data. The package will likely be of interest to researchers wanting to explore the precision achievable with different survey designs. My comments regarding the package and the presentation are summarised below.

1. Optimization

The title announces a package for optimizing survey designs. As far as I can see the package does not allow survey design optimization, neither in terms of defining survey strata nor in terms of number of stations per stratum. The strata are defined by the user. The only option available for the number of stations is proportional to stratum surface; the user sets the minimum number of stations taken in the smallest stratum. It would be useful to be able to specify the total number of stations and test different allocation schemes, such as proportional to surface area (implemented), equal number per stratum, Neyman allocation (accounting for surface area and abundance variability), etc.

Please consider revising the title (e.g. “compare” and instead of “optimize)” and spell out the available sampling design options.

2. Manuscript structure

The manuscript might be easier to follow if the manuscript was restructured: 1) Model description, 2) Using SimSurvey. The later section would then group all example code which could again be subdivided into running simulations and exploring results (plot functions).

3. Parameterisation

To use SimSurvey for a real world problem realistic parameter values are needed. The package comes with default values chosen for a particular case study. However, no mention is made in the manuscript how to choose appropriate values for the many model parameters to tailor the simulations to a population of interest. I suggest the authors add a section on parameterisation and a table summarizing all parameters with a column specifying how to parameterize. For example, parameters for population dynamics and growth could probably be taken from the literature (or a stock assessment report). However this is not possible for the parameters of the spatial distribution function sim_distribution() such as correlation between ages etc. Ideally the package would include a fitting function for estimating these parameters from actual data. These input data could come from a pilot survey and include location (lat, long) and numbers by length/age.

Minor issues

- line 93: I assume there is an age plus but this needs to be mentioned. Also, please specify how you set the initial numbers for plus group ().

- line 121: I don’t understand the explanation of a closure. What do you mean by “return functions”? Do you mean it returns an object with different attributes?

- line 125: the number of right and left brackets is unbalanced, please check

- line 126 “This structure was chosen to avoid the repeated specifications of ages and years”. As far as I can see the example code only specifies years, not ages.

- line 227 Please explain what a pipe is and how it is used. In the example I understand that the output of sim_abundance( ) is provided to (piped) into sim_distribution(). I am unclear what the object b contains. Is it the result of sim_distribution()?

- There is no table 1, please revise table numbering.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

28 Dec 2019

Daniel E. Duplisea, PhD

850 Route de la Mer

Mont‐Joli, Quebec

2019-12-29

Dear Dr. Duplisea,

Thank you for considering a revision of the manuscript PONE-D-19-26904,

**“SimSurvey: an R package to optimize the design and analysis of

fisheries surveys by simulating spatially-correlated fish stocks”** by

Paul M. Regular, Gregory J. Robertson, Keith P. Lewis, Jonathan Babyn,

Brian Healey and Fran Mowbray. In accordance to one of the reviewers’

suggestions, the paper has been renamed **“SimSurvey: an R package

for comparing the design and analysis of fisheries surveys by simulating

spatially-correlated fish stocks”**.

We thank the Reviewers for their thoughtful and thorough comments which

will clarify and improve the manuscript. We agree with many of the

manuscript. Below are the reviewers’ comments with a brief accounting of

our response to each concern or suggestion.

We submit this revised manuscript for your consideration and look

Sincerely,

Paul Regular

Northwest Atlantic Fisheries Center

80 East White Hills, St. John’s, NL

E-mail:

Paul.Regular@dfo-mpo.gc.ca

Phone: (709) 772-2067

------------------------------------------------------------------------

Reviewer \\#1:

This paper describes the R package “SimSurvey” which is a package to

test the design and analysis of fisheries independent surveys.

I found the paper very interesting and the package of great potential

value and interest to many fishery scientists. However, I have several

comments that I would like the authors to address first before I

recommend acceptance of this paper. My main comments relate to:

1. Verify some of the equations and model description used in the

model. I think that some of them lack description to fully

understand what was done and few others are misleading, even

incorrect. The detail can be found below but in general, they are

linked to the problem of bias correction for log-normal

distribution, re-scaling”, moving from abundance at age to abundance

at length.

population dynamics is modeled independently of the spatial

distribution. Is this reasonable, limitation, etc.

3. Few suggestions (add-ons, renaming) to increase the generality of

the paper (without requiring too much work)

*We hope we have addressed each of these main comments by a) clarifying

several equations, b) adding more details and justifications, and c)

modifying naming conventions and clarifying extensibility. See replies

below for more details.*

Abstract:

What about the flexibility to include new sampling strategies (user

defined one) or including new estimation approaches

*We have noted that both built-in and user-defined sampling strategies

can be utilised.*

Introduction:

L43: quality of

*We have made the suggested change.*

It would be good to add few references here e.g. L50,

*We have added references as suggested.*

Method:

L69-73: I think it would be good to add the reference to the specific

function name listed in Table2. This way, I would be more explicit and

avoid possible confusion. For example, in my case, when I first read the

section header “simulate abundance”, I was expecting to see the way you

dealt with the spatial distribution of abundance.

*We have made the suggested change.*

L94&95&104: I just want to make sure that the bias correction has been

applied to these two equations. If logN (or logZ) is normally

distribution with mean mu and sigma, then the N (or Z) is lognormally

distributed with mean exp(mu+sigma^2/2). Therefore, when you write

“mean” in table 3 for R and Z, I hope that you modified in your actual

equation for log(Z) to: log(Z) ~ normal(log(mean)-sigma^2/2, sigma) In

other word, the mean(R) is not simply exp(meanR\\_log\\_scale).

exp(meanR\\_log\\_scale) is the median of the distribution of R.

*Good catch. The answer to the question is no, bias correction was not

applied, but nor was it needed, as all demographic parameters operated

in log space. The issue is a matter of naming convention that we

admittedly struggled with. When developing the functions, we vacillated

on whether to use log_mean or mean as the argument name and we

landed on mean as we believed this would be a more relatable value for

users to supply. To be more explicit with regards to what value should

be supplied, we have renamed the arguments to log_mean.*

L107: there is not enough description to understand how you converted

the abundance at age to abundance at length. What were the choice of

length bin groups? Every 5 cm, 10cm? Such information is also missing in

the tables. Please add that. But then you have to calculate some

cumulative distribution (until length l) for the length at age

distribution (using the VB growth equation) and do some subtractions to

be able to allocate the abundance at age to each length group. See

equation A.1.14 in here for example.

https://www.nwfsc.noaa.gov/news/events/program_reviews/documents/C.2_Methot_Wetzel_SSTechnicalDescription.pdf

*Another good catch as we did not provide enough information to

replicate the approach. We have added more detail on the calculations

behind the sim\\_vonB function; turns out our approach was similar to the

one described in the exemplar the reviewer provided.*

“Simulate spatial distribution”:

I have a few comments on this section:

1. First of all, I think it is important to state that you assume that

the population dynamic model and the spatial distribution is

independent i.e. you do not have a spatially-explicit population

dynamics model. And talk about what it means (is it realistic,

limitation, etc).

*We have prefaced this section with the caveat that population and

spatial dynamics are modeled as independent processes and we note the

pros and cons of this approach. In short: the con is that the approach

is a simplification of reality as it does not explicitly account for

dynamics such as larval dispersal, spatial differences in growth and

population connectivity; the pro is that this simplification limits the

number of unknown parameters that need to be specified while

facilitating the simulation of a sufficiently complex population for

testing the efficacy of various survey designs.*

1. L161. I think it would be better to rename “depth” as “main

covariate influencing the species distribution”. And “depth” happens

to be one of them for many species, but there are cases when this is

not the case.

*A good point, this variable could be used for any habitat gradient over

which stratification is applied. However, we retain the term “depth” for

three reasons: 1) it is an important covariate for many species (as the

reviewer noted), 2) most surveys that we are aware of are

depth-stratified (also noted by the reviewer below) and 3) it is a

concise variable name (as opposed to something like “habitat\\_covar”).

We therefore retained the “depth” naming as it ought to satisfy most use

cases. We now note directly in the text that any covariate could be

supplied and used as ‘depth’*

1. L161. I think it would be good to say upfront that there are two

ways of defining the spatial structure. Using the function in the

package or user defined.

*Given a suggestion from Reviewer \\#2, we have restructured the paper to

have two core sections: “Model structure” and “Using **SimSurvey**”.

We have included a blanket statement under the “Model structure” section

noting that users can circumvent specific components of the

**SimSurvey** framework and supply their own objects, including

spatial structure of the survey grid.*

1. L161: I think another possibility is to generate a random field by

using the package “Randomfields” for example. This way you can

generate a map where “depth” is patchily distributed (maybe more

like an island type case study)

*Good idea! We actually pursued this idea in an earlier iteration of

make_grid, however, we abandoned the option because it was difficult

to automate because the random field sometimes created small strata (one

cell) or strata that were split in space. We would certainly welcome

more development integrating other packages with **SimSurvey***

How can we set it up and what do you specifically mean by division

and strata? Is one nested within the other, or not necessarily? Your

examples are based on Atlantic Ocean and people in other regions

might not be familiar with how these divisions are created.

*We have added more detail on the structure of the divisions and strata.

In short, divisions are pre-defined areas, generally within which a

fisheries management framework is applied. Age-length keys are specified

at the level of the division, so for populations spanning large habitat

gradients (in temperature for example) in which age-length relationships

change along that gradient, multiple divisions can be supplied. Strata,

on the other hand, are defined by an important habitat gradient. As

structured in **SimSurvey**, strata are nested in divisions. *

1. L163: Why only focus on “depth-based strata”? I think it would be

good to allow the user to choose their own stratification approach.

It could be depth based as you did (which happens most often in

surveys) but it could technically be any other thing (user

supplied). This allows more flexibility.

*We agree, see reply to point 2. above, and note that any habitat

variable could be used for stratification.*

1. L167: the equation is misleading and I am not sure it is right based

on what is written in the text. You mentioned later on L178: that

you “re-scaled” so that the total number of fish in a specific year

and age across space is equal to the single number from the

population dynamics model. If so, the re-scaling should be done in

the identify (natural) scale, not in log scale. In log scale, even

if ,, sums to zero across space, the sum in the identity scale won’t

match. This is often refer to as “bias correction” for the log

normal models. And you should ideally show how the rescaling was

done in terms of the equations too.

*Right, the equation presented was not an accurate reflection of the

calculations. We have revised the equation to explicitly show how the

values were normalized to sum to 1.*

1. L167: this “depth preference” function is very simplistic and gives

only very “smooth” symmetric distribution. More often, fish have a

skewed depth preference: often right-skewed.

*True. Some users may find this parameterization insufficient for their

species and we hope they will implement their own closure to use in the

sim_distribution function to better simulate the effect. In addition

to our blanket statement under the “Model structure” section, we have

added a more specific statement under the “Using **SimSurvey**”

section stating that alternate formulations can be used by supplying

alternate closures to the core functions.*

1. L173-174: I think it might be worth adding, in simpler terms, the

meaning of the spatial smoothing and scaling parameters.

*We have prefaced that sentence with “The rate at which point-to-point

spatial correlation decays with distance is controlled by…”.*

1. L178: it is another question of scaling. How did you exactly do the

scaling? In the identity scale? By dividing my the sum of the

effects? I am asking this because depending on how you did the

rescaling, your correlation structure in space and age might have

been affected and is not the same as the one specified in L172. Did

you verify that?

*See reply to point 7. above*

Table 4. “group\\_ages”. Ok but how is the variance controlled for the

other age classes?

*“Variance” was a poor word choice. We have replaced it with “noise” as

it is the simulated noise that we fix across multiple age groups.*

L189: “user supplied”. This is a good feature. However, I think it is

important to mention here that user have to make sure that they use the

correct projection method to ensure that each grid is of the same

dimension.

*We have made the suggested change.*

L216: reference to figure 3?

*We have made the suggested change.*

L221: there is not “group\\_years” argument in Table 4.

*We have added it to the table.*

L237: “this function”. I think it would be better to replace “this

function” with “sim\\_distribution” as you do not mention the word

“sim\\_distribution” in the sentence above this.

*We have made the suggested change.*

L158-251: In general, I think re-organising this section using

*This is an excellent suggestion. By following a suggestion by Reviewer

\\#2 to re-organize the paper into two core sections we have added more

L254: you say that sampling is stratified random but SRS is also an

option based on Table5. Please correct.

L257: what does this mean? Does this control the number of set but how

is this calculated?

*We have clarified how number of sets per strata is calculated.*

L257: I do not see how you control for the total number of set in the

survey? How do you control it?

*We have clarified how number of sets per strata is calculated; it is

done within the sim_survey function using the set_den argument, with

further adjustment possible with the min_sets argument.*

L261: I think you should mention here that you can also force the sample

size (as seen on table 5). Moreover, in table 5, it would be good to

set-up a “ages\\_min” for the minimum number of ages to sample \$…\$” so

that it gives the ability to fix the sample size if needed by writing

the same value for “ages\\_min” and “ages\\_cap”.

*We are not sure what the reviewer would like to have implemented here.

Is the suggestion to impose a minimum number of ages to collect across

all length groups? In practice, an ages_min constraint is already

applied for length bins where the total number of fish caught is lower

than ages_cap. A specific ages_min constraint cannot really be

applied, as if there are too few fish of a certain size caught in any

one survey, there is no means of getting more fish of that size.*

L261: How are you making sure that the number of sampled fish for that

specific cell, age and year won’t be above the total number of fish in

that cell, age, and year? The probability value could be close to one

and if you fish in a few a time, then you are at risk. Especially

because your population dynamics model is not spatially explicit and is

completely independent of the distribution function itself i.e. you can

technically fish out all the fish in an area but it will be populated

back the year after the way you implemented in this study… Maybe you

need to put a condition (or just a note) for general users to make sure

that this probability value is much below 1?

*The sampling is implemented such that the number of fish sampled in a

cell cannot exceed the number of fish in a cell because the population

is split across sets in cases where more than one set is conducted in a

cell. We have added this missing detail to our manuscript. We also added

a note that the survey is assumed to have no impact on the population

from one year to the next, and users planning to conduct surveys at

levels that will affect the overall population will need to use caution

in interpreting their results.*

L267: I recommend to clarify something here. 1. Depending on the number

of fish caught? What do you mean? What is the rule you used? 2. The way

you coded, sample by age is first decided, then the corresponding length

is calculated, then age-subsample is determined. In reality, length

sample is taken in the field, then age sub-samples are taken. While

similar, I do not think it always equal. Especially, when you start

including some correlation structure in the sampling. By the way, did

you consider including some correlation structure in the sampling

process to make more realistic?

*Our wording “depending on the number of fish caught” was unfortunately

vague and, in hindsight, is not adding anything to the paper. Therefore,

we have removed the statement to minimize confusion. We have also

clarified the sub-sampling sequence. Finally, we have yet to consider

including correlation structure in the sampling process as we went about

imposing correlation via the spatial correlation of age groups

(i.e. age-specific clustering tends to result in sets with high

intraclass correlation). Correlation in the sampling procedure could be

a future development for **SimSurvey**, but we feel at this time the

package is sufficiently complex and we did not want to overwhelm users

with too many options.*

L275 Table 1 on should be table 5

*We have changed the page numbers accordingly.*

L275: Table5: “age\\_sammpling” should be “age\\_sampling”

*We have made the suggested change.*

L275: “min\\_sets” you have not described it yet and what is it? You have

sample from all cells? If no, this is not realistic.

*We have clarified the meaning and utility of the min_sets argument

(i.e. a small strata may be allocated only one set under a low set

density scenario; this argument overrides the allocation and imposes the

min_sets if it is greater than the allocation).*

L279: Table5 not Table1?

*We have changed the numbers accordingly.*

L285-286: Could you be more specific on how custom closures can be

supplied and where?

*We have included an example in the second paragraph of the “Using

**SimSurvey**” section that we hope will clarify how a user can supply

a custom closure.*

L306: how are these catchability corrected abundance matrices

calculated? It is important to write this information somewhere (or

write “please refer to the section “Stratified analysis” for further

information on the calculation of abundance indices”) or something alike

and Appendix S3.

*We have clarified how this was calculated.*

L336: I think it would be good to say that other methods exist and

people can use it in this package (maybe)?

*Good point, however, we think this is covered by referencing a paper

that describes a geostatistical R package and we also note that other

options can be used under the “Research opportunities” section.*

L421: color gradient. Even though it is obvious it might be good to say

*We have made the suggested change.*

L427: instead of “sampling protocol”, I think it would be more meaning

full to say the maximum number of length samples.

*We have made the suggested change.*

L452: say that the color ramps from yellow to purple

*We have made the suggested change.*

S1 appendix: missing figure in S1

*We have included the figure*

------------------------------------------------------------------------

Reviewer \\#2:

This manuscript describes an R package called SimSurvey. The package

includes a set of functions for simulating point-based fisheries survey

designs, e.g. bottom trawl surveys, for estimating abundance indices. It

focuses on number of stations and number of fish sampled ignores other

constraints such as distance between stations and day-time duration

which impose strong constraints on real surveys. The functions included

allow the user to first simulate age-and length structured population

dynamics, distribute individuals randomly in space (assuming a certain

correlation structure), carry out a survey and finally calculate

abundance indices from the simulated data. The package will likely be of

interest to researchers wanting to explore the precision achievable with

different survey designs. My comments regarding the package and the

presentation are summarised below.

1. Optimization

The title announces a package for optimizing survey designs. As far as I

can see the package does not allow survey design optimization, neither

in terms of defining survey strata nor in terms of number of stations

per stratum. The strata are defined by the user. The only option

available for the number of stations is proportional to stratum surface;

the user sets the minimum number of stations taken in the smallest

stratum. It would be useful to be able to specify the total number of

stations and test different allocation schemes, such as proportional to

surface area (implemented), equal number per stratum, Neyman allocation

(accounting for surface area and abundance variability), etc. Please

consider revising the title (e.g. “compare” and instead of “optimize)”

and spell out the available sampling design options.

*We agree with the Reviewer and retitled the ms “SimSurvey: an R

package for comparing the design and analysis of fisheries surveys by

simulating spatially-correlated fish stocks”. Currently the package

provides options for stratified random sampling proportional to surface

area, but other survey designs could added in future versions of

**SimSurvey**.*

1. Manuscript structure

The manuscript might be easier to follow if the manuscript was

restructured: 1) Model description, 2) Using SimSurvey. The later

section would then group all example code which could again be

subdivided into running simulations and exploring results (plot

functions).

*This is an excellent suggestion! We have re-structured our manuscript

accordingly and feel that this structure will be much easier for a

1. Parameterisation To use SimSurvey for a real world problem realistic

parameter values are needed. The package comes with default values

chosen for a particular case study. However, no mention is made in

the manuscript how to choose appropriate values for the many model

parameters to tailor the simulations to a population of interest. I

suggest the authors add a section on parameterisation and a table

summarizing all parameters with a column specifying how to

parameterize. For example, parameters for population dynamics and

growth could probably be taken from the literature (or a stock

assessment report). However this is not possible for the parameters

of the spatial distribution function sim\\_distribution() such as

correlation between ages etc. Ideally the package would include a

fitting function for estimating these parameters from actual data.

These input data could come from a pilot survey and include location

(lat, long) and numbers by length/age.

*This is yet another very helpful suggestion and, now that we have

included a Parameterisation section, it is easy to see the value of such

a section to prospective users of the package. Instead of a table that

summarizes how to specify the parameters, we wrote an entire section

that outlines recommended steps for setting up a their own simulation.

We started drafting a table, however, it quickly became apparent that it

would be too big and cumbersome and somewhat redundant with the core

tables that describe the arguments and parameters. We hope that the

narrative/outline we included will serve the practical purpose the

Minor issues - line 93: I assume there is an age plus but this needs to

be mentioned. Also, please specify how you set the initial numbers for

plus group ().

*We have now noted in the manuscript that a plus group is not modeled

explicitly as the number of ages can easily be extended to include

groups with zero fish. This choice simplifies the simulation, including

the setting of initial numbers which is done via exponential decay.

Further, the lack of a plus group is inconsequential for survey based

estimates of abundance at age.*

- line 121: I don’t understand the explanation of a closure. What do

you mean by “return functions”? Do you mean it returns an object

with different attributes?

*We have improved our explanation of a closure at the beginning of the

Using **SimSurvey** section*

- line 125: the number of right and left brackets is unbalanced,

*We have provided an improved description of a closure; the logic behind

this line of code should be clearer now*

- line 126 “This structure was chosen to avoid the repeated

specifications of ages and years”. As far as I can see the example

code only specifies years, not ages.

*Again, we hope that our improved description of a closure will clarify

what we mean by this.*

- line 227 Please explain what a pipe is and how it is used. In the

example I understand that the output of sim\\_abundance( ) is

provided to (piped) into sim\\_distribution(). I am unclear what the

object b contains. Is it the result of sim\\_distribution()?

*We have clarified how a pipe works, noting that it forwards values from

one function call to the next function call, and we now state that the

output from the two examples provided (nested approach vs. pipe

approach) are functionally the same though the approach is slightly

different.*

- There is no table 1, please revise table numbering.

*We have made the suggested change.*

6 Jan 2020

PONE-D-19-26904R1

SimSurvey: an R package for comparing the design and analysis of fisheries surveys by simulating spatially-correlated fish stocks

PLOS ONE

Dear Dr. Regular,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please see the detailed suggestions below.

We would appreciate receiving your revised manuscript by Feb 20 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Daniel E. Duplisea, PhD

PLOS ONE

Dear Paul, you have made lots of excellent changes that I think help with the use of the package and responded to the many specific comments of reviewers which has no doubt corrected many technical issues and clarifications. The one thing I would say that you have not really done is get at the deeper research merits of this work beyond introducing a new piece of software. I think paper, as it stands, lacks a larger context and content which is important for the primary publication. My diagnosis for why this is is that the manuscript does not conform very well to more typical scientific reports (Intro, M&M, Result, Discussion) which can make it difficult for readers to find the larger scientific merits of the work. It is useful of course for those who already understand the merits of this kind of work but this work is for primary publication and it needs to appeal more to the former than the latter group. There is a very "how-to" feel to it (e.g. line 64 "In this section") which I think detracts from getting at the larger purpose of the work.

I would really like you to address this issue of moving it from a software manual to a primary scientific publication. I do not think it should involve that much work but there will be some restructuring of sections as well as places to put in content and bring out conclusions. Here are my suggestions for this:

Try to follow a more traditional paper structure. This will help readers and it likely will also make it clearer for you on how you can inject content into the paper to move it beyond the software manual approach:

Introduction:

You need to talk about wider issues and examples. e.g. examples of when poor survey design meant that scientific questions could not be properly addressed when good survey design meant they were. Examples of when good survey design allowed researchers to address needs that were unanticipated at the time it was designed. i.e. you need to build a better case (not just cost) of why survey design is really important and it is most powerful to do this with examples. You should try to bring in ideas related to ecosystem and climate changes and being able to track communities. Perhaps bring in species at risk ideas and tracking decline, you could bring in ideas related to MSC certification. These are just examples of specifics but you get the idea: the Introduction needs to have more general information outlining in both a broad and specific sense why we should be concerned about this.

Keep in mind an educated reader who may not be in fisheries but is interested in why anyone should care about this or could, for example, be interested in surveying say caribou or songbirds or something outside of fisheries but where many of the motivation and concepts may be similar and they are doing a more general literature search before designing their own survey.

Methods:

This will start at your "Model Structure" section. You should put a higher level heading just before that called Methods.

I can see that it is hard to separate your Methods from Results. Your results are really in the section "Using SimSurvey" I would not be opposed to putting this as a separate Results section but I leave it to you to decide. Essentially what you have is a case study as an example of how it works and therefore specific results are less interesting than how you got there with the package. You might title it something like "Results: running a SimSurvey simulation". This section also has a lot of content without a lot of explanation. for example, you have several large and complicated figures in a row. You should discuss not only what figures are meant to offer in the package but also what they mean. So for example on line 473 you refer to three figures (5,6,7) but you offer little interpretation of those figures just why you can make them which is another example of the limitations of the how-to manual approach. So you need to think of it as a case study and help someone decide the implications of their survey design.

Results:

see comment above

Discussion:

This starts just before "Research Opportunities". You can keep these sections but there should be more preamble before jumping right into research opportunities. I suggest a general paragraph(s) that segue into your subsections of "research opportunities", "future research". In the Discussion you should address again some of the broader issues from the Introduction, e.g. how could spatial (depth) distribution changes anticipated under climate change for some population be tackled by survey design exploration now so that we can continue to track these changing populations 20 years down the road and do not lose the signal. How can this software help with that?

I think something that can be useful for readers is to outline the steps in a thought process a researcher might undertake when setting up a survey (perhaps a separate subsection) and then how one might go about a SimSurvey run for this. It also gives you a good opportunity to discuss your multispecies ideas:
What are the constraints (HR and $perspective) What are the constraints from a biological, physical perspective Resources available What are your anticipated future needs (i.e. if climate change is going to make cod go deeper, will you be able to track it in 20 years?) How could you consider some or all of these with SimSurvey Your "Assumptions" section should come down into the Discussion "Future Directions" should say something about randomfields re: Reviewer 1. Even if you just outline that you have considered it. You might also try to say something about optimisation of design which Reviewer 2 mentioned. [Note: HTML markup is below. Please do not edit.] [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 20 Feb 2020 Additional Editor Comments: Dear Paul, you have made lots of excellent changes that I think help with the use of the package and responded to the many specific comments of reviewers which has no doubt corrected many technical issues and clarifications. The one thing I would say that you have not really done is get at the deeper research merits of this work beyond introducing a new piece of software. I think paper, as it stands, lacks a larger context and content which is important for the primary publication. My diagnosis for why this is is that the manuscript does not conform very well to more typical scientific reports (Intro, M&M, Result, Discussion) which can make it difficult for readers to find the larger scientific merits of the work. It is useful of course for those who already understand the merits of this kind of work but this work is for primary publication and it needs to appeal more to the former than the latter group. There is a very "how-to" feel to it (e.g. line 64 "In this section") which I think detracts from getting at the larger purpose of the work. I would really like you to address this issue of moving it from a software manual to a primary scientific publication. I do not think it should involve that much work but there will be some restructuring of sections as well as places to put in content and bring out conclusions. Here are my suggestions for this: Try to follow a more traditional paper structure. This will help readers and it likely will also make it clearer for you on how you can inject content into the paper to move it beyond the software manual approach: Introduction: - You need to talk about wider issues and examples. e.g. examples of when poor survey design meant that scientific questions could not be properly addressed when good survey design meant they were. Examples of when good survey design allowed researchers to address needs that were unanticipated at the time it was designed. i.e. you need to build a better case (not just cost) of why survey design is really important and it is most powerful to do this with examples. You should try to bring in ideas related to ecosystem and climate changes and being able to track communities. Perhaps bring in species at risk ideas and tracking decline, you could bring in ideas related to MSC certification. These are just examples of specifics but you get the idea: the Introduction needs to have more general information outlining in both a broad and specific sense why we should be concerned about this. - Keep in mind an educated reader who may not be in fisheries but is interested in why anyone should care about this or could, for example, be interested in surveying say caribou or songbirds or something outside of fisheries but where many of the motivation and concepts may be similar and they are doing a more general literature search before designing their own survey. *Thank you for thinking beyond a fisheries biologist audience, as this is the audience we have been targeting from the onset. In hindsight we see that was short-sighted so we have modified our introduction, as suggested, to pitch the concept to a broader audience. The reason we did not think beyond fisheries biology is because we tend to have access to extensive age-based data while others (e.g. songbird and caribou biologists) do not. However, there are certainly other marine or terrestrial situations that could use our approach, the general framework we now present should be useful to adapt our work to systems we have not even considered.* *Specifically, we have added an additional paragraph at the start of the paper discussing the importance of good survey design at a very general level. We also have removed references to fish and fisheries from some places in the abstract, to show our work can easily be generalized to other systems.* Methods: - This will start at your "Model Structure" section. You should put a higher level heading just before that called Methods. - I can see that it is hard to separate your Methods from Results. Your results are really in the section "Using SimSurvey" I would not be opposed to putting this as a separate Results section but I leave it to you to decide. Essentially what you have is a case study as an example of how it works and therefore specific results are less interesting than how you got there with the package. You might title it something like "Results: running a SimSurvey simulation". This section also has a lot of content without a lot of explanation. for example, you have several large and complicated figures in a row. You should discuss not only what figures are meant to offer in the package but also what they mean. So for example on line 473 you refer to three figures (5,6,7) but you offer little interpretation of those figures just why you can make them which is another example of the limitations of the how-to manual approach. So you need to think of it as a case study and help someone decide the implications of their survey design. *We really struggled to accommodate traditional Methods and Results sections because all of our re-structuring attempts defied the definitions of these sections. To maintain a logical flow that describes the how and why to use the package, we landed on a mix of methods and case study results/discussion through the body of the paper. We first describe the "Model structure", then the "Core functions", and then we describe and discuss the case study results in the "Interpretation" section.* *The new "Interpretation" section is essentially a shortened version of the appendix on the case study, however, this iteration of the manuscript includes one new result. In the process of revising the paper and reading more literature, we stumbled upon a potential design-based solution to the bias introduced by the division-level age-length key (default approach). The design-based fix was easily implemented using the package and, after running the necessary simulations, we discovered that the resultant estimates appear to be unbiased. We feel this is an interesting and useful addition to the paper as it shows that the package can be used to identify issues and explore solutions.* *We should also note that we have moved the "Parameterisation" section to an appendix. This was the section where we provide some guidance on how to modify default settings to suit specific needs. Following the addition of the "Interpretation" section, it became clear that this section no longer fits with the logical flow of the paper.* *Overall, we feel that the structural changes we have made to the paper have minimized the how-to content, and the increased focus on the case study should bring more meaning to the complex figures shown through the paper and highlight the real-world implications of the package.* *Below we discuss a number of our failed attempts to place the content into Methods and Results sections:* *1) We tried to start the methods section with "Model Structure", as suggested, and shunt the figures to a "Results" section; this, however, broke the flow and left a gap in the methods section which held demonstrations of the core simulation functions but not the plotting functions.* *2) We tried to keep the methods section as is and describe some of the case study results in the results section, however it seemed awkward to describe and discuss a solution to the age-length key problem in a results section.* *3) Earlier iterations of the paper were focused entirely on the case study, the format was traditional and the package was a side-note in the methods section. It became apparent, however, that the package was more interesting and generally applicable than the case study results. We then began re-working the paper into a how-to format and, while moving away from a conventional format, we looked to several papers as a guide on how to document a R package (e.g. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0092725).* Results: - see comment above *Please see reply to Methods comment.* Discussion: - This starts just before "Research Opportunities". You can keep these sections but there should be more preamble before jumping right into research opportunities. I suggest a general paragraph(s) that segue into your subsections of "research opportunities", "future research". In the Discussion you should address again some of the broader issues from the Introduction, e.g. how could spatial (depth) distribution changes anticipated under climate change for some population be tackled by survey design exploration now so that we can continue to track these changing populations 20 years down the road and do not lose the signal. How can this software help with that? *In contrast to a Methods or Results section, a Discussion section was easier to accommodate and we have provided one. Here we lean on the case study results to reiterate the importance of planning a survey or testing an existing survey. We then use this preamble to segue into the "Research opportunities" section as suggested.* - I think something that can be useful for readers is to outline the steps in a thought process a researcher might undertake when setting up a survey (perhaps a separate subsection) and then how one might go about a SimSurvey run for this. It also gives you a good opportunity to discuss your multispecies ideas: - What is your current problem/needs - What are the constraints (HR and$ perspective)

- What are the constraints from a biological, physical perspective

- Resources available

- What are your anticipated future needs (i.e. if climate change is going to make cod go deeper, will you be able to track it in 20 years?)

- How could you consider some or all of these with SimSurvey

*As much as we would like to address each of these points and work towards a survey design handbook, it would be hard to be as cohesive and comprehensive as Cochran (1977; "Sampling techniques") or Sutherland (2006; "Ecological census techniques: a handbook"). Moreover, we have yet to implement features that could help users assess some of these trade-offs (i.e. cost-benefit analysis). We are hopeful that these are questions that future versions of SimSurvey may be able to help address as more features are added to elevate it from a purely statistical toolbox to something that also considers the operational constraints of delivering a survey.*

- Your "Assumptions" section should come down into the Discussion

*Agreed, we have made the suggested change.*

- "Future Directions" should say something about randomfields re: Reviewer 1. Even if you just outline that you have considered it. You might also try to say something about optimisation of design which Reviewer 2 mentioned.

*Good point. We have noted in the "sim_distribution" section that a custom closure can be created that uses randomfields to simulate spatial noise. We have also noted in the "Future directions" section that we hope to implement alternatives to random or stratified random designs to allow for more comprehensive evaluations of various designs.*

2 Apr 2020

PONE-D-19-26904R2

SimSurvey: an R package for comparing the design and analysis of surveys by simulating spatially-correlated populations

PLOS ONE

Dear Dr. Regular,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by May 17 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Daniel E. Duplisea, PhD

PLOS ONE

Reviewer 2

Reviewer 2

[Note: HTML markup is below. Please do not edit.]

Reviewer's Responses to Questions

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: This is my second time reviewing this manuscript and the authors did a great job addressing people's comments. Great job for that. The manuscript is now much clearer and read well, I think.

This time, I only have a few minor comments:

L19 : «there is a limited number of»

L103: Maybe it would be a good idea to justify why you are converting to abundance at length.

L240-242: I think it could be a good idea to show how to exactly do this (maybe obvious to advanced people but not to the general audience). This could be a simple reference to an example on the github repo.

Table 2: If bias correction was not implemented, I would suggest to change the labeling in the table here and elsewhere (if needed) and replace the “mean” by “median” (if the variable is transformed back to the original scale). If not, I would make clear that you talk about mean in log scale.

L326: I did not see how one can use INLA or RandomFields in Appendix S3. This is a similar comment as the one for line 240-242.

L481-490: the processing time is surprisingly long… it is not an issue in itself but this might refrain some people to using this package… I think some code optimization could help (e.g. vectorize, use matrix operation as much as possible, etc)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

16 Apr 2020

Reviewer #1: This is my second time reviewing this manuscript and the authors did a great job addressing people's comments. Great job for that. The manuscript is now much clearer and read well, I think.

*We thank the reviewer for the kind words.*

This time, I only have a few minor comments:

L19 : «there is a limited number of»

*We have removed the 'the' that should not have been in that sentence.*

L103: Maybe it would be a good idea to justify why you are converting to abundance at length.

*Agreed, we have started a new paragraph regarding abundance at length which begins: "In practice, abundance at age is often inferred from length data as it is easier to collect. Abundance at length is therefore simulated from abundance at age using the original von Bertalanffy growth curve"*

L240-242: I think it could be a good idea to show how to exactly do this (maybe obvious to advanced people but not to the general audience). This could be a simple reference to an example on the github repo.

*Indeed, this is a helpful idea and we have now created a short vignette on creating closures and this can be accessed via the pkgdown site (https://paulregular.github.io/SimSurvey/articles/custom_closures.html). A link to this vignette has been added to the paper.*

Table 2: If bias correction was not implemented, I would suggest to change the labeling in the table here and elsewhere (if needed) and replace the “mean” by “median” (if the variable is transformed back to the original scale). If not, I would make clear that you talk about mean in log scale.

*Good catch, we have added log after mean to ensure that readers know the functions are operating in log scale.*

L326: I did not see how one can use INLA or RandomFields in Appendix S3. This is a similar comment as the one for line 240-242.

*We have added the following in parentheses: see the code behind sim_ays_covar_sped [https://github.com/PaulRegular/SimSurvey/blob/master/R/sim_dist_spde.R] for an example of how the sim_ays_covar closure was modified to apply a Stochastic Partial Differential Equation approach using the INLA package.*

L481-490: the processing time is surprisingly long… it is not an issue in itself but this might refrain some people to using this package… I think some code optimization could help (e.g. vectorize, use matrix operation as much as possible, etc)

*We have stretched our programming skills to the limit to reduce the processing time as much as possible. Of course, said programming skills are largely self-taught, so there is surely room for improvement. While developing the package, we repeatedly profiled our code and vectorized as much as we could, leaning heavily on the data.table package. We have also applied parallel processing in places in an attempt to speed up the code. In other parts of the package, we learned that there is little we can do to reduce the processing time (e.g. Cholesky decomposition is a factor limiting the speed of the sim_distribution function). Another step to further optimize the code would be to translate some of the core functions to C++, however this is on the wish list and well outside the scope of the project at this point in time. We are hopeful that we have produced enough here to catalyze collaborations that help extend and optimize the package.*

23 Apr 2020

SimSurvey: an R package for comparing the design and analysis of surveys by simulating spatially-correlated populations

PONE-D-19-26904R3

Dear Dr. Regular,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Daniel E. Duplisea, PhD

PLOS ONE

27 Apr 2020

PONE-D-19-26904R3

SimSurvey: an R package for comparing the design and analysis of surveys by simulating spatially-correlated populations

Dear Dr. Regular:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr Daniel E. Duplisea