Practical Field Ecology. C. Philip Wheater
we are in the realm of inferential statistics. These usually involve the testing of hypotheses. It is standard practice to set up a null hypothesis alongside the questions to be asked. The null hypothesis tests the chance of there being no significant difference between samples (or relationship between variables, or association between categories of variables). So if we wish to know whether there is a difference between two samples (e.g. comparing the number of birds found in deciduous woodlands with the number found in coniferous woodlands), then we actually test the null hypothesis that: there is no significant difference between the number of birds in deciduous and coniferous woodlands. Note that we are looking at ‘significant’ differences. These are differences that are unlikely to have resulted from random variation in the individual woodlands sampled. For this we need a method that tests the null hypothesis that there is no significant difference in the sample averages. In addition to difference tests between samples, there are also relationship tests between variables, and tests designed to examine associations between categories of variables. Table 1.3 summarises some commonly used, relatively simple, statistical approaches to these research questions.
Since there are various questions that we might ask as part of an investigation, it is important to be clear about possible analysis methods in advance of any sampling. The choice of test depends not only on the question being asked, but also on the data types being used. Where data are ranked, but not measured (i.e. ordinal data – p. 27) then a suite of tests called nonparametric tests may be used. The alternative (using parametric tests) is more robust and generally preferred, but requires data to be on a measurement scale (i.e. interval/ratio data). Therefore, it is usually an advantage to obtain measurement data rather than to rank data wherever possible. Even where measurements are taken, parametric tests may not be the most appropriate. This is because most parametric tests require the data to conform to a type of distribution called a normal distribution. Briefly, this is determined by examining histograms of the data (with the variable of interest plotted on the x axis and the frequency of its occurrence on the y axis) to see whether they have a symmetrical pattern (see Figure 1.6). For further details about the shape of distributions, and of which test to use, see Chapter 5. There are also different tests depending whether the data are matched or unmatched (p. 305).
To illustrate some of the considerations in project design and data collection, we start with a research question that sounds relatively simple on the face of it: is there a relationship between the size of trees and the number of squirrels' dreys in the canopy of the trees? Ideally, we would want to measure the canopy height with some degree of accuracy. This would enable us to work out whether the relationship exists using a parametric statistical technique called Pearson's product moment correlation analysis (p. 308). However, it may be difficult even to see the tops of very tall trees and those obscured by other trees. Thus, we may estimate tree height, perhaps into several groupings. We can of course rank these data, but this means that we need an alternative approach for analysis that is suitable for ordinal data. This is Spearman's rank correlation coefficient analysis, which is not quite as powerful as the Pearson's method. The power of the test is its ability to detect a true relationship (or difference, or association) if one exists. If we knew that any such relationship was likely to be fairly weak, then the less powerful technique might not reveal it and we could be wasting our time in not measuring the trees relatively accurately to obtain measurement data and thus employ the more powerful test. Alternatively, if we are only interested in revealing strong relationships, then using ranked size classes to indicate tree height may be acceptable. The other complexities in this apparently simple question include ensuring that all other aspects are as constant as possible (e.g. species of tree, surrounding landscape, density of the squirrel colony, etc.).
Figure 1.6 Data set approximating to a normal distribution.
Predictive analysis
We may wish to collect data to set up a model that enables us to predict the outcome in a hypothetical situation, one of the simplest of which is known as a linear regression model. Thus, if we are interested in looking at a possible relationship between woodland size and the number of birds and knew that this was likely to produce a significant linear relationship, then we may wish to use this fact to calculate the expected number of birds found in any woodland. This could be used theoretically or in conservation management to check that we have the sort of bird biodiversity that we expect from other data. Here, it is important to note that any such prediction should only be made if the woodland area in which we are interested lies between the minimum and maximum value of the data set we used to establish the model. We first need to establish which variable is the dependent and which is the independent variable: that is, which is likely to be affected (the dependent or response variable – plotted on the y axis of a scatterplot) by the other (independent variable – plotted on the x axis of a scatterplot). Here, obviously, the number of birds (the dependent variable) is more likely to be dependent on the size of the woodland (the independent variable) than vice versa. We can think of this as woodland size driving the size of the bird count. We can extend the technique to cover the case where there are a number of independent variables (e.g. woodland area, habitat diversity, area of associated green space, distance to nearest waterbody) that might influence or drive bird numbers.
Multivariate analysis
Where the question to be asked is a complicated one involving a number of dependent and/or independent variables, then multivariate analyses may be appropriate. The choice of analysis depends on whether the dependent variable is a category, or a ranked or measured variable, and on whether the independent variables are categories, ranked, or measured (or even a mixture). Whilst most such analyses only have one dependent variable, there may be multiple independent variables. For example, we may want to know whether the number of birds differs in different types of woodland when we take into account the woodland size (measured variable), woodland type (nominal variable), distance to the nearest neighbouring woodland (measured variable), age of woodland (measured variable), and the land use type surrounding the woodland (nominal variable). Here, we could enter all of the data into one analysis that would take into account the interrelationships between each variable and produce a model describing the relative importance of each variable on the number of birds (this particular example could be analysed using a generalized linear model – p. 319). Such techniques are powerful but require a full understanding of the data or data set and its attributes and may be quite complex to interpret.
Examining patterns and structure in communities
Ecological data sets can be very complex and difficult to visualise. For example, a data set might include many variables collected as measurements (including counts), as ranks (e.g. scores of abundance), or in a binary form (e.g. presence or absence data). Chapter 5 introduces a number of techniques for visualising complex data sets to enable the use of a range of different types of data. Variables with large numbers of observations of zero (as can occur when surveying relatively rare species), cases where data are heavily skewed, or situations where variables are measured on scales of greatly differing magnitude, may require data transformation before using these techniques (p. 285).
As an example, we might collect information about woodlands on the basis of their size, age, distance to the nearest neighbouring woodland, etc. Since some of these variables will be related to each other, we might wish to find out the underlying pattern of interrelationships within the data and hence identify a number of unrelated factors that can be used instead of our large number of variables. This is a data reduction exercise, reducing the number of variables we have measured into a smaller number of unrelated factors that take into account the interrelationships between the