Introduction

Tree-based models partition the data into increasingly homogeneous groups of presence or absence based on their relationship to a set of environmental variables, the predictor variables. The single classification tree is the most basic form of a decision tree model. As the name suggests, classification trees resemble a tree and consist of three different types of nodes, connected by directed edges (branches):

  • Root node: no incoming branches - this represents the undivided data at the top
  • Internal nodes: have exactly 1 incoming branch, and 2 or more outgoing branches
  • Leaf nodes (= terminal nodes): have exactly 1 incoming branch, and no outgoing branches

Classification Tree Analysis consists of three steps:

  1. Growing: calibration of the tree starts with the complete dataset as one group, forming the root node. The tree is then grown by repeatedly splitting the data into increasingly homogeneous groups. Each split is based on the environmental variable that best divides the data into two groups, where at least one of the groups is very homogeneous. If a group is not homogeneous, it might have a mix of presence and absence records, then it needs to be split further. The model will continue to do this until the second step.
  2. Stopping: this is where the splitting process is stopped when a set of predefined criteria is met. This can either be when further splitting is impossible because all remaining observations have similar values of predictor variables, and thus all groups are relatively homogeneous and no further improvements to the model can be made. Splitting can also be stopped when the number of observations in each terminal node would fall below a predefined minimum, or when some maximum number of splits in the tree is reached.
  3. Pruning: reducing the complexity of the tree to avoid overfitting of the data. This is achieved by keeping only the most important splits.

Although classification trees provide a very useful tool to visualize the hierarchical effects of multiple environmental variables on species occurrence, they are often criticized for being unstable and having low prediction accuracy. This has led to the development of other methods that build upon classification trees, such as random forests and boosted regression trees.

Advantages

  • Simple to understand and interpret
  • Can handle both numerical and categorical data
  • Identify hierarchical interactions between predictors
  • Characterize threshold effects of predictors on species occurrence
  • Robust to missing values and outliers

Limitations

  • Less effective for linear or smooth species responses due to the stepwise approach
  • Requires large datasets to detect patterns, especially with many predictors
  • Very unstable: small changes in the data can change the tree considerably

Assumptions

No formal distributional assumptions, classification trees are non-parametric and can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal.

Requires absence data?

Yes.

Configuration options

BCCVL uses the 'part' package, implemented in biomod2. The user can set the following configuration options:

References

  • Breiman L, Friedman JH, Olshen RH, Stone CJ (1984) Classification and regression trees. Chapman and Hall, New York, USA.
  • De'ath G & Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology, 81(11): 3178-3192.
  • Franklin J (2010) Mapping species distributions: spatial inference and prediction. Cambridge University Press.