AFLP data of Veronica plants

Downloads: Veronica.arff, Veronica.csv, Veronica.RData

If you use this data set in publications please cite
Ortega, M. M., L. Delgado, D. C. Albach, J. A. Elena-Rossello, and E. Rico (2004). Species boundaries and phylogeographic patterns in cryptic taxa inferred from AFLP markers: Veronica subgen. Pentasepalae (Scrophulariaceae) in the Western Mediterranean. Systematic Botany 29, 965-986.

Summary of the data

Contributor:
Christian Hennig
Source:
The dataset appeared in Martinez-Ortega, M. M., L. Delgado, D. C. Albach, J. A. Elena-Rossello, and E. Rico (2004). Species boundaries and phylogeographic patterns in cryptic taxa inferred from AFLP markers: Veronica subgen. Pentasepalae (Scrophulariaceae) in the Western Mediterranean. Systematic Botany 29, 965-986. It was retrieved from www.treebase.org in nexus format and converted by Christian Hennig; spatial coordinates were provided by Bernhard Hausdorf using maps given in Martinez-Ortega et al. (2004).
License:
PDDL

General information about the data

Abstract:
0-1 data indicating whether dominant markers are present for 583 different AFLP bands ranging from 61 to 454 bp of 207 plant individuals of Veronica (Pentasepalae) from the Iberian Peninsula and Morocco (Martinez-Ortega et al., 2004). Additionally, species according to Martinez-Ortega et al., 2004 and regional coordinates are given.
Subject matter background:
The data set can be used for species delimitation, so the issue is to find out how many bee species there are in this dataset, and which bees belong to which species. Species are defined by interbreeding (although there is some controversy about the definition of species, see Hausdorf (2011)), which means that within species a much larger genetic similarity is expected than between species. There is a vector of species already in the data set, which may be seen as the "true classification" but this can be challenged because the technique to determine these is far from perfect (although more information than just the given data set was used for this).

Bernhard Hausdorf told me that even within species lower similarity can occur provided that the individuals are from very different locations, because then interbreeding may not have happened even though possible in principle. For this reason, locations are provided.

Data structure:
object x variables data matrix
Data objects and variables:
Objects are 207 Veronica plants. There are 586 variables. The first variable gives species as indicated in Martinez-Ortega et al. (2004). Clustering should use the binary variables 2-584 (AFLP bands). The last two variables give location coordinates of the bees, see above why these may be relevant.
Data values:
Variable 1: abbreviated names of 8 species; the full names are V. (Veronica) tenuifolia, V. javalambrensis, V. fontqueri, V. rosea, V. orsiniana, V. aragonensis, V. scheereri, V. sennenii. Variables 2-584 take values 0 and 1. Variables 585-586 are coordinates of locations of individuals in decimal format, i.e. the first number is latitude (negative values are South), with minutes and seconds converted to fractions. The second number is longitude (negative values are West).
Preprocessing:
These are original data; preprocessing may not be necessary and is left to the user.
Other relevant papers:
The dataset was analysed here: B. Hausdorf and C. Hennig Species Delimitation Using Dominant and Codominant Multilocus Markers. Systematic Biology 59, 491-503 (2010).

Some background on species: Hausdorf B (2011) Progress toward a general species concept. Evolution 65: 923–931

Justification for clustering:
Delimitation of plant species

External criteria for clustering quality

External variable that represents the underlying true clustering
Variable "Species"; species as given by Martinez-Ortega et al. (2004). However, there are no species delimitation methods that are 100% reliable and Martinez-Ortega et al. themselves in their original paper raise some controversy about their species delimitation; Table 1 in that paper gives alternative proposals for species. This is useful as a reference but cannot be used as ultimate ground truth.
Substantive justification
See p.966 of Martinez-Ortega et al. (2004).
Pragmatic aim of clustering
Doesn't apply.

Internal criteria for clustering quality: cluster membership

Nature of clusters
Crisp; species are traditionally seen as a crisp concept in biology
Cluster overlap
Non-overlapping; species are normally seen as non-overlapping, although one could argue for overlap because hybrids exist.

Internal criteria for clustering quality: within and between cluster features

Ground for objects to belong to the same cluster:
There should be genetic exchange within species; this means that members of the same species should be genetically connected, there should be "paths" from one to the other. These paths should not be too thin (like chaining in Single Linkage), because there are hybrids between species.

Unfortunately, in the data, paths may be broken by incomplete sampling of individuals.
Weight: 4, moderately important

Ground for objects to belong to different clusters:
The opposite of the "common grounds"; genetic gaps should appear between species.
Weight: 5, very important
Stability of clustering:
The occurrence of hybrids between species shouldn't destroy the clustering (although this may be an unrealistic demand for small clusters).
Weight: 3, of some importance

Further aspects

Location data (last two variables) may not be useful here and can probably be ignored.