AFLP data of Veronica plants
Downloads: Veronica.arff,
Veronica.csv,
Veronica.RData
If you use this data set in publications please cite
Ortega, M. M., L. Delgado, D. C. Albach, J. A. Elena-Rossello, and E. Rico (2004). Species boundaries and phylogeographic patterns in cryptic taxa inferred from AFLP markers: Veronica subgen. Pentasepalae (Scrophulariaceae) in the Western Mediterranean. Systematic Botany 29, 965-986.
Summary of the data
- Contributor:
- Christian Hennig
- Source:
- The dataset appeared in Martinez-Ortega, M. M., L. Delgado, D. C. Albach, J. A. Elena-Rossello, and E. Rico (2004). Species boundaries and phylogeographic patterns in cryptic taxa inferred from AFLP markers: Veronica subgen. Pentasepalae (Scrophulariaceae) in the Western Mediterranean. Systematic Botany 29, 965-986. It was retrieved from www.treebase.org in nexus format and converted by Christian Hennig; spatial coordinates were provided by Bernhard Hausdorf using maps given in Martinez-Ortega et al. (2004).
- License:
- PDDL
General information about the data
- Abstract:
- 0-1 data indicating whether dominant markers are present for 583 different AFLP bands ranging from 61 to 454 bp of 207 plant individuals of Veronica (Pentasepalae) from the Iberian Peninsula and Morocco (Martinez-Ortega et al., 2004). Additionally, species according to Martinez-Ortega et al., 2004 and regional coordinates are given.
- Subject matter background:
- The data set can be used for species delimitation, so the issue is to find out how many bee species there are in this dataset, and which bees belong to which species. Species are defined by interbreeding (although there is some controversy about the definition of species, see Hausdorf (2011)), which means that within species a much larger genetic similarity is expected than between species. There is a vector of species already in the data set, which may be seen as the "true classification" but this can be challenged because the technique to determine these is far from perfect (although more information than just the given data set was used for this).
Bernhard Hausdorf told me that even within species lower similarity can occur provided that the individuals are from very different locations, because then interbreeding may not have happened even though possible in principle. For this reason, locations are provided.
- Data structure:
- object x variables data matrix
- Data objects and variables:
- Objects are 207 Veronica plants. There are 586 variables. The first variable gives species as indicated in Martinez-Ortega et al. (2004). Clustering should use the binary variables 2-584 (AFLP bands). The last two variables give location coordinates of the bees, see above why these may be relevant.
- Data values:
- Variable 1: abbreviated names of 8 species; the full names are V. (Veronica) tenuifolia, V. javalambrensis, V. fontqueri, V. rosea, V. orsiniana, V. aragonensis, V. scheereri, V. sennenii.
Variables 2-584 take values 0 and 1.
Variables 585-586 are coordinates of locations of individuals in decimal format, i.e. the first number is latitude (negative values are South), with minutes and seconds converted to fractions. The second number is longitude (negative values are West).
- Preprocessing:
- These are original data; preprocessing may not be necessary and is left to the user.
- Other relevant papers:
- The dataset was analysed here:
B. Hausdorf and C. Hennig Species Delimitation Using Dominant and Codominant Multilocus Markers. Systematic Biology 59,
491-503 (2010).
Some background on species:
Hausdorf B (2011) Progress toward a general species concept. Evolution 65: 923–931
- Justification for clustering:
- Delimitation of plant species
External criteria for clustering quality
- External variable that represents the underlying true clustering
- Variable "Species"; species as given by Martinez-Ortega et al. (2004). However, there are no species delimitation methods that are 100% reliable and Martinez-Ortega et al. themselves in their original paper raise some controversy about their species delimitation; Table 1 in that paper gives alternative proposals for species.
This is useful as a reference but cannot be used as ultimate ground truth.
- Substantive justification
- See p.966 of Martinez-Ortega et al. (2004).
- Pragmatic aim of clustering
- Doesn't apply.
Internal criteria for clustering quality: cluster membership
- Nature of clusters
- Crisp; species are traditionally seen as a crisp concept in biology
- Cluster overlap
- Non-overlapping; species are normally seen as non-overlapping, although one could argue for overlap because hybrids exist.
Internal criteria for clustering quality: within and between cluster features
- Ground for objects to belong to the same cluster:
- There should be genetic exchange within species; this means that members of the same species should be genetically connected, there should be "paths" from one to the other. These paths should not be too thin (like chaining in Single Linkage), because there
are hybrids between species.
Unfortunately, in the data, paths may be broken by incomplete sampling of individuals.
Weight: 4, moderately important
- Ground for objects to belong to different clusters:
- The opposite of the "common grounds"; genetic gaps should appear between species.
Weight: 5, very important
- Stability of clustering:
- The occurrence of hybrids between species shouldn't destroy the clustering (although this may be an unrealistic demand for small
clusters).
Weight: 3, of some importance
Further aspects
Location data (last two variables) may not be useful here and can probably be ignored.