Tetragonula bee species

Downloads: Tetragonula.arff, Tetragonula.csv, Tetragonula.RData

If you use this data set in publications please cite
Franck, P., E. Cameron, G. Good, J.-Y. Rasplus, and B. P. Oldroyd (2004) Nest architecture and genetic differentiation in a species complex of Australian stingless bees. Mol. Ecol._ 13, 2317-2331.

Summary of the data

Contributor:
Christian Hennig
Source:
The data set was provided to me by Bernhard Hausdorf, Hausdorf@zoologie.uni-hamburg.de. Originally it appeared in Franck, P., E. Cameron, G. Good, J.-Y. Rasplus, and B. P. Oldroyd (2004) Nest architecture and genetic differentiation in a species complex of Australian stingless bees. _Mol. Ecol._ 13, 2317-2331.
License:
ODbL

General information about the data

Abstract:
Genetic data for 236 Tetragonula (Apidae) bees from Australia and Southeast Asia, see Franck et al. (2004). The data give pairs of alleles (codominant markers) for 13 microsatellite loci. Additionally, locations of bees and species according to Franck et al. (2004) are provided.
Subject matter background:
The data set can be used for species delimitation, so the issue is to find out how many bee species there are in this dataset, and which bees belong to which species. Species are defined by interbreeding (although there is some controversy about the definition of species, see Hausdorf (2011)), which means that within species a much larger genetic similarity is expected than between species. There is a vector of species already in the data set, which may be seen as the "true classification" but this can be challenged because the technique to determine these is far from perfect (although more information than just the given data set was used for this).

Bernhard Hausdorf told me that even within species lower similarity can occur provided that the individuals are from very different locations, because then interbreeding may not have happened even though possible in principle. For this reason, locations are provided.

Data structure:
object x variables data matrix
Data objects and variables:
Objects are 236 tetragonula bees. There are 16 variables. Clustering should use the first 13 (L1-L13). The next two variables (C1, C2) give location coordinates of the bees, see above why these may be relevant. The last variable "Species" gives the species (clustering) according to Franck et al. (2004).
Data values:
Variables L1-L13: These are strings that consist of six digits each (these are in quotation marks, as are variables names in the first line). The format is derived from the data format used by the software GENEPOP (Rousset 2010). Each string encodes two alleles. Alleles have a three digit code, so a value of "258260" on variable L10 means that on locus 10 the two alleles have codes 258 and 260. These are categorical, there is no numeric information in them."000" refers to missing values.

Variables C1 and C2 are coordinates of locations of individuals in decimal format, i.e. the first number isclatitude (negative values are South), with minutes andcseconds converted to fractions. The second number is longitude (negative values are West).

Variable "species" takes values 1 to 9, which are categorical indicators for different species according to Franck et al. (2004).

Preprocessing:
The data can be used as they are, although users may want to change the format of the allele coding. The R-package prabclus has a function alleleconvert that offers some other formats. The Tetragonula bees data is an example dataset there (although Franck's species are not given in the data in the package).
Other relevant papers:
The dataset was analysed here:

B. Hausdorf and C. Hennig Species Delimitation Using Dominant and Codominant Multilocus Markers. Systematic Biology 59, 491-503 (2010).

C. Hennig How many bee species? A case study in determining the number of clusters. M. Spiliopoulou, L. Schmidt-Thieme, R. Janning (eds.): Data Analysis, Machine Learning and Knowledge Discovery Springer, Berlin (2013), 41-49.

Results are not provided with the dataset.

Some background on species: Hausdorf B (2011) Progress toward a general species concept. Evolution 65: 923–931 J K Pritchard, M Stephens, and P Donnelly (2000) Inference of population structure using multilocus genotype data. Genetics 155(2):945-59.

URLs:
Justification for clustering:
Delimitation of bee species

External criteria for clustering quality

External variable that represents the underlying true clustering
Variable "Species"; species as given by Franck et al. (2004). However, there are no species delimitation methods that are 100% reliable and Franck et al. themselves in their original paper raise some controversy about their species delimitation. This uses some more information (nest architecture, morphology) than given in the dataset, though.

Their result is useful as a reference but cannot be used as ultimate ground truth.

Substantive justification
See Franck et al. (2004) and above.
Pragmatic aim of clustering
Doesn't apply.

Internal criteria for clustering quality: cluster membership

Nature of clusters
Crisp; species are traditionally seen as a crisp concept in biology.
Cluster overlap
Non-overlapping; species are normally seen as non-overlapping, although one could argue for overlap because hybrids exist.

Internal criteria for clustering quality: within and between cluster features

Ground for objects to belong to the same cluster:
There should be genetic exchange within species; this means that members of the same species should be genetically connected, there should be "paths" from one to the other. These paths should not be too thin (like chaining in Single Linkage), because there are hybrids between species. Unfortunately, in the data, paths may be broken by incomplete sampling of individuals.
Weight: 4, moderately important
Within-cluster heterogeneity:
Pritchard et al. (2000) connect the Hardy-Weinberg equilibrium to local independence of alleles within species. This has been criticised as an unrealistic assumption in later literature, though.
Weight: 2, of little importance
Ground for objects to belong to different clusters:
The opposite of the "common grounds"; genetic gaps should appear between species.
Weight: 5, very important
Stability of clustering:
The occurrence of hybrids between species shouldn't destroy the clustering (although this may be an unrealistic demand for small clusters).
Weight: 3, of some importance

Further aspects

No.