K. Harris's spike sorting data

Downloads: spike_sorting.arff, spike_sorting.csv, spike_sorting.RData

If you use this data set in publications please cite
These are not really requests but some background and a clustering approach can be found in Einevoll, G. T., F. Franke, E. Hagen, C. Pouzat, and K. D. Harris (2012). Towards reliable spike-train recordings from thousands of neurons with multielectrodes. Current Opinion in Neurobiology 22, 11–17.

Kadir, S. N., D. F. M. Goodman, and K. D. Harris (2013). High-dimensional cluster analysis with the Masked EM Algorithm. arXiv:1309.2848 preprint available at http: //arxiv.org/pdf/1309.2848.

Summary of the data

Contributor:
Christian Hennig
Source:
Kenneth Harris, Institute of Neurology, Faculty of Brain Sciences, UCL kenneth.harris@ucl.ac.uk

License:
ODbL

General information about the data

Abstract:
This data illustrates a clustering problem common in experimental neuroscience, known as “spike sorting”. Each row (object) represents a set of 96 features characterizing the spatiotemporal electric field. The objects represent 20,000 neuronal action potentials (or “spikes”). The action potentials are generated by multiple neurons. Separate clusters in this high dimensional space correspond to the neurons.
Subject matter background:
Any unsupervised clustering algorithm should reveal several clusters most of which pertain to putative neurons or to electrical artifacts in the recording. Some clusters will pertain to noise and artifacts. Not all clusters will correspond to neurons. The features in the file were produced by an automatic spike detection algorithm taking the first 3 principal components from some filtered waveforms coming from the recording. Waveforms were recorded using a probe inserted into the brain consisting of 32 different channels (labelled 0-31) which were arranged spatially according to the adjacency graph described below: Each pair corresponds to channels which are nearest neighbours, e.g. channel 12 is next to channels 10, 11, 13, and 14. probes = { # Probe 1 1:[ (0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (2, 4), (3, 4), (3, 5), (4, 5), (4, 6), (5, 6), (5, 7), (6, 7), (6, 8), (7, 8), (7, 9), (8, 9), (8, 10), (9, 10), (9, 11), (10, 11), (10, 12), (11, 12), (11, 13), (12, 13), (12, 14), (13, 14), (13, 15), (14, 15), (14, 16), (15, 16), (15, 17), (16, 17), (16, 18), (17, 18), (17, 19), (18, 19), (18, 20), (19, 20), (19, 21), (20, 21), (20, 22), (21, 22), (21, 23), (22, 23), (22, 24), (23, 24), (23, 25), (24, 25), (24, 26), (25, 26), (25, 27), (26, 27), (26, 28), (27, 28), (27, 29), (28, 29), (28, 30), (29, 30), (29, 31), (30, 31), ] }

For each neuronal cluster, it is expected that only a subset of the variables will be informative. The number of relevant dimensions depends on the neuron and on how many channels of the probe managed to pick up signals from this neuron. e.g. if a neuron's signal was only picked up on channels 10, 11, 12, 13, and 14, then the relevant informative features are going to be features, 3k+1, 3k+2, 3k+3 with k = 10, 11, 12, 13 and 14. A neuron could potentially have as many 36 relevant channels or as few as 3. The rest of the features will correspond to background noise. At the same time, many neurons may share the same set of relevant channels.

Data structure:
object x variables data matrix
Data objects and variables:
See background: The features in the file were produced by an automatic spike detection algorithm taking the first 3 principal components from some filtered waveforms coming from the recording, i.e., all observations are "spikes" or special events in the recording after a standard noise level was filtered out. Waveforms were recorded using a probe inserted into the brain consisting of 32 different channels.
Data values:
The values happen to be rounded to integers, but there are no restrictions on the value range in principle.
Preprocessing:
As explained before, the variables are the result of some preprocessing, which is based on experience that more than 3 principal components per channel are not normally useful.

Any further processing is not necessary from a subject matter perspective, although it may be required for certain clustering algorithms. According to experience, the variables should behave well in the sense that there should be no skewness problem that would require nonlinear transformation.

Other relevant papers:
Not as far as I know.
Justification for clustering:
It is of interest to detect and separate from each other the activity of individual neurons from the "action potentails" recorded and detected as "spikes" on the channels.

External criteria for clustering quality

External variable that represents the underlying true clustering
The last variable "truecluster" is an indicator variable indicating a single artificial cluster that was added to the data by K. Harris. Optimally this should correspond to one of the found clusters. Please note that one can expect to find further "true" clusters in the data; the "true" solution is not that there is none apart from the one indicated here.

As the added cluster is artificial, one may even consider using the real data only not including the observations indicated by "1" on the "truecluster" variable.

Substantive justification
The added artificial cluster represents neurological knowledge about a typical cluster generated by a neuron.

Internal criteria for clustering quality: cluster membership

All objects clustered or not
One would expect that some spikes are not generated by neurons but are rather artifacts or noise and should therefore not be assigned to any cluster.
Cluster overlap
The aim is separating neurons so clusters are not thought of a genuinely overlapping, although they may overlap in a "mixture model" sense, i.e., there may be an overlap of neural activity in data space and not all neurons are separated by clear gaps. (In other words, a probabilistic clustering is fine, but overlapping crisp sets of points rather not.)
Cluster sizes
Very large clusters of size of half of the dataset or larger are very unlikely; very small clusters (say >10 or so) are unlikely to indicate neurons and are probably rather artifacts. However, it may be wrong to merge small clusters with bigger ones, they are better kept separated (either not clustered at all or left as small clusters).

Internal criteria for clustering quality: within and between cluster features

Ground for objects to belong to the same cluster:
Spikes belonging to the same neuron can be expected to be similar on certain channels (namely where they can be detected), but may not have anything in common on the other channels.

Density gaps are not expected within clusters and would indicate the presence of several active neurons /clusters.
Weight: 4, moderately important

Within-cluster heterogeneity:
There is a certain trend in the literature to assume normal distributions for spike clusters, but I don't think this is based on strong subject matter arguments; it could be a mixture of wishful thinking and some limited experience.
Weight: 2, of little importance
Ground for objects to belong to different clusters:
Clusters can be separated by (potentially rather weak) density gaps or by different sets of channels on which they are "active".
Weight: 4, moderately important
Stability of clustering:
The data are potentially "noisy", and therefore all too spurious clusters when adding noise or changing a few observations should not be trusted, although stability is not an end in itself here.
Weight: 2, of little importance

Further aspects

The neurologists would probably be happy about clear graphs that give them some idea to what extent the clusters have typical features of neurons; it would be better to describe what that means, but this is not formalised as far as I know.