IFCS Benchmark Data Questionnaire Preview

Questions below with no space to enter answers will have a textbox in the online version. These have been omitted below to save paper when printing the preview.

Contributor

(most items mandatory)
Corresponding person: Name and email address of the corresponding contributor of the data set. This must be the name and email address of a single living person, not an institution, alias, email list etc.
  • Name:
     
  • Email:
     
Source: Provide additional names, email addresses, institutions, and other contact information of the donors and creators of the data set.
License: Please specify under which license terms you want to submit the data set to the repository.
List of admissible licenses
I confirm that I am legally entitled to contribute the data set under the terms and on behalf of the sources listed above to the IFCS repository.
  • Yes
  • No

Data

(most items mandatory )
Dataset name: please specify a title (several words, at most one line) and a short version which can be used as an object name in statistical software packages (one word, or a few words separated by dots or underscores, no blanks).
  • Title:
     
  • Short name:
     
Abstract: Provide a short description of your data set (less than 200 characters).
Subject matter background: Please describe the subject matter background of the data set, the scientific problem and the research questions.
Data structure:
  • object x variables data matrix
  • object x object proximities/(dis)similarities matrix
  • other (textbox to explain structure will open)
What are the objects, and, as far as this applies, what is the meaning of the proximities/(dis)similarities or of the variables? Does the data set contain external variables that are not to be used for the actual clustering? (If so please identify which ones).
Data values: What are the admissible (as opposed to "observed in the data") values for the involved variables or proximities/(dis)similarities, which information is represented by the values, are there missing values and how have they been denoted?
Preprocessing:Should preprocessing be applied to the data prior to clustering or has it already been applied? Please give the substantive reasons and describe the preprocessing that has been or should be applied in as much detail as possible.
Links: Are any analyses or related software (e.g., for preprocessing) available on the web? Please provide URLs.
Citation: Researchers using this data set in publications should cite the following paper(s):
Other relevant papers: What publications has this dataset appeared in previously? Has this data set already been clustered in previous publications? Are the clustering results from previous publications provided in the data and, if so, which variable(s) represent these?  Are there papers relevant to understand the background and methodology (even if not using this particular data set itself)?
Do you want to submit more than 100MB of data?

Justification

(item optional)
If available, give a substantive justification of why a clustering of these data is required, and what will it be used for.

Quality concerns: external criteria

(all items optional)

Is an external variable (not part of the data used for clustering) to be used to evaluate the clustering result?

Analysis should reveal a known underlying true clustering: Which external variable represents the underlying true clustering? Should the to-be-found clustering fully or partly correspond to the underlying true clustering?

Please give substantive justification.

Analysis has pragmatic aim (i.e., the data have to be clustered in order to help with some kind of task, whether this corresponds with a true underlying structure or not): Which external variables are relevant for this pragmatic aim, and which type of relationship should they bear with the clustering?

Examples: clustering should provide good prediction of external variable, clusters should be homogeneous with respect to external variable, external variables should discriminate between clusters, clustering should more or less correspond with one or more known useful clusterings as represented by external variables, prespecified subsets of objects (as characterized by external variables) should be assigned to the same or to different clusters, etc.

Please give a substantive justification.

Quality concerns: internal criteria - cluster membership

(all items optional)
Number of clusters: Is there a preferred value or range of values for the number of clusters?

If yes: What are these? Please give substantive justification.

Nature of clusters: Is there a clear preference for either a crisp clustering or for varying degrees of membership (fuzzy/probabilistic) for objects in all clusters?

If yes: Please specify your preference. Please give substantive justification.

All objects clustered or not: Is there a clear preference for all objects being clustered or for not all objects being clustered?

If yes: Please specify your preference. In case of a preference for not all objects being clustered, please specify which types of objects may not be assigned to any cluster. If objects that are not to be assigned to any cluster are known, please specify which ones. Please give substantive justification.

Cluster overlap: Is there a clear preference for clusters to be nonoverlapping or overlapping?

If yes: Please specify your preference. In case of a preference for overlapping clusters, please specify whether overlapping clusters are required to be nested and whether a hierarchy is required. Please give substantive justification.

Cluster sizes: Are there any constraints on the cluster sizes?

If yes: Is there a minimum or maximum cluster size, and, if yes, which one(s)? Should clusters be similar or dissimilar in size, and if yes, in which respect? Please give substantive justification.

Quality concerns: internal criteria - within and between cluster features

(all items optional)
Unifying/common ground: Are there requirements on what should be the unifying/common ground for objects to belong to the same cluster?

If yes: What are these requirements? Small within-cluster dissimilarities (and, if yes, in which respect)? Common pattern of values (and, if yes, which type of pattern)? Other (and, if yes, what form do these requirements take)? Please give substantive justification.

Within-cluster heterogeneity: Are there requirements on the form of within-cluster heterogeneity?

If yes: Should the within-cluster heterogeneity take the form of a particular geometric pattern (and, if yes, which one)? Within-cluster independence of variables or a specific type of within-cluster dependence structure (and, if yes, which one)? Other (and, if yes, what form do these requirements take)? Please give substantive justification.

Discriminating ground: Are there requirements on what should be the discriminating ground for objects to belong to different clusters?

If yes: What form do these requirements take? E.g., large between-cluster dissimilarities (and, if yes, in which respect)? Separation (and, if yes, of which kind)? Other (and, if yes, what form do these requirements take)? Please give substantive justification.

Between-cluster differences: Are there requirements on the between-cluster heterogeneity, that is, the structure of between-cluster differences?

If yes: Between-cluster differences: What form do these requirements take? E.g., should lie in low-dimensional space, etc. Please give substantive justification.

Between-cluster similarity in within-cluster features: Should clusters be similar with regard to within-cluster features such as variance or within-cluster structure?

If yes: What form should this similarity take? Please give substantive justification.

Stability:Should the clustering be stable (with regard to the influence of outliers, the subset of variables under study, the choice of a dissimilarity measure, other)?

If yes: Please specify in which respect the clustering should be stable. Please give substantive justification.

Inferential quality: Is quality of inferences about population characteristics an issue?

If yes: Please specify for which population characteristics inferential quality is an issue and in which respect. Please give substantive justification.

Further comments

(both items optional)
Are there any further aspects of quality relevant for these data but not mentioned before?
Do you have any further comments or suggestions about the questionnaire or the benchmark repository?