Call for Papers


ICMLA 2010 Speaker Clustering Challenge



Learning methods for sequential data are receiving widespread attention in recent years. This kind of data arises in many interesting scenarios, where the individual semantic units are no longer single vectors but collections of vectors. As examples of these kind of scenarios, we can cite multimedia analysis (e.g., video understanding, speaker recognition), bioinformatics (e.g., DNA or protein sequences), etc. Sequences can have different lengths, so standard distance measures for vector spaces are not directly applicable.

Moreover, sometimes the information conveyed by the sequences is encoded not just on the individual vectors themselves, but also in the dynamics under which these vectors evolve along time. In order to capture such information, it is usual to employ dynamic models such as hidden Markov models or more general dynamic Bayesian networks. Then, distances between sequences can be defined using the learned models. This and other alternatives are reviewed in [1].

However, there are many scenarios where the sequences can be accurately classified or clustered without attending their dynamic characteristics. Examples include bag-of-words models for image analysis, speech-independent speaker verification, etc. In these cases the sequences can be viewed as sets of independent and identically distributed (i.i.d.) samples, and can thus be characterized in terms of their underlying probability density function (PDF). There are many ways of defining affinities or distances between PDFs, from the classic Kullback-Leibler or Bhattacharya divergences (even in feature space, as in [2]) to the recently proposed Probability Product Kernels [3].

In this challenge we propose to focus on unsupervised methods for sequential data. Specifically, clustering of speech data. Clustering tries to find coherent (in some sense) disjoint groups within a dataset. It does not require any training examples, so it is a very important tool for exploratory data analysis. Furthermore, clustering algorithms can be easily expanded into semi-supervised methods which are very useful when the labelling process is costly.


This challenge proposes two different tasks:

The first task is 2-class speaker clustering. For this task we provide 7 datasets, each one of them comprised of speech coming from two different speakers. The participants should then identify two clusters within each dataset.

The more advanced task is multiclass speaker clustering. This task is to be carried out on a single dataset, which is formed by sequences coming from an unknown number of speakers in the range [3,6]. Participants should discover the number of speakers and perform an adequate clustering. An example of a system performing this task can be found in [4].

Both tasks are based on a speech database recorded using a PDA. It includes both male and female speakers. Each subject recorded 50 isolated words, and the mean length of each utterance is around 1.3 seconds. The original audio files were processed using the HTK software, yielding a standard parametrization consisting of 12 Mel-frequency cepstral coefficients (MFCCs), an energy term and their respective increments, giving a total of 26 parameters. These parameters were obtained every 10ms with a 25ms analysis window, yielding 26-dimensional sequences of around 130 samples. Any further pre-processing (normalization, filtering, …) is up to the participants.

Participants can submit their results for just one of the tasks or for the two of them.  For details on how to format the results, please contact the organizers.


Data for each task are available in Matlab (.MAT) format in the following address:


Apart from the actual results, a short paper (4 pages) describing the proposed algorithms should be submitted through the main conference submission website. These papers will be reviewed mainly based on:

·         Originality and technical soundness of the employed distance measures

·         Coherence of the discovered clusters w.r.t. the speakers

·         In the multiclass task, special attention will be paid to the steps toward the correct identification of the number of speakers



Accepted papers will be published in the ICMLA’10 conference proceedings.



Paper Submission Deadline:              

July 15, 2010

Notification of acceptance:                  

September 7, 2010

Camera-ready papers & Pre-registration:   

October 1, 2010

The ICMLA Conference:                     

December 12-14, 2010

The authors should submit their papers through the main conference submission website. Papers must correspond to the requirements detailed in the instructions to authors. Accepted papers must be presented by one of the authors in order to be published in the conference proceeding. If you have any questions, do not hesitate to direct your questions to the organizers.

All challenge submissions will be handled electronically. Detailed instructions for submitting a paper are provided on the conference home page at:



[1] Liao, T.W.; Clustering of time series data – A survey; Pattern Recognition, 2005, 38, pp. 1857-1874


[2] Zhou et al.; From sample similarity to ensemble similarity: Probabilistic distance measures in Reproducing Kernel Hilbert Space, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28, pp. 917-929


[3] Jebara et al.; Probability Product Kernels, Journal of Machine Learning Research (JMLR), 2004, 5, pp. 819-844


[4] Lapidot et al.; Uknown multiple speaker clustering using HMM, International Conference on Spoken Language Processing, 2002


[5] Sanguinetti et al.; Automatic determination of the number of clusters using spectral algorithms; IEEE Workshop on Machine Learning for Signal Processing, 2005, pp. 55-60





ICMLA 2010 Challenge Organizers: