As we explored in previous sections on classification and feature selection, gene expression data can reliably identify cancer subtypes. Next, we will investigate whether it is possible to uncover subtype separations in an unsupervised manner using only transcriptomic data. Specifically, we will employ clustering techniques to group samples based on their similarities.
1. Hierarchical Clustering
The definition of similarity, especially between clusters, can significantly impact the outcomes of hierarchical clustering. Generally, clustering methods calculate the similarities between individual samples and then aggregate these similarities into a collective measure for the clusters. Each method utilizes a specific metric for similarity computation.
In this study, I applied agglomerative (bottom-up) hierarchical clustering, which begins by assigning each sample to its own distinct cluster. The algorithm identifies the pair of most similar samples and merges them into a single cluster, treating them as a “quasi-sample.” This merging process continues until all samples are consolidated into one cluster. The results are typically visualized as a dendrogram, which illustrates the sequence in which clusters are merged. The height of the branches in this tree indicates the level of similarity between clusters; greater heights reflect higher dissimilarity. By cutting the dendrogram at a specific height, I can achieve clusters of desired sizes—making a cut closer to the leaves results in more smaller clusters, while a cut near the root yields fewer larger clusters.
1.1. Input
For this analysis, I utilized a dataset comprising 52 samples of 7,000 genes as input for clustering.
| ID | S01 | S02 | S03 | S04 | S05 | S06 | S07 | S08 | S09 | S10 | S11 | S12 | S13 | S14 | S15 | S16 | S17 | S18 | S19 | S20 | S21 | S22 | S23 | S24 | S25 | S26 | S27 | S28 | S29 | S30 | S31 | S32 | S33 | S34 | S35 | S36 | S37 | S38 | S39 | S40 | S41 | S42 | S43 | S44 | S45 | S46 | S47 | S48 | S49 | S50 | S51 | S52 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ENSG00000000419 | 6.038 | 5.219 | 5.115 | 5.326 | 5.186 | 5.287 | 5.268 | 6.624 | 5.571 | 6.597 | 5.211 | 6.652 | 6.611 | 6.395 | 5.151 | 6.138 | 7.278 | 4.737 | 6.125 | 5.63 | 5.777 | 6.18 | 4.644 | 6.481 | 5.724 | 5.459 | 4.74 | 7.281 | 5.198 | 5.609 | 4.776 | 5.615 | 6.047 | 4.743 | 6.229 | 5.152 | 5.778 | 4.583 | 5.963 | 6.168 | 4.925 | 4.074 | 5.797 | 6.381 | 4.624 | 5.986 | 6.375 | 7.702 | 5.679 | 5.832 | 5.308 | 5.036 |
| ENSG00000001036 | 4.408 | 5.363 | 5.249 | 5.796 | 5.731 | 6.092 | 4.759 | 4.412 | 4.953 | 6.929 | 4.959 | 5.074 | 5.39 | 4.528 | 4.696 | 6.009 | 5.443 | 4.256 | 6.142 | 5.864 | 5.341 | 5.992 | 4.676 | 5.139 | 6.946 | 5.942 | 6.075 | 6.336 | 4.982 | 5.277 | 6.352 | 5.205 | 6.271 | 5.783 | 4.085 | 6.551 | 5.889 | 5.155 | 3.923 | 5.453 | 4.065 | 5.878 | 5.485 | 5.107 | 6.206 | 5.034 | 5.27 | 5.734 | 5.365 | 4.205 | 4.986 | 5.652 |
| ENSG00000001084 | 4.417 | 5.983 | 5.749 | 5.736 | 5.738 | 5.758 | 3.925 | 5.101 | 5.073 | 3.528 | 5.907 | 4.03 | 3.577 | 3.093 | 4.969 | 5.789 | 4.443 | 5.138 | 5.585 | 6.172 | 4.977 | 6.776 | 5.952 | 4.72 | 4.586 | 5.829 | 3.988 | 4.735 | 4.201 | 6.793 | 6.236 | 4.439 | 5.045 | 5.176 | 4.172 | 4.044 | 5.978 | 4.084 | 3.042 | 4.221 | 3.871 | 5.231 | 6.616 | 4.314 | 6.365 | 4.468 | 4.084 | 3.095 | 4.29 | 2.249 | 5.871 | 4.303 |
| ENSG00000001497 | 5.242 | 4.2 | 6.563 | 5.509 | 5.539 | 4.889 | 6.14 | 5.325 | 5.067 | 6.147 | 5.469 | 6.077 | 6.871 | 5.967 | 6.682 | 4.986 | 5.3 | 6.449 | 4.947 | 5.46 | 5.438 | 5.933 | 3.964 | 4.578 | 6.386 | 4.93 | 4.866 | 6.102 | 4.718 | 5.275 | 4.747 | 5.592 | 5.071 | 5.305 | 4.131 | 5.444 | 3.832 | 5.106 | 5.93 | 4.696 | 4.91 | 5.225 | 4.644 | 6.144 | 5.069 | 4.746 | 4.907 | 5.808 | 4.85 | 5.3 | 4.157 | 6.014 |
| ENSG00000001617 | 6.028 | 7.304 | 3.906 | 5.913 | 5.957 | 3.985 | 7.524 | 4.648 | 6.675 | 3.168 | 5.361 | 6.536 | 6.393 | 5.222 | 1.325 | 3.853 | 4.694 | 4.617 | 2.5 | 3.199 | 5 | 5.301 | 5.471 | 5.007 | 1.436 | 5.811 | 2.974 | 3.771 | 7.3 | 5.09 | 5.116 | 5.192 | 5.487 | 6.753 | 5.083 | 0.373 | 5.46 | 7.737 | 0.795 | 6.797 | 1.26 | 4.758 | 3.704 | 5.568 | 5.234 | 4.296 | 5.062 | 6.554 | 6.172 | 6.783 | 4.914 | 5.93 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
1.2. Result
The resulting dendrogram revealed that similar samples were grouped together. For instance, samples S03, S04, and S05 formed a small branch, while S12, S13, and S14 were located in a branch significantly distant from the first.

If the number of clusters is set to five, the algorithm will return the assigned cluster for each sample.
| Sample | Cluster ID |
|---|---|
| S01 | 1 |
| S02 | 1 |
| S03 | 1 |
| S04 | 1 |
| S05 | 1 |
| S06 | 2 |
| S07 | 2 |
| S08 | 2 |
| S09 | 2 |
| S10 | 3 |
| S11 | 2 |
| S12 | 2 |
| S13 | 2 |
| S14 | 2 |
| S15 | 1 |
| S16 | 3 |
| S17 | 2 |
| S18 | 2 |
| S19 | 1 |
| S20 | 1 |
| S21 | 1 |
| S22 | 1 |
| S23 | 4 |
| S24 | 1 |
| S25 | 3 |
| S26 | 1 |
| S27 | 3 |
| S28 | 1 |
| S29 | 2 |
| S30 | 1 |
| S31 | 1 |
| S32 | 1 |
| S33 | 2 |
| S34 | 2 |
| S35 | 2 |
| S36 | 3 |
| S37 | 2 |
| S38 | 2 |
| S39 | 1 |
| S40 | 2 |
| S41 | 3 |
| S42 | 1 |
| S43 | 2 |
| S44 | 1 |
| S45 | 2 |
| S46 | 2 |
| S47 | 2 |
| S48 | 2 |
| S49 | 2 |
| S50 | 2 |
| S51 | 2 |
| S52 | 2 |
2. K-means Clustering
Another commonly used clustering method is k-means. In k-means, the number of clusters, “K,” is set as an input parameter at the beginning, and “K” initial centroids are randomly selected within the sample space. Each sample is then assigned to the cluster associated with the nearest centroid. After the initial assignments, the algorithm updates each cluster’s centroid to reflect the actual mean of the samples it contains, and the process iterates with these new centroids. This cycle continues until convergence is reached or a predefined stopping condition is met, such as the maximum number of iterations.
It is important to note that due to random initialization, running k-means multiple times with the same parameters on the same data can yield different results, especially if the true clustering structure is not clear.
2.1. Input
We continue to work with the same input data of 52 samples as described in 1.1.
2.2. Result
The resulting table is straightforward, with each sample assigned to a cluster.
| Sample | Cluster ID |
|---|---|
| S01 | 1 |
| S02 | 1 |
| S03 | 1 |
| S04 | 1 |
| S05 | 1 |
| S06 | 3 |
| S07 | 2 |
| S08 | 3 |
| S09 | 3 |
| S10 | 4 |
| S11 | 3 |
| S12 | 2 |
| S13 | 2 |
| S14 | 2 |
| S15 | 1 |
| S16 | 4 |
| S17 | 3 |
| S18 | 3 |
| S19 | 1 |
| S20 | 1 |
| S21 | 1 |
| S22 | 1 |
| S23 | 3 |
| S24 | 1 |
| S25 | 4 |
| S26 | 1 |
| S27 | 4 |
| S28 | 1 |
| S29 | 2 |
| S30 | 1 |
| S31 | 1 |
| S32 | 1 |
| S33 | 3 |
| S34 | 3 |
| S35 | 3 |
| S36 | 4 |
| S37 | 3 |
| S38 | 2 |
| S39 | 1 |
| S40 | 2 |
| S41 | 4 |
| S42 | 1 |
| S43 | 2 |
| S44 | 1 |
| S45 | 2 |
| S46 | 2 |
| S47 | 2 |
| S48 | 2 |
| S49 | 2 |
| S50 | 2 |
| S51 | 3 |
| S52 | 3 |
It is advisable to repeat multiple runs to assess the accuracy that this method can provide on a particular dataset. If the results vary significantly across runs, it should be combined with prior knowledge or other methods to validate the findings.
Acknowledgement
These projects were supported by Louisiana Biomedical Research Network and Pine Biotech under LBRN Summer Bioinformatics Workshop in 2018.