Van La, PhD – Computational Biologist

As we explored in previous sections on classification and feature selection, gene expression data can reliably identify cancer subtypes. Next, we will investigate whether it is possible to uncover subtype separations in an unsupervised manner using only transcriptomic data. Specifically, we will employ clustering techniques to group samples based on their similarities.

1. Hierarchical Clustering

The definition of similarity, especially between clusters, can significantly impact the outcomes of hierarchical clustering. Generally, clustering methods calculate the similarities between individual samples and then aggregate these similarities into a collective measure for the clusters. Each method utilizes a specific metric for similarity computation.

In this study, I applied agglomerative (bottom-up) hierarchical clustering, which begins by assigning each sample to its own distinct cluster. The algorithm identifies the pair of most similar samples and merges them into a single cluster, treating them as a “quasi-sample.” This merging process continues until all samples are consolidated into one cluster. The results are typically visualized as a dendrogram, which illustrates the sequence in which clusters are merged. The height of the branches in this tree indicates the level of similarity between clusters; greater heights reflect higher dissimilarity. By cutting the dendrogram at a specific height, I can achieve clusters of desired sizes—making a cut closer to the leaves results in more smaller clusters, while a cut near the root yields fewer larger clusters.

1.1. Input

For this analysis, I utilized a dataset comprising 52 samples of 7,000 genes as input for clustering.

ID	S01	S02	S03	S04	S05	S06	S07	S08	S09	S10	S11	S12	S13	S14	S15	S16	S17	S18	S19	S20	S21	S22	S23	S24	S25	S26	S27	S28	S29	S30	S31	S32	S33	S34	S35	S36	S37	S38	S39	S40	S41	S42	S43	S44	S45	S46	S47	S48	S49	S50	S51	S52
ENSG00000000419	6.038	5.219	5.115	5.326	5.186	5.287	5.268	6.624	5.571	6.597	5.211	6.652	6.611	6.395	5.151	6.138	7.278	4.737	6.125	5.63	5.777	6.18	4.644	6.481	5.724	5.459	4.74	7.281	5.198	5.609	4.776	5.615	6.047	4.743	6.229	5.152	5.778	4.583	5.963	6.168	4.925	4.074	5.797	6.381	4.624	5.986	6.375	7.702	5.679	5.832	5.308	5.036
ENSG00000001036	4.408	5.363	5.249	5.796	5.731	6.092	4.759	4.412	4.953	6.929	4.959	5.074	5.39	4.528	4.696	6.009	5.443	4.256	6.142	5.864	5.341	5.992	4.676	5.139	6.946	5.942	6.075	6.336	4.982	5.277	6.352	5.205	6.271	5.783	4.085	6.551	5.889	5.155	3.923	5.453	4.065	5.878	5.485	5.107	6.206	5.034	5.27	5.734	5.365	4.205	4.986	5.652
ENSG00000001084	4.417	5.983	5.749	5.736	5.738	5.758	3.925	5.101	5.073	3.528	5.907	4.03	3.577	3.093	4.969	5.789	4.443	5.138	5.585	6.172	4.977	6.776	5.952	4.72	4.586	5.829	3.988	4.735	4.201	6.793	6.236	4.439	5.045	5.176	4.172	4.044	5.978	4.084	3.042	4.221	3.871	5.231	6.616	4.314	6.365	4.468	4.084	3.095	4.29	2.249	5.871	4.303
ENSG00000001497	5.242	4.2	6.563	5.509	5.539	4.889	6.14	5.325	5.067	6.147	5.469	6.077	6.871	5.967	6.682	4.986	5.3	6.449	4.947	5.46	5.438	5.933	3.964	4.578	6.386	4.93	4.866	6.102	4.718	5.275	4.747	5.592	5.071	5.305	4.131	5.444	3.832	5.106	5.93	4.696	4.91	5.225	4.644	6.144	5.069	4.746	4.907	5.808	4.85	5.3	4.157	6.014
ENSG00000001617	6.028	7.304	3.906	5.913	5.957	3.985	7.524	4.648	6.675	3.168	5.361	6.536	6.393	5.222	1.325	3.853	4.694	4.617	2.5	3.199	5	5.301	5.471	5.007	1.436	5.811	2.974	3.771	7.3	5.09	5.116	5.192	5.487	6.753	5.083	0.373	5.46	7.737	0.795	6.797	1.26	4.758	3.704	5.568	5.234	4.296	5.062	6.554	6.172	6.783	4.914	5.93
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…

1.2. Result

The resulting dendrogram revealed that similar samples were grouped together. For instance, samples S03, S04, and S05 formed a small branch, while S12, S13, and S14 were located in a branch significantly distant from the first.

If the number of clusters is set to five, the algorithm will return the assigned cluster for each sample.

Sample	Cluster ID
S01	1
S02	1
S03	1
S04	1
S05	1
S06	2
S07	2
S08	2
S09	2
S10	3
S11	2
S12	2
S13	2
S14	2
S15	1
S16	3
S17	2
S18	2
S19	1
S20	1
S21	1
S22	1
S23	4
S24	1
S25	3
S26	1
S27	3
S28	1
S29	2
S30	1
S31	1
S32	1
S33	2
S34	2
S35	2
S36	3
S37	2
S38	2
S39	1
S40	2
S41	3
S42	1
S43	2
S44	1
S45	2
S46	2
S47	2
S48	2
S49	2
S50	2
S51	2
S52	2

2. K-means Clustering

Another commonly used clustering method is k-means. In k-means, the number of clusters, “K,” is set as an input parameter at the beginning, and “K” initial centroids are randomly selected within the sample space. Each sample is then assigned to the cluster associated with the nearest centroid. After the initial assignments, the algorithm updates each cluster’s centroid to reflect the actual mean of the samples it contains, and the process iterates with these new centroids. This cycle continues until convergence is reached or a predefined stopping condition is met, such as the maximum number of iterations.

It is important to note that due to random initialization, running k-means multiple times with the same parameters on the same data can yield different results, especially if the true clustering structure is not clear.

2.1. Input

We continue to work with the same input data of 52 samples as described in 1.1.

2.2. Result

The resulting table is straightforward, with each sample assigned to a cluster.

Sample	Cluster ID
S01	1
S02	1
S03	1
S04	1
S05	1
S06	3
S07	2
S08	3
S09	3
S10	4
S11	3
S12	2
S13	2
S14	2
S15	1
S16	4
S17	3
S18	3
S19	1
S20	1
S21	1
S22	1
S23	3
S24	1
S25	4
S26	1
S27	4
S28	1
S29	2
S30	1
S31	1
S32	1
S33	3
S34	3
S35	3
S36	4
S37	3
S38	2
S39	1
S40	2
S41	4
S42	1
S43	2
S44	1
S45	2
S46	2
S47	2
S48	2
S49	2
S50	2
S51	3
S52	3

It is advisable to repeat multiple runs to assess the accuracy that this method can provide on a particular dataset. If the results vary significantly across runs, it should be combined with prior knowledge or other methods to validate the findings.

Acknowledgement

These projects were supported by Louisiana Biomedical Research Network and Pine Biotech under LBRN Summer Bioinformatics Workshop in 2018.

Sample	Cluster ID
S01	1
S02	1
S03	1
S04	1
S05	1
S06	2
S07	2
S08	2
S09	2
S10	3
S11	2
S12	2
S13	2
S14	2
S15	1
S16	3
S17	2
S18	2
S19	1
S20	1
S21	1
S22	1
S23	4
S24	1
S25	3
S26	1
S27	3
S28	1
S29	2
S30	1
S31	1
S32	1
S33	2
S34	2
S35	2
S36	3
S37	2
S38	2
S39	1
S40	2
S41	3
S42	1
S43	2
S44	1
S45	2
S46	2
S47	2
S48	2
S49	2
S50	2
S51	2
S52	2

Sample	Cluster ID
S01	1
S02	1
S03	1
S04	1
S05	1
S06	3
S07	2
S08	3
S09	3
S10	4
S11	3
S12	2
S13	2
S14	2
S15	1
S16	4
S17	3
S18	3
S19	1
S20	1
S21	1
S22	1
S23	3
S24	1
S25	4
S26	1
S27	4
S28	1
S29	2
S30	1
S31	1
S32	1
S33	3
S34	3
S35	3
S36	4
S37	3
S38	2
S39	1
S40	2
S41	4
S42	1
S43	2
S44	1
S45	2
S46	2
S47	2
S48	2
S49	2
S50	2
S51	3
S52	3

Sample	Cluster ID
S01	1
S02	1
S03	1
S04	1
S05	1
S06	2
S07	2
S08	2
S09	2
S10	3
S11	2
S12	2
S13	2
S14	2
S15	1
S16	3
S17	2
S18	2
S19	1
S20	1
S21	1
S22	1
S23	4
S24	1
S25	3
S26	1
S27	3
S28	1
S29	2
S30	1
S31	1
S32	1
S33	2
S34	2
S35	2
S36	3
S37	2
S38	2
S39	1
S40	2
S41	3
S42	1
S43	2
S44	1
S45	2
S46	2
S47	2
S48	2
S49	2
S50	2
S51	2
S52	2

Sample	Cluster ID
S01	1
S02	1
S03	1
S04	1
S05	1
S06	3
S07	2
S08	3
S09	3
S10	4
S11	3
S12	2
S13	2
S14	2
S15	1
S16	4
S17	3
S18	3
S19	1
S20	1
S21	1
S22	1
S23	3
S24	1
S25	4
S26	1
S27	4
S28	1
S29	2
S30	1
S31	1
S32	1
S33	3
S34	3
S35	3
S36	4
S37	3
S38	2
S39	1
S40	2
S41	4
S42	1
S43	2
S44	1
S45	2
S46	2
S47	2
S48	2
S49	2
S50	2
S51	3
S52	3

Sample	Cluster ID
S01	1
S02	1
S03	1
S04	1
S05	1
S06	2
S07	2
S08	2
S09	2
S10	3
S11	2
S12	2
S13	2
S14	2
S15	1
S16	3
S17	2
S18	2
S19	1
S20	1
S21	1
S22	1
S23	4
S24	1
S25	3
S26	1
S27	3
S28	1
S29	2
S30	1
S31	1
S32	1
S33	2
S34	2
S35	2
S36	3
S37	2
S38	2
S39	1
S40	2
S41	3
S42	1
S43	2
S44	1
S45	2
S46	2
S47	2
S48	2
S49	2
S50	2
S51	2
S52	2

Sample	Cluster ID
S01	1
S02	1
S03	1
S04	1
S05	1
S06	3
S07	2
S08	3
S09	3
S10	4
S11	3
S12	2
S13	2
S14	2
S15	1
S16	4
S17	3
S18	3
S19	1
S20	1
S21	1
S22	1
S23	3
S24	1
S25	4
S26	1
S27	4
S28	1
S29	2
S30	1
S31	1
S32	1
S33	3
S34	3
S35	3
S36	4
S37	3
S38	2
S39	1
S40	2
S41	4
S42	1
S43	2
S44	1
S45	2
S46	2
S47	2
S48	2
S49	2
S50	2
S51	3
S52	3