In my work with transcriptomics, I developed and executed RNA-Seq pipelines, focusing on the importance of selecting the most appropriate algorithms for the data being analyzed. The project involved gene expression analysis using RNA-Seq data processed through RSEM for expression measurement. I further refined the data by filtering out genes with zero expression values and applying quantile normalization to ensure consistency across samples.

Once the data tables were prepared, I utilized machine learning algorithms to classify these subtypes automatically, identifying the most informative genes for distinguishing between the known classes.

1. Classification

Breast cancer was reported to be able divided into distinct subtypes. In this work, I worked on 38 datasets from 5 cell lines (luminal, basal, claudin-low, and normal-like cases). We want to identify the most informative features (genes) in these subtypes using supervised machine learning methods, specifically classification algorithms. In this case, I applied Linear Discriminant Analysis (LDA). LDA works similarly to Principal Component Analysis (PCA) but takes subtype information into account when determining the components. Like other classification methods, LDA uses a training set where the subtypes are already known. From this training set, LDA constructs a classifier that can then assign new samples to one of the known subtypes.

1.1. Input

The gene expression data and subtype information are combined into a single table to serve as the training set. Each sample (or cell line) is represented by a column and is labeled with its respective subtype. For this analysis, the subtypes were encoded as follows:

  • B – Basal
  • CL – Claudin-low
  • L – Luminal
  • N – Normal-like
ID S01 S02 S03 S04 S05 S06 S07 S08 S09 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 S21 S22 S23 S24 S25 S26 S27 S28 S29 S30 S31 S32 S33 S34 S35 S36 S37 S38
class Type_L Type_L Type_L Type_L Type_CL Type_L Type_L Type_L Type_L Type_B Type_CL Type_L Type_L Type_B Type_B Type_B Type_B Type_L Type_B Type_B Type_B Type_L Type_N Type_N Type_N Type_L Type_L Type_L Type_CL Type_L Type_L Type_B Type_L Type_CL Type_B Type_L Type_B Type_L
ENSG00000000419 5.32 5.3 6.64 5.6 6.61 5.24 6.66 6.62 6.41 5.19 6.16 7.29 4.78 6.14 5.66 5.8 6.2 4.69 6.5 5.49 7.29 5.23 5.63 4.82 5.64 6.07 4.79 6.25 5.19 5.8 4.63 5.98 6.19 4.97 4.15 5.82 6.4 4.67
ENSG00000001036 6.11 4.8 4.47 4.99 6.94 5 5.11 5.42 4.58 4.74 6.03 5.47 4.32 6.16 5.89 5.37 6.01 4.72 5.17 5.96 6.35 5.02 5.31 6.37 5.24 6.29 5.81 4.16 6.56 5.91 5.19 4 5.48 4.14 5.9 5.51 5.14 6.22
ENSG00000001084 5.78 4.01 5.14 5.11 3.63 5.93 4.11 3.68 3.23 5.01 5.81 4.5 5.17 5.61 6.19 5.02 6.79 5.97 4.77 5.85 4.78 4.27 6.8 6.25 4.5 5.08 5.21 4.24 4.12 6 4.16 3.19 4.29 3.95 5.26 6.63 4.38 6.38

For the test set, I used a similar table that did not include class labels, allowing LDA to predict the subtypes.

ID T01 T02 T03 T04 T05 T06 T07 T08 T09 T10 T11 T12 T13 T14
ENSG00000000419 6.06 5.25 5.15 5.36 5.22 5.75 4.79 6.01 6.39 7.71 5.7 5.85 5.34 5.07
ENSG00000001036 4.47 5.39 5.28 5.82 5.75 6.96 6.09 5.07 5.3 5.76 5.4 4.27 5.03 5.68
ENSG00000001084 4.47 6 5.77 5.76 5.76 4.64 4.07 4.52 4.16 3.24 4.35 2.49 5.89 4.37
ENSG00000001497 5.27 4.27 6.58 5.54 5.57 6.4 4.91 4.79 4.95 5.83 4.89 5.33 4.23 6.03

1.2. Result

The effectiveness of the LDA classifier constructed for the training set could be assessed by examining the results presented in LDA_plot. For example, the first two axes achieve a separation of type N (Normal-like) and type CL (Claudin-low).

LDA plot

Another results provided by LDA is the confusion matrix, which presents the predicted classifications in rows against the actual classifications in columns. The diagonal elements represent the number of correct predictions (true positives), while the values above the diagonal indicate true observations that were not predicted (false negatives), and those below represent predictions that were incorrect (false positives). According to this matrix, the classification accuracy for the training set around 87%.

Prediction result

Sample Class
T01 Type_B
T02 Type_B
T03 Type_B
T04 Type_B
T05 Type_B
T06 Type_CL
T07 Type_B
T08 Type_L
T09 Type_L
T10 Type_L
T11 Type_L
T12 Type_L
T13 Type_L
T14 Type_L

Confusion matrix

True predicted Type_B Type_CL Type_L Type_N
Type_B 9 1 1 1
Type_CL 0 3 0 0
Type_L 1 0 19 0
Type_N 1 0 0 2

In addition to LDA, various classification methods, including Support Vector Machine (SVM), Random Forest, and Naive Bayes were introduced. Each of these methods employs different mathematical approaches to tackle the same classification task, providing multiple options for the best fit for specific data and analysis purposes.

2. Feature Selection

One effective method for this is stepwise feature selection. Similar to LDA, this approach begins by testing individual features and selecting the one that provides the best classification quality for the training set. The process then evaluates pairs of features, where the first feature is the previously selected one, and identifies the pair that yields the highest classification accuracy. This method continues with triples, quadruples, and so forth. Although this greedy strategy may not be optimal, it delivers results within a reasonable timeframe.

2.1. Input

To implement this, I utilized the swLDA (step-wise LDA) algorithm, using the same training and testing datasets.

2.2. Result

Unlike LDA, swLDA requires an additional parameter called Niveau, which determines the stopping criterion. This parameter should be adjusted based on the size of the transcriptome; in this case, with nearly 7,000 genes, I set it to 0.0005.

Prediction result

Sample Class
T01 Type_N
T02 Type_B
T03 Type_B
T04 Type_B
T05 Type_B
T06 Type_CL
T07 Type_CL
T08 Type_L
T09 Type_L
T10 Type_L
T11 Type_L
T12 Type_L
T13 Type_L
T14 Type_L

Confusion matrix

True predicted Type_B Type_CL Type_L Type_N
Type_B 11 0 0 0
Type_CL 0 4 0 0
Type_L 0 0 20 0
Type_N 0 0 0 3

swLDA plot

Compared to LDA, swLDA achieved a higher accuracy in subtype classification, with a prediction rate of 100%. The feature mean and coefficient tables below revealed that 7 genes were selected, which enabled perfect classification of the training set.

swLDA_Features_means

Class ENSG00000116299 ENSG00000166145 ENSG00000243509 ENSG00000119888 ENSG00000168918 ENSG00000261040 ENSG00000049283
Basal 1.1 7.2 0.4 6.9 0.1 2.0 4.7
Claudin-low 0.7 0.8 1.4 0.7 0.2 3.9 0.2
Luminal 6.5 7.7 0.4 7.0 1.1 0.5 6.5
Normal-like 0.5 6.6 5.8 3.9 3.4 4.2 1.1

swLDA_Features_coeffs

scaling LD1 scaling LD2 scaling LD3  
ENSG00000116299 0.8 -1.1 0.5
ENSG00000166145 1.0 1.0 0.1
ENSG00000243509 0.0 0.4 1.6
ENSG00000119888 1.0 1.3 -0.3
ENSG00000168918 0.4 0.6 1.0
ENSG00000261040 -0.3 0.3 -0.8
ENSG00000049283 0.5 -0.6 -0.4

Acknowledgement

These projects were supported by Louisiana Biomedical Research Network and Pine Biotech under LBRN Summer Bioinformatics Workshop in 2018.