Systematic Evolution of Ligands by EXponential enrichment (SELEX) is an experimental method for identifying aptamer sequences (Ellington and Szostak 1990, Tuerk and Gold (1990)). The combination of SELEX and Next Generation Sequencer (NGS) allows to obtain a huge number of oligonucleotide sequences from SELEX pools. This enables to search different type of aptamers for different epitopes and search aptamer hidden by oligonucleotides of PCR artifacts and/or bead binding oligonucleotides. However, it is impossible to evaluate all sequences for binding to the target molecules. Thus, clustering method is important for selecting aptamer candidates strategically from such a huge dataset. We developed fast string-based clustering (FSBC) for HT-SELEX data with R (R Core Team 2013) with bioconductor package (Gentleman et al. 2004). In this document, an example to use FSBC package for clustering with HT-SELEX dataset will be shown.
Sample dataset “sample.fst” includes 1000 oligonucleotide sequences without primer regions. Some oligonucleotide sequences in the data is shown as below.
DS <- readDNAStringSet("data/sample.fst")
DS
## A DNAStringSet instance of length 1000
## width seq names
## [1] 30 ATGGATGGGGGTCGGGGGTCGGGTGGGTGG 1
## [2] 30 GCGGGGGGTGCTAGGGCGGAGGTGGGCGTT 2
## [3] 30 GCGGGGGGTGCTAGGGCGGAGGTGGGCGTT 3
## [4] 30 TGGGGTGGGCGCAGGTGAGGGGGTGGGGGT 4
## [5] 30 GCGGGGGGTGCTAGGGCGGAGGTGGGCGTT 5
## ... ... ...
## [996] 30 TGGGGTGGGCGCAGGTGAGGGGGTGGGGGT 996
## [997] 30 GCGGGGGGTGCTAGGGCGGAGGTGGGCGTT 997
## [998] 30 TGGGGTGGGCGCAAGTGAGGGGGTGGGGGT 998
## [999] 30 TGGGGTGGGCGCAGGTGAGGGGGTGGGGGT 999
## [1000] 30 TGGGGTGGGCACAGGTGAGGGGGTGGGGAT 1000
The following script shows the flow of FSBC to generate clusters from raw data. The flow includes calculation of frequency, calculation of nucleobase ratio, selection of over-represented strings and clustering with selected strings.
lmin <- 5
lmax <- 10
# Calculate frequency.
DS.freq <- fsbc_calc_freq(DS)
# Get probability of nucleobases.
BR <- fsbc_get_base_ratio(DS)
# Select subsequences.
DF.subseq <- fsbc_search_subseq(DS.freq, DS.freq@metadata$freq, symbols = BR)
# Generate clusters with selected subsequences.
L.cluster <- fsbc_seq_cluster(rownames(DF.subseq), DS.freq)
# Add cluster ID to the cluster object.
DS.cluster <- fsbc_label_cluster(L.cluster, DS.freq)
The following figure shows the frequency of oligonucleotide sequences.
plot(DS.freq@metadata$rank, DS.freq@metadata$freq, pch = 19,
xlab = "Ranking", ylab = "Frequency")
Frequency of oligonucleotide sequences
The following figure shows ratios of nucleobases.
barplot(as.vector(BR), col = 1:4, names = names(BR), ylim = c(0,1),
main = "Nucleobase ratio")
Nucleobase ratio
Top 12 over-represented strings are shown as below. There are many G-quadruplex structure in the oligonucleotide sequences. But, the selected strings do not include such g-quadruplex sequence. Because the ratio of guanine is quite high with this data, and G-quadruplex was not estimated as not so important.
DF.subseq[1:12,]
## F R P Z ZZ L rank
## CGCAGGTGA 421 0.421 4.139253e-05 2069.1281 5.089707 9 1
## TGCTA 401 0.401 3.864192e-03 202.4186 4.431933 5 2
## CAGGTGA 436 0.436 6.791627e-04 528.4088 3.775269 7 3
## CGCAGGTGAG 417 0.417 2.569151e-05 2601.4776 3.618246 10 4
## CTAGGGCGGA 414 0.414 2.569151e-05 2582.7608 3.587151 10 5
## GCGCAGGTGA 412 0.412 2.569151e-05 2570.2829 3.566422 10 6
## GTGCTA 398 0.398 2.417405e-03 254.7347 3.390683 6 7
## TGCTAG 396 0.396 2.417405e-03 253.4468 3.369902 6 8
## CGCAGGT 428 0.428 9.040224e-04 449.3992 3.105398 7 9
## CTAGGGC 427 0.427 9.037726e-04 448.4091 3.097004 7 10
## CGCAG 435 0.435 8.797275e-03 144.3314 2.979802 5 11
## GCGCA 431 0.431 8.797275e-03 142.9768 2.945938 5 12
The distributions of \(Z\)-score and \(Z^*\)-score are shown in the following figures.
par(mfrow = c(1,2))
boxplot(DF.subseq$Z ~ DF.subseq$L, main = "Z-score")
boxplot(DF.subseq$ZZ ~ DF.subseq$L, main = "Z*-score")
Distribution of \(Z\)-score and \(Z^*\)-score
The ratio of selected strings is shown as below.
all <- sum(sapply(lmin:lmax, function(i) 4^i))
nrow(DF.subseq) / all
## [1] 0.0003076351
freq <- DS.cluster@metadata$freq # Frequency of sequence
cid <- DS.cluster@metadata$cluster.id # Cluster ranking
plot(cid, freq, pch = 19, ylab = "Frequency", xlab = "Cluster ranking", log = "x")
The following result shows the number of sequences in each cluster.
cluster.div <- sapply(L.cluster, length)
l <- factor(nchar(names(cluster.div)))
bp <- barplot(cluster.div, names = 1:length(cluster.div), col = l, ylab = "Number of unique sequences")
legend("topright", title = " Length of strings ", levels(l), col = 1:nlevels(l), pch = 15)
text(bp[,1], cluster.div, names(cluster.div), pos = 3, xpd = T)
Clustering Results
cluster.freq <- by(DS.cluster@metadata$freq, DS.cluster@metadata$cluster.id, sum)
l <- factor(nchar(names(L.cluster)))
bp <- barplot(cluster.freq, names = 1:length(cluster.freq), col = l, ylab = "Number of unique sequences")
legend("topright", title = " Length of strings ", levels(l), col = 1:nlevels(l), pch = 15)
text(bp[,1], cluster.freq, names(L.cluster), pos = 3, xpd = T)
The following result shows the sequence from top 5 clusters.
sapply(L.cluster, head, 1) %>% head(., 5)
## $CGCAGGTGA
## A DNAStringSet instance of length 1
## width seq names
## [1] 30 TGGGGTGGGCGCAGGTGAGGGGGTGGGGGT R:1.F:319
##
## $TGCTA
## A DNAStringSet instance of length 1
## width seq names
## [1] 30 GCGGGGGGTGCTAGGGCGGAGGTGGGCGTT R:2.F:278
##
## $CAGGTGA
## A DNAStringSet instance of length 1
## width seq names
## [1] 30 TGGGGTGGGCACAGGTGAGGGGGTGGGGGT R:6.F:8
##
## $CTAGGGCGGA
## A DNAStringSet instance of length 1
## width seq names
## [1] 30 GCGGGGGGCGCTAGGGCGGAGGTGGGCGTT R:4.F:16
##
## $CGCAGGT
## A DNAStringSet instance of length 1
## width seq names
## [1] 30 TGGGGTGGGCGCAGGTAAGGGGATGGGGGT R:66.F:1
# $CGCAGGTGA
# A DNAStringSet instance of length 1
# width seq names
# [1] 30 TGGGGTGGGCGCAGGTGAGGGGGTGGGGGT R:1.F:319
#
# $TGCTA
# A DNAStringSet instance of length 1
# width seq names
# [1] 30 GCGGGGGGTGCTAGGGCGGAGGTGGGCGTT R:2.F:278
#
# $CAGGTGA
# A DNAStringSet instance of length 1
# width seq names
# [1] 30 TGGGGTGGGCACAGGTGAGGGGGTGGGGGT R:6.F:8
#
# $CTAGGGCGGA
# A DNAStringSet instance of length 1
# width seq names
# [1] 30 GCGGGGGGCGCTAGGGCGGAGGTGGGCGTT R:4.F:16
#
# $CGCAGGT
# A DNAStringSet instance of length 1
# width seq names
# [1] 30 TGGGGTGGGCGCAGGTAAGGGGATGGGGGT R:66.F:1
#
Ellington, Andrew D, and Jack W Szostak. 1990. “In Vitro Selection of Rna Molecules That Bind Specific Ligands.” Nature 346 (6287). Nature Publishing Group: 818.
Gentleman, Robert C, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, et al. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biology 5 (10). BioMed Central: R80.
R Core Team. 2013. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.
Tuerk, Craig, and Larry Gold. 1990. “Systematic Evolution of Ligands by Exponential Enrichment: RNA Ligands to Bacteriophage T4 Dna Polymerase.” Science 249 (4968). JSTOR: 505–10.