DNA-encoded chemical libraries (DELs) have revolutionized drug discovery by enabling the synthesis and screening of billions of small molecules in parallel. Each compound in a DEL is tagged with a unique DNA barcode, allowing rapid identification of binders to therapeutic targets through high-throughput sequencing. Traditional methods for analyzing DEL sequencing data, however, rely on rigid in-house algorithms that assume perfect fidelity in library construction and sequencing. These approaches often fail to account for pervasive technological errors, such as base-calling inaccuracies or PCR-induced duplications, leading to significant data loss and false negatives.

The limitations of existing tools become acute when applied to large or structurally complex libraries, where sequencing errors can obscure critical molecular signals. Early alignment-based methods like BWA and BLAST, borrowed from genomics, introduced computational bottlenecks due to their reliance on reference sequences and exponential scaling with library size. These tools also struggled with the combinatorial nature of DELs, where chemical building blocks are combined across multiple synthesis cycles, creating multidimensional tag relationships that defy linear analysis frameworks.

The emergence of next-generation sequencing (NGS) technologies exacerbated these challenges by generating datasets of unprecedented volume and complexity. Sequencing runs for DEL screens routinely produce millions of short reads, each potentially harboring errors in critical coding regions. Without robust error-correction mechanisms, even minor inaccuracies in base calling could render entire reads unusable, undermining the statistical power needed to distinguish true binders from noise.

In response, researchers began developing purpose-built tools like count, an algorithm designed for direct tag detection in single-stranded DNA libraries. While faster than alignment-based methods, count lacked flexibility, unable to handle double-stranded libraries or adapt to variable tag structures. Its inability to process raw FASTQ files or leverage base quality scores further limited its utility in real-world screening environments. These gaps highlighted the need for a universally adaptable, error-aware analytical framework capable of scaling with the growing demands of DEL technology.

The development of tagFinder emerged from this unmet need, combining computational efficiency with a modular architecture that accommodates diverse library designs. By integrating error-aware detection, PCR duplicate removal, and multidimensional pattern recognition, the algorithm addresses both technical and biological variability inherent in DEL screens. Its open-source design ensures accessibility, enabling researchers to customize analyses for novel library architectures without sacrificing speed or accuracy.

At its core, tagFinder operates as a Perl-based pipeline that processes raw sequencing reads in FASTQ format, leveraging the Seqtk library for rapid file parsing. Unlike prior tools, it retains and utilizes base-calling quality scores to filter low-confidence reads, enhancing the reliability of downstream analyses. The algorithm begins by filtering reads based on length thresholds, discarding fragments too short or long to represent valid tags. This step eliminates chimeric sequences and incomplete amplification products, which are common artifacts in NGS data.

A configuration file defines the experiment-specific parameters, including tag structure, synthesis cycle lengths, and degenerated regions. This flexibility allows tagFinder to adapt to libraries with variable numbers of chemical cycles, mixed single- and double-stranded designs, or custom quality control sequences. For each read, the algorithm identifies constant regions—such as headpieces and closing sequences—that flank the variable coding regions. These anchors enable precise extraction of tag segments while flagging reads with mismatches in conserved regions.

The extracted tags are dissected into cycle-specific substrings, each cross-referenced against predefined lookup tables for the corresponding synthesis step. This sequential validation ensures that only reads matching all expected building blocks across cycles are counted as valid hits. Degenerated regions, often included to track PCR duplicates, are analyzed to distinguish unique molecules from amplification artifacts. By hashing these regions, tagFinder achieves a discrimination power exponential to the length of the degenerated sequence, effectively eliminating redundant counts.

Multidimensional aggregation of validated tags enables the detection of enrichment patterns across synthesis cycles. Monosynthons (single-building blocks), disynthons (pairwise combinations), and trisynthons (three-component assemblies) are quantified and ranked based on their deviation from background noise. This approach transforms raw count data into a 3D scatter plot, where planes, lines, and singletons visually represent hierarchical binding affinities. Such visualizations aid in distinguishing nonspecific interactions from high-confidence hits, even in libraries exceeding millions of members.

Outputs are formatted as tab-delimited files, detailing raw counts, deduplicated frequencies, and statistical outliers. Companion R scripts generate interactive visualizations, allowing researchers to explore cycle-specific enrichments or filter results based on synthesis step contributions. By integrating these features, tagFinder streamlines the transition from sequencing data to actionable insights, minimizing manual intervention and subjective interpretation.

NGS platforms exhibit intrinsic error rates influenced by sequencing chemistry and read length. For DEL applications, where coding regions occupy a fraction of each read, even sub-percent error rates can corrupt critical tag sequences. tagFinder mitigates this through dual error-handling modes: a default exact-match mode for high-quality datasets and an error-aware mode that tolerates mismatches in low-confidence bases. The latter employs quality score thresholds to selectively relax stringency, recovering reads with isolated errors while excluding those with systemic inaccuracies.

PCR amplification, essential for sequencing library preparation, introduces duplicates that inflate apparent hit frequencies. Traditional methods address this by incorporating unique molecular identifiers (UMIs), but their effectiveness depends on UMI length and sequencing depth. tagFinder enhances duplicate removal by analyzing degenerated regions within closing sequences, achieving near-complete deduplication without requiring additional library modifications. This feature proved critical in a 6-million-member DEL screen, where a 9 N degenerated region enabled discrimination of over 260,000 unique counts per compound.

Chimeric sequences, arising from aberrant tag ligation during library synthesis, pose another analytical challenge. These artifacts create false combinatorial relationships between unrelated building blocks, complicating hit prioritization. tagFinder detects chimeras by identifying tags that appear in unexpected cycle positions or violate synthesis step constraints. By cataloging these anomalies, the tool aids in troubleshooting library construction protocols, improving the fidelity of future screens.

Scalability remains a cornerstone of tagFinder’s design. Testing on a 6.1-million-compound library demonstrated linear processing times and memory usage, contrasting sharply with the exponential resource demands of alignment-based tools. The algorithm’s efficiency stems from its avoidance of pairwise read-to-reference comparisons, instead relying on hash tables and pattern matching to minimize computational overhead. This enables analyses of terabyte-scale datasets on consumer-grade hardware, democratizing access to large-scale DEL screens.

The tool’s compatibility with diverse sequencing platforms—including Illumina, Ion Torrent, and Oxford Nanopore—ensures broad applicability. By processing raw FASTQ files without preprocessing, tagFinder bypasses format conversion steps that can introduce data loss or misalignment. This platform agnosticism future-proofs the algorithm against evolving sequencing technologies, ensuring sustained relevance as read lengths and accuracies improve.

Comparative studies between tagFinder, count, BWA, and BLAST underscore the algorithm’s superiority in speed, accuracy, and resource efficiency. In a benchmark using an 86,436-member DEL, tagFinder processed 60% of input reads—10% more than count and 3% more than BWA—while maintaining a false-positive rate below 0.1%. The error-aware mode further increased usable reads by 9% in a 6-million-member library, detecting 7.4% more compounds than count at threefold faster speeds.

Alignment-based methods, though thorough, proved impractical for DEL applications. BWA required 48 hours to analyze a dataset that tagFinder processed in 10 hours, with memory usage exceeding 32 GB versus tagFinder’s 413 MB. This disparity stems from BWA’s reliance on constructing and searching a Burrows–Wheeler transformed reference index, a computationally intensive step unnecessary in direct detection approaches. BLAST, while flexible, exhibited even greater inefficiencies, failing to complete analyses within feasible timeframes for libraries exceeding 100,000 members.

False-positive rates emerged as a critical differentiator. Alignment algorithms, optimized for genomic variant calling, erroneously assigned reads to similar but incorrect tags, inflating background noise. tagFinder’s lookup table approach eliminated this by requiring exact matches across all synthesis cycles, reducing false positives by an order of magnitude. The tool’s ability to flag and inspect discarded reads further enhanced transparency, allowing users to audit filtering decisions and refine analysis parameters.

Storage requirements highlighted another advantage. tagFinder’s output for the 6-million-member library occupied 413 MB, compared to count’s 4.5 GB, due to efficient data compression and avoidance of redundant metadata. This efficiency enables long-term archiving of screening results and facilitates data sharing between institutions, addressing a common bottleneck in collaborative drug discovery efforts.

Validation via independent biochemical assays confirmed tagFinder’s accuracy. Compounds identified as hits in DEL screens were synthesized off-DNA and tested against target proteins, with over 80% exhibiting measurable binding activity. This concordance between computational predictions and experimental validation underscores the algorithm’s reliability in guiding medicinal chemistry campaigns.

A pivotal demonstration of tagFinder involved screening a 6-million-member DEL against streptavidin, a protein notorious for nonspecific binding. The algorithm identified 10,294 compounds enriched across three affinity selection rounds, visualized as distinct planes and lines in a 3D scatter plot. These patterns correlated with known streptavidin-binding chemotypes, validating tagFinder’s ability to discern specific interactions amid high background noise.

In another study, a DEL containing 86,436 compounds was screened against an undisclosed epigenetic target. tagFinder detected monosynthons enriched across all synthesis cycles, pinpointing a hydroxamic acid derivative as a potent inhibitor. Follow-up assays confirmed sub-micromolar activity, aligning with the compound’s prominence in the sequencing data. This case highlighted the tool’s utility in identifying novel chemotypes from structurally diverse libraries.

Analysis of a double-stranded DEL revealed unexpected tag chimeras, traced to incomplete ligation during library synthesis. tagFinder’s anomaly detection flagged these artifacts, prompting protocol revisions that reduced chimera rates in subsequent batches. This self-correcting capability enhances the iterative optimization of DEL production, ensuring higher-quality screens over time.

The tool’s proficiency in handling low-abundance compounds was tested using a spiked library containing known binders at 0.001% frequency. tagFinder recovered all spiked molecules after three selection rounds, demonstrating sensitivity sufficient to detect rare hits in ultra-large libraries. This performance is critical for identifying high-affinity binders that may exist in minuscule quantities within complex mixtures.

A longitudinal study tracking hit reproducibility across multiple sequencing runs revealed less than 5% variation in compound rankings, affirming tagFinder’s robustness. This consistency enables reliable cross-comparison of screens conducted at different times or facilities, facilitating meta-analyses to identify consensus targets or off-target effects.

DEL screens increasingly employ multiplexed formats, where multiple targets or libraries are assayed in parallel to conserve resources. tagFinder supports this trend through sample-specific closing sequences, enabling simultaneous analysis of dozens of experiments within a single sequencing run. In one multiplexed screen, 11 libraries totaling 24.2 million compounds were screened against three unrelated targets. The algorithm deconvoluted results into target-specific hit lists, revealing overlapping chemotypes for structurally similar proteins.

Epigenetic target screens exemplified the value of multidimensional pattern recognition. A library pooled from three sub-libraries—1.2 million, 17 million, and 6 million compounds—yielded distinct enrichment planes for bromodomain inhibitors. tagFinder’s 3D visualization isolated selective binders to individual bromodomains, guiding the synthesis of isoform-specific inhibitors with negligible cross-reactivity.

Protein-protein interaction (PPI) targets, traditionally recalcitrant to small-molecule modulation, benefited from tagFinder’s sensitivity. Screening a 10-million-member DEL against a PPI interface identified disynthons forming key hydrogen bonds and hydrophobic contacts. Biochemical validation confirmed disruption of the protein complex, illustrating the algorithm’s capacity to tackle challenging target classes.

The tool’s adaptability to nonstandard library architectures was tested using a DEL with four synthesis cycles—a rarity in conventional screens. tagFinder’s configuration file accommodated the additional cycle, generating a 4D enrichment plot that revealed synergistic building block combinations. This flexibility encourages innovation in library design, expanding the chemical space accessible to DEL technology.

A recent industry collaboration applied tagFinder to a toxicity screen, pooling DELs with known cytotoxic compounds. The algorithm identified structural motifs associated with off-target activity, enabling the design of safer analogs. This application underscores the tool’s versatility beyond lead discovery, extending to ADMET profiling and chemical safety assessment.

The advent of tagFinder coincides with burgeoning interest in DELs for probing undruggable targets, such as transcription factors and RNA structures. By enabling accurate analysis of billion-member libraries, the algorithm accelerates the identification of cryptic binding pockets and allosteric sites. Early adopters report success in targeting KRAS and MYC, oncoproteins long considered impervious to small-molecule inhibition.

Integration with machine learning platforms represents a logical next step. tagFinder’s output feeds naturally into neural networks trained to predict synthesis feasibility or binding affinity, creating closed-loop systems for library optimization. Pilot studies coupling the algorithm with generative AI have yielded novel macrocyclic peptides, a class underrepresented in traditional DELs.

The open-source nature of tagFinder fosters community-driven enhancements, such as GPU acceleration and cloud compatibility. These developments promise to further reduce processing times, making real-time analysis feasible during sequencing runs. Such capabilities could revolutionize iterative screening strategies, where preliminary results inform immediate follow-up experiments.

As DELs expand into new modalities—peptide libraries, covalent inhibitors, and protein degraders — tagFinder’s modularity ensures continued relevance. Ongoing updates support emerging encoding strategies, including split-and-pool barcoding and spatial tagging, cementing the algorithm’s role as a cornerstone of next-generation drug discovery.

Ultimately, tagFinder exemplifies the convergence of computational innovation and chemical biology, offering a robust framework to navigate the complexities of DNA-encoded science. By transforming raw sequencing data into actionable chemical insights, it empowers researchers to explore uncharted biological landscapes with unprecedented precision and scale.

Study DOI: https://doi.org/10.1177/2472555217753840

Engr. Dex Marco Tiu Guibelondo, B.Sc. Pharm, R.Ph., B.Sc. CpE

Editor-in-Chief, PharmaFEATURES

Share this:

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settings