Revealing The Wide-ranging Functions Within Uncommon CRISPR-Cas Systems Through profound Terascale Clustering.
Revealing The Wide-ranging Functions Within Uncommon CRISPR-Cas Systems Through profound Terascale Clustering.
Structured Abstract
INTRODUCTION
The systematic exploration of sequencing databases emerges as a potent approach for unveiling protein families and functional systems. This method has revealed a diverse array of CRISPR-Cas systems, which act as microbial RNA-guided adaptive immune systems and serve as the foundation for numerous molecular technologies, notably programmable genome editing. However, current techniques for mining sequences are struggling to keep pace with the exponential growth of databases housing billions of proteins. This limitation hampers the discovery of rare protein families and their associations.
RATIONALE
The aim was to comprehensively list CRISPR-related gene modules within all publicly available sequencing data. Recently, there have been discoveries of previously unidentified biochemical activities connected to the recognition of programmable nucleic acids by CRISPR systems, including transposition and protease activity. The belief was that numerous diverse enzymatic activities might be linked to CRISPR systems, many of which could have low representation in existing sequence databases.
RESULTS
The team devised fast locality-sensitive hashing–based clustering (FLSHclust), a parallelized, in-depth clustering algorithm with linearithmic scaling built on locality-sensitive hashing. FLSHclust performs comparably to MMseqs2, a well-established quadratic-scaling algorithm, in terms of clustering efficacy. This method was employed in a sensitive CRISPR discovery pipeline, leading to the identification of 188 previously unreported CRISPR-associated systems, including several rare ones.
Four newly discovered systems underwent experimental characterization. For instance, an examination of a type IV system with an HNH nuclease domain inserted in the CRISPR-associated DNA damage-inducible gene G (DinG)–like helicase revealed RNA-guided protospacer-adjacent motif (PAM)–dependent directional double-stranded DNA (dsDNA) degradation. This process necessitated both adenosine triphosphate (ATP) hydrolysis and the HNH nuclease functions of the DinG-HNH protein. This marked the first demonstration of a type IV system with a specified interference mechanism. Additionally, two type I systems harboring HNH nuclease domains inserted in different subunits of Cascade (Cas8-HNH and Cas5-HNH) displayed precise dsDNA cleavage and single-stranded DNA (ssDNA) cleavage. The Cas5-HNH system also exhibited collateral cleavage of ssDNA. Both systems demonstrated potential for genome editing in human cells, with the Cas8-HNH system displaying high specificity. The study also delved into candidate type VII systems, which included a minimal Cas7-Cas5 effector complex and an exclusive interference protein featuring a β-CASP domain. These systems were likely derived from type III-E CRISPR systems and targeted RNA.
Further findings of CRISPR-linked systems encompassed potential effector and adaptation components, novel associations of Mu transposons with CRISPR systems, and numerous newly identified proteins and domains associated with type V systems. The study also observed a potential instance of Cas9 co-optation as an anti-CRISPR mechanism, alongside noting several non-CRISPR hypervariable regularly interspersed repeat arrays.
CONCLUSION
This study presents FLSHclust as an efficient tool for rapidly clustering millions of sequences, offering extensive applications in mining large sequence databases. The CRISPR-associated systems uncovered in this research signify an unexplored reserve of varied biochemical activities associated with RNA-guided mechanisms, holding significant potential for advancement in biotechnologies.
No comments: