Uncovering the functional diversity of rare CRISPR-Cas systems with deep terascale clustering
Microbial systems underpin many biotechnologies, including CRISPR, but the exponential growth of sequence databases makes it difficult to find previously unidentified systems. In this work, we develop the fast locality-sensitive hashing–based clustering (FLSHclust) algorithm, which performs deep clustering on massive datasets in linearithmic time. We incorporated FLSHclust into a CRISPR discovery pipeline and identified 188 previously unreported CRISPR-linked gene modules, revealing many additional biochemical functions coupled to adaptive immunity. We experimentally characterized three HNH nuclease–containing CRISPR systems, including the first type IV system with a specified interference mechanism, and engineered them for genome editing. We also identified and characterized a candidate type VII system, which we show acts on RNA. This work opens new avenues for harnessing CRISPR and for the broader exploration of the vast functional diversity of microbial proteins. Description Editor’s summary Microbial biochemicals systems are incredibly diverse, and computational tools to analyze sequence data are essential in identifying new and valuable components for biotechnology development. Using an approach called deep terascale clustering, Altae-Tran et al. found more than 200 new functional systems linked to CRISPR, a technology editing DNA. Some of the discovered genes are linked to precise DNA-editing systems that may enable safer therapeutic genome editing. The authors also identified a CRISPR-Cas enzyme, Cas14, which cuts RNA precisely. These discoveries may help to further improve DNA- and RNA-editing technologies, with wide-ranging applications in medicine and biotechnology. —Di Jiang A clustering algorithm, FLSHclust, was developed and applied to discover 188 previously unreported CRISPR-linked gene modules. INTRODUCTION Systematic mining of sequencing databases is a powerful method for discovering protein families and functional systems. This approach has uncovered diverse CRISPR-Cas systems, which are microbial RNA–guided adaptive immune systems that have served as the basis of several molecular technologies, notably programmable genome editing. However, existing methods for sequence mining lag behind the exponentially growing databases that now contain billions of proteins, which restricts the discovery of rare protein families and associations. RATIONALE We sought to comprehensively enumerate CRISPR-linked gene modules in all existing publicly available sequencing data. Recently, several previously unknown biochemical activities have been linked to programmable nucleic acid recognition by CRISPR systems, including transposition and protease activity. We reasoned that many more diverse enzymatic activities may be associated with CRISPR systems, many of which could be of low abundance in existing sequence databases. RESULTS We developed fast locality-sensitive hashing–based clustering (FLSHclust), a parallelized, deep clustering algorithm with linearithmic scaling based on locality-sensitive hashing. FLSHclust approaches MMseqs2, a gold-standard quadratic-scaling algorithm, in clustering performance. We applied FLSHclust in a sensitive CRISPR discovery pipeline and identified 188 previously unreported CRISPR-associated systems, including many rare systems. We experimentally characterized four of the newly discovered systems. We examined a type IV system with an HNH nuclease domain inserted in the CRISPR-associated DNA damage-inducible gene G (DinG)–like helicase. We found that this system exhibited RNA-guided protospacer-adjacent motif (PAM)–dependent directional double-stranded DNA (dsDNA) degradation, which required both the adenosine triphosphate (ATP) hydrolysis and HNH nuclease functions of the DinG-HNH protein. This is the first demonstration of a type IV system with a specified interference mechanism. We characterized two type I systems containing HNH nuclease domains inserted in different subunits of Cascade (Cas8-HNH and Cas5-HNH). We found that both of these systems performed precise dsDNA cleavage and single-stranded DNA (ssDNA) cleavage. We additionally observed collateral cleavage of ssDNA by the Cas5-HNH system. We demonstrated that both systems can be applied for genome editing in human cells and that the Cas8-HNH system is highly specific. We also studied candidate type VII systems, including a minimal Cas7-Cas5 effector complex and a distinctive interference protein including a β-CASP domain. We showed that these systems are likely derived from type III-E CRISPR systems and are RNA targeting. Other CRISPR-linked systems that we found include additional potential effector and adaptation components, two previously unknown associations of Mu transposons with CRISPR systems, and numerous newly identified proteins and domains associated with type V systems. We also identified an instance of potential co-option of a Cas9 as an anti-CRISPR mechanism and noted several non-CRISPR hypervariable regularly interspersed repeat arrays. CONCLUSION This study introduces FLSHclust as a tool to cluster millions of sequences quickly and efficiently, with broad applications in mining large sequence databases. The CRISPR-linked systems that we discovered represent an untapped trove of diverse biochemical activities linked to RNA-guided mechanisms, with great potential for development as biotechnologies. Identification and characterization of previously unreported CRISPR-Cas systems. (A) Schematic of FLSHclust algorithm. (B) Applications of protein clustering in CRISPR discovery. CARF, CRISPR-associated Rossmann fold. (C) Locus diagrams of three newly identified CRISPR-Cas systems experimentally characterized in this work. (D) Small RNA sequencing of candidate type VII Cas7-Cas5 ribonucleoprotein (RNP) (top), and targeted RNA cleavage by candidate type VII CRISPR-Cas system (bottom). DR, direct repeat; nt, nucleotide; bp, base pair; TBE, tris-boric acid–EDTA buffer.
TLDR
The fast locality-sensitive hashing–based clustering (FLSHclust) algorithm is developed, which performs deep clustering on massive datasets in linearithmic time, and uncovered more than 200 new functional systems linked to CRISPR, a technology editing DNA.