Because the negative data originated from another source, it could contain biases that possibly confound with specificity (22)

Because the negative data originated from another source, it could contain biases that possibly confound with specificity (22)

Because the negative data originated from another source, it could contain biases that possibly confound with specificity (22). present on the top of pathogens, an essential part of initiating adaptive immune system Mogroside VI responses. Each B cell expresses a BCR that’s exclusive virtually, and this variety allows the disease fighting capability to identify and react to any harmful pathogens. High-throughput sequencing technology enable large-scale characterization of BCR repertoires, producing ADIPOQ massive datasets that may benefit from organic language digesting (NLP) strategies (13). These NLP strategies find out representations (embeddings) for proteins or sets of proteins and summarize them over the Mogroside VI sequence to generate significant representations for downstream duties, such as for example supervised prediction. Some well-known embedding-based versions include phrase2vec (4) and deep transformer versions (5). Defense2vec (1) is certainly a phrase2vec model that learns to represent BCRs as vectors. It can this by wearing down each BCR into smaller sized products of three proteins (3-mers), where each device is inserted into fixed-length series representation and averaged over the sequence to make a one vector for the provided BCR. Latest deep proteins transformer versions make contextualized embeddings and attain state-of-the-art efficiency in downstream prediction duties, such as supplementary framework and protein-protein binding prediction (68). ESM2 (6) and ProtT5 Mogroside VI (7) are two types of transformer versions trained on huge corpora of proteins sequences to generate amino acidity representations that take into account sequence context. To fully capture the impact of neighboring proteins on one another, these versions generate regional embeddings for every amino acidity that depend on the whole sequence and compute a global embedding for the entire sequence by averaging the local embeddings. Similar approaches have also been recently applied to immune receptor sequences to train models for tasks including predicting binding-related properties of immune cell receptors (2,9,10). The abundance of embedding approaches calls for comparative studies to examine their biological relevance (11). A critical evaluation objective is how well low-dimensional representations preserve information for downstream prediction tasks (12). Previous work observed that neighboring BCRs in the embedding space have similar gene usage and somatic hypermutation frequency (1,2). However, no quantitative assessment of the representation over prediction tasks exists, and the comparative advantages of each embedding are underexplored. For instance, even though transformer models are highly expressive and can encode complicated context-based relationships, they require more training data to create meaningful and generalizable representations. Models like immune2vec, though less expressive, can be trained on more specific datasets, potentially allowing for a more informative BCR-specific embedding. Here we evaluated multiple embedding methods over prediction tasks, including BCR sequence properties and receptor specificity, to assess how well they preserve biological information. Previous machine-learning studies on BCR mainly focused on the complementarity-determining region 3 (CDR3) of heavy chain BCR sequences (1,13), a determinant of antibody specificity (14). The recent development of single-cell technologies leads to the increasing availability of paired full-length heavy and light chain BCR sequences, which brings the opportunity to include regions outside CDR and the light chain. However, few studies have examined the effect of incorporating full-length heavy and light chain sequences in receptor specificity prediction tasks using sequence-based embedding models (15). In this study, we compared the performance of protein language models, including the BCR-specific word2vec model (immune2vec), transformer-based protein language models (ESM2, ProtT5, antiBERTy), and traditional amino acid encoding (physicochemical encoding, amino acid frequency) in predicting BCR sequence properties and receptor specificity. We also examined the effect of incorporating full-length and paired light chain sequences on the prediction performance. We found the BCR-specific models, including immune2vec and antiBERTy, perform similarly or slightly outperform general protein Mogroside VI language models in receptor specificity prediction tasks. We also found an improvement in specificity prediction performance by incorporating full-length heavy and paired light chain sequences. These observations offer insights into the performance characteristics of embedding methods trained with different types of BCR sequence input and downstream prediction tasks. == Materials and methods == == Data sources and processing == We collected 1 million single-cell Mogroside VI paired heavy and light chain full-length BCR V(D)J sequences from ten datasets (Table1). Only cells with one productive heavy chain and productive.