The slate.30m.english.rtrvr model is a standard sentence transformers model based on bi-encoders. The model produces an embedding for a given input e.g. query, passage, document
etc. At a high level, our model is trained to maximize the cosine similarity between two input pieces of text e.g. text A (query text) and text B (passage text), which result in the sentence embeddings q and p. These sentence embeddings can
then be compared using cosine similarity.
Figure 1. Bi-encoder Embeddings Model for Retrieval
Base Language Model
Copy link to section
The underlying Language Model (LM) for our embeddings is “slate.30m.english”. It has the same architecture as a small-RoBERTa transformer model (6 layers) and has ~30 million parameters and an embedding dimension of 384. Specifically, “slate.30m.english”
was distilled from “slate.125m.english” (formerly, WatBERT). Our final model is called slate.30m.english.rtrvr. Notice the suffix at the end denoting that we fine-tune the underlying model architecture for retrieval-based tasks.
Training Algorithm
Copy link to section
Most embedding models that are either state-of-the-art or at the top of the MTEB leaderboard are typically trained in 3 stages:
Task Specific (retrieval-based) pre-training
Task specific fine-tuning on mined pairs
Fine-tuning on supervised pairs. We follow a similar approach, combining the final two stages into a single fine-tuning step.
slate.30m.english.rtrvr is produced by distilling from the “slate.125m.english.rtrvr-06-30-2024” model in the fine-tuning step (details of the larger teacher model can be found at this location). Knowledge Distillation transfers the
knowledge from a high-performing teacher model into a smaller student model by training the student’s output probability distribution to match that of the teacher as closely as possible, improving the student’s performance
compared to stand-alone finetuning.
Task-specific pre-training
Copy link to section
This stage uses the RetroMAE framework, to make our underlying LM more retrieval oriented. We initialize our base LM with “slate.30m.english” and continue
with RetroMAE pre-training, using the data in Table 1. Here, instead of only using the hard labels from the data, we also distill the predictions of the RetroMAE encoder used by slate.125m.english.rtrvr. Our hyper-parameters are: learning
rate: 2e-5, number of steps: 435000, GPUs: 8 A100 (80GB) GPUs.
Note: this is our base LM for the following 2 stages.
Distillation using Unsupervised and Supervised Pairs
Copy link to section
We use a bi-encoder framework for training an embedding model, as in Figure 1. We initialize with the RetroMAE pre-trained model, and further employ Knowledge Distillation with <query, passage> text pairs using a contrastive loss objective with in-batch negatives. Knowledge Distillation trains the student’s output probability distribution to match that of the teacher as closely as possible. In the context of retriever models, the output distribution is the similarity
scores between pairs of text. Specifically, for each pair of sentences <
query, passage>, the distribution of the teacher’s scores between the embeddings of query and passage, i.e., the cosine similarity between the embeddings, is distilled into the student.
The teacher used for distillation is the "slate.125m.english.rtrvr” model trained on the same data mentioned below. The teacher model was trained using a multi-stage paradigm where a RetroMAE pretrained model saw unsupervised data in
pretraining stage, and then was finetuned on cleaner, mined or gold labeled data for few hundred steps. To ensure robustness across datasets, this finetuned-model was then fused with another model trained with different hyper-parametes on
the same data. For more details, please refer to the slate.125m.english.rtrvr model card.
The flow of knowledge transfer is shown in Figure 2.
Figure 2. Knowledge Distillation
We mine large scale pairs from various domain, as indicated in the Training Data section. Furthermore, we also include high-quality pairs for the retrieval task on the following datasets: SQuAD, Natural Questions, Specter, Stack Exchange (Title,
Body) pairs, S2ORC, SearchQA, HotpotQA, Fever, and Miracl. For these supervised datasets, we also include Hard Negatives mined with a previous version of the slate.125m.english.rtrvr model. Moreover, we also synthetically generate triples
to create good quality pairs of question-answers, factual verification, etc using `Mixtral-8x7B-Instruct-v0.1`. To provide better performance for IBM-specific use cases, we also include pairs created from IBM Software Support data and IBM
Docs.
Distillation hyperparameters are: learning rate:7e-4, number of steps: 500000, effective batch size: 2048, GPUs: 4 A100_80GB GPUs.
# make sure you’ve sentence transformers installed
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('path_to_slate_model')
input_queries = [
' Who made the song My achy breaky heart? ',
'summit define']
input_passages = [
" Achy Breaky Heart is a country song written by Don Von Tress. Originally titled Don't Tell My Heart and performed by The Marcy Brothers in 1991. ",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]
query_embeddings = model.encode(input_queries)
passage_embeddings = model.encode(input_passages)
print(util.cos_sim(query_embeddings, passage_embeddings))
Show more
The maximum sequence length of this model is 512 tokens.
Evaluation
Copy link to section
Baselines
Copy link to section
For a fair comparison, we compare with the following baselines:
BM25 (a traditional model based on tf-idf)
ELSER (a commercial search algorithm provided by Elastic)
all-MiniLM-l6-v2: a popular open-source sentence transformers model. This model shares the same architecture as slate.30m.english.rtrvr, has been trained on more data without commercial-friendly licenses. Please see the huggingface
model card for more details
E5-base: a recent open-source transformer model with very good performance on the BEIR benchmark. This is a base-sized model, which has the same architecture as slate.125m.english.rtrvr. [Reference: Wang et.al., 2022: Text Embeddings by
Weakly-Supervised Contrastive Pre-training]. Huggingface model card
E5-small: a smaller model within the open source E5 family. The embedding dimension of this model matches that of slate.30m.rtrvr (384) however it has 12 layers, and thus is larger and slightly slower. [Reference: Wang et.al., 2022: Text
Embeddings by Weakly-Supervised Contrastive Pre-training]. Huggingface model card
BGE-base: a recent open-source transformer model with one of the best performances on the BEIR benchmark for the 768 embedding size. Huggingface model card
BGE-small: a recent open-source transformer model with one of the best performances on the BEIR benchmark for the 384 embedding size. Huggingface model card
We also compare the performance of these models with the older versions of the slate models, slate.125m.english.rtrvr-012024 and slate.30m.english.rtrvr-012024.
The BEIR benchmark contains of 15 open-source retrieval tasks evaluated under a zero-shot setting. BEIR focused on Diversity, including nine different retrieval tasks: Fact checking, citation prediction, duplicate question retrieval, argument
retrieval, news retrieval, question answering, tweet retrieval, bio-medical IR, and entity retrieval. Further, it includes datasets from diverse text domains, datasets that cover broad topics (like Wikipedia) and specialized topics (like
COVID-19 publications), different text types (news articles vs. Tweets), datasets of various sizes (3.6k - 15M documents), and datasets with different query lengths (average query length between 3 and 192 words) and document lengths (average
document length between 11 and 635 words). BEIR uses the Normalized Cumulative Discount Gain (specifically, nDCG@10) metric for evaluation.
Long NQ
Copy link to section
Long NQ is an IBM dataset designed for evaluating the full RAG pipeline, based on a subset of the NaturalQuestions dataset. The dev set has 300 answerable questions with a corpus of 178,891 passages from 2,345 Wikipedia documents. Long NQ
also provides gold Wikipedia passages that are relevant for each question. During retrieval, the task is to obtain the relevant gold passage from the corpus for every question.
Results
Copy link to section
Table 3. Performance comparison on the BEIR benchmark (MTEB retrieval tab)
Model
BEIR-15 (NDCG@10)
BM25
42.02
ELSER
49.01
all-miniLM-L6-v2
41.95
ES-small-v2
49.04
ES-base-v2
50.29
BGE-small
51.68
BGE-base
53.25
slate.30m.english.rtrvr 01.20.2024
46.91
slate.125m.english.rtrvr-01.20.2024
49.37
slate.30m.english.rtrvr 06.30.2024
49.06
slate.125m.english.rtrvr-06.30.2024
51.26
Figure 3. Performance comparison on the BEIR benchmark (MTEB retrieval tab)
Table 4. Performance comparison on the Long NQ dataset
Model
LONGNQ (NDCG@10)
all-miniLM-L6-v2
58.10
BGE-small
59.33
BGE-base
61.29
ES-small-v2
61.88
ES-base-v2
63.80
slate.30m.english.rtrvr 01.20.2024
59.94
slate.125m.english.rtrvr-01.20.2024
65.01
slate.30m.english.rtrvr 06.30.2024
62.07
slate.125m.english.rtrvr-06.30.2024
66.80
Figure 4. Performance comparison on the Long NQ dataset
Runtime Performance
Copy link to section
The performance runtime is measured on a re-ranking task with 466 queries. For each query we re-rank the top-100 passages obtained by BM25 and we report the average time over all queries. The re-ranking was performed on a A100_40GB GPU.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.