Skip to main content

BM25S Retriever

BM25 (Wikipedia) also known as the Okapi BM25, is a ranking function used in information retrieval systems to estimate the relevance of documents to a given search query.

BM25SRetriever retriever uses the bm25s package, which leverages Scipy sparse matrices to store eagerly computed scores for all document tokens. This allows extremely fast scoring at query time, improving performance over popular libraries such as rank_bm25 by orders of magnitude.

Setup

%pip install --upgrade --quiet  bm25s langchain_openai

[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
from langchain_community.retrievers import BM25SRetriever
API Reference:BM25SRetriever

Instantiation

The retriever can be instantiated from a list of texts and (optionally) metadata or directly from a list of Documents. If a persist_directory is provided, the retriever will persist the index to that directory.

# Create your corpus here
corpus = [
"a cat is a feline and likes to purr",
"a dog is the human's best friend and loves to play",
"a bird is a beautiful animal that can fly",
"a fish is a creature that lives in water and swims",
]

metadata = [
{"descr": "I am doc1"},
{"descr": "I am doc2"},
{"descr": "I am doc3"},
{"descr": "I am doc4"},
]

retriever = BM25SRetriever.from_texts(
corpus, metadata, k=2, persist_directory="animal_index_bm25"
)
BM25S Create Vocab:   0%|          | 0/4 [00:00<?, ?it/s]
BM25S Convert tokens to indices:   0%|          | 0/4 [00:00<?, ?it/s]
BM25S Count Tokens:   0%|          | 0/4 [00:00<?, ?it/s]
BM25S Compute Scores:   0%|          | 0/4 [00:00<?, ?it/s]

Alternatively you can instantiate the retriever from a persisted directory.

retriever_2 = BM25SRetriever.from_persisted_directory("animal_index_bm25", k=2)

Usage

query = "does the fish purr like a cat?"
retrieved_chunks = retriever.invoke(query)
retrieved_chunks
Split strings:   0%|          | 0/1 [00:00<?, ?it/s]
BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]
[Document(metadata={'descr': 'I am doc1'}, page_content='a cat is a feline and likes to purr'),
Document(metadata={'descr': 'I am doc4'}, page_content='a fish is a creature that lives in water and swims')]
retrieved_chunks_2 = retriever_2.invoke(query)
retrieved_chunks_2
Split strings:   0%|          | 0/1 [00:00<?, ?it/s]
BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]
[Document(metadata={'descr': 'I am doc1'}, page_content='a cat is a feline and likes to purr'),
Document(metadata={'descr': 'I am doc4'}, page_content='a fish is a creature that lives in water and swims')]
retrieved_chunks == retrieved_chunks_2
True

Use within a chain

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAI

llm = OpenAI(temperature=0.0)

prompt = ChatPromptTemplate.from_template(
"""Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)


def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)


chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
chain.invoke(query)

API reference

For detailed documentation of all BM25SRetriever features and configurations head to the API reference.


Was this page helpful?