BM25S Retriever
BM25 (Wikipedia) also known as the
Okapi BM25
, is a ranking function used in information retrieval systems to estimate the relevance of documents to a given search query.
BM25SRetriever
retriever uses thebm25s
package, which leverages Scipy sparse matrices to store eagerly computed scores for all document tokens. This allows extremely fast scoring at query time, improving performance over popular libraries such asrank_bm25
by orders of magnitude.
Setup
%pip install --upgrade --quiet bm25s langchain_openai
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
from langchain_community.retrievers import BM25SRetriever
Instantiation
The retriever can be instantiated from a list of texts and (optionally) metadata or directly from a list of Documents
. If a persist_directory
is provided, the retriever will persist the index to that directory.
# Create your corpus here
corpus = [
"a cat is a feline and likes to purr",
"a dog is the human's best friend and loves to play",
"a bird is a beautiful animal that can fly",
"a fish is a creature that lives in water and swims",
]
metadata = [
{"descr": "I am doc1"},
{"descr": "I am doc2"},
{"descr": "I am doc3"},
{"descr": "I am doc4"},
]
retriever = BM25SRetriever.from_texts(
corpus, metadata, k=2, persist_directory="animal_index_bm25"
)
BM25S Create Vocab: 0%| | 0/4 [00:00<?, ?it/s]
BM25S Convert tokens to indices: 0%| | 0/4 [00:00<?, ?it/s]
BM25S Count Tokens: 0%| | 0/4 [00:00<?, ?it/s]
BM25S Compute Scores: 0%| | 0/4 [00:00<?, ?it/s]
Alternatively you can instantiate the retriever from a persisted directory.
retriever_2 = BM25SRetriever.from_persisted_directory("animal_index_bm25", k=2)
Usage
query = "does the fish purr like a cat?"
retrieved_chunks = retriever.invoke(query)
retrieved_chunks
Split strings: 0%| | 0/1 [00:00<?, ?it/s]
BM25S Retrieve: 0%| | 0/1 [00:00<?, ?it/s]
[Document(metadata={'descr': 'I am doc1'}, page_content='a cat is a feline and likes to purr'),
Document(metadata={'descr': 'I am doc4'}, page_content='a fish is a creature that lives in water and swims')]
retrieved_chunks_2 = retriever_2.invoke(query)
retrieved_chunks_2
Split strings: 0%| | 0/1 [00:00<?, ?it/s]
BM25S Retrieve: 0%| | 0/1 [00:00<?, ?it/s]
[Document(metadata={'descr': 'I am doc1'}, page_content='a cat is a feline and likes to purr'),
Document(metadata={'descr': 'I am doc4'}, page_content='a fish is a creature that lives in water and swims')]
retrieved_chunks == retrieved_chunks_2
True
Use within a chain
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAI
llm = OpenAI(temperature=0.0)
prompt = ChatPromptTemplate.from_template(
"""Answer the question based only on the context provided.
Context: {context}
Question: {question}"""
)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
chain.invoke(query)
API reference
For detailed documentation of all BM25SRetriever
features and configurations head to the API reference.
Related
- Retriever conceptual guide
- Retriever how-to guides