Our pipeline consists fo three main components, a vectorDB, a LLM and the client interface. For the vectorDB, we use
chromaDB, which is an open source vectorDB. The database is hosted on a serverless container using
modal. For the embedding model, we are using
instructor-xl, which is an instruction fine tuned embedding model based on a
GTR model. The emebdding model is also hosted on modal alongside the database. For the LLM, we use a proxy-based
API of OpenAI's ChatGPT 3.5 turbo model. This API essentially allows us to use chatGPT for free with a daily rate limit. The rate limit is similar to using chatGPT directly through the browser, which is more than enough for our use case.
For ingestion, we first use document loaders from
langchain to turn files stored on disk into text. The text are then split into chunks of 1000-4000 characters (depending on the specific use case) with 200 characters of overlap. These chunks of text are then fed into the embedding model to generate embeddings. The embeddings are then stored in the vectorDB.
For querying,
- we first pass the query to the LLM to generate a more concise structured query that is more suitable for the vectorDB.
- The structured query is then passed to the embedding model to generate an embedding of the query.
- The embedding is then passed to the vectorDB to retrieve the top k most revelant documents.
- Our pipeline will also generate k-random examples from the training data to be used as few-shot examples.
- The documents are then combined with the few-shot examples and original query to form a large propmt.
- This prompt is then passed into the LLM for the final time to generate the final response.
Our architecture and data flow is illustrated in the diagrams below.