Creating a simple service that enables you to chat with your PDF files can seem like a straightforward task. There are numerous open-source examples and several companies are attempting to establish themselves in this niche. In this article, we guide you through the creation of such an application.
We recently examined an online PDF chat service and had the opportunity to delve into their source code. Here's an abstract of what we found:
// prompt: 'you are a helpful AI assistant'
app.post('/ask-question', async () => {
const data = await pdfParse(fs.readFileSync(file_path));
const contents = contents ? [] : contents.push(content) // chunk 2k
return openai.completions({messages: [role: 'user', content]})
}
We have omitted some details for brevity and rewritten some code for readability, but this is essentially all you need. With some adept digital marketing, you can develop a brand and start serving users. However, to those in the know, this is akin to replacing the mugs in a bar with paper cups. You can't fit a full beer in a paper cup without it easily crushing, leaking, and ultimately degrading the beer-drinking experience. This code should be thousands of lines, and is unusable for anything but a toy project.
The code above is taken from a real-life application deployed in the wild. We will not name specifics, but this is what users will experience and pay for. How does this affect the end user? Some users may not encounter any issues. However, if your book spans 500 pages, you might experience significant performance issues, or the chat may not respond at all. If 100 users are using the site concurrently, the server could potentially grind to a halt or even crash when it hits memory buffers. So, why does code like this exist in the wild? The answer is simple: there just aren't enough users to cause an issue.
Let's assume you have a book of a few hundred pages, and the service suits your needs. You are the only user using the app, so you're not overloading the server. What issues could you encounter? You might face issues with the accuracy of the responses. Your model will be more prone to copying and less flexible. This is almost the same as uploading a file into the LLM and having a chat. Many people do this every day. So, what's the problem?
Eventually, the conversation ends. The context window is limited, and eventually, the LLM will start to get confused. You will then need to restart your conversation, and all your previous discussions would be forgotten. The experience is akin to talking to a different person each time, one with no knowledge of your past conversations. The code provided above will work at a bare minimum. It's much like speaking to someone that provides subpar responses, sometimes ignores you, and will eventually forget everything you talked about.
So, how can we improve this?
A common strategy is to use Retrieval-Augmented Generation (RAG), instead of feeding the model the whole book before you start the chat. This isn't feasible anyway with moderately sized books. Moreover, it results in comparably poor responses and limits the length of the conversation, as previously mentioned. In order to build the best chat experience, we need to leverage proper engineering tactics that allow for scalability and performance.
