AI ASSISTANCE WITH SMALL LANGUAGE MODELS

0
about 2 weeks ago
75 views
Professional
English
#artificialintelligence #engineering #plaudere
Less info
More info
Background
When ChatGPT was launched, like many developers, I felt that technology had reached a turning point for those of us who enjoy creating applications. A system capable of handling vast amounts of text and organising it in a way that allows users to communicate in natural language is, without doubt, a technological milestone of the 21st century.
Indeed, this "miracle" was achieved through years of research and the efforts of many companies and individual contributors with one goal in mind: communicating with devices in natural language. While witnessing these breaking news stories, I was busy programming this website, wondering if there was a way to apply these latest trends to a "traditional" website built by an individual. I am aware that forcing technology into a service that does not require it is not the best approach. Therefore, I looked for a specific problem on Plaudere that an LLM could solve. After testing the site with colleagues, I realised users struggled with two things: how to use the website and understanding the fundamentals of why it was created.
Navigating Constraints: Cloud vs Local
To address this, I began training in generative AI. While many LLMs are available via cloud services, they typically charge per API call. For the initial development stage of Plaudere, my goal was to build capabilities without relying on third parties, especially if a local solution was feasible. It would not be good practice to use a massive, expensive model if a smaller one could meet the website's needs. Consequently, I began searching Hugging Face for open-source models compatible with Node.js.
I explored several models while trying to keep the service within 3GB of memory and 3vCPUs. I tested Flan-T5, TinyLlama, Llama and GPT-2. Most of these required more than 3GB once loaded into memory. In fact, when combined with the memory required for the website itself, most models failed to respond or encountered Node.js errors, a common frustration for developers.
The model I chose, Qwen 1.5 0.5B (developed by Alibaba Cloud), is extremely efficient. It requires less than 3GB of RAM and 3vCPUs. With roughly half a billion parameters, it was classified as a "Small Language Model" (SLM) in early 2024. Despite its size, it can handle a context window of around 25,000 words, which is very impressive. It follows a decoder-only transformer architecture and supports 12 languages, with English being one of its strongest.
Before settling on Qwen, I researched older NLP-based architectures. While efficient, those older models could not respond in natural language. They could identify the general meaning of a text and answer a specific question, but they lacked an understanding of subject-verb conjugation and context. Often, the answer was a simple "Yes" or "No." To create a supportive agent, the lack of natural language fluidity in those older systems made them unfeasible. This is why I turned to transformer-based LLMs, which have effectively made older QnA architectures outdated.
The FastAPI Solution and Qwen 1.5
My next step was to move the Qwen model to a separate FastAPI service. Keeping the LLM and the website server separate prevented memory conflicts. This dedicated API "wakes up" when a user asks a question; the website waits for the remote service to respond, and if no further questions are asked, the service can "sleep" to save server power.
Implementing the Context (RAG)
To integrate the LLM, I adjusted the website to include a chat pop-up that sends the user's prompt to the server. We defined three support cases:
- General website support.
- Questions about articles created on the site.
- Information about users (only if shared by them).
Even though Qwen accepts 25,000 words, processing that much data on 3GB of RAM is impossible. To solve this, I used Retrieval-Augmented Generation (RAG). Instead of sending the whole text, I use Fuse.js to find the most relevant sentences that match the user's prompt. If no specific match is found, it pulls a summary of the text. This minimises the volume of data sent to the LLM.
The LLM receives the prompt "augmented" by this context. This prevents the model from simply "guessing" the next word using its general knowledge; instead, it bases its answer on the provided content. One drawback of this specific Qwen setup is that it cannot handle multiple questions simultaneously. Questions are placed in a queue and processed one by one. To achieve multi-threading, I would need to spin up multiple instances of the FastAPI service.
Conclusions and Future Improvements
I believe the initial goal of giving users better tools to understand the website has been accomplished. Quality context is key to achieving good results. While the model occasionally "hallucinates" or makes basic factual mistakes because it is an SLM rather than a massive LLM, it successfully answers questions based on the provided support info.
Future goals include:
- Reducing hallucinations by moving to a more recent Qwen iteration.
- Enhancing support by including images and videos in AI responses.
- Refining context rules for user-generated content.
- Multi-language consistency to ensure the AI responds in the same language as the interface.
This project was well worth the effort. It allowed me to experiment with GenAI and understand the limitations of hardware constraints while applying the RAG methodology to make an LLM respond to context it has never seen before.