AI ASSISTANCE WITH SMALL LANGUAGE MODELS

Background:

When ChatGPT was launched, technology reached a turning point for those building applications. Seeing a system capable of organising vast amounts of text, identifying key information and simulating human-like responses enabling natural language communication is, without a doubt, a defining technological milestone of the 21st century.

It is clear that this miracle is the result of years of research and the collective efforts of major organisations and thousands of independent contributors all driven by a singular goal: enabling seamless natural language communication with machines. While following these developments I was busy programming this website and wondered if there was a way to apply these trends to a traditional site built entirely on my own. I sought out a specific problem within Plaudere that a language model could actually solve. After a period of reflection and after revisiting previous versions of the site I realised that having been away from the logic and daily use of those earlier iterations made the application feel difficult to navigate and understand. Despite my efforts to evolve the site into something more intuitive a barrier to entry remained. I therefore identified that user friction during navigation and a lack of clarity regarding the site’s practical usefulness were recurring challenges that could be mitigated through the clever application of language models.

Navigating Constraints: Cloud vs Local:

To address the integration of language models into the website, I researched the high level fundamentals of generative AI. Although many models are available, they typically involve costs per API call or strict usage caps that did not align with a project in its early stages like Plaudere. Since I lacked data regarding potential traffic, relying on third parties presented a significant risk of over-provisioning. My core objective was to build native capabilities without external dependencies, provided that a local solution remained feasible. I concluded that for Plaudere's specific use case, where extreme precision was not the primary requirement, it made little sense to deploy large languaje models if a smaller, more efficient one could meet the criteria. Consequently, I began searching Hugging Face for open source models compatible with Node.js, as my priority was to embed AI directly into the site's architecture and avoid the complexity of managing additional external components beyond basic authentication services. However, once the solution proved stable, I decoupled the infrastructure to ensure long-term independence and scalability.

I explored several models while trying to run the service on an infrastructure already constrained by low RAM and CPU capacity. I tested several models across various parameter sizes and quantization levels, including TinyLlama and GPT-2, but they ultimately exceeded the memory threshold once loaded or triggered Node.js errors when competing with the website's own overhead.

After several attempts, one of the models that proved viable was Qwen, a language model developed by Alibaba Cloud. It is remarkably efficient: the 2.5 version requires 2vCPUs and 4GB of RAM in the quantized version, and 6GB of RAM when using an unquantised version. Using a model with roughly 500 million parameters, it is classified as a Small Language Model (SLM). Despite its compact size, it handles a context window of 128k tokens for Qwen 2.5 and 32k tokens for Qwen 1.5. It follows a decoder only transformer architecture and supports more than 12 languages in version 1.5 and more than 29 languages in version 2.5, showing strong performance in English and impressive capability across other languages despite its small footprint.

Official Qwen 1.5 model documentation.

Official Qwen 2.5 model documentation.

Before settling on Qwen, I researched older NLP-based architectures. While efficient, those older models could not respond in natural language; they could identify the general meaning of a text and answer specific questions, but they lacked an understanding of subject-verb conjugation and context. Often, the answer was a simple "Yes" or "No." To create a supportive agent for Plaudere, the lack of natural language fluidity in those older models made them unfeasible. This is why I turned to transformer-based SLMs, which have effectively made older Q&A architectures outdated by providing a much more human and adaptive interaction.

The FastAPI Solution and Qwen:

My next step was to move the Qwen model to a separate API service. Keeping the LLM and the website server separate prevented memory conflicts. This dedicated API "wakes up" when a user asks a question; the website waits for the remote service to respond, and if no further questions are asked, the service can "sleep" to save server power.

Plaudere SLM backend architecture.

Implementing the Context (RAG):

To integrate the LLM, the website was adjusted to include a chat pop-up that sends the user's prompt to the server. Three support cases are defined:

General website support.
Questions about posts created on the site.
Information about users (only if shared by them).

Although Qwen supports a long context window, trying to push that amount of text through a resource-constrained server is not a viable approach. Once the model is loaded, the remaining memory available for text processing becomes limited, so the application needs to be selective about what information it sends to the SLM. That is why the architecture is built as a compact Retrieval-Augmented Generation pipeline: not to feed everything to the model, but to recover only the pieces of text that are most relevant and most likely to support a grounded answer.

At first, the solution used a simpler fuzzy-search approach using fuse.js, but it was too limited for the level of context understanding the application needed. The current approach uses a custom semantic fuse, a more advanced local retrieval layer created for Plaudere. It converts the text into deterministic 64-dimensional representations and ranks fragments with a hybrid scoring model that combines vector similarity, Jaccard overlap, phrase matching, and typo tolerance. The engine also relies on Levenshtein distance and n-gram similarity, so it can recover relevant snippets even when the wording is not exact. This makes the retrieval stage much closer to a semantic selection process than to a simple keyword search.

The retrieval step is not only about finding related text, but about choosing the best snippets for the model with enough precision to keep the prompt focused. The worker builds optimised chunks, protects sentence structure, keeps the number of top snippets bounded, and uses confidence thresholds to decide whether a snippet should be kept or discarded. In practice, this means the SLM does not receive the full document blindly. It receives a small set of ranked fragments that have already passed through a semantic selection layer designed to behave like a reduced RAG model.

Before the SLM generates anything, the application also performs intention detection. It distinguishes between greeting, farewell, thanks, non relevant prompt, summary, summary operation, and other questions, so each request follows the correct route. Some intents are answered directly, some go through summary logic, and others enter the full RAG flow. This routing layer is important because it prevents the application from treating every input the same way and allows the response strategy to match the user’s actual intent.

Once the intent is identified, the model extracts context, builds the prompt, and injects the relevant snippets and the truncated chat history into the model input. The prompt is designed to keep the answer grounded in the provided context and to avoid unsupported generation. The result is a controlled generation step rather than an open-ended answer from raw model memory.

So the overall architecture is intentionally compact, but conceptually close to a standard RAG stack: local retrieval first, intent routing next, prompt assembly after that, generation in the SLM, and response asessment at the end. It preserves the structure of a bigger language-system pipeline, while staying small enough to run within the memory limits of the server.

Plaudere SLM request flow.

Conclusions and Future Improvements:

The initial goal was to provide better user support tools and to help them understand the text of the articles and user profiles. Quality context is key to achieving good results. While the model occasionally "hallucinates" or makes basic factual mistakes because it is an SLM rather than an LLM, it successfully answers questions based on the provided support info.

Future goals include, among others:

Reducing hallucinations by moving to a more recent Qwen iteration or other better models.
Enhancing support by including images and videos in AI responses.
Refining context rules for user-generated content.
Multi-language consistency to ensure the AI responds in the same language as the user.
Use of embedded vectorised database to increase the coherence and context of the prompt answers.

This project was essential for experimenting with generative AI under real-world hardware constraints, applying the RAG methodology to enable a model to respond to context it had never seen during its original training. The development demonstrates that with clever engineering, smart and efficient AI-powered solutions can be successfully implemented even within highly constrained environments.