Plaudere Engineering

Error de conexión con el servidor. Refresque la página.
El tiempo máximo de actividad o inactividad ha sido alcanzado. Refresque la página

PLAUDERE ENGINEERING

AI ASSISTANCE WITH SMALL LANGUAGE MODELS

Image Text

202 vistas

#artificialintelligence #engineering #plaudere

Mostrar menos

Leer en Español.

Background:

When ChatGPT was launched I realised that technology had reached a turning point for those of us building applications. Seeing a system capable of organising vast amounts of text, identifying key information and simulating human-like responses was a breakthrough. It enabled natural language communication with computers, allowing us to embed intelligence directly into our data transformation and response workflows. It is, without a doubt, a defining technological milestone of the 21st century.

It is clear that this "miracle" is the result of years of research and the collective efforts of major organisations and thousands of independent contributors all driven by a singular goal: enabling seamless natural language communication with machines. While following these developments I was busy programming this website and wondered if there was a way to apply these trends to a "traditional" site built entirely on my own. I sought out a specific problem within Plaudere that a language model could actually solve. After a period of reflection and after revisiting previous versions of the site I realised that having been away from the logic and daily use of those earlier iterations made the application feel difficult to navigate and understand. Despite my efforts to evolve the site into something more intuitive a barrier to entry remained. I therefore identified that user friction during navigation and a lack of clarity regarding the site’s core purpose were recurring challenges that could be mitigated through the clever application of language models.

Navigating Constraints: Cloud vs Local:

To address the integration of language models into the website, I researched the high level fundamentals of generative AI. Although many models are available, they typically involve costs per API call or strict usage caps that did not align with a project in its early stages like Plaudere. Since I lacked data regarding potential traffic, relying on third parties presented a significant risk of service interruptions that could frustrate users. My core objective was to build native capabilities without external dependencies, provided that a local solution remained feasible. I concluded that for Plaudere's specific use case, where extreme precision was not the primary requirement, it made little sense to deploy massive or expensive models if a smaller, more efficient one could meet the criteria. Consequently, I began searching Hugging Face for open source models compatible with Node.js, as my priority was to embed AI directly into the site's current architecture and avoid the complexity of managing additional external components beyond basic authentication services.

I explored several models while trying to keep the service within the strict limits of 3GB of RAM and 3vCPUs. I tested Flan-T5, TinyLlama, Llama, and GPT-2, but most exceeded the 3GB threshold once loaded into memory or triggered Node.js errors when combined with the website's own overhead—a common frustration in development. After several attempts, the only model that proved viable was Qwen 1.5 0.5B (developed by Alibaba Cloud). It is remarkably efficient, requiring less than 3GB of RAM and 3vCPUs when using a quantised version. With roughly 500 million parameters, it is classified as a Small Language Model (SLM). Despite its compact size, it handles a context window of 25,000 words. It follows a decoder-only transformer architecture and supports 12 languages, showing strong performance in English and impressive capability across other languages despite its small footprint.

Qwen1.5 0.5B model repository on Hugging Face.

Official Qwen 1.5 0.5B model documentation.

Before settling on Qwen, I researched older NLP-based architectures. While efficient, those older models could not respond in natural language; they could identify the general meaning of a text and answer specific questions, but they lacked an understanding of subject-verb conjugation and context. Often, the answer was a simple "Yes" or "No." To create a supportive agent for Plaudere, the lack of natural language fluidity in those older models made them unfeasible. This is why I turned to transformer-based LLMs, which have effectively made older Q&A architectures outdated by providing a much more human and adaptive interaction.

The FastAPI Solution and Qwen 1.5:

My next step was to move the Qwen model to a separate FastAPI service. Keeping the LLM and the website server separate prevented memory conflicts. This dedicated API "wakes up" when a user asks a question; the website waits for the remote service to respond, and if no further questions are asked, the service can "sleep" to save server power.

Implementing the Context (RAG):

To integrate the LLM, I adjusted the website to include a chat pop-up that sends the user's prompt to the server. We defined three support cases:

General website support.
Questions about articles created on the site.
Information about users (only if shared by them).

Even though Qwen supports a 25,000-word context window trying to push that much data into a server with only 3GB of RAM is just not going to happen. Once the model is loaded the space left for processing text is so tight that sending huge blocks of data would crash the service instantly. To fix this I built a Retrieval-Augmented Generation architecture focused entirely on efficiency. Instead of overwhelming the model with useless info I use Fuse.js to pinpoint the exact sentences that actually match the user’s query. Fuse.js is a very lightweight fuzzy search library that does the job without needing any external infrastructure. Its main strength is that it runs entirely in memory and is incredibly fast which means I can skip using a vector database for now. There is no network latency and the memory footprint is basically zero. If the function does not find a direct match it just pulls a summary to give a general answer.

I know this fuzzy matching approach has its limits because it relies on finding specific words rather than understanding deep meaning like a vector database would. But for where Plaudere is at the moment setting up a full vector stack with all the extra costs and resources it needs just didn't make sense. I went for this handcrafted RAG because it is practical and lets me test the Small Language Model without breaking the server. The model gets an augmented prompt which stops it from making things up and forces it to stick to the content provided by Fuse.js. The only real downside is that FastAPI has to handle questions one by one in a queue because trying to process multiple requests would blow past the 3GB limit but that is a trade-off I am happy to make to keep the whole thing stable.

Conclusions and Future Improvements:

I believe the initial goal of giving users better tools to understand the website has been accomplished. Quality context is key to achieving good results. While the model occasionally "hallucinates" or makes basic factual mistakes because it is an SLM rather than an LLM, it successfully answers questions based on the provided support info.

Future goals include, among others:

Reducing hallucinations by moving to a more recent Qwen iteration or other better models.
Enhancing support by including images and videos in AI responses.
Refining context rules for user-generated content.
Multi-language consistency to ensure the AI responds in the same language as the interface.
Use of embedded vectorised database to increase the coherence and context of the prompt answers.

This project was well worth the effort. It allowed me to experiment with GenAI and understand the limitations of hardware constraints while applying the RAG methodology to make an LLM respond to context it has never seen before. It has been a real eye-opener on how to push hardware to its limits and prove that smart, efficient solutions are possible even with tight resources.