Unleashing the Power of Self-Learning Machines: Running SLMs on Your Browser

5 min readJun 30, 2024

We all have witnessed the power of Large Language Models (LLMs) and various ways they boost our productivity in various tasks.

However one challenge that these LLMs post is the scale of computing power that they consume to operate.
Having this kind of compute power historically meant that we have to use the classic client server model, where the client sends a request, basically a curated prompt to an inferencing server which hosts the language models and these servers typically contain GPUs which greatly accelerate the speed of these machine learning models and perform the heavy lifting inference tasks.

What does this mean to end users?

They’ll need to have a good enough internet connection for communicating with these servers.
Running LLMs on a server is not a cheap task, it certainly involves costs around acquiring GPUs and maintaining the compute infra, which all translates into a subscription fees that the end users have to pay.
They must be comfortable with sending any contextual data which might be helpful in performing the inference to the server.
Think of a task like summarizing some boring document that you want to email to your boss 😉?
While companies hosting these mighty LLMS do promise that they don’t use your data, you should still be comfortable in accepting the privacy implications.
Typically LLMs are run on a shared compute infrastructure, that means your request to the LLM server is likely to have equal importance as other users requests, unless you are on some special paid tier.
This means that you would have some limit in the amount of requests that you make.

So what are our options?

While we could host your own LLMs on your own server or pay to have your own dedicated server, for personal and hobbyist use, this becomes an overkill.

Enter Small Language Models

Small Lanaguge Models aka (SLMs), like their name suggests are small, but highly capable language models.

The goal of SLMs is to enable the model to run with much lower compute costs and do it as efficiently as possible. This is achieved by a variety of optimization techniques like Knowledge Distillation, Pruning and quantization (self-promotion link alert 😁).

While not as fully capable as a full-fledged LLM,
SLMs do retain a good enough portion of the capabilities offered by LLMs.

Making SLMs an ideal candidate for performing inference at the edge or consumer devices, which typically also is better for user privacy.

Most modern devices do come with a dedicated GPU / NPU that greatly accelerates the inference process on small devices.

As full-stack developers we had access to the underlying device GPU via the client browsers using the Web GPU standard.

With the recent advancements in SLMs and also Web GPU we’ve been heading to a point where the SLMs are offered in a reasonable size package and Web GPU improvements are pushing the extent to which we can utilize the GPU/NPU on users devices.

Example time!

Just like all of my other write-ups this wouldn’t be complete without an end to end example of this.

This time, we will be running a full-fledged SLM like Microsoft Phi-3 on our browser!

The stack includes React+Typescript+Vite along with Transformer.js which is an awesome open source project that let’s us run transformer models right on our browser using Javascript/Typescript, for this example I’ll be using the v3 pre-release of this library as the support of WebGPU for SLMs is still under preview.
Hopefully it’ll come out sometime in the near future and we can play with the stable release.

Because running inference on GPU is a heavy task and can stall the main UI thread, we’d be utilizing dedicated web worker to the heavy lifting for us.

We start with the web worker.

At a high level, we can see that the web worker handles 2 tasks though a singleton class SLMGenerator class.

It initializes the language model, by downloading and caching the files needed for both the model and the tokenizer in the browsers local cache and then warms the GPU shaders up, by running on simple inference pass.
When initializing the model we need to consider the device: 'webgpu' and use_external_data_format: true , the first parameter indicates that we want to utilize the WebGPU API and the second parameter indicates that we want to use external data format, which helps us to load onnx models, more than 2GB in size, by using a seperate format which allows the split of the model
If you are familiar with transformers library in python, you might recognize that the APIs exposed by AutoTokenizer and AutoModelForCasualLM are very similar in terms of programatic syntax.
It also performs the inference tasks whenever the UI thread requests a prompt to be sent to the language model.
It first tokenizes the input by using the model tokenizer and then passes it through the model.generate(...) API.
After generating individual tokens also decodes them and streams them back to the UI thread as they are decode.

Now let’s take a look at the Chat component which is the UI container and orchestrator for most of the chat related work.

As we can see, the UI component for chat boots up the web worker and kicks off the model download process in useEffect(...) hook.

Post that we follow an event driven pattern, where the messages to and from worker are communicated in the form of events.

In this case we are using a Xenova/Phi-3-mini-4k-instruct_fp16 model, which is a slightly tuned version of Microsoft/Phi-3-mini-4k-instruct onnx web model.

This model is a lightweight, state-of-the-art open model built upon datasets used for Phi-2 — synthetic data and filtered websites — with a focus on very high-quality, reasoning dense data.

Alright enough theory, let’s see the SLM in action!

As you can see, the model generates about 9–10 tokens per second on an 2020 M1 MacBook pro, which is certainly not bad!