A masochist’s guide to web development

Introduction
Setting things up
Hello world
Intermezzo I: What is WebAssembly?
Building a library
Intermezzo II: JavaScript and the DOM
Loading the library and making it a module
Multithreading
Intermezzo III: Web Workers and Spectre
Don’t block the main thread!
Callback functions
Persistent storage
Closing thoughts

Introduction

I have recently worked on making a web application out of my latest Rubik’s cube optimal solver. This involved building a rather complex C code base (with multithreading, SIMD, callback functions and whatnot) to WebAssembly via Emscripten, and writing a minimal amount of JavaScript and HTML for the frontend.

This whole process was complex, tiring and at times frustrating - but eventually it was a success! Not only I accomplished my goal, but I have learnt a lot along the way. After finishing the work, I decided to write down all that I have learnt and share it with the world with this post.

You may be wondering why one should do such a thing instead of either rewriting their code base in a more web-friendly language, or distributing their app using a native GUI framework. The main reason to use WebAssembly is that it can provide near-native performance (or so they claim) while running inside a web browser; this gives you all the portability of a web app without too much of a performance drawback, something that would not be possible with an interpreted language such as JavaScript.

So, what is this blog post? A tutorial for web development? I am not sure about this, but if it is, it is definitely not a normal one. As the title suggests, you should not start from this guide unless you just love banging your head against the wall. If you are looking for a sane guide to web development, I strongly advise you head on to the Mozilla Developer Network tutorials page and start from there.

But if you are a C or C++ developer looking to port a program or library to the web, then you are in the right place. With this post I am going to walk you through the process of building an increasingly complex library that can run in a web browser. Make sure you are sitting comfortably and be ready to sweat, because I am not going to shy away from the hard stuff and the complicated details.

To follow this tutorial you won’t need much experience with web development, but some familiarity with HTML and an idea of what JavaScript will be useful. It will also help to know that you can access your browser’s JavaScript console and other developer tools by pressing F12, at least on Firefox or Chrome - but I guess I have literally just taught you that, if you did not already know it. For all the rest, I’ll make sure to add many hyperlinks throughout the text, so you can follow them if something is new to you.

A little disclaimer: although I am a somewhat experienced C developer, I had very little web development experience before embarking in this adventure. If you are a web developer, you may find errors in this post that are going to make you laugh at my ignorance. If you do, I’d appreciate it if you could report them to me by sending an email to sebastiano@tronto.net!

With this out of the way, let’s get started!

Setting things up

The examples used in this tutorial are all contained in a git repository, which you can find either on my git page or on github.

In order to follow them you are going to need:

A working installation of Emscripten (which also includes Node.js). Refer to the official website for installation instructions.
A web server such darkhttpd or the Python http.server package; the examples will use darkhttpd.

I have only tested all of this on Linux, but everything should work exactly the same on any UNIX system. If you are a Windows user, you can either run everything inside WSL, or you can try and adjust the examples to your system - if you choose this second option, I’ll happily accept patches or pull requests :)

Hello world

Let’s start with the classic Hello World program:

#include <stdio.h>

int main() {
    printf("Hello, web!\n");
}

You can compile the code above with

emcc -o index.html hello.c

And if you now start a web server in the current folder, for example with darkhttpd . (the dot at the end is important), and open a web browser to localhost:8080 (or whatever port your web server uses), you should see something like this:

Hello world in a browser

As you can see, the compiler generated a bunch of extra stuff around you print statement. You may or may not want this, but for now we can take it as a convenient way to check that our program works as expected.

There are other ways to run this compiled code. With the command above, the compiler should have generated for you 3 files:

index.html - the web page in the screenshot above.
index.wasm - the actual compiled code of your program; this file contains WebAssembly bytecode.
index.js - some JavaScript glue code to make it possible for index.wasm to actually run in a browser.

If you don’t specify -o index.html, or if your specify -o followed by a filename ending in .js, the .html page is not going to be generated. In this case (but also if you do generate the html page), you can run the JavaScript code in your terminal with:

node index.js

In later examples, the same code may not work seamlessly in both a web browser and in Node.js - for example, when dealing with persistent data storage. But until then, we can generate all three files with a single command and run our code in either way.

It is also possible to ask Emscripten to generate only the .wasm file, in case you want to write the JavaScript glue code by yourself. To do this, you can pass the -sSTANDALONE_WASM option to emcc. However, in some cases the .js file is going to be generated even when this option is used, for example when building a source file without a main() entry point. Since this is something we’ll do soon, we can forget about this option and just take it as a fact that the .wasm files generated by emscripten require some glue JavaScript code to actually run, but in case you are interested you can check out the official documentation.

You can find the code for this example, as well as scripts to build it and run the web server, in the directory 00_hello_world of the git repository (git.tronto.net, github).

Anyway, now we can build our C code to run in a web page. But this is probably not the way we want to run it. First of all, we don’t want to use the HTML template provided by Emscripten; but more importantly, we probably don’t want to write a program that just prints stuff to standard output. More likely, we want to write some kind of library of functions that can be called from the front-end, so that the user can interact with our program via an HTML + JavaScript web page. Before going into that, let’s take a break to discuss what we are actually compiling our code to.

Intermezzo I: What is WebAssembly?

The logo of WebAssembly

WebAssembly is a low-level language meant to run in a virtual machine inside a web browser. The main motivation behind it is running higher-performance web applications compared to JavaScript; this is made possible, by its compact bytecode and its stack-based virtual machine.

WebAssembly (or WASM for short) is supported by all major browsers since around 2017. Interestingly, Emscripten, the compiler we are using to translate our C code to WASM, first appeared in 2011, predating WASM by a few years. Early on, Emscripten would compile C and C++ code into JavaScript, or rather a subset thereof called asm.js.

Just like regular assembly, WASM also has a text-based representation. This means that one could write WASM code directly, assemble it to bytecode, and then run it. We are not going to do it, but if you are curious here is a simple example (computing the factorial of a number, taken from Wikipedia):

(func (param i64) (result i64)
    local.get 0
    i64.eqz
    if (result i64)
        i64.const 1
    else
        local.get 0
        local.get 0
        i64.const 1
        i64.sub
        call 0
        i64.mul
    end)

As you can see, it looks like a strange mix of assembly and Lisp. If you want to try and run WASM locally, outside of a web browser, you could use something like Wasmtime.

Until early 2025, the WASM “architecture” was 32-bit only. One big limitation that this brings is that you cannot use more that 4GB (2³² bytes) of memory, because pointers are only 32 bits long; moreover, your C / C++ code may need some adjustments if it relied on the assumption that e.g. sizeof(size_t) == 8. At the time writing a new standard that enables 64 bit pointers, called WASM64, is supported on Firefox and Chrome, but not on Webkit-based browsers such as Safari yet. Depending on when you are reading this, this may have changed - you can check the status of WASM64 support here.

Building a library

Back to the main topic. Where were we? Oh yes, we wanted to build a C library to WASM and call it from JavaScript. Our complex, high-performance, math-heavy library probably looks something like this:

library.h (actually, we are not going to need this):

int multiply(int, int);

library.c:

int multiply(int a, int b) {
    return a * b;
}

Or maybe it is a bit more complicated than that. But we said we are going to build up in complexity, and this is just the beginning, so let’s stick to multiply().

To build this library you can use:

emcc -o library.js library.c

As we saw before, this is going to generate both a library.js and a library.wasm file. Now we would like to call our library function with something like this

program.js:

var library = require("./library.js");
const result = library.multiply(6, 7);
console.log("The answer is " + result);

(The require() syntax above is valid when running this code in Node.js, but not, for example when running in a browser. We’ll see in the next session what to do in that case, but for now let’s stick to this.)

Unfortunately, this will not work for a couple of reasons. The reason first is that Emscripten is going to add an underscore _ to all our function names; so we’ll have to call library._multiply(). But this still won’t work, because by default the compiler does not export all the functions in your code - that is, it does not make them visible to the outside. To specify which functions you want to export, you can use the -sEXPORTED_FUNCTIONS flag, like so:

emcc -sEXPORTED_FUNCTION=_multiply -o library.js library.c

And now we finally have access to our multiply() function…

$ node program.js
Aborted(Assertion failed: native function `multiply` called before runtime initialization)

…or maybe not. If you are new to JavaScript like I was a few weeks ago, you may find this error message surprising. Some runtime must be initialized, but can’t it just, like… initialize before trying to run the next instruction?

Things are not that simple. A lot of things in JavaScript happen asynchronously, and in these situations you’ll have to either use await or a callback function. So we’ll have to do something like this:

var library = require("./build/library.js");

library.onRuntimeInitialized = () => {
    const result = library._multiply(6, 7);
    console.log("The answer is " + result);
};

And now we can finally run our program:

$ node program.js 
The answer is 42

The code for this example can be found in the 01_library folder in the git repository (git.tronto.net, github).

Intermezzo II: JavaScript and the DOM

The logos of HTML, CSS and JavaScript

If we want to build an interactive web page using JavaScript, we’ll need a way for our script to communicate with the page, i.e. a way to access the HTML structure from JavaScript code. What we are looking for is called Document Object Model, or DOM for short.

For example, if you have a paragraph with some text in your HTML:

<p id="myParagraph">Hello!</p>

you can access this text from JavaScript like this:

var paragraph = document.getElementById("myParagraph");
paragraph.innerText = "New text!";

Here we are selecting the paragraph HTML element using its ID, and we are changing its text via its innerText property, all from JavaScript.

Let’s see a more complex example:

HTML:

<button id="theButton">Press me!</button>

JS:

var button = document.getElementById("theButton");
var counter = 0;

button.addEventListener("click", () => {
    counter++;
    button.innerText = "I have been pressed " + counter + " times!";
});

In the example above we add an event listener to a button: the (anonymous) function we defined is going to be called every time the button is clicked. And since this is a web page, I guess I can show you what this actually looks like.

Behold, the dynamic button:

If you are completely new to web development, you may be wondering where you should write this JavaScript code. One option is to write it in the same HTML file as the rest of the page, inside a <script> tag; this is how I did it in the example above, as you can check by viewing the source of this page (press Ctrl+U, or right-click and select “view source”, or prepend view-source: to this page’s URL; hopefully at least one of these methods should work in your browser).

However, if the script gets too large you may want to split it off in a separate file, which we’ll demonstrate in this next example.

Let’s now make a template web page for using our powerful library. Let’s start with the HTML, which is in large part boilerplate:

index.html:

<!doctype html>
<html lang="en-US">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width">
    <title>Multiply two numbers</title>
    <script src="./script.js" defer></script>
</head>

<body>
    <p>
    <input id="aInput"> x <input id="bInput">
    <button id="goButton">=</button>
    <span id="resultText"></span>
    </p>
</body>

</html>

Besides the <body> element, the only important line for us is line 7, which loads the script from a file. Notice that we use the defer keyword here: this is telling the browser to wait until the whole page has been loaded before executing the script. If we did not do this, we could run in the situation where we document.getElementById() returns null, because the element we are trying to get is not loaded yet (yes, this happened to me while I was writing this post). If you want to know more, check out this MDN page.

Now to the JavaScript code. For now we are going to use the built-in * operator to multiply the two numbers, but in the next section we are going to replace it with our own library.

script.js (in the same folder as index.html):

var aInput = document.getElementById("aInput");
var bInput = document.getElementById("bInput");
var button = document.getElementById("goButton");
var resultText = document.getElementById("resultText");

button.addEventListener("click", () => {
    var a = Number(aInput.value);
    var b = Number(bInput.value);
    resultText.innerText = a * b;
});

The final result will look something like this:

In a real-world scenario you would probably want to check that the text provided in the input fields is actually a number, or perhaps use the type="number" attribute for the input fields. But we’ll ignore these issues here - we are going to have more serious problems to deal with.

Loading the library and making it a module

With what we have learned in the previous intermezzo (you are not skipping those, right?) we can finally run our library code in a real web page. The code is pretty much the same as above; we just need to include both the library and the script file in the HTML:

    <script src="./library.js" defer></script>
    <script src="./script.js" defer></script>

and of course we have to change the line where we perform the multiplication:

    resultText.innerText = Module._multiply(a, b);

Here Module is the default name given to our library by Emscripten. Apart from being too generic a name, this leads to another problem: we can’t include more than one Emscripten-built library in our page in this way - otherwise, both are going to be called Module.

Luckily, there is another way: we can build a modularized library, i.e. obtain a JavaScript Module. This may sound a bit strange, because the name Module kind of implies there is already a module. The way I understand it is that by default Emscripten produces a script that contains a module named Module; when building a modularized library, the whole resulting file is a module.

Modularizing our build is not necessary right now, but there are a couple of other advantages to it:

As mentioned above, we can change the name of our module and include more than one Emscripten-built library, if we want.
We will be able to use the module in the same way in Node.js and in our web page script. This way we can minimize the differences between the two versions of our code, which can be useful for testing.
In case we want to build a more complex layer of JavaScript between our library and our web page, with a modularized build we can easily include the module in another file, which can then be included in the main script.

So let’s go ahead and build our library like so:

emcc -sEXPORTED_FUNCTION=_multiply -sMODULARIZE -sEXPORT_NAME=MyLibrary \
     -o library.mjs library.c

Notice I have changed the extension from .js to .mjs. Don’t worry, either extension can be used. And you are going to run into issues with either choice:

If you run your code in Node.js, it will understand that the library file is a module only if you use the .mjs extension. Alternatively, you can change some settings in a local configuration file to enforce this.
If you run your code in a web page, your web server may not be configured to serve .mjs files as JavaScript files. This can easily be changed by adding a configuration line somewhere.

In my examples I chose to use the .mjs extensions to make Node.js happy, and I changed the configuration of my web servers as needed. For example, for darkhttpd I added a file called mime.txt with a single line text/javascript mjs, and launched the server with the --mimetypes mime.txt option.

Now we have to make a couple of changes. Our program.js, for running in node, becomes:

import MyLibrary from "./library.mjs"

var myLibraryInstance = await MyLibrary();

const result = myLibraryInstance(6, 7);
console.log("The answer is " + result);

By the way, I have renamed this file to program.mjs. This is because only modules can use the static import statement; alternatively, I could have used the dynamic import() and kept the .js extension.

Similary, we have to update our script.js (or script.mjs) to import the module and create an instance. Moreover, we have to specify in the HTML that the script is now a module:

    <script src="./script.mjs" type="module" defer></script>

And we can get rid of the other <script> tag, since now the library is included directly in script.mjs.

You can find the full the code for this example the folder 02_library_modularized in the git repository (git.tronto.net, github).

Multithreading

"Gotta go fast" meme

Let’s move on to a more interesting example. If one of the goals of WebAssembly is performance, there is no point in using only 1/16th of your CPU - let’s port a multithreaded application to the web!

As a more complicated example, let’s write a function that counts how many prime numbers there are in a given range. This function takes two integers as input and returns a single integer as output, but it does a non-trivial amount of work under the hood. A simple implementation of this routine would be something like this:

bool isprime(int n) {
    if (n < 2)
        return false;

    for (int i = 2; i*i <= n; i++)
        if (n % i == 0)
            return false;
    return true;
}

int primes_in_range(int low, int high) {
    if (low < 0 || high < low)
        return 0;

    int count = 0;
    for (int i = low; i < high; i++)
        if (isprime(i))
            count++;

    return count;
}

This algorithm is embarassingly parallelizable: we can split the interval [low, high) into smaller sub-intervals and process each one of them in a separate thread; then we just need to add up the results of the sub-intervals.

For the actual implementation, we are going to use pthreads, for the simple reason that it is supported by Emscripten. In practice, assuming we are working on a UNIX platform, we could also use C11’s threads.h or C++’s std::thread, but only because they happen to be wrappers around pthreads. On other platforms, or in other implementations of the C and C++ standard library, this may not be the case; so we’ll stick to old-school pthreads.

This is my parallel version of primes_in_range():

primes.c:

#include <stdbool.h>
#include <pthread.h>

#define NTHREADS 16

bool isprime(int);
void *pthread_routine(void *);

struct interval { int low; int high; int count; };

int primes_in_range(int low, int high) {
    pthread_t threads[NTHREADS];
    struct interval args[NTHREADS];

    if (low < 0 || high < low)
        return 0;

    int interval_size = (high-low)/NTHREADS + 1;
    for (int i = 0; i < NTHREADS; i++) {
        args[i].low = low + i*interval_size;
        args[i].high = args[i].low + interval_size;
        pthread_create(&threads[i], NULL, pthread_routine, &args[i]);
    }

    int result = 0;
    for (int i = 0; i < NTHREADS; i++) {
        pthread_join(threads[i], NULL);
        result += args[i].count;
    }

    return result;
}

bool isprime(int n) {
    if (n < 2)
        return false;

    for (int i = 2; i*i <= n; i++)
        if (n % i == 0)
            return false;
    return true;
}

void *pthread_routine(void *arg) {
    struct interval *interval = arg;

    interval->count = 0;
    for (int i = interval->low; i < interval->high; i++)
        if (isprime(i))
            interval->count++;

    return NULL;
}

(Pro tip: if you take the number of threads as an extra parameter for your function, you can pass to it the value navigator.hardwareConcurrency from the JavaScript front-end and use exactly the maximum number of threads that can run in parallel on the host platform.)

To build this with Emscripten we’ll have to pass the -pthread option and, optionally, a suitable value for -sPTHREAD_POOL_SIZE.

If we want to run our multithreaded code in an actual browser, we’ll have to scratch our head a bit harder. The code we are supposed to write is exactly what we expect, but once again we have to tinker with our web server configuration. For technical reasons that we’ll cover in the next intermezzo, in order to run multithreaded code in a browser we must add a couple of HTTP headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

These headers are part of the response your browser will receive when it requests any web page from the server. The way you set these depends on the server you are using; with darkhttpd you can use the --header option.

With your server correctly set up, you can enjoy a multithreaded program running in your browser! As always, you can check out this example from the 03_threads folder of the git repository (git.tronto.net, github).

Intermezzo III: Web Workers and Spectre

The logo of the Spectre vulnerability

On a low level, threads are implemented by Emscripten using web workers, which are processes separated from the main web page process and communicate with it and with each other by passing messages. Web workers are commonly used to run slow operations in the background without blocking the UI threads, so the web page remains responsive while these operations run - we’ll do this in the next section.

Web workers do not have regular access to the same memory as the main process, and this is something that will give us some issues in later sections. However, there are ways around this limitation. One of these ways is provided by SharedArrayBuffer, which we won’t use directly in this tutorial, but is used by Emscripten under the hood.

And this is why we had to set the Cross-Origin-* headers. In 2018, a CPU vulnerability called Spectre was found, and it was shown that an attacker could take advantage of shared memory between the main browser thread and web workers to execute code remotely. As a counter-measure, most browsers now require your app to be in a secure context and cross-origin isolated to allow using SharedArrayBuffers.

Even if you do not plan to use web workers directly, it is still good to have a rough idea of how they work, because of the law of leaky abstractions: all abstractions are leaky. The fact that we had to mess around with our Cross-Origin-* headers despite not caring at all about SharedArrayBuffers is a blatant example of this.

Don’t block the main thread!

If you have run the previous example, may have noticed a scary warning like this in your browser’s console:

A warning saying "Blocking on the main thread is very dangerous, see [link]"

The link points to this page in Emscripten’s documentation.

The issue here is that our heavy computation is not running “in the background”, but its main thread (the one spawning the other threads) coincides with the browser’s main thread, the one that is responsible for drawing the UI and handling user interaction. So if our computation really takes long, the browser is going to freeze - and after a few seconds it will ask us if we want to kill this long-running script.

As we anticipated in the previous intermezzo, we are going to solve this with a web worker. We will structure this solution as follows:

The main script will be responsible for reading the user input, sending a message to the worker to ask it to compute the result, and handling the result that the worker is going to send back once it is done. No slow operation is performed by this script, so that it won’t block the main thread.
The worker will be responsible for receiving mesages from the main script, handling them by calling the library, and sending a message with the response back once it is done computing.

In practice, this will look like this:

script.mjs:

var aInput = document.getElementById("aInput");
var bInput = document.getElementById("bInput");
var button = document.getElementById("goButton");
var resultText = document.getElementById("resultText");

var worker = new Worker("./worker.mjs", { type: "module" });

button.addEventListener("click", () => worker.postMessage({
    a: Number(aInput.value),
    b: Number(bInput.value)
}));

worker.onmessage = (e) => resultText.innerText = "There are " +
    e.data.result + " primes between " + e.data.a + " and " + e.data.b;

worker.mjs:

import Primes from "./build/primes.mjs";

var primes = await Primes();

onmessage = (e) => {
    const count = primes._primes_in_range(e.data.a, e.data.b);
    postMessage({ result: count, a: e.data.a, b: e.data.b });
};

More complicated than before, but nothing crazy. Notice how we are using postMessage() and onmessage() to pass events back and forth. The argument of postMessage() is the actual data we want to send in JSON format, while the argument of onmessage() is an event whose data property contains the object that was sent with postMessage().

You can check out this example in the directory 04_no_block in the repository (git.tronto.net, github). Try also large numbers, in the range of millions or tens of millions, and compare it with the previous example - but not don’t go too large, we only support 32-bit integers for now. Notice how, with this new setup, the browser remains responsive while it is loading the response.

Oh and by the way, a nice exercise for you now could be making the script show some kind of "Loading result..." message while the worker is working. This is not hard to do, but a huge improvement in user experience!

Callback functions

The hand of a person using a phone

For one reason or another, your library function may take as parameter another function. For example, you may use this other function to print log messages regardless of where your library code is run: a command-line tool may pass printf() to log to console, while a GUI application may want to show these messages to some text area in a window, and it will pass the appropriate function pointer parameter. This is the use case that we are going to take as an example here, but it is not the only one.

Implementing this was probably the step that took me the longest in my endeavor to port my Rubik’s cube solver to the web. Luckily for you, when writing this post I found a simpler method, so you won’t have to endure the same pain.

First, we’ll have to adapt our library function like this:

int primes_in_range(int low, int high, void (*log)(const char *)) {
    /* The old code, with calls to log() whenever we want */
};

Tip: when using callback functions like this, it is good practice to have them accept an extra void * parameter, and the library function should also accept an extra void * parameter that it then passes on to the callback. So our function would look something like this: int primes_int_range(int low, int high, void (*log)(const char *, void *), void *log_data). This makes the setup extremely flexible, and allows passing callback functions in situation where this may be tricky. For example, this way you could pass a C++ member function by passing an object as log_data and a function that call log_data’s member function as log. Since we are not going to use this in this example, I’ll stick to the simpler setup.

Now, to call our function from the JavaScript side we would like to do something like this:

int result = primes_in_range(a, b, console.log); // Logging to console

Unfortunately, this will not work, because console.log, a JavaScript function object, does not get automatically converted to a function pointer, which is what C expects. So we’ll have to do something slightly more complicated:

import Primes from "./build/primes.mjs"

var primes = await Primes();
const logPtr = primes.addFunction((cstr) => {
    console.log(primes.UTF8ToString(cstr));
}, "vp");

const count = primes._primes_in_range(1, 100, logPtr);

Here addFunction() is a function generated by Emscripten. Notice also that we are wrapping our console.log() in a call to UTF8ToString(), an Emscripten utility to convert C strings to JavaScript strings, and that we are passing the function’s signature "vp" (returns void, takes a pointer) as an argument; see here for more information.

Other than that, you just need to add a couple of compiler flags:

-sEXPORTED_RUNTIME_METHODS=addFunction,UTF8ToString to tell the compiler to make these two methods available.
-sALLOW_TABLE_GROWTH to make it possible to add functions to out module at runtime with addFunction().

And as you can check by running the example 05_callback from the repo (git.tronto.net, github), everything works as expected, both in Node.js and in a web page. To make the examples more interesting, the web page one is not only not logging the messages to console, but it also shows them as text in the web page.

Note: you must be careful where you call this callback function from. If you try to call it from outside the main thread - for example, in one of the threads that are spawned to count the primes in the sub-intervals - you’ll get a horrible crash. This is because web workers do not have access to the functions that reside in another worker’s memory.

Persistent storage

3D rendering of a spinning hard drive

Our multithreaded implementation of primes_in_range() is not slow, but it could be faster. One possible way to speed it up is to use a look-up table to make is_prime() run in constant time; for this we’ll need to memorize which numbers below 2³¹ (the maximum value of 32-bit signed integer) are prime. This will require 2³¹ bits of data, or 256MB. It would be nice if we could store this data persistently in the user’s browser, so that if they use our app again in the future we won’t need to repeat expensive calculations or re-download large files.

Putting aside the question of whether any of the above is a good idea, and assuming you know how to generate such a table, in C you would read and store the data like this:

#include <stdio.h>

#define FILENAME "./build/primes_table"

void read_table(unsigned char *table) {
    FILE *f = fopen(FILENAME, "rb");
    fread(table, TABLESIZE, 1, f);
    fclose(f);
}

void store_table(const unsigned char *table) {
    FILE *f = fopen(FILENAME, "wb");
    fwrite(table, TABLESIZE, 1, f);
    fclose(f);
}

Note: the code snippet above is extremely simplified, you probably want to add some error-handling code if you implement something like this.

The good news is that we can use the same code when building with Emscripten! The bad news is that… well, it’s a bit more complicated than that.

First of all, it is important to know that Emscripten’s File System API supports different “backends”, by which I mean ways of translating the C / C++ file operations to WASM / JavaScript. I am not going to discuss all of them here, but I want to highlight a few key points:

The default backend is called MEMFS. It is a virtual file system that resides in RAM, and all data written to it is lost when the application is closed.
Only one of these backends (NODERAWFS) gives access to the actual local file system, and it is only usable when running your app with Node.js. Browsers are sandboxed, and the filesystem is not normally accessible to them. There are ways, such as the File System API, to access files, but as far as I understand each file you want to access requires explicit actions from the user. We would like to manage our data automatically, so we are not going to use this API.
The backend we are going to use is called IDBFS. It provides access to the IndexedDB API, which allows to persistently store large quantities of data in the browser’s cache. The data is only removed if the user asks for it, for example by cleaning it from the browser’s settings page.

To activate the IDBFS backend, we are going to add --lidbfs.js to our compiler options. The Indexed DB is not the only way to store data persistently in the browser. For an overview of all the options, you can take a look at this page on MDN.

The compiler flag is not enough, however. We also need to:

Create a directory (for the virtual file system) where our data file is going to be stored. We are going to call this directory assets, but you can pick any other name; it does not have to coincide with the name of a directory that exists on your local file system.
Mount the directory we have just created in the indexed DB.
Synchronize the virtual file system, so that our script is able to read pre-existing files.

All of the above has to be done from JavaScript, which makes things a little bit complicated, because we are reading our files from C code. We have a couple of ways to work around this issue:

Using inline JavaScript in our C code with the EM_JS() or EM_ASYNC_JS() Emscripten macros.
Setting up the indexed DB file system when the module loads using the --pre-js compiler option.

Here we are going to use the second solution, but the first option is good to keep in mind, because it allows us to call JavaScript code at any point rather than just at startup.

Note: if you do end up using EM_ASYNC_JS() to make asynchronous JavasScript functions callable from C, keep in mind that any C function that, directly or indirectly, calls an async JavaScript function, will now return a promise when called from JavaScript. But wether an async function is called is determined at runtime, so you C function may return a value one time and a promise another time, depending on how exactly it runs!

So we are going to add --pre-js init_idbfs.js to our compiler options, with init_idbfs.js containing the following:

Module['preRun'] = [
    async () => {
        const dir = "/assets";

        FS.mkdir(dir);
        FS.mount(IDBFS, { autoPersist: true }, dir);

        Module.fileSystemLoaded = new Promise((resolve, reject) => {
            FS.syncfs(true, (err) => {
                if (err) reject(err);
                else resolve(true);
            });
        });

    }
];

As you can see, the syncing operation is more complicated, the main reason being that it is an asynchronous operation. For this reason, we are wrapping it in a Promise, so we can detect when this operation is done and react accordingly. We are going to do so from our worker script, which will send a message to the main script to communicate that the file system is ready to go:

primes.fileSystemLoaded.then(() => {
    postMessage({ type: "readySignal" });
});

The main script can then handle this signal as it prefers, for example by enabling the Compute button, if it was previously marked as disabled.

One last thing: since we are now using a large amount of memory and loading the virtual file system at the start, the compiler will complain that we are not reserving enough memory for our application. Adding a -sINITIAL_MEMORY=272629760 compiler flag will do the trick (watch out: the number you provide must be a multiple of 2¹⁶). I am not entirely sure why this is the case, since we are not loading the file in memory statically, but only at runtime, and only when the primes_in_range() function is called. I would expect that using -sALLOW_MEMORY_GROWTH would be enough - and indeed this is the case if we use the EM_ASYNC_JS() macro to load the file system on-demand.

And with all this, we are ready to run our optimized version of the primes_in_range() algorithm, all from within our browser! As always, you can check out the complete code in the folder 06_storage of the repository (git.tronto.net, github).

If generating this data on the user’s side seems redundant, you can also have it downloaded from the server. I won’t explain how to it here, since there are many possible ways to achieve this - after all, the indexed DB is also accessible from JavaScript. If you want to experiment more with Emscripten you can try to use the Fetch API; in my project I was not able to make its synchronous version work together with -sMODULARIZE, so I ended up using fetch() directly from within an EM_ASYNC_JS() function. This tutorial is already too long, so I am going to leave this as an exercise for the reader.

Closing thoughts

I have discussed almost everything that I have learned about building a webapp in C / C++ with Emscripten. I ended up using C, not C++, for all of my example, so I did not have a chance to discuss some neat C++-specific features such as EMBIND() and emscripten::val - do check them out if you plan to use C++ for your web app!

Even if this page is structured like a tutorial, this is probably better described as a collection of personal notes, a “brain dump” that I wrote for myself as is the case with many of my blog posts. Writing this piece was a great occasion for me to review the work that I have done and the things I have learned. And while reflecting on all of this I was able to isolate a specific impression that I had while working on this, and I summarized it in on sentence:

It’s leaky abstractions all the way down.

If you have not encountered this term before (but you should, I have already used it in this post), leaky abstraction is a term used to describe the failure of an abstraction to hide the low-level details it is abstracting. The so-called law of leaky abstractions says that all abtractions are leaky. But, in my opinion not all abstractions leak in the same way - some leak way more than others.

Emscripten is a great project that tries to abstract away all the web (JavaScript, WASM, web workers, local storage…) so that you can build and run your C / C++ code in a web browser. Frankly, this is mind-blowing, and I have mad respect for the Emscripten developers.

But as soon as the complexity of your codebase bumps up a notch, you immediately find out that the abstractions don’t hold anymore. If yor app is multithreaded, you have to learn what a web worker is. If you want to read some data from a file, welcome to the world of client-side storage. You need 64-bit memory support because you are processing more than 2GB of data? Sure, but first make sure that your users are not using Safari.

But I am not complaining about this. A browser is a very different beast from a bare-metal operating system, and it is to be expected that you have to know something about the system you are deploying to. I am happy that I could learn about all of this, and I believe this knowledge is going to give me an extra edge whenever I’ll work on the web again.