Wat is up with WebAssembly

April 8, 2019

I’ve always been bullish on WebAssembly, and I still am. I think it has a lot of potential to change the way we develop web applications over the long term, and indeed to change what we consider a web application to begin with.

Wasm is really quite simple, in its way. The specification defines only four numerical types and a handful of operations upon them, plus standards for importing and exporting interfaces and shared memory buffers from and to the surrounding context, whether that is a browser or node, and a few other things. Of course, simple does not mean easy, and in this case, actually using it for anything substantial requires quite a bit of knowledge and context on top of the specification: toolchains, conventions, standard library polyfills for the executing context, that sort of thing.

That is not to say that there is no value in understanding the spec as it exists! Quite to the contrary, I think there is great value in that. Generally, this is how I prefer to try to learn, start from the simplest, most atomic concepts, get a hook in, get some purchase, and then build up from there. So let’s do that. This post starts from almost nothing and builds up to slightly more than nothing.

In my various halfhearted attempts

to understand wasm basics, I haven’t had much luck. I get as far as setting up the emscripten toolchain, and compiling a hello world.

But alas, even the simplest thing comes out the other side of emscripten with a dense glob of glue code all over and around it. The javascript calling context alone is several thousand lines! I find this to be overwhelming. What are all these autogenerated functions doing? Why are they necessary? Why can I not just compile a simple function to wasm and see it in the binary? Emscripten is a very powerful tool, but it was originally designed around making it possible to compile existing projects (specifically c/c++ projects) to asm.js and now wasm, and as such it does a lot of work behind the scenes to facilitate that.

I’m interested in starting with a more stripped down and basic view of WebAssembly. How can I understand the simplest things about it? How does it run? Where does it run? How can I interoperate with the calling context whether it’s in a browser or a node process?

https://webassembly.org/ has everything I need to get started with this. It links to several of the full unabridged specifications, but also has some terser but still pretty comprehensive documentation on the binary format here. It is the latter off of which I will be working right now. That document is not being updated since I started this post, it points to the more comprehensive documentation instead, but honestly I found it to be a lot easier to reason about, and it’s not deprecated as of 1.0, so I will refer to it here for now. The full spec is what it says on the tin, and as such is much denser and detailed than this simple, well written overview document.

The smallest thing

Wasm is a binary format. This means that if I download a tiny addTwo example from somewhere and cat it, I will get something like this:

 asm       `
  addTwo
	       j    name 	   addTwo

Which looks like a bunch of nothing. Opening the file in a text editor directly is only slightly more illuminating- my vim replaces the non-printable characters with ‘control character’ cyphers meant to represent them in place.

^@asm^A^@^@^@^A^G^A ^B^?^?^A^?^C^B^A^@^G
^A^FaddTwo^@^@
        ^A^G^@ ^@ ^Aj^K^@^Y^Dname^A     ^A^@^FaddTwo^B^G^A^@^B^@^@^A^@

Of course, these aren’t control characters that are controlling anything, it’s just binary data. This is not a good way to look at this file. I would like to be able to look at the bytes directly, yes, but represented in some human readable format. There is of course a tool for this. *nix systems will have xxd installed, a tool for working with hexdumps.

xxd test.wasm will yield:

00000000: 0061 736d 0100 0000 0107 0160 027f 7f01  .asm.......`....
00000010: 7f03 0201 0007 0a01 0661 6464 5477 6f00  .........addTwo.
00000020: 000a 0901 0700 2000 2001 6a0b 0019 046e  ...... . .j....n
00000030: 616d 6501 0901 0006 6164 6454 776f 0207  ame.....addTwo..
00000040: 0100 0200 0001 00                        .......

I’m looking at the wasm code now, those are WebAssembly instructions that are laid out in the specification. How can I validate that this is really a WebAssembly module? I can actually try to use it.

Since late 2017, all the major browsers have implemented WebAssembly but I’d like to avoid the complications of running these small examples in a browser just yet, so for now, I will use Node locally for this. Node is built on Google’s V8 engine, which is also used in Chrome, which has WebAssembly support, so naturally, Node does too, since around version 7 I believe.

Let’s say that I’ve downloaded that addTwo wasm example from earlier and saved it in a file called addTwo.wasm. First, I’ll read that local wasm file into a variable in a node script, using the blocking synchronous readFileSync for simplicity’s sake:

const fs = require('fs');
const buffer = fs.readFileSync('./addTwo.wasm');

This returns a Buffer containing the bytes read from the file. I can use the globally available WebAssembly object to validate this buffer against the specification.

const isValid = WebAssembly.validate(buffer);
console.log(isValid);
true

As expected, it’s a valid wasm file.

This works in this case, as the node Buffer type is a wrapper for an ArrayBuffer, and implements the correct interface. Many tutorials will ask you to read the buffer into a typed array, which looks like this:

const typedBuffer = new Uint8Array(buf);

That typed array can then be passed to validate, compile or instantiate. It’s probably safer that way. ¯_(ツ)_/¯

We take the buffer and compile it:

const compiledModule = WebAssembly.compile(buf); // returns a Promise
compiledModule.then(console.log);

And for our trouble we receive a compiled Module.

Module [WebAssembly.Module] {}

One more step! This compiled module can be instantiated using WebAssembly.Instance as a constructor.

WebAssembly.compile(buf).then(compiledModule => {
  console.log(new WebAssembly.Instance(compiledModule));
});
Instance [WebAssembly.Instance] {}

This Instance has an exports property which contains all of the exports of the module. You may not be surprised then to find an addTwo function on this one!

WebAssembly.compile(buf).then(compiledModule => {
  const wasmInstance = new WebAssembly.Instance(compiledModule);
  console.log(wasmInstance.exports); // { addTwo: [Function: 0] }
  console.log(wasmInstance.exports.addTwo(40, 2)); // 42
});

We’re calling a wasm function!

Right now this takes a few steps…

  1. fetch the wasm
  2. cast the resulting Buffer to a Uint8Array typed buffer (this seems optional but is likely good practice)
  3. compile the buffer with WebAssembly.compile
  4. instantiate the Instance with new WebAssembly.Instance

The last two steps can be combined by passing the source buffer to WebAssembly.instantiate instead. This function returns an object with a module and a pre constructed initial instance on it.

Further, in a browser context the entire series of steps can be combined into a single call: WebAssembly.instantiateStreaming. Node doesn’t support this (yet, anyway) for good reasons.

Let’s go back to that hexdump, shall we?

xxd test.wasm > test.hex && vim test.hex

00000000: 0061 736d 0100 0000 0107 0160 027f 7f01  .asm.......`....
00000010: 7f03 0201 0007 0a01 0661 6464 5477 6f00  .........addTwo.
00000020: 000a 0901 0700 2000 2001 6a0b 0019 046e  ...... . .j....n
00000030: 616d 6501 0901 0006 6164 6454 776f 0207  ame.....addTwo..
00000040: 0100 0200 0001 00                        .......

What if I… change a tiny number by one? Say, incrementing that first byte…

00000000: 0161 736d 0100 0000 0107 0160 027f 7f01  .asm.......`....
00000010: 7f03 0201 0007 0a01 0661 6464 5477 6f00  .........addTwo.
00000020: 000a 0901 0700 2000 2001 6a0b 0019 046e  ...... . .j....n
00000030: 616d 6501 0901 0006 6164 6454 776f 0207  ame.....addTwo..
00000040: 0100 0200 0001 00                        .......

I can use xxd in the other direction, too, I can turn this edited test.hex file back into a test.wasm binary file with the -r flag.

xxd -r test.hex > test.wasm

Now, back in my node script, the source is no longer valid, and attempting to compile the source yields a verbose error:

const fs = require('fs');
const buf = fs.readFileSync('./test.wasm');
console.log(WebAssembly.validate(buf)); // false
WebAssembly.instantiate(buf);
(node:10392) UnhandledPromiseRejectionWarning: CompileError: AsyncCompile: Wasm decoding failed: expected magic word 00 61 73 6d, found 01 61 73 6d @+0

Well, that makes sense.

The smallest thing

Forget addTwo. What’s the absolute smallest valid WebAssembly binary?

First of all, let’s make it easier to edit these hexdumps. xxd -p test.wasm > test.hex will remove the extra information and leave us with only the “bytes.”

0061736d0100000001070160027f7f017f03020100070a01066164645477
6f00000a09010700200020016a0b0019046e616d65010901000661646454
776f020701000200000100

I’m going to remove everything except for the first 8 bytes.

0061736d01000000

This is the smallest possible valid wasm binary, consisting of the magic number \x00asm and the WebAssembly version number (version 1) in little endian format.

xxd -r -p test.hex > test.wasm

and now:

const fs = require('fs');
const buf = fs.readFileSync('./test.wasm');
console.log(WebAssembly.validate(buf))
true

I am editing these bytes about as directly as I could be, but I am still seeing textual symbols in the text editor to represent them. Obviously, I must do it this way, but it’s interesting to note that here, xxd is acting as an extremely bare bones compiler- taking source code (the hexdump text) and turning it into machine code (by simply translating the hexadecimal bytes into their actual numerical values and writing that to disk.) It’s not that doing much of anything we usually think of as “compilation,” no analysis or optimization or transformations or any of those things, but nevertheless, it compiling- translating code from one form to another.

You may be tempted to call this a “transpiler” instead, which I would like to discourage.

Here, I will add comments. The sed command will filter out everything after a # to the end of that line, and then I pass the result to the xxd command from before:

0061736d  # magic number
01000000  # wasm version number
sed 's/\#.*$//' test.hex | xxd -r -p > test.wasm

Maybe this is the world’s simplest compiler.

Hello single static global variable world

Now it’s time to refer to the binary specification overview. I won’t rehash everything on that page, the only thing more thorough is the official spec itself. Following along from “High-level structure”, we can see that after the magic number / version number preamble, a well formed module will have a sequence of sections.

“The module preamble is followed by a sequence of sections. Each section is identified by a 1-byte section code that encodes either a known section or a custom section. The section length and payload data then follow.”

There are 11 standard sections, and a mechanism for creating arbitrarily named sections.

Each known section is optional and may appear at most once. Custom sections all have the same id (0), and can be named non-uniquely (all bytes composing their names may be identical). Custom sections are intended to be used for debugging information, future evolution, or third party extensions. For MVP, we use a specific custom section (the Name Section) for debugging information.

I want to start by picking a simple section and implementing it in bytecode. The global section is a good candidate for this, because it has no dependency on any other sections and is fairly straightforward in its utility. This section is used to declare globally available variables within the module. Let’s take another look at that hexfile! I’ll need to add the id and the payload length for the global section.

Per the spec, these two are typed as varuint7 and varuint32, respectively, which you can read as “variable unsigned integer n” where n is the maximum amount of bits encoded. A varuint7 can only be one byte long, but a varuint32 can be up to 4 bytes long, but doesn’t have to be. They are Little Endian Base 128 types.

The id for the global section is 6, and the length is as yet unknown, so I will placehold it with 0

0061736d  # magic number
01000000  # wasm version number

06  # global section id
00  # global section payload length

I will fill in the length differently as I go along to keep up with the actual length.

Now, what actually goes into the global section? It consists of a count of all the globals followed by a series of glabal declarations. I want to make just one!

0061736d  # magic number
01000000  # wasm version number

06  # global section id
00  # global section payload length
01  # globals count

Now, what actually goes into a global entry? a type declaration followed by a mutability flag set to 0 or 1 (for false or true respectively), and an initialization expression.

0061736d  # magic number
01000000  # wasm version number

06  # global section id
00  # global section payload length
01  # globals count
7f  # i32 type declaration
00  # mutability

Now, what actually goes into an initialization expression? In the mvp, you may only declare a global variable immediately as a constant or grab it from an import. I’ll go for the first option. This means we use the bytecode for i32.const, which is 41, followed by a literal value, and ending with the opcode for end as a delimiter. All of these bytecodes are described in detail on the docs page, although they can be a little tough to follow/untangle, they are all there.

0061736d  # magic number
01000000  # wasm version number

06  # global section id
06  # global section payload length
01  # globals count
7f  # i32 type declaration
00  # mutability
41  # i32.const
2a  # i32 literal
0b  # end

Note that I’ve updated the payload length to be 06 now.

So, that’s it! This should be a valid wasm module. Is it?

const fs = require('fs');
const buf = fs.readFileSync('./test.wasm');
console.log(WebAssembly.validate(buf))
true

And yes it is.

How does this look different from the first example? Well, the eventual compiled instance will probably be slightly bigger (or perhaps, if the global is immutable, the compiler will reference some internal canonical value, I don’t know how the internals work). But from the perspective of the consumer of this module, nothing about the interface changes. “Global” means global to the module. There is one more step to expose the value, and that is by referencing it in the exports section.

0061736d  # magic number
01000000  # wasm version number

06  # global section id
06  # global section payload length
01  # globals count
7f  # i32 type declaration
00  # mutability
41  # i32.const
2a  # i32 literal
0b  # end

07  # export section id
05  # export section payload length
01  # export section count
01  # field string (name) length
78  # field string (name) - here 78 is the char code for the letter 'x'
03  # kind (global)
00  # memaddr

Now, from the node script, I can see that global as an export named “x”

const fs = require('fs');
const buf = fs.readFileSync('./test.wasm');
WebAssembly.instantiate(buf).then(e => {
  console.log(e.instance.exports)
})
{ x: 42 }

Wat?

As you can see, it is entirely possible to produce functioning wasm by hand, and I could continue the exercise in this way. There is some value in understanding how to read the bytes themselves, but this isn’t really a sustainable workflow for a lot of reasons. Most obviously, any substantial real program would be impenetrable.

Luckily, the WebAssembly spec includes a textual encoding using s-expressions. This format is much easier to both read and to write, and also the file suffix is wat, which is terrific. wat uses s-expressions, but it is not a lisp. Lisps are also based around s-expressions, but the things that make a lisp a lisp are how the s-expressions are stored and evaluated, not the format alone.

The WebAssembly binary toolkit includes among its many tools a couple of utilities to translate between wasm and wat in both directions. They are unceremoniously named wasm2wat and wat2wasm.

So what does my test file look like in wat?

$ wasm2wat test.wasm
(module
  (global (;0;) i32 (i32.const 42))
  (export "x" (global 0)))

This should look structurally familiar, it’s essentially a 1:1 match to the bytecode with a couple of small exceptions.

module defines a top level s-expression for the module, just like the preamble with the magic number and version number does.

You’ll also notice (;0;), which is a comment that wasm2wat has inserted to help with identifying the memory address of the global declaration. See that it is referred to in the export section as 0. Globals are always referred to by their memory address in the order they were declared, so you could see how this would become unwieldy with a lot of them. If you run wat2wasm on this output, you will get back exactly what was put in.

From this base, I can start exploring the other sections in the module spec and investigating how wasm could be and already is used in the real world. Some questions I would like to explore in further post:

  • How does wasm interact with the calling context when passing complex values like javascript objects?
  • What really is the interop overhead when passing large amounts of data between js and wasm? When does taking on that overhead make sense for the greater application performance?
  • Are the current use cases, toolchains, and possibilities of wasm mature enough, and/or compelling enough, to drive its adoption in a meaningful way in the future?
  • I haven’t touched on this, but wasm as currently conceived is absolutely not intended to replace javascript. But could it potentially do that someday in the far future? Browser apis would have to be fully exposed and source language compilers and standard libraries would have to implement support for those apis on a per language basis, but it could be done. Is this something we want to work towards?

All this and more coming to you in my very next post in ::checks post history:: mid 2020.