I love the book Analyzing Computer System Performance with Perl::PDQ. It’s the clearest applied Queuing Theory book I’ve found.

The mostly green filled cover of Analyzing Computer System Performance with Perl::PDQ.

The book in question

IMO I think the Universal Scalability Law should be called Gunther’s law, given how much the author has done to popularize it. (Plus it sounds less grandiose than the Universal Scalability Law, and it would be clearer that it’s on the same level as Amdahl’s Law).

Sadly, I hate to say it, but the book sounds outdated with Perl in the title. I wish it wasn’t so but it sounds Perl/PDQ specific, when it’s a great applied queuing theory book all around.

I’m going to be re-reading it over the winter break, and one of my goals is to translate the examples to Javascript with Observable notebooks. I think Queuing Theory overall needs some heavy marketing/outreach/explanation, and the combined accessiblity and visualization excellence of Observable notebooks would help a lot. Plus the book would sound so much more enticing if it had a version called Analyzing Computer System Performance with PDQ.js ;)

So I looked into porting or rewriting PDQ to Javascript. Turns out it’s written as a C library (with separate bindings for R and python2), so it should be compilable to WASM with emscripten!

Emscripten

While “bringing performance to the browser” nowadays often brings up WASM+Rust, Emscripten and asm.js are the grandparents of this movement, and still used all the time (likely more than the Rust->Wasm backend). Emscripten compiles C/C++ code to WASM and provides several shims to run existing code in the WASM environment.

It turns out Emscripten is really easy to use! It wrap/replaces all the common C building commands:

  • emconfigure
  • emmake
  • emcc

It also provides shims for things like stdio.h, malloc/free and a mock file system that lives in WASM memory. For example, you can run a regular Hello World program in the browser:

#include <stdio.h>

void main() {
	printf("Hello World\n");
}

And the printf will be redirected to console.log() output. If you compile with emcc hello.c -o hello.html it will build you an html file that runs the code and displays its output directly in the browser, which is nice!

Using Emscripten with PDQ

PDQ (Pretty Damn Quick) is a small (3KLOC) library for computing properties of queuing theory models.

I was able to generate both MacOS (Mach-O) and WASM binaries the following example program:

#include "PDQ_Lib.h"

double requests = 400;
double threads = 300;
double service_time =  0.444;

int main() {
  PDQ_Init("AWS Tomcat Model");
  PDQ_CreateClosed("Requests", BATCH, requests, 0.0);
  PDQ_CreateMultiNode(threads, "Threads", MSC, FCFS);
  PDQ_SetDemand("Threads", "Requests", service_time);
  PDQ_SetWUnit("Reqs");
  PDQ_Solve(EXACT);
  PDQ_Report();
  return 0;
}

The emcc command I used was (starting from pdq-qnm-pkg/lib/ in the PDQ repo):

emcc PDQ_*.c MVA*.c test.c -o test.html -s INITIAL_MEMORY=94240768 -s ALLOW_MEMORY_GROWTH=1

It was really nice to be able to compile first to C, run the program and then run it in the browser with the -o test.html flag on emcc. This is the output:

               ==========================================
               ********     PDQ Model OUTPUTS    ********
               ==========================================

Solution method: EXACT

               ********   SYSTEM Performance     ********

Metric                   Value      Unit
------                  -------     ----
Workload: "Requests"
Mean concurrency        400.0000    Reqs
Mean throughput         675.6757    Reqs/Sec
Response time             0.5920    Sec
Stretch factor            1.3333

Bounds Analysis:
Max throughput          675.6757    Reqs/Sec
Min response              0.4440    Sec
Max demand                0.0015    Sec
Total demand              0.4440    Sec
Optimal jobs            300.0000    Jobs

               ********   RESOURCE Performance   ********

Metric          Resource     Work               Value    Unit
------          --------     ----              -------   ----
Capacity        Threads      Requests              300   Servers
Throughput      Threads      Requests         675.6757   Reqs/Sec
In service      Threads      Requests         300.0000   Reqs
Utilization     Threads      Requests         100.0000   Percent
Queue length    Threads      Requests         400.0000   Reqs
Waiting line    Threads      Requests         100.0000   Reqs
Waiting time    Threads      Requests           0.1480   Sec
Residence time  Threads      Requests           0.5920   Sec

However the WASM build was giving different answers than the macOS build! Half the report was NaNs. I dug into it and the execution was behaving so weirdly: I found an array involved in the subcomputation (sm_x) was having its values set to 0 in the middle of a loop…

Here is the code in question, just to show that sm_x was never being written to :

	// Call the submodel defined below to calculate sm_x[] array
    MServers(N, m, D);


	// Solve the composite model
    for (n = 1; n <= N; n++) {
    
        R = 0.0;  // reset

        // response time at the FESC
        for (j = 1; j <= n; j++) {                     ///sm_x  has different values over time here
            R += (j / sm_x[j]) * pq[j - 1][n - 1];
        }

        // mean thruput and mean qlength at the FESC
        X = n / (Z + R);
        Q = X * R;
        
        // compute qlength dsn at the FESC
        for (j = 1; j <= n; j++) {
            pq[j][n] = (X / sm_x[j]) * pq[j - 1][n - 1];
        }

        // include j = 0 row
        pq[0][n] = 1.0;
        for (j = 1; j <= n; j++) {
            pq[0][n] -= pq[j][n];
        }
        
    } // loop over n

The subroutine where the computations diverged. MServers() sets the sm_x array correctly, but the second inner loop sometimes reads 0 values out of it.

I read everything I could find on Emscripten memory corruption, and the only hunch I had was some kind of alignment issue. I found this cool blog post from the Figma team from 2016 debugging an alignment issue deep in their codebase, but my issue wasn’t the same.

I also tried -s SAFE_HEAP=1 but that didn’t change the behavior or call out any errors. My array was only being read inside this simple numerical loop. It was never written after being computed, just read a number of times in the loop.

I was able to get the program above to run after an afternoon of throwing pasta at the wall, by compiling it with both [UBSan] (https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html) (undefined behavior sanitizer) and ASan (address sanitizer). These are extra checks that are compiled into the program to check for either undefined behavior or memory issues (like reading a null pointer). This extra instrumetation makes your program slower but it catches important errors. I used this command:

emcc PDQ_*.c MVA*.c test.c -o test.html -s TOTAL_MEMORY=358744064 -s ALLOW_MEMORY_GROWTH=1 -fsanitize=undefined -fsanitize=address

Woof Asan makes my WASM program require 384MB of memory! IMO that’s a lot for a browser program.

Since ASan got the program to work I suspected some kind of memory corruption.

I then found the Emscripten ASSERTIONS checks, and by compiling with -s ASSERTIONS=2 -s STACK_OVERFLOW_CHECK=2 I was able to get an exception thrown that told me a stack overflow was happening in the pre-computation step for that array.

I went looking for stack allocations and found that before the pre-computation we initialized a huge array on the stack:

//A little bit above the corrupted loop, in the same function. MAX_USERS = 1200
float pq[MAX_USERS + 1][MAX_USERS+1];

I count 4 * 1201 * 1201 = 5,769,604 bytes, and the Emscripten default stack size is 5MB. That’ll overflow the stack!

A note on stack overflows: I had never considered how big stacks, are, but they’re usually not that big! The default stack size on Mac OS X is 8MB (you can check with ulimit -a). If you’re doing any kind of numerical math storing 1 million+ numbers like above, you really need to do it on the heap!

Also stack overflows have a theme song, to the tune of Under the Sea:

I turned up the WASM stack size to 10MB and was able to run the program without any errors or needing sanitizers! The final command I used was

emcc PDQ_*.c MVA*.c test.c -o test.html -s ASSERTIONS=2 -s STACK_OVERFLOW_CHECK=2 -s TOTAL_STACK=10485760 -s INITIAL_MEMORY=94240768 -s ALLOW_MEMORY_GROWTH=1

Now using “only” ~94MB of memory and 10MB of stack. Wow!

I’m not really sure why ASan was initially protecting the program but not giving me an error. My guess is that since it required so much memory (384MB) it allowed the stack overflow to happen without touching anything important on the heap and getting the sm_x array overwritten. But I have no clue! I compiled the same program with twice the array size (so it would take 11MB and be sure to overflow) on macOS and Asan threw an error during execution. So I don’t know why it wasn’t doing it when compiled to WASM. It looks like the Emscripten flag I needed was really -s STACK_OVERFLOW_CHECK=2.

I searched the codebase for other 2D arrays (by grepping for \]\[.*\]) and didn’t find any others allocated on the stack (ie by a function), so this 10MB limit should be fine.

Conclusion

Emscripten is not magic fairy dust you can sprinkle on C libraries (even small ones) to run them in the browser! Especially if the work is numerical.

Porting programs to different environments is not easy. I thought it would be quick because PDQ is small and not very platform dependent, but there are so many variables at play in the execution of a program (including the stack size) you can’t assume that. I recommend starting the compilations with as many sanitizers and ASSERTIONS as possible, to find places where things are breaking and assumptions your program makes about its environment.

C programming is not my day job, so I try to stay positive when I run against some low-level issues. I’m getting better at it! At the Recurse Center I learned about linkers and how they relocate symbols, how to investigate object files, and the difference between shared objects and archive files. I’m much less intimidated by binary programs and their memory layouts. Here I got to debug a real stack overflow and see its downstream effects, I really internalized the difference between allocating on the stack and the heap. And I submitted a tiny patch to make Emscripten easier to use.

I’m going to continue to port to JS pieces of PDQ as I re-read Peformance System Analysis with Perl:PDQ, with a goal of making an npm package we can use.


comments powered by Disqus