In a previous post, I made an implicit assumption that invoking rust calls via python-specific wrappers would be more efficient than by making remote procedure calls via HTTP. While that seems like a decent assumption, I wanted to verify it with a quick benchmark!

As usual, feel free to jump to the source.

Define the Interface

To make sure we’re comparing apples with apples, we want to create two python functions with the same interface as the rust/pyO3 wrappers:

def setup():
    container = set()
    _gol.setup(container)
    return container

def step(board, ntimes=1):
    _gol.step(board, ntimes)

Thus setup should allocate a new set and fill it, while step will modify a set in-place. Utilizing HTTP and JSON, our rpc-wrappers should look as follows:

import requests
RPC_URL = 'http://localhost:3000'

def rpc_setup():
    resp = requests.get(RPC_URL + '/setup/')
    return set(tuple(x) for x in json.loads(resp.content.decode()))

def rpc_step(board, ntimes=1):
    data = json.dumps((ntimes, list(board)))
    resp = requests.post(RPC_URL + '/step/', data=data)
    board.clear()
    board.update(set(tuple(x) for x in json.loads(resp.content.decode())))

Now we’re ready to switch over to rust!

Hyper

In picking between a web framework (eg rocket) or an http server (eg hyper), I opted for the latter. Mostly because I thought a lower-level choice would be more fun/educational, partly because there are too many choices for frameworks.

Though I’d love to also benchmark differences between rocket & hyper. (maybe future post!)

A few imports to get us started:

extern crate game;

extern crate futures;
extern crate hyper;
extern crate serde;
extern crate serde_json;

use futures::{Stream, future::Future};
use hyper::server::{const_service, service_fn, Http, Request, Response};

In my experience, most hyper examples create servers by declaring a zero-sized struct and implementing the Service trait for it. Eg

struct RPC;
impl Service for RPC {
    ... // types
    fn call(&self, req: Request) -> Self::Future {
        ... // service implementation
    }
}

However, I find that pattern slightly unwieldly, and prefer to use a service function instead.

When using a service function, running the main server is pretty simple:

fn main() {
    let addr = "127.0.0.1:3000".parse().unwrap();
    let service = const_service(service_fn(rpc_service));
    let server = Http::new().bind(&addr, service).unwrap();
    server.run().unwrap();
}

Where rpc_service is our function that eventually transforms requests to responses.

fn rpc_service(req: Request) -> RpcFuture {
    let responder = router(&req);

    Box::new(req.body().concat2().map(responder))
}

So why the RpcFuture instead of Response? Because hyper uses async non-blocking I/O.

Then what’s the type of RpcFuture? Hyper allows us to customize this (hence why I’m using a type alias), but basically we want a container that holds a future for an eventual result. In rust, this would be a boxed future result, or

type RpcFuture = Box<Future<Item = Response, Error = hyper::Error>>;

Since this is async, the request and response bodies are of type future::Stream<hyper::Chunk>. Since I’d rather think about my RPCs as functions of requests to responses, I can use some more aliases,

type Data = hyper::Chunk;
type Responder = fn(Data) -> Response;

and hopefully it is now clear that req.body().concat2().map(... is what buffers the streaming request body into one object that we can process all at once. (Which really helps when parsing JSON from it)

That also let’s us define a simple matching router:

fn router(req: &Request) -> Responder {
    match (req.method(), req.path()) {
        (&hyper::Method::Get, "/setup/") => setup,
        (&hyper::Method::Post, "/step/") => step,
        _ => not_found,
    }
}

This may feel like alot of type aliases for what is relatively few lines of server code, but it helps me think about breaking the computation into managable pieces.

In particular, I’ll get an async service while still being able to write my responders in a synchronous manner:

fn step(data: Data) -> Response {
    let (ntimes, mut board): (usize, game::Board) = serde_json::from_slice(&data).unwrap();
    for _ in 0..ntimes {
        board = game::next_generation(&board);
    }
    let resp = serde_json::to_string(&board).unwrap();

    Response::new().with_body(resp)
}

Nice! Notice here that the real magic is in Serde. Serde can magically transform the bytes from the content data into arbitrary native rust types and back. Hoorah! Really great stuff.

Compare Serde’s auto-serialization to the cruft that crops up with other libraries/languages. Ie python’s:

TypeError: {1, 2} is not JSON serializable
TypeError: datetime.datetime(2018, 5, 5, 10, 30, 11, 497078) is not JSON serializable

...etc

The other two responders happen to be essentially static responses, but I’ll leave on the Data param to keep the clean monomorphic interface in router.

fn not_found(_: Data) -> Response {
    Response::new().with_status(hyper::StatusCode::NotFound)
}

fn setup(_: Data) -> Response {
    let acorn = serde_json::to_string(&game::setup()).unwrap();

    Response::new().with_body(acorn)
}

And that’s it! Now just `cargo run –release’ and we’re serving game-of-life RPCs on localhost!

Benchmarking

Before running benchmarks I like to make some hypothesis. Not because it feels good to predict correctly (which it does), but because I learn alot whenever results are wildly unexpected! And it’s easiest to really see how unexpected results are only if you actually go ahead and write down hypotheses. Otherwise it’s easy to fudge the accounting in your head (“umm, oh yeah… I’m sure I expected that to happen…”).

So, hypotheses:

  • pyO3 bindings should be more efficient than RPC for all workloads,
  • but those efficiencies will diminish for more computationally intensive tasks.

I’m not sure how serde’s Deserialize trait compares to pyO3’s FromPyObject, nor how easy that would be to isolate in a test…

Since we’ve already defined both high level python wrappers, we can compare the end-to-end time of getting a python result from a python argument:

def bench(gc=False):
    print('SETUP')
    output(rpc_setup, gol_py.setup, 1000, gc=gc)

    a = rpc_setup()
    b = gol_py.setup()

    for i in range(4):
        time_step(a, b, 10**i)

    assert (a == b)

def time_step(a, b, n, gc=False):
    loops = 10000 // n
    print('\nSTEP: {} loops of step_{}'.format(loops, n))
    output(
        lambda: rpc_step(a, ntimes=n),
        lambda: gol_py.step(b, ntimes=n),
        loops,
        gc=gc,
    )

def output(f1, f2, loops, gc=False):
    gc = 'gc.enable()' if gc else ''
    print('rpc:  ', round(timeit.timeit(f1, gc, number=loops), 4))
    print('pyo3: ', round(timeit.timeit(f2, gc, number=loops), 4))

Note: since step modified the array in place, the results are very similar with/without the garbage collector disabled.

Here’s sample run:

SETUP
rpc:   2.8057
pyo3:  0.0022

STEP: 10000 loops of step_1
rpc:   48.712
pyo3:  7.9649

STEP: 1000 loops of step_10
rpc:   11.7823
pyo3:  6.8943

STEP: 100 loops of step_100
rpc:   7.2779
pyo3:  6.6566

STEP: 10 loops of step_1000
rpc:   7.1569
pyo3:  6.507

How does this stack up? Hypotheses #1 and #2 correct! Which is nice justification for bothering to tinker with pyO3 in the first place ;)

I’d still love to profile again between hyper and rocket, but that’ll have to wait…