Linkerd 2.11
is here and with it are some cool new updates. One I am particularly
excited about (full disclosure: I worked on it), is retries for HTTP requests
with bodies. Linkerd has supported HTTP retries
since version 2.2,
but until now, we would only retry requests without bodies. Retrying
requests with bodies is especially important for anyone using Linkerd with gRPC.
Since all gRPC requests are HTTP/2 POST
requests with bodies, this feature
enables retries to be configured for gRPC traffic.
Retrying a request with a body may sound simple (just send the body again, right?), but it’s not that straightforward. In order to send a body again, the entire body has to be buffered in memory until the original request completes. This means that proxies will have to use more memory to store those bodies, and buffering the body increases latency. We would like to be able to retry these requests while keeping the impact on latency and proxy memory usage as low as possible.
Additionally, some requests, like
client-streaming requests in HTTP/2
and
Transfer-Encoding: chunked
requests
in HTTP/1.1, can have long bodies that are sent in multiple pieces. If the
proxy were to buffer the entire request body before forwarding the request
to the server, it would have to wait for the body to complete —
potentially introducing significant latency. And, in many cases, the server
might expect to process those request bodies chunk-by-chunk. Imagine
uploading a multi-gigabyte video file or a client-streaming gRPC request
where the client pushes events to the server as they occur. Waiting to
buffer the entire body before forwarding it could break the behavior that
the server expects. Instead, we want to be able to forward each chunk as
it’s received while also buffering it in the proxy in case the request
needs to be retried.
To reduce the overhead of buffering body data, we also want to minimize
copying data from one buffer to another (i.e. memcpy
calls). This can be
quite time-consuming when there’s a lot of data and we want to avoid
unnecessary memory allocations (i.e. malloc
calls).
In addition, we want to make sure we’re correctly handling potential edge
cases. What if the server responds with an error before the body stream
ends? We might have to retry the request before we’ve received the whole
body from the client. Or, what if the client sends a longer body than the
request’s
Content-Length
header?
A standards-compliant client shouldn’t do this, but it might happen due to
a bug or (if the request came from an untrusted client outside the cluster)
an HTTP request smuggling attack.
If this occurs, we want to ensure that the proxy doesn’t use potentially
unbounded amounts of memory.
The bottom line is that there are a lot of potential challenges and edge
cases involved in buffering and re-sending request bodies. Fortunately,
we can use some excellent libraries from the Rust ecosystem to help solve
them. In particular, the bytes
and http-body
crates (both written
by emeritus Linkerd maintainers), were vital parts of our implementation.
Buffering
The bytes
crate provides
an implementation
of reference-counted byte buffers. This is very useful because it
allows us to clone chunks of body data without having to copy all the
bytes into a new array.
The Rust standard library’s growable array type,
Vec
,
is represented simply as a pointer to an array in memory, plus a length.
This representation is simple and lightweight. However, it means that if
we want to clone a Vec
of bytes representing an HTTP body chunk, we have
to do this by allocating a new array and copying all the bytes from the
existing array into it. This is quite time-consuming when there’s a large
amount of data.
To solve this problem the bytes crate provides the
Bytes
type, a
reference-counted byte buffer. Where a Vec
is a pointer to an array and
a length, a Bytes
is a pointer to an array, a length, and an atomic reference
count. This means that multiple owning references to the byte buffer can
exist at the same time, as it will only be deallocated when all those
references go away. Now, cloning buffers only requires incrementing a
reference count and copying a pointer. That’s much faster than copying
all the data — 100% malloc
and memcpy
free!
Hyper
, the Rust HTTP implementation used
by the proxy, can be configured to represent HTTP body data using the Bytes
type. This means that we can take a chunk of data received from the client,
clone it by incrementing its reference count, and send one clone to the server,
while retaining the other for a potential retry. But because both clones
of the buffer are just pointers to the same array in memory, we don’t
have to copy all of the bytes.
This solves half of the problem. But, recall that a body might be streaming: we might receive several chunks of data over time. How do we add new data to our buffer?
The bytes
crate’s
BytesMut::extend_from_slice
method appends the data from a slice of bytes to a mutable byte buffer.
But, because we are using the shared, reference-counted Bytes
type, we can’t
use this. Mutating the contents of the byte buffer while it might be
referenced elsewhere could result in a data race, so the shared Bytes
type
does not provide this API. We could copy the data from the Bytes
into a new
buffer that we can freely mutate, but this would mean allocating and copying
the actual array of bytes, which defeats the purpose of using reference-counted
buffers in the first place.
Instead, our solution was to implement a new type called
BufList
.
A BufList
is a Vec
of multiple Bytes
buffers, in the order that they were
received. Now we can append a new chunk of data to the buffered body by simply
cloning the Bytes
and appending it to the BufList
’s vector. By doing this,
we can avoid copying the bytes, and (most of the time) avoid allocating as
well. If the BufList
’s Vec
array is at capacity it may need to be resized to
append a new chunk, but because it consists only of a pointer to each byte buffer,
rather than all the bytes received as part of the request body, the array that
needs to be allocated and copied is quite small, reducing the overhead of
the allocation significantly.
The http-body
crate contains Rust traits providing interfaces that can be
implemented types that represent HTTP bodies and HTTP body data chunks.
Implementing these traits allows us to provide our BufList
type to hyper
as
body data when forwarding the request. Although the byte buffers are not
contiguous in memory, we can still send them on the network with a single
system call by using
vectored writes.
Retrying requests
Now that we have an efficient strategy for buffering body data, how do we
actually retry requests? First, we wrote a new type, called
ReplayBody
,
that implements http-body
’s
Body
trait.
A ReplayBody
exists in one of two states: it is either receiving the initial
request body from the client or playing back a buffered body for a retry. When
we are receiving the initial body from the client, we lazily append each chunk
to a BufList
while forwarding that data to the server. If the request fails
and we have to retry it, we switch to the “replay” state and play back the data
buffered in the BufList
. After replaying from the buffer, if the original
request body has not completed (i.e. the server returned an error before we
received the end of the body), we can switch back to the initial state, and
continue forwarding from the received body while buffering new data. This means
that from the client’s perspective, everything is fine — the retry is performed
completely transparently, even when the body has not yet completed.
Next, we need to determine whether a given request can be retried. The
proxy already has logic for determining whether a request is retryable based
on the
ServiceProfile
configuration
and retry budget. Previously, that logic would always determine requests that
have bodies to be non-retryable. All we do is modify this logic to allow
retrying requests with bodies. To avoid potentially unbounded buffering,
we set a maximum Content-Length
that the proxy will buffer for retries.
Requests with Content-Length
headers over 64 KB are never considered
retryable.
Additionally, as a safeguard against situations where the Content-Length
header
is incorrect and the body is longer than its advertised length, we added a
check in the ReplayBody
type that stops buffering if the buffer ever exceeds
the limit. If this occurs, any previously buffered data is discarded and
the request will no longer be retried. This means that the proxy cannot run
out of memory due to a bug in the client, or a malicious request from the
outside world.
Summary
Retries are one of Linkerd’s most important reliability features. Before Linkerd 2.11, though, only requests without bodies could be retried, limiting the cases where this feature could be used. In 2.11, however, we added support for retrying requests with bodies.
Retrying requests with bodies involves some interesting implementation challenges, especially when we take streaming bodies into consideration.
In this post, we looked at how the Linkerd proxy minimizes the performance overhead of buffering request bodies by reducing copying and allocation. We also discussed how the proxy determines which requests can be retried, and some of the edge cases that had to be taken into consideration.
We hope you’re as excited as we are for this new Linkerd feature. You can try it out yourself by upgrading to Linkerd 2.11 (if you haven’t already) and enabling retries on routes with request bodies!