2024-03-22 14:58:37 +08:00

22 KiB

Raw Blame History

Datagram BIO API revisions for sendmmsg/recvmmsg

We need to evolve the API surface of BIO which is relevant to BIO_dgram (and the eventual BIO_dgram_mem) to support APIs which allow multiple datagrams to be sent or received simultaneously, such as sendmmsg(2)/recvmmsg(2).

The adopted design

Design decisions

The adopted design makes the following design decisions:

We use a sendmmsg/recvmmsg-like API. The alternative API was not considered for adoption because it is an explicit goal that the adopted API be suitable for concurrent use on the same BIO.
We define our own structures rather than using the OS's struct mmsghdr. The motivations for this are:
- It ensures portability between OSes and allows the API to be used on OSes which do not support sendmmsg or sendmsg.
- It allows us to use structures in keeping with OpenSSL's existing abstraction layers (e.g. BIO_ADDR rather than struct sockaddr).
- We do not have to expose functionality which we cannot guarantee we can support on all platforms (for example, arbitrary control messages).
- It avoids the need to include OS headers in our own public headers, which would pollute the environment of applications which include our headers, potentially undesirably.
For OSes which do not support sendmmsg, we emulate it using repeated calls to sendmsg. For OSes which do not support sendmsg, we emulate it using sendto to the extent feasible. This avoids the need for code consuming these new APIs to define a fallback code path.
We do not define any flags at this time, as the flags previously considered for adoption cannot be supported on all platforms (Win32 does not have MSG_DONTWAIT).
We ensure the extensibility of our BIO_MSG structure in a way that preserves ABI compatibility using a stride argument which callers must set to sizeof(BIO_MSG). Implementations can examine the stride field to determine whether a given field is part of a BIO_MSG. This allows us to add optional fields to BIO_MSG at a later time without breaking ABI. All new fields must be added to the end of the structure.
The BIO methods are designed to support stateless operation in which they are simply calls to the equivalent system calls, where supported, without changing BIO state. In particular, this means that things like retry flags are not set or cleared by BIO_sendmmsg or BIO_recvmmsg.

The motivation for this is that these functions are intended to support concurrent use on the same BIO. If they read or modify BIO state, they would need to be synchronised with a lock, undermining performance on what (for BIO_dgram) would otherwise be a straight system call.
We do not support iovecs. The motivations for this are:
- Not all platforms can support iovecs (e.g. Windows).
- The only way we could emulate iovecs on platforms which don't support them is by copying the data to be sent into a staging buffer. This would defeat all of the advantages of iovecs and prevent us from meeting our zero/single-copy requirements. Moreover, it would lead to extremely surprising performance variations for consumers of the API.
- We do not believe iovecs are needed to meet our performance requirements for QUIC. The reason for this is that aside from a minimal packet header, all data in QUIC is encrypted, so all data sent via QUIC must pass through an encrypt step anyway, meaning that all data sent will already be copied and there is not going to be any issue depositing the ciphertext in a staging buffer together with the frame header.
- Even if we did support iovecs, we would have to impose a limit on the number of iovecs supported, because we translate from our own structures (as discussed above) and also intend these functions to be stateless and not requiire locking. Therefore the OS-native iovec structures would need to be allocated on the stack.
Sometimes, an application may wish to learn the local interface address associated with a receive operation or specify the local interface address to be used for a send operation. We support this, but require this functionality to be explicitly enabled before use.

The reason for this is that enabling this functionality generally requires that the socket be reconfigured using setsockopt on most platforms. Doing this on-demand would require state in the BIO to determine whether this functionality is currently switched on, which would require otherwise unnecessary locking, undermining performance in concurrent usage of this API on a given BIO. By requiring this functionality to be enabled explicitly before use, this allows this initialization to be done up front without performance cost. It also aids users of the API to understand that this functionality is not always available and to detect when this functionality is available in advance.

Design

The currently proposed design is as follows:

typedef struct bio_msg_st {
    void *data;
    size_t data_len;
    BIO_ADDR *peer, *local;
    uint64_t flags;
} BIO_MSG;

#define BIO_UNPACK_ERRNO(e)     /*...*/
#define BIO_IS_ERRNO(e)         /*...*/

ossl_ssize_t BIO_sendmmsg(BIO *b, BIO_MSG *msg, size_t stride,
                          size_t num_msg, uint64_t flags);
ossl_ssize_t BIO_recvmmsg(BIO *b, BIO_MSG *msg, size_t stride,
                          size_t num_msg, uint64_t flags);

The API is used as follows:

msg points to an array of num_msg BIO_MSG structures.
Both functions have identical prototypes, and return the number of messages processed in the array. If no messages were sent due to an error, -1 is returned. If an OS-level socket error occurs, a negative value v is returned. The caller should determine that v is an OS-level socket error by calling BIO_IS_ERRNO(v) and may obtain the OS-level socket error code by calling BIO_UNPACK_ERRNO(v).
stride must be set to sizeof(BIO_MSG).
data points to the buffer of data to be sent or to be filled with received data. data_len is the size of the buffer in bytes on call. If the given message in the array is processed (i.e., if the return value exceeds the index of that message in the array), data_len is updated to the actual amount of data sent or received at return time.
flags in the BIO_MSG structure provides per-message flags to the BIO_sendmmsg or BIO_recvmmsg call. If the given message in the array is processed, flags is written with zero or more result flags at return time. The flags argument to the call itself provides for global flags affecting all messages in the array. Currently, no per-message or global flags are defined and all of these fields are set to zero on call and on return.
peer and local are optional pointers to BIO_ADDR structures into which the remote and local addresses are to be filled. If either of these are NULL, the given addressing information is not requested. Local address support may not be available in all circumstances, in which case processing of the message fails. (This means that the function returns the number of messages processed, or -1 if the message in question is the first message.)

Support for local must be explicitly enabled before use, otherwise attempts to use it fail.

Local address support is enabled as follows:

int BIO_dgram_set_local_addr_enable(BIO *b, int enable);
int BIO_dgram_get_local_addr_enable(BIO *b);
int BIO_dgram_get_local_addr_cap(BIO *b);

BIO_dgram_get_local_addr_cap() returns 1 if local address support is available. It is then enabled using BIO_dgram_set_local_addr_enable(), which fails if support is not available.

Options which were considered

Options for the API surface which were considered included:

sendmmsg/recvmmsg-like API

This design was chosen to form the basis of the adopted design, which is described above.

int BIO_readm(BIO *b, BIO_mmsghdr *msgvec,
              unsigned len, int flags, struct timespec *timeout);
int BIO_writem(BIO *b, BIO_mmsghdr *msgvec,
              unsigned len, int flags, struct timespec *timeout);

We can either define BIO_mmsghdr as a typedef of struct mmsghdr or redefine an equivalent structure. The former has the advantage that we can just pass the structures through to the syscall without copying them.

Note that in BIO_mem_dgram we will have to process and therefore understand the contents of struct mmsghdr ourselves. Therefore, initially we define a subset of struct mmsghdr as being supported, specifically no control messages; msg_name and msg_iov only.

The flags argument is defined by us. Initially we can support something like MSG_DONTWAIT (say, BIO_DONTWAIT).

Implementation Questions

If we go with this, there are some issues that arise:

Are BIO_mmsghdr, BIO_msghdr and BIO_iovec simple typedefs for OS-provided structures, or our own independent structure definitions?
- If we use OS-provided structures:
  - We would need to include the OS headers which provide these structures in our public API headers.
  - If we choose to support these functions when OS support is not available (see discussion below), We would need to define our own structures in this case (a “polyfill” approach).
- If we use our own structures:
  - We would need to translate these structures during every call.
    
    But we would need to have storage inside the BIO_dgram for m struct msghdr, m*v iovecs, etc. Since we want to support multithreaded use these allocations probably will need to be on the stack, and therefore must be limited.
    
    Limiting m isn't a problem, because sendmmsg returns the number of messages sent, so the existing semantics we are trying to match lets us just send or receive fewer messages than we were asked to.
    
    However, it does seem like we will need to limit v, the number of iovecs per message. So what limit should we give to v, the number of iovecs? We will need a fixed stack allocation of OS iovec structures and we can allocate from this stack allocation as we iterate through the BIO_msghdr we have been given. So in practice we could just only send messages until we reach our iovec limit, and then return.
    
    For example, suppose we allocate 64 iovecs internally:
```
struct iovec vecs[64];
```
    If the first message passed to a call to BIO_writem has 64 iovecs attached to it, no further messages can be sent and BIO_writem returns 1.
    
    If three messages are sent, with 32, 32, and 1 iovecs respectively, the first two messages are sent and BIO_writem returns 2.
    
    So the only important thing we would need to document in this API is the limit of iovecs on a single message; in other words, the number of iovecs which must not be exceeded if a forward progress guarantee is to be made. e.g. if we allocate 64 iovecs internally, BIO_writem with a single message with 65 iovecs will never work and this becomes part of the API contract.
    
    Obviously these quantities of iovecs are unrealistically large. iovecs are small, so we can afford to set the limit high enough that it shouldn't cause any problems in practice. We can increase the limit later without a breaking API change, but we cannot decrease it later. So we might want to start with something small, like 8.
We also need to decide what to do for OSes which don't support at least sendmsg/recvmsg.
- Don't provide these functions and require all users of these functions to have an alternate code path which doesn't rely on them?
  - Not providing these functions on OSes that don't support at least sendmsg/recvmsg is a simple solution but adds complexity to code using BIO_dgram. (Though it does communicate to code more realistic performance expectations since it knows when these functions are actually available.)
- Provide these functions and emulate the functionality:
  - However there is a question here as to how we implement the iovec arguments on platforms without sendmsg/recvmsg. (We cannot use writev/readv because we need peer address information.) Logically implementing these would then have to be done by copying buffers around internally before calling sendto/recvfrom, defeating the point of iovecs and providing a performance profile which is surprising to code using BIO_dgram.
  - Another option could be a variable limit on the number of iovecs, which can be queried from BIO_dgram. This would be a constant set when libcrypto is compiled. It would be 1 for platforms not supporting sendmsg/recvmsg. This again adds burdens on the code using BIO_dgram, but it seems the only way to avoid the surprising performance pitfall of buffer copying to emulate iovec support. There is a fair risk of code being written which accidentally works on one platform but not another, because the author didn't realise the iovec limit is 1 on some platforms. Possibly we could have an “iovec limit” variable in the BIO_dgram which is 1 by default, which can be increased by a call to a function BIO_set_iovec_limit, but not beyond the fixed size discussed above. It would return failure if not possible and this would give client code a clear way to determine if its expectations are met.

Alternate API

Could we use a simplified API? For example, could we have an API that returns one datagram where BIO_dgram uses readmmsg internally and queues the returned datagrams, thereby still avoiding extra syscalls but offering a simple API.

The problem here is we want to support “single-copy” (where the data is only copied as it is decrypted). Thus BIO_dgram needs to know the final resting place of encrypted data at the time it makes the readmmsg call.

One option would be to allow the user to set a callback on BIO_dgram it can use to request a new buffer, then have an API which returns the buffer:

int BIO_dgram_set_read_callback(BIO *b,
                                void *(*cb)(size_t len, void *arg),
                                void *arg);
int BIO_dgram_set_read_free_callback(BIO *b,
                                     void (*cb)(void *buf,
                                                size_t buf_len,
                                                void *arg),
                                     void *arg);
int BIO_read_dequeue(BIO *b, void **buf, size_t *buf_len);

The BIO_dgram calls the specified callback when it needs to generate internal iovecs for its readmmsg call, and the received datagrams can then be popped by the application and freed as it likes. (The read free callback above is only used in rare circumstances, such as when calls to BIO_read and BIO_read_dequeue are alternated, or when the BIO_dgram is destroyed prior to all read buffers being dequeued; see below.) For convenience we could have an extra call to allow a buffer to be pushed back into the BIO_dgram's internal queue of unused read buffers, which avoids the need for the application to do its own management of such recycled buffers:

int BIO_dgram_push_read_buffer(BIO *b, void *buf, size_t buf_len);

On the write side, the application provides buffers and can get a callback when they are freed. BIO_write_queue just queues for transmission, and the sendmmsg call is made when calling BIO_flush. (TBD: whether it is reasonable to overload the semantics of BIO_flush in this way.)

int BIO_dgram_set_write_done_callback(BIO *b,
                                      void (*cb)(const void *buf,
                                                 size_t buf_len,
                                                 int status,
                                                 void *arg),
                                      void *arg);
int BIO_write_queue(BIO *b, const void *buf, size_t buf_len);
int BIO_flush(BIO *b);

The status argument to the write done callback will be 1 on success, some negative value on failure, and some special negative value if the BIO_dgram is being freed before the write could be completed.

For send/receive addresses, we import the BIO_(set|get)_dgram_(origin|dest) APIs proposed in the sendmsg/recvmsg PR (#5257). BIO_get_dgram_(origin|dest) should be called immediately after BIO_read_dequeue and BIO_set_dgram_(origin|dest) should be called immediately before BIO_write_queue.

This approach allows BIO_dgram to support myriad options via composition of successive function calls in a “builder” style rather than via a single function call with an excessive number of arguments or pointers to unwieldy ever-growing argument structures, requiring constant revision of the central read/write functions of the BIO API.

Note that since BIO_set_dgram_(origin|dest) sets data on outgoing packets and BIO_get_dgram_(origin|dest) gets data on incoming packets, it doesn't follow that these are accessing the same data (they are not setters and getters of a variables called "dgram origin" and "dgram destination", even though they look like setters and getters of the same variables from the name.) We probably want to separate these as there is no need for a getter for outgoing packet destination, for example, and by separating these we allow the possibility of multithreaded use (one thread reads, one thread writes) in the future. Possibly we should choose less confusing names for these functions. Maybe BIO_set_outgoing_dgram_(origin|dest) and BIO_get_incoming_dgram_(origin|dest).

Pros of this approach:

Application can generate one datagram at a time and still get the advantages of sendmmsg/recvmmsg (fewer syscalls, etc.)

We probably want this for our own QUIC implementation built on top of this anyway. Otherwise we will need another piece to do basically the same thing and agglomerate multiple datagrams into a single BIO call. Unless we only want use sendmmsg constructively in trivial cases (e.g. where we send two datagrams from the same function immediately after one another... doesn't seem like a common use case.)
Flexible support for single-copy (zero-copy).

Cons of this approach:

Very different way of doing reads/writes might be strange to existing applications. But the primary consumer of this new API will be our own QUIC implementation so probably not a big deal. We can always support BIO_read/BIO_write as a less efficient fallback for existing third party users of BIO_dgram.

Compatibility interop

Suppose the following sequence happens:

BIO_read (legacy call path)
BIO_read_dequeue (recvmmsg based call path with callback-allocated buffer)
BIO_read (legacy call path)

For (1) we have two options

a. Use recvmmsg and add the received datagrams to an RX queue just as for the BIO_read_dequeue path. We use an OpenSSL-provided default allocator (OPENSSL_malloc) and flag these datagrams as needing to be freed by OpenSSL, not the application.

When the application calls BIO_read, a copy is performed and the internal buffer is freed.

b. Use recvfrom directly. This means we have a recvmmsg path and a recvfrom path depending on what API is being used.

The disadvantage of (a) is it yields an extra copy relative to what we have now, whereas with (b) the buffer passed to BIO_read gets passed through to the syscall and we do not have to copy anything.

Since we will probably need to support platforms without sendmmsg/recvmmsg support anyway, (b) seems like the better option.

For (2) the new API is used. Since the previous call to BIO_read is essentially “stateless” (it's just a simple call to recvfrom, and doesn't require mutation of any internal BIO state other than maybe the last datagram source/destination address fields), BIO_dgram can go ahead and start using the recvmmsg code path. Since the RX queue will obviously be empty at this point, it is initialised and filled using recvmmsg, then one datagram is popped from it.

For (3) we have a legacy BIO_read but we have several datagrams still in the RX queue. In this case we do have to copy - we have no choice. However this only happens in circumstances where a user of BIO_dgram alternates between old and new APIs, which should be very unusual.

Subsequently for (3) we have to free the buffer using the free callback. This is an unusual case where BIO_dgram is responsible for freeing read buffers and not the application (the only other case being premature destruction, see below). But since this seems a very strange API usage pattern, we may just want to fail in this case.

Probably not worth supporting this. So we can have the following rule:

After the first call to BIO_read_dequeue is made on a BIO_dgram, all subsequent calls to ordinary BIO_read will fail.

Of course, all of the above applies analogously to the TX side.

BIO_dgram_pair

We will also implement from scratch a BIO_dgram_pair. This will be provided as a BIO pair which provides identical semantics to the BIO_dgram above, both for the legacy and zero-copy code paths.

Thread safety

It is a functional assumption of the above design that we would never want to have more than one thread doing TX on the same BIO and never have more than one thread doing RX on the same BIO.

If we did ever want to do this, multiple BIOs on the same FD is one possibility (for the BIO_dgram case at least). But I don't believe there is any general intention to support multithreaded use of a single BIO at this time (unless I am mistaken), so this seems like it isn't an issue.

If we wanted to support multithreaded use of the same FD using the same BIO, we would need to revisit the set-call-then-execute-call API approach above (BIO_(set|get)_dgram_(origin|dest)) as this would pose a problem. But I mainly mention this only for completeness. Our recent learnt lessons on cache contention suggest that this probably wouldn't be a good idea anyway.

Premature destruction

If BIO_dgram is freed before all datagrams are read, the read buffer free callback is used to free any unreturned read buffers.

22 KiB Raw Blame History