authors | state |
---|---|
Andrew Lytvynov ([email protected]) |
draft |
Support HSM devices for CA private key storage.
In simple terms, HSMs are hardware devices that store private keys and expose only cryptographic operations using them (like sign/verify and encrypt/decrypt). HSMs provide stronger protections for storing private keys compared to disks or remote databases. Private keys stored in the HSM can be non-exportable, meaning they will never leave the device. HSMs are also required by some regulation regimes, like FIPS and PCI.
First, some background on HSMs and Teleport CAs.
It's important to understand that HSMs are a product category (like "storage devices" or "monitors") and not a specific standard (like NVMe, SATA or HDMI). There are commonalities between HSM devices, but also many vendor-specific variations, extensions and features.
That said, most HSM devices support the following minimum set of operations that Teleport needs:
- generating asymmetric key pairs, with persistence
- using user-specific parameters, like algorithm and key size
- the set of valid parameters is limited (e.g. you can have RSA but not EdDSA)
- exporting the public key of the pair
- using those keys for signing/verification of arbitrary data
- lookup of keys by ID specified during creation
- import of an existing private key from disk
Some devices support other interesting operations, but they are not universal:
- backup/restore/migration of keys to a different HSM device
- remote access
By far, the most ubiquitous interface to HSM devices is PKCS#11. It is often
despised because of complexity and usage difficulty. PKCS#11 works by loading a
module (.so
library) provided by the vendor, which has a known C API. To
cover the largest number of devices, we have to support PKCS#11.
Cloud HSM (and KMS) products are also available from the biggest cloud providers, usually over a custom API. AWS CloudHSM is an exception - it mounts a real HSM over PKCS#11. Since cloud HSM could become a popular request, we should design with the assumption of multiple HSM interfaces. But we will not implement their support initially.
Each HSM vendor typically also provides a custom SDK or library for their specific product. We should not support these in favor of the standard PKCS#11 interface.
yubihsm-connector is an interesting HTTP-to-USB proxy for the device. It allows a single HSM to be shared by multiple clients. It doesn't seem to do authentication though and only works for YubiHSM 2.
Despite popular belief, HSMs don't guarantee that private keys can't be stolen. Using key wrapping and special attributes on private keys, it's possible to copy keys created on one device to a different one or extract them in plaintext. Yubico even provides CLI tools for this, specific to YubiHSM.
This may be used to distribute a single CA private key to multiple HSMs over the network. However, it significantly weakens the security properties of using an HSM in the first place.
Each Teleport cluster has multiple CAs:
- Host CA (SSH + TLS)
- User CA (SSH + TLS)
- JWT CA (TLS only)
Private keys in each CA are stored as a list to enable CA rotation. For example, when rotating a Host CA, we shuffle the key fields of the existing CA object instead of creating a brand new backend object. The expectation of only 1 CA of each type is deeply integrated in Teleport.
Each Auth server may have an HSM device available, configured in
teleport.yaml
. Auth servers create and manage unique CA keys in their local
HSMs, distinct from the keys on other Auth servers. Each Auth server encodes
its local HSM information in the private key field of the CA backend object
(instead of the PEM-encoded raw private key).
CA storage schema has to support multiple active private keys. Auth servers decide on which key to use based on their local configuration.
CA rotation also needs to update, to support multiple active key pairs and multiple trusted key pairs.
Implementing key creation in the Auth server (instead of asking users to manually create keys) gives us the flexibility to change key algorithms in the future. Support for multiple private keys sets us up for CA backup/restore/migration and full federation in the future.
There are several PKCS#11 libraries available, we'll use crypto11. This library allows both key creation and usage, and is a nice usable abstraction over the raw PKCS#11 protocol.
We could use miekg/pkcs11
directly, which represents the PKCS#11 API as closely as possible. But we'd end
up with a lot of verbose code, essentially reimplementing most of crypto11
(which is based on miekg/pkcs11
anyway).
Current CA rotation logic operates on one CA type at a time and performs a phased rotation to give clients and servers a chance to fetch updated CA certs list and reissue their credentials with the new CA.
The rotation mechanism switches new and old CA keys to be "signing" key and "trusted" keys in different phases:
- "signing" key is the first item in the keys list (e.g.
CertificateAuthorityV2.Spec.TLSKeyPairs[0]
) and is used to sign any new certificates - "trusted" key is the second item in the keys list and is trusted for certificate validation, but does not sign any new certificates
This index-based approach will break down with multiple active CA keys (in different HSMs on different Auth servers). Instead, we'll update the storage schema to support multiple "signing" and "trusted" keys for each CA.
Clients will trust all "signing" + "trusted" CA certs for validation. Auth servers will use their corresponding HSM key for signing (see below).
The current backend schema stores CA keys as lists:
message CertAuthoritySpecV2 {
// Note: some fields are omitted below.
// SSH keys. CheckingKeys are public keys, SigningKeys are private keys.
repeated bytes CheckingKeys = 3 [ (gogoproto.jsontag) = "checking_keys,omitempty" ];
repeated bytes SigningKeys = 4 [ (gogoproto.jsontag) = "signing_keys,omitempty" ];
// TLS keys and certs.
repeated TLSKeyPair TLSKeyPairs = 7 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "tls_key_pairs,omitempty" ];
// JWT keys.
repeated JWTKeyPair JWTKeyPairs = 10 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "jwt_key_pairs,omitempty" ];
}
This looks like it supports multiple private keys, but in practice only the
first item of each list is used for signing. Since we can't add a bool signing
flag to SSH keys (because they are byte slices), there isn't an
incremental way to mark multiple signing keys.
I propose revamping the schema entirely and doing a migration to move existing data:
message CertAuthoritySpecV2 {
// Note: some fields are omitted below.
// Deprecated fields for backwards-compatibility.
// Delete in teleport v8.
repeated bytes CheckingKeys = 3 [ (gogoproto.jsontag) = "checking_keys,omitempty" ];
repeated bytes SigningKeys = 4 [ (gogoproto.jsontag) = "signing_keys,omitempty" ];
repeated TLSKeyPair TLSKeyPairs = 7 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "tls_key_pairs,omitempty" ];
repeated JWTKeyPair JWTKeyPairs = 10 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "jwt_key_pairs,omitempty" ];
// New fields in teleport v7.
CAKeySet SigningKeys = 11;
CAKeySet TrustedKeys = 12;
}
message CAKeySet {
repeated SSHKeyPair SSH = 1;
repeated TLSKeyPair TLS = 2;
repeated JWTKeyPair JWT = 3;
}
message SSHKeyPair {
bytes PublicKey = 1;
bytes PrivateKey = 2;
PrivateKeyType PrivateKeyType = 3;
}
// TLSKeyPair already exists.
message TLSKeyPair {
bytes Cert = 1;
bytes Key = 2;
// New field.
PrivateKeyType KeyType = 3;
}
// JWTKeyPair already exists.
message JWTKeyPair {
bytes PublicKey = 1;
bytes PrivateKey = 2;
// New field.
PrivateKeyType PrivateKeyType = 3;
}
enum PrivateKeyType {
RAW = 0;
PKCS11 = 1;
}
Clients will use a concatenation of both CAKeySet
s. Auth servers will use
SigningKeys
to sign new certificates. CA rotation can now swap SigningKeys
and TrustedKeys
instead of relying on indexes.
When PrivateKeyType
is set to PKCS11
, the PrivateKey
/Key
fields contain
encoded information about the PKCS#11 key, as a JSON object prefixed with
pkcs11:
:
pkcs11:{"host_id": "host-uuid", "key_id": "key-uuid"}
Where host-uuid
is the UUID of the Auth server that created the key and
key_id
is the UUID of the key set in the CKA_ID
attribute on the HSM.
Auth servers, when signing, will use:
- if configured with PKCS#11, the
SigningKeys
PKCS#11 key that belongs to their host (matched viahost_id
field) - if not configured with PKCS#11, the first non-PKCS#11 item in
SigningKeys
Auth servers need PKCS#11 configuration, passed in via teleport.yaml
:
auth_service:
ca_key_params:
pkcs11:
# Path to PKCS#11 module from the HSM vendor.
module_path: "/path/to/pkcs11.so"
# CKA_LABEL of the hSM, set during HSM initialization.
label: "myhsm"
# PIN for connecting to the HSM, set during HSM initialization.
# This can also be a path to a file containing the PIN.
pin: "password"
If pkcs11
field is not set, Auth server will start with plaintext keys stored
in the backend. If Auth server fails to connect to the token at startup with
these parameters, the process will exit.
The fields from PKCS#11 configuration will also be added to the auth_server
API resource as read-only.
Migration of existing clusters to use HSMs will involve a CA rotation. Until the CA rotation, Auth servers will not use the HSM for signing even when configured. CA rotation is needed to propagate the CA certificates associated with the new private key to all users and hosts. Without the rotation, all certificates issued by this Auth server would not be trusted by any client.
On startup, Auth server will fetch existing CAs from the backend and:
- if no CAs exist, create a key in HSM and register it in
SigningKeys
- this takes care of brand new Teleport clusters
- if CAs exist, but don't have an HSM key from this host registered in
SigningKeys
, require a CA rotation to add the key- print a warning on startup and in
tctl status
until HSM key is added - all signing operations against this Auth server will fail until rotation; when adding a new HSM-enabled Auth server to the cluster, admins should not route any traffic to it until the rotation is complete
- for HA setup in a new cluster, admins can run
tctl auth rotate --grace-period=0
to perform fast rotation after adding all Auth servers
- print a warning on startup and in
- if CAs exist and have an HSM key from this host, load the key and use it for
signing
- if the key is not found in the HSM (e.g. it was factory reset), the Auth
server will exit; to resolve this, admins have remove
/var/lib/teleport/host_uuid
to explicitly change the UUID and dis-associate the old HSM key from this instance; from there, Auth server behaves as above when registering a new key
- if the key is not found in the HSM (e.g. it was factory reset), the Auth
server will exit; to resolve this, admins have remove
When rotation is initiated through a different Auth server, Auth servers with
HSMs will add a new PKCS#11 key reference to CA TrustedKeys
in the init
phase of rotation.
Using HSMs is inherently more fragile than using plaintext keys. Auth servers become much less dynamic and are tied to a specific machine with an HSM device. CA rotation becomes more complex, with new failure modes.
This design attempts to automate as much HSM management as possible, but still relies on human operators to:
- provision and initialize HSM devices
- not remove/reset/swap HSM devices when in use
- carefully manage
teleport.yaml
on Auth servers to ensure correct PKCS#11 configuration - trigger CA rotations when adding new Auth servers with HSMs
- and check logs /
tctl status
to ensure the new HSM keys are in use
- and check logs /
HSM support should support any PKCS#11-compatible device, but will be tested with: