At QCon London, John Spray, a storage engineering lead @neon.tech, discussed the often-overlooked complexities of stateful cloud service design, using Neon Serverless Postgres as a case study. His session was part of the Cloud-Native Engineering track on the first day of the conference.

In his talk, Spray discussed the key considerations for data management and storage within modern IT infrastructures. He addressed questions about data localization and replication, the optimal strategies for storing data, and determining the necessary number of copies to maintain data integrity and availability. 

Spray also tackled the challenge of ensuring service availability during the initialization of a new node with an empty cache drive and discussed strategies for efficiently scaling services that rely on local disk storage. His analysis further extended to assessing Kubernetes’ influence in this domain and the financial ramifications of attaining data durability across multiple availability zones or regions, underpinning the talk’s focus on balancing cost, performance, and reliability.

InfoQ interviewed John before his talk at QCon London.

InfoQ: To ensure data durability and service availability, especially for databases like Neon Serverless Postgres, could you discuss the trade-offs in choosing between synchronous and asynchronous data replication methods? 

John Spray: Where practical, synchronous replication is usually preferred, it is easier for the user to reason about by avoiding the “time travel” problem of async systems when switching from the primary to a secondary location.  For example, internally within Neon, we use fully synchronous replication between Safekeeper nodes: latency stays within ~1 ms, and the resulting behavior is what the user expects (i.e., nothing changes from their point of view if one of our nodes is lost).


Asynchronous replication allows the primary to proceed regardless of the secondary’s responsiveness. This is obviously useful over high latency links, but it is also important when the secondary is subject to high loads, such as read-intensive analytics workloads. Ensuring that a primary can maintain high performance irrespective of the secondary’s workload is a valuable tool for building robust architectures.


Neon’s Read Replica endpoints add one further level of isolation. Because the read replica can read directly from our disaggregated storage backend, the primary doesn’t even have to transmit updates to the replica, so any number of replicas may be run without putting extra load on the primary Postgres instance.

InfoQ: Additionally, how do recovery strategies differ when utilizing local disk storage versus block or object storage, and what factors should influence the choice of one over the others?

Spray: It’s not quite clear what kind of recovery is meant; I’ll assume we’re discussing recoveries from infra failures.


Within Neon, we provide data durability through a combination of an initial 3x cross-AZ replication of users’ incoming writes (WAL) to our “Safekeeper” service and later writes to object storage (also replicated across AZs), where this data can later be read via our “Pageserver” service.


This provides a useful example of the contrast between local disk and object storage for recovering from failures:


  • Safekeeper node failures require a new (replacement) node to re-fill its local storage from peers to restore the 3x replication of data and return to a fully healthy state. No user data is lost, but we must make fresh copies as quickly as possible to ensure we return to a fully replicated state as soon as possible.
  • Pageserver node failures do not impact the replication of underlying data (it is in S3), but they impact its availability, as we must re-download the hot set of objects.  We mitigate this by keeping warm-standby caches on other pageservers at the cost of using some extra disk space.


Using replicas on a local disk is more expensive than object storage for primary storage, but we accept that cost to provide a lower latency for our users’ writes.  Failure strategies when using object storage are more flexible; for example, we can avoid holding extra warm copies of objects in the local disk cache for a less active database.  This enables finer-grained optimization of how much hardware resource we use per tenant, compared with designs that must always maintain 3+ local disk copies of everything (we only keep three local disk copies of the most recent writes).


Using replicated/network block devices such as EBS can simplify some designs. Still, we avoid it because of the poor cost/durability trade-off: EBS volumes are only replicated within a single AZ (users typically expect their databases to be durable against an AZ failure).

InfoQ: Deploying stateful services across multiple availability zones or regions is crucial for high availability, but often has significant cost implications. Could you share insights on how organizations can balance the cost and performance when designing multi-region deployments for stateful services? 

Spray: Multi-AZ deployments are a frequent source of “bill shock” for cloud users: replicating a high rate of writes between storage instances in two or more AZs can have a comparable cost to the underlying storage instances.


Therefore, cross-AZ replication traffic should be limited to what is essential: incoming writes with tight latency requirements. Avoid using cross-AZ replication for later data management/housekeeping: Get the data into object storage as soon as possible so you can access it from different AZs without the egress costs.


How can we mitigate this?


  • Cloud vendors’ storage services are not subject to their own egress fees and sometimes offer cross-AZ replication at a better value than doing it yourself, with the downside of, e.g., S3 having higher latency than replicating between local disks.
  • Thoughtful use of compression can significantly help: hot data is often highly compressible. This is true of typical OLTP workloads and streaming workloads. Modern CPUs can apply lightweight compression like LZ4 without imposing much extra latency, and the CPU cycles are cheaper than the egress fees.


Similar issues apply to multi-region deployments, but there is less scope for mitigation: moving data longer distances over fiber optic cable has an intrinsic cost.  For industries with a regulatory requirement for cross-region replication for durability, this is simply a cost of doing business.  Others should carefully consider whether the business benefit of having a presence in a remote region is sufficient to justify the cost of replicating data inter-region.

InfoQ: Are there specific patterns or Kubernetes features that can help minimize costs while maintaining or enhancing service performance and data durability?

Spray: I’ll cover this in some detail in my talk. The short version is that one must be careful, as there are pitfalls when using kubernetes StatefulSet, and consider how node replacements in managed kubernetes services will impact the maintenance of your service. Kubernetes is still sometimes the right tool for the job, but using kubernetes requires more careful thought for stateful services than for typical stateless use cases.

Access recorded QCon London talks with a Video-Only Pass.


link

By admin