The payment system question is the canonical staff-level system design interview. It tests correctness under failure — a property that most distributed systems can trade away, but payment systems cannot. Getting a charge wrong costs real money and real customers. Interviewers use this question to find engineers who understand that distributed systems correctness is not optional when money is involved.
Payment systems are uniquely unforgiving. Every other system in this series can tolerate some level of eventual consistency, approximate correctness, or graceful degradation. Payment systems cannot. Charging a customer twice, or not recording a completed charge, are both equally catastrophic — one causes customer complaints and chargebacks, the other causes revenue leakage and reconciliation failures. The interviewer is specifically evaluating whether you understand which correctness properties are non-negotiable and how to achieve them in a distributed environment that will inevitably experience network partitions and crashes.
Idempotency. The single most important property in payment system design. Network timeouts are inevitable. Client retries are necessary. Without idempotency, a retry after a timeout results in a double charge. The interviewer expects you to open with idempotency keys as a first-class design requirement, not as an afterthought.
ACID over availability. Most distributed systems questions reward you for choosing availability over consistency. Payment systems reward the opposite choice. Choosing NoSQL for its availability properties, or trading ACID for write throughput, signals a fundamental misunderstanding of financial data requirements. The interviewer wants to hear you explicitly say: "I will accept lower write throughput and shorter availability windows in exchange for ACID guarantees, because a double charge or lost transaction costs more than a brief outage."
State machine design. A payment is not a binary (succeeded/failed). It moves through a series of states: initiated → PSP called → PSP responded → ledger updated → notification sent → reconciled. Each state transition must be atomic and observable. The interviewer is checking that you model payment as a state machine, not a function call.
Reconciliation. Even with perfect idempotency and ACID transactions, distributed systems create discrepancies. The PSP might report a charge as successful that your system recorded as failed. Your database might be updated but the notification failed. Reconciliation — the process of comparing your internal records against PSP records and resolving discrepancies — is a required component of every production payment system. Candidates who skip reconciliation are signalling that they have not operated a real payment system.
Payment systems are not high-throughput by typical distributed systems standards. Global credit card transaction volume is approximately 500 transactions per second worldwide across all cards and networks. A large e-commerce platform like Amazon processes roughly 100–300 payment transactions per second at peak. This is orders of magnitude lower than Twitter's 23,000 RPS or YouTube's 50,000 playback RPS.
Throughput target. For a mid-to-large e-commerce platform: 1,000 payment transactions per second at peak (Black Friday). This is achievable on a single well-provisioned PostgreSQL instance. The challenge is not throughput — it is correctness under failure.
Storage. Each payment record: payment ID (UUID, 16 bytes), user ID (8 bytes), amount (8 bytes), currency (3 bytes), status (4 bytes), PSP transaction ID (32 bytes), idempotency key (64 bytes), timestamps (16 bytes), metadata JSON (~500 bytes). Total: ~650 bytes per record. At 1,000 TPS, that's 650 KB/sec or ~56 GB/day of new payment records. A year of payment data: ~20 TB. Manageable on standard PostgreSQL with appropriate partitioning by date.
Audit log storage. Each payment generates approximately 5–8 audit log entries (state transitions). Each entry: ~200 bytes. At 1,000 TPS: 5,000–8,000 audit entries/sec = 1–1.5 MB/sec. Year of audit logs: ~40–50 TB. Audit logs are typically archived to cold storage (S3-compatible) after 90 days, keeping the hot audit table to recent history.
PSP latency. External PSP calls (Stripe, Braintree, Adyen) typically take 500ms–3,000ms. This means each payment request holds a database connection for up to 3 seconds waiting for the PSP response. At 1,000 TPS with 1,500ms average PSP latency: 1,500 concurrent database connections in the connection pool. This is a practical constraint — PostgreSQL default max connections is 100. You need a connection pooler (PgBouncer) and connection pool sizing matched to PSP latency, not just request rate.
Idempotency keys: the foundation of everything. Every payment request from the client includes an idempotency key — a UUID generated by the client before the request. The server stores this key in a dedicated idempotency_keys table with a unique constraint. Before processing any payment: attempt to insert the key. If the insert succeeds, proceed with payment processing and store the result against the key. If the insert fails (duplicate key constraint), return the stored result of the original request. This entire check-and-process must happen in a single database transaction to prevent race conditions between concurrent requests with the same key.
Two-phase payment flow. Never attempt to charge a card and update your database in the same operation — these are calls to different systems (PSP and your DB) and cannot be atomically coordinated. The correct sequence: (1) Create a payment record in state INITIATED in your database. (2) Call the PSP. (3) Based on the PSP response, update the payment record to SUCCEEDED or FAILED. If step (2) times out — you don't know whether the charge succeeded — set the payment to UNCERTAIN and trigger a reconciliation check. Never assume failure on timeout; the PSP may have charged the card successfully.
PostgreSQL with ACID transactions. The payment service writes to PostgreSQL exclusively. Each write uses a transaction that atomically: inserts/updates the payment record, appends to the audit log, and updates the user's ledger balance. If any of these three writes fail, the entire transaction rolls back. This is the ACID guarantee in action — there is no partial state where the PSP was charged but the ledger wasn't updated. PostgreSQL's serialisable isolation level prevents concurrent transactions from seeing each other's uncommitted state.
Async reconciliation worker. A background worker runs every 15 minutes and queries the PSP for all transactions in the reconciliation window. It compares PSP records against your internal payment records and flags discrepancies: charges the PSP shows as successful that you have as failed (potential revenue lost), and charges you show as successful that the PSP has no record of (data integrity issue). Discrepancies are queued for human review. This is not a failure recovery mechanism — it is a correctness validation that runs continuously in production.
Webhook handling for async PSPs. Some payment methods (bank transfers, some international cards) do not respond synchronously. The PSP accepts the charge initiation and sends a webhook callback minutes or hours later with the final status. Your system must: create the payment in PENDING state, return a response to the user ("payment is processing"), and have a webhook endpoint that updates payment state when the PSP calls back. The webhook endpoint must be idempotent — PSPs often send the same webhook multiple times.
Separation of payment service and ledger service. The payment service handles external PSP calls and idempotency. The ledger service maintains account balances as a series of double-entry bookkeeping entries (debit one account, credit another). These are separate services because their consistency requirements differ slightly and they scale differently. The ledger is append-only — balances are derived by summing ledger entries, never stored as a mutable balance field (which would require careful locking to update).
Load the Payment Processing blueprint in SysSimulator. The blueprint models the full payment flow: API gateway, payment service, idempotency key store, PSP integration (simulated), PostgreSQL ledger, audit log, and a reconciliation worker.
Run at 500 TPS and observe: payment success rate, PSP call latency (simulated at 800ms average), database connection pool utilisation, and audit log write rate. At healthy load, success rate should be near 100% and connection pool should be under 70% utilised.
Inject a PSP timeout spike — simulate PSP latency jumping from 800ms to 8,000ms (PSP degradation). Watch: connection pool depth climbs as requests wait for PSP responses. When the pool exhausts, new payment requests fail immediately. The payment service is now effectively down — not because it's overloaded, but because it's waiting on an external dependency. This is the argument for async PSP calls with a queue buffer for high-volume systems.
Then inject a database partition and observe the ACID failure mode: payments that cannot write to the database are rejected with 503 rather than attempting the PSP charge first (which would create unrecorded charges). This is the correct failure behaviour for a payment system — fail before charging, not after.
Open Payment Processing blueprint →
"I'll inject a PSP latency spike — simulating a payment provider experiencing slowness, which is a real scenario that happens during high-traffic events like Black Friday when payment networks are under load."
"[inject] PSP response time jumps from 800ms to 6,000ms. Watch the database connection pool — it climbs from 35% to 87% utilisation within 30 seconds because connections are being held open waiting for PSP responses. At 92% pool utilisation, new requests start queueing. At 100%, they start failing immediately with 503s. Payment success rate drops to 41%."
"The blast radius: during PSP degradation, a large fraction of users cannot complete checkout. The critical property is what does NOT happen: no double charges, because requests that couldn't get a database connection never reached the PSP at all. No lost charges, because the idempotency key and payment record were written to the database before the PSP call — any retry will be reconciled against the existing record."
"The mitigation I'd add: an async payment queue. Instead of making the PSP call synchronously in the request handler, write the payment intent to a queue and return immediately with a 'payment processing' response. A pool of workers pulls payment intents from the queue and calls the PSP at a controlled rate. This decouples the request-response latency from the PSP latency and allows the payment service to remain responsive even during PSP slowness — at the cost of some additional processing latency for the user."
"What happens if you write to the database but the PSP call fails?" The payment record is in state INITIATED in your database, no charge has been made. On retry (with the same idempotency key), the system finds the existing INITIATED record, retries the PSP call, and updates the record. This is the correct and safe failure mode — the customer has not been charged.
"What happens if the PSP succeeds but your database write fails?" This is the dangerous case. The customer has been charged but your system has no record. The idempotency key was never written (or was rolled back). On retry, the system has no record and may attempt to charge again. Prevention: write the idempotency key and payment record to the database in a transaction that commits before the PSP call. If that write fails, reject the request — never reach the PSP. This is called "write-first" design.
"How do you handle refunds?" Refunds are new payment records with type REFUND referencing the original payment ID. They go through the same idempotency machinery. Refunds can partially fail (PSP accepts the refund but your database write fails) — same write-first design applies. The ledger records a credit entry that offsets the original debit.
"Why not use a distributed transaction (2PC) to coordinate the PSP call and database write?" Two-phase commit requires the PSP to participate in your distributed transaction protocol — it won't. External payment providers have their own transaction semantics. You cannot have a single atomic operation that spans your database and an external HTTP call. This is why write-first design and reconciliation exist: they are the practical substitute for distributed transactions that include external systems.
What is an idempotency key in payment systems?
A unique ID attached to every payment request. If the same request is retried after a timeout, the server recognises the key and returns the original result instead of processing a new charge. Without idempotency keys, network timeouts followed by client retries cause double charges.
How do you prevent double charges?
Idempotency keys with a unique database constraint. Before processing, attempt to insert the key. Success means proceed. Failure (duplicate constraint) means return the cached result. The check-and-process must be a single atomic database transaction.
Should a payment system use SQL or NoSQL?
SQL. ACID transactions are non-negotiable for financial data. NoSQL databases that trade ACID for write throughput are inappropriate for payments. The throughput requirements of payment systems (hundreds to low thousands of TPS) are well within PostgreSQL's capabilities.
How does a payment system integrate with a PSP?
The PSP's client-side SDK tokenises card data and returns a token to your backend. Your backend calls the PSP's charge API with the token. Your system never touches raw card data — PCI compliance is scoped to the PSP. For async payment methods, the PSP sends a webhook callback when the charge settles.
How do you design a payment audit log?
Append-only, immutable rows — no updates or deletes. Every state transition is a new row. Written in the same database transaction as the payment record update. Archived to cold storage after 90 days. Retained permanently (legal and compliance requirements typically mandate 7 years).
Run this in SysSimulator → Browse all blueprints
Next in the series: CAP theorem explained →