On 20 October 2025, Amazon Web Services went down. Within minutes, Coinbase, Robinhood, MetaMask, and multiple other crypto platforms were offline. Trading froze. API integrations stopped responding.
For nearly three hours, millions of users could not access their funds or execute trades. Six months earlier, in April 2025, AWS had caused the same cascade. And in October 2025, a separate market sell-off triggered outages on Binance and Coinbase as $19 billion in leveraged positions liquidated within 24 hours.
These are not hypothetical scenarios for business continuity planning. They are recent, documented events that demonstrate why Recovery Time Objective (RTO) and Recovery Point Objective (RPO) requirements for crypto exchanges are materially different from those of traditional financial services.
Crypto markets run 24/7/365 with no closing bell, no circuit breakers, and no central authority to halt trading during a crisis. Every minute of downtime is a minute where traders cannot manage risk, positions are unhedged, and customer trust erodes.
This article explains what RTO and RPO mean in the context of crypto exchange operations, what the emerging regulatory and industry benchmarks are, and how to set and validate these targets using a structured Business Impact Analysis approach.
RTO and RPO: A Quick Refresher
Recovery Time Objective (RTO) is the maximum acceptable duration that a system, application, or business process can be offline after a disruption before the impact becomes unacceptable.
It answers the question: how fast must we be back up? An RTO of 15 minutes means your platform must be fully operational within 15 minutes of going down. An RTO of four hours means you can tolerate four hours of downtime before the business damage crosses a defined threshold.
Recovery Point Objective (RPO) is the maximum acceptable amount of data loss, measured in time. It answers: how much data can we afford to lose? An RPO of zero means you cannot lose any transaction data at all, which requires real-time synchronous replication.
An RPO of 15 minutes means you accept that the last 15 minutes of data before the disruption might be lost, which requires backup frequency of at least every 15 minutes.
Together, RTO and RPO form the backbone of any disaster recovery strategy. They drive infrastructure investment, backup architecture, failover design, and service-level agreements with cloud providers and custodians.
For a deeper introduction to how these metrics fit into the broader continuity lifecycle, see our guide on the IT risk management lifecycle.
Why Crypto Exchanges Need Tighter RTO/RPO Than Traditional Finance
Traditional financial markets have structural buffers that crypto does not. Stock exchanges operate during defined hours (typically 6.5 hours per day for US equities) with overnight periods for reconciliation, batch processing, and maintenance.
They have circuit breakers that automatically halt trading during extreme volatility. They have centralised clearing houses that guarantee settlement finality. Crypto has none of this.
24/7/365 operations. Crypto never closes. A disruption at 2 AM on a Sunday faces the same trading volume and urgency as one at noon on a Tuesday.
There is no low-activity window for maintenance. Your RTO must assume worst-case timing, which means peak volatility, peak volume, and peak user demand all happening simultaneously.
Irreversible transactions. Blockchain transactions cannot be reversed. If a system failure causes a transaction error, an incorrect liquidation, or a missed withdrawal, there is no central authority to unwind it.
Data integrity is existential. An RPO greater than zero for on-chain transaction records is effectively unacceptable for any regulated exchange.
Volatility-driven demand spikes. Exchange outages disproportionately occur during the moments when uptime matters most: sharp price movements. When Bitcoin drops 15% in an hour, trading volume can surge 10x and every user tries to log in simultaneously.
The October 2025 sell-off that liquidated $19 billion in leveraged positions within 24 hours is a textbook example. Systems that cannot scale under stress fail precisely when failure costs the most.
Custody risk. Unlike a stock exchange where a clearinghouse holds assets, a crypto exchange directly custodies customer funds. Downtime does not just prevent trading; it prevents customers from accessing and moving their own assets.
The reputational and legal exposure is fundamentally different. For an overview of how custody architecture affects these decisions, see our article on hot wallet vs cold wallet architecture.
Setting RTO/RPO by System Tier: A Practical Framework
Not every system in a crypto exchange requires the same recovery targets. A tiered approach allocates investment where the impact is highest. Here is how to structure it based on industry practice and regulatory expectations.
Tier 0: Trade Matching Engine and Order Book
Recommended RTO: < 5 minutes. Recommended RPO: Zero (real-time replication).
The matching engine is the core of any exchange. If it stops, trading stops. Every millisecond of additional latency costs money: high-frequency trading clients will abandon an exchange that cannot guarantee sub-millisecond order matching under normal conditions, and failover within minutes during a disruption.
Active-active deployment across geographically separated data centres is the standard architectural pattern. The RPO must be zero because any loss of order state means unfilled orders, incorrect balances, or disputed trades. This requires synchronous replication with ACID transaction guarantees at scale.
Tier 1: Wallet Infrastructure and Custody Systems
Recommended RTO: < 15 minutes for hot wallets. 4-8 hours for cold storage access. Recommended RPO: Zero for transaction records. Near-zero for wallet state.
Hot wallet systems must recover fast enough that withdrawal and deposit processing resumes before users perceive an extended outage. Cold storage access inherently has a longer RTO because it involves physical devices and multi-party authorisation, but this is by design rather than by failure.
The critical point is that the transaction ledger (what was sent, received, and confirmed on-chain) must have zero data loss. Any discrepancy between your internal records and the blockchain is an audit finding, a regulatory exposure, and a potential financial loss. Your data integrity risk assessment should include wallet transaction reconciliation as a critical control.
Tier 2: Customer-Facing Applications (Web, Mobile, API)
Recommended RTO: < 15 minutes. Recommended RPO: 5 minutes.
The user interface is what customers see. If the matching engine is running but the web interface is down, the exchange is effectively unavailable to retail users. API access is equally critical for institutional and algorithmic traders.
The AWS outages of 2025 demonstrated exactly this failure mode: underlying asset custody was secure, but front-end access was disrupted because the platform depended on a single cloud provider.
Multi-cloud or hybrid architecture is the mitigation. An RPO of 5 minutes is acceptable because user session data and interface state are reconstructable from backend systems.
Tier 3: KYC/AML, Compliance, and Reporting Systems
Recommended RTO: 2-4 hours. Recommended RPO: 15 minutes.
Compliance systems are important but do not need to recover at the same speed as the trading engine. A four-hour RTO is typically acceptable because regulatory reporting deadlines operate in hours or days, not seconds.
However, the RPO must be tight enough that no completed KYC verification or suspicious activity report is lost. Losing compliance data creates regulatory exposure that persists long after the technical disruption is resolved.
Your GDPR risk assessment should explicitly cover personal data stored in KYC systems, and your privacy risk assessment should evaluate recovery procedures for this data category.
Tier 4: Analytics, Reporting Dashboards, Marketing Systems
Recommended RTO: 8-24 hours. Recommended RPO: 4 hours.
Internal analytics, marketing platforms, and non-critical administrative tools can tolerate longer recovery times. These systems do not affect customer access, trading, or custody.
A 24-hour RTO and 4-hour RPO is cost-effective for this tier and allows you to focus infrastructure investment on the tiers where downtime has material financial impact.
What Regulators Expect: DORA, MiCA, NYDFS, and ISO 22301
There is no single global standard that prescribes exact RTO/RPO numbers for crypto exchanges. But several regulatory frameworks create de facto requirements.
EU DORA (Digital Operational Resilience Act). Effective since January 2025, DORA applies to all EU financial entities, including Crypto-Asset Service Providers (CASPs) licensed under MiCA.
DORA requires comprehensive ICT risk management frameworks, mandatory resilience testing (including threat-led penetration testing for significant entities), and incident reporting within hours of detection.
While DORA does not prescribe specific RTO/RPO numbers, it requires that recovery objectives are defined, documented, tested, and aligned with business criticality. If your exchange cannot demonstrate that it has set and validated recovery targets, you are non-compliant.
DORA also requires oversight of third-party ICT providers: if your exchange runs on AWS and AWS goes down, your regulator will ask what your contractual recovery commitments are and whether you tested them.
EU MiCA (Markets in Crypto-Assets). MiCA requires CASPs to maintain operational resilience, segregate customer assets, and demonstrate adequate safeguards for custody arrangements.
CASPs must have business continuity plans that cover the full range of disruption scenarios. The grandfathering deadline for existing operators is July 2026, making this an immediate planning concern for any exchange serving EU customers.
NYDFS BitLicense and 23 NYCRR Part 500. New York’s Department of Financial Services requires BitLicensees to maintain comprehensive books and records, implement business continuity and disaster recovery plans, and report cybersecurity incidents promptly.
Part 500 (cybersecurity regulation) mandates that covered entities maintain systems designed for the availability of their critical infrastructure.
These requirements imply near-zero RPO for transaction data and aggressive RTOs for customer-facing systems. For detailed regulatory context, see our guide on MiCA, DORA, and BitLicense regulatory requirements.
ISO 22301 (Business Continuity Management). While not crypto-specific, ISO 22301 provides the international standard framework for setting RTO and RPO. It requires organisations to conduct a Business Impact Analysis, identify critical activities and dependencies, set recovery objectives based on maximum tolerable period of disruption (MTPD), and test those objectives through exercises.
If your exchange claims ISO 22301 alignment, your RTO/RPO targets must be documented, justified by BIA, and validated through regular testing. ISO 22301 is the natural companion to ISO 27001 for exchanges that want to demonstrate both information security and operational resilience to regulators and institutional clients.
Running a BIA to Set Your RTO/RPO: Step by Step
RTO and RPO targets should not be guesses. They should emerge from a structured Business Impact Analysis that quantifies the cost of downtime and data loss for each system tier. Here is how to run the analysis.
Step 1: Identify critical business functions. For a crypto exchange, these typically include order matching and trade execution, deposit and withdrawal processing, wallet custody and key management, user authentication and account access, compliance monitoring and suspicious activity detection, and customer support ticketing.
Step 2: Map dependencies. For each critical function, document every technology dependency: servers, databases, APIs, cloud providers, third-party services (custodians, blockchain node providers, KYC vendors), and human dependencies (key holders, on-call engineers, compliance officers).
The AWS outages of 2025 showed that cloud provider concentration is itself a dependency risk. If your matching engine, wallet infrastructure, and front-end all depend on a single availability zone from a single provider, your effective RTO is whatever AWS’s RTO is, and you have no control over that.
Step 3: Quantify impact over time. For each critical function, estimate the financial, regulatory, reputational, and legal impact at defined time intervals after disruption: 5 minutes, 15 minutes, 1 hour, 4 hours, 24 hours.
For a high-volume exchange, the cost of even 5 minutes of matching engine downtime during a volatility spike can run into millions in lost trading fees, forced liquidation errors, and customer compensation claims.
Financial institutions report average downtime costs of $9.3 million per hour, and crypto exchanges with 24/7 operations face comparable or higher exposure during peak periods.
Step 4: Determine MTPD. The Maximum Tolerable Period of Disruption is the point beyond which the business cannot survive the disruption. For a Tier 0 system like the matching engine, the MTPD might be measured in minutes. For Tier 4 systems, it might be days. Your RTO must be shorter than your MTPD with a meaningful margin.
Step 5: Set RTO and RPO. Based on the impact analysis, set recovery targets for each system tier that balance cost against risk.
Document the assumptions and the financial justification for each target. Present the results to your board risk committee for approval, because RTO/RPO decisions are resource allocation decisions that carry strategic consequences.
For a structured approach to this analysis, use the five-step risk management process (identify, analyse, evaluate, treat, monitor).
Step 6: Validate through testing. An RTO of 5 minutes that has never been tested is a wish, not a target. Schedule quarterly failover drills for Tier 0 and Tier 1 systems, annual drills for Tier 2 and Tier 3. Measure actual recovery time against the target.
Document gaps and remediate. Your three lines of defence model should assign testing responsibility: 1st line (IT operations) executes the drill, 2nd line (risk and compliance) verifies results and flags gaps, 3rd line (internal audit) independently assesses whether the testing programme is adequate.
Infrastructure Patterns That Deliver Aggressive RTO/RPO
Meeting sub-5-minute RTO and zero RPO for mission-critical crypto systems requires specific architectural patterns.
Active-active deployment. Run your matching engine and order book across two or more geographically separated data centres simultaneously.
Both handle live traffic. If one fails, the other absorbs the load with no switchover delay. This is the only architecture that reliably delivers sub-minute RTO for the trading engine.
Synchronous replication. For zero RPO, every write to the primary database must be confirmed by the secondary before the transaction is committed.
This adds latency (typically single-digit milliseconds for well-designed systems) but guarantees that no committed transaction is lost during failover. Asynchronous replication is cheaper and faster but always risks data loss equal to the replication lag.
Multi-cloud / hybrid architecture. The repeated AWS outages of 2025 make the case for multi-cloud design. Critical systems should be deployable across at least two cloud providers, or a combination of cloud and owned infrastructure.
This eliminates single-provider dependency risk. The trade-off is complexity and cost, but for a Tier 0 system, the cost of a three-hour outage during a $19 billion liquidation event dwarfs the cost of multi-cloud infrastructure.
Automated failover. Manual failover procedures introduce human latency. For Tier 0 systems, failover should be automated with continuous health checks and pre-validated recovery scripts. Human intervention should be limited to monitoring, verification, and exception handling.
Immutable backup architecture. For ransomware resilience, adopt the 3-2-1-1-0 backup rule: three copies, two different media, one offsite, one immutable (cannot be encrypted or deleted by an attacker), zero errors (verified through automated recovery testing). Your cybersecurity key risk indicators should track backup verification success rate and time since last successful recovery test.
Common Mistakes in Crypto Exchange RTO/RPO Planning
Setting uniform RTO/RPO across all systems. Not every system is equally critical. Applying the same aggressive targets everywhere wastes resources on low-impact systems and risks under-investing in high-impact ones. Tiering is essential.
Confusing backup with recovery. Having backups is not the same as having a tested recovery capability. Many exchanges discover during an actual incident that their backup restoration takes three times longer than assumed because the process was never tested end-to-end under realistic conditions.
Ignoring cloud provider SLAs. AWS’s standard SLA offers 99.99% availability for most services, which translates to roughly 52 minutes of allowable downtime per year. If your RTO for Tier 0 systems is 5 minutes, you need architecture that can withstand a full AWS region outage, not just an individual service disruption. Your cloud provider’s SLA is not your RTO.
Neglecting the human layer. Recovery procedures that require a specific engineer to be online, or a specific key holder to travel to a vault, fail when that person is unavailable. Your BIA must document human dependencies and your BCP must include alternatives. Ensure your enterprise risk management technology practices account for the people, process, and technology dimensions of recovery.
Not testing under stress. A failover drill during quiet market conditions does not validate your RTO during a 10x volume spike. Design stress scenarios into your testing programme. The October 2025 events proved that exchanges fail when stress is highest, which is exactly when they must not fail.
Next Steps: What To Do This Quarter
First, classify every system in your exchange by tier using the framework above. Second, run a formal BIA for each tier to quantify downtime cost and data loss impact at defined intervals.
Third, set RTO and RPO targets that are justified by the BIA and approved by your board. Fourth, audit your current architecture against those targets: do you have the replication, failover, and multi-cloud capability to meet them?
Fifth, schedule your first failover drill if you have not tested recovery in the past 90 days. Sixth, review your cloud provider contracts: what are their committed SLAs, and are they contractually aligned with your RTO/RPO targets? Seventh, update your risk register with system-specific recovery targets, the controls that support them, and the residual risk if targets are missed.
The crypto exchanges that survived 2025’s outages with customer trust intact were the ones that had tested their recovery before they needed it.
The ones that lost trust and market share were the ones that assumed their infrastructure would hold without ever validating that assumption. Use the tools from quantitative risk management to translate downtime probability into financial exposure, and present it to your board in terms they can act on.
Want more actionable risk content?
Explore more at riskpublishing.com. Related articles: key recovery, incident response, and wallet security, NIST cybersecurity key risk indicators, CIS risk assessment method v2.0, and qualitative risk assessment for IT infrastructure.
References and Further Reading
NIST: SP 800-61 Rev. 3: Incident Response Recommendations (CSF 2.0)
Coincover: How Business Continuity Planning Can Protect Institutional Crypto Assets
Debut Infotech: System Outage for Crypto Exchange: Causes, Consequences, and Prevention
CCN: AWS Outage Knocks Coinbase, Robinhood, and Others Offline (October 2025)
CoinDesk: The Protocol: AWS Outage Halts Some Crypto Apps
AWS: Establishing RPO and RTO Targets for Cloud Applications
IPC: The Financial Impact of Downtime on the Trading Floor: $9 Million/Hour
ESMA: Markets in Crypto-Assets Regulation (MiCA)

Chris Ekai is a Risk Management expert with over 10 years of experience in the field. He has a Master’s(MSc) degree in Risk Management from University of Portsmouth and is a CPA and Finance professional. He currently works as a Content Manager at Risk Publishing, writing about Enterprise Risk Management, Business Continuity Management and Project Management.
