Skip to main content

Part 2: Simulate Failures

In this part, you'll simulate failures to see how Temporal handles them. This demonstrates why Temporal is particularly useful for building reliable systems.

Part 1: Basic Workflow
Part 2: Failure Simulation

Systems fail in unpredictable ways. A seemingly harmless deployment can bring down production, a database connection can time out during peak traffic, or a third-party service can decide to have an outage. Despite our best efforts with comprehensive testing and monitoring, systems are inherently unpredictable and complex. Networks fail, servers restart unexpectedly, and dependencies we trust can become unavailable without warning.

Traditional systems aren't equipped to handle these realities. When something fails halfway through a multi-step process, you're left with partial state, inconsistent data, and the complex task of figuring out where things went wrong and how to recover. Most applications either lose progress entirely or require you to build extensive checkpointing and recovery logic.

In this tutorial, you'll see Temporal's durable execution in action by running two tests: crashing a server while it's working and fixing code problems on the fly without stopping your application.

Recover from a server crash

Unlike other solutions, Temporal is designed with failure in mind. In this part of the tutorial, you'll simulate a server crash mid-transaction and watch Temporal helps you recover from it.

Here's the challenge: Kill your Worker process while money is being transferred. In traditional systems, this would corrupt the transaction or lose data entirely.

What We're Testing

Worker
CRASH
Recovery
Success

Before You Start

What's happening behind the scenes?

Unlike many modern applications that require complex leader election processes and external databases to handle failure, Temporal automatically preserves the state of your Workflow even if the server is down. You can test this by stopping the Temporal Service while a Workflow Execution is in progress.

No data is lost once the Temporal Service went offline. When it comes back online, the work picked up where it left off before the outage. Keep in mind that this example uses a single instance of the service running on a single machine. In a production deployment, the Temporal Service can be deployed as a cluster, spread across several machines for higher availability and increased throughput.

Instructions

Step 1: Start Your Worker

First, stop any running Worker (Ctrl+C) and start a fresh one in Terminal 2.

Worker Status: RUNNING
Workflow Status: WAITING
Terminal 2 - Worker
python run_worker.py

Step 2: Start the Workflow

Now in Terminal 3, start the Workflow. Check the Web UI - you'll see your Worker busy executing the Workflow and its Activities.

Worker Status: EXECUTING
Workflow Status: RUNNING
Terminal 3 - Workflow
python run_workflow.py

Step 3: Simulate the Crash

The moment of truth! Kill your Worker while it's processing the transaction.

Jump back to the Web UI and refresh. Your Workflow is still showing as "Running"!

That's the magic! The Workflow keeps running because Temporal saved its state, even though we killed the Worker.

Worker Status: CRASHED
Workflow Status: RUNNING
The Crash Test

Go back to Terminal 2 and kill the Worker with Ctrl+C

Step 4: Bring Your Worker Back

Restart your Worker in Terminal 2. Watch Terminal 3 - you'll see the Workflow finish up and show the result!

Worker Status: RECOVERED
Workflow Status: COMPLETED
Transaction: SUCCESS
Terminal 2 - Recovery
python run_worker.py
tip
Try This Challenge

Try killing the Worker at different points during execution. Start the Workflow, kill the Worker during the withdrawal, then restart it. Kill it during the deposit. Each time, notice how Temporal maintains perfect state consistency.

Check the Web UI while the Worker is down and you'll see the Workflow is still "Running" even though no code is executing.

Recover from an unknown error

In this part of the tutorial, you will inject a bug into your production code, watch Temporal retry automatically, then fix the bug while the Workflow is still running. This demo application makes a call to an external service in an Activity. If that call fails due to a bug in your code, the Activity produces an error.

To test this out and see how Temporal responds, you'll simulate a bug in the Deposit Activity function or method.

Live Debugging Flow

Bug
Retry
Fix
Success

Before You Start

Instructions

Step 1: Stop Your Worker

Before we can simulate a failure, we need to stop the current Worker process. This allows us to modify the Activity code safely.

In Terminal 2 (where your Worker is running), stop it with Ctrl+C.

What's happening? You're about to modify Activity code to introduce a deliberate failure. The Worker process needs to restart to pick up code changes, but the Workflow execution will continue running in Temporal's service - this separation between execution state and code is a core Temporal concept.

Step 2: Introduce the Bug

Now we'll intentionally introduce a failure in the deposit Activity to simulate real-world scenarios like network timeouts, database connection issues, or external service failures. This demonstrates how Temporal handles partial failures in multi-step processes.

Find the deposit() method and uncomment the failing line while commenting out the working line:

activities.py

@activity.defn
async def deposit(self, data: PaymentDetails) -> str:
reference_id = f"{data.reference_id}-deposit"
try:
# Comment out this working line:
# confirmation = await asyncio.to_thread(
# self.bank.deposit, data.target_account, data.amount, reference_id
# )

# Uncomment this failing line:
confirmation = await asyncio.to_thread(
self.bank.deposit_that_fails,
data.target_account,
data.amount,
reference_id,
)
return confirmation
except InvalidAccountError:
raise
except Exception:
activity.logger.exception("Deposit failed")
raise

Save your changes. You've now created a deliberate failure point in your deposit Activity. This simulates a real-world scenario where external service calls might fail intermittently.

Step 3: Start Worker & Observe Retry Behavior

Now let's see how Temporal handles this failure. When you start your Worker, it will execute the withdraw Activity successfully, but hit the failing deposit Activity. Instead of the entire Workflow failing permanently, Temporal will retry the failed Activity according to your retry policy.

python run_worker.py

Here's what you'll see:

  • The withdraw() Activity completes successfully
  • The deposit() Activity fails and retries automatically
Terminal
|

Key observation: Your Workflow isn't stuck or terminated. Temporal automatically retries the failed Activity according to your configured retry policy, while maintaining the overall Workflow state. The successful withdraw Activity doesn't get re-executed - only the failed deposit Activity is retried.

Step 4: Fix the Bug

Here's where Temporal really shines - you can fix bugs in production code while Workflows are still executing. The Workflow state is preserved in Temporal's durable storage, so you can deploy fixes and let the retry mechanism pick up your corrected code.

Go back to activities.py and reverse the comments - comment out the failing line and uncomment the working line:

activities.py

@activity.defn
async def deposit(self, data: PaymentDetails) -> str:
reference_id = f"{data.reference_id}-deposit"
try:
# Uncomment this working line:
confirmation = await asyncio.to_thread(
self.bank.deposit, data.target_account, data.amount, reference_id
)

# Comment out this failing line:
# confirmation = await asyncio.to_thread(
# self.bank.deposit_that_fails,
# data.target_account,
# data.amount,
# reference_id,
# )
return confirmation
except InvalidAccountError:
raise
except Exception:
activity.logger.exception("Deposit failed")
raise

Save your changes. You've now restored the working implementation. The key insight here is that you can deploy fixes to Activities while Workflows are still executing - Temporal will pick up your changes on the next retry attempt.

Step 5: Restart Worker

To apply your fix, you need to restart the Worker process so it picks up the code changes. Since the Workflow execution state is stored in Temporal's servers (not in your Worker process), restarting the Worker won't affect the running Workflow.

# Stop the current Worker
Ctrl+C

# Start it again with the fix
python run_worker.py

On the next retry attempt, your fixed deposit() Activity will succeed, and you'll see the completed transaction in Terminal 3:

Transfer complete.
Withdraw: {'amount': 250, 'receiver': '43-812', 'reference_id': '1f35f7c6-4376-4fb8-881a-569dfd64d472', 'sender': '85-150'}
Deposit: {'amount': 250, 'receiver': '43-812', 'reference_id': '1f35f7c6-4376-4fb8-881a-569dfd64d472', 'sender': '85-150'}

Check the Web UI - your Workflow shows as completed. You've just demonstrated Temporal's key differentiator: the ability to fix production bugs in running applications without losing transaction state or progress. This is possible because Temporal stores execution state separately from your application code.

Mission Accomplished. You have just fixed a bug in a running application without losing the state of the Workflow or restarting the transaction.

Advanced Challenge

Try this advanced scenario of compensating transactions.

  1. Modify the retry policy in workflows.py to only retry 1 time
  2. Force the deposit to fail permanently
  3. Watch the automatic refund execute

Knowledge Check

Test your understanding of what you just experienced:

Q: What are four of Temporal's value propositions that you learned about in this tutorial?

Answer:

  1. Temporal automatically maintains the state of your Workflow, despite crashes or even outages of the Temporal Service itself.
  2. Temporal's built-in support for retries and timeouts enables your code to overcome transient and intermittent failures.
  3. Temporal provides full visibility in the state of the Workflow Execution and its Web UI offers a convenient way to see the details of both current and past executions.
  4. Temporal makes it possible to fix a bug in a Workflow Execution that you've already started. After updating the code and restarting the Worker, the failing Activity is retried using the code containing the bug fix, completes successfully, and execution continues with what comes next.
Q: Why do we use a shared constant for the Task Queue name?

Answer: Because the Task Queue name is specified in two different parts of the code (the first starts the Workflow and the second configures the Worker). If their values differ, the Worker and Temporal Service would not share the same Task Queue, and the Workflow Execution would not progress.

Q: What do you have to do if you make changes to Activity code for a Workflow that is running?

Answer: Restart the Worker.

Continue Your Learning