Part 2: Simulate Failures
In this part, you'll simulate failures to see how Temporal handles them. This demonstrates why Temporal is particularly useful for building reliable systems.
Systems fail in unpredictable ways. A seemingly harmless deployment can bring down production, a database connection can time out during peak traffic, or a third-party service can decide to have an outage. Despite our best efforts with comprehensive testing and monitoring, systems are inherently unpredictable and complex. Networks fail, servers restart unexpectedly, and dependencies we trust can become unavailable without warning.
Traditional systems aren't equipped to handle these realities. When something fails halfway through a multi-step process, you're left with partial state, inconsistent data, and the complex task of figuring out where things went wrong and how to recover. Most applications either lose progress entirely or require you to build extensive checkpointing and recovery logic.
In this tutorial, you'll see Temporal's durable execution in action by running two tests: crashing a server while it's working and fixing code problems on the fly without stopping your application.
Recover from a server crash
Unlike other solutions, Temporal is designed with failure in mind. In this part of the tutorial, you'll simulate a server crash mid-transaction and watch Temporal helps you recover from it.
Here's the challenge: Kill your Worker process while money is being transferred. In traditional systems, this would corrupt the transaction or lose data entirely.
What We're Testing
Before You Start
What's happening behind the scenes?
Unlike many modern applications that require complex leader election processes and external databases to handle failure, Temporal automatically preserves the state of your Workflow even if the server is down. You can test this by stopping the Temporal Service while a Workflow Execution is in progress.
No data is lost once the Temporal Service went offline. When it comes back online, the work picked up where it left off before the outage. Keep in mind that this example uses a single instance of the service running on a single machine. In a production deployment, the Temporal Service can be deployed as a cluster, spread across several machines for higher availability and increased throughput.
Instructions
Step 1: Start Your Worker
First, stop any running Worker (Ctrl+C) and start a fresh one in Terminal 2.
python run_worker.py
go run worker/main.go
mvn compile exec:java -Dexec.mainClass="moneytransferapp.MoneyTransferWorker"
npm run worker
./rr serve
dotnet run --project MoneyTransferWorker
bundle exec ruby worker.rb
Step 2: Start the Workflow
Now in Terminal 3, start the Workflow. Check the Web UI - you'll see your Worker busy executing the Workflow and its Activities.
python run_workflow.py
go run start/main.go
mvn compile exec:java -Dexec.mainClass="moneytransferapp.TransferApp"
npm run client
php src/transfer.php
dotnet run --project MoneyTransferClient
bundle exec ruby starter.rb
Step 3: Simulate the Crash
The moment of truth! Kill your Worker while it's processing the transaction.
Jump back to the Web UI and refresh. Your Workflow is still showing as "Running"!
That's the magic! The Workflow keeps running because Temporal saved its state, even though we killed the Worker.
Go back to Terminal 2 and kill the Worker with Ctrl+C
Step 4: Bring Your Worker Back
Restart your Worker in Terminal 2. Watch Terminal 3 - you'll see the Workflow finish up and show the result!
python run_worker.py
go run worker/main.go
mvn compile exec:java -Dexec.mainClass="moneytransferapp.MoneyTransferWorker"
npm run worker
./rr serve
dotnet run --project MoneyTransferWorker
bundle exec ruby worker.rb
Try killing the Worker at different points during execution. Start the Workflow, kill the Worker during the withdrawal, then restart it. Kill it during the deposit. Each time, notice how Temporal maintains perfect state consistency.
Check the Web UI while the Worker is down and you'll see the Workflow is still "Running" even though no code is executing.
Recover from an unknown error
In this part of the tutorial, you will inject a bug into your production code, watch Temporal retry automatically, then fix the bug while the Workflow is still running. This demo application makes a call to an external service in an Activity. If that call fails due to a bug in your code, the Activity produces an error.
To test this out and see how Temporal responds, you'll simulate a bug in the Deposit Activity function or method.
Live Debugging Flow
Before You Start
Instructions
Step 1: Stop Your Worker
Before we can simulate a failure, we need to stop the current Worker process. This allows us to modify the Activity code safely.
In Terminal 2 (where your Worker is running), stop it with Ctrl+C.
What's happening? You're about to modify Activity code to introduce a deliberate failure. The Worker process needs to restart to pick up code changes, but the Workflow execution will continue running in Temporal's service - this separation between execution state and code is a core Temporal concept.
Step 2: Introduce the Bug
Now we'll intentionally introduce a failure in the deposit Activity to simulate real-world scenarios like network timeouts, database connection issues, or external service failures. This demonstrates how Temporal handles partial failures in multi-step processes.
Find the deposit() method and uncomment the failing line while commenting out the working line:
activities.py
@activity.defn
async def deposit(self, data: PaymentDetails) -> str:
reference_id = f"{data.reference_id}-deposit"
try:
# Comment out this working line:
# confirmation = await asyncio.to_thread(
# self.bank.deposit, data.target_account, data.amount, reference_id
# )
# Uncomment this failing line:
confirmation = await asyncio.to_thread(
self.bank.deposit_that_fails,
data.target_account,
data.amount,
reference_id,
)
return confirmation
except InvalidAccountError:
raise
except Exception:
activity.logger.exception("Deposit failed")
raise
Save your changes. You've now created a deliberate failure point in your deposit Activity. This simulates a real-world scenario where external service calls might fail intermittently.
Find the Deposit() function and uncomment the failing line while commenting out the working line:
activity.go
func Deposit(ctx context.Context, data PaymentDetails) (string, error) {
log.Printf("Depositing $%d into account %s.\n\n",
data.Amount,
data.TargetAccount,
)
referenceID := fmt.Sprintf("%s-deposit", data.ReferenceID)
bank := BankingService{"bank-api.example.com"}
// Uncomment this failing line:
confirmation, err := bank.DepositThatFails(data.TargetAccount, data.Amount, referenceID)
// Comment out this working line:
// confirmation, err := bank.Deposit(data.TargetAccount, data.Amount, referenceID)
return confirmation, err
}
Save your changes. You've now created a deliberate failure point in your deposit Activity. This simulates a real-world scenario where external service calls might fail intermittently.
Find the deposit() method and change activityShouldSucceed to false:
AccountActivityImpl.java
public String deposit(PaymentDetails details) {
// Change this to false to simulate failure:
boolean activityShouldSucceed = false;
// ... rest of your method
}
Save your changes. You've now created a deliberate failure point in your deposit Activity. This simulates a real-world scenario where external service calls might fail intermittently.
Find the deposit() function and uncomment the failing line while commenting out the working line:
activities.ts
export async function deposit(details: PaymentDetails): Promise<string> {
// Comment out this working line:
// return await bank.deposit(details.targetAccount, details.amount, details.referenceId);
// Uncomment this failing line:
return await bank.depositThatFails(details.targetAccount, details.amount, details.referenceId);
}
Save your changes. You've now created a deliberate failure point in your deposit Activity. This simulates a real-world scenario where external service calls might fail intermittently.
Find the deposit() method in BankingActivity.php and uncomment the failing line while commenting out the working line:
BankingActivity.php
#[\Override]
public function deposit(PaymentDetails $data): string
{
$referenceId = $data->referenceId . "-deposit";
try {
// Comment out this working line:
// $confirmation = $this->bank->deposit(
// $data->targetAccount,
// $data->amount,
// $referenceId,
// );
// Uncomment this failing line:
$confirmation = $this->bank->depositThatFails(
$data->targetAccount,
$data->amount,
$referenceId,
);
return $confirmation;
} catch (InvalidAccount $e) {
throw $e;
} catch (\Throwable $e) {
$this->logger->error("Deposit failed", ['exception' => $e]);
throw $e;
}
}
Save your changes. You've now created a deliberate failure point in your deposit Activity. This simulates a real-world scenario where external service calls might fail intermittently.
Find the DepositAsync() method and uncomment the failing line while commenting out the working block:
MoneyTransferWorker/Activities.cs
[Activity]
public static async Task<string> DepositAsync(PaymentDetails details)
{
var bankService = new BankingService("bank2.example.com");
Console.WriteLine($"Depositing ${details.Amount} into account {details.TargetAccount}.");
// Uncomment this failing line:
return await bankService.DepositThatFailsAsync(details.TargetAccount, details.Amount, details.ReferenceId);
// Comment out this working block:
/*
try
{
return await bankService.DepositAsync(details.TargetAccount, details.Amount, details.ReferenceId);
}
catch (Exception ex)
{
throw new ApplicationFailureException("Deposit failed", ex);
}
*/
}
Save your changes. You've now created a deliberate failure point in your deposit Activity. This simulates a real-world scenario where external service calls might fail intermittently.
Find the deposit method and uncomment the failing line that causes a divide-by-zero error:
activities.rb
def deposit(details)
# Uncomment this line to introduce the bug:
result = 100 / 0 # This will cause a divide-by-zero error
# Your existing deposit logic here...
end
Save your changes. You've now created a deliberate failure point in your deposit Activity. This simulates a real-world scenario where external service calls might fail intermittently.
Step 3: Start Worker & Observe Retry Behavior
Now let's see how Temporal handles this failure. When you start your Worker, it will execute the withdraw Activity successfully, but hit the failing deposit Activity. Instead of the entire Workflow failing permanently, Temporal will retry the failed Activity according to your retry policy.
python run_worker.py
Here's what you'll see:
- The
withdraw()Activity completes successfully - The
deposit()Activity fails and retries automatically
go run worker/main.go
Here's what you'll see:
- The
Withdraw()Activity completes successfully - The
Deposit()Activity fails and retries automatically
Make sure your Workflow is still running in the Web UI, then start your Worker:
mvn clean install -Dorg.slf4j.simpleLogger.defaultLogLevel=info 2>/dev/null
mvn compile exec:java -Dexec.mainClass="moneytransferapp.MoneyTransferWorker" -Dorg.slf4j.simpleLogger.defaultLogLevel=warn
Here's what you'll see:
- The
withdraw()Activity completes successfully - The
deposit()Activity fails and retries automatically
npm run worker
Here's what you'll see:
- The
withdraw()Activity completes successfully - The
deposit()Activity fails and retries automatically
./rr serve
Here's what you'll see:
- The
withdraw()Activity completes successfully - The
deposit()Activity fails and retries automatically
Check the Web UI - click on your Workflow to see the failure details and retry attempts.
dotnet run --project MoneyTransferWorker
Here's what you'll see:
- The
WithdrawAsync()Activity completes successfully - The
DepositAsync()Activity fails and retries automatically
bundle exec ruby worker.rb
In another terminal, start a new Workflow:
bundle exec ruby starter.rb
Here's what you'll see:
- The
withdrawActivity completes successfully - The
depositActivity fails and retries automatically
Check the Web UI - click on your Workflow to see the failure details and retry attempts.
Key observation: Your Workflow isn't stuck or terminated. Temporal automatically retries the failed Activity according to your configured retry policy, while maintaining the overall Workflow state. The successful withdraw Activity doesn't get re-executed - only the failed deposit Activity is retried.
Step 4: Fix the Bug
Here's where Temporal really shines - you can fix bugs in production code while Workflows are still executing. The Workflow state is preserved in Temporal's durable storage, so you can deploy fixes and let the retry mechanism pick up your corrected code.
Go back to activities.py and reverse the comments - comment out the failing line and uncomment the working line:
activities.py
@activity.defn
async def deposit(self, data: PaymentDetails) -> str:
reference_id = f"{data.reference_id}-deposit"
try:
# Uncomment this working line:
confirmation = await asyncio.to_thread(
self.bank.deposit, data.target_account, data.amount, reference_id
)
# Comment out this failing line:
# confirmation = await asyncio.to_thread(
# self.bank.deposit_that_fails,
# data.target_account,
# data.amount,
# reference_id,
# )
return confirmation
except InvalidAccountError:
raise
except Exception:
activity.logger.exception("Deposit failed")
raise
Go back to activity.go and reverse the comments - comment out the failing line and uncomment the working line:
activity.go
func Deposit(ctx context.Context, data PaymentDetails) (string, error) {
log.Printf("Depositing $%d into account %s.\n\n",
data.Amount,
data.TargetAccount,
)
referenceID := fmt.Sprintf("%s-deposit", data.ReferenceID)
bank := BankingService{"bank-api.example.com"}
// Comment out this failing line:
// confirmation, err := bank.DepositThatFails(data.TargetAccount, data.Amount, referenceID)
// Uncomment this working line:
confirmation, err := bank.Deposit(data.TargetAccount, data.Amount, referenceID)
return confirmation, err
}
Go back to AccountActivityImpl.java and change activityShouldSucceed back to true:
AccountActivityImpl.java
public String deposit(PaymentDetails details) {
// Change this back to true to fix the bug:
boolean activityShouldSucceed = true;
// ... rest of your method
}
Go back to activities.ts and reverse the comments - comment out the failing line and uncomment the working line:
activities.ts
export async function deposit(details: PaymentDetails): Promise<string> {
// Uncomment this working line:
return await bank.deposit(details.targetAccount, details.amount, details.referenceId);
// Comment out this failing line:
// return await bank.depositThatFails(details.targetAccount, details.amount, details.referenceId);
}
Go back to BankingActivity.php and reverse the comments - comment out the failing line and uncomment the working line:
BankingActivity.php
#[\Override]
public function deposit(PaymentDetails $data): string
{
$referenceId = $data->referenceId . "-deposit";
try {
// Uncomment this working line:
$confirmation = $this->bank->deposit(
$data->targetAccount,
$data->amount,
$referenceId,
);
// Comment out this failing line:
// $confirmation = $this->bank->depositThatFails(
// $data->targetAccount,
// $data->amount,
// $referenceId,
// );
return $confirmation;
} catch (InvalidAccount $e) {
throw $e;
} catch (\Throwable $e) {
$this->logger->error("Deposit failed", ['exception' => $e]);
throw $e;
}
}
Go back to Activities.cs and reverse the comments - comment out the failing line and uncomment the working block:
MoneyTransferWorker/Activities.cs
[Activity]
public static async Task<string> DepositAsync(PaymentDetails details)
{
var bankService = new BankingService("bank2.example.com");
Console.WriteLine($"Depositing ${details.Amount} into account {details.TargetAccount}.");
// Comment out this failing line:
// return await bankService.DepositThatFailsAsync(details.TargetAccount, details.Amount, details.ReferenceId);
// Uncomment this working block:
try
{
return await bankService.DepositAsync(details.TargetAccount, details.Amount, details.ReferenceId);
}
catch (Exception ex)
{
throw new ApplicationFailureException("Deposit failed", ex);
}
}
Go back to activities.rb and comment out the failing line:
activities.rb
def deposit(details)
# Comment out this problematic line:
# result = 100 / 0 # This will cause a divide-by-zero error
# Your existing deposit logic here...
end
Save your changes. You've now restored the working implementation. The key insight here is that you can deploy fixes to Activities while Workflows are still executing - Temporal will pick up your changes on the next retry attempt.
Step 5: Restart Worker
To apply your fix, you need to restart the Worker process so it picks up the code changes. Since the Workflow execution state is stored in Temporal's servers (not in your Worker process), restarting the Worker won't affect the running Workflow.
# Stop the current Worker
Ctrl+C
# Start it again with the fix
python run_worker.py
On the next retry attempt, your fixed deposit() Activity will succeed, and you'll see the completed transaction in Terminal 3:
Transfer complete.
Withdraw: {'amount': 250, 'receiver': '43-812', 'reference_id': '1f35f7c6-4376-4fb8-881a-569dfd64d472', 'sender': '85-150'}
Deposit: {'amount': 250, 'receiver': '43-812', 'reference_id': '1f35f7c6-4376-4fb8-881a-569dfd64d472', 'sender': '85-150'}
# Stop the current Worker
Ctrl+C
# Start it again with the fix
go run worker/main.go
On the next retry attempt, your fixed Deposit() Activity will succeed, and you'll see the completed transaction in your starter terminal:
Transfer complete (transaction IDs: W1779185060, D1779185060)
# Stop the current Worker
Ctrl+C
# Start it again with the fix
mvn clean install -Dorg.slf4j.simpleLogger.defaultLogLevel=info 2>/dev/null
mvn compile exec:java -Dexec.mainClass="moneytransferapp.MoneyTransferWorker" -Dorg.slf4j.simpleLogger.defaultLogLevel=warn
On the next retry attempt, your fixed deposit() Activity will succeed:
Depositing $32 into account 872878204.
[ReferenceId: d3d9bcf0-a897-4326]
[d3d9bcf0-a897-4326] Transaction succeeded.
# Stop the current Worker
Ctrl+C
# Start it again with the fix
npm run worker
On the next retry attempt, your fixed deposit() Activity will succeed, and you'll see the completed transaction in your client terminal:
Transfer complete (transaction IDs: W3436600150, D9270097234)
# Stop the current Worker
Ctrl+C
# Start it again with the fix
./rr serve
On the next retry attempt, your fixed deposit() Activity will succeed, and you'll see the completed transaction:
Result: Transfer complete (transaction IDs: W12345, D12345)
# Stop the current Worker
Ctrl+C
# Start it again with the fix
dotnet run --project MoneyTransferWorker
On the next retry attempt, your fixed DepositAsync() Activity will succeed, and you'll see the completed transaction in your client terminal:
Workflow result: Transfer complete (transaction IDs: W-caa90e06-3a48-406d-86ff-e3e958a280f8, D-1910468b-5951-4f1d-ab51-75da5bba230b)
# Stop the current Worker
Ctrl+C
# Start it again with the fix
bundle exec ruby worker.rb
On the next retry attempt, your fixed deposit Activity will succeed, and you'll see the Workflow complete successfully.
Check the Web UI - your Workflow shows as completed. You've just demonstrated Temporal's key differentiator: the ability to fix production bugs in running applications without losing transaction state or progress. This is possible because Temporal stores execution state separately from your application code.
Mission Accomplished. You have just fixed a bug in a running application without losing the state of the Workflow or restarting the transaction.
Try this advanced scenario of compensating transactions.
- Modify the retry policy in
workflows.pyto only retry 1 time - Force the deposit to fail permanently
- Watch the automatic refund execute
Knowledge Check
Test your understanding of what you just experienced:
Q: What are four of Temporal's value propositions that you learned about in this tutorial?
Answer:
- Temporal automatically maintains the state of your Workflow, despite crashes or even outages of the Temporal Service itself.
- Temporal's built-in support for retries and timeouts enables your code to overcome transient and intermittent failures.
- Temporal provides full visibility in the state of the Workflow Execution and its Web UI offers a convenient way to see the details of both current and past executions.
- Temporal makes it possible to fix a bug in a Workflow Execution that you've already started. After updating the code and restarting the Worker, the failing Activity is retried using the code containing the bug fix, completes successfully, and execution continues with what comes next.
Q: Why do we use a shared constant for the Task Queue name?
Answer: Because the Task Queue name is specified in two different parts of the code (the first starts the Workflow and the second configures the Worker). If their values differ, the Worker and Temporal Service would not share the same Task Queue, and the Workflow Execution would not progress.
Q: What do you have to do if you make changes to Activity code for a Workflow that is running?
Answer: Restart the Worker.