Airline Application Proof of Concept
At TechFabric, we always strive to develop better, more efficient, and robust software. Through years of experience, we try new techniques, new frameworks, and new tools to make software development more efficient and produce better software. In this blog post, we share our experience creating a proof-of-concept application with Temporal, a framework for building robust and stable applications. We decided to use an airline ticket booking system as a concept because it sounded challenging, with edge cases like returning tickets and delayed flights.
Before diving into details, let’s first go through some theory:
Retry Policies
Let's try to break down the approach to writing stable code according to the stages of its evolution:
A naive approach to calling a web service, one that we might use as a starting point, is to just create an HttpClient and .Send a message. That’s easy, fast and straightforward.
Later, having experienced various situations when the server crashed or the database went offline, developers realized that any operation could fail. A developer's next step might be to wrap all the code in a try/catch block, but we'll skip past this, and move on a little further - to the point where we start using things like Polly (resilience and transient-fault-handling library that allows developers to express policies such as Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and Fallback. This allows us not to think too much about what will happen if one of the servers suddenly falls off for a couple of minutes. We can be sure that Polly will simply retry the operation again within the defined policy. Polly solves a lot of problems and could be sufficient to make most of the applications stable and fault-tolerant.
Event Sourcing
But what if the service suddenly dies for a few days? What if some interruption in our storage causes data to become corrupted and we won’t know which records are broken and which are not?
Such cases usually require a special approach for each specific case. But some techniques make it easier to restore systems even in such difficult situations. One such technique is called “Event Sourcing”. Its idea is that each action in the system is defined as an event or a set of events. All events are stored in a special event storage and replayed upon request to reconstruct the state of a business entity. Thus, the typical shopping cart object for such examples is not stored on its own with its properties like items and totalPrice. It is stored exclusively as a chain of events like item_added, item_quantity_changed, item_removed, and is reconstructed every time it is needed for any business processes. This technique is very difficult to develop and has many disadvantages, but it gives you flexibility with changes and can easily restore the chain of events in case of problems with the system.
Combining retry policies with event sourcing to create a durable execution framework
Having both aforementioned things in mind (Polly, with its retry policy, and Event Sourcing), it becomes obvious that the next step in building solid and stable systems will be the ability to combine both approaches – retry policy and event sourcing. Making a framework that would allow us to define steps or actions to perform, and would ensure that each step is executed successfully by retrying it in case it fails. Additionally, each step and details about its execution would be stored as an event-sourced stream. That means that at any point in time, we will be able to see the current step, see how previous steps are completed, and even replay the whole process from the beginning or a specific step! That’s what Temporal is, and in Temporal terminology that’s a Workflow. Executing Workflow steps (actions) in such a way is called durable execution.
Changing The Mindset
The concept sounds great but introduces a completely new way of building the applications.
Here, we would like to share some of our findings when building the flight ticketing system.
The first thing that one should do is to re-visit the approach to architecture and develop a skill to map every business process to a Workflow.
Everything is a Workflow
A user can purchase a ticket, go to the airport, hand over the luggage, and register at the gate.
Usually that maps to some TicketService, UserService, and so on. But with Temporal, we have to think about processes.
We know that there is such a thing as flight. It’s scheduled at some time in the future. There should be a plane with a number of seats. People can buy tickets for those seats. At some point in time, the registration starts. Tickets need to be verified. Later, registration closes, and a plane is ready to take off.
When designing Workflows, we start with defining its lifecycle. The most complicated part is to understand when one Workflow stops and another starts, and how to split multiple related business processes into Workflows. Should it all be one big Workflow? Or a Workflow with multiple Child Workflows?
With our airlines concept, we decided to take the following approach: one “Flight” Workflow, which starts when tickets become open for sale and completes when the plane lands. This appeared to be a very convenient way of tracking tickets, boarding, seats, and luggage – we have one state throughout the whole flight and can track everything in one object.
The workflow itself is just a class marked with the [Workflow] attribute. The class stores all details in FlightDetailsModel, but for the sake of simplicity in this article, we can assume that FlightWorkflow has a state, which consists of two things: its status and a collection of tickets. The most counter-intuitive thing for newcomers is the fact that a Workflow can simply exist for quite a long time after it started. In our case, right inside the constructor, we call Temporal methods to change the `Status` property at given points in time in the future. We don’t need to program any details of a state machine – we can simply assume that for all that time, the Workflow will simply `live` somewhere in Temporal, changing its status at the defined time. Each change of status can have its own rules and validations. From this perspective, Workflows are very close to domain aggregates and how they encapsulate business logic.
Note that on the diagram the state of the Workflow just “exists” in Temporal – neither do we know how it is stored nor do we care about it. Temporal just ensures that it’s there, even if the server restarts. And the only way to access that state is to query the Workflow. But at the end, when the Workflow finishes, we store the data to a database – for archive and reporting, and that’s the only part where we interact with the database. It’s not even required and is there just to show a very common scenario when data might be required later for reporting or analytics.
We also see “Purchase Workflows” on the diagram and that they are communicating with our main “Flight Workflow” via Signals.
“Purchase Workflow” is much more complicated. It should lock the tickets, get passenger details, process payment, assign seats, and, if everything goes well, notify the Flight Workflow via a Signal that tickets were purchased successfully. More than that, if something in the process fails, the Activity should be retried, and if it fails even after a couple of retries, all previous steps should be rolled back.
Again, we decide the Workflow's lifetime. Workflow starts when the user buys a ticket, that is clear. But what about its end? Should the workflow be completed once payment is successful? We decided that it’s easier to manage all possible edge cases if we extend the Purchase Workflow to the point when the flight begins (plane takes off). That way, we can make seat selection optional but still map seats to tickets when the passenger goes through the gate and selects a seat at that time. Also, we can be sure that if the flight is delayed, we can allow people to return tickets by simply rolling back all actions using the saga pattern.
The code for PurchaseWorkflow can be found here >
It introduces a new concept: the Saga pattern. It may look complicated, but the idea is to just gather all rollback methods into an array, and, if an exception happens, execute them in reverse order.
Take a look at methods like ConfirmWithdrawal, BookTicket, or HoldMoney. They are very easy to understand since they simply call workflow activities. You can open the PurchaseActivities.cs file to get implementation details. Activities look like simple methods, and we write them like regular C# methods. Activities can send Signals to other Workflows (in our case, we can send a Signal to the Flight Workflow notifying that particular seats have been booked), and they can send data to third-party APIs or store the data in a database. And we, as developers, can simply write such code without caring much about possible failures. That is because activities are executed using the `durable execution` strategy. Temporal takes care of everything – if an activity fails, it will be retried according to the configured retry policy. The Workflow keeps track of all Activities executed. If let’s say, some activity throws an exception – Temporal will try to execute that activity continuously until it’s fixed according to configured policies. Even after code deployment! This means that if bad code reaches production and we notice it in the Temporal dashboard, we can simply deploy a fix and the execution will continue from where it left off. Some types of exceptions fail the whole workflow, more details can be found on the official github readme file.
Failure demo
To demonstrate how the workflow could be rolled back when an exception happens, we created a special artificial exception in the PurchaseWorkflow. To trigger it, select “Error” as an airport and proceed through the process of purchasing a ticket. In the final step, right after tickets are generated, the exception happens in PurchaseActivities.cs, in the SaveTickets method, simulating the case when, for some reason, database save failed. (Since this is an artificial case, we just put the exception to the last step so that it would roll back as many steps as possible).