FireFall: From the Clouds

by **DogManDan** Wed Feb 20, 2013 6:21 pm

We talk to Director of Technical Operations Jeff Berube about powering Red 5's heavyweight MMO with cloud computing.

FireFall: From the Clouds B8d870d77edb53987f15a35924738527

FireFall: From the Clouds B8d870d77edb53987f15a35924738527

Over the last few years, cloud computing has become a marvel of the business world. By being able to order computing power on demand, IT departments have been able to eliminate the costs associated with commissioning servers and kitting out data centers. It also provides flexibility, supporting firms like Netflix, Spotify and Pinterest as their user base grows.

Can cloud computing support a full-fledged MMO while still providing a stable, low latency experience to gamers? The developer behind sci-fi shooter MMO FireFall believed that it could, going on to convincingly prove it during an evolving, long-term beta that started earlier in 2012. You can even judge the fruits of their endeavour yourself this weekend, as Red 5 Studios are holding an open beta from February 22nd to 25th.

After hearing that FireFall was powered by Amazon’s AWS cloud platform, I was eager to discover why developer Red 5 Studios took this bold step. In an interview, Director of Technical Operations Jeff Berube explained that all the pieces seemed to fall into place, with cloud services like AWS maturing to the point where the concept became viable. He went on to describe how it’s made growing FireFall much easier, either when adding services or expanding into Europe.

Working with a small team that has backgrounds in systems, databases and security, Berube is responsible for infrastructure engineering and management at Red 5. Prior to joining the studio in January 2011, he was part of a team at Origin Systems growing Ultima Online. He also spent time at Blizzard Entertainment, designing the infrastructure behind World of Warcraft and later used as a platform for Starcraft II.

FireFall: From the Clouds 9681090e5359f5c60faa9ba2b3278a54

FireFall: From the Clouds 9681090e5359f5c60faa9ba2b3278a54

ZAM: What led to your decision to use Amazon as a hosting platform for all of FireFall, and not just your website?

As far back as my time at Origin, I talked about finding a way to manage infrastructure based upon just two metrics: available CPU and memory in the data center. That was well before technology had caught up with what I was hoping to do. As services such as Amazon’s AWS matured, it looked like the pieces needed were finally becoming available but even I thought it would be difficult, if not impossible, to operate something as complex as an online game in a virtualized environment.

When I started at Red 5 Studios, I explained to Mark that I had this crazy idea to run everything in the cloud but I wasn’t ready to commit to it as a solution until I could prove it would work. As we built up the technology we would use to manage everything, we were surprised at just how well everything came together in the cloud. There are some really great tools available now, tools that had always needed to be created in-house, that made things both cleaner and faster to build.

We started using AWS (Amazon Web Services) for “production” the first month I was here with the launch of some of our web infrastructure. I was able to get a single instance running our full game stack up and running in AWS in late March and, working with the development team, create the first working cluster in May, if I remember correctly.

It wasn’t until we had a couple more months of experience running the game on EC2 (Elastic Compute Cloud) that I was comfortable letting Mark know that we would be able to launch in the cloud.

ZAM: What benefits does cloud hosting have, both for you as a developer and for us as gamers?

As a developer, there are a number of benefits. Probably the one we experience most frequently is that when we need to add new servers, or upgrade the ones we have, we have the ability to do so without a lengthy procurement process. This lets us prototype new architecture, test new software easily in parallel to existing systems, roll out a new feature as soon as it has passed QA, or add additional capacity to an existing service in response to player behavior. We are also able to scale services, both up and down, as the player population in a region changes.

I remember talking with our lead backend web developer right before we launched the store for the first time to the public. There was about 8 minutes before the feature went live and one of us mentioned it would probably be a good idea to have more than just a small cluster of servers backing that service. We tripled the size of the application server pool powering that feature by the time the first customers had the opportunity to access it.

An extremely important issue as it relates to the gamer’s experience and the developer’s ability to manage player expectations on launch day is adequate server capacity. With the traditional approach to building online game infrastructure, you trust that player forecasts are accurate and then make a decision concerning how much it is worth to the company to provide that level of capacity. It is very easy to either have inaccurate forecasting or purchase too much or too little hardware. With too little hardware, the initial experience is ruined for players and they may never get to see the real value of the experience you are trying to provide. In the case of too much hardware, players are spread too thin and a lot of money, money that could have gone into further development, is tied up in hardware and data center colocation costs.

Cloud hosting, like AWS, provides for near limitless and rapid scalability, within reason, of course. We pay for exactly what we need and only when we need it. If we find that we have under or overestimated player interest, we can change the size of the server cluster that we use. Working with the development team, we have built a number of features that help us take advantage of the fluid nature of our “data center”.

FireFall: From the Clouds 2ac71248e50dbabee8a945bd21a2b83e

FireFall: From the Clouds 2ac71248e50dbabee8a945bd21a2b83e

ZAM: How do you make sure that FireFall performs as well as we expect, being able to connect and play lag-free?

Being able to provide a gameplay experience at least as good as we could provide with dedicated hardware was our top priority while first building and testing the environment in EC2. We definitely had questions early in testing about whether “lag” that was seen was due to the shared nature of the virtualized servers in the cloud or something that we had introduced in game code. (It turned out to be something we were doing.)

There are also certain things we don’t have any visibility into, like network performance outside the actual server, that need to be thought about since we don’t have direct access to those parts of the service. We spent a long time building metrics collection and visualization into both our server infrastructure and the application itself. This is a process that will continue for as long as we are developing the app or infrastructure, of course.

FireFall: From the Clouds Feb00c08308e29abf52a28befa403a72

FireFall: From the Clouds Feb00c08308e29abf52a28befa403a72

ZAM: Has it all been smooth sailing, or have there been any problems or restrictions you encountered on the way? Likewise, were there any unexpected perks?

As good as things have been for us, there has definitely been a learning curve.

There have been a few instances of cloud instability that we needed to learn to deal with. We were caught up in the multi-day outage that US-East AWS experienced in April 2011. It is also possible to try to bring up an instance type which is just not available because they are all being used (like after the AWS outage mentioned above). Finally, there is always the possibility of your server being impacted in some way by other customer’s workloads but I can’t think of a case of that in some time. All in all, pretty minor annoyances with proper planning.

We’ve also learned a lot about designing for the cloud. Some things that are “standard” practices aren’t done easily. I mentioned the difficulty in monitoring of certain things, like network stability outside of your own instances. It isn’t easy to do simple failover as there isn’t the concept of a VIP (virtual IP) to pass between servers. (There are ways to get similar functionality in AWS Virtual Private Cloud but it is more complicated to setup and manage.) Also, by design, there is no built in data sharing between regions. (This is a reason you hear of websites and the like failing when AWS experiences an outage. It is a lot more difficult to build infrastructure that is in separate regions for failover.) As long as you take the nuances of the cloud into perspective when you are designing your architecture, it is possible to work around all these things

On the plus side, Amazon has built a really great set of tools to help developers build their own infrastructure without a lot of the headaches of managing everything. We don’t utilize everything they offer, by any means. Some things, like RDS (Relational Database Service), provide fully managed MySQL servers which are fantastic. (We deployed our own servers for our production sites because there were things we needed to be able to accomplish that are not provided by their service at this time.) Services like S3 (Simple Storage Service – kind of like NFS) and ElasticCache (managed memcached) are excellent replacements for services you would probably need to run anyways but work so well that it is better to just let them manage those services for you.

FireFall: From the Clouds Dabb57442d3e1b6c6c5d8b1a6702a217

FireFall: From the Clouds Dabb57442d3e1b6c6c5d8b1a6702a217

ZAM: What’s it been like working with Amazon as a partner? In what way has working with them made FireFall an even better game?

We have been really impressed with everyone we have worked with at Amazon. They are extremely knowledgeable about their area of responsibility and, when appropriate, we have had the opportunity to work directly with the Engineers working on the various services they offer.

In some cases, they have been able to provide us guidance on how to best leverage the platform for the best performance. We have a pretty non-standard use case on AWS but they definitely do all they can to make sure we are getting the very best we can from the service.

ZAM: Did the platform choice help you when expanding into Europe? Will it help if you decide to launch in other areas?

Without a doubt, AWS made expanding our service into Europe much easier. There were no new contracts to sign, no hardware purchases to make, and no servers to physically install. We decided we wanted to extend the service to the region and got right to work to make it happen.

Using the tools we built up in our development testing environment and our US production environment, we were able to quickly build up the required infrastructure. We then made sure that the information needed so that our players had all of their character information was properly replicating and, utilizing Dyn’s Global Traffic Management service, we started sending players who were closer to the new European facility to that location without needing them to pay to transfer their character or start again from scratch. (Although some companies see those services as a chance to make more money, Red 5 Studios doesn’t feel it is right to hold your characters hostage.)

Finally, we build every single location in the world to the same standard. This allows us to quickly extend the service to any location where Amazon has services available. Additionally, because we work to ensure that all of your character data is available everywhere and any of our sites can provide you all the services required to power our products, in the case of a disaster we can fail the affected customers to the next closest site and they can pick up immediately from where they were before the problem was encountered.

FireFall: From the Clouds Cb1d7d0940e7dd181bc1289be15d6af6

FireFall: From the Clouds Cb1d7d0940e7dd181bc1289be15d6af6

ZAM: Do you expect that other MMOs and online games will follow your lead?

To the best of my knowledge, the original architecture that I designed for World of Warcraft had never been used to power online gaming before we built it. I’m not sure if the things we were doing were being done at scale anywhere, actually. It was definitely a departure from anything I had ever done prior. In the last couple of years, having spoken to operations and development people working on a number of big MMOs, they are now using a design which is pretty close to what we built in 2004.

I’m sure, as we continue to prove that games as complex as ours are viable in the cloud, other companies are going to start taking a good look at what they are doing currently and planning to do with their future products. Also, the availability of people with experience working in the cloud, both in operations and development, will help to overcome those company’s initial fears. I think that last part, the right people with the right knowledge, is what will determine how quickly or widely a similar design will be adopted.

In order to fully leverage whatever infrastructure you are building requires very close cooperation between teams within the organization. There are a number of changes that have been made to our product in order to make sure that we can fully utilize the functionality provided, and manage the limitations imposed, by the cloud. Some game types may just not be a good fit for the cloud. (I can’t think of any, however).

On the other hand, if your Operations and Development teams can work together closely and each is willing to do what is best for the product, without egos, it is possible to create a fluid architecture that supports rapid iteration and “limitless” platform expansion capabilities while limiting the cost of service operation.

Also, at a certain scale, it makes sense to evaluate the expertise of your staff and whether it continues to be cost effective to operate in the cloud. There may be a point, depending on the size of the infrastructure, the ability of your team to manage the hardware required, and the requirements of your product, to move to an internally managed solution. If we reach that point, however, we have already committed to the ability to “burst” into the cloud, utilizing public cloud infrastructure in addition to our own infrastructure, in order to make sure our players always have all the hardware they need for the very best experience.