Example communication for server outages - SSG can use it as blueprint

**Nihria** · Aug 19 2020, 03:19 PM

In another game i'm playing (@Smartphone) there was an outage of laughable 4 hours and 40 minutes.
And this was the statement of a Staff member to this outage
Source

On 12th of August at 10:36 UTC the game started to kick players out and did not let new players in. The problems lasted for 4 hours and 40 minutes, which is by far the longest downtime we’ve ever had. I thought it might make sense to write a bit about what happened. I’ll try to keep things on a high level, but some of it will be quite technical.

Background

We are using MongoDB as our database. We are running it in a so-called replica set, where most of the work is done on the primary database, and the changes are replicated to two secondaries for high availability. Some of the changes are so important that we want the primary to wait until the change has been acknowledged by at least one of the secondaries before continuing.

One of our engineers was in the process of updating an index on the collection holding the player data. Indexes are used to speed up queries targeting the collection. MongoDB does not support updating an index directly - instead, we create a new index and then drop the old one. Creating an index takes a long time, so we were doing it in the background. At 10:20 UTC the index creation completed on the primary database. At 10:36 UTC the engineer dropped the old index, and a few seconds later both of our secondary databases became unhealthy. The primary continued to work for a little while longer but eventually became unhealthy too.

The Problem

What the engineer did not realize at the time was that the index creation starts on the secondaries only after the index has been created on the primary. Dropping an index completes immediately, so the old index was dropped from the secondaries while the creation of the new one had just started. When both secondaries became unhealthy, none of the changes could be replicated anymore. Because the primary waits for an acknowledgement from a secondary for the most important changes, the operations were stuck forever and eventually, the primary database ran out of resources.

Over the next hours, we tried various things to restore the normal functionality of our replica set. Unfortunately, it started to look less and less likely to succeed. We could have recovered the database, but we were worried about potential data loss. While we worked to restore the system, the game was online periodically, but the user experience and stability were poor. In the end, we made the decision to restore the replica set from a backup. We take regular backups, but we were fortunate that the latest backup was dated 10:33 UTC, only 3 minutes before the incident occurred.

We re-opened the servers at 15:16 UTC.

Explanation

When the game was restored, we wanted to understand what had caused the problem. Our server reads some non-critical things from the secondary databases for performance reasons so that the primary can focus on more critical things. While trying to recover the server, we worked with the assumption that the secondaries became unhealthy because they no longer had either of the indexes - one was being created, and the other one had just been deleted. On later investigation, it turned out that none of the queries that need the index are accessing the secondaries. Even though dropping the index too soon was a mistake, it should not have caused any visible issues in the game. Instead, we had encountered the following bug in MongoDB: https://jira.mongodb.org/browse/SERVER-21307 78.

Conclusion

We take downtime like this very seriously. The team already practices disaster recovery routinely to prepare us for this kind of situation, but it doesn’t mean that we couldn’t do better. We held a post mortem on the incident and created almost 20 new action points based on the findings.

I can’t promise there won’t be similar outages in the future, but we will do our very best to learn from this one and make sure it will not happen again. Thank you for your understanding and your continued support!

(Yeah, i know, i posted this here before, but then it came to my mind, i should set it to an own thread, so SSG can make it sticky and use it as blueprint for their own meesages to us)

**Polymachos** · Aug 19 2020, 03:34 PM

(asks around) how many could actually grasp this technical explanation?

(raises hand) any more?

Sure, you and I now understand why they had that outage; but I doubt that people not that tech-savvy are interested in such explanations. They just want the servers to be swearing online and running. And they won't forgive it if the dev/tech team admits to an error they made; instead they would rather call them all sorts of incompetent, and repeat that for weeks.

(I see this word used here quite often, and for some reason I always get the impression that those who talk about in-competence, have little to no idea how these systems work by themselves. How ironic)

Which makes me understand why SSG decided not to talk about the technical background of their issues. If I understood it correctly, they currently don't even seem to reside in their own product (which is a program, and server setup) but in whatever their hoster is providing. And calling out your business partner, is a no-no.

Greetings, Polymachos

**cdq1958** · Aug 20 2020, 03:00 PM

Originally Posted by Polymachos

(asks around) how many could actually grasp this technical explanation?

(raises hand) any more?

Sure, you and I now understand why they had that outage; but I doubt that people not that tech-savvy are interested in such explanations. They just want the servers to be swearing online and running. And they won't forgive it if the dev/tech team admits to an error they made; instead they would rather call them all sorts of incompetent, and repeat that for weeks.

(I see this word used here quite often, and for some reason I always get the impression that those who talk about in-competence, have little to no idea how these systems work by themselves. How ironic)

Which makes me understand why SSG decided not to talk about the technical background of their issues. If I understood it correctly, they currently don't even seem to reside in their own product (which is a program, and server setup) but in whatever their hoster is providing. And calling out your business partner, is a no-no.

Greetings, Polymachos

And, people not familiar with how the USA legal system works just don't understand how *much* control lawyers have. Key concept is: "What you don't say can't be used against you in court", so they'll only say what the lawyers let them say. I'll bet that their provider contracts include non-disclosure terms, where the lawyers deem it necessary.

**SavinDwarf** · Aug 20 2020, 03:12 PM

hmmm They were certainly transparent. But admitting to stupidity doesn't seem like a great idea.

1) Who would want to change the data base structure/rules/triggers while the system is active? Seems like a recipe for disaster.

2) MongoDB (which I always though was a non SQL document management type of data base?? but things change) if it forces you to drop and recreate an index rather than make a change is a nightmare. Who would pick that for a gaming system?

But they get 5 stars for honesty!!

**Shirewhistle** · Aug 20 2020, 03:45 PM

They have a publisher. If the publisher broke their contract by not providing the service, the lawyers may not want them to speak out on Twitter, but they will act.

It's tempting to say "None of that is our business" but it is. We paid for this service. They paid for publishing and a platform. If the fault is with a partner, Twitter isn't the right place to call them out, but lawyers will speak loudly about it.

Even if the contract has weasel wording, the service of publishing is currently not being performed. It's dragged on for longer than a month even. This is no longer a growing pain, or unforeseen issue with platform change. It's simple failure to provide agreed upon service. And as a result, they aren't providing us with service because they can't.

They trusted the wrong people.

I think we as a community should stand by SSG and encourage them to move to greener pastures if they need to, even if it means a longer downtime while they find a better partner or develop the internal capacity to self publish. I'd rather that happens, than this constant upsie-downsie. I don't think their new datacenter intends to give them the capacity they need. It's possible they aren't even able to give them that capacity. Or maybe they're just sloppy. In any case, this game deserves better.

Out of this failure, we could be free of Daybreak. I don't trust my credit information with them and it has certainly hurt SSG to be published by DBG over these years as I and others who don't trust them, didn't spend freely. If Daybreak failed, then let's be free of them, and be able to trust SSG again.

I like that communication blueprint, and if unfriendly partners were not involved, I'm sure we would have much more open communications.

**Cordovan** · Aug 20 2020, 04:09 PM

I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it. If that gets me some heat here, so be it; you know where I stand. Besides opening yourself up to all sorts of issues as a business entity, you are then committing to that level of detail every time you have an extended outage. Let's say a technical answer along those lines would throw an employee specifically under the bus, or a valued contractor you intend to do business with long-term. Even if technically accurate, it would be bad precedent to set. Not to mention how it'd be picked apart over the long term. It would not work for us. When I begin to think about what the above would have meant for almost ten years of messaging with this company, my head explodes.

**StinkyGreene** · Aug 20 2020, 04:15 PM

I moved a lot of stuff to MongoDB in the past decade. I like it.

**Oldwiley** · Aug 20 2020, 04:17 PM

Originally Posted by Cordovan

I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it. If that gets me some heat here, so be it; you know where I stand. Besides opening yourself up to all sorts of issues as a business entity, you are then committing to that level of detail every time you have an extended outage. Let's say a technical answer along those lines would throw an employee specifically under the bus, or a valued contractor you intend to do business with long-term. Even if technically accurate, it would be bad precedent to set. Not to mention how it'd be picked apart over the long term. It would not work for us. When I begin to think about what the above would have meant for almost ten years of messaging with this company, my head explodes.

Always been a kinda tongue in cheek forum poster who would make an offbeat comment like:

Valued employee (spilt coffee in the no coffee room) or valued contractor (Got a good contract and lawyer or union rep)

But I have to say for all my gripes and snipes over the years, you've never had more respect from me than to come on here at this moment in time and say that. +1 as they say...

Btw the situation is not as funny as the forum

**OghranNasty** · Aug 20 2020, 04:20 PM

Originally Posted by Polymachos

(asks around) how many could actually grasp this technical explanation?

(raises hand) any more?

Sure, you and I now understand why they had that outage; but I doubt that people not that tech-savvy are interested in such explanations. They just want the servers to be swearing online and running. And they won't forgive it if the dev/tech team admits to an error they made; instead they would rather call them all sorts of incompetent, and repeat that for weeks.

(I see this word used here quite often, and for some reason I always get the impression that those who talk about in-competence, have little to no idea how these systems work by themselves. How ironic)

Which makes me understand why SSG decided not to talk about the technical background of their issues. If I understood it correctly, they currently don't even seem to reside in their own product (which is a program, and server setup) but in whatever their hoster is providing. And calling out your business partner, is a no-no.

Greetings, Polymachos

Not agreed.
Even if I didn't grasp the explanation (I do), a statement like tells you that the crew indeed takes this stuff very seriously, that they care and are intent on learning lessons from it to prevent future occurrence.

And that is what I miss in SSG's response.

**Elmagor** · Aug 20 2020, 04:21 PM

Originally Posted by Cordovan

I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it. If that gets me some heat here, so be it; you know where I stand. Besides opening yourself up to all sorts of issues as a business entity, you are then committing to that level of detail every time you have an extended outage. Let's say a technical answer along those lines would throw an employee specifically under the bus, or a valued contractor you intend to do business with long-term. Even if technically accurate, it would be bad precedent to set. Not to mention how it'd be picked apart over the long term. It would not work for us. When I begin to think about what the above would have meant for almost ten years of messaging with this company, my head explodes.

Ok, without going deep into techincal issues... this problems related to recent 10 days downtime? SSG find source of this problems? They have any progress in fixing them once and for all?

**OghranNasty** · Aug 20 2020, 04:25 PM

Originally Posted by Cordovan

I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it. If that gets me some heat here, so be it; you know where I stand. Besides opening yourself up to all sorts of issues as a business entity, you are then committing to that level of detail every time you have an extended outage. Let's say a technical answer along those lines would throw an employee specifically under the bus, or a valued contractor you intend to do business with long-term. Even if technically accurate, it would be bad precedent to set. Not to mention how it'd be picked apart over the long term. It would not work for us. When I begin to think about what the above would have meant for almost ten years of messaging with this company, my head explodes.

But it is not the level of technical detail that makes the difference.
It is the level of commitment and care about the community that makes this stand out, and that would make such a big difference to this community.

**Nicname** · Aug 20 2020, 04:26 PM

And THAT is the kind of answer I was hoping for a looooooong time now. That is some straight up statement. And even if I'd like to know specifics, I see your point(s) and respect that.

Originally Posted by Cordovan

I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it. If that gets me some heat here, so be it; you know where I stand. Besides opening yourself up to all sorts of issues as a business entity, you are then committing to that level of detail every time you have an extended outage. Let's say a technical answer along those lines would throw an employee specifically under the bus, or a valued contractor you intend to do business with long-term. Even if technically accurate, it would be bad precedent to set. Not to mention how it'd be picked apart over the long term. It would not work for us. When I begin to think about what the above would have meant for almost ten years of messaging with this company, my head explodes.

**LotroVidz** · Aug 20 2020, 04:27 PM

Originally Posted by Cordovan

I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it.
Let's say a technical answer along those lines would throw an employee specifically under the bus

SSG would advise saying close to nothing then? On an issue that has been ongoing for more than a month?

Don't you get the point here? Most players don't want you to explain it in such detail, there is a middle-ground. And please stop being scared of your customers, I have heard you say many times on stream that you don't dare say this and that, because you are afraid of some players getting upset.. well look at your forums now.

For a small company like yours, transparency would be a better approach. Maybe you should open your eyes just a little bit more, and you might notice that the majority of this community here WANTS information about what's going on. It's what we want, not what you want.

Ashes of creation devs understand how to communicate with their player base, they even have an active discord with 20k members and regularly talks with their players. They invite random players to ask questions etc, you can't even explain why the servers have been struggling for the past 2 months.

Note, when I say *you* I mostly speak about SSG as a company, not you personally. I don't beleive you personally has handled this good or bad, but I think your company has handled this poorly. You are not the only one who can communicate, and shouldn't be the only one to do so either.

**RagTop1** · Aug 20 2020, 04:28 PM

Originally Posted by OghranNasty

Not agreed.
Even if I didn't grasp the explanation (I do), a statement like tells you that the crew indeed takes this stuff very seriously, that they care and are intent on learning lessons from it to prevent future occurrence.

And that is what I miss in SSG's response.

You're saying, because the SSG crew won't give you all the details of the problems they are experiencing, therefore, that means the SSG crew doesn't take this stuff seriously doesn't care, and isn't intent on learning lessons from it.

What an assinine statement.

**LotroVidz** · Aug 20 2020, 04:29 PM

Originally Posted by RagTop1

You're saying, because the SSG crew won't give you all the details of the problems they are experiencing, therefore, that means the SSG crew doesn't take this stuff seriously doesn't care, and isn't intent on learning lessons from it.

What an assinine statement.

If they say nothing, we can't expect a solution either.. They need to give us information on what they are doing to prevent this from happening again, and some form of ETA. They have had many many weeks to do this! It wasn't long ago the game was down for 1week+, and now it's the same story all over.

**Oldwiley** · Aug 20 2020, 04:34 PM

Originally Posted by LotroVidz

If they say nothing, we can't expect a solution either.. They need to give us information on what they are doing to prevent this from happening again, and some form of ETA. They have had many many weeks to do this! It wasn't long ago the game was down for 1week+, and now it's the same story all over.

**RagTop1** · Aug 20 2020, 04:36 PM

Originally Posted by LotroVidz

If they say nothing, we can't expect a solution either.

Whatever the solution is, it will arrive whether they say something, or not.

Originally Posted by LotroVidz

They need to give us information on what they are doing to prevent this from happening again, and some form of ETA.

No, they don't need to tell you what they are doing... only that they are doing something. An ETA would certainly be nice, but, if it's out of their control (e.g., ISP issues), and the ISP doesn't provide SSG with an ETA, what could they possibly say, other than, "It's beyond our control, and we're working with our partners to get things fixed."?

**LotroVidz** · Aug 20 2020, 04:39 PM

Originally Posted by RagTop1

Whatever the solution is, it will arrive whether they say something, or not.

No, they don't need to tell you what they are doing... only that they are doing something. An ETA would certainly be nice, but, if it's out of their control (e.g., ISP issues), and the ISP doesn't provide SSG with an ETA, what could they possibly say, other than, "It's beyond our control, and we're working with our partners to get things fixed."?

They could say this is related to ISP issues, unfortunately we can't provide an ETA at this time.

"beyond our control" is the most unsmart thing you could say IMO.. Sure maybe after a few days of issues, not 2months! Are u serious?

**LotroVidz** · Aug 20 2020, 04:41 PM

Originally Posted by Oldwiley

Is this some kind of boomer thing, link videos instead of arguing? xD

**RagTop1** · Aug 20 2020, 04:43 PM

Originally Posted by LotroVidz

They could say this is related to ISP issues, unfortunately we can't provide an ETA at this time.

"beyond our control" is the most unsmart thing you could say IMO.. Sure maybe after a few days of issues, not 2months! Are u serious?

Ok...

If they said, "This is related to ISP issues, unfortunately we can't provide an ETA at this time.", everyone would feel better? The problems would be solved faster? The forums would be less-pessimistic, and let SSG do whatever they're going to do to address the issues?

That's a bit naive...

Let's face it: No amount of detail about the problems or solutions is going to make anyone happy, because it will do nothing to make the problems go away.

**Nonde** · Aug 20 2020, 04:44 PM

Originally Posted by Cordovan

. . . Let's say a technical answer along those lines would throw an employee specifically under the bus . . .

If not a one off I'm surprised SSG isn't throwing them under the bus as this is costing you. Have a read of Massively and MMORPG - and you're advertising atm . . .

**LotroVidz** · Aug 20 2020, 04:47 PM

Originally Posted by RagTop1

Ok...

If they said, "This is related to ISP issues, unfortunately we can't provide an ETA at this time." Everyone would feel better? The problems would be solved faster? The forums would be less-pessimistic, and let SSG do whatever they're going to do to address the issues?

That's a bit naive...

Better than saying nothing for like 24h+? I could come up with more detailed and better responses but why would I? Not my problem to convince you or anyone else. It's my personal OPINION. Some people here are happy with an official statement 24+hours to several days after the issues started to appear.

Also, we had these issues just a few weeks ago, and now it's here again.

Remember this is the feedback section.

Face it: Every single detail is better than silence.

**Oldwiley** · Aug 20 2020, 04:47 PM

Originally Posted by LotroVidz

Is this some kind of boomer thing, link videos instead of arguing? xD

Technically am Gen X, but when you've had all those arguments so many times before......

**RagTop1** · Aug 20 2020, 04:49 PM

Originally Posted by LotroVidz

Better than saying nothing for like 24h+? I could come up with more detailed and better responses but why would I? Not my problem to convince you or anyone else. It's my personal OPINION. Some people here are happy with an official statement 24+hours to several days after the issues started to appear.

Also, we had these issues just a few weeks ago, and now it's here again.

Remember this is the feedback section.

**Cordovan** · Aug 20 2020, 04:49 PM

Originally Posted by OghranNasty

But it is not the level of technical detail that makes the difference.
It is the level of commitment and care about the community that makes this stand out, and that would make such a big difference to this community.

If I may pull out the grizzled veteran card once again: This is not my first rodeo. If people don't like me, fine. I may be incompetent for a million different things, but if I am not saying something, there is probably a reason. I assure you that we would like this resolved as quickly as possible. We do care, and that is why we are here.

Thread: Example communication for server outages - SSG can use it as blueprint

Thread Tools

Search Thread

Display

Example communication for server outages - SSG can use it as blueprint

Giving downtime details to the community won't eliminate the problems any faster

Fixes will happen when they happen

Oh, is that all it would take?

Posting Permissions