We have detected that cookies are not enabled on your browser. Please enable cookies to ensure the proper experience.
Page 1 of 7 1 2 3 4 5 ... LastLast
Results 1 to 25 of 156
  1. #1
    Join Date
    May 2015
    Posts
    172

    Example communication for server outages - SSG can use it as blueprint

    In another game i'm playing (@Smartphone) there was an outage of laughable 4 hours and 40 minutes.
    And this was the statement of a Staff member to this outage
    Source

    On 12th of August at 10:36 UTC the game started to kick players out and did not let new players in. The problems lasted for 4 hours and 40 minutes, which is by far the longest downtime we’ve ever had. I thought it might make sense to write a bit about what happened. I’ll try to keep things on a high level, but some of it will be quite technical.

    Background

    We are using MongoDB as our database. We are running it in a so-called replica set, where most of the work is done on the primary database, and the changes are replicated to two secondaries for high availability. Some of the changes are so important that we want the primary to wait until the change has been acknowledged by at least one of the secondaries before continuing.

    One of our engineers was in the process of updating an index on the collection holding the player data. Indexes are used to speed up queries targeting the collection. MongoDB does not support updating an index directly - instead, we create a new index and then drop the old one. Creating an index takes a long time, so we were doing it in the background. At 10:20 UTC the index creation completed on the primary database. At 10:36 UTC the engineer dropped the old index, and a few seconds later both of our secondary databases became unhealthy. The primary continued to work for a little while longer but eventually became unhealthy too.

    The Problem

    What the engineer did not realize at the time was that the index creation starts on the secondaries only after the index has been created on the primary. Dropping an index completes immediately, so the old index was dropped from the secondaries while the creation of the new one had just started. When both secondaries became unhealthy, none of the changes could be replicated anymore. Because the primary waits for an acknowledgement from a secondary for the most important changes, the operations were stuck forever and eventually, the primary database ran out of resources.

    Over the next hours, we tried various things to restore the normal functionality of our replica set. Unfortunately, it started to look less and less likely to succeed. We could have recovered the database, but we were worried about potential data loss. While we worked to restore the system, the game was online periodically, but the user experience and stability were poor. In the end, we made the decision to restore the replica set from a backup. We take regular backups, but we were fortunate that the latest backup was dated 10:33 UTC, only 3 minutes before the incident occurred.

    We re-opened the servers at 15:16 UTC.

    Explanation

    When the game was restored, we wanted to understand what had caused the problem. Our server reads some non-critical things from the secondary databases for performance reasons so that the primary can focus on more critical things. While trying to recover the server, we worked with the assumption that the secondaries became unhealthy because they no longer had either of the indexes - one was being created, and the other one had just been deleted. On later investigation, it turned out that none of the queries that need the index are accessing the secondaries. Even though dropping the index too soon was a mistake, it should not have caused any visible issues in the game. Instead, we had encountered the following bug in MongoDB: https://jira.mongodb.org/browse/SERVER-21307 78.

    Conclusion

    We take downtime like this very seriously. The team already practices disaster recovery routinely to prepare us for this kind of situation, but it doesn’t mean that we couldn’t do better. We held a post mortem on the incident and created almost 20 new action points based on the findings.

    I can’t promise there won’t be similar outages in the future, but we will do our very best to learn from this one and make sure it will not happen again. Thank you for your understanding and your continued support!
    (Yeah, i know, i posted this here before, but then it came to my mind, i should set it to an own thread, so SSG can make it sticky and use it as blueprint for their own meesages to us)

  2. #2
    Join Date
    Nov 2011
    Posts
    2,604
    (asks around) how many could actually grasp this technical explanation?

    (raises hand) any more?

    Sure, you and I now understand why they had that outage; but I doubt that people not that tech-savvy are interested in such explanations. They just want the servers to be swearing online and running. And they won't forgive it if the dev/tech team admits to an error they made; instead they would rather call them all sorts of incompetent, and repeat that for weeks.

    (I see this word used here quite often, and for some reason I always get the impression that those who talk about in-competence, have little to no idea how these systems work by themselves. How ironic)

    Which makes me understand why SSG decided not to talk about the technical background of their issues. If I understood it correctly, they currently don't even seem to reside in their own product (which is a program, and server setup) but in whatever their hoster is providing. And calling out your business partner, is a no-no.


    Greetings, Polymachos
    Räuberhöhle auf Belegaer, Breelandsiedlung, Ochsbott, Lange Straße 5. Vorsicht, Fallen!
    Awkward Anomalities Arena in Breeland Homesteads, 6 Long Street, Ersward (Landroval) - Elderslade under attack!

    Scared people tend to follow the flock, no matter which shepherd it has

  3. #3
    cdq1958's Avatar
    cdq1958 is offline Hero Of the Small Folk 2013
    Join Date
    Jun 2010
    Posts
    0
    Quote Originally Posted by Polymachos View Post
    (asks around) how many could actually grasp this technical explanation?

    (raises hand) any more?

    Sure, you and I now understand why they had that outage; but I doubt that people not that tech-savvy are interested in such explanations. They just want the servers to be swearing online and running. And they won't forgive it if the dev/tech team admits to an error they made; instead they would rather call them all sorts of incompetent, and repeat that for weeks.

    (I see this word used here quite often, and for some reason I always get the impression that those who talk about in-competence, have little to no idea how these systems work by themselves. How ironic)

    Which makes me understand why SSG decided not to talk about the technical background of their issues. If I understood it correctly, they currently don't even seem to reside in their own product (which is a program, and server setup) but in whatever their hoster is providing. And calling out your business partner, is a no-no.


    Greetings, Polymachos
    And, people not familiar with how the USA legal system works just don't understand how *much* control lawyers have. Key concept is: "What you don't say can't be used against you in court", so they'll only say what the lawyers let them say. I'll bet that their provider contracts include non-disclosure terms, where the lawyers deem it necessary.
    "No sadder words of tongue or pen are the words: 'Might have been'." -- John Greenleaf Whittier
    "Do or do not. There is no try." -- Yoda
    On planet Earth, there is a try.
    Indeed, in a world and life full of change, the only constant is human nature (A is A, after all :P).
    We old vets need to keep in mind those who come after us.

  4. #4
    Join Date
    Jul 2011
    Posts
    793
    hmmm They were certainly transparent. But admitting to stupidity doesn't seem like a great idea.

    1) Who would want to change the data base structure/rules/triggers while the system is active? Seems like a recipe for disaster.

    2) MongoDB (which I always though was a non SQL document management type of data base?? but things change) if it forces you to drop and recreate an index rather than make a change is a nightmare. Who would pick that for a gaming system?


    But they get 5 stars for honesty!!
    May the winds of fortune sail you,
    May you sail a gentle sea.
    May it always be the other guy
    Who says, "this drink's on me."

  5. #5
    Join Date
    Nov 2018
    Posts
    229
    They have a publisher. If the publisher broke their contract by not providing the service, the lawyers may not want them to speak out on Twitter, but they will act.

    It's tempting to say "None of that is our business" but it is. We paid for this service. They paid for publishing and a platform. If the fault is with a partner, Twitter isn't the right place to call them out, but lawyers will speak loudly about it.

    Even if the contract has weasel wording, the service of publishing is currently not being performed. It's dragged on for longer than a month even. This is no longer a growing pain, or unforeseen issue with platform change. It's simple failure to provide agreed upon service. And as a result, they aren't providing us with service because they can't.

    They trusted the wrong people.

    I think we as a community should stand by SSG and encourage them to move to greener pastures if they need to, even if it means a longer downtime while they find a better partner or develop the internal capacity to self publish. I'd rather that happens, than this constant upsie-downsie. I don't think their new datacenter intends to give them the capacity they need. It's possible they aren't even able to give them that capacity. Or maybe they're just sloppy. In any case, this game deserves better.

    Out of this failure, we could be free of Daybreak. I don't trust my credit information with them and it has certainly hurt SSG to be published by DBG over these years as I and others who don't trust them, didn't spend freely. If Daybreak failed, then let's be free of them, and be able to trust SSG again.

    I like that communication blueprint, and if unfriendly partners were not involved, I'm sure we would have much more open communications.

  6. #6
    Join Date
    Sep 2016
    Posts
    6,276
    I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it. If that gets me some heat here, so be it; you know where I stand. Besides opening yourself up to all sorts of issues as a business entity, you are then committing to that level of detail every time you have an extended outage. Let's say a technical answer along those lines would throw an employee specifically under the bus, or a valued contractor you intend to do business with long-term. Even if technically accurate, it would be bad precedent to set. Not to mention how it'd be picked apart over the long term. It would not work for us. When I begin to think about what the above would have meant for almost ten years of messaging with this company, my head explodes.
    Community Manager, Lord of the Rings Online
    Follow LOTRO on: Twitter - Facebook - Twitch - YouTube
    Personal channels (No SSG talk): Twitch Twitter Facebook
    Support: help.standingstonegames.com
    coolcool

  7. #7
    Join Date
    Jan 2011
    Posts
    996
    I moved a lot of stuff to MongoDB in the past decade. I like it.

  8. #8
    Join Date
    Aug 2011
    Posts
    1,291
    Quote Originally Posted by Cordovan View Post
    I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it. If that gets me some heat here, so be it; you know where I stand. Besides opening yourself up to all sorts of issues as a business entity, you are then committing to that level of detail every time you have an extended outage. Let's say a technical answer along those lines would throw an employee specifically under the bus, or a valued contractor you intend to do business with long-term. Even if technically accurate, it would be bad precedent to set. Not to mention how it'd be picked apart over the long term. It would not work for us. When I begin to think about what the above would have meant for almost ten years of messaging with this company, my head explodes.
    Always been a kinda tongue in cheek forum poster who would make an offbeat comment like:

    Valued employee (spilt coffee in the no coffee room) or valued contractor (Got a good contract and lawyer or union rep)

    But I have to say for all my gripes and snipes over the years, you've never had more respect from me than to come on here at this moment in time and say that. +1 as they say...

    Btw the situation is not as funny as the forum
    "Romper: You have the power to make EM less boring for yourself and everyone else. "
    "Look for your lore. But do not trust to lore, it has forsaken these lands." - Eolore prince of Lorehan

  9. #9
    Join Date
    Jun 2011
    Posts
    1,314
    Quote Originally Posted by Polymachos View Post
    (asks around) how many could actually grasp this technical explanation?

    (raises hand) any more?

    Sure, you and I now understand why they had that outage; but I doubt that people not that tech-savvy are interested in such explanations. They just want the servers to be swearing online and running. And they won't forgive it if the dev/tech team admits to an error they made; instead they would rather call them all sorts of incompetent, and repeat that for weeks.

    (I see this word used here quite often, and for some reason I always get the impression that those who talk about in-competence, have little to no idea how these systems work by themselves. How ironic)

    Which makes me understand why SSG decided not to talk about the technical background of their issues. If I understood it correctly, they currently don't even seem to reside in their own product (which is a program, and server setup) but in whatever their hoster is providing. And calling out your business partner, is a no-no.


    Greetings, Polymachos
    Not agreed.
    Even if I didn't grasp the explanation (I do), a statement like tells you that the crew indeed takes this stuff very seriously, that they care and are intent on learning lessons from it to prevent future occurrence.

    And that is what I miss in SSG's response.

  10. #10
    Join Date
    Apr 2015
    Posts
    4,112
    Quote Originally Posted by Cordovan View Post
    I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it. If that gets me some heat here, so be it; you know where I stand. Besides opening yourself up to all sorts of issues as a business entity, you are then committing to that level of detail every time you have an extended outage. Let's say a technical answer along those lines would throw an employee specifically under the bus, or a valued contractor you intend to do business with long-term. Even if technically accurate, it would be bad precedent to set. Not to mention how it'd be picked apart over the long term. It would not work for us. When I begin to think about what the above would have meant for almost ten years of messaging with this company, my head explodes.
    Ok, without going deep into techincal issues... this problems related to recent 10 days downtime? SSG find source of this problems? They have any progress in fixing them once and for all?

  11. #11
    Join Date
    Jun 2011
    Posts
    1,314
    Quote Originally Posted by Cordovan View Post
    I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it. If that gets me some heat here, so be it; you know where I stand. Besides opening yourself up to all sorts of issues as a business entity, you are then committing to that level of detail every time you have an extended outage. Let's say a technical answer along those lines would throw an employee specifically under the bus, or a valued contractor you intend to do business with long-term. Even if technically accurate, it would be bad precedent to set. Not to mention how it'd be picked apart over the long term. It would not work for us. When I begin to think about what the above would have meant for almost ten years of messaging with this company, my head explodes.
    But it is not the level of technical detail that makes the difference.
    It is the level of commitment and care about the community that makes this stand out, and that would make such a big difference to this community.

  12. #12
    Join Date
    Feb 2015
    Posts
    66
    And THAT is the kind of answer I was hoping for a looooooong time now. That is some straight up statement. And even if I'd like to know specifics, I see your point(s) and respect that.

    Quote Originally Posted by Cordovan View Post
    I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it. If that gets me some heat here, so be it; you know where I stand. Besides opening yourself up to all sorts of issues as a business entity, you are then committing to that level of detail every time you have an extended outage. Let's say a technical answer along those lines would throw an employee specifically under the bus, or a valued contractor you intend to do business with long-term. Even if technically accurate, it would be bad precedent to set. Not to mention how it'd be picked apart over the long term. It would not work for us. When I begin to think about what the above would have meant for almost ten years of messaging with this company, my head explodes.
    Warden - Nicl - lvl130
    LM - Telperinor - lvl130
    Burglar - Nicsa - lvl130
    RK - Telpinquar - lvl130

  13. #13
    Join Date
    Feb 2019
    Posts
    1,464
    Quote Originally Posted by Cordovan View Post
    I'm sorry, but even if it were theoretically possible to make that kind of statement, I would advise against it.
    Let's say a technical answer along those lines would throw an employee specifically under the bus
    SSG would advise saying close to nothing then? On an issue that has been ongoing for more than a month?

    Don't you get the point here? Most players don't want you to explain it in such detail, there is a middle-ground. And please stop being scared of your customers, I have heard you say many times on stream that you don't dare say this and that, because you are afraid of some players getting upset.. well look at your forums now.

    For a small company like yours, transparency would be a better approach. Maybe you should open your eyes just a little bit more, and you might notice that the majority of this community here WANTS information about what's going on. It's what we want, not what you want.

    Ashes of creation devs understand how to communicate with their player base, they even have an active discord with 20k members and regularly talks with their players. They invite random players to ask questions etc, you can't even explain why the servers have been struggling for the past 2 months.

    Note, when I say *you* I mostly speak about SSG as a company, not you personally. I don't beleive you personally has handled this good or bad, but I think your company has handled this poorly. You are not the only one who can communicate, and shouldn't be the only one to do so either.
    Last edited by LotroVidz; Aug 20 2020 at 03:52 PM.

  14. #14
    Join Date
    Jan 2007
    Posts
    21

    Angry Giving downtime details to the community won't eliminate the problems any faster

    Quote Originally Posted by OghranNasty View Post
    Not agreed.
    Even if I didn't grasp the explanation (I do), a statement like tells you that the crew indeed takes this stuff very seriously, that they care and are intent on learning lessons from it to prevent future occurrence.

    And that is what I miss in SSG's response.
    You're saying, because the SSG crew won't give you all the details of the problems they are experiencing, therefore, that means the SSG crew doesn't take this stuff seriously doesn't care, and isn't intent on learning lessons from it.

    What an assinine statement.

  15. #15
    Join Date
    Feb 2019
    Posts
    1,464
    Quote Originally Posted by RagTop1 View Post
    You're saying, because the SSG crew won't give you all the details of the problems they are experiencing, therefore, that means the SSG crew doesn't take this stuff seriously doesn't care, and isn't intent on learning lessons from it.

    What an assinine statement.
    If they say nothing, we can't expect a solution either.. They need to give us information on what they are doing to prevent this from happening again, and some form of ETA. They have had many many weeks to do this! It wasn't long ago the game was down for 1week+, and now it's the same story all over.

  16. #16
    Join Date
    Aug 2011
    Posts
    1,291
    Quote Originally Posted by LotroVidz View Post
    If they say nothing, we can't expect a solution either.. They need to give us information on what they are doing to prevent this from happening again, and some form of ETA. They have had many many weeks to do this! It wasn't long ago the game was down for 1week+, and now it's the same story all over.
    "Romper: You have the power to make EM less boring for yourself and everyone else. "
    "Look for your lore. But do not trust to lore, it has forsaken these lands." - Eolore prince of Lorehan

  17. #17
    Join Date
    Jan 2007
    Posts
    21

    Angry Fixes will happen when they happen

    Quote Originally Posted by LotroVidz View Post
    If they say nothing, we can't expect a solution either.
    Whatever the solution is, it will arrive whether they say something, or not.

    Quote Originally Posted by LotroVidz View Post
    They need to give us information on what they are doing to prevent this from happening again, and some form of ETA.
    No, they don't need to tell you what they are doing... only that they are doing something. An ETA would certainly be nice, but, if it's out of their control (e.g., ISP issues), and the ISP doesn't provide SSG with an ETA, what could they possibly say, other than, "It's beyond our control, and we're working with our partners to get things fixed."?

  18. #18
    Join Date
    Feb 2019
    Posts
    1,464
    Quote Originally Posted by RagTop1 View Post
    Whatever the solution is, it will arrive whether they say something, or not.



    No, they don't need to tell you what they are doing... only that they are doing something. An ETA would certainly be nice, but, if it's out of their control (e.g., ISP issues), and the ISP doesn't provide SSG with an ETA, what could they possibly say, other than, "It's beyond our control, and we're working with our partners to get things fixed."?
    They could say this is related to ISP issues, unfortunately we can't provide an ETA at this time.

    "beyond our control" is the most unsmart thing you could say IMO.. Sure maybe after a few days of issues, not 2months! Are u serious?

  19. #19
    Join Date
    Feb 2019
    Posts
    1,464
    Quote Originally Posted by Oldwiley View Post
    Is this some kind of boomer thing, link videos instead of arguing? xD

  20. #20
    Join Date
    Jan 2007
    Posts
    21

    Oh, is that all it would take?

    Quote Originally Posted by LotroVidz View Post
    They could say this is related to ISP issues, unfortunately we can't provide an ETA at this time.

    "beyond our control" is the most unsmart thing you could say IMO.. Sure maybe after a few days of issues, not 2months! Are u serious?
    Ok...

    If they said, "This is related to ISP issues, unfortunately we can't provide an ETA at this time.", everyone would feel better? The problems would be solved faster? The forums would be less-pessimistic, and let SSG do whatever they're going to do to address the issues?

    That's a bit naive...

    Let's face it: No amount of detail about the problems or solutions is going to make anyone happy, because it will do nothing to make the problems go away.

  21. #21
    Join Date
    Dec 2007
    Posts
    680
    Quote Originally Posted by Cordovan View Post
    . . . Let's say a technical answer along those lines would throw an employee specifically under the bus . . .
    If not a one off I'm surprised SSG isn't throwing them under the bus as this is costing you. Have a read of Massively and MMORPG - and you're advertising atm . . .

  22. #22
    Join Date
    Feb 2019
    Posts
    1,464
    Quote Originally Posted by RagTop1 View Post
    Ok...

    If they said, "This is related to ISP issues, unfortunately we can't provide an ETA at this time." Everyone would feel better? The problems would be solved faster? The forums would be less-pessimistic, and let SSG do whatever they're going to do to address the issues?

    That's a bit naive...
    Better than saying nothing for like 24h+? I could come up with more detailed and better responses but why would I? Not my problem to convince you or anyone else. It's my personal OPINION. Some people here are happy with an official statement 24+hours to several days after the issues started to appear.

    Also, we had these issues just a few weeks ago, and now it's here again.

    Remember this is the feedback section.

    Face it: Every single detail is better than silence.

  23. #23
    Join Date
    Aug 2011
    Posts
    1,291
    Quote Originally Posted by LotroVidz View Post
    Is this some kind of boomer thing, link videos instead of arguing? xD
    Technically am Gen X, but when you've had all those arguments so many times before......
    "Romper: You have the power to make EM less boring for yourself and everyone else. "
    "Look for your lore. But do not trust to lore, it has forsaken these lands." - Eolore prince of Lorehan

  24. #24
    Join Date
    Jan 2007
    Posts
    21
    Quote Originally Posted by LotroVidz View Post
    Better than saying nothing for like 24h+? I could come up with more detailed and better responses but why would I? Not my problem to convince you or anyone else. It's my personal OPINION. Some people here are happy with an official statement 24+hours to several days after the issues started to appear.

    Also, we had these issues just a few weeks ago, and now it's here again.

    Remember this is the feedback section.

  25. #25
    Join Date
    Sep 2016
    Posts
    6,276
    Quote Originally Posted by OghranNasty View Post
    But it is not the level of technical detail that makes the difference.
    It is the level of commitment and care about the community that makes this stand out, and that would make such a big difference to this community.
    If I may pull out the grizzled veteran card once again: This is not my first rodeo. If people don't like me, fine. I may be incompetent for a million different things, but if I am not saying something, there is probably a reason. I assure you that we would like this resolved as quickly as possible. We do care, and that is why we are here.
    Community Manager, Lord of the Rings Online
    Follow LOTRO on: Twitter - Facebook - Twitch - YouTube
    Personal channels (No SSG talk): Twitch Twitter Facebook
    Support: help.standingstonegames.com
    coolcool

 

 
Page 1 of 7 1 2 3 4 5 ... LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

This form's session has expired. You need to reload the page.

Reload