IT-chef / Director IT in Energy and Utilities, 201 - 500 employees
A redundant site as pointed out above is very good and a hygene factor. How ever, when it happens the following is good 1. have your data classified in priority of criticality 2. define the master application/system, 3 if application/system is on prem know which server its on 4. rank apps on the server, so if completely down its all tracable and u know which application/system that hold critical data to start first of all other things. 5. a matrix of reset report progress in time 4 hrs, 8 hrs etc with report responsible (whos doing what when). Depending on what type of business or nation security naturally a communication device that works when all else is down. Also, actually train your staff for this as it will happen in various degrees of disaster level.IT-chef / Director IT in Energy and Utilities, 201 - 500 employees
... whoops... also have a system in place for incident reporting , analyze and learn from this to improve you swift recoveryCIO in Software, 501 - 1,000 employees
Hi,First, my assumption is that we are referring to DR and not BCP which address people and operations on top of the DR which is mostly technical. my recommendation is to adopt the assumption that not all services born equal, meaning u don't need to approach the whole eco system inn the same way, this will enable you to put priorities in place from the solution point of view and from the budget aspect, I believe that the solution should be driven by the time you wish to recover and then ask yourself to which % operational ratio are u willing to accept, as DR does not mean 100% operational with same SLA like normal days as again its not happening every day and budget is a big factor, second as yourself which point of time is acceptable to be up (again not all services born equal), another aspect is your SaaS solutions, in spite that manny thinks, many of them do not have DR solution of 100%, in some cases its a tier of support level, in some cases there is no SLA associated with the availability in case of DR, so reviewing and signing off those 3rd party solutions is extremely important, tech wise there are many solutions today including SaaS ones which I would consider and unless you have already Active Active eco system you are operating from, I would consider not building such by default but exploring 3rd party providers, another important and maybe the most important, usually the focus is on backup or mirroring data etc...but the real service in DR is the restore or switching to the mirrored cluster, there for the most important tip I have to say here is test your self between 1-2 times a year, you can divide services to two and then test diff ones each drill, have a war room, document it all and conduct a take-in (Leasonn learned) post drill as I am sure you will find dependencies many did not consider in the initiall phaseof design. define a kpi's to tell the story of the DR, how much was automated/manual, time to recover, knowledge in place. one last thing which very important to look at is the "back to normal" phase, this must to be planned very well and confirm that both DR plans and back to normal is aligned with business needs and customer contract in case you have appendix which addresses availability and service levels, hope this was useful to you and maybe others.
CTO in Software, 51 - 200 employees
When we talk about the Web applications and need the plan of disaster recovery then we have lots of options for that apart from manually.1. The traditional way is to have the backup of things locally with you (I really don't prefer this)
2. The automation of backups
2.1 We generally use the AWS S3 for this, we have created the bash scripts which can backup the complete code and data backup and put on the S3 bucket in different regions so that we can quickly deploy the backup again in case of disaster.
2.1 For the backup of code we use the code repository with the CI & CD tool so at least we can have the complete code backup from the starting and automated DB backup on RDS.
There can have lots of tools for recovery. You can get the idea from this table https://xebialabs.com/periodic-table-of-devops-tools/
Assistant Director IT Auditor in Education, 10,001+ employees
Regularly testing and when any significant changes are made to critical applications and infrastructure.Content you might like
Founder, Self-employed
Work travel is a privilege. Embracing your experience to meet new people, and see the beauty of nature and culture wherever you go.Costs involved by this activity (external and internal)33%
Response time / waiting time / resolution time74%
Resolution quality55%
Automation24%
Knowledge / skills / know-how / expertise10%
223 PARTICIPANTS
Slow recovery response times35%
Data availability is limited48%
Too expensive to scale effectively52%
Difficult to manage for widespread use38%
Prone to misconfiguration12%
No - There are no drawbacks7%
530 PARTICIPANTS
Community User in Software, 11 - 50 employees
organized a virtual escape room via https://www.puzzlebreak.us/ - even though his team lost it was a fun subtitue for just a "virtual happy hour"
In a company like yours, it would be important to have your whole site mirrored somewhere outside your main offices.
Consider distributing your data services having at least one site per continent. This way you'll have your data services always up, and faster, more responsive por your local customers everywhere.