#DBHangOps 03/19/15 -- Failing over in the moment, Old and weird bugs, and more!
Hello everybody!
Join in #DBHangOps this Thursday, March, 19, 2015 at 11:00am pacific (19:00 GMT), to participate in the discussion about:
- Failing over in the moment
- How do you recognize you need to failover?
- When is it safe to kill -9 the server?
- Other thoughts?
- Old/Weird bugs!
- GTID for operators -- Have you set it up?
You can check out the event page at https://plus.google.com/events/ch7dvhercc2anl9knnvig02hng4 on Thursday to participate.
As always, you can still watch the #DBHangOps twitter search, the @DBHangOps twitter feed, or this blog post to get a link for the google hangout on Thursday!
See all of you on Thursday!
You can catch a livestream at:
Failing over in the moment
- Failover stories?
- Etsy has an active<->active topology, so writes go to both masters
- Historically, most problems are hardware trouble that requires some sort of failover
- typically set the
read_only
flag on a machine that's getting moved to a slave role
- push a configuration to move traffic from one write master to another to manage failover
- In the past, had a custom heartbeat tool to set
read_only
flags correctly and move customer traffic automatically
Master<->Master Replication
- Master<->Master replication is a pretty common pattern to be able to quickly failover
- This can give benefits around quick failover if you pin reads/writes to one side
- Not wholly recommended because of the gotchas that can come up
- Potential for data drift as writes hit both masters in a topology
MySQL 5.6 with GTIDs and something like MySQL Fabric make it easier to not need Master<->Master replication
When is it safe to kill -9 the server?
- Make a judgement call. If you can't failover a server, you may wait less time
When do you failover a server?
- Typically for maintenance
- Hardware issues
- In larger shops, this is probably more common
- some edge case in the application that can't be reverted quickly
Other thoughts?
- Being able to setup delayed replication in MySQL for disaster recovery scenarios
- folks historically would use tools to do this (e.g.
pt-slave-delay
)
- In MySQL 5.6+, you can natively do this and have more confidence about the catch up
Old/Weird bugs!
GTIDs for Operators
Other links!