Ask any data centre manager whether they maintain their data centre and you’ll get a reflex reaction “Yes!” and probably a funny look. The majority of companies have an ongoing maintenance schedule, but does it really cover everything needed? Sometimes it's a good idea to stop and review your service level agreement and perform some maintenance on the maintenance regime itself!
We all know why ongoing maintenance is needed, after all it is our job to keep the business running, minimise downtime, as well as ensure that our cooling and general energy efficiency are at an optimum.A maintenance regime also gets you ‘closer’ to the equipment.What do I mean by that? It gets you more familiar just like if you owned a bicycle, classic car, or your own small boat, by getting your hands dirty you get a better idea of what works, what doesn’t, where an upgrade might be needed, or whether there are persistent problem areas.
There is an endless amount of best practice maintenance information available covering every distinct piece of equipment or cabling in your data centres, and therefore no way we could cover that here.However, there are key areas that should be part of every regime, regardless of the make-up of your data centre or its size.Here is my take, on what those areas should be:
·Uninterruptable Power Supplies (UPS) - For such a critical piece of data centre equipment the UPS can often only receive the most rudimentary of maintenance attention.The occasional failover test with a unimportant server simply is not enough.Batteries need to be checked, periodic impedance tests as well as checking for software updates.It is also important to check that configuration of the servers connected to the UPS has not changed.For example, in such a way as to overload the appliance, or create a situation where the time the UPS can power the servers becomes too small.Seemingly simple things like bypass switches should also be checked.
·Air Conditioning Units (ACU) – We’ve just had the hottest July on record, so if you didn’t realise how important your ACU systems were now, then there may be no hope! Inspections and refreshing of consumables should take place regularly, along with reviews of the condenser locations and critical condenser cleaning.All associated internal and external pipework should be integrity checked.
·Generators – One of the most critical aspects of data centre maintenance and often over-looked, given largely they sit quietly waiting for an outage.It does not mean they do not need maintaining though.They should be routinely run.This is not simply a case of saying ‘it works’, fuel analysis and load testing should also be reviewed at regular points.
·Fire suppression systems – These should be routinely inspected and tested.Both in terms of the detection systems themselves, outlets, fire suppression cylinders that may need to be removed for pressure vessel stretch testing. Room integrity should also be checked to ensure no service entry ducts have been left open, as well as the automatic alert systems designed to switch to fail over, alert teams, and shut down servers.
·Cabling – Checks should be conducted to ensure cabling is not under undue tension or showing excessive signs of wear on terminators that are regularly manipulated.If you have a colour-coded cable system in place, then checks should also take place for rogue cable colours.Finally, whilst rare it is possible for rodents and other unwanted guests to find their way into underfloor cavities, so checks for signs of dining and nesting should be looked for.
·Cleaning – Every area we have talked about above should include a cleaning aspect, and of course we try to keep data centres as dust free as possible.However, we all have a habit of leaving rubbish around and however hard we try dust and dirt is going to work its way into the data centre.A cleaning routine that covers areas outside those above, including floors, cabinet doors and emptying the bins should all be in place, and important fire prevention as much as anything else.
Part of the nature of maintenance regimes is that they are unique to a business so many will be looking at this list and saying “that point is not relevant for me” or Chris has left X, Y and Z out.The whole point of this piece was to get you thinking about the regime itself, not just performing the checklists that were put in place when the DC was handed over.Unless it is very new the DC will have evolved, but has your regime evolved too?If not, then it is probably time to take a step back and consider whether your data centre is underperforming as a result or are the risks of downtime higher than they should be.