As a Systems Engineer/Administrator you will often have calls from people (or bots) telling that a particular app/server is not working as expected. Many times you are going to scratch your head and try to recall "when was this god damn thing put in production?". I had similar moments several times and I have thought of compiling a list for a server/app to be called "in production".
The app should produce informative logs but should not run on verbose/debug mode under normal circumstances. Logrotate should be in place. A 10 gb log file is mostly useless. Logrorate can create really nice, timestamped log files and will retain it for last X number of days.
No manual intervention should be there on server. Use Puppet or chef or any thing else but DO NOT touch your production servers manually. Kill anyone who try to do so.
Everything that can be backed up and is worth a backup, should be backed up. Also make sure that you have tested a restore strategy. No backup is of any use if you cannot restore from it.
- Functionality Checks
There is an appropriate functionality check in place. Please don't just check if the process is alive. I have seen several instances when process is up but not serving the requests. Do an end-to-end functionality check. Nagios is an easy to use tool for monitoring individual apps as well as a cluster.
- App Owner
There should be a primary and a secondary app-owner of each app. At any given point of time either you should have the capability and authority to debug and revert the app without screwing the rest of the functionality or you should have contact of those who can do the same. Put these people on your speed dial.
- Access for Troubleshooting
Developer owners should have easy access to logs instantly. Make sure that they do not go through an entire chain of command just to get access for any troubleshooting. Make sure that he has a user account on the server or a temporary account can be provisioned at a very short notice with sufficient permissions.
Redundancy and fail over should be there and should be well tested. Always have two of each (servers/app instances) and make sure that in the scenarios where one blows out, the other one can take the load. Play Netflix Chaos Moneky, a game where a script gets into your infra and randomly starts killing stuff. It is great to test resilience of your infrastructure.
Secure your machines. Make sure that firewall is always running and is restrictive. The users who left the project or organization, no longer should exist in your servers. Any directory with permissions 777 should be deleted at the expense of the person who set this permission.
I can keep this rant going on for a while but I think these are the bare minimum things that should be present for any production environment be it a huge multinational company or a small startup. These things are not difficult to do but they just tend to be overlooked or assumed obvious or get into the to-do list but never gets done.