> is this a viable career path?
Yes. As systems/products become more complex, people skilled in keeping them reliable are highly coveted.
> What qualities/attributes are necessary for these roles that are different from developers?
Nothing in particular other than a general attitude of "getting shit done" and less "that's someone else's problem". DevOps as a culture strives to eliminate this "throw it over the wall to Dev/Ops, it's their problem" mentality.
> And if I’m interested, how do I learn more?
Google's SRE (not RE) resources are a good place for the fundamentals:
The DevOps Handbook also includes some bits-and-pieces, not single monolithic chapters, regarding release planning/management.
I don't know of any good books specifically tied to release engineering, though it's a duty very firmly under the DevOps umbrella.
Explained best in this book - The DevOps Handbook.
It seems that you want to cover DevOps to some degree. I'd look at the DevOps handbook https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002/ref=sr_1_10?keywords=DevSecOps+Handbook&qid=1554391709&s=gateway&sr=8-10
, along with covering the technologies that @wsppan touched on.
> give a vague impression that they all do the same thing.
Lots of tooling does overlap but each one has one area it excels at - some excel at the same area.
So, you have done a good job so far, it seems like most of your stuff is automated to a good degree and you have identified where your weaknesses are.
You should tackle one thing at a time, identify your largest bottle neck or problem and work to solve that first. In the same vain, only introduce one new tool at a time. Each takes some time to learn and to implement it correctly. Trying to do too much at once will just cause problems.
You have already identified the weaknesses so focus on solving these, starting with what you think is causing the most issues.
> - One server per environment is obviously not super scalable
Look into HA setups. How you do this and how much work it is depends on your application. Typically there are two parts to applications, work and state. Work (such as processing requests) is easy to scale if it contains no state. Just add another server to the environment and load balance between it. For this you need a loadbalancer (HAProxy or Nginx work well, though there are many others to chose from) and to move any state off the node you want to scale.
There are many forms of state, most will be stored in a database but you should also pay attention to session state which is sometimes stored in memory on the node - if you have anything like this you will need to do work to move it into some sort of storage, like a database or storage solution (such as your existing database or redis or memcached etc).
> - No sense of automatic provisioning, we do that "by hand" and write the IPs to a config file per environment
There are loads of tools to help with this.
Terraform for provisioning infrastructure.
Ansible or Chef or Saltstack or Puppet for provisioning nodes (I recommend starting with ansible, though any of them will work).
There is nothing wrong with using bash scripts to glue things together or even do provisioning while you learn to use these tools. I would not shy away from them, but do recognize the benefits each tool provides over just bash scripts. Take your time to learn them and stick with what you know and what works for you while you do. Introduce them a little bit at a time rather than trying to convert your entire infrastructure to use them in one go.
> - Small amounts of downtime per deploy, even if tests pass
This is easiest if you have a HA setup. You can do it without one but it involves just as much work and basically follows the same steps as creating a HA setup. In short, with multiple nodes you can upgrade them one at a time until everything has been upgraded. There are always some nodes running on either the old or new version so everything will continue to work.
You can either update nodes in place, or create new ones (if you have automated their provisioning) and delete the old ones when the new ones are up and working (see immutable infrastructure for this pattern, also canary deploys and blue/green deploys for different strategies).
> - If tests fail, manual intervention required (no rollback or anything) - though we do usually catch problems somewhere before production
Tests should be run before you deploy. These should run on a build server, or ideally a CI system. Ideally these should not only run before all deployments, but also for all commits to your code base. This way you can spot things failing much sooner and thus fix them when they are cheaper to fix. You also likely want to expand on the number of tests you do and what they cover (though this is always true).
Rollbacks should also be as easy as deploying the old version of the code. They should be no more complex than deploying any other version of your code.
> - Bash scripts to do all this get pretty hairy and stay that way
Nothing wrong with some bash scripts, work to keep them in order and replace them with better tooling as you learn/discover it.
I have mentioned a few tools here, but there are many more depending on exactly the problems you need to solve. Tackle each problem one at a time and do your research around the areas you have identified. Learn the tools you think will be helpful before you try to put them in production (ie do some small scale trails for them to see if they are fit for purpose). Then slowly roll them out to your infrastructure, using them to control more and more things as you gain confidence in them.
For everything you have said there is no one solution and as long as you incrementally improve things towards the goal you have you will be adding a lot of value to your business.
For now you need to decide on which is the biggest problem you face and focus your efforts on solving that - or at least making it less of a problem for now so you can focus on the next biggest problem. Quite often you will resolve the same problems in different, hopefully better, ways as you learn more and as your overall infrastructure, developmental practices and knowledge improves.
Also the 12 factor app is worth a read as is googles SRE book and the devops handbook. The Phenoix Project is also a good read.
Though these are more about the philosophy of DevOps, they are worth a read but wont solve your immediate issues. Reading around different topics is always a good idea, especially about what others have done to solve the problems you are facing. It will give you different perspectives and links to good tools you can use to solve the problems you face.
Tell your manager to read this fucking book: https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002
Do you know and understand the basic principles as described in The DevOps handbook ?
Have you created an automated deployment pipeline following the principles described in that book?
Though job titles around DevOps can be misleading and are sometimes just mislabeled sysadmins or automation engineers. There are a lot of people who claim the title without fully understanding the meanings behind it and just learn the base tooling without really understanding what DevOps is about. Personally, I find the concepts raised in the book above far more important than any one tool you know.
[BOOK] The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations
Define access to. Do you mean ssh access? Why do you need it? My ideal goal is never ever having to log into a node as every action you could want to take on one should be automated in some form. There is very little reason you should need to log into a production server and any sign that you do is a sign that you need to improve some aspect of your systems.
It might be nice in staging/dev environments - so you can more quickly debug problems but at the same time if it is slow to debug in staging or dev it will be slow in production so ideally, you should never need to log into those environments either. This was you are forced to fix the issues that make it hard to find and solve problems without direct access before they hit production.
On the other hand, you should have control over your environments and the ability to spin up new ones at will. As part of DevOps the ideal goal is the person who writes the code is responsible for deploying their code and you should not have to throw code over a wall to the Ops team to deploy or manage it. This is not what DevOps is about - it is the complete opposite.
You should read the DevOps Handbook and The Phenoix Project which has some interesting things to say about compliance regulations and how to integrate these into DevOps practices.