Have you ever heard or had this conversation before?
How’s featureX coming along?
Oh it’s done. I pushed it last night.
Done? Last night? Why wasn’t it in this morning’s preview release? What’s the service URL?
Well it’s all checked in, but no-one’s released it.
Oh yes, this is one of those posts.
Over the past several years, one of the biggest issues I’ve struggled with is convincing developers that operations of their services is now their problem. Most agree in theory that they ought to be in control of their own services’ destinies. Devops is a cultural change in the way that software is developed and operated. It does not mandate that the developers get the root password to the operations account, that devs boss the ops people around, or vice versa, or that either team continues the same old same old practices they had previously. It is this last part that I think developers struggle with. In general, I find there are three main barriers to true success, especially inside a corporate environment that previously taught them learned helplessness in the face of rigid organisational hierarchy and bureaucracy.
The first two, I won’t go detail other than to say they are: 1. wavering management commitment; and 2. lack of understanding what devops really means in the IT operational departments (also known as “we’ll continue to hand-crank environments, thanks”); I’ll leave them for another time. The one I’d like to talk about is the corollary of item 2. — developers need to build operational understanding of their services. Now, that’s a pretty broad category of items of course. For example: learning how your services, your frameworks, and your dependencies, like app servers or databases or third party APIs, all scale under load. There’s stuff about security and attack surfaces in your software, issues of unfamiliar technologies and new architectures, the unlearning of old bad habits ingrained over years or decades, resistance to change, and on and on and on.
The thing I’m going to talk about is the thing I think that developers get wrong right at the very start of their projects. I’ll also provide a practical step which I’ve found recently has been really helping me with developing applications that are easy to deploy and operate in a cloud-native environment.
How many guides to software frameworks and APIS begin like this:
To use
MagicFramework-1.0
type the following:import magicframwork.magic; public static void main(String[] args) { Magic myMagic = new Magic(); myMagic.doMAgic(args); }Then, on the commandline, type “run magick”:
$ run magick ... magick happened!Congratulations, you have written your first “magic” program! You’re a wizard!
Of course, framework developers have a different set of needs around their code and practices than developers who are writing services relying on those frameworks but meant to be directly deployed into production environments. I only picked this as illustrative of the general issue: getting straight down to the code.
Now there’s nothing wrong with coding like this, per se. I was up until the past year a strong advocate of that approach, and in all respects I am (‘big design upfront’ is still massively wasteful to greater or lesser extents). What I mean is that just jumping straight into and creating a new class or js file or whatever stems from a lack of understanding as to what constitutes code in a devops practice. So what is code? Everything you do as a software engineer.
The biggest change for developers in a devops environment is the management of many respects of the ‘platform’ is now not only up to them but has to be delivered in a automated, repeatable, and secure manner. For many programmers, the choice of the platform was already mandated to them: they were hired because they already had skills in Java Spring running on JBoss app server (or whatever). Maybe they’d not even have experience with the particular platform. This usually didn’t matter. They didn’t have to fully comprehend the details of the run-time environment to deploy their application; just enough so their program worked in the development environment’s version of the application. Normally one of the senior developers or architects had worked with the operations people to create and configure this environment, as a once off task.
Now, however, developers have to supply not only the code, but the code containerised (e.g. in a Docker image), and with a complete set of ‘scripts’ (in the form of a Kubernetes manifest, a Helm chart, an Ansible playbook, CloudFormation template, or some combination of these) which not only describes the application run time, but all of the attached service dependencies that it requires. If those attached services aren’t shared (for example, a Kafka cluster) among many applications, then these attached services must also be created in those very same scripts (for example, creating and attaching to the RDS instance the application will use for its database, or defining a persistence shared file storage layer).
And guess what? This stuff can take just as long if not longer than writing the code which is to say, the part which implements the actual “business logic” desired. As an architect, designing and running the technical components of software delivery, I’ve learned from long and bitter experience that what we formerly thought of as “the code” is only about 30% of the effort. And now, with devops, at least half of the remaining 60% is now also code.
This is a good thing, because it makes building software much more dependable and repeatable, but requires some additional effort on the part of the development team.
We’re all familiar with the mantra:
the last 10% takes 90% of the time;
and, we’ve all been there. We’re “95% done”, just one last bit of effort, couple of hours tops, we’ll just deploy the code, and …
damn, there’s a firewall between my code and the database
… or …
the business changed its mind about that function which means you have to get the data from this other database.
… or …
legal says the user data has to be encrypted securely ‘at rest’ and ‘in flight’.
… or …
the new version of the document database that we upgraded to uses a new connection string format.
This list is pretty much ordinally-sized ω and every deployment and project just adds more to the list. I’m sure everyone has their own to add. The point is the code was not ‘complete’ until it was running in an environment and doing its job.
Reliably.
Securely.
Repeatably.
Can you delete the entire environment and redeploy it? Are there tests which stand up an ephemeral environment, deploy all the services into it, run tests across the entire system, save the test report, and then tear the whole environment down? Did you write the code that does this? Do you rely one someone else to do it for you, or do all the development team collaborate on this component? How big is that code base compared to the ‘services’ code it tests? How long will it take you get your code into production? Seconds? Hopefully not hours? Can you do it without an outage?
I’ve always maintained the definition of ‘done’ was running in production. Now, because of the environment I work in (the highly regulated aviation sector), this means the code I write (or supervise the writing thereof) might not be ‘done’ until long after we wrote the code, even when we do all the above. The code might support a new process, which might require the regulator to approve, and certainly requires changes to the Standard Operating Procedures and training to be rolled out to the affected operational personell. There is, nevertheless, that golden standard to aspire to. If for the purposes of project delivery we will define ‘done’ as some lesser standard, say, running in a pre-production preview/acceptance environment, then it is even more imperative that we have the ‘code’ fully operationalised.
This is where I think developers go astray. At first, given a business issue to solve, they open their IDE and just code the solution. Sometimes this is still the right approach; where the exact solution has to be discovered, and maybe will drive the adoption of aspects of the technology platform. For the vast majority of cases however, this isn’t it.
What we are then often building, is a code solution which ends up with multiple inflexible surfaces on it. These require to be ‘punched through’ in order to operationalise the solution in the target environment, often, at the last minute, requiring open-heart surgery on the code so that it can use them properly. Then sometimes major parts of the code base need to be refactored just to achieve this. All those references to attached services need to be injected into the running code from the environment (not configuration or property files). There’s a ton of config which has to be attached to that environment: Deployment
; Service
; Ingress
; Secret
; Role
and RoleBinding
; to name a few specific to a Kubernetes deployment. Maybe you’re using AWS Fargate, maybe on ECS not EKS, maybe GKE on GCP, maybe on both, maybe Openshift, maybe even in the cloud and in your datacentre. Hostnames may have to created for each service and valid TLS certificates generated for them. Network security rules have to be defined and written, and many more things of that nature have to be considered and created.
And here you were promising at standup just this morning that the code was done and checked in! Now you find yourself working deep into the night, just trying to get the code running reliably, in multiple environments, without hard-coding, just so you don’t have to repeat the experience of that product manager’s sour look, and the blank look on the face of the project manager, at tomorrow morning’s stand up.
All of this needs to be done in your build pipeline; none of it should be manual. Which of these steps will require elevated privileges? Do you need to submit a pull request (or merge) to another team which controls those elevated privs (especially relevant for security and network items that are issued at the Cluster or whole-cloud level)? Will you need help from another team member’s expertise in correctly attaching a particular service? Just how do you make that cert manager pod, running in a different namespace which you don’t control, issue you a certificate which your Ingress will read?
And here’s the advice I wish I could give myself from now, to myself three years go:
This is a reverse of the usual process of ‘write the code, figure out how to deploy it’. I find it has several advantages to agile software development. First, infrastructure and attached services are added into the agile software development methodology. Second, once you’ve got the operational code running, you can, as a programmer, forget about it and just write the business logic undisturbed. Every time you check in and push you get a running version of your code! Third, I find it easier using this method to run a local version in your Docker Desktop or other local run-time. There’s no fiddling about with manual deployments with manually-created dependencies.
In conclusion, by concentrating on your development and deployment pipeline up front, you effectively clear a path to make your code run in production. Those Romans built those dead straight paved roads to their borders for a reason, and that reason wasn’t always apparent while they built them; only afterwards when they had to march in a hurry to the edges of their world did that become obvious. As a developer, you’ll grow in ways that benefit you immensely. First, in appreciation of just what it is that our colleagues in IT operations have to do (and they will, hopefully, be learning what just what it is that we do as developers). Second, you’ll be broadening your skill base. Learning Kubernetes and Helm is invaluable in learning how to architect, design, build, and run services properly, no matter what language or base platform that you use. You’re also building your employability, I don’t think I have to mention that!
Most importantly, I think you will find this invaluable to increasing your productivity as a programmer. Once you get over the hump of learning these techniques and frameworks, you will be able to quickly, reliably, and repeatably, make changes to your code and test it running in a real environment, over and over again.
## links