Three Golden Rules when delivering cloud based technology

Arron Dougan
5 min readOct 24, 2022

--

I’m currently on a short sabbatical from my role at KPMG UK which is giving me plenty of time to get back into tech blogging. Today I’m writing from Airlie Beach, Australia (the gateway to the Whitsunday Islands) covering an opinionated short article on my three golden rules of delivering cloud-based technology.

Photo by Marvin Meyer on Unsplash

Now “delivering technology” in this context could range from an ad-hoc script pulling some metadata from a cloud environment, to a new feature or even an entirely greenfield application build-out.

No matter the scale of the solution, these three grounding principles remain the same, forming the basis of any design thinking on my part.

Number One – Maintainability

Simply speaking maintainability can be described as the ease of keeping a tool doing its job.

Aim for “No Ops” – I really love the term “no-ops” in which a solution will live it’s life with either little or no manual intervention. Often this nivarna will require some forward thinking and investment. When done right can ensure your team can focus on new features without having to worry about keeping older solutions running. Avoid manual intervention where possible and automate like your life depends on it !

Complexity, the arch enemy of maintainability – I always try keep solutions as simple as possible. The legendary saying “don’t reinvent the wheel” applies strongly here. Always try use a PAAS / SAAS service if requirements allow and opt for well maintained open source libraries rather than starting from scratch.

Low barrier of entry – junior and newer team members should be able to seamlessly contribute to the product. This can be achieved by solid documentation, contribution guidelines and a backlog of “first-issues” tagged and ready for fresh engineers to cut their teeth on. It’s not ideal to have solutions that can only be maintained by single points of failures – and will definitely impact your teams velocity.

Everything as code!! – the best documentation by far is clean code. Ensure all infrastructure is written using Terraform / Bicep. This means engineers can easily reference the topology in a language they understand. To name a few other preferential examples; machine images (Packer / Ansible), Policy (YAML / JSON) , K8s (Helm) and of course the source code itself! Ideally a solution should be immutable, i.e easily recreated from scratch, if it can’t – identify the manual steps and raise some backlog tickets!

Number Two - Security

Almost daily my LinkedIn and Medium feeds are full of new companies falling victim of a cloud-based data breach, usually via social engineering or accidental misconfiguration. No matter what the solution is in the cloud it pays to keep things secure!

Invest Early in Guardrails – all major cloud providers have a wealth of built in policies that can safeguard your organisation from serious cloud misconfigurations. Many of which are available out the box, get them configured ASAP as a baseline. As your organisation matures it’s worth delivering a way to deploy these cloud policies as code, ensuring you can keep up with the newer standards with ease. Check out an Azure based example below.

Security is everyone’s responsibility – annual privileged access E-Learning is not enough (if your organisation even has this!). The threat landscape is always changing, with cyber criminals getting smarter by the day. Security should be baked into every engineer’s goals. Encourage cloud security certifications, pair up your threat integlience function with engineers and encourage reading of regular threat reports such as the ones tagged below by the NCSC. Learn from where others have fallen short and address any gaps.

Watch for highly permissive accounts – the principle of least privilege is gospel to anyone working in cyber security, however I’ve seen some questionable configurations over my career. Only assign permissions required by the tool / solution. If a high level of permissions are required see if you can combine this with a mitigating control such as a conditional access policy. This controls when the credentials can be used, for example from a trusted IP range or device. I’ve thrown in a super cool preview feature from Microsoft below.

Keep an eye on Secrets – If the solution relies on a shared service account ensure the keys are regularly rotated, particularly when engineers leave as they can easily hang onto these. A better alternative is no-credential access, using AWS IAM roles / Azure Managed Identities. Lastly you should implement robust secret scanning in your SCM toolsets, an accidental access key within a git repo can cause chaos in the wrong hands.

Secure Enough – it’s worth mentioning that security features can come at additional cost and complexity. It’s worth having a quick, standardised way to risk assess your piece of technology and applying a sensible level of control. Don’t over secure or you’ll sacrifice on maintainability and in some cases even reliability.

Number Three – Reliability

Reliability is the likelihood of your solution breaking, ruining a users day and an on call engineers evening.

Design around Entropy – entropy is the scientific measure of uncertainty. Entropy is high when making changes to your system, so ensure you have a solid suite of testing to make sure you understand when changes are either breaking or regressing a system. Couple this with a deployment pipeline so you can easily roll back any faulty changes.

Monitor / React to Key Symptoms – unfortunately most systems will have unreliable components, make sure you are in a position to monitor key indicators of failure. Ideally combine these with an automated run book to heal the symptom before it results in a outage.

Choose Reliable Components – this may be stating the obvious. But some components and services are simply more reliable than others, consult documentation and ensure the availability meets your business requirements.

Alert on vitals – at the end of the day, it’s better that you can spot a system failure rather than the end user. Ensure you can spot and alert on key system failures, using health checks or spotting key job failures etc. Link these up with an alerting mechanism of your choice and ensure engineers know how to react to maximise uptime.

Let’s wrap up…

Both business criticality and data sensitivity will naturally define how much you invest in each of these areas. Based on my experience, systems get more critical and sensitive over time, so start thinking about these areas from day one and iterate over time.

That’s a wrap folks ! Hope you’ve enjoyed reading this article. As with everything this is not an exhaustive list, but has served me well over the years of making cloud design decisions no matter the scale.

--

--

Arron Dougan

Azure focused DevOps and Cloud engineer based in Manchester 🐝☁️