In Conversation With a Network Reliability Engineering Practice Manager
Network Reliability Engineering (NRE) is an implementation offshoot of the famed Site Reliability Engineering (SRE) practice pioneered by Google and adopted across many organisations that look to overcome the problem of scale without compromising reliability and quality of the user and customer experience, typically through the extensive leveraging of automation. NRE extends SRE practices and behaviours to the network domain.
NRE Practice Manager
DOI Ambassador, Feisal Ismail sat down (virtually!) with Wesley Situ, an NRE practice manager in an American multinational investment bank and financial services holding company, to learn more about his understanding and experiences in implanting NRE in his organisation.
- Why did you consider Network Reliability Engineering as a suitable practice for your organisation and did you have prior experience with NRE?
Well, it started in October of 2019 October in my previous organisation that was transforming its IT services and moving to the cloud in a big way. As part of the cloud-first approach, there was a strong move towards embedding the complementary practices of DevSecOps and Site Reliability Engineering. I got my fingers wet in SRE practices there
With my present employer, our operating structure is heavy in terms of numbers. Network infrastructure components are growing rapidly and it would be extremely challenging to be able to manage them without scaling up our technical resources if we do not fundamentally change the way we run things. We do have a scaling issue and modernising our workforce and the way we run network operations are the key. This presents a tremendous opportunity for us to assess and review the way we would like to run network operations at scale and the same time elevate the skillsets of our engineers.
- What unforeseen issues or challenges did you face as part of the transformation to an NRE way of working?
Hands down, transforming culture and mindset is the biggest. It takes time and effort. What we aspire to do is develop our workforce to exhibit elements of the generative culture that is described in the Westrum model well covered in the DevOps Foundation and SRE courses by DOI. It’s not so easy to shift from a Fail-Never past to a Fail-Often-Learn-Often future.
A case in point is our attempt to put in place Blameless Post-Mortems. A blameless postmortem (or retrospective) is a post-incident document that helps teams figure out why an incident happened, and brainstorm how to improve the process to prevent similar incidents from happening again. It’s easy to conceptualize but human conditioning over many years makes it difficult to pull off. The mindset of assigning blame for human mistakes is still embedded into the organisational psyche. It’s tough to balance accountability and growth or learning but it is something that we need to learn to strike.
Another challenge is monitoring. The organisation is steeped in very traditional monitoring approaches with little scope for engineering. Engineers are handling issues that they know are a more immediate need and those that are already hurting the organisation. It takes time to move engineers conditioned for firefighting to work on fire prediction.
I also did not anticipate the dearth of talent available in this area. We need people who understand Network Automation and have an affinity for programming. In my experience, it’s quite tough to look for these attributes in Asia but my counterparts in US and UK found it less of an issue to fill up these required roles. I am looking to expand our technical capabilities in the Asia-Pacific region to catalyse the transformation. It’s challenging but we are getting there, albeit slower than what I would have preferred. My immediate task now is to keep the troops focused on what we are doing and planning to achieve with NRE.
Transforming the existing workforce into the NRE ways of working initially faced a roadblock due to the team’s lack of exposure to “novel” practices. This led to initial scepticism and a lack of faith in the path forward. Getting them all on the same page with learning and understanding the common practices and terminologies used helps to grease things quite a bit. We are planning investments in this area.
- Any areas affected that pleasantly surprised you?
Yeah! I underestimated the reactions that I would get from the existing team. Although there was pessimism and scepticism in the beginning, the team was prepared to change and there was not that much resistance! The team was more sensible and open than I thought they would be. The culture of the bank where we are a people-centric and dynamic organisation and where change is something that is viewed as normal helped immensely.
A top-down approach will not work as part of the NRE implementation. You’ve got to listen to people doing the work and embrace new ideas and be adaptive.
- What would be your advice to anyone exploring implementing SRE or NRE in their organisation?
Be clear about why you are doing it. It is a huge undertaking. Does it even make sense to begin this journey? Is the practice even suitable for the organisation?
Recognise that it is a long, long journey and you must be prepared to be adaptive along the way. Don’t go for immediate results but look for desired behavioural changes and sustainability.
Have the right road map, have a clearly communicated vision and be patient. Good luck!
Learn more about DevSecOps, DevOps and Site Reliability Engineering with these available courses today!