Systems are built to be production ready, tolerant of failure, self healing and be able to be monitored.
Adopting the building of more smaller systems, and utilising cloud environments means handling failure is more important than ever. If a system is self healing, it must also raise alerts which provide enough information to investigate the problem.
- Release It will be used as the starting point for what defines Production Ready
- The system should have sufficient monitoring, logging and alerting to be able to understand and report on its health and the health of its dependencies in a consistent way.
- Production monitoring should be available to all!
- Systems should gracefully degrade on disruption, by using circuit breakers, bulkheads, caching, partial responses for example.
- Teams and business owners will need to work together to identify relevant business and technical KPIs for their product.
- Teams should understand the resilience of its service by experimenting with intentional faults
- Dashboards will need to be implemented for the most important KPIs
- Not all risks to production readiness are analysable in advance so (in addition to checking KPIs) exploratory testing should be used to expose new information about software behaviour
- Applications should be built with a diversity of stakeholders in mind. Operability and supportability are important in most contexts, but see [https://en.wikipedia.org/wiki/List_of_system_quality_attributes] for a list of other software 'illities' that may need particular consideration in your context.
- The 1JL Category Service, Elastic search service and apps built on the digital platform have been built with monitoring and self healing in mind using Grafana and Prometheus
Points for discussion
- Do you understand your failure modes?
- Do you understand how monitoring helps you understand your failure modes?
- How close is the system to its limit?
- Are there circuit breakers?
- Are there rolling deployments, eg kubernetes?
- Do you have dashboards, golden signals, engineering metrics for traffic and resilience?