Let's Finally Build Continuous Database Reliability! We Deserve It

Over the past few decades, we have undergone significant transformations, enhancing our ability to improve software delivery and creating new methodologies and frameworks to improve collaboration among teams. Despite these advancements, the existing software development lifecycle (SDLC) is still far from flawless. Teams invest a considerable amount of time in the handover of artifacts, and early pipeline checks are inefficient or lacking.

At the same time, we want to achieve a reliable and robust SDLC. We want our deployments to not get blocked, our applications to not fail and our databases to not slow down. We want continuous reliability around deployments, applications and databases. While we worked hard to make sure our CI/CD pipelines are fast and learned how to deploy and test applications reliably, we didn’t advance our databases world. It’s time to get continuous reliability around databases as well.

To do that, developers need to own their databases. Once developers take over the ownership, they will be ready to optimize the pipelines, thereby achieving continuous reliability for databases. This shift of ownership needs to be consciously driven by technical leaders. The potential for platform engineers to revolutionize the industry by implementing proactive measures to safeguard databases is evident. However, having the right tools and processes is essential. Let’s see how to do it and what we need.

The World Moved on But Databases Stayed Behind

Two decades ago, cloud environments were non-existent, and the majority of software operated on on-premise servers. Applications were confined to a few blocks, typically comprising one database, a couple of web servers and file storage. In those times, troubleshooting was relatively straightforward as logs and traces were easily accessible when bugs arose.

However, our software delivery capabilities were hindered by inefficient processes. Developers worked on changes that were later encapsulated in changesets and handed over to system engineers for deployment. This segregation meant that developers were not actively involved in the deployment and maintenance phases. When bugs occurred, system engineers had to step in, leading to prolonged remediation processes due to communication barriers.

Recognizing the inefficiency of excluding developers from deployment and maintenance, the concept of DevOps was introduced, emphasizing the collaboration between developers and system engineers. However, it became evident that collaboration alone did not suffice. DevOps engineers emerged, aiming to merge competencies for smoother development and deployment. These engineers can now develop business code, deploy it and manage cloud infrastructure using infrastructure as code (IaC) solutions equipped with tools and processes for efficient operations.

The landscape has evolved significantly since then, with the adoption of microservices, independent databases for each small application and increased inter-service communication complexity. Bug identification has become challenging, given the distributed nature of systems and scattered signals throughout the ecosystem. Although component deployment has accelerated, managing this complexity remains a struggle. Effective solutions to prevent production issues, streamlined debugging processes and scalable teams are still elusive.

How to Build Database Reliability

Following are the three parts that platform engineers need to cover to build reliability in the database domain:

Tools and processes that work across the pipeline
Observability and semantic monitoring of the databases
Automated troubleshooting.

Let’s go through each of these areas to understand what we need.

Tools and Processes

Various issues concerning databases can occur without developers noticing. These include the N+1 queries problem, an inadequate or excessive number of indexes, challenges related to eager loading versus lazy loading in ORMs, schema migrations and impedance mismatch — just to highlight a few.

It is crucial to recognize that developers are unable to proactively prevent these problems. They lack effective tools and processes to identify performance issues during the development of their applications. Testing databases is often insufficient, as discussed in our article on how to test databases. This inadequacy is due to the limitations of current CI/CD solutions and the testing pyramid, which struggle to detect these issues. Unit and integration tests primarily focus on data correctness and do not address concerns such as the N+1 queries problem, the use of indexes or the impact on performance when utilizing common table expression (CTE). While load tests may offer some insights, they are conducted late in the pipeline, near the end of the deployment process, providing little help to developers in terms of time efficiency.

We need robust database guardrails to enable developers to identify these issues early in the development process and shift checks to the left as much as possible. These guardrails can identify issues such as unused indexes, incorrect configurations, performance concerns and improper settings in object-relational mapping (ORM) systems precisely when developers are writing their code. By implementing these measures before committing any code changes, the turnaround time is significantly reduced. This approach empowers developers to take ownership of their databases’ performance, providing them with the necessary tools without hindering their productivity. Since the ownership is kept within one team, the turnaround increases significantly, which leads to higher reliability.

Observability for Databases

Another dimension that we must address to build reliability is monitoring. Present monitoring solutions fall short of perfection, inundating users with raw data, aggregating signals, obscuring problems within specific user cohorts, or hindering easy debugging to pinpoint issues.

Enabling developers to assume control of their databases requires the development of tools attuned to database-related activities and developers’ workflows. Database monitoring tools should comprehend schema migrations, maintenance tasks, diverse hosting methods, multi-tenancy applications, database extensions, configurations and numerous other facets. Demanding developers to take ownership becomes impractical if monitoring tools overwhelm them with raw data devoid of explanations about the system’s actual workings.

Nonetheless, platform engineers can transition from mere telemetry and monitoring to achieving comprehension and observability, as elucidated in our article on observability. By integrating database-aware tools, platform engineers can empower developers to utilize them. Once implemented, developers can effectively monitor their databases and gain insights into their evolution over time.

Automated Troubleshooting

Developers cannot assume ownership of their databases if burdened with labor-intensive and manual tasks. Activities such as setting thresholds, configuring alarms, reviewing dashboards or correlating queries with REST commands can all be automated. Instead of relying on monitoring systems to report generic issues like ‘high CPU usage,’ we need comprehensive narratives like ‘we deployed these changes to production, altering data distribution, leading to the application’s failure to use an index due to an outdated execution plan when executing the query in this particular part of the code’. This detailed account is what we require.

Platform engineers must furnish developers with tools that narrate the entire story rather than merely elucidating symptoms. This approach enables developers to address issues more expeditiously, avoiding the laborious troubleshooting process, including collecting logs from various sources and using grep to search for correlation IDs. Automation based on our knowledge of databases is essential. Various strategies to enhance database performance, as outlined in our earlier discussions, should be automated within the system. Once these three areas are fortified with database guardrails, developers can once again take charge of their databases. Let’s explore the benefits this approach can yield.

Benefits of the Shift in Ownership

The primary advantage of implementing database guardrails and empowering developers to take ownership of their databases is scalability. This approach eliminates team constraints, unlocking their complete potential and enabling them to operate at their optimal speed. By removing the need to collaborate with other teams that lack comprehensive context, developers can work more swiftly, reducing communication overhead. Just as we recognized that streamlining communication between developers and system engineers was the initial step, leading to the evolution into DevOps engineers, the objective here is to eliminate dependence on other teams. Developers are no longer reliant on system engineers or database administrators; they can independently manage and maintain their databases.

This results in a significantly accelerated evolution process. With each database now under the ownership of the respective microservice owner, any database issues are promptly addressed and resolved by the owner. There is no need for centralized performance management or maintaining teams of database administrators capable of optimization but unable to keep pace with the speed of development.

Another noteworthy aspect is the reduction of the bus factor. As the knowledge of the database becomes concentrated within a single database administrators’ team, concerns about staff turnover or extended vacations are alleviated. Database task handovers can be managed like regular development workstreams, aligning with the principles of agile methodology. Database-related tasks seamlessly integrate into the scrum methodology.

Ultimately, developers taking ownership of their databases minimizes the time required to identify and address database issues. Developers are freed from the burden of slow and mundane tasks. Thanks to semantic monitoring, they promptly identify issues, automated troubleshooting provides a comprehensive understanding of the problem, and they can independently rectify the issues. This eliminates the need for war rooms or call bridges to decipher the situation.

What is Ahead of Us

Continuous reliability is a must regardless of the company size. The shift of ownership provides a way to achieve it. Database guardrails mark the inception of a new era for developers and databases, but this is just the starting point. With the integration of machine learning (ML), automated troubleshooting can evolve into automated code changes. Similar to static code analysis that identifies common issues in programming languages, tools can generate automated pull requests to address typical problems, leveraging production database data captured automatically. Rather than initiating a ticket for ORM configuration changes, database guardrails can autonomously modify the code, seeking approval as a formality.

As developers take charge of their databases, they can employ CI/CD best practices to enhance the database’s state. The testing pyramid will expand beyond checking business requirements to encompass ‘how to do that,’ ensuring not only correct actions but also correct implementation.

Ultimately, this approach will reduce communication bottlenecks between teams and roles, transitioning from DevOps to DevDbOps. This is the path we must tread to unlock the full potential of developers.

Summary

In recent years, the global landscape has grown significantly more intricate. The proliferation of databases, services, communication channels and dynamic components has added complexity. Similar to the shift toward DevOps and the implementation of CI/CD using infrastructure as code (IaC) for expedited change deployment, integrating database guardrails is essential to empower developers to take ownership of their databases. It falls upon platform engineers to advocate for and implement this innovative approach within their organizations.

Let's Finally Build Continuous Database Reliability! We Deserve It - DevOps.com (2024)