on
Development Security Bug Bash
Development Security Bug Bash
Reecetech has been developing software for over 25 years, and web applications for almost half that.
As of 2021 we now have over 1000 active microservices running in over 7500 Kubernetes pods.
Measuring and addressing technical debt has become one of the hardest challenges we face.
With competing priorities, ever changing products and a growing number of microservices, Reecetech recently found that the number of potential security vulnerabilities was growing faster than could be remediated by the team responsible for maintenance, including:
- Applications that haven’t been built or deployed for 3+ years.
- Unclear or outright undefined application owners and maintainers.
- Outdated frameworks, libraries, and dependencies with tens of thousands known security issues (CVEs).
Bug Bash
While a long term solution is being developed, developers were given a dedicated week and tasked with swarming to address as many vulnerabilities as possible.
Each software project (repository) had a card/issue created and populated with the list of CVEs which were extracted from Grype.
A points system was created that scaled with the varying CVE counts and ratings, points were tracked on a dashboard.
Developers self organised into teams of six, and each had one dedicated Platform Engineer assigned.
Starting with the cards that had the highest number of Critical and High rated CVEs each engineer would:
- Assign a card to themselves.
- Check out the code.
- Update the Docker base image.
- Run a build which triggers a Grype scan and a number of internal audits to test for vulnerabilities and compliance.
- Update any dependencies or framework (Springboot/Django) if required.
- Gain peer review.
- Release the updated application through the environments.
- Mark the card as done, points were awarded to their team.
- Repeat.
Results
CVEs resolved (approximate):
- Critical: 50% resolved (over 4000!)
- High: 20% resolved
- Medium: 15% resolved
- Low:15% resolved
Retro
What worked well?
- Gamifying the work with a points system and competing teams.
- All devs stopping normal work for this common goal.
- Everyone put an awesome effort in and the mood was very positive.
- Having work split into cards by application.
- Mixing platform engineers in with developers.
- Lots of great peer reviews.
- Many apps that hadn’t been touched in years got fresh builds and deployments.
- Working remotely via Slack.
- High number of vulnerabilities resolved.
- Engineers that tested out using Snyk as potential future tooling for security scanning mostly had a positive experience, especially as scans could be run locally.
What could have been better?
- Calculating the scores and updating the scoreboard should have been automated.
- Bamboo is clunky and hard to use for CI/CD.
- Bamboos log output is very noisy and difficult for engineers to search through for scan results.
- Some applications didn’t have a README with information on how to build them.
- Not having APM for pre-production environments made it hard to gain confidence in deployments.
- No pre-commit hooks to do basic linting/tests before pushing code.
- Would have been nice to invest time in automated scheduling to rebuild apps that haven’t been touched in X number of days.
- Security scanning (Grype with the required wrappers) couldn’t be run locally, this meant there was a slow feedback cycle while everyone had to push and wait on Bamboo to report the results.
- Every app seemed to have its on special way of being built, no common or inherited build patterns.
- Every application needed manual verification that the build and deploy actually worked (monitoring and observability improvements are needed to increase confidence).
Moving forward
Shifting Tooling Left
We are working on new ways to empower and influence developers to shift left with security.
While our current security tooling has mostly sufficient coverage, it has slow feedback loops where developers need to push their code to CI/CD and wait for security tests to run before they know if there are any vulnerabilities.
While we want to keep security testing in CI/CD, we want to empower developers to test their code before it is pushed.
To enable this we are looking to move from a custom Grype wrapper, to Snyk which provides Docker image, open source dependency scanning, and SAST.
Application Ownership
Each application has been updated with a MAINTAINER
label within the Dockerfile, CI checks that this label exists and is set to a valid development team.
Defining SLOs and Measuring SLIs
We want to consider defining SLOs and SLIs for the applications we develop or deploy to define what good looks like and ensure alignment between the business and engineering.
Without Service Level Indicators or Service Level Objectives defined - how do we know if it’s acceptable to have 1 or 100 critical vulnerabilities in a given application?
Increasing Build Frequency
A number of applications had not been built for over 3 years making the effort required to resolve security issues significant.
We want to look into automating builds on all developed software that has not been built for a given number of days and alerting the owners if the build fails.
Enabling Renovate Auto-Remediation
Renovate is currently used to create automated PRs for vulnerabilities in our applications.
We have begun taking this to the next level and enabling automatic merging of these PRs for builds that pass and have sufficient coverage.
Developer Security Training
We are looking to introduce an external secure development training solution such as application.security or PagerDuty security training to increase developer awareness and understanding of security vulnerabilities.
Introducing Security Champions
We are planning on starting a security champions program to enlist security-minded engineers to advocate for security best practices.
Once trained, the security champions will become the voice for security within their teams and work together to identify improvements to culture, processes and tooling.