How OpenWater uses Raygun to reduce error noise by 99%
Posted Oct 19, 2017 | 9 min. (1783 words)Note: This article was last updated October 2019.
Our customers are always using Raygun in unique ways. In this article, we’ll look at the workflow processes of OpenWater the preferred Awards Management Software of giants like The Disney Company and Kelloggs - and how they use Raygun on a daily basis to reduce error noise by 99%.
But that’s not all. OpenWater has also reduced their error diagnosis time from 10 minutes per bug to just a few seconds, all while significantly improving their customer experience over just six months.
We’ll show you step-by-step how Co-Founder and Director of Technology Kunal Johar configured Raygun Crash Reporting, and which tools he recommends for anyone wanting to replicate his results.
(You can read the full case study here.)
How Kunal identified error noise using Raygun Crash Reporting
Before Raygun Crash Reporting, OpenWater relied on traditional logging tools for error management.
When OpenWater moved their data to the cloud, logging tools became impossible to scale to the needs of a growing number of servers. Kunal chose Raygun Crash Reporting as their centralized error monitoring solution to manage growth.
Once data was flowing from OpenWater’s applications into the dashboard, Kunal realized most of the errors taking up valuable support time were, in fact, user-generated, or outside of their control (like a faulty browser plugin.)
“If a client experiences a broken link, from their point of view, it’s broken. But from our point of view, it was the email tool breaking the link.”
Kunal Johar - Director of Technology at OpenWater
Rather than ignoring the bulk of active errors coming through in the Crash Reporting dashboard, Kunal saw an opportunity to provide better and more efficient support.
“If we could detect the ‘why’ we could compensate for our customer’s technology toolchain,” he says.
Kunal implemented a company-wide initiative called the Raygun Error Zero Initiative as a solution, which aims to reduce error noise and improve response times.
The development team uses Raygun’s tagging feature and the Crash Reporting dashboard to reduce and sort 25,000 errors a month. The tagging is designed so only the most critical errors land in Raygun. Now, there are only two or three errors that need action per week.
Here’s how they do it.
The Raygun Error Zero Initiative: how to use the custom tagging feature to reduce error noise from actions outside of your control
The initiative’s success relies on two main areas of Raygun Crash Reporting:
- Custom tagging
- The Raygun Crash Reporting dashboard
Firstly, however, if you’d like to try the Zero Error Initiative in your company, Kunal recommends that you set up any unit tests, then implement the Raygun Zero Error Initiative to proactively keep error noise to a minimum.
1. Set up unique codes for exceptions outside of your control using Custom Tags
Tags are an important way of helping your team filter errors. They are essentially labels you give to exceptions in your code, therefore are an excellent way to filter your errors!
Kunal and his team give each exception a nine-digit alphanumeric code, so when the application throws the exception, it immediately has a classification and can be either marked as ‘Ignored’, or if it is major, attended to ASAP.
The code acts as a classification system as such:
A20 - 001 - 002
User-generated error 20 - class one - line 2
How to setup Tags in Raygun Crash Reporting
Sending custom data in the form of a tag in Raygun is language-specific, so head to the Raygun documentation for details. Here’s an example in JavaScript:
On initialization:
// V2
rg4js('withTags', ['tag1', 'tag2']);
// V1
Raygun.init('apikey').withTags(array)
On send:
rg4js('send', {
error: e,
tags: ['tag3'];
});
As you create exception labels, keep a record of the codes you use in a documentation manual for future reference (and to keep tags consistent across your codebase.)
2. Sort errors by ‘Active,’ ‘Ignore’ or ‘Permanently ignore’ in your Crash Reporting dashboard
OpenWater’s development team uses the Crash Reporting dashboard to sort the exceptions landing in the ‘Active’ tab.
When an exception lands in the Crash Reporting dashboard, you can see the attached classification code come through in the ‘Active’ error tab. Click on the error message and into the ‘Error details’ page. The code will show on the ‘Error Id’:
Then, cross-check the nine-digit classification code with your documentation.
Is it a user-generated error? Change the error status to ‘Ignore,’ which will send the error to the ‘Ignore’ tab. Or, if you are confident the error doesn’t need action, send it to the ‘Permanently ignore’ tab.
Perhaps the error is more complex but isn’t affecting too many users. In this case, you can leave it in the ‘Active’ tab for a fix later. However, if the error is affecting many users, assign a team member to fix the error in your project management tool.
Finally, sort the error details by ‘User,’ ‘Count’ and ‘Last seen’ in the Crash Reporting Dashboard to assess priorities:
3. Provide customers with the antidote by writing friendlier error messages
Now you know which errors are important, which can wait, and which ones to ignore with your tags.
The next step Kunal takes is to put the power into his customer’s hands by writing more useful error messages. Peep Leija shows some good examples in his article here.
“We wanted to isolate those so we can focus on things that are truly within our control, and if we can show a friendlier error to the user, we should,” Kunal explained.
By giving his customers the antidote to the error, Kunal and his team reduced incoming error noise by 99% and provided his customers with a better experience around malfunctions.
Before, customers would be confused by technical language, assume something on the website was broken and contact support.
For example, a customer could receive the following exception label:
‘Formatexception. Input string not in the correct format.’
Rewritten to read:
‘Please enter the correct email address.’
If customers experience an error, before, they would see a technical description:
‘System data entity infrastructure exception.’
Kunal rewrote his error messages in a more helpful format:
‘We apologize you’ve run into this problem. OpenWater has sent the error with error code ABCD. We have notified the OpenWater team. If you would like more urgent attention, please send us this error code using this link.’
Writing more explicit error messages gives the power to the user, and helps your development team distinguish between user-generated exceptions which can cause a lot of noise - and real errors which can get buried among minor malfunctions.
4. Prioritize your development time by addressing the top 10% of recurring errors
Rewriting every error doesn’t make a great deal of sense. Prioritizing time is important on any dev team, and one way Kunal ensures team focus is to only react to things that are reoccurring.
Kunal’s team only surfaces the top 10% of the most popular errors and empowers his team to raise common exceptions in Slack. He says it’s as easy as writing a message to your team chat room. He says:
“Using this method, we can say ‘Okay, this error string incorrect format has come up 100 times this week. What if we can make that into a friendlier exception and deal with it?”
This method could also easily be a distraction to your team, though, so it’s important to manage error resolution time. Kunal offers some valuable advice on prioritizing your team’s time below.
How to allocate the right amount of time to error resolution
Kunal recommends allocating half your error resolution time to fixing errors and half the time to writing more useful error messages. You should see improvements in as little as two months. Typically, some errors will be harder to handle, so Kunal starts with what he can manage at the time. He then recommends temporarily ignoring the more complex errors and returning to them within six months.
Kunal plays a mentorship role to technology startups and advises that it’s better to have customers and bugs, but if you can’t budget at least 30% toward bug fixing, you’ll end up spending 50% toward error handling.
“In the beginning, at a minimum, 30% of your time should be spent on remediation. Once you get below that number, start spending time on proactive work. If you have a big development budget, do proactive work at the beginning.”
How to do this in Raygun Crash Reporting
Sort the Crash Reporting dashboard by the number of affected users (by clicking the ‘Users’ column header) to make sure the error is only affecting a handful of people on a minor page.
Then, change the status of the error from ‘Active’ to ‘Ignore.’
5. Check your dashboard proactively to ensure your application is healthy
Now you have your tagging setup and have assigned priorities you can now do quick check-ins to ensure your application is healthy.
All you need to do is regularly check in with your Crash Reporting dashboard to watch for recent activity and error count, especially error spikes:
You can also quickly search for a particular tag using the search bar in the side menu on the Crash Reporting dashboard. Enter the tag into the search bar in the sidebar and hit the return key:
Which tools does Kunal recommend in his developer toolkit?
Finally, Kunal recommends the following developer toolkit to help make any software successful:
-
Jira Software - for managing projects
-
GitHub - repository management
-
Slack - team communication and raising issues
-
Intercom - a centralized customer service management platform
-
Stripe - to receive and organize payments
-
Raygun - for crash and error monitoring
How well does the Raygun Error Zero Initiative work to reduce error noise?
Kunal has been using this system to manage OpenWater’s error noise from browser plugins and user-generated errors for over a year. Over the course of the next 12 months, Kunal has plans to continue and evolve the program. He recognizes eliminating all errors is not the actual goal, but it helps to motivate the team to manage errors.
“We are never going to get to zero errors, but the point is, do we have a manageable number we can look through?”
As a result of the Raygun Error Zero Initiative initiative, OpenWater is able to:
- Reduce error noise by 99%
- Reduce error diagnosis time from 30 - 40 minutes per error to just a few seconds
- Provide human-friendly messages when errors occur to let casual users troubleshoot and rectify common problems on their own
Raygun wishes to thank Kunal and his team for their generosity in providing information for this blog post. You can find out more about OpenWater and their company by visiting their website.