LinkedIn’s engineering organization has made a couple of key tools available as open source projects to help businesses deal with what happens when their applications fail. The tools help organizations automatically contact engineers to deal with issues that pop up in their applications.
Iris, named after the Greek messenger goddess, notifies users of alerts generated by company systems. For example, Iris can be set up to contact an on-call engineer and notify the site reliability organization if a production server goes down. If users don’t respond to a first notification, Iris can be configured to send subsequent messages until it receives a response.
Who the system contacts is driven by Oncall, another project that was released today. That service lets companies lay out a schedule for who is responsible to deal with issues when they arise. Users lay out their schedule in a calendar, and Iris will use that information to drive notifications to the right people.
These projects are designed to make it easier for companies to automate the process of notifying their engineers of outages and other issues. LinkedIn created Iris and Oncall as part of the company’s move towards increasingly automatic notifications. Prior to the system’s implementation, the company’s Network Operations Center engineers notified on-call engineers manually when issues arose.
June 5th: The AI Audit in NYC
Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.
Deploying the systems inside a company is supposed to be fairly easy, according to the LinkedIn engineers who helped build them.
“Different companies might use different deployment tools, but Docker is a good example: You can get Iris and Oncall running in four commands total,” said Daniel Wang, a site reliability engineer at LinkedIn. “So, we’ve tried to make that process as easy as possible.”
Once Iris and Oncall are deployed, users have to set up information about the engineers who need to be part of the system. Wang said that LinkedIn uses LDAP, but other companies could set the system up to use other modes of authentication.
Once Iris and Oncall are running, different users can also set up different methods of notification. For example, some Slack addicts might prefer to get their high-priority notices sent there, while other engineers might prefer an old-fashioned phone call. All told, Iris supports contacting people through email, Slack message, SMS, and phone call.
Iris can also be set up to batch notifications to a user if multiple notices come in at once. During periods of high activity, users can also set the system up to de-escalate notifications, so they get a phone call for the first problem, and subsequent notices come in via SMS, email, or Slack.
LinkedIn does recommend that users try to batch alerts from the source as well. The company found out the hard way that its switch from human-driven notifications to automated ones meant that engineers lost some curation of alerts by the people relaying them, so it’s necessarily to set that up programmatically to some degree.
These open source tools could replace other paging options that businesses pay for, as well as manual processes companies already have put in place. LinkedIn’s roadmap for the tools going forward leans heavily on better out-of-the-box integrations with identity services, as well as support for mobile apps, so that it’s possible for users to respond to their alerts from a smartphone.