All of this allows us to keep response times very low – on average 6ms and in the 99th percentile between 10-20ms:
Last Thursday however, our fast response times quickly deteriorated:
And consequently we were only able to serve 1/6th of the requests from our clients:
While this alone was bad enough, less than an hour later we saw the same slow down in response times, but this time with more serious impact on our application: it was no longer able to communicate with Redis properly and was throwing exceptions in the process. This effectively took our ad server offline:
Once we saw that we were down, we were able to restore access within 15 minutes by spinning up new EC2 instances of our ad servers and Redis instance. The data stored in Redis is quickly restored from our primary MySQL database. In this way we keep loss of state to a minimum: just the changes on Redis that haven’t been pulled back into MySQL.
So what was it that caused our response times to slow down, and what caused the ad servers to lose their connection with Redis?
In the Redis logs, during the initial slowdown we saw this:
 31 May 02:16:14.084 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting
We have Redis configured using AOF (Append-Only Files) because we were concerned about the durability of the data stored there. The Redis docs contain a great overview of Redis persistence models but in summary by using AOF you can keep data loss to a minimum (by default you would lose up to 1 seconds worth of data). When the fsync of the AOF file takes too long, Redis will compensate by slowing down to ensure data is not lost.
The message in the log above indicates one of two likely causes:
We can easily rule out the first possibility. Our Redis instance only handles about 2000 transactions per second on an m1.large EC2 instance. We saw no difference in the amount of transactions at any time before the slow down occured.
That leaves the second option. On EC2 we run the risk of having noisy neighbors (an issue where other ‘nearby’ instances are consuming significant resources). Unfortunately we don’t have the insight today to be able to determine if this is the root cause but we’ve seen reports from others with similar issues.
What happened during the second slowdown which led to complete application meltdown?
From our application logs we saw this:
com.twitter.finagle.redis.ServerError: ERR operation not permitted
This indicates that our adserver is somehow not able to authenticate with the Redis server. Now, the Redis security docs don’t strongly recommend using authentication as a sole means of protection but rather recommend using network level security. Unfortunately, timing is everything! We’re in the final stages of moving off of Heroku, and until that’s complete we’re temporarily relying on Redis AUTH.
So we have two problems now:
Our solution to these problems was simple: if something is causing us pain stop doing it!
In the past few months we improved our application so that we don’t require the same amount of durability, but nevertheless we had kept this option enabled. So the quickest solution turns out to be just turn AOF off. We can recreate nearly all the data in Redis from our MySQL datastore, the only exception being daily spend information, but we have the Redis RDB snapshots every minute for that.
Furthermore, we can remove the AUTH issues by not using AUTH. This means moving applications that require access to Redis inside EC2 and setting up network level security using EC2 security groups. We’re in the process of doing so now.
The incident also reinforced another thing we’ve found invaluable. Having visibility into your application metrics helps detect and pinpoint failures. We had a lot of good data on our application behavior that helped such as our ad server and Redis performance metrics, and we still found ourselves needing more data that we weren’t yet capturing such as disk IO performance.
Stepping back, I think the biggest thing we learned is to routinely question earlier assumptions made when building our application. Over time, your application and infrastructure builds up cruft that was necessary at one point and may no longer be needed. Eliminating variability in your application configuration reduces the search space for diagnosing application problems, and coupled with more data helps isolate root causes faster.
If you’re interested in building high performance services with Scala, Finagle, Redis, etc. and working with a team who respects the time and effort it takes to build solid infrastructure, Sharethrough is hiring]]>
A number of industry leading publishers, including PEOPLE, Serious Eats and Forbes, are already using the Sharethrough mobile platform to power native ads on their mobile websites.
Pepsi ®, one of our mobile launch brand partners, shared some thoughts on why Sharethrough Mobile Sponsored Stories is important to them:
“At Pepsi, we strongly believe in creating advertising that is genuinely entertaining and promoting our content through ways that are respectful to our audience,” said Josh Nafman, Sr. Digital Brand Manager, Pepsi ®. “Sharethrough’s mobile platform will help Pepsi reach audiences in a way that upholds our values of engagement and compelling advertising experiences. We’re excited for this new channel to connect with our fans on their mobile devices.”
We are also working with leading brands including Cruzan® Rum, Sauza® Tequila, and Pepsi. Both Cruzan Rum and Sauza Tequila will be using the Sharethrough mobile platform to reach engaged audiences for their entertaining video campaigns, including Sauza Tequila’s irresistible “Make It With a Lifeguard,” video series. “With Mobile Internet usage on the rise, Sauza’s excited to be one of the first brands to test Sharethrough’s Native advertising distribution method specifically in driving consumers to watch our ‘Make It With a Lifeguard’ videos,” said Lindsey Lewis, Brand Manager, Jim Beam Inc. We are thrilled to provide engaging content whenever and wherever our fans want.”
So how does it work?
Sharethrough Mobile Sponsored Stories allow brands to promote their original content, such as videos, articles, posts, reviews and more, across mobile sites in a scalable way that respects and enhances the user experience of each site. Mobile Sponsored Stories appear as part of the stream of content within a publisher’s mobile site experience and are automatically updated to match the look-and-feel of the organic site content. This allows advertisers to promote their content in placements across the mobile web that feel native to each site they appear on.
How do you make ads look and feel native?
Real Time Templating (RTT) a technology that Sharethrough developed, allows our advertising platform to identify the style elements of any webpage and match each ad to those attributes, in real time. Even better, we are able to match any future updates to the style template, so even if a publisher changes their design, your ad will stay consistent with the new style elements.
Are these ads effective?
In a word, yes. The Sharethrough solution was built to work in a “scroll-centric” environment, it does not interrupt mobile site usage – which according to Forrester is users’ most frequently requested mobile ad feature. Our mobile ads are ideally suited to the smaller screen size of mobile devices, where there is less real estate for content and advertising to run adjacently. Sharethrough Mobile Sponsored Stories enables brand messages to stand out among the site content, with up to one-third of the entire screen dedicated to brand content.
The launch of Mobile Sponsored Stories marks a big leap for Sharethrough and the native advertising industry at large. Our goal is to be the leading provider of native ad products across all platforms and to create technology that makes advertising better – today we feel just a little closer to that vision. If you’d like to learn more about our mobile advertising solutions, please click here.
To see full results take a look at the infographic below.
“Yahoo’s undeniable strength is content. From Sports to News to Entertainment, Yahoo has far exceeded any other Silicon Valley company’s efforts to build loyal audiences through original content. There could not be a better time for Yahoo to go native with content-driven advertising. Brands are investing unprecedented budgets in their own original content and publishing efforts, and are in deep need of new mediums to promote this branded content in native ways.”
Ok, enough with the back-patting. The main reason for this post is to welcome Yahoo to the native party. Their inclusion is a large step for the entire digital advertising industry and a very important stamp of approval from one of the largest media companies. We look forward to additional publishers and media heavy weights following their lead.
From the 1st ever Native Advertising Summit, to an increase in branded custom content, to today’s announcement, 2013 has been a big year for native. To reflect that growth, we plan on rolling out another version of the Native Adscape, which offers the only all-inclusive view of companies, technologies, and publishers involved in the native movement.
Until next time, stay native, friends.