mashraqi

> slide

[ This is my personal blog so all opinions expressed here are mine. I am a product, scalability, operations and monetization advisor and currently employed as Director of Business Operations & Technical Strategy for a top 50 website that delivers billions of page views per month. I was a keynote panelist for Scaling Up or Out keynote at MySQL Conference and speak regularly at conferences and user groups. ]
Farhan "Frank" Mashraqi

Wednesday, June 25, 2008

Harnessing Explosive Growth: Infrastructure Strategies and Tactics

The session I am now waiting for is Harnessing Explosive Growth: Infrastructure Strategies and Tactics. The official description of the session is:

What worked in the garage can rarely be scaled effectively for the boardroom. This panel will bring together some of the biggest names in web infrastructure to share their thoughts, insights and tactics for harnessing explosive growth, with a focus that goes beyond simply which technologies are available but how to best deploy them. This panel is not to be missed
Panelists include:
How much of scalability is architecture, and how much is throwing servers at the problem?

SJ: 99% architecture
JH: Product is what drives architecture. We have more than 10,000 servers. For chat, they built a separate network. www.facebook.com/eblog.

JR: The biggest consideration is how their servers work with Facebook.

When's it broken? When did you know your first architecture was broken?
AG: First million users. Then they started focusing on caching and sharding.

JB: eBay had a few catastrophic issues near 2000. People were very forgiving of availability issues. They give refunds when there is downtime.

JR: They moved a lot of their technology to EDGE.

SJ: Since Meebo is just one page, javascript delivery is a major issue for them. They are not generating pages. Dynamic loading and background computation is really important.

RP: Application and infrastructure is going to break. If you look hard enough you can find where are the scalability issues.

JR: They are very metric driven. It's not often they see outages. They have number of Facebook's ops folks on their IM lists and they talk continously.

JH: We work through problems together.

JH: They have had to turn off some applications because other applications were being affected.

JB: At eBay they have created a central application login system. They can flag and identify problems really quickly. If you don't have it, you're shooting in the dark.

Rolling your own stuff? Off-the-shelf vs. custom:
Did you roll your own? Do you regret it?

SJ: For cash restraint startup, off-the-shelf can work. But for scale, you'd want to build yourself. Open Source is awesome. No one can scale your system as well as you can. Off-the-Shelf can be bulky. You have to get your hands dirty.

JR: We built our own caching backend. Invest time in core stuff, anything that's not core, don't focus there.

What also needs to scale as you grow? What non-technology things you had to scale?

JB: Need to scale out your business as well as technology.

AG: Building anti-spam features into the product that are scalable.

JH: Make community part of the process in translations as you grow.

JR; They introduced user moderation for photos. Hard to find what's porn and what's not. It's about a dozen people looking at photos full-time to hunt down porn.

How should we handle the fallout? If you were Twitter what would you have done last month?
JB: You have to be transparent. Tell them what's going on. Setup message boards for communication. You've got to communicate.

AG: Setting realistic timelines is very important

JH: We do it often. We roll in small chunk and if things don't look right, we roll back.

JB: You cannot operate a large system without the ability to turn things on and off.

JR: If you have an aggressive competitor, you don't have the luxury of downtime.

RP: You can't roll out something that can't be rolled back.



Labels: , , , , , , ,

Monday, June 09, 2008

RockYou! raises $35 million

Following Slide, RockYou! has raised an impressive $35 million in Series C funding led by DCM venture capital firm. RockYou! is now claiming that their reach is bigger than Slide.

Slide:

RockYou!:

Labels: , , , ,

Saturday, May 31, 2008

Animoto - Kick Ass Slide Show Producer Experiencing Explosive Growth

Animoto is experiencing You Tube and Facebook style growth. The difference is that Animoto actually has a proper business model!

Move over, Slide and Rock You! here comes Animoto with a totally kick ass slide show experience. They take your photos, your music and do magic to create slide shows that "feel" the music.

To sign up and save $5, click on the graphic above with code: oxqeiaug

Animoto, which is currently experiencing explosive growth and went from 25,000 users to 250,000 users in just three days, runs on Amazon EC2 and uses S3 for storage.

A demo of Animoto produced slide show is below:




At Startup School 2008, Jeff Bezos, Amazon CEO, talked about Animoto. The video of his talk follows:


According to Right Scale, which Animoto uses for handling their EC2 instances, mentions that Animoto has been adding 40 EC2 instances per minute during their peak time.

The upshot is that there are a lot of moving parts! Each one of the subsystems consists of many servers and everything needs to scale-up as the load increases. What Animoto CTO Stevie Clifton did really well is to connect all the operations using queues, many of them in SQS. One queue contains work items that list photo URLs to fetch from other sites, such as Facebook, Flickr, etc., and that is processed by one array of worker instances. Another queue has the list of render jobs and each work item in there points to the set of photos sitting at the ready in S3 and at the music files also on S3. All of these queues are held in Amazon SQS and the arrays of worker instances are managed by RightScale. This allows the monitoring part of our service to detect when the queue gets too large and more instances need to be launched. What’s nice about using queues is that it decouples the various parts of the site, so if the renderers get backlogged the queue simply builds up and users have to wait a little longer for their video to be produced. Waiting is not good, but dropping requests on the floor is much worse!

Right Scale blog also touches on some of the important lessons learned from Animoto's growth:

First of all, when you scale 10x and then 10x again to run on thousands of servers every little problem turns into a large one. That insignificant error rate of 0.1% gets multiplied by 1000x per second and you end up with an error a second, and actually, the error rate typically increases in itself too because of the added load on the system. So suddenly it’s not something you can ignore anymore. An example for this was having exponential backoff for uploads to S3 when using curl, but forgetting that the 5th retry exceeds the S3 connection timeout. Normally, this happens only once in a blue moon, but when tens of uploader instances are banging hard on one S3 bucket the S3 error rate goes up a bit and suddenly uploads are failing left and right. Once we changed this to a constant retry timeout it all went smooth again.

References:
Animoto's Facebook scale-up by Right Scale Blog
Animoto Scaling through Viral Growth by Amazon Web Services Blog

Labels: , , , , , ,

  • View Farhan 'Frank' Mashraqi's profile on LinkedIn
  • Structure 08
  • Graphing Social Patterns - East 2008
  • Velocity Conference
    follow me on Twitter

    © 2006 The Mashraqi's.