[ This is my personal blog so all opinions expressed here are mine. I am a product, scalability, operations and monetization advisor and currently employed as Director of Business Operations & Technical Strategy for a top 50 website that delivers billions of page views per month. I was a keynote panelist for Scaling Up or Out keynote at MySQL Conference and speak regularly at conferences and user groups.
]
Amazon's S3 service went down today and more than 7 hours later, it is still down. The service initially went down around 12:00PM EST and my latest check shows troubles continuing for sites that depend on S3.
“Funny how Amazon doesn't use S3 to store any assets for amazon.com”tweet by @gruber
Smugmug, a popular photo sharing site with more than 600 TB of data stored on S3 has been accessible along with several "Web 2.0" startup sites. Because of the astounding amount of data stored by Smugmug on S3, it can definitely be considered a poster child for the service. Last time I checked Smugmug, at 11:10 PM EST, it was still inaccessible.
“As a consequence, a variety of businesses such as Twitter, digital photo sharing Web site, SmugMug and The Huffington Post all had issues. Twitterers were claiming their avatar images could not be displayed. The Huffington Post was also unable to display images to its stories, while SmugMug could not offer any service at all. ”ComputerWorld
This is not good for Amazon and for startups that are looking to count on Amazon's "redundant" S3 platform. Dr. Werner Vogels is probably pretty upset right now.
If you're a rock star web developer with expertise in Rails/Java/PHP and MySQL, I have several exciting opportunities available in New York.
I've posted the job description for web developer and graphic designers at my MySQL blog.
About the company: Give Real is a well-funded startup in the midst of an exciting period of growth and success. Our technology uses a patent pending platform that combines the ubiquity of credit card transactions and the power of social networks to create a new gifting experience.
I was recently introduced to Summize by Dan White of CafeMom and have since loved the service.
“The deal started with a conversation with Fred Wilson about how conversational search can evolve into navigation, about how important navigation becomes for UGC as you go mainstream — it concluded with the deal that was announced this morning. Betaworks is now a twitter shareholder, and excited to be one.”John Borthwick, partner at Betaworks
Congratulations to Summize, Twitter and Betaworks teams on making a perfect acquisition. Also congrats to John Borthwick who is a partner at BetaWorks.
Marketing Sherpa's latest chart of the week is about the metrics that are underused by search marketers. The data was collected by asking Marketing Sherpa members about 'the most underused metrics in search'
“Marketers with short, impulse-buy sales cycles were quite adamant that immediate sales should be tied to keywords when figuring out conversion...With prices rising steadily, marketers who evaluate search against tangible KPIs will be the ones who will optimize and balance their spending.”
The slides from my second memcached webinar are embedded below. If you want you can also watch the memcached webinar on demand (includes sound).
I would like to know what else you would like to know about in my next webinar on memcached. Please leave a comment or if you prefer, email me at fmashraqi [at] yahoo dot com.
A big thanks to everyone who attended. The recording of webinar (with sound) is now available on MySQL website. Also thanks to Jimmy Guerrero, Rich Taylor, Alex Roedling, Edwin DeSouza and Sun/MySQL for inviting me for the second time to give this webinar.
If you registered previously for the memcached webinar, you can simply login to access the Webex recording. Otherwise, you will have to register for the on-demand webinar.
After watching/listening to the webinar, in case you have any questions or suggestions for my next webinar, please feel free to email them to me at fmashraqi [at] yahoo dot com.
The description of the webinar follows:
Memcached for MySQL: Advanced Use Cases
Join us for this in depth technical webinar where memcached guru Frank Mashraqi of Fotolog will demonstrate several use cases on how to leverage memcached to increase the performance and scalability of MySQL driven web sites and applications. Memcached is the open source distributed memory caching system used by some of the biggest websites in the world like, YouTube, Facebook, LiveJournal and Wikipedia. Use cases explored include: non-deterministic caches, deterministic caches, a replacement/add-on for file system caches and more. We will also provide an overview of memcached production support available with a MySQL Enterprise subscription. The presentation will conclude with a question and answer period where you can "ask the expert" about the technical details of memcached. Attendees will also receive a technical white paper.
WHO:
Farhan "Frank" Mashraqi, Director of Business Operations and Technical Strategy - Fotolog Inc
Jimmy Guerrero, Sr Product Marketing Manager, Sun Microsystems – Database Group
WHAT:
Memcached for MySQL: Advanced Use Cases web presentation.
Next session is High-performance Ajax Applications by Julien Lecomte (Yahoo!).
Plan for performance from day 1
work closely with designers and product managers
understand design rationale
explain the tradeoffs between design and performance
offer alternatives and shw what is possible (prototypes)
as a last resort, simplify design
Engineering high performance: a few basic rules:
don't do anything unnecessary
less is more
break the rules
work on improving perceived performance
users can generally deal with a little bit of discomfort if they can see something is happening.
can't compromise security but someother things can be compromised
in general avoid presentational markup
Measuring performance:
test performance using a setup similar to your user's environment
profile your code during development
automate profiling/performance testing
keep historical records of how features perform
consider keeping some (small amount of ) profiling code in production
Yahoo!'s exceptional performance rules
make fewer http requests
use a content delivery network
Asset optimization:
minify CSS and JS files
combine CSS and JS files
optimize image assets
Reduce unminified code size:
loading and parsing HTML, CSS and JS code is costly
be concise and write less code
make good use of javascript features
consider optimizing your large JS files into smaller files (bundles) when the parsing and compilation of the script takes an excessive amount of time
load code (HTML, CSS and JS) on demand
Optimize initial rendering (1/4) misc tips
consider rendering the first view on the server. (server should generate the markup)
this will speed up the intial rendering.
close your HTML tags to speed up parsing
consider flushing the apache buffer very early on
load only essential assets/load assets on a delay or on demand.
don't always wait for onload
most DOM operations can be accomplished before he onload event has fired
post load script loading:
a well designed site should be fully functional even without the JS enabled
therefore you may be able to load scripts on a delay
conditional preloading:
preload assets that you know user is likely to need very shortly
however, one must be smart about when the preloading takes place. Otherwise the preloading may actually worsen the user experience.
Part 3: High Performance Javascript:
look up is performed in JS everytime a variable is accessed.
declare with the var keyword and use variables in the same scope whenever possible and avoid global resources at all costs.
never use the with keyword as it prevents the compiler from generating code for fast access to local variables.
cache the results of expensive lookups in local variables.
The prototype chain:
accessing member found in the primary object is about 25% faster
optimize object instantiation:
if you need to create many objects, consider adding members to the prototype instead.
Don't use eval
eval is evil
the string passed to eval (and its relatives, the function constructr and setTimeout and setInterval function needs to be compiled and interpreted (extremely slow).
optimize string concatenation:
on IE concatentating two strings causes a new string to be allocated and the two original strings to be copied.
therefore, it is mch faster on IE to append string to an array and then use Array.join
optimize regular expressions.
don't use the RegExp constructure unless your regular expression is assembled at runtime. Instead, use regular expression literals.
use the test mehod if all you want to do is test for a pattern. (the exec method carries a small performance penlty.
Caching
caching can be justified when there is a high cost associated with getting/accessing the data and when data wouldn't change over time.- increases memory consumption (tradeoff)
memorization
long running javascript running process (longer than 300ms):
the entire browser UI is frozen.
to maintain a decent user experience make sure that JS threads never take more than 300 ms.
misc:
function calls have overhead associated with them
consider using primitive operations since they are often faster than the corresponding function cals
if possible, avoid using try..catch in performance critical sections
if possible, avoid for...in in perormance critical sections
branch outside, not inside, whenever the branching condition does not change
Next session is Stress, Load and Performance Testing in Quality Assurance by Goranka Bjedov of Google.
I have been wanting to hear Goranka for some time now as her sessions usually end up becoming the highlight of the event. For record, she passionately hates Power Point (I don't blame her).
I couldn't find a video of her Velocity talk but here is a video from her previous talk that's equally interesting.
Goranka spends all her time doing performance testing at Google. She tests Adwords, AdSense and hates any kinds of presentation tools.
focus on the backend. Steve Souders is the client side performance guy.
she works on the servers.
what are the bottlenecks?
QA people should be able to tell you what to expect.
2 purposes: what is going on in application? and monitor application for changes.
a small code change can cause tremendous performance decline.
first thing is if a mistake is made, everyone should know right away. Finding it later can cost a lot
80% of the performance problems can be worked out with one front end and one backend as long as you have the right database.
figure out what is happenging with important transactions
if you don't know what important transactions are, make a guess. it's better than nothing. don't be paralyzed, then worry about perfection
there is nothing as perfection in performance testing. she cannot guarantee the exact results. All tests are run as statistical tests and run 5 times or so.
big proponent of open source tools: jmeter, grinder and funkload. In Windows environment look at Open SDA.
Vendor tools do reasonably job. (and solve the problem of having too much money)
Open Source tools do exactly the same job. they are not free completely as time is required.
if you're not willing to spend time on OS tool, then why even spend half a million dollars?
she is happy that OS tools don't have monitoring built into it.
monitoring is absolutely essential and must be done separately.
for majority of things you can troubleshoot and benchmark within 3 weeks.
check google blog for her posts: open source performance testing tools
Next session is Clouds Are No Substitute for Competence by Javier Soltero of Hyperic.
The promise of cloud computing:
Cloud computing is the next big thing: Because it is green, easy, scalable, available and disposable.
Cloud computing adds complexity:
clouds allow you to run your applications, but mask the performance of the infrastructure powering them. NYT is not going to stop their own infrastructure just because they had success with one project on EC2.
when a problem happens, where is the source of the problem? cloud or your own app.
cloud, by definition is always available and the status is always green.
how quickly can I provision new servers?
what is the throughput in the regions I use?
what latency am I getting for my messages?
How do you answer?
'is it my application, or is it the cloud?"
Hyperic is introducing cloudstatus.com which shows performance, availability and health of Amazon's Web services. On CloudStatus.com, you can monitor EC2, S3, SQS, SDB and FPS (5 most popular and critical services of AWS). You can look at performance metrics such as deployment latency. They are firing Amazon instances and monitoring response times.
After the break, the next session is Energy Efficient Operations: Some Challenges and Opportunities. Luiz Barroso from Google is the presenter. I got a couple minutes late as I had to pick the charger.
Server electricity usage in perspective:
worldwide electricity usage of servers is around 1% of total electricity consumption.
usage doubled between 2000 and 2005
could increase by 40%-76% by 2010.
PC enery consumption likely higher:
installed base for servers in 2005 - 27M
installed base for PCs in 2005: 870M
Measuring computing energy efficiency
harder for computers than for refrigerators
efficiency = work done / energy used = computing speed / power
biggest thing you can do for energy efficiency is write fast code. it can have really big impact.
from measurement standpoint, it is useful to break down the energy efficiency/budget equation
breaking it down:
efficiency = (work done / energy used in chips) * (energy used in chips / energy provided to servers) * (energy provided to servers / energy entering the building)
first: computing efficiency
second: server efficiency
third: datacenter efficiency or 1/PUE (power usage efficiency)
Energy efficiency opportunities:
datacenter energy efficiency
LBNL survey of 24 facilities shows avg PUE of 1.83
underutilized data centers
wasted power provisioning investment
makes cooling and power distribution less efficient
server energy efficiency
typical server power supplies dissipate 25% of total energy
DC-to-DC voltage regulatorscan lose another 25%
computing efficiency
servers have poor energy efficiency in their most common usage range
Plan for today:
datacenter efficiency
the power provisioning efficiency: What can you achieve if you utilize all energy in your data center.
two key energy related costs:
10 year energy costs ($9/watt)
cost of building a datacenter ($10-22/watt)
Facility costs are as important as energy consumption costs
Datacenter buildout can be larger than energy itself.
Efficiency provisioning playbook:
consolidate workloads into the minimum number of machines needed for peak usage requirements
smart scheduling or virtualization help here
measure actual power usage of devices
nameplates lie!
study activity trends and investigate the oversubscription potential
the subject of our ISCA 07 article
Six month power monitoring study at Google (ISCA 07)
Basic setup
model based power monitoring scheme
measure usage statistics at rack, PDU and cluster levels
4 diferent workloads over 5k servers
More servers leads to higher oversubscription potential.
Safely oversubscribing power
oversubscribe at the datacenter level, not of at server or rack levels
profile power usage of applications: learn what to expect
mix workloads
manage overload
provision a sizeable 'best effort' workload; victimize it first
use applications with QoS stack
good news: time constants to react are long
Energy-proportional computing: (An article was published in december of last year)
look at datacenter as a device you have to lower power for
he calls the datacenter: a land-held
CPU activity distribution over six months (graph)
real production systems don't run full blast all the time.
systems run 10% to 50% of their full capacity most of the time.
fraction of time these servers are doing nothing is very small.
A datacenter and a laptop are indeed different
Characteristics of well designed internet services:
high performance and high availability requires
load balancing and wide data distribution -> no useful idle intervals, lots of low activity intervals
example: Google file system:
replicas distributed across multiple machines
reads load balancing across replicas, writes need to reach all.
Key implications:
sleep or power-down strategies are much less useful in servers
focus on energy efficiency at peak performance is misguided
Power varies with amount of activity in servers. When a machine is completely idle, it still pretty much uses half of peak power it consumes. At 1/3 of peak, power efficiency is halved.
Energy-proportional computing: (the idea)
no work, no power consumed
some work, some power consumed
lots of work, lots of power consumed
That would be the end of power management software.
What if we could build machines with a wide activity range? He shows a graph.
Estimated impact of energy proportionality is quite huge based on another graph.
Conclusion:
write fast code!
the software engineer's biggest contribution to energy efficiency
Last session before the break is "Innovation That Drives Opportunity for the Web Infrastructure" by John Fowler (Sun Microsystems). John is responsible for hardware at Sun.
Applications are built in different ways.
Three things Sun is working on:
Computing
Open Storage
focusing on $/performance
Networking
huge bandwidth
He is talking about Web 2.0 architectures. The software running today wasn't there 10 years ago. Almost everyone is horizontally scaled which brings up a host of technology issues.
Sun's Web 2.0 kit: a set of performance and benchmarking applications. Sun will be open sourcing this and other tools. The tools tested are web/app server, cache layers, database and storage.
It's driving you crazy:
power, heat, space
scale
understanding the infrastructure
performance
Compute:
relatively straightforward
clock rates not going up. everyone scaling horizontally
lower memory latency
how can you have a high degree of concurrency
Cores and threads are on the move. Sun is working on 16 cores per socket. Future is higher and higher degrees of concurrency.
Open Storage (Servers + Storage + Open Solaris) :
built on OpenSolaris
performance of ZFS and SSDs
cost efficiency of volume hardware
scale easily
millions of files
gigabytes / sec
management simplicity
analytics with dTrace
diagnostics with FMA
Why Applications Don't Perform:
Waiting for DATA
Future: Enterprise SSD:
up to 5,000 - 8,000 write IOPS
up to 30,000 to 40,000 read IOPS
32 GB
$ per IOPS $0.08 compared to $2.43 traditionally.
New generation of flash is quite reliable and has no moving parts. Power consumption of SSDs is 2 watts compared to 13 watts for traditional HDD.
ZFS Hybrid Storage Pool Model:
High Performance Read and Write Cache Pool
ZFS combines main memory and SSDs for read caching
Notes from Velocity Conference continue: Next up is a Keynote by Artur Bergman (Wikia). Wikia runs 7000 Wikis and has 400 million page views per month.
Google, Yahoo and Amazon are what people rely on
Friendster.com, Twitter and boo.com have serious reliability problem
Value of performance/ reliability
brand value (they rely on you)
more page views (fixed amount of time + faster site)
Match user expectations:
World of Warcraft:
$520 million in profit last year
99% reliable
down every week, scheduled
server crash
"We pay them money, so we have to accept the downtime."
Operations:
efficient use of resources
end user performance
reliability
bad operations wastes R&D and cost of sale money
Business
cost per pageview. How many actually know this?
cost per page
revenue CPM - cost per pageview
Gross margin
Reliability example:
20% of wikia pages
200ms -> 15s to load
35% reduction of page views out of the slow pages
15% reduction of the fast pages
slow pages made people abandon the site.
Happy users - Lower cost
for the users: service industry
for the business: cost per page view, lower capex.
VC love to give money to IBM, HP, Dell
better for the environment.
May performance project:
50% cpu usage cost
delay investment 6 months
3 engineers - 4 weeks
Cache miss
300ms -> 190 ms
Perception
Ads
Ads are slow
Load ads after content load
Dramatic change:
significant % increase in pageviews
We lose money
but edits increase
Stay in loose and simple area, stay away from the complexity.
Next up is Scott Ruthfield (WhitePages.com) talking about Jiffy: Open Source Performance Measurement and Instrumentation. WhitePages.com is a people search power. They own 411.com. They have data on 180 million people doing 2 billion searches / year and 500 searches per second during peak. A top-50 comscore site.
Very important performance lessons: Scott says "Slow is bad." Customers just don't want you to be slow. "We're slow." A detailed analysis on their end revealed that the slowness wasn't their fault.
Reflection:
YSMV: Your slowness may vary
YCMWYCM: You can't manage what you can't measure
Jiffy:
means: Small unit of time; tick between system clock interrupts
is an end-to-end system for measuring and reporting on page load activity
Four goals:
real data at scale: what are customers seeing?
measure anything.
real/near time reporting
~0 impact on page performance
he says "it works"
What are the components of Jiffy?:
jiffy.js - library for instrumenting your pages and reporting measurements
apache (httpd.conf) config - receive and log measuremens
Better late than never. At Velocity Conference I took a lot of notes and didn't get to publish them earlier. Now that I find some time on my hand, I am going to go ahead and publish them.
At Velocity, there will be two product launches. Vik Chaudhary (Keynote Systems, Inc.) and Abelardo Gonzalez (Keynote Systems) are on stage now.
The first product is KITE (Keynote Internet Testing Environment). Keynote has been in business for 12 years. Providing a single performance testing environment for everyone (web developers, QA and system administrators) has always been a challenge for both startups and major players. This presentation sounds very promising.
Keynote collects 200 million internet measurements everyday.
With KITE, You can test from desktop to the internet cloud.
Keynote built it to provide a single performance testing environment for web developers, QA and IT operations team.
Sites are making upgrades every single day. Customers want to be able to test it from multiple locations
KITE is Ideal for 3 things
recording , scripting and playback of web transactions
instant web tests from desktop
scheduled testing for higher productivity
KITE Web Performance Engine allows for:
performance analysis for multi-page transactions, not just single pages
Javascript programmability for scripting and DOM analysis
testing from the desktop, Last Mile and the Internet Cloud
Native IE integration which in turn allows for easily analysis of AJAX, Flash and Javascript
Recording of test scripts which can be played back in "burst" mode and share scripts
Next is a demo by Abelardo Gonzalez.
You get a script repository that you can save for performance benchmarks.
There is a record button and a free global test button.
They will be testing iGoogle.
You start a session then as you take actions, they are recorded in the console.
KITE helps with three main things:
user experience time
network time
cache network time (for repeat visitors)
You can download very deep into each page and each performance metric. Tests from Global Network are also possible where you can replay exactly the steps you made from your desktop from all supported areas (e.g. Atlanta).
Just found on Slashdot that Google is attempting to patent 'FriendRank', an idea that Jeremy Zawodny came up with in, get this, 2004!
A computer-implemented method for displaying advertisements to members of a network comprises identifying one or more communities of members, identifying one or more influencers in the one or more communities, and placing one or more advertisements at the profiles of one or more members in the identified one or more communities. Way to go Google!
Today at 1PM EST I will be presenting the second part of Memcached Webinar for Sun/MySQL. Like the first memcached webinar, this will focus on memcached use cases but it will be more technical.