°Ù¶ÈÊ×Ò³ | 
°Ù¶ÈËѲØ
¿ìÕÕ
(°Ù¶ÈºÍhttp://highscalability.com/digg-architectureµÄ×÷ÕßÎ޹أ¬²»¶ÔÆäÄÚÈݸºÔð¡£°Ù¶È¿ìÕÕ½÷ÎªÍøÂç¹ÊÕÏʱ֮Ë÷Òý£¬²»´ú±í±»ËÑË÷ÍøÕ¾µÄ¼´Ê±Ò³Ãæ¡£)

Digg Architecture | High Scalability

Digg Architecture

Todd Hoff's picture

Traffic generated by Digg's over 1.2 million famously info-hungry users can crash an unsuspecting website head-on into its CPU, memory, and bandwidth limits. How does Digg handle all this load?

Site: http://digg.com

Information Sources

  • How Digg.com uses the LAMP stack to scale upward
  • Digg PHP's Scalability and Performance

    Platform

  • MySQL
  • Linux
  • PHP
  • Lucene
  • APC PHP Accelerator
  • MCache

    The Stats

  • Started in late 2004 with a single Linux server running Apache 1.3, PHP 4, and MySQL. 4.0 using the default MyISAM storage engine
  • Over 1.2 million users.
  • Over 200 million page views per month
  • 100 servers hosted in multiple data centers.
    - 20 database servers
    - 30 Web servers
    - A few search servers running Lucene.
    - The rest are used for redundancy.
  • 30GB of data.
  • None of the scaling challenges we faced had anything to do with PHP. The biggest issues faced were database related.
  • The lightweight nature of PHP allowed them to move processing tasks from the database to PHP in order to improve scaling. Ebay does this in a radical way. They moved nearly all work out of the database and into applications, including joins, an operation we normally think of as the job of the database.

    What's Inside

  • Load balancer in the front that sends queries to PHP servers.
  • Uses a MySQL master-slave setup.
    - Transaction-heavy servers use the InnoDB storage engine.
    - OLAP-heavy servers use the MyISAM storage engine.
    - They did not notice a performance degradation moving from MySQL 4.1 to version 5.
  • Memcached is used for caching.
  • Sharding is used to break the database into several smaller ones.
  • Digg's usage pattern makes it easier for them to scale. Most people just view the front page and leave. Thus 98% of Digg's database accesses are reads. With this balance of operations they don't have to worry about the complex work of architecting for writes, which makes it a lot easier for them to scale.
  • They had problems with their storage system telling them writes were on disk when they really weren't. Controllers do this to improve the appearance of their performance. But what it does is leave a giant data integrity whole in failure scenarios. This is really a pretty common problem and can be hard to fix, depending on your hardware setup.
  • To lighten their database load they used the APC PHP accelerator MCache.
  • You can configure PHP not parse and compile on each load using a combination of Apache 2’s worker threads, FastCGI, and a PHP accelerator. On a page's first load the PHP code is compiles so any subsequent page loads are very fast.

    Lessons Learned

  • Tune MySQL through your database engine selection. Use InnoDB when you need transactions and MyISAM when you don't. For example, transactional tables on the master can use MyISAM for read-only slaves.
  • At some point in their growth curve they were unable to grow by adding RAM so had to grow through architecture.
  • People often complain Digg is slow. This is perhaps due to their large javascript libraries rather than their backend architecture.
  • One way they scale is by being careful of which application they deploy on their system. They are careful not to release applications which use too much CPU. Clearly Digg has a pretty standard LAMP architecture, but I thought this was an interesting point. Engineers often have a bunch of cool features they want to release, but those features can kill an infrastructure if that infrastructure doesn't grow along with the features. So push back until your system can handle the new features. This goes to capacity planning, something the Flickr emphasizes in their scaling process.
  • You have to wonder if by limiting new features to match their infrastructure might Digg lose ground to other faster moving social bookmarking services? Perhaps if the infrastructure was more easily scaled they could add features faster which would help them compete better? On the other hand, just adding features because you can doesn't make a lot of sense either.
  • The data layer is where most scaling and performance problems are to be found and these are language specific. You'll hit them using Java, PHP, Ruby, or insert your favorite language here.

    Related Articles

    * Live Journal Architecture
    * Flickr Architecture
    * An Unorthodox Approach to Database Design : The Coming of the Shard

  • Ebay Architecture

  • Comments

    30!?

    30 Gigabytes of data what?
    30 GB of Bandwidth?
    30 GB worth of databases?(i think not...)

    30 GB?

    Yeh, how come digg only have 30 GB of data? If this report is authentic, I am highly amused.

    30gb is database data is

    30gb is database data is HUUUUUGE !

    i have been blogging for nearly 4 years and i've used only some 15 mb in database data.

    30 gb in database data is colossal.

    30gb db

    30 gb database isn't much if they track just a little historic data about their 1.2 million users.

    I am not convinced. Digg

    I am not convinced. Digg would need much more than the 30GB data (whatever) shown here.

    You're wondering why "only"

    You're wondering why "only" 30GB? Databases are text. A character is a byte. Do the math.

    nothing impressive

    Maybe I will be alone, but for me there is nothing to be proud. It looks standard US company where boxes are cheaper then people, so its easier to buy 100 servers then to pay 10 people to play with EXPLAIN command.

    results IS impressive

    You may come from a time when you did have to wring every last performance gain from a piece of hardware. However, it is (somewhat) true that the cost/benefit analysis is moving more toward just throwing more hardware at a problem. Engineers, especially good ones, aren't cheap; hardware is. So while you may not be impressed, I find knowing when to spend less money to accomplish the same task "smart".

    Now, that doesn't mean that proper database design should be thrown out the window. I am actually somewhat ADD when it comes to relations, constraints, indexing, etc. But there comes a time when it just makes sense to do the cost effective thing, and I would hope that is what digg does.

    My subject is not impressive

    Wow, I didn't realize how bad my last subject was till just now.

    One last thing, it actually takes quite a bit of effort to be able to scale horizontally. Just because they use a lot of boxes doesn't mean that they didn't spend a lot of time and effort in the overall system architecture and database design. That is currently what I am designing my system to be able to do, and it isn't so I can not worry about optimizing my code, its so that I can capacity plan and buy resources to handle increasing load incrementally and cost effectively. I find being able to do that IS impressive.

    30 GB

    Maybe the 30 GB he's referring to is the space needed on each cluster node. OS + Application, etc.

    re: nothing impressive

    I'll be the first to disagree. First, without the right people designing the overall architecture, throwing boxes at a problem won't do much other than increasing your electricity bill. I've seen this first hand with clients. In fact, I've seen "solutions" where clusters were used that actually decreased performance because the design counter-indicated the use of a cluster (in other words, it wasn't a scalable design).

    Personally, I'd rather have a select few architects and administrators and many powerful machines, than be top heavy with staff and not have enough server resources. :)

    --
    Dustin Puryear
    Author, Best Practices for Managing Linux and UNIX Servers
    http://www.puryear-it.com

    What load balancing

    What load balancing solution/product is Digg using?

    Digg scalable but inefficient?

    sure they can add servers cheaply, but 100 servers for 200M views per month? this compared to 1.1B views per month for 2 boxes at plenty of fish, digg sux?

    Comment viewing options

    Select your preferred way to display the comments and click "Save settings" to activate your changes.

    Post new comment

    The content of this field is kept private and will not be shown publicly.
    • Web page addresses and e-mail addresses turn into links automatically.
    • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd><div ?=?><p ?=?> <img ?=?> <embed ?=?> <h1 ?=?><h2 ?=?><h3 ?=?>
    • Lines and paragraphs break automatically.
    • Glossary terms will be automatically marked with links to their descriptions
    • You may link to webpages through the weblinks registry

    More information about formatting options

    What is 2 + 0?
    To combat spam, please solve the math question above.