Visual Effect of Tcmalloc

Over a couple years, as more and more data come in, our application instances gradually run CPU from 50% up to 100% on each physical server. After running with tcmalloc http://goog-perftools.sourceforge.net/doc/tcmalloc.html, by setting

LD_PRELOAD=”/usr/lib/libtcmalloc_minimal.so”

The physical server is now running only 20% cpu!

Screen Shot 2015-10-15 at 6.00.14 PM

We have 8 processes running on the server, with 100+ threads in total. In our C++ code, we use STL containers and strings, so there are a lot memory allocation, all threads are competing for system memory lock, thus high CPU usage. With tcmalloc and its thread-local cache, there is no contention among the threads to allocate memory, all the small-and-frequent allocs are lock free, no lock/CPU contention, thus low CPU.

Database Availability

In many cases, certain databases, no matter traditional RDBMS or NoSQL DBs, are the center of a company’s product, for example an e-commerce service, or, an Ads system. To keep these DBs available is no joke, there were outages among many famous companies, I believe some of the outages were caused by DB issues, either availability or performance (they are connected anyway). Here are some scenarios:

  1. DBs that have to be always available no matter what, switch failure, network link failure, cut fiber, power outage, or anything. Usually the DBs are so critical because too many components are depending on the DBs, with that said, they usually have a lot reads from the components and few writes, the readers have to be able to read the data in the DB. In this case, definitely we need multiple data centers to support the DB, and in each data center, we will need primary write server and secondary write server, and multiple read servers, the number of read servers will depend on the amount of read traffic. One of the data centers will have the “active” primary/secondary write servers, other data centers will have “standby” primary/secondary write servers; when the “active” primary write server fails, the “active” secondary write server will take over; when both “active” primary/secondary write servers are gone, another data center will take over and become “active”. All the read servers across all data centers will be sync’ed from the “active” primary write server. Because this configuration requires a lot servers, it is expensive, so the DBs should store small data, such as configuration key-values, lookup tables, metadata etc.
  2. DBs that should be very highly available. The DBs store critical data for business workflows and the data are large, so usually there are physical partitions. Of course, again we need multiple data centers, then for each partition, we can set up primary server in one data center, and standby server in another, and backup server (so we can recover even primary and standby both failed) in a 3rd data center. They sync method between primary/standby vs sync method between primary/backup should be different methods so they won’t fail the same way and the same time.
  3. For the less critical data, a primary server in one data center and a standby server in another data center should be enough, or even primary/standby in the same data center but on a different network/power/floor would do.

Underlying physical storage media is not discussed here. For Hbase etc, they have their own storage as local files so they are isolated. For Oracle etc, if shared storage is used, eg SAN, that is a different failure domain to design for.

Oracle, MySQL, Teradata, Hadoop, HBase, Cassandra, Redis, MongoDB

Here are my reading notes for traditional DBs and NoSQL DBs, also, it is basically a selection process for the RDBMS DBs vs NoSQL DBs:

  1. RDBMS: to be selected to store absolute source of truth, absolutely no data lose in any scenarios (switch failure, data center failure); if it is going to hold more critical data, then Oracle with eg Active Data Guard is preferred over MySQL.
  2. Teradata or Hadoop: good for BI, analytic reports
  3. HBase: it is fast for random access; even faster if data is in cache (write then read right away); it is native to Hadoop for effective data processing with MapReduce; storage files are organized by column families in appending mode with (minor and major) compaction; it can hold huge amount of data with consistency; HMaster can be SPOF unless that is addressed
  4. Cassandra: it is good for lots writes but few reads; it supports active-active replication so it is highly available; also column family; support CQL; pretty high performance among NoSQL DBs, see http://blog.endpoint.com/2015/04/new-nosql-benchmark-cassandra-mongodb.html
  5. Redis: Disk-backed in-memory cache, see database http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  6. MongoDB: good for lots reads but few writes; SQL like queries; master-slave arch might be SPOF

Another View of Capacity Planning and Disaster Recovery

When we build a back end service such as a search engine, we need to position enough hardware to be able to support the peak traffic. Usually there are multiple data centers to handle global traffic, when one data center gets knocked out for whatever reason, other data centers will be able to pick up the slack and satisfy all the requests from all clients.

If there are two data centers in total, we have to position enough hardware in each data center, so each data center can support all traffic in case another data center goes offline, that means we have 2x capacity when both data centers are health, that can be expensive: the hardware costs millions and millions, and half of them will be sitting idle.

Here is another view of capacity planning and disaster recovery, the back end service usually have many different clients, some of the clients might be less important than others, for example, certain research batch jobs, or cross-sale wedges, or email campaigns vs the primary business flow.

A disaster, eg cable or power outage, usually lasts a couple hours and comes few and far between, if there are enough non-critical clients, we can do the following, for example:

  • turn off the secondary widgets
  • tell the low priority batch jobs to come back later at another time
  • return busy/error code to non-critical clients

In this way, we don’t have to waster capex for 2x capacity, 0.5x+0.5x would be enough for the business.

Protobuf and a Design Concept

In many cases, we need to pass data around: from file into RAM, from server to server, from raw data into class object, so we need a formal format for the data, and serialization deserialization utils to manipulate the data. A text format will be easier to debug, but binary format will be superior so there is no escaping for special chars and it is faster.

The design concept is to separate data and code logics (classes), the data will be passed around as raw binary file or byte array, and the classes will not hold data but manipulate the data. Preferably we don’t have to create and pass temp objects between classes (many small objects will hurt performance because object creation will need many mem alloc/free operations that require locking), instead we just need to pass a triplet (address, offset, length) to classes. Another desired capability is to be able to partially parse a bytearray in case we just need one piece of value (identified by offset/length) in a very large raw array, so we don’t have deserialize the entire array.

Google protobuf seems fit the bill, it can handle backward compatible versions, and easy to program. But the partial parsing capability seems to be different in different programing languages, see a discussion here,  to sum for the different APIs:

  • Java: void writeDelimitedTo(OutputStream output)
  • C++: CodedOutputStream::WriteVarint32
  • C#: SerializeWithLengthPrefix

But in the same post above, the author of protobuf did provide C++ version of writeDelimitedTo method.

Google does release proto3 https://github.com/google/protobuf/releases , I need to check out if the APIs are unified across these languages.

Messaging System Comparison: Kafka vs. RabbitMQ vs. ActiveMQ

It seems Kafka catches up quickly these days, and it is an outright winner against many other existing messaging systems, such as ActiveMQ etc. it must be doing something right. Here is a blog with some performance graphs for Kafka, RabbitMQ and ActiveMQ, Kafka is faster and having much higher throughput. Here is another blog with comparison among several messaging systems on many features, Kafka seems not losing to any in terms of richness of features.

It is the same as I see, many companies and projects are utilizing Kafka, good for its initiators. However, the are three areas where I will be cautious: routing, persistence and message storm.

Routing: Kafka is depending on files to transfer the messages, if there are large number of subscribers, the same message files will need to be copied to many subscribers for many many times, there is inefficiency there. For example, if the same file will need to go to 10 servers on the same rack, the file will go through the rack switch 10 times. If 10 can be reduced to 1, that will be much more efficient. The modern network switches are very capable and sophisticated, they should be able to help with the routing problem.

Persistence: Kafka uses file for persistence and transfer, so all Kafka brokers, publishers and subscribers will need to have disk space dedicated for Kafka. Disk speed and disk failure become two issues: 1. subscriber servers usually are running mission critical applications, Kafka will compete not only CPU (Kafka CPU usage can be capped) but also disk (which is not easy to cap), we can use two disks per server, but that is cost. 2. Disk failure rate is not that low, if we have 10,000 servers in the cluster, many disks will fail over period of time, then Kafka will not work there; however, critical applications running on subscriber servers often are required to have high availability, high throughput, and low latency, many such applications can run without disk, disk/Kafka failure will damage availability. If Kafka subscriber can work without disk, this option should find its uses in many projects.

Message Storm: Kafka will go as fast as it can, that is great, however, it can easily saturate network links, so throttling should be put in many places to prevent message storm to happen, especially in the cases of bootstrapping publishers or subscribers.

Solr or ElasticSearch

This Solr vs ElasticSearch page has pretty extensive comparison between them, the followings are my learning:

  • It seems ElasticSearch catches on in these days, more people are choosing it over Solr
  • Comparing with Solr, ES is easier to set up, and it is easier to scale
  • Both support document-by-document updates, and that is easy to use
  • Both support some kind of partial document update, that is good feature to have, but I am concerned about index fragmentation
  • I don’t know if they support index update for a set of documents, that is, update many docs in the same call
  • Managing schema change, index def change etc is not trivial
  • it seems neither supports well for full indexing and rolling restarts in the search grid, we still need to manage synchronization between full index and delta index, and sync among delta indexes

HTTPS or HTTP

It is common sense a website should be secure, nowadays many web pages are running with HTTPS, eg google. That is definitely desired given the facts that many emails, passwords, even credit card numbers are stolen from many different organizations, from social media, to financial institutes, to hospitals, and even government agents.

However, HTTPS has extra cost than HTTP. HTTPS needs multiple handshakes between client and server for each connection, and it needs encryption for content, as well as talking with a central Certificate Authority, so there are plenty overhead, even when recent newer technologies to reduce handshake rounds, it is still more complex and slower than HTTP.

So, if certain layer of traffics are happening within company firewall firewall, and are traveling within a secured network, internal traffic should be using straight TCP/HTTP, it is faster, simpler and easier to support, for example, when doing load balancing. In this situation, it is more important to for the company to authenticate a user login to make sure all accesses are authorized.

Also with straight TCP/HTTP for internal traffic, code will be much simpler, test and maintenance will be easier.

How to Scan NoSQL Storage, eg HBASE

NoSQL storage, eg HBASE, usually does not provide capability to index columns, so it can be expensive to scan the data. HBASE provides scan API and utils, but the scan operation can result in a full-table scan which requires to go through all rows, Hadoop/HBASE is perfectly suited for that, however, having fast scans is still desirable in certain usecases. A few options we consider:

  • HBASE HFile is stored with rowkeys sorted, so it is efficient to scan rowkeys; but: 1. it is the one and only one “indexed” field; 2. it is tricky to play with rowkeys, designing rowkey for a given scan usecase is not recommended: what if the usecase changes later?
  • Perform the scan outside HBASE in an efficient storage, such as MySQL, and then read rows from HBASE by rowkeys. New columns and new indexes can be easily added to MySQL to accommodate new or changed usecases;
  • Create a column family within HBASE just for scan purpose, and perform a time range scan with this column family — if the usecase is time range scan. Hfiles for different column families are stored in different files, so if we can keep the scan CF small, thus its Hfiles will be small, so scan will be fast; also, if the time range is recent, likely (configurable!) the CF data is still in memstore, that means no disk reads at all, which results in very fast scans; even if the CF is flushed to disk, between major compactions (see http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/), Hfiles are created in time sequence, so only newer Hfiles will need to be read for the scan, that is still quite efficient;
  • The above idea can be achieved with a separate HBASE table, or even a separate HBASE cluster; the design decision here is depending the access pattern and failure domain, for example, with a separate HBASE scan table, the errors on that table can be isolated from main data table, in case some scan clients could damage the HBASE table.

Software Testing

I know I know so little so less about testing, I definitely feel I am not qualified to summarize anything about testing, so a wiki page will help: http://en.wikipedia.org/wiki/Software_testing. But, anyway, I will still try to tell how I see testing (and release), which levels we should have:

  1. unit test: all engineers should be doing unit test, and evaluate the test coverage
  2. regression: let’s say this is regression for “this” component or library of this development team
  3. API integration: this will be testing interface with clients, other partner teams
  4. LnP test: load and performance, stress test; LnP test is essential but tricky as it is different to reproduce production environment
  5. burn-in: this will be the first stage of release, the new code is already released into production, but it is released into a burn-in pool for testing with replicated traffic
  6. smoke test, sanity testing: the new code is released to some production servers, and sanity testing is done against those servers
  7. rolling activation: rolling activation with the new code; the activation process might stop if system health deteriorates
  8. A/B test: for business validation; this is tricky as well, A/B test will need enough traffic to make meaningful result, so number of concurrent A/B tests will be a challenge

in an agile environment, ideally we can have frequent checkins, frequent releases, quick fixes, so all these tests need to be put together to make sure the new code in quick iterations is “no-harm” to the entire system.