Search Engine Relevance

Relevance is such a broad topic, I will group some bullet points into 3 buckets: query understanding and query rewrite, query evaluation and ranking, and relevance feedback.

Query understanding and query rewrite:

  • we want to understand the intention of the query, so that we can find most related results to the query
  • query should be tokenized to match tokenization algorithm in index, also be processed for plural, x-ing, synonyms, acronyms etc
  • annotate the query with location, time etc
  • add query demand, such as category demand histogram, token weight
  • attach boolean constraints such as category, price range
  • identify facet and prepare for facet search, eg: Color=White
  • in-session and personalized search relevance parameters, eg: interests, budget buyer

Query evaluation and ranking:

  • evaluate/execute the query tree which usually has AND/OR/NOT nodes with CONTAINS, GREATER_THAN, LESS_THAN, PHRASE, EXISTS etc operators
  • collect the results in a heap to rank them
  • depending how complex the search engine is, there can be multiple ranking rounds
  • the first round will be cheap and quick sorting that is applied to all results, sort parameters will be some static relevance indicators, eg static quality score, click through rate, product review score, seller credentials
  • the second round can be a simple machine learned formula with a few parameters, it will be applied to the top N results after the first round ranking
  • the third round ranking will be extensive and can be a bit expensive, such as prob decision forest, this will be applied to top few pages of results
  • the last round of ranking will involve some business logics, such as diversification, or, category or format mixing (based on rules or stats or both), so we can handle problems such as: too many similar items, or, too many newly listed item, or, too much same seller, too much/few top categories etc
  • we can separate machine learning data into different sectors, such as geo location, user segment, query segment, category etc: so we need a lot data
  • the above strategy works well for popular and often queries, but for tail queries, the sample size is too small for machine learning, an old fashion relevance model will work better, such as Vector Space Model, or Prob Model, or Language Model; a regular tf-idf formula plus some boolean constraints (cat, price range) weights can work; or query term bi-grams of a long query to “vote” for the top categories.
  • pagination can become two problems and needs special consideration: 1. how to save for multi-round ranking; 2. large page number can blow up internal memory usage!

Relevance feedback

  • user behavior data (minus identification information!) should be collected at runtime, and data-mined later, such as page views, click through rates, conversion rates, etc.
  • explicit user feedback “is this page helpful?” can provide different view points.
  • human judge is still essential to evaluate relevance of existing and new algorithms.
  • time-decay (recency) should be applied to many numbers.
  • feedback data is so huge, injecting all into search index can drain a lot resources, so data update speed could vary:
    • view and transaction should be realtime;
    • decayed counts can be 1 day;
    • user trustworthiness, forward estimation of shipping and handling time can be 24 hours, or, on demand.
    • aggregated query demand and category demand for different regions/use-cases can be 24 hours, or on-demand, such as “iPhone X” product launch.
    • offline computed document clusters, related query recommendations etc can be 24 hours.

 

Query Demand for Ranking

When user types a query, for example “iphone 6s”, it helps if we can detect the intention based on history demand. The above query matches “iphone 6s white 64gb verizon” better than “pink case for iphone 6s”, we can use title match as a factor of ranking machine learning model (eg decision forest).

A simple Naive Bayes based log-likelihood ratio can help, see https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Document_classification . The NB model can be trained with user data, for example, NB model for query “iphone 6s” can be trained with buyer clicked items as positive class, and displayed but not clicked items as negative class. This will help to learn “64bg” and “case” in above example.

But how about “new iphone 6s with case”? in this case, category demand can help: with query “iphone 6s”, buyer is looking for Phone category other than Accessory category. We can collect the top categories for the query, and use the match score as a factor inside ranking model.

To collect the title-matching data and category-demand data, we need to capture the full context of the query: displayed items (title, category, and price etc), clicks, no-clicks. That will be a lot data, but that is what machine learning for, or Big Data.

There are many scarce queries, such as those long queries. They might not have enough data to train and get a reliable NB model. We can cluster the queries based on user behavior, or, we can use bi-grams of a long query to “vote” for, for example, the top categories.

 

Regression Model for eCommerce Relevance

Nowadays, all sophisticated eCommerce search engines are harvesting user data to guide relevance ranking. First of all, we need a lot user behavior data, then we can mine the data and build machine learning models offline, then predict user intentions at runtime with the models.

Lets say we will build decision trees (in practice, decision forest with gradient boost) for ranking. We can label the training products with Sale/NoSale labels, then use decision trees to classify recalled products into the Sale/NoSale buckets with probability, and rank results by the probabilities of Sale.

The above will work. However, it will skew towards lower priced accessories as they are more likely to sell than the their main products. To compensate, we can use decision trees to regress the sale price, and rank by the predicated sale price from the trees. Basically, we just need to label the products with log(sale price) or 0 if not sale, and then build regression model.

The decision tree will learn from many factors, the factors can have these types:

  1. Listings — for example, eBay, Craigslist, Amzon 3rd party all have “listings”. Listing history metrics can be machine learning factors, such as sold quantity, total sales, views and impressions, view over impression, etc. However, old listing of products will accumulate in the metrics and out-rank new listings, so we need to decay listing’s metrics regularly over time, so the system will be fair for new listings.
  2. Similar Listings — this is important for a site that doesn’t have “product”, such as Craigslist. Many of characteristics of similar listings — that a simple Title ti.idf formula can get — can be used for machine learning, such as average or median sale price, sales, view, impression, VoI etc.
  3. Products. If it is product based system, eg Amazon, or listing based system but with catalog/products, eg eBay, then product condition, color, and average or median sale price, sales rank are all important factors.  Also old releases of products will accumulate in the metrics and out-rank new releases, so we need to decay or use sliding windows.
  4. Seller. Seller trustworthiness, such as rating, feedback scores, also shipping and return policies etc, all do matter.

We need to feed these (listing + sim + product + seller) factors from the marketplace into the models and tune the model to learn the correct factors. To do all these, we need huge amount of user behavior data to get a good reading of user intention.

User behaviors can be very different by country (UK vs US), by region (Europe vs Asia), by season (summer vs winter) and by query groups (Star War toys vs Star War collectibles). We can throw all these factors to machine learning training, but we need to review the relatively most important factors from the learned model to see if they makes sense. Otherwise, it makes sense to manually slice the training data into major business usecases, such as regions, amazon products vs 3rd party products, ebay auction vs fix price listings, top level categories, search vs merchandising, and maybe major query clusters, then build different models for specific usecases. The slices cannot be too small, otherwise training set will be too sparse, that is why we need huge amount of data (so Google wins).

After relevance ranking, we might need to shuffle the ranked results into several buckets, based on certain business rules. For example, ratio between established listings vs newly listed, condition mix eg new vs used, amazon products vs 3rd party products, ebay auction vs fix price listings, etc. Many of these need to be collected as demand data offline, and feed them at query time into the relevance model. See https://suniphrase.wordpress.com/2015/01/15/dynamic-mingling-of-ranked-results/

And personalization can be part of relevance ranking, such as buyer’s recent click history can be an input to relevance model.

Doing all the above, it is costly in resource, it needs a lot hardware capacity to retrieve/compute the factors and run them through the decision forest. We need to consider how to sort inverted list for given usecase, and physical document tiers, and  multi round ranking.

Signals and Signal Handling in C Multi-Threads

A refreshment for myself, and copy-paste from online:


Signals are the system’s way of informing a process about various events. There are two types of signals, synchronous and asynchronous.

Synchronous signals are a result of a program action. Two examples are:

  1. SIGFPE, floating-point exception, is returned when the program tries to do some illegal mathematical operation such as dividing by zero.
  2. SIGSEGV, segmentation violation, is returned when the program tries to access an area of memory outside the area it can legally access.

Asynchronous signals are independent of the program. For example, the signal sent when the user gives the kill command.


So which thread will get the signal? If it is a synchronous signal, the signal is delivered to the thread that generated it. Synchronous signals are commonly managed by having an appropriate signal handler set up in each thread to handle any that aren’t masked


Use signal() to set up a signal handler—nice and simple, but not very robust.


If it is an asynchronous signal, it could go to any of the threads that haven’t masked out that signal


So a common way of handling asynchronous signals in a multi-threaded program is to mask signals in all the threads, and then create a separate thread (or threads) whose sole purpose is to catch signals and handle them. The signal-handler thread catches signals by calling the function sigwait() with details of the signals it wishes to wait for

Memory Management in our Search Engine

Here are how we manage memory in query nodes of our search engine, in two memory use cases: index memory, and, query memory.

Index Memory

We load all index into main memory for performance, the index memory is managed in a way that is very similar to a buddy system http://www.memorymanagement.org/mmref/alloc.html#mmref-alloc-buddy, especially in the sense that both are using array of free lists, and both are using the heads of free lists to achieve locality. However, there are major differences as we design and tune our memory system to perform well for specific usecase, eg search engine indexing.

There are 3 major types of objects in index that consume a lot memory: #1 string table (dictionary, lexicon); #2 document data (aka column group, aka forward index); #3 posting list (aka inverted index). We use the following memory management technique for #1 and #2.

Initialization:

  1. a big consecutive memory buffer is allocated from OS; more such buffers can be allocated later;
  2. an array of “free lists” (see details below) is created for this buffer, each free list is a link list of free memory blocks that have the same size; initially all free lists are empty;
  3. a “free pointer” points to a position in the buffer, all memory after this position in this buffer is all free; the pointer is initialized to beginning of the buffer.

Allocation:

  1. pre-parse data into a temp object to figure out the object size, round up the actual object size to the “allocation unit” boundary, eg 8-bytes, this is the size of memory block to allocate;
  2. if the free list of this allocation size is not empty, take and return the first mem block in this free list, also adjust the head of this free list; if it is empty, check and allocate from the next closest bigger free list; if all empty, allocate a block from the “free pointer”, adjust the free pointer accordingly;
  3. If all free lists and the free pointer for this big memory buffer are exhausted, check other big memory buffers, if no avail, create a new memory buffer from OS; if reach num of buffers limit, return out-of-mem error;
  4. copy the temp object into the actual allocated memory block.

Deallocation:

  1. just add the free block to the head of free list of its size; for locality purpose, we insert to head;
  2. the link list can use in-place memory to store the “next” pointer, so we don’t need any extra memory for the link list.

Here are some details about actual implementation:

  • the number of free lists for one buffer is limited; it is an array as: uint32 free_lists[2^16]; the position in this array has double-duty as individual mem block size for the free list, so min block size is one “allocation unit”, max block size is 2^16 of “allocation units”;
  • the system can have up to 2^12 buffers, each buffer has 2^20 allocation units, so the “address” can be coded into uint32;
  • the unit size is configurable, it can be configured to be 2^x bytes (so 1,2,4,8,…), by default it is 8-bytes, so the actual addressable space is 2^12 * 2^20 * 2^x bytes, and 32GB by default;
  • we use 2^x bytes (other than 1,2,3,4,5,… bytes) as allocation unit and bit-shifting to calculate physical address, as bit-shifting is faster than multiplying.

Here are major differences from other generic memory management systems:

  • it is implemented for single thread usage, as a result, there is no lock any where, so it is super fast; it is tailored for the single indexing thread in indexing component; in multi thread environment, each thread has to have its own mem management instance, of course one thread can have multiple such instances
  • a typical buddy system will repeatedly split a big free block into two halves until it finds a proper size, that is to deal with fragments for generic use-cases which can have any allocation/deallocation patterns; this system does not split halves, it just allocate from the “free pointer”, it will dynamically and automatically adjust to the memory allocating characteristics of the data, eg: document sizes are closely bundled, so are word sizes; as a results, unusable fragments and gaps are reduced greatly. Analogy can be made to (document or word) object pool, but with no type, no fixed size.

For posting list use-case (#3 above, aka inverted index), we need to either create new posting list, or, grow an existing posting list to append new logical doc ids, we alloc mem as the following:

  • use system calls malloc or mmap directly to allocate memory from OS;
  • call malloc if the size is under certain configurable threshold; otherwise, call mmap to allocate an anonymous block of memory;
  • if needed, re-alloc and grow the posting list, and grow with a 10% extra room.

When we use mmap and munmap, the memory will be returned to OS, so it won’t show mem fragmentation effect; with malloc/free, the memory will be most likely staying within the process space.

Query Memory

To manage query time memory:

  • we use object pools for query object, query response, sort heap, histogram etc;
  • the same malloc/mmap trick above is applied; but the mmap thresholds are defined for individual usecases
  • the objects have a lot internal buffers or arrays, we dynamically adjust (grow and shrink) the array size to be 1.6x of average size at runtime for its type. As result, most queries will not require mem allocation at all and run very fast; the buffer size self-adjustment is implemented with ratios (eg 1.6x) and some min/max limits;

The best solution is still to have a dedicated memory allocator, as we still observe fragments (mem grows over time) and slower queries (big query will need to allocate memory every time); however, we have regular code release and full index pushes, both will require restarts, and restarts will reset the memory fragments for the process; for big slower queries (as long as they are reasonable), they are still fast enough (in 100ms range) for their callers, as we have enough objects and async threads in the query evaluation pipeline.

In short, in our case, it performs good enough to just use malloc/mmap with self-adjusting buffers.

Oracle “Snapshot Too Old”

I have seen Oracle “Snapshot Too Old” error for too many times, here are some learnings.

Oracle maintains rollback segments so it can rollback failed transactions, the rollback segment is also used to provide read consistency for long lasting select sessions, if the read session is too long, older version of data are not available in the rollback segment any more, thus ORA-01555 “snapshot too old” error will happen.

It is possible to increase UNDO_RETENTION so to keep the old versions longer, however, it is better to tune select SQLs so they can run faster, eg, reduce number of columns to retrieve, reduce number of rows to return, create better index for repeating SQLs, execute the SQL during off-peak hours, consume the results when they are ready by reduce network delay, co-locate app and data, etc.

If the error is triggered by LOBs, there will be ORA-22924 “snapshot too old” error. LOB rollback data is kept within the LOB data space itself, so PCTVERSION or RETENTION can be tuned for LOB rollback size or retention time period. But, like the above, the first option is to improve SQL speed.

Learning HugePages

We are using dedicated bare metal machines with 76GB RAM to run our application, in most cases, 60+GB can be consumed by one process, so mem management is very critical. The most important factor is still good design in architecture and good implementation in code, eg: specific memory management for specific usecase to reduce memory alloc/free as much as possible; however, it will help if the OS supports system level feature to speed up memory operations.

We recently are tuning huge pages, here are what I learned from a few web searches. From https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

  • Users can use the huge page support in Linux kernel by either using the mmap system call or standard SYSV shared memory system calls (shmget, shmat).
  • Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes.
  • Huge pages cannot be swapped out under memory pressure.

All the above features are good for dedicated hardware running dedicated application: huge page with its TLB will speed up VM addressing, it won’t fragment heap, it won’t swap (we load everything into memory anyway).

This, http://www.pythian.com/blog/performance-tuning-hugepages-in-linux/, is a good document on diagnosing performance and tuning huge pages for Oracle on Linux.

And, http://dev.mysql.com/doc/refman/5.0/en/large-page-support.html, huge page support for MySQL.

Update: if we have a good memory management design and good customized memory allocator, huge page doesn’t help much, because a good memory allocator will eliminate most malloc/free calls and fragmentation; in this case, huge page adds complexity instead.

Discrimination in Online Marketplace

Discrimination does happen in any online marketplace, for example http://www.tnooz.com/articles/discrimination-on-Airbnb-do-whites-make-more-than-blacks/, it tells:

… a recently released study from the Harvard Business School that found that black hosts on Airbnb were charging 12% less than white hosts for similar listings.

Of course, the study is debatable, for example, average charge might be lower, but if rent rate is higher, then total charge amount can come out equal or even higher.

Nonetheless, discrimination is a real problem online. It shows on the buyer side as well, for example, prospect black renter is more likely to be rejected by airbnb host than prospect white renter (study by the same authors above, Benjamin Edelman and Michael Luca).

How to curb discrimination? the study suggests to “withhold information from market participants”, such as full name and profile photo. I doubt how efficient that would be; also it is not too difficult to find these information in today’s online world.

I like more towards “using empirical and relevant criteria for determining a guest’s and host’s trustworthiness”. E.g. feedback scores can help, definitely; however, it suffers “Reputation Inflation” http://econweb.tamu.edu/common/files/workshops/Theory%20and%20Experimental%20Economics/2015_3_5_John_Horton.pdf , because of costs to give bad feedback.

Besides feedback, John Horton proposes: “In response, the marketplace encouraged buyers to additionally give private feedback. This private feedback was substantially more candid and more predictive of future”

For example, the marketplace can give a simple NPS https://en.wikipedia.org/wiki/Net_Promoter yes/no survey: “would you recommend our company/product/service to a friend or colleague?“, and then present aggregated NPS to the user, such as: “85% people would”, that will help to address discrimination even when names/photos are also presented.

jemalloc vs tcmalloc vs dlmalloc

“jemalloc” seems to be the newer allocator for c/c++ applications, it is definitely more sophisticated than e.g. tcmalloc: https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919

My application has issue with increasing process memory size over time (a week) with tcmalloc, similar to what this page https://www.quora.com/Is-tcmalloc-stable-enough-for-production-use claims:

What ultimately forced us off TCMalloc in production was not stability, but its inconsistent memory economy: a heap containing a given number of allocated bytes would spread across an ever larger physical footprint over time. Multiple TCMalloc back-ends (search, ads, feed, etc.) would suffer 512MB-1GB of heap bloat per day, which meant it would be hitting swap within a few days.

So I tried jemalloc, compared it with tcmalloc, also dlmalloc. Here are my observations for jemalloc:

  • The good: My process RSS (resident set size) is obviously smaller than tcmalloc (during the two days of testing), this is consistent with jemalloc claim: Minimize the active page set
  • The neutral: CPU% (from dlmalloc 15% to 2% per process) is as low as tcmalloc (from dlmalloc 15% to 1% per process), so both jemalloc and tcmalloc minimize lock contention
  • The bad: With jemalloc, the process total size (swap) is running away, it is hitting swap limit within two days, I am not able to find a way to contain the swap growth

My application is a unique beast, it runs many short-living threads, it creates/destroys 10s of worker threads every minute, the threads are not using/sharing object pools or memory buffers, every thread does its own malloc/new calls and push the allocated objects into STL containers with boost shared pointer wrapper. My guess is that the short-living threads (or shared pointers) throw off jemalloc, so the swap size is unreleased/uncontained. I tried one jemalloc config “narenas” (number of arenas) with these values: default (4*nCPUs=96 in my case), 12, and 4, none of the values helped; I was not able to find other meaningful configs that I can use to cope the problem.

Still, jemalloc is more sophisticated than tcmalloc, it has more community activities, more configs/hooks/stats than tcmalloc (http://www.canonware.com/download/jemalloc/jemalloc-latest/doc/jemalloc.html#mallctl_namespace V.S. http://google-perftools.googlecode.com/svn/trunk/doc/tcmalloc.html#Garbage_Collection). On another side, tcmalloc is more lightweight, works great out of box.

So, in case of a typical server or daemon that uses long-living worker threads, jemalloc is a good choice, in this case, I believe jemalloc can be tuned to maintain smaller RSS/swap than tcmalloc. It can even improve mysql performance: https://www.percona.com/blog/2013/03/08/mysql-performance-impact-of-memory-allocators-part-2/

But for me, I am staying with tcmalloc (and weekly restarts).

Here are the mem and cpu graphs for each of jemalloc, tcmalloc, dlmalloc from my tests.

jemalloc running for 22 hours, cpu is low, but total mem size (swap) is growing un-contained (note: the graph is not showing RSS which is quite small, and very steady):

jemalloc-22hours

tcmalloc running for 5 days, cpu is the lowest, total mem size is growing at a much slower pace:

tcmalloc-5days

dlmalloc running for 6 days, cpu is out of the roof (note: many processes running on the same physical server), mem growth is similar to tcmalloc (kudos to tcmalloc):

dlmalloc-6days