Well, that’s a misleading title. We actually quadrupled the performance of our brand monitoring alert system that uses Elasticsearch’s Percolator, but that would have been a much longer title.
Buzzilla has two main products. The first is Webhose.io which provides businesses worldwide access to structured data from the open web, and the second is the leading brand monitoring system in Israel.
The brand monitoring system
Although Israel is a small country, Israelis usually create complex queries that puts a lot of stress on our servers (this could also be attributed to the complexities of the Hebrew language, but that’s for another post). One of the most popular features of the system, is its ability to send push notifications (usually by email) when a post matches a Boolean query.
As I mentioned, we use the Elasticsearch Percolator to register our queries (about 3,500 of them) and run each post we crawl against them. We run about 1 million posts a day against those queries and when they match they are sent to our clients. The system is distributed and uses RabbitMQ to pull posts from our crawlers queue.
We made some optimizations in the past, where we didn’t run the Boolean query against a document if we knew beforehand it wouldn’t match. We did that by comparing some properties of the query and the document. For example, if the language didn’t match, there was no need to check the rest of the query.
The problem we faced
At our old configuration, we were able to run about 30 documents per minute against all of our queries per server (and a strong one). As the volume of crawled data and the number of queries grew, we began to have a problem keeping up, at times causing delays of a few hours between crawl time and alert match. We found ourselves adding more and more hardware to try and solve the problem.
What did the trick was to create pre-percolation process, that concatenates multiple posts and runs the queries against the concatenated string (you of course must remove the Boolean NOT clause of the query, I will explain why later on). If there is no match, then great, you just saved time checking each individual post, if there is a match, then bummer, you wasted time checking the concatenated string. Fortunately the former is much more frequent than the latter.
So now I will explain why it worked. Let’s take two phrases, or posts as an example:
- First post: “The quick brown fox jumps over the lazy dog”
- Second post: “This is a quick example since I’m lazy”
The combined text would be: “The quick brown fox jumps over the lazy dog This is a quick example since I’m lazy”
It’s obvious that a query that didn’t match the combined text wouldn’t match its children. So by running the query once against a long chunk of text, we didn’t need to run it against two shorter chunks of text. If on the other hand it did match, we would then need to run it against each post to see which query matched which post. But even then we know which query matched and we wouldn’t have to run all the queries again on each post.
So why is running a query against a large chunk of text faster than running it against two short chunks of text? That’s because we run the query against the index, and the size of the index of the concatenated texts is smaller than the size of each posts index combined:
SizeOfIndex(Post A + Post B) < SizeOfIndex(Post A) + SizeOfIndex(Post B)
Why stop at two posts combined? Why not 100? You can, and should of course concatenate more than two posts, but be careful and remember that once a query matches the concatenated text, you actually wasted resources, as you now need to query against each post (or do a binary search). You want to reach a balance point where your chances to not match are much greater, as on that point your system will be optimized.
I mentioned earlier that you must remove the Boolean NOT clause of the query. If you don’t remove it, you might miss relevant posts. Let’s take the query “quick -example” and run it against the above concatenated text, this of course won’t match as the keyword “example” exists in the text, but it should have matched since the first post matched the query.
That’s it. The solution takes more memory as we are now running two percolators (pre-percolator and the actual alert percolator), but it’s 4 times faster! Hooray!