The Alexa Web Search web service allows developers to build customized search engines against the massive data that Alexa crawls every night. One of the features of their web service allows users to query the Alexa search index and get Million Search Results (MSR) back as output. Developers can run queries that return up to 10 million results (Baun 2011).
This application is currently in production at Amazon.com and is code-named GrepTheWeb because it can grep (a popular UNIX command-line utility to search patterns) the actual web documents. GrepTheWeb allows developers to do some pretty specialized searches like selecting documents that have a particular HTML tag or META tag or finding documents with particular punctuations, or searching for mathematical equations (?f(x) = Sx + W), source code, e-mail addresses or other patterns such as dis-integration of life. While the functionality is impressive, for us the way it was built is even more so. In the next section, we will zoom in to see different levels of the architecture of GrepTheWeb.
Figure 1 shows a high-level depiction of the architecture. The output of the Million Search Results Service, which is a sorted list of links and gzipped (compressed using the UNIX gzip utility) in a single file, is given to GrepTheWeb as input. It takes a regular expression as a second input. Different factors could combine to cause the processing to take lot of time:
Regular expressions could be complex
Dataset could be large, even hundreds of terabytes
Unknown request patterns, e.g., any number of people can access the application at any given point in time
Hence, the design goals of GrepTheWeb included to scale in all dimensions (more powerful pattern-matching languages, more concurrent users of common datasets, larger datasets, better result qualities) while keeping the costs of processing down. To get a ...