Wednesday, January 18, 2017

Deploying Solr's New Parallel Executor

In an a recent blog we explored Solr's new parallel batch capabilities. This blog expands on parallel batch and introduces Solr's new parallel executor. Solr's parallel executor allows Streaming Expressions to be stored in a Solr Cloud collection where they can be streamed to worker nodes and executed in parallel.


The executor Function


The new executor function performs the compilation and execution of Streaming Expressions on worker nodes. The executor function has an internal thread pool that executes Streaming Expressions in parallel within a single worker node. The queue of streaming expressions can also be partitioned across a cluster of worker nodes providing a second level of parallelism.


Deploying a General Purpose Work Queue


The executor function can be used with the daemon and topic functions to deploy a general purpose work queue. An example of this expression construct is below:

daemon(id="daemon1",
               executor(threads=5,
                               topic(checkpointCollection,
                                         storedExpressions,
                                         id="topic1",
                                         initialCheckpoint=0,
                                         q="*:*",
                                         fl="id, expr_s")))

Let's break down the expression above starting with the topic function.

The topic function subscribes to a query and provides one-time delivery of documents that match the query. In the example, the topic function is subscribed to a collection of stored Streaming Expressions.

The executor function wraps the topic and for each tuple it compiles and runs the expression in the expr_s field. The executor has an internal thread pool and each expression is compiled and run in its own thread. The threads parameter controls the size of the thread pool.

The daemon function wraps the executor and calls it at intervals using an internal thread. This will cause the executor to iterate over the topic and execute all the Streaming Expressions in the work queue in batches.

The daemon function will continue to run at intervals when the queue is empty. As new tasks are indexed into the queue they will automatically be read by the topic and executed.

Prioritizing Tasks with the priority Function  


In the example above, the executor will run tasks in the order that they are emitted by the topic. Topic's emit tuples ordered by Solr's internal _version_ number. This behaves similar to a FIFO queue (but without strict FIFO enforcement). But the topic alone doesn't have any concept of task prioritization.

The priority function can be used to allow higher priority tasks to be scheduled ahead of lower priority tasks. The priority function wraps two topics. The first topic is the higher priority queue and the second topic is the lower priority queue. The priority function will only emit a lower priority task when there are no higher priority tasks in the queue.

daemon(id="daemon1",
               executor(threads=5,
                               priority(topic(checkpointCollection,
                                                      highPriorityTasks,
                                                      id="high",
                                                      initialCheckpoint=0,
                                                      q="*:*",
                                                     fl="id, expr_s"),
                                             topic(checkpointCollection,
                                                       lowPriorityTasks,
                                                       id="low",
                                                       initialCheckpoint=0,
                                                       q="*:*",
                                                      fl="id, expr_s")))


Deploying a Parallel Work Queue


The parallel function can be used to partition tasks across a worker collection. This provides parallel execution within a single worker and across a cluster of workers. The syntax for deploying a parallel work queue is below:

parallel(workerCollection,
              workers=6,
              sort="DaemonOp asc",
              daemon(id="daemon1",
                             executor(threads=5,
                                             topic(checkpointCollection,
                                                       storedExpressions,
                                                       id="topic1",
                                                       initialCheckpoint=0,
                                                       q="*:*",
                                                       fl="id, expr_s",
                                                       partitionKeys="id"))))

In the example above the parallel function sends daemons to 6 workers. Each worker executes a partition of the work queue.

Deploying Replicas to Increase Cluster Capacity


Expressions run by the executor search Solr Cloud collections to retrieve data. These searches will be spread across all of the replicas in the Solr Cloud collections. As the number of workers executing expressions increases, more replicas can be added to the Solr Cloud collections to increase the capacity of the entire system.

Time Series Cross-correlation and Lagged Regression With Solr Streaming Expresssions

One of the more interesting capabilities in Solr's new statistical library is cross-correlation . But before diving into cross-correlat...