Tuesday, April 7, 2015

The Streaming API (Solrj.io) : The Basics

This is the first blog in a series about the new Streaming API for SolrCloud, otherwise known as Solrj.io.

What is Solrj.io?

Solrj.io is a new Solrj package that provides the programming framework for SolrCloud's new parallel computing framework.

Solrj.io is a Java API located in the org.apache.solr.client.solrj.io package.

The Base Abstractions

Solrlj.io has two important base abstractions:
Tuple: A Tuple is simply a thin wrapper around a Map of key/value pairs. In it's most basic form a Tuple represents a single record from a search result set.

TupleStream: The TupleStream abstracts a search result set as a stream of Tuples. The TupleStream provides a simple interface for opening, reading and closing streams of Tuples.
The Base Implementations

Solrj.io has two base TupleStream implementations:
SolrStream: Abstracts a search result set from a single Solr instance as a stream of Tuples. 
CloudSolrStream: Abstracts a search result set from a SolrCloud collection as a stream of Tuples. 
CloudSolrStream is a SolrCloud smart client. You provide CloudSolrStream with a zkHost and a collection name and it will automatically pick a replica from each shard to perform the query.

CloudSolrStream queries each shard and performs a streaming merge of the results based on an internal Comparator. This streaming merge allows CloudSolrStream to merge very large result sets from the shards with very little memory.
A Simple Example
import org.apache.solr.client.solrj.io.*;
import java.util.*;

public class StreamingClient {

   public static void main(String args[]) throws IOException {
      String zkHost = args[0];
      String collection = args[1];

      Map props = new HashMap();
      props.put("q", "*:*");
      props.put("qt", "/export");
      props.put("sort", "fieldA asc");
      props.put("fl", "fieldA,fieldB,fieldC");
      
      CloudSolrStream cstream = new CloudSolrStream(zkHost, 
                                                    collection, 
                                                    props);
      try {
       
        cstream.open();
        while(true) {
          
          Tuple tuple = cstream.read();
          if(tuple.EOF) {
             break;
          }

          String fieldA = tuple.getString("fieldA");
          String fieldB = tuple.getString("fieldB");
          String fieldC = tuple.getString("fieldC");
          System.out.println(fieldA + ", " + fieldB + ", " + fieldC);
        }
      
      } finally {
       cstream.close();
      }
   }
}

The example above does the following things:
  1. Creates a map with query parameters. Notice the "qt" parameter is set to "/export". This will cause the search to be handled by the /export handler, which is designed to sort and export entire result sets.
  2. Constructs a CloudSolrStream passing it the zkHost, collection and query parameters.
  3. Opens the stream and reads the Tuples until a Tuple with the EOF property set to true is encountered. This EOF Tuple signifies the end of the stream.
  4. For each Tuple fieldA, fieldB and fieldC are read. These fields are present because they are specified in the field list (fl) query parameter.
  5. The Tuples will be returned in ascending order of fieldA. This is specified by the sort parameter.
Streamed Results 

It's important to understand that the Tuples in the example are streamed. This means that the example can handle result sets with millions of documents without running into memory issues.

TupleStream Decorators 

The code sample shows a simple iteration of Tuples. What if we want to do something more exciting with the stream of Tuples?

TupleStream Decorators can be used to perform operations on TupleStreams. A TupleStream Decorator wraps one or more TupleStreams and performs operations on the Tuples as they are read. 

In the next blog we'll see how TupleStream Decorators can be used to both transform TupleStreams and gather metrics on TupleStreams.


Statistical programming with Solr Streaming Expressions

In the previous blog we explored the new timeseries function and introduced the syntax for math expressions . In this blog we'll dive d...