Tuesday, May 30, 2017

Statistical programming with Solr Streaming Expressions

In the previous blog we explored the new timeseries function and introduced the syntax for math expressions. In this blog we'll dive deeper into math expressions and explore the statistical programming functions rolling out in the next release.

Let's first learn how the statistical expressions work and then look at how we can perform statistical analysis on retrieved result sets.

Array Math


The statistical functions create, manipulate and perform math on arrays. One of the basic things that we can do is create an array with the array function:

array(2, 3, 4, 3, 6)

The array function simply returns an array of numbers. If we send the array function above to Solr's stream handler it responds with:

{ "result-set": { "docs": [ { "return-value": [ 2, 3, 4, 3, 6 ] }, { "EOF": true, "RESPONSE_TIME": 1 } ] } }

Notice that the stream handler returns a single Tuple with the return-value field pointing to the array. This is how Solr responds when given a statistical function to evaluate.

This is a new behavior for Solr. In the past the stream handler always returned streams of Tuples. Now the stream handler can directly perform mathematical functions.

Let's explore a few more of the new array math functions. We can manipulate arrays in different ways. For example we can reverse the array like this:

rev(array(2, 3, 4, 3, 6))

Solr returns the following from this expression:

{ "result-set": { "docs": [ { "return-value": [ 6, 3, 4, 3, 2 ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

We can describe the array:

describe(array(2, 3, 4, 3, 6))

{ "result-set": { "docs": [ { "return-value": { "sumsq": 74, "max": 6, "var": 2.3000000000000003, "geometricMean": 3.365865436338599, "sum": 18, "kurtosis": 1.4555765595463175, "N": 5, "min": 2, "mean": 3.6, "popVar": 1.8400000000000003, "skewness": 1.1180799331493778, "stdev": 1.5165750888103102 } }, { "EOF": true, "RESPONSE_TIME": 31 } ] } }

Now we see our first bit of statistics. The describe function provides descriptive statistics for the array.

We can correlate arrays:

corr(array(2, 3, 4, 3, 6),
       array(-2, -3, -4, -3, -6))

This returns:

{ "result-set": { "docs": [ { "return-value": -1 }, { "EOF": true, "RESPONSE_TIME": 2 } ] } }


The corr function performs the Pearson Product Moment correlation on the two arrays. In this case the arrays are perfectly negatively correlated.

We can perform a simple regression on the arrays:

regress(array(2, 3, 4, 3, 6),
             array(-2, -3, -4, -3, -6))

{ "result-set": { "docs": [ { "return-value": { "significance": 0, "totalSumSquares": 9.2, "R": -1, "meanSquareError": 0, "intercept": 0, "slopeConfidenceInterval": 0, "regressionSumSquares": 9.2, "slope": -1, "interceptStdErr": 0, "N": 5 } }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }


All statistical functions in the initial release are backed by Apache Commons Math. The initial release includes a core group of functions that support:

  • Rank transformations
  • Histograms
  • Percentiles
  • Simple regression and predict functions
  • One way ANOVA
  • Correlation
  • Covariance
  • Descriptive statistics
  • Convolution
  • Finding the delay in signals/time series
  • Lagged regression
  • Moving averages
  • Sequence generation
  • Calculating Euclidean distance between arrays
  • Data normalization and scaling
  • Array creation and manipulation functions
Statistical functions can be applied to:
  1.  Time series result sets
  2.  Random sampling result sets
  3.  SQL result sets (Solr's Internal Parallel SQL)
  4.  JDBC result sets (External JDBC Sources)
  5.  K-Nearest Neighbor results sets
  6.  Graph Expression result sets
  7.  Search result sets
  8.  Faceted aggregation result sets
  9.  MapReduce result sets 


Array Math on Solr Result Sets


Let's now explore how we can apply statistical functions on Solr result sets. In the example below we'll correlate arrays of moving averages for two stocks:

let(stockA = sql(stocks, stmt="select closing_price from price_data where ticker='aaa' and ..."),
      stockB = sql(stocks, stmt="select closing_price from price_data where ticker='bbb' and ..."),
      pricesA = col(stockA, closing_price),
      pricesB = col(stockB, closing_price),
      movingA = movingAvg(pricesA, 30),
      movingB = movingAvg(pricesB, 30),
      tuple(correlation=corr(movingA, movingB)))

Let's break down how this expression works:

1) The let expression is setting variables and then returning a single output tuple.

2) The first two variables stockA and stockB contain result sets from sql expressions. The sql expressions return tuples with the closing prices for stock tickers aaa and bbb.

3) The next two variables pricesA and pricesB are created by the col function. The col function creates a numeric array from a list of Tuples. In this example pricesA contains the closing prices for stockA and pricesB contains the closing prices for stockB.

4) The next two variables movingA and movingB are created by the movingAvg function. In this example movingA and movingB contain arrays with the moving averages calculated from the pricesA and pricesB arrays.

5) In the final step we output a single Tuple containing the correlation of the movingA and movingB arrays. The correlation is computed using the corr function.

Monday, May 1, 2017

Exploring Solr's New Time Series and Math Expressions

In Solr 6.6 the Streaming Expression library has added support for time series and math expressions. This blog will walk through an example of how to use these exciting features.


Time Series


Time series aggregations are supported through the timeseries Streaming Expression. The timeseries expression uses the json facet api under the covers so the syntax will be familiar if you've used Solr date range syntax.

Here is the basic syntax:

timeseries(collection, 
                 field="test_dt", 
                 q="*:*",
                 start="2012-05-01T00:00:00Z",
                 end="2012-06-30T23:59:59Z",
                 gap="+1MONTH", 
                 count(*))

When sent to Solr this expression will return results that look like this:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

Solr takes care of the date math and builds the time range buckets automatically. Solr also fills in any gaps in the range with buckets automatically and adds zero aggregation values. Any Solr query can be used to select the records. 

The supported aggregations are: count(*), sum(field), avg(field), min(field), max(field).

The timeseries function is quite powerful on it's own, but it grows in power when combined with math expressions.


Math Expressions


In Solr 6.6 the Streaming Expression library also adds math expressions. This is a larger topic then one blog can cover, but I'll hit some of highlights by slowly building up a math expression.


Let and Get


The fun begins with the let and get expressions. let is used to assign tuple streams to variables and get is used to retrieve the stream later in the expression. Here is the most basic example:


let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      get(a))

In the example above the timeseries expression is being set to the variable a. Then the get expression is used to turn the variable a back into a stream.

The let expression allows you to set any number of variables, and assign a single Streaming Expression to run the program logic. The expression that runs the program logic has access to the variables. The basic structure of let is:

let(a=expr,
     b=expr,
     c=expr,
     expr)

The first three name/value pairs are setting variables and the final expression is the program logic that will use the variables.

If we send the let expression with the timeseries to Solr it returns with:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

This is the exact same response we would get if we sent the timeseries expression alone. Thats because all we did was assign the expression to a variable and use get to stream out the results.

Implementation Note: Under the covers the let expression sets each variable by executing the expressions and adding the tuples to a list. It then maps the variable name to the list in memory so that it can be retrieved by the variable name. So in memory Streams are converted to lists of tuples.


The Select Expression


The select expression has been around for a long time, but it now plays a central role in math expressions. The select expression wraps another expression and applies a list of Stream Evaluators to each tuple. Stream Evaluators perform operations on the tuples. 

The Streaming Expression library now includes a base set of numeric evaluators for performing math on tuples. Here is an example of select in action:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      get(b))

In the example above we've set a timeseries to variable a.

Then we are doing something really interesting with variable b. We are transforming the timeseries tuples stored in variable a with the select expression. 

The select expression is reading all the tuples from the get(a) expression and applying the mult stream evaluator to each tuple. The mult Streaming Evaluator is multiplying -1 to the value in the count(*) field of the tuples and assigning it to the field negativeCount. Select is also outputting the test_dt field from the tuples.

The transformed tuples are then assigned to variable b.

Then get(b) is used to output the transformed tuples. If you send this expression to Solr it outputs:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

Implementation Note: The get expression creates new tuples when it streams tuples from a variable. So you never have to worry about side effects. In the example above variable a was unchanged when the tuples were transformed and assigned to variable b.



The Tuple Expression


The basic data structure of Streaming Expressions is a Tuple. A Tuple is a set of name/value pairs. In the 6.6 release of Solr there is a Tuple expression which allows you to create your own output tuple. Here is the sample syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      tuple(seriesA=a,
               seriesB=b))

The example above defines an output tuple with two fields: seriesA and seriesB, both of these fields have been assigned a variable. Remember that variables a and b are pointers to lists of tuples. This is exactly how they will be output by the tuple expression.

If you send the expression above to Solr it will respond with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ] }, { "EOF": true, "RESPONSE_TIME": 7 } ] } }

Now we have both the original time series and the transformed time series in the output.

The Col Evaluator


Lists of tuples are nice, but for performing many math operations what we need are columns of numbers. There is a special evaluator called col which can be used to pull out a column of numbers from a list of tuples.

Here is the basic syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      c=col(a, count(*)),
      d=col(b, negativeCount),
      tuple(seriesA=a,
               seriesB=b,
               columnC=c,
               columnD=d))

Now we have two new variables c and d, both pointing to a col expression. The col expression takes two parameters. The first parameter is a variable pointing to a list of tuples. The second parameter is the field to pull the column data from.

Also notice that there are two new fields in the output tuple that output the columns. If you send this expression to Solr it responds with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ], "columnC": [ 247007, 247994 ], "columnD": [ -247007, -247994 ] }, { "EOF": true, "RESPONSE_TIME": 6 } ] } }

Now the columns appear in the output.

Performing Math on Columns


We've seen already that there are numeric Stream Evaluators that work on tuples in the select expression.

Some numeric evaluators also work on columns. An example of this is the corr evaluator which performs the Pearson product-moment correlation calculation on two columns of numbers.

Here is the sample syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      c=col(a, count(*)),
      d=col(b, negativeCount),
      tuple(seriesA=a,
               seriesB=b,
               columnC=c,
               columnD=d,
               correlation=corr(c, d)))

Notice that the tuple now has a new field called correlation with the output of the corr function set to it. If you send this to Solr it responds with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ], "columnC": [ 247007, 247994 ], "columnD": [ -247007, -247994 ], "correlation": -1 }, { "EOF": true, "RESPONSE_TIME": 6 } ] } }


Opening the Door to the Wider World of Mathematics


The syntax described in this blog opens the door to more sophisticated mathematics. For example the corr function can be used as a building block for cross-correlation, auto-correlation and auto-regression functions. Apache Commons Math includes machine learning algorithms such as clustering and regression and data transformations such as Fourier transforms that work on columns of numbers.

In the near future the Streaming Expressions math library will include these functions and many more.

Time Series Cross-correlation and Lagged Regression With Solr Streaming Expresssions

One of the more interesting capabilities in Solr's new statistical library is cross-correlation . But before diving into cross-correlat...