Monday, November 27, 2017

Feature Scaling with Solr Streaming Expressions

Before performing machine learning operations its often important to scale the feature vectors so they can be compared at the same scale. In Solr 7.2 the Streaming Expression statistical function library provides a rich set of feature scaling functions that work on both vectors and matrices.

This blog will describe the different feature scaling functions and provide examples to show how they differ from each other.

Min/Max Scaling

The minMaxScale function scales a vector or matrix between a min and max value. By default it will scale between 0 and 1 if min/max values are not provided. Min/Max scaling is useful when comparing time series of different scales in machine learning algorithms such as k-means clustering.

Below is the sample code for scaling a matrix:

let(a=array(20, 30, 40, 50),
    b=array(200, 300, 400, 500),
    c=matrix(a, b),
    d=minMaxScale(c))

The expression above creates two arrays at different scales. The arrays are then added to a matrix and scaled with the minMaxScale function.

Solr responds with the vectors of the scaled matrix:

{ "result-set": { "docs": [ { "d": [ [ 0, 0.3333333333333333, 0.6666666666666666, 1 ], [ 0, 0.3333333333333333, 0.6666666666666666, 1 ] ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Notice that once brought into the same scale the vectors are the same.

Standardizing

Standardizing scales a vector so that it has a mean of 0 and a standard deviation of 1. Standardization can be used with machine learning algorithms, such as SVM, that perform better when the data has a normal distribution.  

Standardization does lose information about the data if the underlying vectors don't fit a normal distribution. So use standardization with care.

Here is an example of how to standardize a matrix:

let(a=array(20, 30, 40, 50),
    b=array(200, 300, 400, 500),
    c=matrix(a, b),
    d=standardize(c))

Solr responds with the vectors of the standardized matrix:

{ "result-set": { "docs": [ { "d": [ [ -1.161895003862225, -0.3872983346207417, 0.3872983346207417, 1.161895003862225 ], [ -1.1618950038622249, -0.38729833462074165, 0.38729833462074165, 1.1618950038622249 ] ] }, { "EOF": true, "RESPONSE_TIME": 17 } ] } }

Unitizing

Unitizing scales vectors to a magnitude of 1. A vector with a magnitude of 1 is known as a unit vector.  Unit vectors are preferred when the vector math deals with vector direction rather than magnitude.

let(a=array(20, 30, 40, 50),
    b=array(200, 300, 400, 500),
    c=matrix(a, b),
    d=unitize(c))

Solr responds with the vectors of the unitized matrix:

{ "result-set": { "docs": [ { "d": [ [ 0.2721655269759087, 0.40824829046386296, 0.5443310539518174, 0.6804138174397716 ], [ 0.2721655269759087, 0.4082482904638631, 0.5443310539518174, 0.6804138174397717 ] ] }, { "EOF": true, "RESPONSE_TIME": 6 } ] } }

Normalized Sum

The final feature scaling function is the normalizeSum function which scales a vector so that it sums to a specific value. By default its scales the vector so that it sums to 1. This technique is useful when you want to convert vectors of raw counts to vectors of probabilities. 


Below is the sample code for applying the normalizeSum function:

let(a=array(20, 30, 40, 50),
    b=array(200, 300, 400, 500),
    c=matrix(a, b),
    d=normalizeSum(c))

Solr responds with the vectors scaled to a sum of 1:

{ "result-set": { "docs": [ { "d": [ [ 0.14285714285714285, 0.21428571428571427, 0.2857142857142857, 0.35714285714285715 ], [ 0.14285714285714285, 0.21428571428571427, 0.2857142857142857, 0.35714285714285715 ] ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Sunday, November 5, 2017

An Introduction to Markov Chains in Solr 7.2

This blog introduces Solr's new Markov Chain implementation coming in Solr 7.2.  We'll first cover how Markov Chains work and then show how they are supported through the Streaming Expression statistical library.

Markov Chain

A Markov Chain uses probabilities to model the state transitions of a process. A simple example taken from the Markov Chain wiki page will help illustrate this.

In this example we'll be modeling the state transitions of the stock market. In our model the stock market can have three possible states:
  • Bull
  • Bear
  • Stagnant
There are two important characteristics of a Markov Process:

1) The process can only be in one state at a time.

2) The probability of transitioning between states is based only on the current state. This is known as "memorylessness". 

Below is a state diagram of the probabilities of transferring between the different states.



In the diagram above, the lines show the probabilities of transitioning between the different states. For example there is .075 probability of transitioning from a Bull market to a Bear market.  There is .025 probability of transitioning from a Bull market to a Stagnant market. There is a .9 probability of a Bull market transitioning to another Bull market.

The state transition probabilities in this example can be captured in a 3x3 matrix called a transition matrix. The transition matrix for this example is:
                    
                   Bull   Bear    Stagnant
Bull         |     .9     .075    .025    |
Bear         |     .15    .8      .05     |
Stagnant     |     .25    .25     .5      |

Notice each state has a row in the matrix. The values in the columns hold the transition probabilities for each state. 

For example row 0, column 0 is the probability of the Bull market transitioning to another Bull market. Row 1, column 0 is the probability of the Bear market transitioning to a Bull market.

A Markov Chain uses the transition matrix to model and simulate the transitions in the process. A code example will make this more clear.

Working with Matrices

In Solr 7.2 support for matrices has been added to the Streaming Expression statistical function library. Below is the expression for creating the example transition matrix:

let(a=array(.9, .075, .025),
    b=array(.15, .8, .05),
    c=array(.25, .25, .5),
    d=matrix(a, b, c))
  
In the expression above the rows of the matrix are created as numeric arrays and set to variables a, b and c. Then the arrays are passed to the matrix function to instantiate the matrix.

If we send this expression to Solr's /stream handler it responds with:

{ "result-set": { "docs": [ { "d": [ [ 0.9, 0.075, 0.025 ], [ 0.15, 0.8, 0.05 ], [ 0.25, 0.25, 0.5 ] ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Markov Chain Simulations

Once the transition matrix is created its very easy to create a Markov Chain and simulate the process. Here is a sample expression:

let(a=array(.9, .075, .025),
    b=array(.15, .8, .05),
    c=array(.25, .25, .5),
    d=matrix(a, b, c),
    e=markovChain(d),
    f=sample(e, 5))

In the expression above the transition matrix is created and then passed as a parameter to the markovChain function. The markovChain function returns a Markov Chain for the specific transition matrix.

The Markov Chain can then be sampled using the sample function. In the example above 5 samples are taken from the Markov Chain. The samples represent the state that the process is in. This transition matrix has three states so each sample will either be 0 (Bull), 1 (Bear) or 2 (Stagnant).

Each time the Markov Chain is sampled it returns the next state of the process based on the transition probabilities of its current state.

If we send this expression to the /stream handler it may respond with:

{ "result-set": { "docs": [ { "f": [ 0, 0, 0, 2, 2 ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Notice that 5 samples were returned 0, 0, 0, 2, 2. This corresponds to three consecutive Bull states followed by two Stagnant states. 

Finding the Long Term Average of the States

By increasing the number of samples we can determine how much time the process will spend in each state over the long term. Here is an example expression:

let(a=array(.9,  .075, .025),
    b=array(.15, .8,   .05),
    c=array(.25, .25,  .5),
    d=matrix(a, b, c),
    e=sample(markovChain(d), 200000),
    f=freqTable(e))

Notice that now instead of 5 samples we are taking 200,000 samples. Then we are creating a frequency table from the simulation array using the freqTable function. This will tell us the percentage of time spent in each state.

If we send this expression to the /stream handler it responds with the breakdown of the frequency table:

{ "result-set": { "docs": [ { "f": [ { "pct": 0.62636, "count": 125272, "cumFreq": 125272, "cumPct": 0.62636, "value": 0 }, { "pct": 0.310705, "count": 62141, "cumFreq": 187413, "cumPct": 0.937065, "value": 1 }, { "pct": 0.062935, "count": 12587, "cumFreq": 200000, "cumPct": 1, "value": 2 } ] }, { "EOF": true, "RESPONSE_TIME": 56 } ] } }

Notice in the response above that there are three tuples returned in the frequency table, one for each state (Bull, Bear, Stagnant).

The value field in each tuple is the numeric mapping of the state (0=Bull, 1=Bear, 2=Stagnant).

The pct field in each tuple is the percentage of time the value appears in the sample set. We can see that the process is in a Bull market 62.6% of the time, Bear market 31% and Stagnant 6.3% of the time.

Solr temporal graph queries for event correlation, root cause analysis and temporal anomaly detection

Temporal graph queries will be available in the 8.9 release of Apache Solr. Temporal graph queries are designed for key log analytics use ...