Wednesday, July 12, 2017

Detrending Time Series Data With Linear Regression in Solr 7

Often when working with time series data there is a linear trend present in the data. For example if a stock price has been gradually rising over a period of months you'll see a positive slope in the time series data. This slope over time is the trend. Before performing statistical analysis on the time series data it's often necessary to remove the trend.

Why is a trend problematic? Consider an example where you want to correlate two time series that are trending on a similar slope. Because they both have a similar slope they will appear to be correlated. But in reality they may be trending for entirely different reasons. To tell if the two time series are actually correlated you would need to first remove the trends and then perform the correlation on the detrended data. 

Linear Regression 


Linear regression is a statistical tool used to measure the linear relationship between two variables. For example you could use linear regression to determine if there is a linear relationship between age and medical costs. If a linear relationship is found you can use linear regression to predict the value of a dependent variable based on the value of an independent variable.

Linear regression can also be used to remove a linear trend from a time series.

Removing a Linear Trend from a Time Series 


We can remove a linear trend from a time series using the following technique:

  1. Regress the dependent variable over a time sequence. For example if we have 12 months of time series observations the time sequence would be expressed as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12.
  2. Use the regression analysis to predict a dependent value at each time interval. Then subtract the prediction from the actual value. The difference between actual and predicted value is known as the residual. The residuals array is the time series with the trend removed. You can now perform statistical analysis on the residuals.
Sounds complicated, but an example will make this more clear and Solr makes this all very easy to do.

Example: Exploring the linear relationship between marketing spend and site usage.


In this example we want explore the linear relationship between marketing spend and website usage. The motivation for this is to determine if higher marketing spend causes higher website usage. 

Website usage has been trending upwards for over a year. We have been varying the marketing spend throughout the year to experiment with how different levels of marketing spend impacts website usage. 

Now we want to regress the marketing spend and the website usage to build a simple model of how usage is impacted by marketing spend. But before we can build this model we must remove the trend from the website usage or the cumulative effect of the trend will mask the relationship between marketing spend and website usage.

Here is the streaming expression:

let(a=timeseries(logs,  
                           q="rec_type:page_view",  
                           field="rec_time", 
                           start="2016-01-01T00:00:00Z", 
                           end="2016-12-31T00:00:00Z", 
                           gap="+1MONTH",  
                           count(*)),
     b=jdbc(connection="jdbc:mysql://...", 
                  sql="select marketing_expense from monthly_expenses where ..."),
     c=col(a, count(*)),
     d=col(b, marketing_expense),
     e=sequence(length(c), 1, 1),
     f=regress(e, c),
     g=residuals(f, e, c),
     h=regress(d, g),
     tuple(regression=h))     

Let's break down what this expression is doing:

  1. The let expression is setting the variables a, b, c, d, e, f, g, h and returning a single result tuple.
  2. Variable a is holding the result tuples from a timeseries function that is querying the logs for monthly usage counts. 
  3. Variable b is holding the result tuples from a jdbc function which is querying an external database for monthly marketing expenses.
  4. Variable c is holding the output from a col function which returns the values in the count(*) field from the tuples stored in variable a. This is an array containing the monthly usage counts.
  5. Variable d is holding the output from a col function which returns the values in the marketing_expense field from the tuples stored in variable bThis is an array containing the monthly marketing expenses.
  6. Variable e holds the output of the sequence function which returns an array of numbers the same length as the array in variable c. The sequence starts from 1 and has a stride of 1. 
  7. Variable f holds the output of the regress function which returns a regression result. The regression is performed with the sequence in variable e as the independent variable and monthly usage counts in variable c as the dependent variable.
  8. Variable g holds the output of the residuals function which returns the residuals from applying the regression result to the data sets in variables e and c. The residuals are the monthly usage counts with the trend removed.
  9. Variable h holds the output of the regress function which returns a regression result. The regression is being performed with the marketing expenses (variable d) as the independent variable. The residuals from the monthly usage regression (variable g) are the dependent variable.  This regression result will describe the linear relationship between marketing expenses and site usage.
  10. The output tuple is returning the regression result.
     

Sunday, July 9, 2017

One-way ANOVA and Rank Transformation with Solr's Streaming Expressions

In the previous blog we explored the use of random sampling and histograms to pick a threshold for point-wise anomaly detection. Point-wise anomaly detection is a good place to start, but alerting based on a single anomalous point may lead to false alarms. What we need is a statistical technique that can help confirm that the problem goes beyond a single point.

Spotting Differences In Sets of Data


The specific example in the last blog dealt with finding individual log records with unusually high response times. In this blog we'll be looking for sets of log records with unusually high response times.

One approach to doing this is to compare the means of response times between different sets of data. For this we'll use a statistical approach called One-way Anova.

One-way ANOVA (Analysis of Variance)


The Streaming Expression statistical library includes the anova function. The anova function is used to determine if the difference in means between two or more sample sets is statistically significant.

In the example below we'll use ANOVA to compare two samples of data:

  1. A sample taken from a known period of normal response times.
  2. A sample taken before and after the point-wise anomaly.
If the difference in means between the two sets is statistically significant we have evidence that the data around the anomalous data point is also unusual.


Accounting For Outliers


We already know that sample #2 has at least one outlier point. A few large outliers could skew the mean of a sample #2 and bias the ANOVA calculation.

In order to determine if sample set #2 as a whole has a higher mean then sample #1 we need a way to decrease the effect of outliers on the ANOVA calculation.

Rank Transformation


One approach for smoothing outliers is to first rank transform the data sets before running the ANOVA. Rank transformation transforms each value in the data to an ordinal ranking.

The Streaming Expression function library includes the rank function which performs the rank transformation.

In order to compare the data sets following the rank transform, we'll need to perform the rank transformation on both sets of data as if they were one contiguous data set. Streaming Expressions provides array manipulation functions that will allow us do this.


The Streaming Expression


In the expression below we'll perform the ANOVA:

let(a=random(logs,
                       q="rec_time:[2017-05 TO 2017-06]",
                       fq="file_name:index.html",
                       fl="response_time",
                       rows="7000"),
     b=random(logs,
                       q="rec_time:[NOW-10MINUTES TO NOW]",
                       fq="file_name:index.html",
                       fl="response_time",
                       rows="7000"),
     c=col(a, response_time),
     d=col(b, response_time),
     e=addAll(c, d),
     f=rank(e),
     g=copyOfRange(f, 0, length(c)),
     h=copyOfRange(f, length(c), length(f)),
     i=anova(g, h),
     tuple(results=i))

Let's break down what this expression is doing:

  1. The let expression is setting the variables a, b, c, d, e, f, g, h, i and returning a single response tuple.
  2. The variable a holds the tuples from a random sample of response times from a period of normal response times (sample set #1).
  3. The variable b holds the tuples from a random sample of response times before and after the anomalous data point (sample set #2).
  4. Variables c and d hold results of the col function which returns a column of numbers from a list of tuples.  Sample set #1 is in variable c. Sample set #2 is in variable d.
  5. Variable e holds the result of the addAll function which is returning a single array containing the contents of variables c and d.
  6. Variable f holds the results of the rank function which performs the rank transformation on variable e
  7. Variables g and hold the values of copyOfRange functions. The copyOfRange function is used to separate the single rank transformed array back into two data sets. Variable g holds the rank transformed values of sample set #1. Variable h holds the rank transformed values of sample set #2.
  8. Variable i holds the result of the anova function which is performing the ANOVA on variable g and h.
  9. The response tuple has a single field called results that contains the results of the ANOVA on the the rank transformed data sets.

Interpreting the ANOVA p-value


The response from the Streaming Expression above looks like this:

{ "result-set": { "docs": [ { "results": { "p-value": 0.0008137581457111631, "f-ratio": 38.4 } }, { "EOF": true, "RESPONSE_TIME": 789 } ] } }


The p-value of 0.0008 is the percentage chance that there is NOT a statistically significant difference in the means between the two sample sets.

Based on this p-value we can say with a very high level of confidence that there is a statistically significant difference in the means between the two sample sets.

Time Series Cross-correlation and Lagged Regression With Solr Streaming Expresssions

One of the more interesting capabilities in Solr's new statistical library is cross-correlation . But before diving into cross-correlat...