Why is a trend problematic? Consider an example where you want to correlate two time series that are trending on a similar slope. Because they both have a similar slope they will appear to be correlated. But in reality they may be trending for entirely different reasons. To tell if the two time series are actually correlated you would need to first remove the trends and then perform the correlation on the detrended data.

### Linear Regression

Linear regression is a statistical tool used to measure the linear relationship between two variables. For example you could use linear regression to determine if there is a linear relationship between

**age**and**medical costs**. If a linear relationship is found you can use linear regression to predict the value of a dependent variable based on the value of an independent variable.
Linear regression can also be used to remove a linear trend from a time series.

### Removing a Linear Trend from a Time Series

We can remove a linear trend from a time series using the following technique:

- Regress the
**dependent**variable over a**time sequence**. For example if we have 12 months of time series observations the time sequence would be expressed as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. - Use the regression analysis to predict a dependent value at each time interval. Then subtract the
**prediction**from the**actual**value. The difference between actual and predicted value is known as the**residual**. The residuals array is the time series with the trend removed. You can now perform statistical analysis on the residuals.

Sounds complicated, but an example will make this more clear and Solr makes this all very easy to do.

###
Example: Exploring the linear relationship between** marketing spend** and **site usage**.

In this example we want explore the linear relationship between

**marketing spend**and**website usage.**The motivation for this is to determine if higher marketing spend causes higher website usage.
Website usage has been trending upwards for over a year. We have been varying the marketing spend throughout the year to experiment with how different levels of marketing spend impacts website usage.

Now we want to regress the marketing spend and the website usage to build a simple model of how usage is impacted by marketing spend. But before we can build this model we must remove the trend from the website usage or the cumulative effect of the trend will mask the relationship between marketing spend and website usage.

Here is the streaming expression:

let(a=timeseries(logs,

q="rec_type:page_view",

field="rec_time",

start="2016-01-01T00:00:00Z",

end="2016-12-31T00:00:00Z",

gap="+1MONTH",

count(*)),

b=jdbc(connection="jdbc:mysql://...",

sql="select marketing_expense from monthly_expenses where ..."),

c=col(a, count(*)),

d=col(b, marketing_expense),

e=sequence(length(c), 1, 1),

f=regress(e, c),

g=residuals(f, e, c),

h=regress(d, g),

tuple(regression=h))

Let's break down what this expression is doing:

- The
expression is setting the variables*let*and returning a single result tuple.*a, b, c, d, e, f, g, h* - Variable
is holding the result tuples from a*a*function that is querying the logs for monthly usage counts.*timeseries* - Variable
is holding the result tuples from a*b*function which is querying an external database for monthly marketing expenses.*jdbc* - Variable
is holding the output from a*c*function which returns the values in the*col*field from the tuples stored in variable*count(*)**a.*This is an array containing the monthly usage counts. - Variable
is holding the output from a*d*function which returns the values in the*col*field from the tuples stored in variable*marketing_expense**b**.*This is an array containing the monthly marketing expenses. - Variable
*e*holds the output of thefunction which returns an array of numbers the same length as the array in variable*sequence***c**. The sequence starts from 1 and has a stride of 1. - Variable
*f*holds the output of thefunction which returns a regression result. The regression is performed with the sequence in variable*regress*as the independent variable and monthly usage counts in variable*e*as the dependent variable.*c* - Variable
holds the output of the*g*function which returns the residuals from applying the regression result to the data sets in variables*residuals*and*e*.*c***The residuals are the monthly usage counts with the trend removed**. - Variable
holds the output of the*h*function which returns a regression result. The regression is being performed with the*regress*(variable*marketing expenses*as the independent variable. The*d)*from the monthly usage regression (variable*residuals*) are the dependent variable. This regression result will describe the linear relationship between marketing expenses and site usage.*g* - The output tuple is returning the regression result.