Regression

FIZ353 - Numerical Analysis | 27/11/2020

Emre S. Tasci emre.tasci@hacettepe.edu.tr

(Case from our "Least Squares Method & Error Estimation" lecture) Data:

$i$ $x_i$ $y_i$
1 10 25
2 20 70
3 30 380
4 40 550
5 50 610
6 60 1220
7 70 830
8 80 1450

Let's try to fit it to a linear model, $y = ax+b$. Doesn't matter whether we're using least squares method or minimizer, they will both yield the best answer.

In [1]:
import numpy as np

import seaborn as sns
sns.set_theme()

data = np.array([range(10,90,10),[25,70,380,550,610,1220,830,1450]]).T
x = data[:,0]
y = data[:,1]
print(data)
[[  10   25]
 [  20   70]
 [  30  380]
 [  40  550]
 [  50  610]
 [  60 1220]
 [  70  830]
 [  80 1450]]
In [2]:
A = np.vstack([x,np.ones(len(x))]).T
a,b = np.linalg.lstsq(A,y,rcond=None)[0]
print("a: {:.5f}\tb: {:.5f}".format(a,b))
a: 19.47024	b: -234.28571

While we are at it, let's plot it:

In [3]:
import matplotlib.pyplot as plt

xx = np.linspace(0,80,100)
yy = a*xx + b
plt.plot(xx,yy,"b-",x,y,"ko",markerfacecolor="k")
plt.show()

And here's the error (sum of the squares of the estimate residuals ($S_r$)):

$$S_r = \sum_{i}{e_i^2}=\sum_{i}\left(y_i-a_0-a_1 x_i\right)^2$$
In [4]:
t = a*x + b
e = y-t
S_r = np.sum(e**2)
print(S_r)
216118.1547619048

and here's how to do the same thing (albeit, systematically ;) using functions:

In [5]:
def fun_lin(alpha, beta, x):
    return alpha*x + beta

def err_lin(params):
    e = y - fun_lin(params[0],params[1],x)
    return np.sum(e**2)
err_ls = err_lin([a,b])
print("Least-square sum of squares error: {:10.2f}".format(err_ls))
Least-square sum of squares error:  216118.15

Same yet different

Instead of fitting the given data into a linear model, let's once again fit them to a power model (just copy/pasting from lecture notes #3):

Example: Fit the data to the power model (Chapra, 14.6)
Data:

$i$ $x_i$ $y_i$
1 10 25
2 20 70
3 30 380
4 40 550
5 50 610
6 60 1220
7 70 830
8 80 1450

Find the optimum $\alpha$ and $\beta$ for the best fit of $y=\alpha x^\beta$ for the given data.

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data1 = pd.DataFrame({'i':np.arange(1,9),'x':np.arange(10,90,10),
                      'y':[25,70,380,550,610,1220,830,1450]})
data1.set_index('i', inplace=True)
In [7]:
data1
Out[7]:
x y
i
1 10 25
2 20 70
3 30 380
4 40 550
5 50 610
6 60 1220
7 70 830
8 80 1450
In [8]:
plt.plot(data1.x,data1.y,"o")
plt.show()

Least-squares proper way:

We can convert it such that $$\log{y}=\log{\alpha} + \beta\log{x}$$

and as the least square fit for a linear model given as $y' = a_0 + a_1 x'$ is:

$$a_1 = \frac{n\sum{x_i' y_i'} - \sum{x_i'}\sum{y_i'}}{n\sum{x_i'^2}-\left(\sum{x_i'}\right)^2}$$$$a_0 = \bar{y}' - a_1\bar{x}'$$

(For derivations, refer to FIZ219 Lecture Notes #5)

and since $x_i' = \log{x_i},\;y_i' = \log{y_i}$:

In [9]:
n = data1.shape[0]
xp = np.log(data1.x)
yp = np.log(data1.y)

a1 = (n*np.sum(xp*yp)-np.sum(xp)*np.sum(yp)) / (n*np.sum(xp**2) - np.sum(xp)**2)
a0 = np.mean(yp) - a1*np.mean(xp)

print("a0: {:7.4f}\na1: {:7.4f}".format(a0,a1))
a0: -1.2941
a1:  1.9842

as $a_0 = \log{\alpha}\rightarrow \alpha = e^{a_0}$ and $a_1 x' = \beta\log{x}\rightarrow \beta = a_1$

In [10]:
alpha = np.exp(a0)
beta = a1
print("alpha: {:7.4f}\nbeta:  {:7.4f}".format(alpha,beta))
alpha:  0.2741
beta:   1.9842
In [11]:
def fun(alpha, beta, x):
    return alpha*x**beta
In [12]:
xx = np.linspace(0,80,100);
yy = fun(alpha,beta,xx)
plt.plot(data1.x,data1.y,"or",xx,yy,"-b")
plt.show()

Minimizing the error function:

In [13]:
def fun_pow(alpha, beta, x):
    return alpha*x**beta

x = data1.x
y = data1.y
def err(params):
    e = y - fun_pow(params[0],params[1],x)
    return np.sum(e**2)

from scipy.optimize import minimize

res = minimize(err,[0.274,1.98])
print(res)
alpha2,beta2 = res.x
      fun: 222604.8484396939
 hess_inv: array([[ 8.34770117e-09, -4.45042043e-09],
       [-4.45042043e-09,  7.67927578e-09]])
      jac: array([-0.00390625,  0.05664062])
  message: 'Desired error not necessarily achieved due to precision loss.'
     nfev: 214
      nit: 33
     njev: 68
   status: 2
  success: False
        x: array([2.53844123, 1.4358492 ])
In [14]:
xx = np.linspace(0,80,100);
yy2 = fun(alpha2,beta2,xx)
plt.plot(data1.x,data1.y,"or",xx,yy2,"-k")
plt.show()

Which one is better?

In [15]:
err_ls = err([alpha,beta])
err_min = err([alpha2,beta2])
print("Least-square sum of squares error: {:10.2f}".format(err_ls))
print("   Minimizer sum of squares error: {:10.2f}".format(err_min))
Least-square sum of squares error:  345713.59
   Minimizer sum of squares error:  222604.85

Let's plot the two side by side:

In [16]:
xx = np.linspace(0,80,100);
yy_ls = fun(alpha,beta,xx)
yy_min = fun(alpha2,beta2,xx)
# Blue for least-squares, Black for minimizer
plt.plot(data1.x,data1.y,"or",xx,yy_ls,"-b",xx,yy_min,"-k")
plt.legend(["data","least-squares","minimizer"])
plt.show()

Really, which one is better?

In addition to the findings of our 3rd lecture, today, we performed an even simpler operation, namely, fit the data to a linear model. Let's put all the three together:

In [17]:
xx = np.linspace(0,80,100);
yy_ls_pow = fun(alpha,beta,xx)
yy_min_pow = fun(alpha2,beta2,xx)
yy_ls_lin = fun_lin(a,b,xx)

# Blue for least-squares, Black for minimizer
plt.plot(data1.x,data1.y,"or",xx,yy_ls_pow,"-b",\
         xx,yy_min_pow,"-k",\
         xx,yy_ls_lin,"-m")
plt.legend(["data","least-squares (power)",\
            "minimizer (power)","least-squares (linear)"])
plt.show()

and here is a table of the errors:

Method Error ($S_r$)
LS (power) 345713.59
Minimizer (power) 222604.85
LS (linear) 216118.15

So, we should take the linear least-squares fit as it yields the closest results... or, is it? (it is indeed, as it has the lowest error).

Now what would you say if I told you, this was some kind of force vs. velocity" data -- would you change your mind then?

Here, let's make the graph in the proper way:

In [18]:
plt.plot(data1.x,data1.y,"or",xx,yy_ls_pow,"-b",\
         xx,yy_min_pow,"-k",\
         xx,yy_ls_lin,"-m")
plt.legend(["data","least-squares (power)",\
            "minimizer (power)","least-squares (linear)"])
plt.title ("(Some kind of) Force vs. Velocity")
plt.xlabel("v (m/s)")
plt.ylabel("F (N)")
plt.show()

Even though the linear model produces better fit, the bothersome thing is its behaviour for small velocities: the fit carries the response to negative forces which doesn't make much sense (can you think of a case that behaves like this? Downwards for low velocities alas upwards for high velocities? Non-newtonian liquids? Not very likely).

Therefore, even if it's not the best fit, realizing that we are actually dealing with forces and velocity, not some mathematical toy but physical quantities, it -hopefully- makes much more sense to choose the power model over the linear model.

So our equation looks something like this:

$$F = \alpha v^\beta$$

with $(\alpha,\beta)$ being equal to:

method $\alpha$ $\beta$
LS 0.2741 1.9842
Minimizer 2.5384 1.4358

Still, are we insisting on taking the minimizer's results (because it yielded a better fit)?

In physics, the power relations are usually (and interestingly, actually) integers. LS's $\beta$ of 1.9842 looks suspiciously close to a clean 2 whereas the minimizer's 1.4358 looks as if... nothing.

So, to cut a long story short, that "some kind of force" was actually the Drag Force $\vec{D}$, defined as:

$$\vec{D} = \frac{1}{2}C\rho A v^2$$

with $C$ being the drag coefficient (empirically determined); $\rho$ the density of the medium and $A$ being the effective cross-section of the body.

Moral of the story: We are not mathematicians, nor computers but we are humans and physicists! Always eye-ball the model and more importantly use your heads! 8)

Over regression

If you have $n$ datapoints, you can perfectly fit a polynomial of (n-1)th order:

Once again, let's check our good old data:

$i$ $x_i$ $y_i$
1 10 25
2 20 70
3 30 380
4 40 550
5 50 610
6 60 1220
7 70 830
8 80 1450
In [19]:
data = np.array([range(10,90,10),[25,70,380,550,610,1220,830,1450]]).T
x = data[:,0]
y = data[:,1]

n = 7

In [20]:
p = np.polyfit(x,y,len(x)-1)
print(p)
[ 2.31051587e-07 -6.89097222e-05  8.34131944e-03 -5.27253472e-01
  1.86274861e+01 -3.63810556e+02  3.60735714e+03 -1.37900000e+04]
In [21]:
xx = np.linspace(10,80,100)
yy = np.zeros(len(xx))
n = len(x)
for k in range(n):
    yy += p[k]*xx**(n-k-1)

# we could as well had used poly1d function
# to functionalize the polynomial 8)
f = np.poly1d(p)
print(f(x))

plt.plot(xx,yy,"-b",x,y,"ok",xx,f(xx),"-r")
plt.show()
[  25.           70.          380.          550.          610.
 1220.          830.         1449.99999999]

n = 6 ... 2

In [22]:
for s in np.arange(2,6):
    print("Order: {:d}".format(len(x)-s))
    p = np.polyfit(x,y,len(x)-s)
    print(p)

    xx = np.linspace(10,80,100)
    yy = np.zeros(len(xx))

    f = np.poly1d(p)
    print(f(x))

    plt.plot(xx,yy,"-b",x,y,"ok",xx,f(xx),"-r")
    plt.show()
Order: 6
[ 3.87152778e-06 -1.02071314e-03  1.05383547e-01 -5.39411531e+00
  1.42334687e+02 -1.78109710e+03  8.04437500e+03]
[  28.39306527   46.24854312  451.25437063  431.24271562  728.75728438
 1148.74562937  853.75145688 1446.60693473]
Order: 5
[ 2.45993590e-05 -5.13097319e-03  3.86779575e-01 -1.27692745e+01
  1.96127331e+02 -1.01500000e+03]
[   7.27564103  151.83566434  261.19755245  536.82983683  834.34440559
  958.68881119  959.33857809 1425.48951049]
Order: 4
[ 4.03882576e-04 -7.24084596e-02  4.38877841e+00 -8.31222493e+01
  4.92589286e+02]
[  31.875        71.00919913  320.93885281  589.54274892  781.63149351
  898.94751082 1040.16504329 1400.89015152]
Order: 3
[ 2.90404040e-04 -2.00216450e-03  1.76176046e+01 -1.92857143e+02]
[ -16.59090909  161.01731602  341.70995671  527.22943723  719.31818182
  919.71861472 1130.17316017 1352.42424242]

A real case : Infrared Spectrum

You can find the infrared data for Silica (Si-O) ("Silica.csv") in our course page's supplementary data. Let's parse it:

In [23]:
import pandas as pd
data_IR = pd.read_csv("data/Silica.csv",header=None)
data_IR.columns = ["Wavenumber (cm-1)","Absorbance"]
print(data_IR)
      Wavenumber (cm-1)  Absorbance
0                4000.0     0.03502
1                3999.0     0.03502
2                3998.0     0.03502
3                3997.0     0.03502
4                3996.0     0.03502
...                 ...         ...
3546              454.0     0.41889
3547              453.0     0.41889
3548              452.0     0.41889
3549              451.0     0.41889
3550              450.0     0.41889

[3551 rows x 2 columns]
In [24]:
import seaborn as sns
sns.set_theme()

plt1 = sns.relplot(data=data_IR,x="Wavenumber (cm-1)",\
                  y="Absorbance",kind="line")
aux = plt1.set_axis_labels("Wavenumber ($cm^{-1}$)","Absorbance")
In [25]:
data_IR["Wavelength (um)"] = 1/data_IR["Wavenumber (cm-1)"]*1E-2*1E6
print(data_IR)
      Wavenumber (cm-1)  Absorbance  Wavelength (um)
0                4000.0     0.03502         2.500000
1                3999.0     0.03502         2.500625
2                3998.0     0.03502         2.501251
3                3997.0     0.03502         2.501876
4                3996.0     0.03502         2.502503
...                 ...         ...              ...
3546              454.0     0.41889        22.026432
3547              453.0     0.41889        22.075055
3548              452.0     0.41889        22.123894
3549              451.0     0.41889        22.172949
3550              450.0     0.41889        22.222222

[3551 rows x 3 columns]
In [26]:
filter1 = (data_IR.iloc[:,0] >=900) & (data_IR.iloc[:,0] <= 1500)
data_IR_filtered = data_IR[filter1]

plt1 = sns.relplot(data=data_IR_filtered,x="Wavenumber (cm-1)",\
                  y="Absorbance",kind="line")
aux = plt1.set_axis_labels("Wavenumber ($cm^{-1}$)","Absorbance")

Let's try to put a Gaussian in it! 8)

In [27]:
def Gauss(x,A,mu,sigma):
    y = A*np.exp(-(x-mu)**2/(2*sigma**2))
    return y
In [28]:
data_IR_x = data_IR_filtered.iloc[:,0]
data_IR_y = data_IR_filtered.iloc[:,1]
In [38]:
x = data_IR_x
y_0 = Gauss(x,1.30,1150,200)
plt.plot(x,y_0,"b-",data_IR_x,data_IR_y,"r-")
plt.show()
In [30]:
y_max = np.max(data_IR_y)
i_ymax = np.argmax(data_IR_y)
print(i_ymax,y_max)
x_ymax = data_IR_x.iloc[i_ymax]
print(x_ymax,y_max)
392 1.33854
1108.0 1.33854
In [31]:
x = data_IR_x
y_1 = Gauss(x,y_max,x_ymax,100)
plt.plot(x,y_1,"b-",data_IR_x,data_IR_y,"r-")
plt.show()
$$<x> = \frac{\sum_{i}{x_i \,p(x_i)}}{\sum_{i}{p(x_i)}}$$$$<\xi> = \frac{\sum_{i}{\xi_i \,p(x_i)}}{\sum_{i}{p(x_i)}}$$$$<x^2> = \frac{\sum_{i}{x_i^2 \,p(x_i)}}{\sum_{i}{p(x_i)}}$$$$\sigma^2 = <x^2>-<x>^2$$
$$<x> = \frac{<\psi|x|\psi>}{<\psi|\psi>}$$
In [41]:
N = np.sum(data_IR_y)
mu_x = np.sum(data_IR_x*data_IR_y)/N
mu_x2 = sum(x**2*data_IR_y)/N
sigma = np.sqrt(mu_x2 - mu_x**2)
y_max_opt = y_max
print(mu_x,sigma)

x = data_IR_x
y_2 = Gauss(x,y_max_opt,mu_x,sigma)
N2 = np.sum(y_2)
print(N/N2)
y_2 *= N/N2
plt.plot(x,y_2,"b-",data_IR_x,data_IR_y,"r-")
plt.show()
1137.1957113048513 81.50784474177878
0.8106737388165177
In [33]:
from scipy import optimize
popt,_=optimize.curve_fit(Gauss,data_IR_x,data_IR_y,p0=[y_max,x_ymax,sigma])
print(popt)
[   1.23532539 1123.47082881   69.67020882]
In [34]:
x = data_IR_x
y_3 = Gauss(x,popt[0],popt[1],popt[2])
plt.plot(x,y_3,"b-",data_IR_x,data_IR_y,"r-")
plt.show()

How good are we?

Coefficient of determination ($r^2$)

We have already met with $r^2$ in our lecture on least squares: it is concerned with the variations from the average value and residuals' distance.

The data points' distances from the average value leads to the sum of the squares of the data residuals ($S_t$), and defined as:

$$S_t = \sum_{i}{\left(y_i - \bar{y}\right)^2}$$

whereas, the sun of the squares of the estimate residuals ($S_r$] is calculated on the difference between the model estimation and the data:

$$S_r = \sum_{i}{e_i^2}=\sum_{i}\left(y_i-t_i\right)^2$$

And here are them, visualized:

standard_errors.png (a) $S_t$, (b) $S_r$

(Source: Chapra)

Using these two quantities, the coefficient of determination ($r^2$) is calculated as:

$$r^2 = \frac{S_t-S_r}{S_t}$$

Where a result of 1 (hence, $S_r = 0$) indicating a perfect fit, $r^2=0$ meaning we could have actually picked the average value and a negative $r^2$ indicating that even picking the average value would be better than this fit!

In [35]:
def r2(y,t):
    # y: true data
    # t: model data
    mean = np.mean(y)
    S_t = np.sum((y-mean)**2)
    S_r = np.sum((y-t)**2)
    r2 = (S_t - S_r)/S_t
    return r2
In [42]:
print(r2(data_IR_y,y_0))
print(r2(data_IR_y,y_1))
print(r2(data_IR_y,y_2))
print(r2(data_IR_y,y_3))
-1.0881294117966585
0.634989743541739
0.9273085327036094
0.9702674162337274

fig4.png

(Source: Cappeletti et al.)

In [37]:
def Gauss2(x,A1,mu1,sigma1,A2,mu2,sigma2):
    y = A1*np.exp(-(x-mu1)**2/(2*sigma1**2))+\
        A2*np.exp(-(x-mu2)**2/(2*sigma2**2))
    return y

popt,_=optimize.curve_fit(Gauss2,data_IR_x,data_IR_y,\
                          p0=[1.33,1100,200,0.7,1250,100])
print(popt)
y_4 = Gauss2(x,popt[0],popt[1],popt[2],popt[3],popt[4],popt[5])
plt.plot(x,y_4,"b-",data_IR_x,data_IR_y,"r-",\
        x,Gauss(x,popt[0],popt[1],popt[2]),"g:",\
        x,Gauss(x,popt[3],popt[4],popt[5]),"m:")
plt.show()
print(r2(data_IR_y,y_4))
[ 5.21359229e-01  1.20841798e+03 -4.29075410e+01  1.27314538e+00
  1.10288140e+03  4.92447274e+01]
0.9964479074386704
In [ ]: