forked from apache/systemds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGLM.tex
431 lines (394 loc) · 21.5 KB
/
GLM.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
\begin{comment}
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
\end{comment}
\subsection{Generalized Linear Models (GLM)}
\label{sec:GLM}
\noindent{\bf Description}
\smallskip
Generalized Linear Models~\cite{Gill2000:GLM,McCullagh1989:GLM,Nelder1972:GLM}
extend the methodology of linear and logistic regression to a variety of
distributions commonly assumed as noise effects in the response variable.
As before, we are given a collection
of records $(x_1, y_1)$, \ldots, $(x_n, y_n)$ where $x_i$ is a numerical vector of
explanatory (feature) variables of size~\mbox{$\dim x_i = m$}, and $y_i$ is the
response (dependent) variable observed for this vector. GLMs assume that some
linear combination of the features in~$x_i$ determines the \emph{mean}~$\mu_i$
of~$y_i$, while the observed $y_i$ is a random outcome of a noise distribution
$\Prob[y\mid \mu_i]\,$\footnote{$\Prob[y\mid \mu_i]$ is given by a density function
if $y$ is continuous.}
with that mean~$\mu_i$:
\begin{equation*}
x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}
\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i]
\end{equation*}
In linear regression the response mean $\mu_i$ \emph{equals} some linear combination
over~$x_i$, denoted above by~$\eta_i$.
In logistic regression with $y\in\{0, 1\}$ (Bernoulli) the mean of~$y$ is the same
as $\Prob[y=1]$ and equals $1/(1+e^{-\eta_i})$, the logistic function of~$\eta_i$.
In GLM, $\mu_i$ and $\eta_i$ can be related via any given smooth monotone function
called the \emph{link function}: $\eta_i = g(\mu_i)$. The unknown linear combination
parameters $\beta_j$ are assumed to be the same for all records.
The goal of the regression is to estimate the parameters~$\beta_j$ from the observed
data. Once the~$\beta_j$'s are accurately estimated, we can make predictions
about~$y$ for a new feature vector~$x$. To do so, compute $\eta$ from~$x$ and use
the inverted link function $\mu = g^{-1}(\eta)$ to compute the mean $\mu$ of~$y$;
then use the distribution $\Prob[y\mid \mu]$ to make predictions about~$y$.
Both $g(\mu)$ and $\Prob[y\mid \mu]$ are user-provided. Our GLM script supports
a standard set of distributions and link functions, see below for details.
\smallskip
\noindent{\bf Usage}
\smallskip
{\hangindent=\parindent\noindent\it%
{\tt{}-f }path/\/{\tt{}GLM.dml}
{\tt{} -nvargs}
{\tt{} X=}path/file
{\tt{} Y=}path/file
{\tt{} B=}path/file
{\tt{} fmt=}format
{\tt{} O=}path/file
{\tt{} Log=}path/file
{\tt{} dfam=}int
{\tt{} vpow=}double
{\tt{} link=}int
{\tt{} lpow=}double
{\tt{} yneg=}double
{\tt{} icpt=}int
{\tt{} reg=}double
{\tt{} tol=}double
{\tt{} disp=}double
{\tt{} moi=}int
{\tt{} mii=}int
}
\smallskip
\noindent{\bf Arguments}
\begin{Description}
\item[{\tt X}:]
Location (on HDFS) to read the matrix of feature vectors; each row constitutes
an example.
\item[{\tt Y}:]
Location to read the response matrix, which may have 1 or 2 columns
\item[{\tt B}:]
Location to store the estimated regression parameters (the $\beta_j$'s), with the
intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
see read/write functions in SystemML Language Reference for details.
\item[{\tt O}:] (default:\mbox{ }{\tt " "})
Location to write certain summary statistics described in Table~\ref{table:GLM:stats},
by default it is standard output.
\item[{\tt Log}:] (default:\mbox{ }{\tt " "})
Location to store iteration-specific variables for monitoring and debugging purposes,
see Table~\ref{table:GLM:log} for details.
\item[{\tt dfam}:] (default:\mbox{ }{\tt 1})
Distribution family code to specify $\Prob[y\mid \mu]$, see Table~\ref{table:commonGLMs}:\\
{\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$;
{\tt 2} = binomial or Bernoulli
\item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0})
When {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power
dependence of the variance of~$y$ on its mean. In particular, use:\\
{\tt 0.0} = Gaussian,
{\tt 1.0} = Poisson,
{\tt 2.0} = Gamma,
{\tt 3.0} = inverse Gaussian
\item[{\tt link}:] (default:\mbox{ }{\tt 0})
Link function code to determine the link function~$\eta = g(\mu)$:\\
{\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\
{\tt 1} = power functions,
{\tt 2} = logit,
{\tt 3} = probit,
{\tt 4} = cloglog,
{\tt 5} = cauchit
\item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0})
When {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link
function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$. Common power links:\\
{\tt -2.0} = $1/\mu^2$,
{\tt -1.0} = reciprocal,
{\tt 0.0} = log,
{\tt 0.5} = sqrt,
{\tt 1.0} = identity
\item[{\tt yneg}:] (default:\mbox{ }{\tt 0.0})
When {\tt dfam=2} and the response matrix $Y$ has 1~column,
this specifies the $y$-value used for Bernoulli ``No'' label.
All other $y$-values are treated as the ``Yes'' label.
For example, {\tt yneg=-1.0} may be used when $y\in\{-1, 1\}$;
either {\tt yneg=1.0} or {\tt yneg=2.0} may be used when $y\in\{1, 2\}$.
\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
Intercept and shifting/rescaling of the features in~$X$:\\
{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
\item[{\tt reg}:] (default:\mbox{ }{\tt 0.0})
L2-regularization parameter (lambda)
\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
Tolerance (epsilon) used in the convergence criterion: we terminate the outer iterations
when the deviance changes by less than this factor; see below for details
\item[{\tt disp}:] (default:\mbox{ }{\tt 0.0})
Dispersion parameter, or {\tt 0.0} to estimate it from data
\item[{\tt moi}:] (default:\mbox{ }{\tt 200})
Maximum number of outer (Fisher scoring) iterations
\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
limit provided
\end{Description}
\begin{table}[t]\small\centerline{%
\begin{tabular}{|ll|}
\hline
Name & Meaning \\
\hline
{\tt TERMINATION\_CODE} & A positive integer indicating success/failure as follows: \\
& $1 = {}$Converged successfully;
$2 = {}$Maximum \# of iterations reached; \\
& $3 = {}$Input ({\tt X}, {\tt Y}) out of range;
$4 = {}$Distribution/link not supported \\
{\tt BETA\_MIN} & Smallest beta value (regression coefficient), excluding the intercept \\
{\tt BETA\_MIN\_INDEX} & Column index for the smallest beta value \\
{\tt BETA\_MAX} & Largest beta value (regression coefficient), excluding the intercept \\
{\tt BETA\_MAX\_INDEX} & Column index for the largest beta value \\
{\tt INTERCEPT} & Intercept value, or NaN if there is no intercept (if {\tt icpt=0}) \\
{\tt DISPERSION} & Dispersion used to scale deviance, provided in {\tt disp} input argument \\
& or estimated (same as {\tt DISPERSION\_EST}) if {\tt disp} argument is${} \leq 0$ \\
{\tt DISPERSION\_EST} & Dispersion estimated from the dataset \\
{\tt DEVIANCE\_UNSCALED} & Deviance from the saturated model, assuming dispersion${} = 1.0$ \\
{\tt DEVIANCE\_SCALED} & Deviance from the saturated model, scaled by {\tt DISPERSION} value \\
\hline
\end{tabular}}
\caption{Besides~$\beta$, GLM regression script computes a few summary statistics listed above.
They are provided in CSV format, one comma-separated name-value pair per each line.}
\label{table:GLM:stats}
\end{table}
\begin{table}[t]\small\centerline{%
\begin{tabular}{|ll|}
\hline
Name & Meaning \\
\hline
{\tt NUM\_CG\_ITERS} & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\
{\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\
{\tt POINT\_STEP\_NORM} & L2-norm of iteration step from old point ($\beta$-vector) to new point \\
{\tt OBJECTIVE} & The loss function we minimize (negative partial log-likelihood) \\
{\tt OBJ\_DROP\_REAL} & Reduction in the objective during this iteration, actual value \\
{\tt OBJ\_DROP\_PRED} & Reduction in the objective predicted by a quadratic approximation \\
{\tt OBJ\_DROP\_RATIO} & Actual-to-predicted reduction ratio, used to update the trust region \\
{\tt GRADIENT\_NORM} & L2-norm of the loss function gradient (omitted if point is rejected) \\
{\tt LINEAR\_TERM\_MIN} & The minimum value of $X \pxp \beta$, used to check for overflows \\
{\tt LINEAR\_TERM\_MAX} & The maximum value of $X \pxp \beta$, used to check for overflows \\
{\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\
{\tt TRUST\_DELTA} & Updated trust region size, the ``delta'' \\
\hline
\end{tabular}}
\caption{
The {\tt Log} file for GLM regression contains the above \mbox{per-}iteration
variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\#
being~0 for initial values.}
\label{table:GLM:log}
\end{table}
\begin{table}[t]\hfil
\begin{tabular}{|ccccccc|}
\hline
\multicolumn{4}{|c}{INPUT PARAMETERS} & Distribution & Link & Cano- \\
{\tt dfam} & {\tt vpow} & {\tt link} & {\tt\ lpow} & family & function & nical?\\
\hline
{\tt 1} & {\tt 0.0} & {\tt 1} & {\tt -1.0} & Gaussian & inverse & \\
{\tt 1} & {\tt 0.0} & {\tt 1} & {\tt\ 0.0} & Gaussian & log & \\
{\tt 1} & {\tt 0.0} & {\tt 1} & {\tt\ 1.0} & Gaussian & identity & Yes \\
{\tt 1} & {\tt 1.0} & {\tt 1} & {\tt\ 0.0} & Poisson & log & Yes \\
{\tt 1} & {\tt 1.0} & {\tt 1} & {\tt\ 0.5} & Poisson & sq.root & \\
{\tt 1} & {\tt 1.0} & {\tt 1} & {\tt\ 1.0} & Poisson & identity & \\
{\tt 1} & {\tt 2.0} & {\tt 1} & {\tt -1.0} & Gamma & inverse & Yes \\
{\tt 1} & {\tt 2.0} & {\tt 1} & {\tt\ 0.0} & Gamma & log & \\
{\tt 1} & {\tt 2.0} & {\tt 1} & {\tt\ 1.0} & Gamma & identity & \\
{\tt 1} & {\tt 3.0} & {\tt 1} & {\tt -2.0} & Inverse Gauss & $1/\mu^2$ & Yes \\
{\tt 1} & {\tt 3.0} & {\tt 1} & {\tt -1.0} & Inverse Gauss & inverse & \\
{\tt 1} & {\tt 3.0} & {\tt 1} & {\tt\ 0.0} & Inverse Gauss & log & \\
{\tt 1} & {\tt 3.0} & {\tt 1} & {\tt\ 1.0} & Inverse Gauss & identity & \\
\hline
{\tt 2} & {\tt *} & {\tt 1} & {\tt\ 0.0} & Binomial & log & \\
{\tt 2} & {\tt *} & {\tt 1} & {\tt\ 0.5} & Binomial & sq.root & \\
{\tt 2} & {\tt *} & {\tt 2} & {\tt\ *} & Binomial & logit & Yes \\
{\tt 2} & {\tt *} & {\tt 3} & {\tt\ *} & Binomial & probit & \\
{\tt 2} & {\tt *} & {\tt 4} & {\tt\ *} & Binomial & cloglog & \\
{\tt 2} & {\tt *} & {\tt 5} & {\tt\ *} & Binomial & cauchit & \\
\hline
\end{tabular}\hfil
\caption{Common GLM distribution families and link functions.
(Here ``{\tt *}'' stands for ``any value.'')}
\label{table:commonGLMs}
\end{table}
\noindent{\bf Details}
\smallskip
In GLM, the noise distribution $\Prob[y\mid \mu]$ of the response variable~$y$
given its mean~$\mu$ is restricted to have the \emph{exponential family} form
\begin{equation}
Y \sim\, \Prob[y\mid \mu] \,=\, \exp\left(\frac{y\theta - b(\theta)}{a}
+ c(y, a)\right),\,\,\textrm{where}\,\,\,\mu = \E(Y) = b'(\theta).
\label{eqn:GLM}
\end{equation}
Changing the mean in such a distribution simply multiplies all \mbox{$\Prob[y\mid \mu]$}
by~$e^{\,y\hspace{0.2pt}\theta/a}$ and rescales them so that they again integrate to~1.
Parameter $\theta$ is called \emph{canonical}, and the function $\theta = b'^{\,-1}(\mu)$
that relates it to the mean is called the~\emph{canonical link}; constant~$a$ is called
\emph{dispersion} and rescales the variance of~$y$. Many common distributions can be put
into this form, see Table~\ref{table:commonGLMs}. The canonical parameter~$\theta$
is often chosen to coincide with~$\eta$, the linear combination of the regression features;
other choices for~$\eta$ are possible too.
Rather than specifying the canonical link, GLM distributions are commonly defined
by their variance $\Var(y)$ as the function of the mean~$\mu$. It can be shown
from Eq.~(\ref{eqn:GLM}) that $\Var(y) = a\,b''(\theta) = a\,b''(b'^{\,-1}(\mu))$.
For example, for the Bernoulli distribution $\Var(y) = \mu(1-\mu)$, for the Poisson
distribution \mbox{$\Var(y) = \mu$}, and for the Gaussian distribution
$\Var(y) = a\cdot 1 = \sigma^2$.
It turns out that for many common distributions $\Var(y) = a\mu^q$, a power function.
We support all distributions where $\Var(y) = a\mu^q$, as well as the Bernoulli and
the binomial distributions.
For distributions with $\Var(y) = a\mu^q$ the canonical link is also a power function,
namely $\theta = \mu^{1-q}/(1-q)$, except for the Poisson ($q = 1$) whose canonical link is
$\theta = \log\mu$. We support all power link functions in the form $\eta = \mu^s$,
dropping any constant factor, with $\eta = \log\mu$ for $s=0$. The binomial distribution
has its own family of link functions, which includes logit (the canonical link),
probit, cloglog, and cauchit (see Table~\ref{table:binomial_links}); we support these
only for the binomial and Bernoulli distributions. Links and distributions are specified
via four input parameters: {\tt dfam}, {\tt vpow}, {\tt link}, and {\tt lpow} (see
Table~\ref{table:commonGLMs}).
\begin{table}[t]\hfil
\begin{tabular}{|cc|cc|}
\hline
Name & Link function & Name & Link function \\
\hline
Logit & $\displaystyle \eta = 1 / \big(1 + e^{-\mu}\big)^{\mathstrut}$ &
Cloglog & $\displaystyle \eta = \log \big(\!- \log(1 - \mu)\big)^{\mathstrut}$ \\
Probit & $\displaystyle \mu = \frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut}
\!\!\!\!\! e^{-\frac{t^2}{2}} dt$ &
Cauchit & $\displaystyle \eta = \tan\pi(\mu - 1/2)$ \\
\hline
\end{tabular}\hfil
\caption{The supported non-power link functions for the Bernoulli and the binomial
distributions. (Here $\mu$~is the Bernoulli mean.)}
\label{table:binomial_links}
\end{table}
The observed response values are provided to the regression script as matrix~$Y$
having 1 or 2 columns. If a power distribution family is selected ({\tt dfam=1}),
matrix $Y$ must have 1~column that provides $y_i$ for each~$x_i$ in the corresponding
row of matrix~$X$. When {\tt dfam=2} and $Y$ has 1~column, we assume the Bernoulli
distribution for $y_i\in\{y_{\mathrm{neg}}, y_{\mathrm{pos}}\}$ with $y_{\mathrm{neg}}$
from the input parameter {\tt yneg} and with $y_{\mathrm{pos}} \neq y_{\mathrm{neg}}$.
When {\tt dfam=2} and $Y$ has 2~columns, we assume the
binomial distribution; for each row~$i$ in~$X$, cells $Y[i, 1]$ and $Y[i, 2]$ provide
the positive and the negative binomial counts respectively. Internally we convert
the 1-column Bernoulli into the 2-column binomial with 0-versus-1 counts.
We estimate the regression parameters via L2-regularized negative log-likelihood
minimization:
\begin{equation*}
f(\beta; X, Y) \,\,=\,\, -\sum\nolimits_{i=1}^n \big(y_i\theta_i - b(\theta_i)\big)
\,+\,(\lambda/2) \sum\nolimits_{j=1}^m \beta_j^2\,\,\to\,\,\min
\end{equation*}
where $\theta_i$ and $b(\theta_i)$ are from~(\ref{eqn:GLM}); note that $a$
and $c(y, a)$ are constant w.r.t.~$\beta$ and can be ignored here.
The canonical parameter $\theta_i$ depends on both $\beta$ and~$x_i$:
\begin{equation*}
\theta_i \,\,=\,\, b'^{\,-1}(\mu_i) \,\,=\,\, b'^{\,-1}\big(g^{-1}(\eta_i)\big) \,\,=\,\,
\big(b'^{\,-1}\circ g^{-1}\big)\left(\beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}\right)
\end{equation*}
The user-provided (via {\tt reg}) regularization coefficient $\lambda\geq 0$ can be used
to mitigate overfitting and degeneracy in the data. Note that the intercept is never
regularized.
Our iterative minimizer for $f(\beta; X, Y)$ uses the Fisher scoring approximation
to the difference $\varDelta f(z; \beta) = f(\beta + z; X, Y) \,-\, f(\beta; X, Y)$,
recomputed at each iteration:
\begin{gather*}
\varDelta f(z; \beta) \,\,\,\approx\,\,\, 1/2 \cdot z^T A z \,+\, G^T z,
\,\,\,\,\textrm{where}\,\,\,\, A \,=\, X^T\!\diag(w) X \,+\, \lambda I\\
\textrm{and}\,\,\,\,G \,=\, - X^T u \,+\, \lambda\beta,
\,\,\,\textrm{with $n\,{\times}\,1$ vectors $w$ and $u$ given by}\\
\forall\,i = 1\ldots n: \,\,\,\,
w_i = \big[v(\mu_i)\,g'(\mu_i)^2\big]^{-1}
\!\!\!\!\!\!,\,\,\,\,\,\,\,\,\,
u_i = (y_i - \mu_i)\big[v(\mu_i)\,g'(\mu_i)\big]^{-1}
\!\!\!\!\!\!.\,\,\,\,
\end{gather*}
Here $v(\mu_i)=\Var(y_i)/a$, the variance of $y_i$ as the function of the mean, and
$g'(\mu_i) = d \eta_i/d \mu_i$ is the link function derivative. The Fisher scoring
approximation is minimized by trust-region conjugate gradient iterations (called the
\emph{inner} iterations, with the Fisher scoring iterations as the \emph{outer}
iterations), which approximately solve the following problem:
\begin{equation*}
1/2 \cdot z^T A z \,+\, G^T z \,\,\to\,\,\min\,\,\,\,\textrm{subject to}\,\,\,\,
\|z\|_2 \leq \delta
\end{equation*}
The conjugate gradient algorithm closely follows Algorithm~7.2 on page~171
of~\cite{Nocedal2006:Optimization}.
The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$
and updated as described in~\cite{Nocedal2006:Optimization}.
The user can specify the maximum number of the outer and the inner iterations with
input parameters {\tt moi} and {\tt mii}, respectively. The Fisher scoring algorithm
terminates successfully if $2|\varDelta f(z; \beta)| < (D_1(\beta) + 0.1)\hspace{0.5pt}\eps$
where $\eps > 0$ is a tolerance supplied by the user via {\tt tol}, and $D_1(\beta)$ is
the unit-dispersion deviance estimated as
\begin{equation*}
D_1(\beta) \,\,=\,\, 2 \cdot \big(\Prob[Y \mid \!
\begin{smallmatrix}\textrm{saturated}\\\textrm{model}\end{smallmatrix}, a\,{=}\,1]
\,\,-\,\,\Prob[Y \mid X, \beta, a\,{=}\,1]\,\big)
\end{equation*}
The deviance estimate is also produced as part of the output. Once the Fisher scoring
algorithm terminates, if requested by the user, we estimate the dispersion~$a$ from
Eq.~\ref{eqn:GLM} using Pearson residuals
\begin{equation}
\hat{a} \,\,=\,\, \frac{1}{n-m}\cdot \sum_{i=1}^n \frac{(y_i - \mu_i)^2}{v(\mu_i)}
\label{eqn:dispersion}
\end{equation}
and use it to adjust our deviance estimate: $D_{\hat{a}}(\beta) = D_1(\beta)/\hat{a}$.
If input argument {\tt disp} is {\tt 0.0} we estimate $\hat{a}$, otherwise we use its
value as~$a$. Note that in~(\ref{eqn:dispersion}) $m$~counts the intercept
($m \leftarrow m+1$) if it is present.
\smallskip
\noindent{\bf Returns}
\smallskip
The estimated regression parameters (the $\hat{\beta}_j$'s) are populated into
a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
input argument. What this matrix contains, and its size, depends on the input
argument {\tt icpt}, which specifies the user's intercept and rescaling choice:
\begin{Description}
\item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with
$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$.
\item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$
has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$,
and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient.
\item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to
mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of
the~$\hat{\beta}_j$'s, one for the original features and another for the
shifted/rescaled features. Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with
$B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled
features, in the above format. Note that $B[\cdot, 2]$ are iteratively estimated
and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and
rescaling.
\end{Description}
Our script also estimates the dispersion $\hat{a}$ (or takes it from the user's input)
and the deviances $D_1(\hat{\beta})$ and $D_{\hat{a}}(\hat{\beta})$, see
Table~\ref{table:GLM:stats} for details. A log file with variables monitoring
progress through the iterations can also be made available, see Table~\ref{table:GLM:log}.
\smallskip
\noindent{\bf Examples}
\smallskip
{\hangindent=\parindent\noindent\tt
\hml -f GLM.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
B=/user/biadmin/B.mtx fmt=csv dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.01 tol=0.00000001
disp=1.0 moi=100 mii=10 O=/user/biadmin/stats.csv Log=/user/biadmin/log.csv
}
\smallskip
\noindent{\bf See Also}
\smallskip
In case of binary classification problems, consider using L2-SVM or binary logistic
regression; for multiclass classification, use multiclass~SVM or multinomial logistic
regression. For the special cases of linear regression and logistic regression, it
may be more efficient to use the corresponding specialized scripts instead of~GLM.