reset website

newflush · Sep 19, 2018 · 0af734e · 0af734e
1 parent 79fc0b5
commit 0af734e
Show file tree

Hide file tree

Showing 20 changed files with 241 additions and 119 deletions.
diff --git a/_notes/ai/decisions.md b/_notes/ai/decisions.md
@@ -210,7 +210,7 @@ typora-copy-images-to: ./assets/ai
 - *reinforcement learning* - use observed rewards to learn optimal policy for the environment
   - in ch 17, agent had model of environment (P(s'|s, a) and R(s))
 - 2 problems
-  - *passive* - given $\pi$, learn $U^\pi (s)$
+  - *passive* - given $\pi$, learn $U^\pi (s)$
   - *active* - *explore* states to find utilities and *exploit* to get highest reward
 - 2 model types, 3 agent designs
   - model-based: can predict next state/reward before taking action (for MDP, requires learning $P(s'|s,a)$)

diff --git a/_notes/ai/logic.md b/_notes/ai/logic.md
@@ -174,7 +174,7 @@ typora-copy-images-to: ./assets/logic
 - forward chaining: start w/ atomic sentences + apply modus ponens until no new inferences can be made
   - *first-order definite clauses* - (remember this is a type of Horn clause)
   - *Datalog* - language restricted to first-order definite clauses with no function symbols
-  - simple forward-chaining: FOL-FC-ASK
+  - simple forward-chaining: FOL-FC-ASK - may not terminate if not entailed
     1. *pattern matching* is expensive
     2. rechecks every rule
     3. generates irrelevant facts

diff --git a/_notes/ai/search.md b/_notes/ai/search.md
@@ -124,7 +124,7 @@ From "Artificial Intelligence" Russel & Norvig 3rd Edition
   - solution to original problem still solves relaxed problem
   - cost of optimal solution to a relaxed problem is an admissible heuristic for the original problem 
     - also is consistent
-- when there are several good heuristics, pick $h(n) = max[h_1(n), ..., h_m(n)]$ for each node
+- when there are several good heuristics, pick $h(n) = \max[h_1(n), ..., h_m(n)]$ for each node
 - *pattern database* - heuristic stores exact solution cost for every possible subproblem instance
   - *disjoint pattern database* - break into independent possible subproblems
 - can learn heuristic by solving lots of problems using useful features

diff --git a/_notes/blog/interpretability.md b/_notes/blog/interpretability.md
@@ -7,7 +7,7 @@ category: blog
 
 
 # interpretability
-**chandan singh**  
+**chandan singh** 
 *last updated jul 20, 2018*
 
 ---

diff --git a/_notes/math/linear_algebra.md b/_notes/math/linear_algebra.md
@@ -14,7 +14,7 @@ typora-copy-images-to: ./assets/linear_algebra
 
 ## notation
 
-- $x \preceq y$ - these are vectors and x is less than y elementwise
+- $x \preceq y$ - these are vectors and x is less than y elementwise
 - $X \preceq Y$ - matrices, $Y-X$ is PSD
   - $v^TXv \leq v^TYv \:\: \forall v$
 
@@ -26,7 +26,7 @@ typora-copy-images-to: ./assets/linear_algebra
   - gives angle back
 - linear 
     1. superposition $f(x+y) =  f(x)+f(y) $
-    2. proportionality $f(k*x) = k*f(x)$
+    2. proportionality $f(k\cdot x) = k \cdot f(x)$
 - vector space
     1. closed under addition
     2. contains identity
@@ -57,6 +57,7 @@ typora-copy-images-to: ./assets/linear_algebra
       - if rank(A) = n, then must use $A^T A$
     - inversion of matrix is $\approx O(n^3)$
     - inverse of psd symmetric matrix is also psd and symmetric
+    - if A, B invertible $(AB)^{-1} = B^{-1} A^{-1}$
 - *orthogonal complement* - set of orthogonal vectors
   - define R(A) to be *range space* of A (column space) and N(A) to be *null space* of A
   - R(A) and N(A) are orthogonal complements
@@ -147,17 +148,17 @@ typora-copy-images-to: ./assets/linear_algebra
 - expressions when $A \in \mathbb{S}$
   - $det(A) = \prod_i \lambda_i$
   - $tr(A) = \sum_i \lambda_i$
-  - $||A||_2 = max | \lambda_i |$
+  - $||A||_2 = \max | \lambda_i |$
   - $||A||_F = \sqrt{\sum \lambda_i^2}$
-  - $\lambda_{max} (A) = sup_{x \neq 0} \frac{x^T A x}{x^T x}$
-  - $\lambda_{min} (A) = inf_{x \neq 0} \frac{x^T A x}{x^T x}$
+  - $\lambda_{max} (A) = \sup_{x \neq 0} \frac{x^T A x}{x^T x}$
+  - $\lambda_{min} (A) = \inf_{x \neq 0} \frac{x^T A x}{x^T x}$
 - *defective matrices* - lack a full set of eigenvalues
-- *positive semi-definite* -  $A \in R^{nxn}$
+- *positive semi-definite*:  $A \in R^{nxn}$
   - basically these are always *symmetric* $A=A^T$
   - all eigenvalues are nonnegative
   - if $\forall x \in R^n, x^TAx \geq 0$ then A is positive semi definite (PSD)
     - like it curves up
-    - Note: $x^TAx = \sum_{i, j} x_iA_{i, j} x_j$
+    - Note: $x^TAx = \sum_{i, j} x_iA_{i, j} x_j$
   - if $\forall x \in R^n, x^TAx > 0$ then A is positive definite (PD)
     - PD $\to$ full rank, invertible
   - PSD + symmetric $\implies$ can be written as *Gram matrix* $G = X^T X $
@@ -196,6 +197,8 @@ typora-copy-images-to: ./assets/linear_algebra
   - columns of V (pxp) are eigenvectors of $X^TX$
   - r singular values on diagonal of $\Sigma$ (nxp) - square roots of nonzero eigenvalues of both $XX^T$ and $X^TX$
   - like rotating, scaling, and rotating back
+  - SVD ex. $A=UDV^T \implies A^{-1} = VD^{-1} U^T$
+  - $X = \sum_i \sigma_i u_i v_i^T$
 - properties
   1. for PD matrices, $\Sigma=\Lambda$, $U\Sigma V^T = Q \Lambda Q^T$
     - for other symmetric matrices, any negative eigenvalues in $\Lambda$ become positive in $\Sigma$

diff --git a/_notes/ml/classification.md b/_notes/ml/classification.md
@@ -36,7 +36,7 @@ category: ml
 | Logistic regression | $\theta^T\theta + C \sum_i \log[1+\exp(-y_i \cdot \theta^T x_i)]$ |
 
 
-- svm use +1/-1, logistic use 1/0
+- svm, perceptron use +1/-1, logistic use 1/0
 - *perceptron* - tries to find separating hyperplane
   - whenever misclassified, update w
   - can add in delta term to maximize margin
@@ -167,7 +167,7 @@ category: ml
      1. at test time, can't just store w - have to store support vectors
 - ![](assets/classification/svm_margin.png)
 - $\hat{y} =\begin{cases}   1 &\text{if } w^Tx +b \geq 0 \\ -1 &\text{otherwise}\end{cases}$
-- $\hat{\theta} = argmin \:\frac{1}{2} \vert \vert \theta\vert \vert ^2 \\s.t. \: y^{(i)}(\theta^Tx^{(i)}+b)\geq1, i = 1,...,m$
+- $\hat{\theta} = \text{argmin} \:\frac{1}{2} \vert \vert \theta\vert \vert ^2 \\s.t. \: y^{(i)}(\theta^Tx^{(i)}+b)\geq1, i = 1,...,m$
   - *functional margin* $\gamma^{(i)} = y^{(i)} (\theta^T x +b)$
     - limit the size of $(\theta, b)$ so we can't arbitrarily increase functional margin
     - function margin $\hat{\gamma}$ is smallest functional margin in a training set
@@ -197,7 +197,7 @@ category: ml
   - scale before applying
   - fill in missing values
   - start with RBF
-- - valid kernel: kernel matrix is PSd
+  - valid kernel: kernel matrix is Psd
 
 # generative
 
@@ -222,7 +222,7 @@ category: ml
 ## naive bayes classifier
 
 - assume multinomial Y
-- with very clever tricks, can produce $P(Y^i=1|x, \eta)$ again as a softmax
+- with clever tricks, can produce $P(Y^i=1|x, \eta)$ again as a softmax
 - let $y_1,...y_l$ be the classes of Y
 - want Posterior $P(Y\vert X) = \frac{P(X\vert Y)(P(Y)}{P(X)}$ 
 - MAP rule - maximum a posterior rule
@@ -304,6 +304,6 @@ category: ml
 
 ## multinomial
 
-- $L(\theta) = P(x_1,...,x_n\vert \theta_1,...,\theta_p) = \prod_i^n P(d_i\vert \theta_1,...\theta_p)=\prod_i^n factorials \cdot \theta_1^{x_1},...,\theta_p^{x_p}$- ignore factorials because they are always same
+- $L(\theta) = P(x_1,...,x_n\vert \theta_1,...,\theta_p) = \prod_i^n P(d_i\vert \theta_1,...\theta_p)=\prod_i^n factorials \cdot \theta_1^{x_1},...,\theta_p^{x_p}$- ignore factorials because they are always same
   - require $\sum \theta_i = 1$
 - $\implies \theta_i = \frac{\sum_{j=1}^n x_{ij}}{N}$ where N is total number of words in all docs
diff --git a/_notes/ml/unsupervised.md b/_notes/ml/unsupervised.md
@@ -86,9 +86,13 @@ graph LR;
 - latent variable Z has multinomial distr.
   - *mixing proportions*: $P(Z^i=1|x, \xi)$
     - ex. $ \frac{e^{\xi_i^Tx}}{\sum_je^{\xi_j^Tx}}$
-  - *mixture components*: $p(y|Z^i=1, x, \theta_i)$ ~ different choices
+  - *mixture components*: $p(y|Z^i=1, x, \theta_i)$ ~ different choices
   - ex. mixture of linear regressions
     - $p(y| x, \theta) = \sum_i \underbrace{\pi_i (x, \xi)}_{\text{mixing prop.}} \cdot \underbrace{\mathcal{N}(y|\beta_i^Tx, \sigma_i^2)}_{\text{mixture comp.}}$
   - ex. mixtures of logistic regressions
     - $p(y|x, \theta_i) = \underbrace{\pi_i (x, \xi)}_{\text{mixing prop.}} \cdot \underbrace{\mu(\theta_i^Tx)^y\cdot[1-\mu(\theta_i^Tx)]^{1-y}}_{\text{mixture comp.}}$ where $\mu$ is the logistic function
-- also, nonlinear optimization for this (including EM)
+- also, nonlinear optimization for this (including EM)
+
+## spectral clustering
+
+- use the spectrum (eigenvalues) of the similarity matrix of the data to perform dim. reduction before clustering in fewer dimensions
diff --git a/_notes/neuro/neural_comp.md b/_notes/neuro/neural_comp.md
@@ -50,7 +50,9 @@ category: neuro
   - neocognitron fukushima 1980
   - david marr: theory, representation, implementation
 
-# circuit-modelling basics
+# neuron models
+
+## circuit-modelling basics
 
 - membrane has capacitance $C_m$
 - force for diffusion, force for drift
@@ -62,7 +64,7 @@ category: neuro
     - $C_m \propto D$
   - axial resistance $R_A \propto 1/D^2$ (not same as membrane lerk), thus bigger axons actually charge faster
 
-# action potentials
+## action potentials
 
 - channel/receptor types
   - ionotropic: $G_{ion}$ = f(molecules outside)
@@ -74,7 +76,7 @@ category: neuro
     - hair cell
   - voltage-gated (active - provide gain; might not require active ATP, other channels are all passive)
 
-# physics of computation
+## physics of computation
 
 - based on carver mead: drift and diffusion are at the heart of everything
 - different things realted by the **Boltzmann distr.** (ex. distr of air molecules vs elevation. Subject to gravity and diffusion upwards since they're colliding)
@@ -101,3 +103,77 @@ category: neuro
 
 # supervised learning
 
+- see machine learning course
+- net talk was major breakthrough (words -> audio) Sejnowski & Rosenberg 1987
+- people looked for world-centric receptive fields (so neurons responded to things not relative to retina but relative to body) but didn't find them
+  - however, they did find gain fields: (Zipser & Anderson, 1987)
+    - gain changes based on what retina is pointing at
+  - trained nn to go from pixels to head-centered coordinate frame
+    - yielded gain fields
+  - pouget et al. were able to find that this helped having 2 pop vectors: one for retina, one for eye, then add to account for it
+- support vector networks (vapnik et al.) - svms early inspired from nns
+- dendritic nonlinearities (hausser & mel 03)
+- example to think about neurons due this: $u = w_1 x_1 + w_2x_2 + w_{12}x_1x_2$
+  - $y=\sigma(u)$
+  - somestimes called sigma-pi unit since it's a sum of products
+  - exponential number of params...**could be fixed w/ kernel trick?**
+    - could also incorporate geometry constraint...
+
+# unsupervised learning
+
+- born w/ extremely strong priors on weights in different areas
+- barlow 1961, attneave 1954: efficient coding hypothesis = redundancy reduction hypothesis
+  - representation: compression / usefulness
+  - easier to store prior probabilities (because inputs are independent)
+  - relich 93: redundancy reduction for unsupervised learning (text ex. learns words from text w/out spaces)
+
+## hebbian learning and pca
+
+- pca can also be thought of as a tool for decorrelation (in pc dimension, tends to be less correlated)
+- hebbian learning = fire together, wire together: $\Delta w_{ab} \propto <a, b>$ note: $<a, b>$ is correlation of a and b (average over time)
+- linear hebbian learning (perceptron with linear output)
+- $\dot{w}_i \propto <y, x_i> \propto \sum_j w_j <x_j, x_i>$ since weights change relatively slowly
+  - 
+
+# sparse, distributed coding
+
+
+
+# self-organizing maps
+
+
+
+# manifold learning
+
+
+
+# reinforcement learning
+
+
+
+# recurrent networks
+
+
+
+# probabilistic models + inference
+
+
+
+# boltzmann machines
+
+
+
+# ica
+
+
+
+# dynamical models
+
+
+
+# neural coding
+
+
+
+# high-dimensional computing
+
diff --git a/_notes/ref/prelim/prelim.md b/_notes/ref/prelim/prelim.md
@@ -18,7 +18,6 @@
   - decision theory
   - vpi
   - mdps
-    - **pomdps**
   - rl
 - graphical models
   - independence / factorization
@@ -43,30 +42,38 @@
   - nearest neighbor
   - neural nets (usually)
 
-- misc facts
-  - bayes estimator minimizes **risk** (expected loss, varies with loss func)
-  - hard-margin + soft differ for separable data (c=inf when same)
-  - can't use implies with there exists (trivially true)
-  - bayes net can't have loops
-  - left-singular is left vector in SVD
-  - averaging multiple trees (decreases var, doesn't increase)
-  - for alpha-beta pruning, usually pick correct soln first
+# misc facts
+
+- bayes estimator minimizes **risk** (expected loss, varies with loss func)
+- can't use implies with there exists (trivially true)
+- left-singular is left vector in SVD
+- averaging multiple trees (decreases var, doesn't increase bias)
+- for alpha-beta pruning, usually pick correct soln first
+- kd tree: binary split on different dim each time
+- graphical models
   - uniform distr can be represented by any bayes net
+  - bayes net can't have loops
   - likelihood weighting only takes upstream evidence into account when sampling (unlike Gibbs)
-
-# todo
-
-- memorize new cheat sheet (logic, planning)
-- practice problems
-  - 188 finals / discussions
-  - 189 finals / discussions
-  - russell qs
-  - **do exam-prep** - 7 q2
-- review topics
-  - approximate q-learning?
-    - rl eqs
-  - conditional independencies
-    - forward-backward algo
-  - ac-3 / backtracking
-  - kernels
-    - svms duality
+  - conditional independencies - remember bayes ball algo
+    - bounce through all non-shaded unless shaded is base of v
+    - can bounce up farther from base of v
+  - gibb's sampling: need to resample (using proportionality) keeping all other vars constant
+  - state space keeps track of things that change
+  - filtering, prediction, smoothing, mle
+    - viterbi: $m_t[x_t] = P (e_t|x_t) \max P (x_t|x_t−1)m_{t−1}[x_{t−1}]$
+- qda/svm with quadratic kernel can represent hyperbola boundary
+- As the number of neighbors k gets very large and the number of training points goes to infinity, the probability of error for a k-nearest-neighbor classifier will tend to approach the probability of error for a MAP classifier that knew the true underlying distributions. 
+- svm
+  - soft margin SVM effectively optimizes hinge loss plus ridge regularization
+  - If a training point has a positive slack, then it is not a support vector.
+  - hard-margin + soft differ for separable data (c=inf when same)
+- remember to justify convexity!
+- $x\sim N(\mu, \Sigma) \implies v^Tx ~ N(v^T\mu, v^T\Sigma v)$
+- don't write $\sqrt{\Sigma}$ write $\Sigma^{1/2}$
+- logistic reg: $P(Y=y|x, \theta) = p(y)^{y} \cdot (1-p(y))^{y}$
+- csp
+  - Any node can be backtracked on up until a cutset has been assigned. Note that B’s values in the first part has no effect on the rest of the CSP after A has been assigned. However, because of the way that backtracking search is run, B would still be re-assigned before A if there was no consistent solution for a given value of A
+  - tree-structured csp algo (has backward pass for arc consistency / forward pass for assigning variables - faster)
+  - when enforcing arc consistency before every value assignment, we will only be guaranteed that we won’t need to backtrack when our remaining variables left to be assign form a tree structure 
+- $MEU(B) = P(B=0) MEU(B=0) + P(B=1)MEU(B=1)$
+  - note we calculate MEU with respect to an assignment of a variable
diff --git a/_notes/ref/prelim/prelim_cheat_sheet_full.pdf b/_notes/ref/prelim/prelim_cheat_sheet_full.pdf
diff --git a/_notes/ref/prelim/prelim_cheat_sheet.pdf → ...es/ref/prelim/prelim_cheat_sheet_orig.pdf b/_notes/ref/prelim/prelim_cheat_sheet.pdf → ...es/ref/prelim/prelim_cheat_sheet_orig.pdf