Skip to content

Commit

Permalink
reset website
Browse files Browse the repository at this point in the history
  • Loading branch information
csinva committed Sep 19, 2018
1 parent 79fc0b5 commit 0af734e
Show file tree
Hide file tree
Showing 20 changed files with 241 additions and 119 deletions.
2 changes: 1 addition & 1 deletion _notes/ai/decisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ typora-copy-images-to: ./assets/ai
- *reinforcement learning* - use observed rewards to learn optimal policy for the environment
- in ch 17, agent had model of environment (P(s'|s, a) and R(s))
- 2 problems
- *passive* - given $\pi$, learn $U^\pi (s)$
- *passive* - given $\pi$, learn $U^\pi (s)$
- *active* - *explore* states to find utilities and *exploit* to get highest reward
- 2 model types, 3 agent designs
- model-based: can predict next state/reward before taking action (for MDP, requires learning $P(s'|s,a)$)
Expand Down
2 changes: 1 addition & 1 deletion _notes/ai/logic.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ typora-copy-images-to: ./assets/logic
- forward chaining: start w/ atomic sentences + apply modus ponens until no new inferences can be made
- *first-order definite clauses* - (remember this is a type of Horn clause)
- *Datalog* - language restricted to first-order definite clauses with no function symbols
- simple forward-chaining: FOL-FC-ASK
- simple forward-chaining: FOL-FC-ASK - may not terminate if not entailed
1. *pattern matching* is expensive
2. rechecks every rule
3. generates irrelevant facts
Expand Down
2 changes: 1 addition & 1 deletion _notes/ai/search.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ From "Artificial Intelligence" Russel & Norvig 3rd Edition
- solution to original problem still solves relaxed problem
- cost of optimal solution to a relaxed problem is an admissible heuristic for the original problem
- also is consistent
- when there are several good heuristics, pick $h(n) = max[h_1(n), ..., h_m(n)]$ for each node
- when there are several good heuristics, pick $h(n) = \max[h_1(n), ..., h_m(n)]$ for each node
- *pattern database* - heuristic stores exact solution cost for every possible subproblem instance
- *disjoint pattern database* - break into independent possible subproblems
- can learn heuristic by solving lots of problems using useful features
Expand Down
2 changes: 1 addition & 1 deletion _notes/blog/interpretability.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ category: blog


# interpretability
**chandan singh**
**chandan singh**
*last updated jul 20, 2018*

---
Expand Down
17 changes: 10 additions & 7 deletions _notes/math/linear_algebra.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ typora-copy-images-to: ./assets/linear_algebra

## notation

- $x \preceq y$ - these are vectors and x is less than y elementwise
- $x \preceq y$ - these are vectors and x is less than y elementwise
- $X \preceq Y$ - matrices, $Y-X$ is PSD
- $v^TXv \leq v^TYv \:\: \forall v$

Expand All @@ -26,7 +26,7 @@ typora-copy-images-to: ./assets/linear_algebra
- gives angle back
- linear
1. superposition $f(x+y) = f(x)+f(y) $
2. proportionality $f(k*x) = k*f(x)$
2. proportionality $f(k\cdot x) = k \cdot f(x)$
- vector space
1. closed under addition
2. contains identity
Expand Down Expand Up @@ -57,6 +57,7 @@ typora-copy-images-to: ./assets/linear_algebra
- if rank(A) = n, then must use $A^T A$
- inversion of matrix is $\approx O(n^3)$
- inverse of psd symmetric matrix is also psd and symmetric
- if A, B invertible $(AB)^{-1} = B^{-1} A^{-1}$
- *orthogonal complement* - set of orthogonal vectors
- define R(A) to be *range space* of A (column space) and N(A) to be *null space* of A
- R(A) and N(A) are orthogonal complements
Expand Down Expand Up @@ -147,17 +148,17 @@ typora-copy-images-to: ./assets/linear_algebra
- expressions when $A \in \mathbb{S}$
- $det(A) = \prod_i \lambda_i$
- $tr(A) = \sum_i \lambda_i$
- $||A||_2 = max | \lambda_i |$
- $||A||_2 = \max | \lambda_i |$
- $||A||_F = \sqrt{\sum \lambda_i^2}$
- $\lambda_{max} (A) = sup_{x \neq 0} \frac{x^T A x}{x^T x}$
- $\lambda_{min} (A) = inf_{x \neq 0} \frac{x^T A x}{x^T x}$
- $\lambda_{max} (A) = \sup_{x \neq 0} \frac{x^T A x}{x^T x}$
- $\lambda_{min} (A) = \inf_{x \neq 0} \frac{x^T A x}{x^T x}$
- *defective matrices* - lack a full set of eigenvalues
- *positive semi-definite* - $A \in R^{nxn}$
- *positive semi-definite*: $A \in R^{nxn}$
- basically these are always *symmetric* $A=A^T$
- all eigenvalues are nonnegative
- if $\forall x \in R^n, x^TAx \geq 0$ then A is positive semi definite (PSD)
- like it curves up
- Note: $x^TAx = \sum_{i, j} x_iA_{i, j} x_j$
- Note: $x^TAx = \sum_{i, j} x_iA_{i, j} x_j$
- if $\forall x \in R^n, x^TAx > 0$ then A is positive definite (PD)
- PD $\to$ full rank, invertible
- PSD + symmetric $\implies$ can be written as *Gram matrix* $G = X^T X $
Expand Down Expand Up @@ -196,6 +197,8 @@ typora-copy-images-to: ./assets/linear_algebra
- columns of V (pxp) are eigenvectors of $X^TX$
- r singular values on diagonal of $\Sigma$ (nxp) - square roots of nonzero eigenvalues of both $XX^T$ and $X^TX$
- like rotating, scaling, and rotating back
- SVD ex. $A=UDV^T \implies A^{-1} = VD^{-1} U^T$
- $X = \sum_i \sigma_i u_i v_i^T$
- properties
1. for PD matrices, $\Sigma=\Lambda$, $U\Sigma V^T = Q \Lambda Q^T$
- for other symmetric matrices, any negative eigenvalues in $\Lambda$ become positive in $\Sigma$
Expand Down
10 changes: 5 additions & 5 deletions _notes/ml/classification.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ category: ml
| Logistic regression | $\theta^T\theta + C \sum_i \log[1+\exp(-y_i \cdot \theta^T x_i)]$ |


- svm use +1/-1, logistic use 1/0
- svm, perceptron use +1/-1, logistic use 1/0
- *perceptron* - tries to find separating hyperplane
- whenever misclassified, update w
- can add in delta term to maximize margin
Expand Down Expand Up @@ -167,7 +167,7 @@ category: ml
1. at test time, can't just store w - have to store support vectors
- ![](assets/classification/svm_margin.png)
- $\hat{y} =\begin{cases} 1 &\text{if } w^Tx +b \geq 0 \\ -1 &\text{otherwise}\end{cases}$
- $\hat{\theta} = argmin \:\frac{1}{2} \vert \vert \theta\vert \vert ^2 \\s.t. \: y^{(i)}(\theta^Tx^{(i)}+b)\geq1, i = 1,...,m$
- $\hat{\theta} = \text{argmin} \:\frac{1}{2} \vert \vert \theta\vert \vert ^2 \\s.t. \: y^{(i)}(\theta^Tx^{(i)}+b)\geq1, i = 1,...,m$
- *functional margin* $\gamma^{(i)} = y^{(i)} (\theta^T x +b)$
- limit the size of $(\theta, b)$ so we can't arbitrarily increase functional margin
- function margin $\hat{\gamma}$ is smallest functional margin in a training set
Expand Down Expand Up @@ -197,7 +197,7 @@ category: ml
- scale before applying
- fill in missing values
- start with RBF
- - valid kernel: kernel matrix is PSd
- valid kernel: kernel matrix is Psd

# generative

Expand All @@ -222,7 +222,7 @@ category: ml
## naive bayes classifier

- assume multinomial Y
- with very clever tricks, can produce $P(Y^i=1|x, \eta)$ again as a softmax
- with clever tricks, can produce $P(Y^i=1|x, \eta)$ again as a softmax
- let $y_1,...y_l$ be the classes of Y
- want Posterior $P(Y\vert X) = \frac{P(X\vert Y)(P(Y)}{P(X)}$
- MAP rule - maximum a posterior rule
Expand Down Expand Up @@ -304,6 +304,6 @@ category: ml

## multinomial

- $L(\theta) = P(x_1,...,x_n\vert \theta_1,...,\theta_p) = \prod_i^n P(d_i\vert \theta_1,...\theta_p)=\prod_i^n factorials \cdot \theta_1^{x_1},...,\theta_p^{x_p}$- ignore factorials because they are always same
- $L(\theta) = P(x_1,...,x_n\vert \theta_1,...,\theta_p) = \prod_i^n P(d_i\vert \theta_1,...\theta_p)=\prod_i^n factorials \cdot \theta_1^{x_1},...,\theta_p^{x_p}$- ignore factorials because they are always same
- require $\sum \theta_i = 1$
- $\implies \theta_i = \frac{\sum_{j=1}^n x_{ij}}{N}​$ where N is total number of words in all docs
8 changes: 6 additions & 2 deletions _notes/ml/unsupervised.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,13 @@ graph LR;
- latent variable Z has multinomial distr.
- *mixing proportions*: $P(Z^i=1|x, \xi)$
- ex. $ \frac{e^{\xi_i^Tx}}{\sum_je^{\xi_j^Tx}}$
- *mixture components*: $p(y|Z^i=1, x, \theta_i)$ ~ different choices
- *mixture components*: $p(y|Z^i=1, x, \theta_i)$ ~ different choices
- ex. mixture of linear regressions
- $p(y| x, \theta) = \sum_i \underbrace{\pi_i (x, \xi)}_{\text{mixing prop.}} \cdot \underbrace{\mathcal{N}(y|\beta_i^Tx, \sigma_i^2)}_{\text{mixture comp.}}$
- ex. mixtures of logistic regressions
- $p(y|x, \theta_i) = \underbrace{\pi_i (x, \xi)}_{\text{mixing prop.}} \cdot \underbrace{\mu(\theta_i^Tx)^y\cdot[1-\mu(\theta_i^Tx)]^{1-y}}_{\text{mixture comp.}}$ where $\mu$ is the logistic function
- also, nonlinear optimization for this (including EM)
- also, nonlinear optimization for this (including EM)

## spectral clustering

- use the spectrum (eigenvalues) of the similarity matrix of the data to perform dim. reduction before clustering in fewer dimensions
82 changes: 79 additions & 3 deletions _notes/neuro/neural_comp.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,9 @@ category: neuro
- neocognitron fukushima 1980
- david marr: theory, representation, implementation

# circuit-modelling basics
# neuron models

## circuit-modelling basics

- membrane has capacitance $C_m$
- force for diffusion, force for drift
Expand All @@ -62,7 +64,7 @@ category: neuro
- $C_m \propto D$
- axial resistance $R_A \propto 1/D^2$ (not same as membrane lerk), thus bigger axons actually charge faster

# action potentials
## action potentials

- channel/receptor types
- ionotropic: $G_{ion}$ = f(molecules outside)
Expand All @@ -74,7 +76,7 @@ category: neuro
- hair cell
- voltage-gated (active - provide gain; might not require active ATP, other channels are all passive)

# physics of computation
## physics of computation

- based on carver mead: drift and diffusion are at the heart of everything
- different things realted by the **Boltzmann distr.** (ex. distr of air molecules vs elevation. Subject to gravity and diffusion upwards since they're colliding)
Expand All @@ -101,3 +103,77 @@ category: neuro

# supervised learning

- see machine learning course
- net talk was major breakthrough (words -> audio) Sejnowski & Rosenberg 1987
- people looked for world-centric receptive fields (so neurons responded to things not relative to retina but relative to body) but didn't find them
- however, they did find gain fields: (Zipser & Anderson, 1987)
- gain changes based on what retina is pointing at
- trained nn to go from pixels to head-centered coordinate frame
- yielded gain fields
- pouget et al. were able to find that this helped having 2 pop vectors: one for retina, one for eye, then add to account for it
- support vector networks (vapnik et al.) - svms early inspired from nns
- dendritic nonlinearities (hausser & mel 03)
- example to think about neurons due this: $u = w_1 x_1 + w_2x_2 + w_{12}x_1x_2$
- $y=\sigma(u)$
- somestimes called sigma-pi unit since it's a sum of products
- exponential number of params...**could be fixed w/ kernel trick?**
- could also incorporate geometry constraint...

# unsupervised learning

- born w/ extremely strong priors on weights in different areas
- barlow 1961, attneave 1954: efficient coding hypothesis = redundancy reduction hypothesis
- representation: compression / usefulness
- easier to store prior probabilities (because inputs are independent)
- relich 93: redundancy reduction for unsupervised learning (text ex. learns words from text w/out spaces)

## hebbian learning and pca

- pca can also be thought of as a tool for decorrelation (in pc dimension, tends to be less correlated)
- hebbian learning = fire together, wire together: $\Delta w_{ab} \propto <a, b>$ note: $<a, b>$ is correlation of a and b (average over time)
- linear hebbian learning (perceptron with linear output)
- $\dot{w}_i \propto <y, x_i> \propto \sum_j w_j <x_j, x_i>$ since weights change relatively slowly
-

# sparse, distributed coding



# self-organizing maps



# manifold learning



# reinforcement learning



# recurrent networks



# probabilistic models + inference



# boltzmann machines



# ica



# dynamical models



# neural coding



# high-dimensional computing

59 changes: 33 additions & 26 deletions _notes/ref/prelim/prelim.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@
- decision theory
- vpi
- mdps
- **pomdps**
- rl
- graphical models
- independence / factorization
Expand All @@ -43,30 +42,38 @@
- nearest neighbor
- neural nets (usually)

- misc facts
- bayes estimator minimizes **risk** (expected loss, varies with loss func)
- hard-margin + soft differ for separable data (c=inf when same)
- can't use implies with there exists (trivially true)
- bayes net can't have loops
- left-singular is left vector in SVD
- averaging multiple trees (decreases var, doesn't increase)
- for alpha-beta pruning, usually pick correct soln first
# misc facts

- bayes estimator minimizes **risk** (expected loss, varies with loss func)
- can't use implies with there exists (trivially true)
- left-singular is left vector in SVD
- averaging multiple trees (decreases var, doesn't increase bias)
- for alpha-beta pruning, usually pick correct soln first
- kd tree: binary split on different dim each time
- graphical models
- uniform distr can be represented by any bayes net
- bayes net can't have loops
- likelihood weighting only takes upstream evidence into account when sampling (unlike Gibbs)

# todo

- memorize new cheat sheet (logic, planning)
- practice problems
- 188 finals / discussions
- 189 finals / discussions
- russell qs
- **do exam-prep** - 7 q2
- review topics
- approximate q-learning?
- rl eqs
- conditional independencies
- forward-backward algo
- ac-3 / backtracking
- kernels
- svms duality
- conditional independencies - remember bayes ball algo
- bounce through all non-shaded unless shaded is base of v
- can bounce up farther from base of v
- gibb's sampling: need to resample (using proportionality) keeping all other vars constant
- state space keeps track of things that change
- filtering, prediction, smoothing, mle
- viterbi: $m_t[x_t] = P (e_t|x_t) \max P (x_t|x_t−1)m_{t−1}[x_{t−1}]$
- qda/svm with quadratic kernel can represent hyperbola boundary
- As the number of neighbors k gets very large and the number of training points goes to infinity, the probability of error for a k-nearest-neighbor classifier will tend to approach the probability of error for a MAP classifier that knew the true underlying distributions.
- svm
- soft margin SVM effectively optimizes hinge loss plus ridge regularization
- If a training point has a positive slack, then it is not a support vector.
- hard-margin + soft differ for separable data (c=inf when same)
- remember to justify convexity!
- $x\sim N(\mu, \Sigma) \implies v^Tx ~ N(v^T\mu, v^T\Sigma v)$
- don't write $\sqrt{\Sigma}$ write $\Sigma^{1/2}$
- logistic reg: $P(Y=y|x, \theta) = p(y)^{y} \cdot (1-p(y))^{y}$
- csp
- Any node can be backtracked on up until a cutset has been assigned. Note that B’s values in the first part has no effect on the rest of the CSP after A has been assigned. However, because of the way that backtracking search is run, B would still be re-assigned before A if there was no consistent solution for a given value of A
- tree-structured csp algo (has backward pass for arc consistency / forward pass for assigning variables - faster)
- when enforcing arc consistency before every value assignment, we will only be guaranteed that we won’t need to backtrack when our remaining variables left to be assign form a tree structure
- $MEU(B) = P(B=0) MEU(B=0) + P(B=1)MEU(B=1)$
- note we calculate MEU with respect to an assignment of a variable
Binary file modified _notes/ref/prelim/prelim_cheat_sheet_full.pdf
Binary file not shown.
File renamed without changes.
Loading

0 comments on commit 0af734e

Please sign in to comment.