forked from stanfordnlp/CoreNLP
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge remote-tracking branch 'origin/master'
- Loading branch information
Showing
83 changed files
with
316,940 additions
and
1,336 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
Architectural Overview of edu.stanford.nlp.loglinear: | ||
|
||
The goal of this package is to provide fast, general structure log-linear modelling that's easy to use and to extend. | ||
|
||
The package is broken into three parts: model, inference, and learning | ||
|
||
Model contains all of the basic storage elements, as well as means to serialize and deserialize for both storage and | ||
network transit. Inference depends on model, and provides an implementation of the clique tree message passing algorithm | ||
for efficient exact inference in tree-structured graphs. Learning depends on inference and model, and provides a simple | ||
interface to efficient multithreaded batch learning, with an implementation of AdaGrad guarded by backtracking. | ||
|
||
We will go over model, then inference, then learning. | ||
|
||
##################################################### | ||
|
||
Model module overview: | ||
|
||
##################################################### | ||
|
||
*** | ||
ConcatVector: | ||
|
||
The key to the speed of loglinear is the ConcatVector class. ConcatVector provides a useful abstraction for NLP machine | ||
learning: a concatenation of vectors, treated as a single vector. The basic idea is to have each feature output a vector, | ||
which are then stored in a ConcatVector (or 'concatenated vector'). When the dot-product is taken of two ConcatVectors, | ||
the result is the sum of the dot product of each of the concatenated components of each vector in sequence. To write | ||
that out explicitly, if a feature ConcatVector f is composed of a number of vector f_i's, and a weight ConcatVector w is | ||
composed of a number of vector w_i's, then dot(f,w) is: | ||
|
||
\sum_i dot(f_i, w_i) | ||
|
||
This leaves us with two key advantages over a regular vector: each component can be individually tuned for sparsity, and | ||
each component has an isolated namespace and so can an individual feature vector can grow after training begins (say | ||
discovering a new word in a one-hot feature vector), and the weight vector will behave appropriately without hassle. | ||
|
||
*** | ||
NDArray | ||
|
||
We have a basic NDArray, which allows a standard iterator over possible assignments that creates a lot of int[] arrays | ||
on the heap, and a more elaborate iterator that saves GC by mutating a single array passed over. You'll see this used | ||
throughout the code in hot loops marked by an "//OPTIMIZATION" comment | ||
|
||
*** | ||
ConcatVectorTable | ||
|
||
ConcatVectorTable is a subclass of NDArray that we use to store factor tables for the log-linear graphical model, where | ||
each element of the table represents the features for one joint assignment to the variables the factor is associated | ||
with. In order to get a factor like you learned about in CS 228, each element of the table is dot-producted with weights. | ||
We don't do this at construction to allow a single set of GraphicalModel objects to be used throughout training. | ||
|
||
*** | ||
GraphicalModel | ||
|
||
GraphicalModel is a super stripped down implementation of a graphical model. It holds factors, represented by lists of | ||
neighbor indices and a ConcatVectorTable for features. It was deliberate to make all downstream annotations on the model | ||
(like observations for inference or observations for training) go into a HashMap. This is to maintain easy backwards | ||
compatibility with previous serialized versions as features change, and to make life more convenient for downstream | ||
algorithms that may be passing GraphicalModel objects across module or network boundaries, and don't want to create tons | ||
of little 'ride-along' objects that add annotations to the GraphicalModel. | ||
|
||
##################################################### | ||
|
||
Inference module overview: | ||
|
||
##################################################### | ||
|
||
*** | ||
TableFactor | ||
|
||
This is the traditional 'factor' datatype that you're used to hearing about from Daphne in 228 and "Probabilistic | ||
Graphical Models". It's a subclass of NDArray, and has fast operations for product and marginalize dataflows. It's the | ||
key building block for inference. | ||
|
||
*** | ||
CliqueTree | ||
|
||
This object takes a GraphicalModel at creation and provides high speed tree-shaped message passing inference for both | ||
exact marginals and exact MAP estimates. It exists as a new object for each GraphicalModel, rather than a static call | ||
for each model, to allow for cacheing some messages when repeated marginals are needed on only slightly changing models. | ||
|
||
##################################################### | ||
|
||
Learning module overview: | ||
|
||
##################################################### | ||
|
||
*** | ||
AbstractDifferentiableFunction | ||
|
||
This follows the Optimize.jl package convention of providing both gradient and function value in a single return value. | ||
|
||
*** | ||
LogLikelihoodFunction | ||
|
||
An implementation of AbstractDifferentiableFunction for calculating the log-likelihood of a log-linear model as given by | ||
a GraphicalModel. | ||
|
||
*** | ||
AbstractOnlineOptimizer | ||
|
||
This is the basic interface for online optimizers to follow. It is sketched out right now, but no implementations have | ||
been made yet. | ||
|
||
*** | ||
AbstractBatchOptimizer | ||
|
||
There is a fair amount of redundant complexity involved in writing an optimizer that needs to calculate the gradient | ||
on the entire batch of examples every update step. The work between threads must be carefully balanced so that the | ||
time between the first thread finishing, and the last, during which the CPU utilization is far less than 100%, is | ||
minimized. This is managed through rough estimating of the amount of work each item represents, and a perceptron style | ||
updating once the system is running, based on CPU time used for each thread. We also implement a convenience function | ||
here to allow the user to interrupt training early if they are happy with convergence to this point, since that involves | ||
some tricky Java threading to make work. | ||
|
||
*** | ||
BacktrackingAdaGradOptimizer | ||
|
||
This subclasses AbstractBatchOptimizer, and implements a simple AdaGrad gradient descent guarded by backtracking line | ||
search to maximize an AbstractDifferentiableFunction. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
Optimization for loglinear was done last, and driven by realistic benchmarks wherever possible. | ||
|
||
There are 2 major benchmarks for loglinear so far: | ||
|
||
- Training CoNLL linear chain model with a mixture of dense and sparse features: | ||
org.stanford.nlp.loglinear.learning.CoNLLBenchmark | ||
- A bunch of microbenchmarks for ConcatVector: | ||
org.stanford.nlp.loglinear.model.ConcatVectorBenchmark | ||
|
||
The general findings so far have been: | ||
|
||
- The JNI doesn't speed up ConcatVector operations, even with AVX assembly, since the JIT seems to vectorize automatically, | ||
and occasionally will give the C code a copy of the arrays to mutate, which is absurdly slow. | ||
- The vast majority of training time is spent on feature vector related operations. Message passing takes less than 5% | ||
of total time. The rest of the time is evenly split between the initial dot product with weights to get factor values | ||
and the final summation of feature vectors to get the derivative of the log-likelihood. | ||
- Huge heap wastage occurs if you allow the featurizing code to run once and keep around ConcatVector's uselessly, which | ||
results in general slowdowns and long GC waits. Instead keeping featurizing code as thunks is much faster. | ||
- Making ConcatVector copy-on-write is a very valuable way to keep the GC from working too hard. A very common case is | ||
a clone vector that is only ever read from, so optimizing for that yielded a 50% drop in GC load. | ||
- A non-trivial fraction of time is wasted on poorly balanced work queues for different threads during batch gradient | ||
descent. Balancing more carefully yielded a 10% speedup. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
loglinear package quickstart: | ||
|
||
First, read the ConcatVector section in ARCH.txt. | ||
|
||
To jump straight into working code, go read generateSentenceModel() in edu.stanford.nlp.loglinear.learning.CoNLLBenchmark. | ||
|
||
##################################################### | ||
|
||
Creating and featurizing a GraphicalModel | ||
|
||
##################################################### | ||
|
||
To construct a GraphicalModel, which you'll need for training and inference, do the following: | ||
|
||
------------- | ||
GraphicalModel model = new GraphicalModel(); | ||
------------- | ||
|
||
Now, to add a factor to the model, you'll need to know two things: Who are the neighbor variables of this factor, and | ||
how many states do each of those variables possess? As an example, let's add a factor between variable 2 and variable 7, | ||
where variable 2 has 3 states, and variable 7 has 2 states. | ||
|
||
------------- | ||
int[] neighbors = new int[]{2,7}; | ||
int[] neighborSizes = new int[]{3,2}; // This must appear in the same order as neighbors | ||
|
||
model.addFactor(neighbors, neighborSizes, (int[] assignment) -> { | ||
// TODO: In the next paragraph we'll discuss featurizing each possible assignment to neighbors | ||
return new ConcatVector(0); | ||
}); | ||
------------- | ||
|
||
You'll also need to know how to featurize your new factors. That happens inside the closure you pass into the factor. | ||
The closure takes as an argument an assignment to the factor (in the same order as neighbors and neighborSizes) and | ||
expects in response a ConcatVector of features. | ||
|
||
Make sure your closures are idempotent! The system actually stores these closures, rather than their resulting | ||
ConcatVector's. This is an optimization. The GC in modern JVMs is tuned to collect large numbers of young objects, so | ||
creating new ConcatVectors every gradient calculation and then immediately disposing of them turns out to increase | ||
speed and dramatically decrease heap footprint and GC load. | ||
|
||
Here's how to create a feature closure: | ||
|
||
------------- | ||
int[] neighbors = new int[]{2,7}; | ||
int[] neighborSizes = new int[]{3,2}; // This must appear in the same order as neighbors | ||
|
||
model.addFactor(neighbors, neighborSizes, (int[] assignment) -> { | ||
|
||
// This is how assignment[] is structured | ||
|
||
int variable2Assignment = assignment[0]; | ||
int variable7Assignment = assignment[1]; | ||
|
||
// Create a new ConcatVector with 2 segments: | ||
|
||
ConcatVector features = new ConcatVector(); | ||
|
||
// Add a dense feature as feature 0, of length 2 | ||
// (Dense features in ConcatVector's are mostly used for embeddings) | ||
|
||
features.setDenseComponent(0, new double[]{ | ||
variable2Assignment * 2 + variable7Assignment, | ||
variable2Assignment + variable7Assignment * 2, | ||
}); | ||
|
||
// Add a sparse (one-hot) feature as feature 1 | ||
|
||
int sparseIndex = variable2Assignment; | ||
double sparseValue = 1.0; | ||
features.setSparseComponent(1, sparseIndex, sparseValue); | ||
|
||
// Return our feature set to complete the closure | ||
|
||
return features; | ||
}); | ||
------------- | ||
|
||
And that's all there is to it. Just repeat this several times to populate a GraphicalModel for whatever problem you've | ||
got. | ||
|
||
##################################################### | ||
|
||
Training a set of weights | ||
|
||
##################################################### | ||
|
||
Assuming you've got a bunch of GraphicalModel objects, there's not much you need to do to train a system. First, you | ||
need to provide labels to all your variables that the training system understands. To do this, you need to get the | ||
HashMap<String,String> object that represents metadata, and put in labels that the LogLikelihood system understands. | ||
You must label every variable that is mentioned in any factor in your model. | ||
|
||
------------- | ||
GraphicalModel model = new GraphicalModel(); | ||
|
||
// Omitted model construction, see previous section | ||
|
||
// Tell LogLikelihood to treat variable 2 as having the assignment "1" in training labels | ||
|
||
model.getVariableMetadataByReference(2).put(LogLikelihoodFunction.VARIABLE_TRAINING_VALUE, "1"); | ||
|
||
// Tell LogLikelihood to treat variable 7 as having the assignment "0" in training labels | ||
|
||
model.getVariableMetadataByReference(7).put(LogLikelihoodFunction.VARIABLE_TRAINING_VALUE, "0"); | ||
------------- | ||
|
||
Once you've got an array of labeled models, training is pretty straightforward. We create an optimizer, pass it a | ||
LogLikelihoodFunction as the function to optimize, and the array of models as the data to optimize over. That returns | ||
us the optimal set of weights. | ||
|
||
------------- | ||
GraphicalModel[] trainingSet = //omitted dataset construction; | ||
|
||
// Create the optimizer we will use | ||
|
||
AbstractBatchOptimizer opt = new BacktrackingAdaGradOptimizer(); | ||
|
||
// Call the optimizer, with a dataset, a function to optimize, initial weights, and l2 regularization constant | ||
|
||
ConcatVector weights = opt.optimize(trainingSet, new LogLikelihoodFunction(), new ConcatVector(0), 0.1); | ||
------------- | ||
|
||
We can then use these weights for inference. | ||
|
||
##################################################### | ||
|
||
Inference | ||
|
||
##################################################### | ||
|
||
Inference is easy once we have a set of weights we want to use. We simply create a CliqueTree for the model we're trying | ||
to optimize and the weights we want, and then ask it for inference results. | ||
|
||
------------- | ||
GraphicalModel model = // see previous section; | ||
ConcatVector weights = // see first section; | ||
|
||
CliqueTree tree = new CliqueTree(model, weights); | ||
|
||
int[] mapAssignment = tree.calculateMAP(); | ||
double[][] marginalProbabilities = tree.calculateMarginals(); | ||
------------- | ||
|
||
The MAP assignment comes back as an array of assignments, where the assignment for variable 0 is at index 0, variable 1 | ||
is at index 1, and so forth. The marginalProbilities array is organized in the same way, except instead of an int for | ||
an assignment, there is an array of doubles, one for each possible assignment for the variable it represents, which | ||
represent global marginals. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
For an explanation of how everything fits together, see ARCH.txt | ||
|
||
For a quick runnable object, go run edu.stanford.nlp.loglinear.learning.CoNLLBenchmark in core's test package. | ||
|
||
For a tutorial, see QUICKSTART.txt | ||
|
||
For a brief overview of testing for this package (which was quite thorough), see TESTING.txt | ||
|
||
Look at OPTIMIZATION.txt for an overview of what exists, what's been done, and what we know about further optimizations. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
The testing for the loglinear package uses functional invariants over randomly generated inputs, which in general yields | ||
much more durable software. | ||
|
||
This is aided by the JUnit port of Quickcheck. The general tactic for testing is to randomly generate inputs, then use | ||
the slow definitional approach on tractably small inputs and test that the output of our algorithms always matches | ||
exactly. The GitHub for the Quickcheck port is https://github.com/pholser/junit-quickcheck. The dependencies for that | ||
are in the lib/ folder in test1. | ||
|
||
Some of the general testing approaches are listed below: | ||
|
||
message passing -> tested against brute force factor multiplication and marginalization | ||
partition function -> tested against brute force multiplication and summation | ||
log likelihood gradient -> tested against definition of derivative | ||
optimization -> tested by making thousands of random perturbations around the function to check if any values are better | ||
concatVector -> tested against a non-sparse version | ||
table factor -> tested against functional invariants of results, and re-implementations using different algorithms |
10 changes: 6 additions & 4 deletions
10
itest/src/edu/stanford/nlp/pipeline/ProtobufAnnotationSerializerSlowITest.java
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.