Merge remote-tracking branch 'origin/master'

FelixHo · Sep 28, 2015 · e1c9148 · e1c9148
1 parent 8ab45e7
commit e1c9148
Show file tree

Hide file tree

Showing 83 changed files with 316,940 additions and 1,336 deletions.
diff --git a/doc/README b/doc/README
@@ -8,3 +8,5 @@ zips we release
 
 classify, lexparser, ner, segmenter: documentation included in various
 packages, such as readmes, build files, etc
+
+loglinear: architectural explanation and various tutorials
diff --git a/doc/loglinear/ARCH.txt b/doc/loglinear/ARCH.txt
@@ -0,0 +1,119 @@
+Architectural Overview of edu.stanford.nlp.loglinear:
+
+The goal of this package is to provide fast, general structure log-linear modelling that's easy to use and to extend.
+
+The package is broken into three parts: model, inference, and learning
+
+Model contains all of the basic storage elements, as well as means to serialize and deserialize for both storage and
+network transit. Inference depends on model, and provides an implementation of the clique tree message passing algorithm
+for efficient exact inference in tree-structured graphs. Learning depends on inference and model, and provides a simple
+interface to efficient multithreaded batch learning, with an implementation of AdaGrad guarded by backtracking.
+
+We will go over model, then inference, then learning.
+
+#####################################################
+
+Model module overview:
+
+#####################################################
+
+***
+ConcatVector:
+
+The key to the speed of loglinear is the ConcatVector class. ConcatVector provides a useful abstraction for NLP machine
+learning: a concatenation of vectors, treated as a single vector. The basic idea is to have each feature output a vector,
+which are then stored in a ConcatVector (or 'concatenated vector'). When the dot-product is taken of two ConcatVectors,
+the result is the sum of the dot product of each of the concatenated components of each vector in sequence. To write
+that out explicitly, if a feature ConcatVector f is composed of a number of vector f_i's, and a weight ConcatVector w is
+composed of a number of vector w_i's, then dot(f,w) is:
+
+\sum_i dot(f_i, w_i)
+
+This leaves us with two key advantages over a regular vector: each component can be individually tuned for sparsity, and
+each component has an isolated namespace and so can an individual feature vector can grow after training begins (say
+discovering a new word in a one-hot feature vector), and the weight vector will behave appropriately without hassle.
+
+***
+NDArray
+
+We have a basic NDArray, which allows a standard iterator over possible assignments that creates a lot of int[] arrays
+on the heap, and a more elaborate iterator that saves GC by mutating a single array passed over. You'll see this used
+throughout the code in hot loops marked by an "//OPTIMIZATION" comment
+
+***
+ConcatVectorTable
+
+ConcatVectorTable is a subclass of NDArray that we use to store factor tables for the log-linear graphical model, where
+each element of the table represents the features for one joint assignment to the variables the factor is associated
+with. In order to get a factor like you learned about in CS 228, each element of the table is dot-producted with weights.
+We don't do this at construction to allow a single set of GraphicalModel objects to be used throughout training.
+
+***
+GraphicalModel
+
+GraphicalModel is a super stripped down implementation of a graphical model. It holds factors, represented by lists of
+neighbor indices and a ConcatVectorTable for features. It was deliberate to make all downstream annotations on the model
+(like observations for inference or observations for training) go into a HashMap. This is to maintain easy backwards
+compatibility with previous serialized versions as features change, and to make life more convenient for downstream
+algorithms that may be passing GraphicalModel objects across module or network boundaries, and don't want to create tons
+of little 'ride-along' objects that add annotations to the GraphicalModel.
+
+#####################################################
+
+Inference module overview:
+
+#####################################################
+
+***
+TableFactor
+
+This is the traditional 'factor' datatype that you're used to hearing about from Daphne in 228 and "Probabilistic
+Graphical Models". It's a subclass of NDArray, and has fast operations for product and marginalize dataflows. It's the
+key building block for inference.
+
+***
+CliqueTree
+
+This object takes a GraphicalModel at creation and provides high speed tree-shaped message passing inference for both
+exact marginals and exact MAP estimates. It exists as a new object for each GraphicalModel, rather than a static call
+for each model, to allow for cacheing some messages when repeated marginals are needed on only slightly changing models.
+
+#####################################################
+
+Learning module overview:
+
+#####################################################
+
+***
+AbstractDifferentiableFunction
+
+This follows the Optimize.jl package convention of providing both gradient and function value in a single return value.
+
+***
+LogLikelihoodFunction
+
+An implementation of AbstractDifferentiableFunction for calculating the log-likelihood of a log-linear model as given by
+a GraphicalModel.
+
+***
+AbstractOnlineOptimizer
+
+This is the basic interface for online optimizers to follow. It is sketched out right now, but no implementations have
+been made yet.
+
+***
+AbstractBatchOptimizer
+
+There is a fair amount of redundant complexity involved in writing an optimizer that needs to calculate the gradient
+on the entire batch of examples every update step. The work between threads must be carefully balanced so that the
+time between the first thread finishing, and the last, during which the CPU utilization is far less than 100%, is
+minimized. This is managed through rough estimating of the amount of work each item represents, and a perceptron style
+updating once the system is running, based on CPU time used for each thread. We also implement a convenience function
+here to allow the user to interrupt training early if they are happy with convergence to this point, since that involves
+some tricky Java threading to make work.
+
+***
+BacktrackingAdaGradOptimizer
+
+This subclasses AbstractBatchOptimizer, and implements a simple AdaGrad gradient descent guarded by backtracking line
+search to maximize an AbstractDifferentiableFunction.
diff --git a/doc/loglinear/OPTIMIZATION.txt b/doc/loglinear/OPTIMIZATION.txt
@@ -0,0 +1,22 @@
+Optimization for loglinear was done last, and driven by realistic benchmarks wherever possible.
+
+There are 2 major benchmarks for loglinear so far:
+
+- Training CoNLL linear chain model with a mixture of dense and sparse features:
+    org.stanford.nlp.loglinear.learning.CoNLLBenchmark
+- A bunch of microbenchmarks for ConcatVector:
+    org.stanford.nlp.loglinear.model.ConcatVectorBenchmark
+
+The general findings so far have been:
+
+- The JNI doesn't speed up ConcatVector operations, even with AVX assembly, since the JIT seems to vectorize automatically,
+and occasionally will give the C code a copy of the arrays to mutate, which is absurdly slow.
+- The vast majority of training time is spent on feature vector related operations. Message passing takes less than 5%
+of total time. The rest of the time is evenly split between the initial dot product with weights to get factor values
+and the final summation of feature vectors to get the derivative of the log-likelihood.
+- Huge heap wastage occurs if you allow the featurizing code to run once and keep around ConcatVector's uselessly, which
+results in general slowdowns and long GC waits. Instead keeping featurizing code as thunks is much faster.
+- Making ConcatVector copy-on-write is a very valuable way to keep the GC from working too hard. A very common case is
+a clone vector that is only ever read from, so optimizing for that yielded a 50% drop in GC load.
+- A non-trivial fraction of time is wasted on poorly balanced work queues for different threads during batch gradient
+descent. Balancing more carefully yielded a 10% speedup.
diff --git a/doc/loglinear/QUICKSTART.txt b/doc/loglinear/QUICKSTART.txt
@@ -0,0 +1,147 @@
+loglinear package quickstart:
+
+First, read the ConcatVector section in ARCH.txt.
+
+To jump straight into working code, go read generateSentenceModel() in edu.stanford.nlp.loglinear.learning.CoNLLBenchmark.
+
+#####################################################
+
+Creating and featurizing a GraphicalModel
+
+#####################################################
+
+To construct a GraphicalModel, which you'll need for training and inference, do the following:
+
+-------------
+GraphicalModel model = new GraphicalModel();
+-------------
+
+Now, to add a factor to the model, you'll need to know two things: Who are the neighbor variables of this factor, and
+how many states do each of those variables possess? As an example, let's add a factor between variable 2 and variable 7,
+where variable 2 has 3 states, and variable 7 has 2 states.
+
+-------------
+int[] neighbors = new int[]{2,7};
+int[] neighborSizes = new int[]{3,2}; // This must appear in the same order as neighbors
+
+model.addFactor(neighbors, neighborSizes, (int[] assignment) -> {
+    // TODO: In the next paragraph we'll discuss featurizing each possible assignment to neighbors
+    return new ConcatVector(0);
+});
+-------------
+
+You'll also need to know how to featurize your new factors. That happens inside the closure you pass into the factor.
+The closure takes as an argument an assignment to the factor (in the same order as neighbors and neighborSizes) and
+expects in response a ConcatVector of features.
+
+Make sure your closures are idempotent! The system actually stores these closures, rather than their resulting
+ConcatVector's. This is an optimization. The GC in modern JVMs is tuned to collect large numbers of young objects, so
+creating new ConcatVectors every gradient calculation and then immediately disposing of them turns out to increase
+speed and dramatically decrease heap footprint and GC load.
+
+Here's how to create a feature closure:
+
+-------------
+int[] neighbors = new int[]{2,7};
+int[] neighborSizes = new int[]{3,2}; // This must appear in the same order as neighbors
+
+model.addFactor(neighbors, neighborSizes, (int[] assignment) -> {
+
+    // This is how assignment[] is structured
+
+    int variable2Assignment = assignment[0];
+    int variable7Assignment = assignment[1];
+
+    // Create a new ConcatVector with 2 segments:
+
+    ConcatVector features = new ConcatVector();
+
+    // Add a dense feature as feature 0, of length 2
+    // (Dense features in ConcatVector's are mostly used for embeddings)
+
+    features.setDenseComponent(0, new double[]{
+        variable2Assignment * 2 + variable7Assignment,
+        variable2Assignment + variable7Assignment * 2,
+    });
+
+    // Add a sparse (one-hot) feature as feature 1
+
+    int sparseIndex = variable2Assignment;
+    double sparseValue = 1.0;
+    features.setSparseComponent(1, sparseIndex, sparseValue);
+
+    // Return our feature set to complete the closure
+
+    return features;
+});
+-------------
+
+And that's all there is to it. Just repeat this several times to populate a GraphicalModel for whatever problem you've
+got.
+
+#####################################################
+
+Training a set of weights
+
+#####################################################
+
+Assuming you've got a bunch of GraphicalModel objects, there's not much you need to do to train a system. First, you
+need to provide labels to all your variables that the training system understands. To do this, you need to get the
+HashMap<String,String> object that represents metadata, and put in labels that the LogLikelihood system understands.
+You must label every variable that is mentioned in any factor in your model.
+
+-------------
+GraphicalModel model = new GraphicalModel();
+
+// Omitted model construction, see previous section
+
+// Tell LogLikelihood to treat variable 2 as having the assignment "1" in training labels
+
+model.getVariableMetadataByReference(2).put(LogLikelihoodFunction.VARIABLE_TRAINING_VALUE, "1");
+
+// Tell LogLikelihood to treat variable 7 as having the assignment "0" in training labels
+
+model.getVariableMetadataByReference(7).put(LogLikelihoodFunction.VARIABLE_TRAINING_VALUE, "0");
+-------------
+
+Once you've got an array of labeled models, training is pretty straightforward. We create an optimizer, pass it a
+LogLikelihoodFunction as the function to optimize, and the array of models as the data to optimize over. That returns
+us the optimal set of weights.
+
+-------------
+GraphicalModel[] trainingSet = //omitted dataset construction;
+
+// Create the optimizer we will use
+
+AbstractBatchOptimizer opt = new BacktrackingAdaGradOptimizer();
+
+// Call the optimizer, with a dataset, a function to optimize, initial weights, and l2 regularization constant
+
+ConcatVector weights = opt.optimize(trainingSet, new LogLikelihoodFunction(), new ConcatVector(0), 0.1);
+-------------
+
+We can then use these weights for inference.
+
+#####################################################
+
+Inference
+
+#####################################################
+
+Inference is easy once we have a set of weights we want to use. We simply create a CliqueTree for the model we're trying
+to optimize and the weights we want, and then ask it for inference results.
+
+-------------
+GraphicalModel model = // see previous section;
+ConcatVector weights = // see first section;
+
+CliqueTree tree = new CliqueTree(model, weights);
+
+int[] mapAssignment = tree.calculateMAP();
+double[][] marginalProbabilities = tree.calculateMarginals();
+-------------
+
+The MAP assignment comes back as an array of assignments, where the assignment for variable 0 is at index 0, variable 1
+is at index 1, and so forth. The marginalProbilities array is organized in the same way, except instead of an int for
+an assignment, there is an array of doubles, one for each possible assignment for the variable it represents, which
+represent global marginals.
diff --git a/doc/loglinear/README.txt b/doc/loglinear/README.txt
@@ -0,0 +1,9 @@
+For an explanation of how everything fits together, see ARCH.txt
+
+For a quick runnable object, go run edu.stanford.nlp.loglinear.learning.CoNLLBenchmark in core's test package.
+
+For a tutorial, see QUICKSTART.txt
+
+For a brief overview of testing for this package (which was quite thorough), see TESTING.txt
+
+Look at OPTIMIZATION.txt for an overview of what exists, what's been done, and what we know about further optimizations.
diff --git a/doc/loglinear/TESTING.txt b/doc/loglinear/TESTING.txt
@@ -0,0 +1,16 @@
+The testing for the loglinear package uses functional invariants over randomly generated inputs, which in general yields
+much more durable software.
+
+This is aided by the JUnit port of Quickcheck. The general tactic for testing is to randomly generate inputs, then use
+the slow definitional approach on tractably small inputs and test that the output of our algorithms always matches
+exactly. The GitHub for the Quickcheck port is https://github.com/pholser/junit-quickcheck. The dependencies for that
+are in the lib/ folder in test1.
+
+Some of the general testing approaches are listed below:
+
+message passing -> tested against brute force factor multiplication and marginalization
+partition function -> tested against brute force multiplication and summation
+log likelihood gradient -> tested against definition of derivative
+optimization -> tested by making thousands of random perturbations around the function to check if any values are better
+concatVector -> tested against a non-sparse version
+table factor -> tested against functional invariants of results, and re-implementations using different algorithms
diff --git a/itest/src/edu/stanford/nlp/pipeline/ProtobufAnnotationSerializerSlowITest.java b/itest/src/edu/stanford/nlp/pipeline/ProtobufAnnotationSerializerSlowITest.java
Original file line number	Diff line number	Diff line change
Expand Up		@@ -8,3 +8,5 @@ zips we release

		classify, lexparser, ner, segmenter: documentation included in various
		packages, such as readmes, build files, etc

		loglinear: architectural explanation and various tutorials