rss.xml

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Recursively Boosted</title><link>https://dpkatz.github.io/</link><description>Functional Programming, Machine Learning, and anything else that catches my interest...</description><atom:link href="https://dpkatz.github.io/rss.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2018 &lt;a href="mailto:dpkatz.bc@gmail.com"&gt;Dan Katz&lt;/a&gt; </copyright><lastBuildDate>Fri, 22 Jun 2018 21:00:10 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Using LightGBM from Haskell</title><link>https://dpkatz.github.io/posts/using-lightgbm-from-haskell/</link><dc:creator>Dan Katz</dc:creator><description>&lt;div id="outline-container-orgf62600a" class="outline-2"&gt;
&lt;h2 id="orgf62600a"&gt;Summary&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-orgf62600a"&gt;
&lt;blockquote&gt;
&lt;ul class="org-ul"&gt;
&lt;li&gt;Easily run LightGBM from Haskell&lt;/li&gt;
&lt;li&gt;Improved parameter value safety thanks to refined types&lt;/li&gt;
&lt;li&gt;Improved parameter grouping safety thanks to algebraic data types&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org632b75d" class="outline-2"&gt;
&lt;h2 id="org632b75d"&gt;Introduction&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-org632b75d"&gt;
&lt;p&gt;
One of the most popular recent approaches to tree boosting as a
machine learning approach is &lt;a href="https://github.com/Microsoft/LightGBM"&gt;LightGBM&lt;/a&gt;.  It's fast, performs well on
various sorts of regression and classification problems, and has an
open-source implementation under active development which includes
APIs in Python and R as well as a community-provided &lt;a href="https://github.com/Allardvm/LightGBM.jl"&gt;Julia wrapper&lt;/a&gt;.
It does not, however, do anything to make life easy for the average
Haskell programmer.  Let's change that.
&lt;/p&gt;

&lt;p&gt;
I've created a &lt;a href="https://github.com/dpkatz/HaskellGBM"&gt;Haskell wrapper around LightGBM&lt;/a&gt; which makes it easy to
call the library from a Haskell program.  Beyond simply making it
possible to use LightGBM from Haskell, however, I've tried to make
this interface to the library relatively safe.  In partcular:
&lt;/p&gt;
&lt;ul class="org-ul"&gt;
&lt;li&gt;Many LightGBM parameters are supposed to be limited to defined
sets of values (e.g. fractions in the range \([0, 1]\), numbers in
the range \([1, 2)\), or strictly positive integers).  In this
Haskell interface I use refinement types to ensure that these
limitations are honored - at compile time when possible, at run
time when necessary.&lt;/li&gt;
&lt;li&gt;Many LightGBM parameters are only supposed to be used in a subset
of cases (e.g. for a particular objective or with a particular
metric function).  In this Haskell interface I use algebraic data
types to try to ensure that combinations of parameters that don't
make sense cannot be represented.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
I hope that these decisions make the library safer and easier to use
than it would otherwise be.
&lt;/p&gt;

&lt;p&gt;
So let's take a look at how to put the library to use.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org2973dfe" class="outline-2"&gt;
&lt;h2 id="org2973dfe"&gt;How to use the library&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-org2973dfe"&gt;
&lt;p&gt;
We'll apply LightGBM to the venerable Kaggle problem of predicting &lt;a href="https://www.kaggle.com/c/titanic"&gt;who
will survive the sinking of the Titanic&lt;/a&gt;.  The code and data for this
exercise can be found in the &lt;a href="https://github.com/dpkatz/HaskellGBM/tree/master/examples/titanic"&gt;Titanic example subdirectory of the
GitHub repo&lt;/a&gt;.&lt;sup&gt;&lt;a id="fnr.1" class="footref" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fn.1"&gt;1&lt;/a&gt;&lt;/sup&gt;
&lt;/p&gt;
&lt;/div&gt;


&lt;div id="outline-container-org01289b7" class="outline-3"&gt;
&lt;h3 id="org01289b7"&gt;Data Preparation&lt;/h3&gt;
&lt;div class="outline-text-3" id="text-org01289b7"&gt;
&lt;p&gt;
We start by taking the &lt;a href="https://www.kaggle.com/c/titanic/data"&gt;Kaggle training data&lt;/a&gt; and splitting it into two
parts - one for training and one for validation.&lt;sup&gt;&lt;a id="fnr.2" class="footref" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fn.2"&gt;2&lt;/a&gt;&lt;/sup&gt; In the &lt;a href="https://github.com/dpkatz/HaskellGBM/tree/master/examples/titanic"&gt;repo&lt;/a&gt;,
you'll find these files named &lt;code&gt;train_part.csv&lt;/code&gt; and
&lt;code&gt;validate_part.csv&lt;/code&gt;.  
&lt;/p&gt;

&lt;p&gt;
Here's a sample of the first few rows of the training data:
&lt;/p&gt;
&lt;table border="2" cellspacing="0" cellpadding="6" rules="all" frame="border"&gt;


&lt;colgroup&gt;
&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-left"&gt;

&lt;col class="org-left"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-left"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-left"&gt;

&lt;col class="org-left"&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th scope="col" class="org-right"&gt;PassengerId&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;Survived&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;Pclass&lt;/th&gt;
&lt;th scope="col" class="org-left"&gt;Name&lt;/th&gt;
&lt;th scope="col" class="org-left"&gt;Sex&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;Age&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;SibSp&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;Parch&lt;/th&gt;
&lt;th scope="col" class="org-left"&gt;Ticket&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;Fare&lt;/th&gt;
&lt;th scope="col" class="org-left"&gt;Cabin&lt;/th&gt;
&lt;th scope="col" class="org-left"&gt;Embarked&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-right"&gt;3&lt;/td&gt;
&lt;td class="org-left"&gt;Braund, Mr. Owen Harris&lt;/td&gt;
&lt;td class="org-left"&gt;male&lt;/td&gt;
&lt;td class="org-right"&gt;22&lt;/td&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-left"&gt;A/5 21171&lt;/td&gt;
&lt;td class="org-right"&gt;7.25&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;/td&gt;
&lt;td class="org-left"&gt;S&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-right"&gt;2&lt;/td&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-left"&gt;Cumings, Mrs. John Bradley (Florence Briggs Thayer)&lt;/td&gt;
&lt;td class="org-left"&gt;female&lt;/td&gt;
&lt;td class="org-right"&gt;38&lt;/td&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-left"&gt;PC 17599&lt;/td&gt;
&lt;td class="org-right"&gt;71.2833&lt;/td&gt;
&lt;td class="org-left"&gt;C85&lt;/td&gt;
&lt;td class="org-left"&gt;C&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-right"&gt;3&lt;/td&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;3&lt;/td&gt;
&lt;td class="org-left"&gt;Heikkinen, Miss. Laina&lt;/td&gt;
&lt;td class="org-left"&gt;female&lt;/td&gt;
&lt;td class="org-right"&gt;26&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-left"&gt;STON/O2. 3101282&lt;/td&gt;
&lt;td class="org-right"&gt;7.925&lt;/td&gt;
&lt;td class="org-left"&gt; &lt;/td&gt;
&lt;td class="org-left"&gt;S&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;
The next step is to note that LightGBM requires that categorical
features be encoded as integer values, so we'll need to convert some
of the columns (e.g. &lt;code&gt;Sex&lt;/code&gt;, &lt;code&gt;Embarked&lt;/code&gt;) into integers.  In my case, I
wrote some code using Haskell's &lt;a href="http://hackage.haskell.org/package/cassava"&gt;Cassava CSV library&lt;/a&gt; to do the
conversion,&lt;sup&gt;&lt;a id="fnr.3" class="footref" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fn.3"&gt;3&lt;/a&gt;&lt;/sup&gt; but any approach you want to use is fine.
&lt;/p&gt;

&lt;p&gt;
We should also get rid of any columns which will induce spurious
correlations or which have so little data that we can't use them.  In
our case, that would include &lt;code&gt;Name&lt;/code&gt; and &lt;code&gt;Ticket&lt;/code&gt; for probable
irrelevance, and &lt;code&gt;Cabin&lt;/code&gt; for lack of a reasonable amount of
data.&lt;sup&gt;&lt;a id="fnr.4" class="footref" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fn.4"&gt;4&lt;/a&gt;&lt;/sup&gt; We end up with a filtered dataset that looks like this:
&lt;/p&gt;

&lt;table border="2" cellspacing="0" cellpadding="6" rules="all" frame="border"&gt;


&lt;colgroup&gt;
&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;

&lt;col class="org-right"&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th scope="col" class="org-right"&gt;Survived&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;PassengerId&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;Pclass&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;Sex&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;Age&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;SibSp&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;Parch&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;Fare&lt;/th&gt;
&lt;th scope="col" class="org-right"&gt;Embarked&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;3&lt;/td&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;22.0&lt;/td&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-right"&gt;7.25&lt;/td&gt;
&lt;td class="org-right"&gt;2&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;2&lt;/td&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-right"&gt;38.0&lt;/td&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-right"&gt;71.2833&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class="org-right"&gt;1&lt;/td&gt;
&lt;td class="org-right"&gt;3&lt;/td&gt;
&lt;td class="org-right"&gt;3&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-right"&gt;26.0&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-right"&gt;0&lt;/td&gt;
&lt;td class="org-right"&gt;7.925&lt;/td&gt;
&lt;td class="org-right"&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;
Note that our dependent variable (a.k.a. what we're trying to predict)
is in the 0th column - this is the default data format expected by
LightGBM.  
&lt;/p&gt;

&lt;p&gt;
Also note that we still have the &lt;code&gt;PassengerId&lt;/code&gt; field, even though
that's in no way relevant to the problem.  We keep it around since the
final submission file format for Kaggle asks for it.  To keep it from
influencing the learning algorithm, we'll use a parameter to ask
LightGBM to ignore it.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-orgf218eea" class="outline-3"&gt;
&lt;h3 id="orgf218eea"&gt;Training the booster&lt;/h3&gt;
&lt;div class="outline-text-3" id="text-orgf218eea"&gt;
&lt;p&gt;
Training the booster is pretty straightforward - feed the training
data into the training algorithm, give a few hyperparameters, and let
it rip.  The gory details are in the &lt;a href="https://github.com/dpkatz/HaskellGBM/blob/master/examples/titanic/Main.hs"&gt;example code&lt;/a&gt;, but here are the
highlights.  
&lt;/p&gt;
&lt;/div&gt;

&lt;div id="outline-container-orga89e5e3" class="outline-4"&gt;
&lt;h4 id="orga89e5e3"&gt;Imports&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-orga89e5e3"&gt;
&lt;p&gt;
First, some imports to get the LightGBM library and CSV filtering
logic:
&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kr"&gt;import&lt;/span&gt; &lt;span class="k"&gt;qualified&lt;/span&gt; &lt;span class="nn"&gt;LightGBM&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;LGBM&lt;/span&gt;
&lt;span class="kr"&gt;import&lt;/span&gt; &lt;span class="k"&gt;qualified&lt;/span&gt; &lt;span class="nn"&gt;LightGBM.DataSet&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;DS&lt;/span&gt;
&lt;span class="kr"&gt;import&lt;/span&gt; &lt;span class="k"&gt;qualified&lt;/span&gt; &lt;span class="nn"&gt;LightGBM.Parameters&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt;
&lt;span class="kr"&gt;import&lt;/span&gt;           &lt;span class="nn"&gt;ConvertData&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;csvFilter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;testFilter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-orgcd03c1d" class="outline-4"&gt;
&lt;h4 id="orgcd03c1d"&gt;Training Parameters&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-orgcd03c1d"&gt;
&lt;p&gt;
Next, the parameters that control the training of the booster:
&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nf"&gt;trainParams&lt;/span&gt; &lt;span class="ow"&gt;::&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;Param&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;trainParams&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;Objective&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;BinaryClassification&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;Metric&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;BinaryLogloss&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;AUC&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;TrainingMetric&lt;/span&gt; &lt;span class="kt"&gt;True&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;LearningRate&lt;/span&gt; &lt;span class="o"&gt;$$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;refineTH&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;NumLeaves&lt;/span&gt; &lt;span class="o"&gt;$$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;refineTH&lt;/span&gt; &lt;span class="mi"&gt;63&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;FeatureFraction&lt;/span&gt; &lt;span class="o"&gt;$$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;refineTH&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;BaggingFreq&lt;/span&gt; &lt;span class="o"&gt;$$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;refineTH&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;BaggingFraction&lt;/span&gt; &lt;span class="o"&gt;$$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;refineTH&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;MinDataInLeaf&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;MinSumHessianInLeaf&lt;/span&gt; &lt;span class="o"&gt;$$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;refineTH&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;IsSparse&lt;/span&gt; &lt;span class="kt"&gt;True&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;LabelColumn&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;ColName&lt;/span&gt; &lt;span class="s"&gt;"Survived"&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;IgnoreColumns&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;ColName&lt;/span&gt; &lt;span class="s"&gt;"PassengerId"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;CategoricalFeatures&lt;/span&gt;
      &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;ColName&lt;/span&gt; &lt;span class="s"&gt;"Pclass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;ColName&lt;/span&gt; &lt;span class="s"&gt;"Sex"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;ColName&lt;/span&gt; &lt;span class="s"&gt;"Embarked"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;
There are several things to note here:  
&lt;/p&gt;
&lt;ul class="org-ul"&gt;
&lt;li&gt;I use the &lt;a href="https://hackage.haskell.org/package/refined"&gt;refined&lt;/a&gt; library to limit some of the parameters to their
proper domains.  For example, the &lt;code&gt;FeatureFraction&lt;/code&gt; should be in
the range \((0, 1]\), and by using a refinement of the &lt;code&gt;Double&lt;/code&gt; type
I can ensure that it's so at compile time (with that &lt;code&gt;$$(refineTH
    ...)&lt;/code&gt; Template Haskell stuff).&lt;sup&gt;&lt;a id="fnr.5" class="footref" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fn.5"&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;LightGBM multi-parameters are converted into lists (e.g. the
&lt;code&gt;Metric&lt;/code&gt; parameter).&lt;/li&gt;
&lt;li&gt;LightGBM enumerated parameters are turned into equivalent sum
types (e.g. the &lt;code&gt;Objective&lt;/code&gt; parameter).&lt;/li&gt;
&lt;li&gt;Column selection is based on a sum type rather than a string
prefix as is done in the standard LightGBM parameters (e.g. in the
&lt;code&gt;LabelColumn&lt;/code&gt; parameter).&lt;/li&gt;
&lt;li&gt;We can select which column contains the "labels" (the dependent
quantity being predicted) with the &lt;code&gt;LabelColumn&lt;/code&gt; parameter.&lt;/li&gt;
&lt;li&gt;We can ignore some columns that we might be carrying along just
for reporting purposes using the &lt;code&gt;IgnoreColumns&lt;/code&gt; parameter.&lt;/li&gt;
&lt;li&gt;Categorical features are encoded as integers, so we have to signal
explicitly to LightGBM whether a feature is categorical (i.e. it's
just an enum of a finite set of values) or not (i.e. it's a
numerical value of some sort).  We do this with the
&lt;code&gt;CategoricalFeatures&lt;/code&gt; paremeter.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
More generally, note that the parameters module does some parameter
bundling to ensure that nonsensical combinations of parameters don't
occur.  For instance, the &lt;code&gt;NumClasses&lt;/code&gt; parameter can only be set with
the &lt;code&gt;MultiClass&lt;/code&gt; application.  This is a break from the flat parameter
space of the underlying LightGBM library where ensuring parameter
coherence is up to the user.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org465a63d" class="outline-4"&gt;
&lt;h4 id="org465a63d"&gt;Loading Data&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-org465a63d"&gt;
&lt;p&gt;
The library provides a simple interface to load data from a CSV file
with an optional header into a &lt;code&gt;DataSet&lt;/code&gt; for use with the algorithm.
In our case, all of the files have headers so a simple helper function
is in order.
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nf"&gt;loadData&lt;/span&gt; &lt;span class="ow"&gt;::&lt;/span&gt; &lt;span class="kt"&gt;FilePath&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;DS&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;DataSet&lt;/span&gt;
&lt;span class="nf"&gt;loadData&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;DS&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readCsvFile&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;DS&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;HasHeader&lt;/span&gt; &lt;span class="kt"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org6706bf8" class="outline-4"&gt;
&lt;h4 id="org6706bf8"&gt;Training&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-org6706bf8"&gt;
&lt;p&gt;
I create a couple of temporary files to hold the filtered data (I'm
doing the filtering inline - I could also have filtered the data
out-of-band, saved them, and then fed them in directly).
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nf"&gt;trainModel&lt;/span&gt; &lt;span class="ow"&gt;::&lt;/span&gt; &lt;span class="kt"&gt;IO&lt;/span&gt; &lt;span class="kt"&gt;LGBM&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;Model&lt;/span&gt;
&lt;span class="nf"&gt;trainModel&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt;
  &lt;span class="kt"&gt;TMP&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withSystemTempFile&lt;/span&gt; &lt;span class="s"&gt;"filtered_train"&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="nf"&gt;\&lt;/span&gt;&lt;span class="n"&gt;trainFile&lt;/span&gt; &lt;span class="n"&gt;trainHandle&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kr"&gt;do&lt;/span&gt;
    &lt;span class="kr"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;csvFilter&lt;/span&gt; &lt;span class="s"&gt;"train_part.csv"&lt;/span&gt; &lt;span class="n"&gt;trainHandle&lt;/span&gt;
    &lt;span class="n"&gt;hClose&lt;/span&gt; &lt;span class="n"&gt;trainHandle&lt;/span&gt;
    &lt;span class="kt"&gt;TMP&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withSystemTempFile&lt;/span&gt; &lt;span class="s"&gt;"filtered_val"&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="nf"&gt;\&lt;/span&gt;&lt;span class="n"&gt;valFile&lt;/span&gt; &lt;span class="n"&gt;valHandle&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kr"&gt;do&lt;/span&gt;
      &lt;span class="kr"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;csvFilter&lt;/span&gt; &lt;span class="s"&gt;"validate_part.csv"&lt;/span&gt; &lt;span class="n"&gt;valHandle&lt;/span&gt;
      &lt;span class="n"&gt;hClose&lt;/span&gt; &lt;span class="n"&gt;valHandle&lt;/span&gt;
      &lt;span class="kr"&gt;let&lt;/span&gt; &lt;span class="n"&gt;trainingData&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loadData&lt;/span&gt; &lt;span class="n"&gt;trainFile&lt;/span&gt;
	  &lt;span class="n"&gt;validationData&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loadData&lt;/span&gt; &lt;span class="n"&gt;valFile&lt;/span&gt;
	  &lt;span class="n"&gt;predictionFile&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"LightGBM_predict_result.txt"&lt;/span&gt;
	  &lt;span class="n"&gt;modelName&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"LightGBM_model.txt"&lt;/span&gt;
      &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt;
	&lt;span class="kt"&gt;LGBM&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trainNewModel&lt;/span&gt; &lt;span class="n"&gt;modelName&lt;/span&gt; &lt;span class="n"&gt;trainParams&lt;/span&gt; &lt;span class="n"&gt;trainingData&lt;/span&gt; &lt;span class="n"&gt;validationData&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
      &lt;span class="kr"&gt;case&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="kr"&gt;of&lt;/span&gt;
	&lt;span class="kt"&gt;Left&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="ne"&gt;error&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="s"&gt;"Error training model:  "&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="n"&gt;show&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
	&lt;span class="kt"&gt;Right&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kr"&gt;do&lt;/span&gt;
	  &lt;span class="kr"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="kt"&gt;LGBM&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeModelFile&lt;/span&gt; &lt;span class="n"&gt;modelFile&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;
	  &lt;span class="c1"&gt;-- [... a bit of self validation code elided here ...]&lt;/span&gt;
	  &lt;span class="n"&gt;return&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;
Note how we use the training data and the validation data in the the
training cycle.
&lt;/p&gt;

&lt;p&gt;
The effect of this code is to train a model, write the model out to
the &lt;code&gt;modelName&lt;/code&gt; file for future use, and return the model for
immediate use (or return an error-log in case there was an error
during training).
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-orgc0f984f" class="outline-4"&gt;
&lt;h4 id="orgc0f984f"&gt;Predicting&lt;/h4&gt;
&lt;div class="outline-text-4" id="text-orgc0f984f"&gt;
&lt;p&gt;
Now that we have the model, we can use it to predict the fate of other
passengers.  Here we go:
&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt; &lt;span class="ow"&gt;::&lt;/span&gt; &lt;span class="kt"&gt;IO&lt;/span&gt; &lt;span class="nb"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;main&lt;/span&gt; &lt;span class="ow"&gt;=&lt;/span&gt; &lt;span class="kr"&gt;do&lt;/span&gt;
  &lt;span class="n"&gt;cwd&lt;/span&gt; &lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="kt"&gt;SD&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getCurrentDirectory&lt;/span&gt;
  &lt;span class="kt"&gt;SD&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withCurrentDirectory&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cwd&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;/&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"examples"&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;/&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"titanic"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;do&lt;/span&gt;
	&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;trainModel&lt;/span&gt;

	&lt;span class="kt"&gt;TMP&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withSystemTempFile&lt;/span&gt; &lt;span class="s"&gt;"filtered_test"&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="nf"&gt;\&lt;/span&gt;&lt;span class="n"&gt;testFile&lt;/span&gt; &lt;span class="n"&gt;testHandle&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kr"&gt;do&lt;/span&gt;
	  &lt;span class="kr"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;testFilter&lt;/span&gt; &lt;span class="s"&gt;"test.csv"&lt;/span&gt; &lt;span class="n"&gt;testHandle&lt;/span&gt;
	  &lt;span class="n"&gt;hClose&lt;/span&gt; &lt;span class="n"&gt;testHandle&lt;/span&gt;
	  &lt;span class="kt"&gt;TMP&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withSystemTempFile&lt;/span&gt; &lt;span class="s"&gt;"predictions"&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="nf"&gt;\&lt;/span&gt;&lt;span class="n"&gt;predFile&lt;/span&gt; &lt;span class="n"&gt;predHandle&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kr"&gt;do&lt;/span&gt;
	    &lt;span class="n"&gt;hClose&lt;/span&gt; &lt;span class="n"&gt;predHandle&lt;/span&gt;

	    &lt;span class="n"&gt;predResults&lt;/span&gt; &lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="kt"&gt;LGBM&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="kt"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loadData&lt;/span&gt; &lt;span class="n"&gt;testFile&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
	    &lt;span class="kr"&gt;case&lt;/span&gt; &lt;span class="n"&gt;predResults&lt;/span&gt; &lt;span class="kr"&gt;of&lt;/span&gt;
	      &lt;span class="kt"&gt;Left&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="ne"&gt;error&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt; &lt;span class="s"&gt;"Error predicting final results:  "&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="n"&gt;show&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
	      &lt;span class="kt"&gt;Right&lt;/span&gt; &lt;span class="n"&gt;predValues&lt;/span&gt; &lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kr"&gt;do&lt;/span&gt;
		&lt;span class="kt"&gt;LGBM&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeCsvFile&lt;/span&gt; &lt;span class="n"&gt;predFile&lt;/span&gt; &lt;span class="n"&gt;predValues&lt;/span&gt;

	  &lt;span class="c1"&gt;-- [... some code to report the output in Kaggle format elided ...]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;
The model output will go to the &lt;code&gt;predFile&lt;/code&gt; where it can be used for
further processing (e.g. massaging into the proper format for
submitting to Kaggle.).
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org2c32b9e" class="outline-2"&gt;
&lt;h2 id="org2c32b9e"&gt;Caveats&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-org2c32b9e"&gt;
&lt;p&gt;
This interface to the LightGBM library is fundamentally a wrapper
around the command-line interface to LightGBM, which makes it rather
heavily embedded in the &lt;code&gt;IO&lt;/code&gt; type and heavily dependent on the file
system.  The file system dependence is not particularly bad - data
sets and models in the machine learning space are typically large
enough that you'd want to have them persisted to disk anyway - but it
perhaps gives an odd feel to the wrapper API.  Most wrappers around
LightGBM use foreign function calls to the C API and pass data
structures in directly (e.g. as Pandas or R data frames); I might do
something like that in the future if it looks like it would help
matters.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;

&lt;div id="outline-container-org39742b3" class="outline-2"&gt;
&lt;h2 id="org39742b3"&gt;Future directions?&lt;/h2&gt;
&lt;div class="outline-text-2" id="text-org39742b3"&gt;
&lt;p&gt;
The wrapper presented here is still very rudimentary, and many
improvements should be made.  For example:
&lt;/p&gt;
&lt;ul class="org-ul"&gt;
&lt;li&gt;Add the library to Hackage&lt;/li&gt;
&lt;li&gt;Grid search for parameter tuning&lt;/li&gt;
&lt;li&gt;Cross-validation support&lt;/li&gt;
&lt;li&gt;Better validation metrics&lt;/li&gt;
&lt;li&gt;Using the C API via the Haskell FFI rather than wrapping the
command line interface&lt;/li&gt;
&lt;li&gt;&lt;b&gt;DONE&lt;/b&gt; Provide a &lt;a href="http://hackage.haskell.org/package/Frames"&gt;data frame&lt;/a&gt; interface to DataSet allowing us to
use a data frame directly as input and extract a dataframe as
output.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id="footnotes"&gt;
&lt;h2 class="footnotes"&gt;Footnotes: &lt;/h2&gt;
&lt;div id="text-footnotes"&gt;

&lt;div class="footdef"&gt;&lt;sup&gt;&lt;a id="fn.1" class="footnum" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fnr.1"&gt;1&lt;/a&gt;&lt;/sup&gt; &lt;div class="footpara"&gt;&lt;p class="footpara"&gt;
Please note that this article is only meant to introduce you to
the Haskell wrapper for LightGBM in the simplest way possible! The
approach taken in this article to the actual machine learning problem
at hand is intentionally naive to make sure that the exposition of the
library doesn't get lost in the complexity of handling a machine
learning problem properly.  Do not take this approach as a tutorial in
how to approach machine learning in general!  A few of the most
egregious simplifications will be called out in the footnotes.
&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class="footdef"&gt;&lt;sup&gt;&lt;a id="fn.2" class="footnum" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fnr.2"&gt;2&lt;/a&gt;&lt;/sup&gt; &lt;div class="footpara"&gt;&lt;p class="footpara"&gt;
This "holdout" approach is not a particularly good validation
method, but it's simple to implement.  Some high-level interfaces to
LightGBM provide support for cross-validation, and I might supply that
too eventually.
&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class="footdef"&gt;&lt;sup&gt;&lt;a id="fn.3" class="footnum" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fnr.3"&gt;3&lt;/a&gt;&lt;/sup&gt; &lt;div class="footpara"&gt;&lt;p class="footpara"&gt;
Take a look at &lt;code&gt;ConvertData.hs&lt;/code&gt; in the &lt;a href="https://github.com/dpkatz/HaskellGBM/blob/master/examples/titanic/ConvertData.hs"&gt;repo&lt;/a&gt; if you're
interested.
&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class="footdef"&gt;&lt;sup&gt;&lt;a id="fn.4" class="footnum" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fnr.4"&gt;4&lt;/a&gt;&lt;/sup&gt; &lt;div class="footpara"&gt;&lt;p class="footpara"&gt;
I leave open the possibility of engineering features on the
basis of these columns (e.g. using titles from names as a proxy for
social class, or correlating cabin numbers to particular locations on
the ship); I'm just saying that leaving these columns as they are
doesn't give us any useful information for a feature.
&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class="footdef"&gt;&lt;sup&gt;&lt;a id="fn.5" class="footnum" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fnr.5"&gt;5&lt;/a&gt;&lt;/sup&gt; &lt;div class="footpara"&gt;&lt;p class="footpara"&gt;
Note that to use the Template Haskell approach we need to
enable the &lt;code&gt;TemplateHaskell&lt;/code&gt; extension using a file pragma (as is done
in the example files) or similar approach.  You always have the option
of using &lt;code&gt;refine&lt;/code&gt; instead of &lt;code&gt;refineTH&lt;/code&gt; in which case Template Haskell
is not needed and the bounds checks will happen at runtime rather than
compile time.
&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;


&lt;/div&gt;
&lt;/div&gt;</description><category>boosting</category><category>Haskell</category><category>LightGBM</category><category>Machine Learning</category><guid>https://dpkatz.github.io/posts/using-lightgbm-from-haskell/</guid><pubDate>Thu, 31 May 2018 15:22:08 GMT</pubDate></item><item><title>A simple beginning</title><link>https://dpkatz.github.io/posts/a-simple-beginning/</link><dc:creator>Dan Katz</dc:creator><description>&lt;p&gt;
It all started with a simple classification problem: Given a document,
identify what company it was about.  Here are 10&lt;sup&gt;6&lt;/sup&gt; documents we know
how to classify, and here are 10&lt;sup&gt;5&lt;/sup&gt; we don't.  A standard undergrad
homework exercise in machine learning, they said.
&lt;/p&gt;

&lt;p&gt;
Little did I know I was headed down the rabbit hole…
&lt;/p&gt;

&lt;p&gt;
A year later, I feel like I'm just beginning to get my head around all
this and I'm going to try to get it down in a form that makes some
sense to me.
&lt;/p&gt;</description><category>blog</category><guid>https://dpkatz.github.io/posts/a-simple-beginning/</guid><pubDate>Sat, 14 Apr 2018 20:20:56 GMT</pubDate></item></channel></rss>