-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrss.xml
570 lines (507 loc) · 37.9 KB
/
rss.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Recursively Boosted</title><link>https://dpkatz.github.io/</link><description>Functional Programming, Machine Learning, and anything else that catches my interest...</description><atom:link href="https://dpkatz.github.io/rss.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2018 <a href="mailto:[email protected]">Dan Katz</a> </copyright><lastBuildDate>Fri, 22 Jun 2018 21:00:10 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Using LightGBM from Haskell</title><link>https://dpkatz.github.io/posts/using-lightgbm-from-haskell/</link><dc:creator>Dan Katz</dc:creator><description><div id="outline-container-orgf62600a" class="outline-2">
<h2 id="orgf62600a">Summary</h2>
<div class="outline-text-2" id="text-orgf62600a">
<blockquote>
<ul class="org-ul">
<li>Easily run LightGBM from Haskell</li>
<li>Improved parameter value safety thanks to refined types</li>
<li>Improved parameter grouping safety thanks to algebraic data types</li>
</ul>
</blockquote>
</div>
</div>
<div id="outline-container-org632b75d" class="outline-2">
<h2 id="org632b75d">Introduction</h2>
<div class="outline-text-2" id="text-org632b75d">
<p>
One of the most popular recent approaches to tree boosting as a
machine learning approach is <a href="https://github.com/Microsoft/LightGBM">LightGBM</a>. It's fast, performs well on
various sorts of regression and classification problems, and has an
open-source implementation under active development which includes
APIs in Python and R as well as a community-provided <a href="https://github.com/Allardvm/LightGBM.jl">Julia wrapper</a>.
It does not, however, do anything to make life easy for the average
Haskell programmer. Let's change that.
</p>
<p>
I've created a <a href="https://github.com/dpkatz/HaskellGBM">Haskell wrapper around LightGBM</a> which makes it easy to
call the library from a Haskell program. Beyond simply making it
possible to use LightGBM from Haskell, however, I've tried to make
this interface to the library relatively safe. In partcular:
</p>
<ul class="org-ul">
<li>Many LightGBM parameters are supposed to be limited to defined
sets of values (e.g. fractions in the range \([0, 1]\), numbers in
the range \([1, 2)\), or strictly positive integers). In this
Haskell interface I use refinement types to ensure that these
limitations are honored - at compile time when possible, at run
time when necessary.</li>
<li>Many LightGBM parameters are only supposed to be used in a subset
of cases (e.g. for a particular objective or with a particular
metric function). In this Haskell interface I use algebraic data
types to try to ensure that combinations of parameters that don't
make sense cannot be represented.</li>
</ul>
<p>
I hope that these decisions make the library safer and easier to use
than it would otherwise be.
</p>
<p>
So let's take a look at how to put the library to use.
</p>
</div>
</div>
<div id="outline-container-org2973dfe" class="outline-2">
<h2 id="org2973dfe">How to use the library</h2>
<div class="outline-text-2" id="text-org2973dfe">
<p>
We'll apply LightGBM to the venerable Kaggle problem of predicting <a href="https://www.kaggle.com/c/titanic">who
will survive the sinking of the Titanic</a>. The code and data for this
exercise can be found in the <a href="https://github.com/dpkatz/HaskellGBM/tree/master/examples/titanic">Titanic example subdirectory of the
GitHub repo</a>.<sup><a id="fnr.1" class="footref" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fn.1">1</a></sup>
</p>
</div>
<div id="outline-container-org01289b7" class="outline-3">
<h3 id="org01289b7">Data Preparation</h3>
<div class="outline-text-3" id="text-org01289b7">
<p>
We start by taking the <a href="https://www.kaggle.com/c/titanic/data">Kaggle training data</a> and splitting it into two
parts - one for training and one for validation.<sup><a id="fnr.2" class="footref" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fn.2">2</a></sup> In the <a href="https://github.com/dpkatz/HaskellGBM/tree/master/examples/titanic">repo</a>,
you'll find these files named <code>train_part.csv</code> and
<code>validate_part.csv</code>.
</p>
<p>
Here's a sample of the first few rows of the training data:
</p>
<table border="2" cellspacing="0" cellpadding="6" rules="all" frame="border">
<colgroup>
<col class="org-right">
<col class="org-right">
<col class="org-right">
<col class="org-left">
<col class="org-left">
<col class="org-right">
<col class="org-right">
<col class="org-right">
<col class="org-left">
<col class="org-right">
<col class="org-left">
<col class="org-left">
</colgroup>
<thead>
<tr>
<th scope="col" class="org-right">PassengerId</th>
<th scope="col" class="org-right">Survived</th>
<th scope="col" class="org-right">Pclass</th>
<th scope="col" class="org-left">Name</th>
<th scope="col" class="org-left">Sex</th>
<th scope="col" class="org-right">Age</th>
<th scope="col" class="org-right">SibSp</th>
<th scope="col" class="org-right">Parch</th>
<th scope="col" class="org-left">Ticket</th>
<th scope="col" class="org-right">Fare</th>
<th scope="col" class="org-left">Cabin</th>
<th scope="col" class="org-left">Embarked</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-right">1</td>
<td class="org-right">0</td>
<td class="org-right">3</td>
<td class="org-left">Braund, Mr. Owen Harris</td>
<td class="org-left">male</td>
<td class="org-right">22</td>
<td class="org-right">1</td>
<td class="org-right">0</td>
<td class="org-left">A/5 21171</td>
<td class="org-right">7.25</td>
<td class="org-left"> </td>
<td class="org-left">S</td>
</tr>
<tr>
<td class="org-right">2</td>
<td class="org-right">1</td>
<td class="org-right">1</td>
<td class="org-left">Cumings, Mrs. John Bradley (Florence Briggs Thayer)</td>
<td class="org-left">female</td>
<td class="org-right">38</td>
<td class="org-right">1</td>
<td class="org-right">0</td>
<td class="org-left">PC 17599</td>
<td class="org-right">71.2833</td>
<td class="org-left">C85</td>
<td class="org-left">C</td>
</tr>
<tr>
<td class="org-right">3</td>
<td class="org-right">1</td>
<td class="org-right">3</td>
<td class="org-left">Heikkinen, Miss. Laina</td>
<td class="org-left">female</td>
<td class="org-right">26</td>
<td class="org-right">0</td>
<td class="org-right">0</td>
<td class="org-left">STON/O2. 3101282</td>
<td class="org-right">7.925</td>
<td class="org-left"> </td>
<td class="org-left">S</td>
</tr>
</tbody>
</table>
<p>
The next step is to note that LightGBM requires that categorical
features be encoded as integer values, so we'll need to convert some
of the columns (e.g. <code>Sex</code>, <code>Embarked</code>) into integers. In my case, I
wrote some code using Haskell's <a href="http://hackage.haskell.org/package/cassava">Cassava CSV library</a> to do the
conversion,<sup><a id="fnr.3" class="footref" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fn.3">3</a></sup> but any approach you want to use is fine.
</p>
<p>
We should also get rid of any columns which will induce spurious
correlations or which have so little data that we can't use them. In
our case, that would include <code>Name</code> and <code>Ticket</code> for probable
irrelevance, and <code>Cabin</code> for lack of a reasonable amount of
data.<sup><a id="fnr.4" class="footref" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fn.4">4</a></sup> We end up with a filtered dataset that looks like this:
</p>
<table border="2" cellspacing="0" cellpadding="6" rules="all" frame="border">
<colgroup>
<col class="org-right">
<col class="org-right">
<col class="org-right">
<col class="org-right">
<col class="org-right">
<col class="org-right">
<col class="org-right">
<col class="org-right">
<col class="org-right">
</colgroup>
<thead>
<tr>
<th scope="col" class="org-right">Survived</th>
<th scope="col" class="org-right">PassengerId</th>
<th scope="col" class="org-right">Pclass</th>
<th scope="col" class="org-right">Sex</th>
<th scope="col" class="org-right">Age</th>
<th scope="col" class="org-right">SibSp</th>
<th scope="col" class="org-right">Parch</th>
<th scope="col" class="org-right">Fare</th>
<th scope="col" class="org-right">Embarked</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-right">0</td>
<td class="org-right">1</td>
<td class="org-right">3</td>
<td class="org-right">1</td>
<td class="org-right">22.0</td>
<td class="org-right">1</td>
<td class="org-right">0</td>
<td class="org-right">7.25</td>
<td class="org-right">2</td>
</tr>
<tr>
<td class="org-right">1</td>
<td class="org-right">2</td>
<td class="org-right">1</td>
<td class="org-right">0</td>
<td class="org-right">38.0</td>
<td class="org-right">1</td>
<td class="org-right">0</td>
<td class="org-right">71.2833</td>
<td class="org-right">0</td>
</tr>
<tr>
<td class="org-right">1</td>
<td class="org-right">3</td>
<td class="org-right">3</td>
<td class="org-right">0</td>
<td class="org-right">26.0</td>
<td class="org-right">0</td>
<td class="org-right">0</td>
<td class="org-right">7.925</td>
<td class="org-right">2</td>
</tr>
</tbody>
</table>
<p>
Note that our dependent variable (a.k.a. what we're trying to predict)
is in the 0th column - this is the default data format expected by
LightGBM.
</p>
<p>
Also note that we still have the <code>PassengerId</code> field, even though
that's in no way relevant to the problem. We keep it around since the
final submission file format for Kaggle asks for it. To keep it from
influencing the learning algorithm, we'll use a parameter to ask
LightGBM to ignore it.
</p>
</div>
</div>
<div id="outline-container-orgf218eea" class="outline-3">
<h3 id="orgf218eea">Training the booster</h3>
<div class="outline-text-3" id="text-orgf218eea">
<p>
Training the booster is pretty straightforward - feed the training
data into the training algorithm, give a few hyperparameters, and let
it rip. The gory details are in the <a href="https://github.com/dpkatz/HaskellGBM/blob/master/examples/titanic/Main.hs">example code</a>, but here are the
highlights.
</p>
</div>
<div id="outline-container-orga89e5e3" class="outline-4">
<h4 id="orga89e5e3">Imports</h4>
<div class="outline-text-4" id="text-orga89e5e3">
<p>
First, some imports to get the LightGBM library and CSV filtering
logic:
</p>
<div class="highlight"><pre><span></span><span class="kr">import</span> <span class="k">qualified</span> <span class="nn">LightGBM</span> <span class="k">as</span> <span class="n">LGBM</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">LightGBM.DataSet</span> <span class="k">as</span> <span class="n">DS</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">LightGBM.Parameters</span> <span class="k">as</span> <span class="n">P</span>
<span class="kr">import</span> <span class="nn">ConvertData</span> <span class="p">(</span><span class="nf">csvFilter</span><span class="p">,</span> <span class="nf">testFilter</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div id="outline-container-orgcd03c1d" class="outline-4">
<h4 id="orgcd03c1d">Training Parameters</h4>
<div class="outline-text-4" id="text-orgcd03c1d">
<p>
Next, the parameters that control the training of the booster:
</p>
<div class="highlight"><pre><span></span><span class="nf">trainParams</span> <span class="ow">::</span> <span class="p">[</span><span class="kt">P</span><span class="o">.</span><span class="kt">Param</span><span class="p">]</span>
<span class="nf">trainParams</span> <span class="ow">=</span>
<span class="p">[</span> <span class="kt">P</span><span class="o">.</span><span class="kt">Objective</span> <span class="kt">P</span><span class="o">.</span><span class="kt">BinaryClassification</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">Metric</span> <span class="p">[</span><span class="kt">P</span><span class="o">.</span><span class="kt">BinaryLogloss</span><span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">AUC</span><span class="p">]</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">TrainingMetric</span> <span class="kt">True</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">LearningRate</span> <span class="o">$$</span><span class="p">(</span><span class="n">refineTH</span> <span class="mf">0.1</span><span class="p">)</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">NumLeaves</span> <span class="o">$$</span><span class="p">(</span><span class="n">refineTH</span> <span class="mi">63</span><span class="p">)</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">FeatureFraction</span> <span class="o">$$</span><span class="p">(</span><span class="n">refineTH</span> <span class="mf">0.8</span><span class="p">)</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">BaggingFreq</span> <span class="o">$$</span><span class="p">(</span><span class="n">refineTH</span> <span class="mi">5</span><span class="p">)</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">BaggingFraction</span> <span class="o">$$</span><span class="p">(</span><span class="n">refineTH</span> <span class="mf">0.8</span><span class="p">)</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">MinDataInLeaf</span> <span class="mi">50</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">MinSumHessianInLeaf</span> <span class="o">$$</span><span class="p">(</span><span class="n">refineTH</span> <span class="mf">5.0</span><span class="p">)</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">IsSparse</span> <span class="kt">True</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">LabelColumn</span> <span class="o">$</span> <span class="kt">P</span><span class="o">.</span><span class="kt">ColName</span> <span class="s">"Survived"</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">IgnoreColumns</span> <span class="p">[</span><span class="kt">P</span><span class="o">.</span><span class="kt">ColName</span> <span class="s">"PassengerId"</span><span class="p">]</span>
<span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">CategoricalFeatures</span>
<span class="p">[</span><span class="kt">P</span><span class="o">.</span><span class="kt">ColName</span> <span class="s">"Pclass"</span><span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">ColName</span> <span class="s">"Sex"</span><span class="p">,</span> <span class="kt">P</span><span class="o">.</span><span class="kt">ColName</span> <span class="s">"Embarked"</span><span class="p">]</span>
<span class="p">]</span>
</pre></div>
<p>
There are several things to note here:
</p>
<ul class="org-ul">
<li>I use the <a href="https://hackage.haskell.org/package/refined">refined</a> library to limit some of the parameters to their
proper domains. For example, the <code>FeatureFraction</code> should be in
the range \((0, 1]\), and by using a refinement of the <code>Double</code> type
I can ensure that it's so at compile time (with that <code>$$(refineTH
...)</code> Template Haskell stuff).<sup><a id="fnr.5" class="footref" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fn.5">5</a></sup></li>
<li>LightGBM multi-parameters are converted into lists (e.g. the
<code>Metric</code> parameter).</li>
<li>LightGBM enumerated parameters are turned into equivalent sum
types (e.g. the <code>Objective</code> parameter).</li>
<li>Column selection is based on a sum type rather than a string
prefix as is done in the standard LightGBM parameters (e.g. in the
<code>LabelColumn</code> parameter).</li>
<li>We can select which column contains the "labels" (the dependent
quantity being predicted) with the <code>LabelColumn</code> parameter.</li>
<li>We can ignore some columns that we might be carrying along just
for reporting purposes using the <code>IgnoreColumns</code> parameter.</li>
<li>Categorical features are encoded as integers, so we have to signal
explicitly to LightGBM whether a feature is categorical (i.e. it's
just an enum of a finite set of values) or not (i.e. it's a
numerical value of some sort). We do this with the
<code>CategoricalFeatures</code> paremeter.</li>
</ul>
<p>
More generally, note that the parameters module does some parameter
bundling to ensure that nonsensical combinations of parameters don't
occur. For instance, the <code>NumClasses</code> parameter can only be set with
the <code>MultiClass</code> application. This is a break from the flat parameter
space of the underlying LightGBM library where ensuring parameter
coherence is up to the user.
</p>
</div>
</div>
<div id="outline-container-org465a63d" class="outline-4">
<h4 id="org465a63d">Loading Data</h4>
<div class="outline-text-4" id="text-org465a63d">
<p>
The library provides a simple interface to load data from a CSV file
with an optional header into a <code>DataSet</code> for use with the algorithm.
In our case, all of the files have headers so a simple helper function
is in order.
</p>
<div class="highlight"><pre><span></span><span class="nf">loadData</span> <span class="ow">::</span> <span class="kt">FilePath</span> <span class="ow">-&gt;</span> <span class="kt">DS</span><span class="o">.</span><span class="kt">DataSet</span>
<span class="nf">loadData</span> <span class="ow">=</span> <span class="kt">DS</span><span class="o">.</span><span class="n">readCsvFile</span> <span class="p">(</span><span class="kt">DS</span><span class="o">.</span><span class="kt">HasHeader</span> <span class="kt">True</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div id="outline-container-org6706bf8" class="outline-4">
<h4 id="org6706bf8">Training</h4>
<div class="outline-text-4" id="text-org6706bf8">
<p>
I create a couple of temporary files to hold the filtered data (I'm
doing the filtering inline - I could also have filtered the data
out-of-band, saved them, and then fed them in directly).
</p>
<div class="highlight"><pre><span></span><span class="nf">trainModel</span> <span class="ow">::</span> <span class="kt">IO</span> <span class="kt">LGBM</span><span class="o">.</span><span class="kt">Model</span>
<span class="nf">trainModel</span> <span class="ow">=</span>
<span class="kt">TMP</span><span class="o">.</span><span class="n">withSystemTempFile</span> <span class="s">"filtered_train"</span> <span class="o">$</span> <span class="nf">\</span><span class="n">trainFile</span> <span class="n">trainHandle</span> <span class="ow">-&gt;</span> <span class="kr">do</span>
<span class="kr">_</span> <span class="ow">&lt;-</span> <span class="n">csvFilter</span> <span class="s">"train_part.csv"</span> <span class="n">trainHandle</span>
<span class="n">hClose</span> <span class="n">trainHandle</span>
<span class="kt">TMP</span><span class="o">.</span><span class="n">withSystemTempFile</span> <span class="s">"filtered_val"</span> <span class="o">$</span> <span class="nf">\</span><span class="n">valFile</span> <span class="n">valHandle</span> <span class="ow">-&gt;</span> <span class="kr">do</span>
<span class="kr">_</span> <span class="ow">&lt;-</span> <span class="n">csvFilter</span> <span class="s">"validate_part.csv"</span> <span class="n">valHandle</span>
<span class="n">hClose</span> <span class="n">valHandle</span>
<span class="kr">let</span> <span class="n">trainingData</span> <span class="ow">=</span> <span class="n">loadData</span> <span class="n">trainFile</span>
<span class="n">validationData</span> <span class="ow">=</span> <span class="n">loadData</span> <span class="n">valFile</span>
<span class="n">predictionFile</span> <span class="ow">=</span> <span class="s">"LightGBM_predict_result.txt"</span>
<span class="n">modelName</span> <span class="ow">=</span> <span class="s">"LightGBM_model.txt"</span>
<span class="n">model</span> <span class="ow">&lt;-</span>
<span class="kt">LGBM</span><span class="o">.</span><span class="n">trainNewModel</span> <span class="n">modelName</span> <span class="n">trainParams</span> <span class="n">trainingData</span> <span class="n">validationData</span> <span class="mi">100</span>
<span class="kr">case</span> <span class="n">model</span> <span class="kr">of</span>
<span class="kt">Left</span> <span class="n">e</span> <span class="ow">-&gt;</span> <span class="ne">error</span> <span class="o">$</span> <span class="s">"Error training model: "</span> <span class="o">++</span> <span class="n">show</span> <span class="n">e</span>
<span class="kt">Right</span> <span class="n">m</span> <span class="ow">-&gt;</span> <span class="kr">do</span>
<span class="kr">_</span> <span class="ow">&lt;-</span> <span class="kt">LGBM</span><span class="o">.</span><span class="n">writeModelFile</span> <span class="n">modelFile</span> <span class="n">m</span>
<span class="c1">-- [... a bit of self validation code elided here ...]</span>
<span class="n">return</span> <span class="n">m</span>
</pre></div>
<p>
Note how we use the training data and the validation data in the the
training cycle.
</p>
<p>
The effect of this code is to train a model, write the model out to
the <code>modelName</code> file for future use, and return the model for
immediate use (or return an error-log in case there was an error
during training).
</p>
</div>
</div>
<div id="outline-container-orgc0f984f" class="outline-4">
<h4 id="orgc0f984f">Predicting</h4>
<div class="outline-text-4" id="text-orgc0f984f">
<p>
Now that we have the model, we can use it to predict the fate of other
passengers. Here we go:
</p>
<div class="highlight"><pre><span></span><span class="nf">main</span> <span class="ow">::</span> <span class="kt">IO</span> <span class="nb">()</span>
<span class="nf">main</span> <span class="ow">=</span> <span class="kr">do</span>
<span class="n">cwd</span> <span class="ow">&lt;-</span> <span class="kt">SD</span><span class="o">.</span><span class="n">getCurrentDirectory</span>
<span class="kt">SD</span><span class="o">.</span><span class="n">withCurrentDirectory</span>
<span class="p">(</span><span class="n">cwd</span> <span class="o">&lt;/&gt;</span> <span class="s">"examples"</span> <span class="o">&lt;/&gt;</span> <span class="s">"titanic"</span><span class="p">)</span>
<span class="p">(</span><span class="kr">do</span>
<span class="n">m</span> <span class="ow">&lt;-</span> <span class="n">trainModel</span>
<span class="kt">TMP</span><span class="o">.</span><span class="n">withSystemTempFile</span> <span class="s">"filtered_test"</span> <span class="o">$</span> <span class="nf">\</span><span class="n">testFile</span> <span class="n">testHandle</span> <span class="ow">-&gt;</span> <span class="kr">do</span>
<span class="kr">_</span> <span class="ow">&lt;-</span> <span class="n">testFilter</span> <span class="s">"test.csv"</span> <span class="n">testHandle</span>
<span class="n">hClose</span> <span class="n">testHandle</span>
<span class="kt">TMP</span><span class="o">.</span><span class="n">withSystemTempFile</span> <span class="s">"predictions"</span> <span class="o">$</span> <span class="nf">\</span><span class="n">predFile</span> <span class="n">predHandle</span> <span class="ow">-&gt;</span> <span class="kr">do</span>
<span class="n">hClose</span> <span class="n">predHandle</span>
<span class="n">predResults</span> <span class="ow">&lt;-</span> <span class="kt">LGBM</span><span class="o">.</span><span class="n">predict</span> <span class="n">m</span> <span class="kt">[]</span> <span class="p">(</span><span class="n">loadData</span> <span class="n">testFile</span><span class="p">)</span>
<span class="kr">case</span> <span class="n">predResults</span> <span class="kr">of</span>
<span class="kt">Left</span> <span class="n">e</span> <span class="ow">-&gt;</span> <span class="ne">error</span> <span class="o">$</span> <span class="s">"Error predicting final results: "</span> <span class="o">++</span> <span class="n">show</span> <span class="n">e</span>
<span class="kt">Right</span> <span class="n">predValues</span> <span class="ow">-&gt;</span> <span class="kr">do</span>
<span class="kt">LGBM</span><span class="o">.</span><span class="n">writeCsvFile</span> <span class="n">predFile</span> <span class="n">predValues</span>
<span class="c1">-- [... some code to report the output in Kaggle format elided ...]</span>
<span class="p">)</span>
</pre></div>
<p>
The model output will go to the <code>predFile</code> where it can be used for
further processing (e.g. massaging into the proper format for
submitting to Kaggle.).
</p>
</div>
</div>
</div>
</div>
<div id="outline-container-org2c32b9e" class="outline-2">
<h2 id="org2c32b9e">Caveats</h2>
<div class="outline-text-2" id="text-org2c32b9e">
<p>
This interface to the LightGBM library is fundamentally a wrapper
around the command-line interface to LightGBM, which makes it rather
heavily embedded in the <code>IO</code> type and heavily dependent on the file
system. The file system dependence is not particularly bad - data
sets and models in the machine learning space are typically large
enough that you'd want to have them persisted to disk anyway - but it
perhaps gives an odd feel to the wrapper API. Most wrappers around
LightGBM use foreign function calls to the C API and pass data
structures in directly (e.g. as Pandas or R data frames); I might do
something like that in the future if it looks like it would help
matters.
</p>
</div>
</div>
<div id="outline-container-org39742b3" class="outline-2">
<h2 id="org39742b3">Future directions?</h2>
<div class="outline-text-2" id="text-org39742b3">
<p>
The wrapper presented here is still very rudimentary, and many
improvements should be made. For example:
</p>
<ul class="org-ul">
<li>Add the library to Hackage</li>
<li>Grid search for parameter tuning</li>
<li>Cross-validation support</li>
<li>Better validation metrics</li>
<li>Using the C API via the Haskell FFI rather than wrapping the
command line interface</li>
<li><b>DONE</b> Provide a <a href="http://hackage.haskell.org/package/Frames">data frame</a> interface to DataSet allowing us to
use a data frame directly as input and extract a dataframe as
output.</li>
</ul>
</div>
</div>
<div id="footnotes">
<h2 class="footnotes">Footnotes: </h2>
<div id="text-footnotes">
<div class="footdef"><sup><a id="fn.1" class="footnum" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fnr.1">1</a></sup> <div class="footpara"><p class="footpara">
Please note that this article is only meant to introduce you to
the Haskell wrapper for LightGBM in the simplest way possible! The
approach taken in this article to the actual machine learning problem
at hand is intentionally naive to make sure that the exposition of the
library doesn't get lost in the complexity of handling a machine
learning problem properly. Do not take this approach as a tutorial in
how to approach machine learning in general! A few of the most
egregious simplifications will be called out in the footnotes.
</p></div></div>
<div class="footdef"><sup><a id="fn.2" class="footnum" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fnr.2">2</a></sup> <div class="footpara"><p class="footpara">
This "holdout" approach is not a particularly good validation
method, but it's simple to implement. Some high-level interfaces to
LightGBM provide support for cross-validation, and I might supply that
too eventually.
</p></div></div>
<div class="footdef"><sup><a id="fn.3" class="footnum" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fnr.3">3</a></sup> <div class="footpara"><p class="footpara">
Take a look at <code>ConvertData.hs</code> in the <a href="https://github.com/dpkatz/HaskellGBM/blob/master/examples/titanic/ConvertData.hs">repo</a> if you're
interested.
</p></div></div>
<div class="footdef"><sup><a id="fn.4" class="footnum" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fnr.4">4</a></sup> <div class="footpara"><p class="footpara">
I leave open the possibility of engineering features on the
basis of these columns (e.g. using titles from names as a proxy for
social class, or correlating cabin numbers to particular locations on
the ship); I'm just saying that leaving these columns as they are
doesn't give us any useful information for a feature.
</p></div></div>
<div class="footdef"><sup><a id="fn.5" class="footnum" href="https://dpkatz.github.io/posts/using-lightgbm-from-haskell/#fnr.5">5</a></sup> <div class="footpara"><p class="footpara">
Note that to use the Template Haskell approach we need to
enable the <code>TemplateHaskell</code> extension using a file pragma (as is done
in the example files) or similar approach. You always have the option
of using <code>refine</code> instead of <code>refineTH</code> in which case Template Haskell
is not needed and the bounds checks will happen at runtime rather than
compile time.
</p></div></div>
</div>
</div></description><category>boosting</category><category>Haskell</category><category>LightGBM</category><category>Machine Learning</category><guid>https://dpkatz.github.io/posts/using-lightgbm-from-haskell/</guid><pubDate>Thu, 31 May 2018 15:22:08 GMT</pubDate></item><item><title>A simple beginning</title><link>https://dpkatz.github.io/posts/a-simple-beginning/</link><dc:creator>Dan Katz</dc:creator><description><p>
It all started with a simple classification problem: Given a document,
identify what company it was about. Here are 10<sup>6</sup> documents we know
how to classify, and here are 10<sup>5</sup> we don't. A standard undergrad
homework exercise in machine learning, they said.
</p>
<p>
Little did I know I was headed down the rabbit hole…
</p>
<p>
A year later, I feel like I'm just beginning to get my head around all
this and I'm going to try to get it down in a form that makes some
sense to me.
</p></description><category>blog</category><guid>https://dpkatz.github.io/posts/a-simple-beginning/</guid><pubDate>Sat, 14 Apr 2018 20:20:56 GMT</pubDate></item></channel></rss>