Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mondrian Forests #10

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
6023dd1
Update ClassifierOutput docstring
MarcoDiFrancesco Apr 11, 2024
feba8a0
Add RegressionOutput to common
MarcoDiFrancesco Apr 11, 2024
c13d3c6
Merge branch 'online-ml:main' into main
MarcoDiFrancesco Apr 11, 2024
308a082
Add boilerplate code for mondrian forest
MarcoDiFrancesco Apr 12, 2024
3ba0e3a
Add keystroke dataset
MarcoDiFrancesco Apr 12, 2024
2f9e03d
Add all functions calls with unimplemented errors
MarcoDiFrancesco Apr 15, 2024
7b63db5
Add predict steps to be refactored
MarcoDiFrancesco Apr 15, 2024
d5bb6db
Add get features function
MarcoDiFrancesco Apr 16, 2024
b5b7ec4
Add Array library
MarcoDiFrancesco Apr 16, 2024
d613df2
Add randomization for cache tests
MarcoDiFrancesco Apr 16, 2024
2174472
Disable test github actions and enable only check
MarcoDiFrancesco Apr 17, 2024
1c91530
Remove verbose from build and test
MarcoDiFrancesco Apr 17, 2024
44cfba4
Add Stats struct and impl
MarcoDiFrancesco Apr 18, 2024
4c6ebe4
Add rust caching in actions
MarcoDiFrancesco Apr 18, 2024
1ccabc4
Split MondrianTree and MondrianForest
MarcoDiFrancesco Apr 22, 2024
ac71b06
Refactor to use Tree Vector indicies instead of pointers
MarcoDiFrancesco Apr 23, 2024
8aad4ed
Change actions cargo.lock to cargo.toml
MarcoDiFrancesco Apr 23, 2024
8c91dd8
Add print function for MondrianTree
MarcoDiFrancesco Apr 23, 2024
6b38849
Adding print functions to mondriantree and node
MarcoDiFrancesco Apr 23, 2024
107354a
Implement and test predict_proba
MarcoDiFrancesco Apr 24, 2024
4385fe8
Add unit test for predict_proba
MarcoDiFrancesco Apr 24, 2024
49d4e3e
Add final implementation of inference (predict_proba)
MarcoDiFrancesco Apr 24, 2024
a16d3e7
Add random distribution to extend mondrian block
MarcoDiFrancesco Apr 25, 2024
de5d67a
Add full extend_mondrian_block implementation
MarcoDiFrancesco Apr 25, 2024
667d35e
Add synthetic dataset and tree integrity tests
MarcoDiFrancesco Apr 25, 2024
f79864d
Fix pointer of grandpa on extend_mondrian_block
MarcoDiFrancesco Apr 26, 2024
989c176
Add recursive repr mondrian forest
MarcoDiFrancesco Apr 26, 2024
75e5feb
Add score function
MarcoDiFrancesco Apr 29, 2024
da4a00a
Remove debug statements
MarcoDiFrancesco Apr 30, 2024
717161f
Adjust code to River behaviour
MarcoDiFrancesco Apr 30, 2024
a9ca4bc
Adapt _go_downwards from River
MarcoDiFrancesco May 3, 2024
ccc9b1d
Update function names from nel215 to River
MarcoDiFrancesco May 3, 2024
30fb86b
Comment debug prints
MarcoDiFrancesco May 3, 2024
a619415
Remove unused imports
MarcoDiFrancesco May 3, 2024
da23d14
Add synthetic dataset download
MarcoDiFrancesco May 3, 2024
85030ad
Rename MondrianForest to MondrianForestClassifier
MarcoDiFrancesco May 6, 2024
c4753f1
Update readme with classification run instructions
MarcoDiFrancesco May 6, 2024
a08f922
Add update_leaf flag to create_leaf
MarcoDiFrancesco May 13, 2024
a00cfe5
Fix mondrian forest classifier test
MarcoDiFrancesco May 13, 2024
4d9ef48
Remove create_leaf flag
MarcoDiFrancesco May 20, 2024
0217db2
Add create leafs when reaching a leaf
MarcoDiFrancesco May 24, 2024
1e5a874
Add assert to check for NaN probability
MarcoDiFrancesco May 24, 2024
6971c21
Revert removal of split_time
MarcoDiFrancesco May 24, 2024
782d1f2
Add test cases
MarcoDiFrancesco May 29, 2024
a5bd895
Remove unused `child_is_on_edge_parent` test case
MarcoDiFrancesco May 29, 2024
3544c28
Add debug statement for overwriting variance aware estimation
MarcoDiFrancesco May 29, 2024
9083d8e
Add synthetic regression target boilerplate
MarcoDiFrancesco Jun 4, 2024
43cce28
Add Classification and Regression division of MF
MarcoDiFrancesco Jun 7, 2024
e58638b
Add regression task and parent_has_finite_values test
MarcoDiFrancesco Jun 11, 2024
fed6daf
Fix child_inside_parent test
MarcoDiFrancesco Jun 11, 2024
760de79
Remove prints in excess
MarcoDiFrancesco Jun 11, 2024
54bb202
Add regression metrics
MarcoDiFrancesco Jun 12, 2024
0d74d3f
Fix test keystroke dataset
MarcoDiFrancesco Jun 12, 2024
c60b381
Change description of synthetic dataset
MarcoDiFrancesco Jun 12, 2024
ec2109a
Add baseline comparison for regression
MarcoDiFrancesco Jun 24, 2024
b77ba69
Add machine degradation dataset
MarcoDiFrancesco Jul 9, 2024
a6c1b8b
Add genesis demostrator dataset
MarcoDiFrancesco Jul 10, 2024
4a4b9f5
Update machine degradation with redirect
MarcoDiFrancesco Jul 10, 2024
23c109e
Update src/datasets/synthetic_regression.rs
smastelini Jul 29, 2024
38e64ee
Update src/datasets/synthetic.rs
smastelini Jul 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add boilerplate code for mondrian forest
  • Loading branch information
MarcoDiFrancesco committed Apr 12, 2024
commit 308a082f8cb05eb051d7b1baab64cb191bf3183b
4 changes: 4 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,10 @@ opt-level = 3
name = "credit_card"
path = "examples/anomaly_detection/credit_card.rs"

[[example]]
name = "credit_card_clf"
path = "examples/classification/credit_card.rs"

[[bench]]
name = "hst"
harness = false
57 changes: 57 additions & 0 deletions examples/classification/credit_card.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
use light_river::classification::mondrian_tree::MondrianTree;
use light_river::common::ClassifierOutput;
use light_river::common::ClassifierTarget;
use light_river::datasets::credit_card::CreditCard;
use light_river::metrics::rocauc::ROCAUC;
use light_river::metrics::traits::ClassificationMetric;
use light_river::stream::data_stream::DataStream;
use light_river::stream::iter_csv::IterCsv;
use std::fs::File;
use std::time::Instant;

fn main() {
let now = Instant::now();

let window_size: usize = 1000;
let n_trees: usize = 1;
let height: usize = 4;

// TODO: Check if still need for classification or it was useful only in anomany detection
let pos_val_metric = ClassifierTarget::from("1".to_string());
let pos_val_tree = pos_val_metric.clone();

let features = vec!["F1".to_string(), "F2".to_string(), "F3".to_string()];

// INITIALIZATION
let mut mt: MondrianTree<f32> =
MondrianTree::new(window_size, n_trees, height, features, pos_val_tree);

// DEBUG: remove it
let mut counter = 0;

// LOOP
let transactions: IterCsv<f32, File> = CreditCard::load_credit_card_transactions().unwrap();
for transaction in transactions {
let data = transaction.unwrap();
// println!("Data: {data}");
let observation = data.get_observation();
// println!("Observation: {:?}", observation);
let label = data.to_classifier_target("Class").unwrap();
// let score = mt.update(&observation, true, true).unwrap();

// Label: No idea why we it
// println!("Label: {:?}", label);
// println!("Score: {:?}", score);
// println!("");

counter += 1;
if counter > 10 {
break;
}
// roc_auc.update(&score, &label, Some(1.));
}

let elapsed_time = now.elapsed();
println!("Took {}ms", elapsed_time.as_millis());
// println!("ROCAUC: {:.2}%", roc_auc.get() * (100.0 as f32));
}
1 change: 1 addition & 0 deletions src/classification/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
pub mod mondrian_tree;
186 changes: 186 additions & 0 deletions src/classification/mondrian_tree.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
use num::pow::Pow;
use rand::prelude::*;

use num::{Float, FromPrimitive};
use std::cell::RefCell;
use std::collections::HashMap;
use std::convert::TryFrom;
use std::env::consts;
use std::iter::FlatMap;
use std::mem;
use std::ops::{AddAssign, DivAssign, MulAssign, SubAssign};
use std::rc::Rc;

use crate::common::{ClassifierOutput, ClassifierTarget, Observation};

trait FType:
Float + FromPrimitive + AddAssign + SubAssign + MulAssign + DivAssign + std::fmt::Debug
{
}
impl<T> FType for T where
T: Float + FromPrimitive + AddAssign + SubAssign + MulAssign + DivAssign + std::fmt::Debug
{
}

/// Stats assocociated to one node
/// Vecotor are the labels
struct Stats {
sum: Vec<f64>,
sq_sum: Vec<f64>,
count: Vec<i32>,
}
impl Stats {
fn create_result() {
unimplemented!()
}
fn add() {
unimplemented!()
}
fn merge() {
unimplemented!()
}
fn predict_proba() {
unimplemented!()
}
}
#[derive(Debug, Copy, Clone)]
struct Node<F> {
parent: Option<usize>,
tau: F, // Time parameter (?)
is_leaf: bool,
min_list: [F; 2], // Lists representing the minimum and maximum values of the data points contained in the current node
max_list: [F; 2],
delta: F, // Dimension in which a split occurs (?)
xi: F, // Split point along the dimension specified by delta
left: Option<usize>,
right: Option<usize>,
// stats: Stats, // Ignoring stats for now since it should be a fixed-size array, vector should not work since we are using fixed-size arrays in Trees, but try it out and see what comes out
}
impl<F: FType> Node<F> {
// pub fn update_leaf(&mut self, x: F, label: F) {
pub fn update_leaf(&mut self) {
unimplemented!()
}
pub fn update_internal(&mut self) {
unimplemented!()
}
pub fn get_parent_tau(&self) -> f64 {
unimplemented!()
// match self.parent {
// Some(ref parent) => parent.borrow().tau,
// None => 0.0,
// }
}
}

struct Trees<F: FType> {
nodes: Vec<Node<F>>,
}
impl<F: FType> Trees<F> {
fn new(
n_trees: usize,
height: usize,
features: &Vec<String>,
rng: &mut ThreadRng,
n_nodes: usize,
) -> Self {
if n_trees != 1 {
unimplemented!();
}

let node_default = Node::<F> {
parent: None,
tau: F::from_f64(0.33).unwrap(),
is_leaf: false,
min_list: [F::from_f64(0.1).unwrap(), F::from_f64(0.2).unwrap()],
max_list: [F::from_f64(0.3).unwrap(), F::from_f64(0.4).unwrap()],
delta: F::from_f64(0.123).unwrap(),
xi: F::from_f64(0.456).unwrap(),
left: None,
right: None,
// stats: Stats::new,
};
let mut nodes = vec![node_default; n_nodes];

for i in 0..n_nodes {
let left_idx = 2 * i + 1;
let right_idx = 2 * i + 2;

if (left_idx < n_nodes) && (right_idx < n_nodes) {
nodes[i].left = Some(left_idx);
nodes[i].right = Some(right_idx);
nodes[left_idx].parent = Some(i);
nodes[right_idx].parent = Some(i);
} else {
nodes[i].is_leaf = true;
}
}

Trees { nodes }
}
}

pub struct MondrianTree<F: FType> {
window_size: usize,
n_trees: usize,
height: usize,
features: Vec<String>,
rng: ThreadRng,
n_nodes: usize,
trees: Trees<F>,
first_learn: bool,
pos_val: ClassifierTarget,
}
impl<F: FType> MondrianTree<F> {
pub fn new(
window_size: usize,
n_trees: usize,
height: usize,
features: Vec<String>,
pos_val: ClassifierTarget,
) -> Self {
let features_clone = features.clone();
let mut rng = rand::thread_rng();
// #nodes = 2 ^ height - 1
let n_nodes = usize::pow(2, height.try_into().unwrap()) - 1;
let mut trees = Trees::new(n_trees, height, &features, &mut rng, n_nodes);
MondrianTree::<F> {
window_size,
n_trees,
height,
features: features_clone,
rng,
n_nodes,
trees,
first_learn: false,
pos_val,
}
}

pub fn update(
&mut self,
observation: &Observation<F>,
do_score: bool,
do_update: bool,
) -> Option<ClassifierOutput<F>> {
if do_score {
let score: F = F::from(1234.0).unwrap();
return Some(ClassifierOutput::Probabilities(HashMap::from([(
ClassifierTarget::from(self.pos_val.clone()),
score,
)])));
}
return None;
}
pub fn learn_one(&mut self, observation: &Observation<F>) {
self.update(observation, false, true);
}
pub fn score_one(&mut self, observation: &Observation<F>) -> Option<ClassifierOutput<F>> {
self.update(observation, true, false)
}
fn max_score(&self) -> F {
F::from(self.n_trees).unwrap()
* F::from(self.window_size).unwrap()
* (F::from(2.).unwrap().powi(self.height as i32 + 1) - F::one())
}
}
1 change: 1 addition & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
pub mod anomaly;
pub mod classification;
pub mod common;
pub mod datasets;
pub mod metrics;
Expand Down
1 change: 1 addition & 0 deletions src/stream/data_stream.rs
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ impl<F: Float + std::fmt::Display + std::str::FromStr> Data<F> {
}
}

#[derive(Debug)]
pub enum DataStream<F: Float + std::str::FromStr> {
X(HashMap<String, Data<F>>),
XY(HashMap<String, Data<F>>, HashMap<String, Data<F>>),
Expand Down