Skip to content

Lightweight and Scalable framework that combines mainstream algorithms of Click-Through-Rate prediction Based Machine Learning, Deep Learning and Philosophy of Parameter Server.

License

Notifications You must be signed in to change notification settings

ChironJ/LightCTR

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Alt text -w135

LightCTR Overview

LightCTR is a lightweight and scalable framework that combines mainstream algorithms of Click-Through-Rate prediction Based Machine Learning, Deep Learning and Philosophy of Parameter Server. The library is suitable for sparse data and designed for Large-scale Distributed Model Training.

Meanwhile, LightCTR is also an open source project that oriented to code readers. The clear execute logic will be of significance to leaners on Machine-Learning-related field.

用于群体发现

群体发现作为点击率预估中用户画像的重要指标之一,LightCTR提供了相关的算法支持。可将离散id特征与连续特征组合输入LightCTR::GBM预先找群体,训练得到的每个叶子节点代表一个用户群,再使用LightCTR::LRLightCTR::MLP对GBDT建立的低维0/1群体特征做高层分类。 当id特征过多使输入成为高维稀疏数据,便不再适合树模型,可通过LightCTR::FMLightCTR::FFM将每个群体的类别onehot特征映射在低维空间中,向量内积即表示特征交叉的权重。FM相比LR引入一定的非线性,降低了数据稀疏下的过拟合的风险,或更进一步采用LightCTR::NFM结合DNN进行高维特征组合,达到更好的效果。

用于行为序列

用户点击内容序列等用户的行为序列可通过LightCTR::Embedding得到行为的低维隐向量表示;隐向量数据经过平滑处理后,按时序输入循环神经网络LightCTR::LSTM,从用户行为序列中训练得到基础特征的表示;再基于变分自编码器LightCTR::VAE实现特征组合衍生,增强特征的表达能力;最后将特征表达输入LightCTR::Softmax分类器,利用样本标签训练生成用户评估,或导入LightCTR::GMM进行无监督聚类,作为用户画像的人群特征。

用于内容分析

LightCTR可通过分析用户评论、兴趣得到推荐信息,作为点击率预估的参考。可使用LightCTR::Embedding预先训练含语意特征的词向量表,将切分后的词向量文本按时序输入LightCTR::Attention_LSTM训练得到句子的情感倾向,并在Attention层获得每个词对分类结果的重要性。 另一种方案是将由文本词向量构成的矩阵,输入卷积神经网络LightCTR::CNN用于提取句子局部相关性特征作为文本分类依据。当文本缺乏分类标签时,可使用LightCTR::PLSA无监督的获取文章主题分布,作为是否向用户推荐此文章的依据。

分层模型融合

为了兼顾点击率预估的性能与效果,可使用不同模型逐层预测,如第一层采用在线学习、并引入稀疏性解的简单模型LightCTR::FTRL_LR,第二层采用LightCTR::MLPLightCTR::NFM等复杂模型进行精细预测。

多机多线程并行计算

基于参数服务器与异步梯度下降理论,LightCTR支持可扩展性的模型参数集群训练。集群分为Master, ParamServer与Worker三种角色,一个集群有一个Master负责集群启动与运行状态的维护,大规模模型参数以DHT散布在多个参数服务器上,与多个负责模型梯度运算的Worker协同,每轮训练都先从ParamServer拉取(Pull)一个样本Batch的参数,运算得到的参数梯度推送(Push)到ParamServer进行梯度汇总与参数更新。LightCTR分布式集群采取心跳监控、消息重传等容错方式。

List of Implemented Algorithms

  • Factorization Machine
  • Field-aware Factorization Machine
  • Wide&Deep Neural Factorization Machine
  • Gradient Boosting Tree Model
  • Gaussian Mixture Clustering Model
  • Topic Model PLSA
  • Embedding Model
  • Multi-Layer Perception
  • Convolution Neural Network
  • Attention-based Recurrent Neural Network
  • Variational AutoEncoder

Features

  • Optimizer implemented by Mini-Batch GD, Negative sampling, Adagrad, FTRL, RMSprop, Adam, etc
  • Regularization: L1, L2, Dropout, Partially connected
  • Template-customized Activation and Loss function
  • Evaluate methods including F1, AUC
  • Support distributed model training based on Parameter Server and Async-SGD
  • Shared parameters Key-Value pairs store in physical nodes by DHT
  • Lock-free Stand-alone Multi-threaded training and Vectorized compiling

Quick Start

  • LightCTR depends on C++11 and ZeroMQ only
  • Change configuration (e.g. Learning Rate, Data source) in main.cpp
  • run ./build.sh to start training task
  • Current CI Status: Build Status

Welcome to Contribute

  • Welcome everyone interested in machine learning or distributed system to contribute code, create issues or pull requests of new features.
  • LightCTR is released under the Apache License, Version 2.0.

Community

About

Lightweight and Scalable framework that combines mainstream algorithms of Click-Through-Rate prediction Based Machine Learning, Deep Learning and Philosophy of Parameter Server.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 89.9%
  • C 8.6%
  • Other 1.5%