Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel包重载jiebaR导致worker()时间戳不一致 #20

Open
ioflows opened this issue Dec 26, 2015 · 4 comments
Open

parallel包重载jiebaR导致worker()时间戳不一致 #20

ioflows opened this issue Dec 26, 2015 · 4 comments

Comments

@ioflows
Copy link

ioflows commented Dec 26, 2015

# init jieba
library(jiebaR)
seg_local=worker()
# init cluster
library(parallel)
cl=makeCluster(3)
# init args and functions
args=c('abc def','abd efg','ah gs fhg')
get_seg_local=function(d) segment(d,seg_local)
get_seg_remote=function(d) segment(d,seg_remote)

clusterEvalQ(cl,library(jiebaR))
# ======================
# 本地定义worker()并export
# ======================
clusterExport(cl,'seg_local')
# clusterExport(cl,'get_seg_local')
parLapply(cl,args,get_seg_local)
# Error in checkForRemoteErrors(val) : 
#   3 nodes produced errors; first error: Please create a new worker after jiebaR is reloaded.

# ========================
# 远程定义master节点的worker()
# ========================
clusterCall(cl,function(){
    seg_remote=worker()
})
parLapply(cl,args,get_seg_remote)
# Error in checkForRemoteErrors(val) : 
#   3 nodes produced errors; first error: 找不到对象'seg_remote'

本地声明的报错信息主要是时间戳的不一致导致,Line 42

https://github.com/qinwf/jiebaR/blob/master/R/segment.R

当然,第二种方案报错并不是jiebaR的问题(我自己找了不少相关资料,但始终不得解),想请教一下对于jiebaR在并行计算中是否有更好的解决方案,谢谢!

@qinwf
Copy link
Owner

qinwf commented Dec 26, 2015

你好,我这几天暂时比较忙,先简单说一下,之后再细说。并行需要解决很多问题,其中一个是数据竞争 data racing 。你现在遇到的差不多算是这个问题。

@qinwf
Copy link
Owner

qinwf commented Dec 26, 2015

除了 R 层面的粗并行,我有在用 c++11 的并行机制实现分词的并行。但是 windows 上 rtools gcc 4.9 有 bug ,64位下有 dll 加载有问题,所以这个特性没有正式加到主分支里。这个特性可能要等 rtools gcc 4.9 在 windows 稳定了才能用

@qinwf
Copy link
Owner

qinwf commented Dec 26, 2015

你可以试着在各自的 子并行集群里新建 cutter 这样可能不会有时间戳问题,也可以避免数据竞争。比如,3个子集群,3个cutter。

@ioflows
Copy link
Author

ioflows commented Dec 26, 2015

感谢您的解答。实际上我的第二种方案也是想在每个子并行群中新建cutter,但是方法写错了,不应该将其放入匿名函数中。:-(

clusterCall(cl,function(){
    seg_remote=worker()
})

应改为

clusterEvalQ(cl,{seg_remote=worker()})

期待jiebaR的新特性!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants