-
Notifications
You must be signed in to change notification settings - Fork 0
/
2024-10-18.json
2089 lines (2089 loc) · 184 KB
/
2024-10-18.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{
"date": {
"ru": "18 октября",
"en": "October 18",
"zh": "10月18日"
},
"time_utc": "2024-10-18 09:00",
"weekday": 4,
"issue_id": 176,
"home_page_url": "https://huggingface.co/papers?date=2024-10-18",
"papers": [
{
"id": "https://huggingface.co/papers/2410.13720",
"title": "Movie Gen: A Cast of Media Foundation Models",
"url": "https://huggingface.co/papers/2410.13720",
"abstract": "We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.",
"score": 88,
"issue_id": 147,
"pub_date": "2024-10-17",
"pub_date_card": {
"ru": "17 октября",
"en": "October 17",
"zh": "10月17日"
},
"hash": "086b8ff148ce7df3",
"authors": [
"Adam Polyak",
"Amit Zohar",
"Andrew Brown",
"Andros Tjandra",
"Animesh Sinha",
"Ann Lee",
"Apoorv Vyas",
"Bowen Shi",
"Chih-Yao Ma",
"Ching-Yao Chuang",
"David Yan",
"Dhruv Choudhary",
"Dingkang Wang",
"Geet Sethi",
"Guan Pang",
"Haoyu Ma",
"Ishan Misra",
"Ji Hou",
"Jialiang Wang",
"Kiran Jagadeesh",
"Kunpeng Li",
"Luxin Zhang",
"Mannat Singh",
"Mary Williamson",
"Matt Le",
"Matthew Yu",
"Mitesh Kumar Singh",
"Peizhao Zhang",
"Peter Vajda",
"Quentin Duval",
"Rohit Girdhar",
"Roshan Sumbaly",
"Sai Saketh Rambhatla",
"Sam Tsai",
"Samaneh Azadi",
"Samyak Datta",
"Sanyuan Chen",
"Sean Bell",
"Sharadh Ramaswamy",
"Shelly Sheynin",
"Siddharth Bhattacharya",
"Simran Motwani",
"Tao Xu",
"Tianhe Li",
"Tingbo Hou",
"Wei-Ning Hsu",
"Xi Yin",
"Xiaoliang Dai",
"Yaniv Taigman",
"Yaqiao Luo",
"Yen-Cheng Liu",
"Yi-Chiao Wu",
"Yue Zhao",
"Yuval Kirstain",
"Zecheng He",
"Zijian He",
"Albert Pumarola",
"Ali Thabet",
"Artsiom Sanakoyeu",
"Arun Mallya",
"Baishan Guo",
"Boris Araya",
"Breena Kerr",
"Carleigh Wood",
"Ce Liu",
"Cen Peng",
"Dimitry Vengertsev",
"Edgar Schonfeld",
"Elliot Blanchard",
"Felix Juefei-Xu",
"Fraylie Nord",
"Jeff Liang",
"John Hoffman",
"Jonas Kohler",
"Kaolin Fire",
"Karthik Sivakumar",
"Lawrence Chen",
"Licheng Yu",
"Luya Gao",
"Markos Georgopoulos",
"Rashel Moritz",
"Sara K. Sampson",
"Shikai Li",
"Simone Parmeggiani",
"Steve Fine",
"Tara Fowler",
"Vladan Petrovic",
"Yuming Du"
],
"affiliations": [
"Meta"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13720.jpg",
"data": {
"categories": [
"#diffusion",
"#synthetic",
"#inference",
"#video",
"#optimization",
"#multimodal",
"#data",
"#training",
"#open_source",
"#audio",
"#architecture"
],
"emoji": "🎬",
"ru": {
"title": "MovieGen: Революция в генерации мультимедиа",
"desc": "MovieGen - это набор фундаментальных моделей, генерирующих высококачественные видео в формате 1080p HD с синхронизированным аудио. Модели устанавливают новый уровень качества в нескольких задачах, включая синтез видео по тексту, персонализацию видео и генерацию аудио. Крупнейшая модель имеет 30 миллиардов параметров и может генерировать 16-секундные видео. Авторы представляют ряд технических инноваций в архитектуре, обучении и оптимизации моделей генерации мультимедиа."
},
"en": {
"title": "Revolutionizing Video Creation with Movie Gen",
"desc": "Movie Gen introduces advanced foundation models capable of generating high-quality videos with synchronized audio, offering new capabilities in video editing and personalization. The models excel in tasks like text-to-video synthesis and video-to-audio generation, setting a new benchmark in the field. With a 30 billion parameter transformer, the system can produce 16-second videos at 16 frames per second, showcasing significant technical innovations. These advancements aim to push forward the research and development of large-scale media generation models."
},
"zh": {
"title": "Movie Gen:引领高清视频生成新标准",
"desc": "这篇论文介绍了一个名为Movie Gen的基础模型集,可以生成高质量的1080p高清视频,并支持不同的宽高比和同步音频。该模型还具备精确的指令视频编辑和基于用户图像生成个性化视频的能力。Movie Gen在多项任务上设立了新的技术标准,包括文本到视频合成、视频个性化、视频编辑、视频到音频生成和文本到音频生成。通过多项技术创新和简化,该模型在架构、潜在空间、训练目标、数据策划等方面取得了突破,推动了大规模媒体生成模型的进步。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.13754",
"title": "MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures",
"url": "https://huggingface.co/papers/2410.13754",
"abstract": "Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any real-world benchmark designed to optimize and standardize evaluations across input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions and the model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98). We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.",
"score": 74,
"issue_id": 148,
"pub_date": "2024-10-17",
"pub_date_card": {
"ru": "17 октября",
"en": "October 17",
"zh": "10月17日"
},
"hash": "82517ad6fbb54273",
"authors": [
"Jinjie Ni",
"Yifan Song",
"Deepanway Ghosal",
"Bo Li",
"David Junhao Zhang",
"Xiang Yue",
"Fuzhao Xue",
"Zian Zheng",
"Kaichen Zhang",
"Mahir Shah",
"Kabir Jain",
"Yang You",
"Michael Shieh"
],
"affiliations": [
"Carnegie Mellon University",
"Independent Researcher",
"Nanyang Technological University",
"National University of Singapore",
"Peking University",
"University of Waterloo"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13754.jpg",
"data": {
"categories": [
"#benchmark",
"#optimization",
"#multimodal",
"#survey",
"#alignment"
],
"emoji": "🎭",
"ru": {
"title": "MixEval-X: Универсальный бенчмарк для оценки многомодальных моделей ИИ",
"desc": "Статья представляет MixEval-X - первый многомодальный бенчмарк для оценки моделей ИИ в реальных задачах. Авторы предлагают новые методы для создания репрезентативных наборов тестов, охватывающих различные модальности ввода и вывода. Бенчмарк решает проблемы несогласованности стандартов оценки и различных видов смещений в существующих методах. Результаты показывают высокую корреляцию с оценками краудсорсинга в реальных сценариях использования."
},
"en": {
"title": "MixEval-X: Bridging the Gap Between AI Benchmarks and Real-World Performance",
"desc": "The paper introduces MixEval-X, a benchmark designed to standardize evaluations across different input and output modalities in AI models. It addresses issues of inconsistent evaluation standards and biases in current methods by proposing a multi-modal benchmark mixture and adaptation-rectification pipelines. These pipelines help align benchmark samples with real-world task distributions, ensuring that evaluations are more representative of real-world scenarios. The approach shows strong correlation with real-world evaluations, providing valuable insights for improving multi-modal evaluations and guiding future research."
},
"zh": {
"title": "MixEval-X:优化多模态评估的全新基准",
"desc": "这篇论文讨论了AI模型在处理多种信号时需要可靠的评估方法。当前评估存在标准不一致和偏差问题。为此,作者提出了MixEval-X,一个用于优化和标准化多模态评估的基准。通过这种方法,评估结果更贴近真实世界的任务分布。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.12784",
"title": "JudgeBench: A Benchmark for Evaluating LLM-based Judges",
"url": "https://huggingface.co/papers/2410.12784",
"abstract": "LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench .",
"score": 42,
"issue_id": 160,
"pub_date": "2024-10-16",
"pub_date_card": {
"ru": "16 октября",
"en": "October 16",
"zh": "10月16日"
},
"hash": "a81030e9f379736a",
"authors": [
"Sijun Tan",
"Siyuan Zhuang",
"Kyle Montgomery",
"William Y. Tang",
"Alejandro Cuadron",
"Chenguang Wang",
"Raluca Ada Popa",
"Ion Stoica"
],
"affiliations": [
"UC Berkeley",
"Washington University in St. Louis"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.12784.jpg",
"data": {
"categories": [
"#reasoning",
"#benchmark",
"#math",
"#plp",
"#data",
"#training",
"#dataset",
"#open_source",
"#architecture",
"#alignment"
],
"emoji": "⚖️",
"ru": {
"title": "JudgeBench: Новый стандарт для оценки ИИ-судей",
"desc": "Статья представляет новую систему оценки судей на основе больших языковых моделей (LLM). Авторы предлагают JudgeBench - набор тестов для оценки LLM-судей в сложных задачах, охватывающих знания, рассуждения, математику и программирование. JudgeBench использует новый подход к созданию сложных пар ответов с метками предпочтений, отражающими объективную правильность. Результаты показывают, что JudgeBench представляет значительно большую сложность, чем предыдущие тесты, с многими сильными моделями, работающими лишь немного лучше, чем случайное угадывание."
},
"en": {
"title": "JudgeBench: Raising the Bar for LLM Evaluation",
"desc": "The paper introduces JudgeBench, a new benchmark designed to evaluate the reliability of LLM-based judges, which are used to assess and improve machine learning models. Unlike existing benchmarks that focus on alignment with human preferences, JudgeBench emphasizes objective correctness in challenging tasks like reasoning and coding. The framework converts difficult datasets into response pairs with preference labels, providing a more rigorous test for LLM-based judges. Results show that even advanced models struggle with JudgeBench, highlighting its effectiveness in assessing the capabilities of these judges."
},
"zh": {
"title": "JudgeBench:评估大语言模型评判者的新基准",
"desc": "这篇论文提出了一种新的评估框架,用于客观地评估基于大语言模型(LLM)的评判者。研究者开发了一个名为JudgeBench的基准,用于评估这些评判者在知识、推理、数学和编程等方面的挑战性响应对。JudgeBench通过一个新颖的流程,将现有的困难数据集转换为具有偏好标签的挑战性响应对,以反映客观的正确性。研究结果表明,JudgeBench比以往的基准更具挑战性,许多强大的模型在此基准上的表现仅略优于随机猜测。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.13863",
"title": "Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens",
"url": "https://huggingface.co/papers/2410.13863",
"abstract": "Scaling up autoregressive models in vision has not proven as beneficial as in large language models. In this work, we investigate this scaling problem in the context of text-to-image generation, focusing on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed raster order using BERT- or GPT-like transformer architectures. Our empirical results show that, while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends. Models based on continuous tokens achieve significantly better visual quality than those using discrete tokens. Furthermore, the generation order and attention mechanisms significantly affect the GenEval score: random-order models achieve notably better GenEval scores compared to raster-order models. Inspired by these findings, we train Fluid, a random-order autoregressive model on continuous tokens. Fluid 10.5B model achieves a new state-of-the-art zero-shot FID of 6.16 on MS-COCO 30K, and 0.69 overall score on the GenEval benchmark. We hope our findings and results will encourage future efforts to further bridge the scaling gap between vision and language models.",
"score": 35,
"issue_id": 160,
"pub_date": "2024-10-17",
"pub_date_card": {
"ru": "17 октября",
"en": "October 17",
"zh": "10月17日"
},
"hash": "3fa9a449112a391b",
"authors": [
"Lijie Fan",
"Tianhong Li",
"Siyang Qin",
"Yuanzhen Li",
"Chen Sun",
"Michael Rubinstein",
"Deqing Sun",
"Kaiming He",
"Yonglong Tian"
],
"affiliations": [
"Google DeepMind",
"MIT"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13863.jpg",
"data": {
"categories": [
"#diffusion",
"#benchmark",
"#cv",
"#optimization",
"#games",
"#architecture"
],
"emoji": "🖼️",
"ru": {
"title": "Новый подход к масштабированию моделей генерации изображений",
"desc": "Исследователи изучили проблему масштабирования авторегрессионных моделей в контексте генерации изображений по тексту. Они сравнили модели с дискретными и непрерывными токенами, а также с различным порядком генерации и архитектурами трансформеров. Результаты показали, что модели с непрерывными токенами достигают лучшего визуального качества, а случайный порядок генерации улучшает показатели GenEval. На основе этих выводов была разработана модель Fluid, достигшая нового уровня производительности в нескольких бенчмарках."
},
"en": {
"title": "Bridging the Gap: Continuous Tokens and Random Order in Vision Models",
"desc": "This paper explores the challenges of scaling autoregressive models for text-to-image generation, focusing on the use of discrete versus continuous tokens and the order of token generation. The study finds that models using continuous tokens produce higher visual quality images compared to those using discrete tokens. Additionally, models that generate tokens in a random order outperform those using a fixed raster order in terms of GenEval scores. The authors introduce Fluid, a random-order autoregressive model with continuous tokens, which sets new benchmarks in zero-shot FID and GenEval scores, suggesting a promising direction for future research in bridging the gap between vision and language models."
},
"zh": {
"title": "突破视觉与语言模型扩展的界限",
"desc": "这篇论文研究了在图像生成中自回归模型的扩展问题,特别关注使用离散或连续的标记,以及标记生成的顺序。研究发现,使用连续标记的模型在视觉质量上明显优于使用离散标记的模型。生成顺序和注意力机制对GenEval评分有显著影响,随机顺序的模型在GenEval评分上表现更好。基于这些发现,作者训练了Fluid模型,在MS-COCO 30K数据集上取得了新的零样本FID记录。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.13268",
"title": "Roadmap towards Superhuman Speech Understanding using Large Language Models",
"url": "https://huggingface.co/papers/2410.13268",
"abstract": "The success of large language models (LLMs) has prompted efforts to integrate speech and audio data, aiming to create general foundation models capable of processing both textual and non-textual inputs. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs, which preserves non-semantic information and world knowledge for deeper speech understanding. To guide the development of speech LLMs, we propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models capable of integrating non-semantic information with abstract acoustic knowledge for complex tasks. Moreover, we design a benchmark, SAGI Bechmark, that standardizes critical aspects across various tasks in these five levels, uncovering challenges in using abstract acoustic knowledge and completeness of capability. Our findings reveal gaps in handling paralinguistic cues and abstract acoustic knowledge, and we offer future directions. This paper outlines a roadmap for advancing speech LLMs, introduces a benchmark for evaluation, and provides key insights into their current limitations and potential.",
"score": 33,
"issue_id": 153,
"pub_date": "2024-10-17",
"pub_date_card": {
"ru": "17 октября",
"en": "October 17",
"zh": "10月17日"
},
"hash": "929ec80dcb105705",
"authors": [
"Fan Bu",
"Yuhao Zhang",
"Xidong Wang",
"Benyou Wang",
"Qun Liu",
"Haizhou Li"
],
"affiliations": [
"Noahs Ark Lab, Huawei",
"The Chinese University of Hong Kong, Shenzhen"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13268.jpg",
"data": {
"categories": [
"#science",
"#benchmark",
"#agi",
"#multimodal",
"#survey",
"#audio",
"#architecture"
],
"emoji": "🗣️",
"ru": {
"title": "Дорожная карта для речевых LLM: от распознавания речи к сверхчеловеческим моделям",
"desc": "В статье предлагается дорожная карта из пяти уровней для развития речевых языковых моделей (LLM), способных обрабатывать как текстовые, так и нетекстовые входные данные. Авторы разработали бенчмарк SAGI для стандартизации оценки различных задач на этих уровнях. Исследование выявило пробелы в обработке паралингвистических сигналов и абстрактных акустических знаний. Статья предлагает направления для будущих исследований в области речевых LLM."
},
"en": {
"title": "Bridging Text and Sound: The Future of Speech Language Models",
"desc": "This paper explores the integration of speech and audio data into large language models (LLMs) to create versatile models that can handle both text and non-text inputs. It introduces a five-level roadmap for developing speech LLMs, from basic automatic speech recognition (ASR) to advanced models that incorporate non-semantic information and abstract acoustic knowledge. The authors also present the SAGI Benchmark, which evaluates these models across various tasks and highlights challenges in processing paralinguistic cues and abstract acoustic knowledge. The paper provides insights into the current limitations of speech LLMs and suggests future research directions to enhance their capabilities."
},
"zh": {
"title": "语音大模型的未来:从基础到超人",
"desc": "这篇论文探讨了将语音和音频数据整合到大型语言模型中的可能性,旨在创建能够处理文本和非文本输入的通用基础模型。研究提出了一个五级路线图,从基本的自动语音识别到能够处理复杂任务的超人模型。作者还设计了一个名为SAGI的基准,用于标准化这些五个级别中各种任务的关键方面。研究发现当前模型在处理副语言线索和抽象声学知识方面存在不足,并提供了未来的发展方向。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.13757",
"title": "MobA: A Two-Level Agent System for Efficient Mobile Task Automation",
"url": "https://huggingface.co/papers/2410.13757",
"abstract": "Current mobile assistants are limited by dependence on system APIs or struggle with complex user instructions and diverse interfaces due to restricted comprehension and decision-making abilities. To address these challenges, we propose MobA, a novel Mobile phone Agent powered by multimodal large language models that enhances comprehension and planning capabilities through a sophisticated two-level agent architecture. The high-level Global Agent (GA) is responsible for understanding user commands, tracking history memories, and planning tasks. The low-level Local Agent (LA) predicts detailed actions in the form of function calls, guided by sub-tasks and memory from the GA. Integrating a Reflection Module allows for efficient task completion and enables the system to handle previously unseen complex tasks. MobA demonstrates significant improvements in task execution efficiency and completion rate in real-life evaluations, underscoring the potential of MLLM-empowered mobile assistants.",
"score": 31,
"issue_id": 150,
"pub_date": "2024-10-17",
"pub_date_card": {
"ru": "17 октября",
"en": "October 17",
"zh": "10月17日"
},
"hash": "a4a73fb090d1a0ae",
"authors": [
"Zichen Zhu",
"Hao Tang",
"Yansi Li",
"Kunyao Lan",
"Yixuan Jiang",
"Hao Zhou",
"Yixiao Wang",
"Situo Zhang",
"Liangtai Sun",
"Lu Chen",
"Kai Yu"
],
"affiliations": [
"Shanghai Jiao Tong University, China"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13757.jpg",
"data": {
"categories": [
"#reasoning",
"#agi",
"#multimodal",
"#agents",
"#architecture",
"#alignment"
],
"emoji": "📱",
"ru": {
"title": "MobA: Умный мобильный помощник нового поколения",
"desc": "MobA - это новый мобильный агент, основанный на мультимодальных больших языковых моделях. Он использует двухуровневую архитектуру с глобальным агентом для понимания команд и планирования, и локальным агентом для выполнения конкретных действий. Система включает модуль рефлексии для эффективного выполнения задач и обработки новых сложных заданий. MobA показывает значительное улучшение эффективности и уровня выполнения задач в реальных условиях."
},
"en": {
"title": "Revolutionizing Mobile Assistance with Multimodal Intelligence",
"desc": "The paper introduces MobA, a mobile assistant that uses multimodal large language models to improve understanding and task planning. It features a two-level agent architecture with a Global Agent for command comprehension and task planning, and a Local Agent for executing detailed actions. A Reflection Module is integrated to enhance the system's ability to handle complex and novel tasks. Real-life tests show MobA's improved efficiency and success in completing tasks, highlighting the potential of advanced language models in mobile assistants."
},
"zh": {
"title": "多模态大语言模型助力移动助手新突破",
"desc": "当前的移动助手由于对系统API的依赖和对复杂用户指令的理解能力有限,难以处理多样化的界面。为了解决这些问题,我们提出了MobA,这是一种由多模态大语言模型驱动的移动代理,通过复杂的双层代理架构增强理解和规划能力。高层的全局代理负责理解用户命令、跟踪历史记忆和规划任务,而低层的本地代理则根据子任务和全局代理的记忆预测详细的动作。通过集成反思模块,系统能够高效完成任务,并处理以前未见过的复杂任务。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.12705",
"title": "WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines",
"url": "https://huggingface.co/papers/2410.12705",
"abstract": "Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.",
"score": 29,
"issue_id": 157,
"pub_date": "2024-10-16",
"pub_date_card": {
"ru": "16 октября",
"en": "October 16",
"zh": "10月16日"
},
"hash": "6829d8490ef2d294",
"authors": [
"Genta Indra Winata",
"Frederikus Hudi",
"Patrick Amadeus Irawan",
"David Anugraha",
"Rifki Afina Putri",
"Yutong Wang",
"Adam Nohejl",
"Ubaidillah Ariq Prathama",
"Nedjma Ousidhoum",
"Afifa Amriani",
"Anar Rzayev",
"Anirban Das",
"Ashmari Pramodya",
"Aulia Adila",
"Bryan Wilie",
"Candy Olivia Mawalim",
"Ching Lam Cheng",
"Daud Abolade",
"Emmanuele Chersoni",
"Enrico Santus",
"Fariz Ikhwantri",
"Garry Kuwanto",
"Hanyang Zhao",
"Haryo Akbarianto Wibowo",
"Holy Lovenia",
"Jan Christian Blaise Cruz",
"Jan Wira Gotama Putra",
"Junho Myung",
"Lucky Susanto",
"Maria Angelica Riera Machin",
"Marina Zhukova",
"Michael Anugraha",
"Muhammad Farid Adilazuarda",
"Natasha Santosa",
"Peerat Limkonchotiwat",
"Raj Dabre",
"Rio Alexander Audino",
"Samuel Cahyawijaya",
"Shi-Xiong Zhang",
"Stephanie Yulia Salim",
"Yi Zhou",
"Yinxuan Gui",
"David Ifeoluwa Adelani",
"En-Shiun Annie Lee",
"Shogo Okada",
"Ayu Purwarianti",
"Alham Fikri Aji",
"Taro Watanabe",
"Derry Tanti Wijaya",
"Alice Oh",
"Chong-Wah Ngo"
],
"affiliations": [
"AI Singapore",
"Boston University",
"Capital One",
"Cardiff University",
"Cohere",
"Columbia University",
"HK PolyU",
"HKUST",
"ITB",
"Independent",
"JAIST",
"KAIST",
"MBZUAI",
"MILA",
"Masakhane",
"McGill",
"Monash University",
"NAIST",
"NICT",
"Ontario Tech",
"SEACrowd",
"SMU",
"Tokyo Tech",
"UCSB",
"University of Lagos",
"UofT"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.12705.jpg",
"data": {
"categories": [
"#benchmark",
"#multilingual",
"#cv",
"#graphs",
"#multimodal",
"#dataset",
"#open_source",
"#games",
"#low_resource"
],
"emoji": "🌎",
"ru": {
"title": "WorldCuisines: Глобальный тест на кулинарную эрудицию для ИИ",
"desc": "Статья представляет новый бенчмарк WorldCuisines для оценки понимания культурно-специфических знаний моделями компьютерного зрения и обработки естественного языка. Бенчмарк включает набор данных для визуального ответа на вопросы (VQA) с парами текст-изображение на 30 языках и диалектах, охватывающих 9 языковых семей. Авторы предоставляют наборы данных для оценки в двух размерах, а также тренировочный набор из 1 миллиона примеров. Результаты показывают, что модели лучше справляются с правильным контекстом местоположения, но испытывают трудности с состязательными контекстами и прогнозированием конкретных региональных кухонь и языков."
},
"en": {
"title": "\"WorldCuisines: Bridging Cultural Gaps in Vision Language Models\"",
"desc": "The paper introduces WorldCuisines, a large-scale benchmark designed to test Vision Language Models (VLMs) on their ability to understand culture-specific knowledge across multiple languages and dialects. This benchmark includes a Visual Question Answering (VQA) dataset with over 1 million text-image pairs, making it the largest of its kind for multicultural contexts. The study reveals that while VLMs can perform well when given correct location context, they face challenges with adversarial contexts and accurately predicting regional cuisines and languages. To aid further research, the authors provide a comprehensive knowledge base with annotated food entries and images."
},
"zh": {
"title": "跨文化视觉语言理解的新基准",
"desc": "这篇论文介绍了一个名为WorldCuisines的大规模基准,用于评估视觉语言模型在多语言和多文化背景下的理解能力。该基准包括一个视觉问答数据集,涵盖30种语言和方言,涉及9个语言家族,拥有超过100万个数据点。研究发现,视觉语言模型在正确的地理背景下表现较好,但在对抗性背景和预测特定地区的菜肴和语言时表现较差。为了支持未来的研究,作者还发布了一个包含注释食品条目和图像的知识库。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.13824",
"title": "Harnessing Webpage UIs for Text-Rich Visual Understanding",
"url": "https://huggingface.co/papers/2410.13824",
"abstract": "Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48\\% improvement on VisualWebBench and a 19.1\\% boost in action accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.",
"score": 29,
"issue_id": 146,
"pub_date": "2024-10-17",
"pub_date_card": {
"ru": "17 октября",
"en": "October 17",
"zh": "10月17日"
},
"hash": "7d1ade016ff53a03",
"authors": [
"Junpeng Liu",
"Tianyue Ou",
"Yifan Song",
"Yuxiao Qu",
"Wai Lam",
"Chenyan Xiong",
"Wenhu Chen",
"Graham Neubig",
"Xiang Yue"
],
"affiliations": [
"Carnegie Mellon University",
"Peking University",
"The Chinese University of Hong Kong",
"University of Waterloo"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13824.jpg",
"data": {
"categories": [
"#science",
"#synthetic",
"#benchmark",
"#cv",
"#graphs",
"#optimization",
"#multimodal",
"#data",
"#training",
"#dataset",
"#transfer_learning",
"#games",
"#architecture"
],
"emoji": "🌐",
"ru": {
"title": "Синтез веб-данных для универсального мультимодального понимания",
"desc": "В статье представлен новый подход к улучшению понимания визуального контекста с текстом для мультимодальных больших языковых моделей. Авторы предлагают синтезировать инструкции из веб-интерфейсов с помощью текстовых языковых моделей. Создан датасет MultiUI, содержащий 7,3 миллиона образцов из 1 миллиона веб-сайтов. Модели, обученные на MultiUI, показывают значительное улучшение в задачах веб-интерфейсов и обобщают свои способности на другие домены."
},
"en": {
"title": "Unlocking Text-Rich Visual Understanding with Web UI Data",
"desc": "The paper introduces a method to improve multimodal large language models (MLLMs) by synthesizing instructions from webpage UIs using text-based large language models (LLMs). These models, despite not having direct visual input, can process structured text from webpage accessibility trees and are trained with UI screenshots. The authors present MultiUI, a dataset with 7.3 million samples from 1 million websites, which helps models excel in web UI tasks and generalize to other domains like document understanding and OCR. The study demonstrates that web UI data can significantly enhance text-rich visual understanding across various applications."
},
"zh": {
"title": "网页UI数据:提升多模态视觉理解的关键",
"desc": "这篇论文提出了一种方法,通过网页的可访问性树生成多模态指令,来增强多模态大语言模型的文本丰富视觉理解能力。研究中使用了一个名为MultiUI的数据集,包含了来自100万个网站的730万样本,用于训练多模态模型。实验结果表明,使用MultiUI训练的模型在网页UI任务中表现优异,并且在非网页UI任务和其他领域如文档理解、OCR和图表解释中也有良好的泛化能力。这表明网页UI数据在提升多种场景下的文本丰富视觉理解方面具有广泛的应用潜力。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.13848",
"title": "Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation",
"url": "https://huggingface.co/papers/2410.13848",
"abstract": "In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.",
"score": 28,
"issue_id": 148,
"pub_date": "2024-10-17",
"pub_date_card": {
"ru": "17 октября",
"en": "October 17",
"zh": "10月17日"
},
"hash": "8b28045f373976ba",
"authors": [
"Chengyue Wu",
"Xiaokang Chen",
"Zhiyu Wu",
"Yiyang Ma",
"Xingchao Liu",
"Zizheng Pan",
"Wen Liu",
"Zhenda Xie",
"Xingkai Yu",
"Chong Ruan",
"Ping Luo"
],
"affiliations": [
"DeepSeek-AI",
"Peking University",
"The University of Hong Kong"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13848.jpg",
"data": {
"categories": [
"#optimization",
"#multimodal",
"#interpretability",
"#games",
"#architecture"
],
"emoji": "🔀",
"ru": {
"title": "Janus: единая модель для мультимодального понимания и генерации",
"desc": "Статья представляет Janus - новую авторегрессивную модель для мультимодального понимания и генерации. В отличие от предыдущих подходов, Janus использует отдельные визуальные энкодеры для задач понимания и генерации, что позволяет оптимизировать работу модели. Единая архитектура трансформера обрабатывает данные от обоих энкодеров. Эксперименты показывают, что Janus превосходит предыдущие унифицированные модели и не уступает специализированным моделям для конкретных задач."
},
"en": {
"title": "Janus: A New Era in Multimodal Intelligence",
"desc": "The paper introduces Janus, a new framework that improves how machines understand and create content using different types of data, like images and text. Unlike previous models that used one visual encoder for both understanding and generating, Janus separates these tasks into different pathways, which helps improve performance. By using a single transformer architecture, Janus allows each task to choose the best way to process information, making it more flexible and effective. Experiments show that Janus not only outperforms previous models but also competes well with models designed for specific tasks."
},
"zh": {
"title": "Janus:多模态理解与生成的全新统一框架",
"desc": "这篇论文介绍了Janus,一个统一多模态理解和生成的自回归框架。以往的研究通常使用单一的视觉编码器来处理这两项任务,但由于多模态理解和生成所需的信息粒度不同,这种方法可能导致性能不佳。为了解决这个问题,Janus将视觉编码解耦为独立的路径,同时仍然使用统一的Transformer架构进行处理。实验表明,Janus不仅超越了之前的统一模型,还能匹敌或超过特定任务模型的表现。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.13830",
"title": "DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control",
"url": "https://huggingface.co/papers/2410.13830",
"abstract": "Recent advances in customized video generation have enabled users to create videos tailored to both specific subjects and motion trajectories. However, existing methods often require complicated test-time fine-tuning and struggle with balancing subject learning and motion control, limiting their real-world applications. In this paper, we present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory, guided by a single image and a bounding box sequence, respectively, and without the need for test-time fine-tuning. Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning, and devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks derived from bounding boxes. While these two components achieve their intended functions, we empirically observe that motion control tends to dominate over subject learning. To address this, we propose two key designs: 1) the masked reference attention, which integrates a blended latent mask modeling scheme into reference attention to enhance subject representations at the desired positions, and 2) a reweighted diffusion loss, which differentiates the contributions of regions inside and outside the bounding boxes to ensure a balance between subject and motion control. Extensive experimental results on a newly curated dataset demonstrate that DreamVideo-2 outperforms state-of-the-art methods in both subject customization and motion control. The dataset, code, and models will be made publicly available.",
"score": 23,
"issue_id": 146,
"pub_date": "2024-10-17",
"pub_date_card": {
"ru": "17 октября",
"en": "October 17",
"zh": "10月17日"
},
"hash": "67dc892195cd59d6",
"authors": [
"Yujie Wei",
"Shiwei Zhang",
"Hangjie Yuan",
"Xiang Wang",
"Haonan Qiu",
"Rui Zhao",
"Yutong Feng",
"Feng Liu",
"Zhizhong Huang",
"Jiaxin Ye",
"Yingya Zhang",
"Hongming Shan"
],
"affiliations": [
"Alibaba Group",
"Fudan University",
"Michigan State University",
"Nanyang Technological University"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13830.jpg",
"data": {
"categories": [
"#diffusion",
"#video",
"#training",
"#dataset",
"#open_source",
"#games",
"#architecture"
],
"emoji": "🎬",
"ru": {
"title": "Создание персонализированных видео одним щелчком",
"desc": "DreamVideo-2 - это новая система для создания персонализированных видео без дополнительного обучения. Она использует одно изображение и последовательность ограничивающих рамок для генерации видео с заданным объектом и траекторией движения. Система вводит референсное внимание и маскированный модуль движения для балансировки между сохранением объекта и контролем движения. Эксперименты показывают превосходство DreamVideo-2 над современными методами в области персонализации объектов и контроля движения."
},
"en": {
"title": "Effortless Video Customization with DreamVideo-2",
"desc": "DreamVideo-2 is a new framework for creating customized videos without needing complex adjustments during testing. It uses a single image and a sequence of bounding boxes to guide video generation, focusing on both the subject and its motion. The framework introduces reference attention and a mask-guided motion module to improve subject learning and motion control. To balance these aspects, it employs masked reference attention and reweighted diffusion loss, achieving superior results compared to existing methods."
},
"zh": {
"title": "DreamVideo-2:无微调的视频定制新突破",
"desc": "这篇论文介绍了一种名为DreamVideo-2的视频定制框架,可以在不需要测试时微调的情况下生成特定主题和运动轨迹的视频。该方法通过引入参考注意力和掩码引导运动模块,实现了对主题学习和运动控制的平衡。研究发现,运动控制往往会压倒主题学习,因此提出了掩码参考注意力和重加权扩散损失来解决这一问题。实验结果表明,DreamVideo-2在主题定制和运动控制方面优于现有方法。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.11842",
"title": "MoH: Multi-Head Attention as Mixture-of-Head Attention",
"url": "https://huggingface.co/papers/2410.11842",
"abstract": "In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.",
"score": 20,
"issue_id": 148,
"pub_date": "2024-10-15",
"pub_date_card": {
"ru": "15 октября",
"en": "October 15",
"zh": "10月15日"
},
"hash": "4a94e557d3f7a79a",
"authors": [
"Peng Jin",
"Bo Zhu",
"Li Yuan",
"Shuicheng Yan"
],
"affiliations": [
"Kunlun 2050 Research & Skywork AI, Singapore",
"Peng Cheng Laboratory, Shenzhen, China",
"Rabbitpre Intelligence, Shenzhen, China",
"School of Electronic and Computer Engineering, Peking University, Shenzhen, China"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.11842.jpg",
"data": {
"categories": [
"#small_models",
"#inference",
"#optimization",
"#training",
"#transfer_learning",
"#architecture"
],
"emoji": "🧠",
"ru": {
"title": "Смешивание голов внимания для повышения эффективности трансформеров",
"desc": "Исследователи предлагают новый механизм внимания Mixture-of-Head (MoH), который улучшает эффективность многоголового внимания в трансформерах. MoH позволяет каждому токену выбирать подходящие головы внимания, повышая эффективность вывода без ущерба для точности. Эксперименты на ViT, DiT и языковых моделях показывают, что MoH превосходит стандартное многоголовое внимание, используя лишь 50-90% голов. Продолжительная настройка предобученных моделей, таких как LLaMA3-8B, с использованием MoH также демонстрирует значительное улучшение производительности."
},
"en": {
"title": "\"MoH: Elevating Attention Efficiency and Accuracy\"",
"desc": "This paper introduces Mixture-of-Head attention (MoH), an enhancement to the multi-head attention mechanism in Transformer models, aimed at improving efficiency and accuracy. MoH treats attention heads as experts, allowing each token to select the most relevant heads, which boosts inference efficiency without increasing parameters. By replacing the standard summation with a weighted summation, MoH adds flexibility and unlocks additional performance potential. Experiments show that MoH outperforms traditional multi-head attention, achieving higher accuracy with fewer attention heads, and can be applied to pre-trained models like LLaMA3-8B for further improvements."
},
"zh": {
"title": "混合头注意力:高效的Transformer新选择",
"desc": "这项研究改进了Transformer模型中的多头注意力机制,提高了效率,同时保持或超过了之前的准确性。研究表明,多头注意力可以用求和形式表示,并提出了混合头注意力(MoH)架构,将注意力头视为专家。MoH允许每个标记选择合适的注意力头,提高推理效率而不增加参数数量。实验表明,MoH在使用较少注意力头的情况下,性能优于传统多头注意力。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.13085",
"title": "MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models",
"url": "https://huggingface.co/papers/2410.13085",
"abstract": "Artificial Intelligence (AI) has demonstrated significant potential in healthcare, particularly in disease diagnosis and treatment planning. Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. However, these models often suffer from factual hallucination, which can lead to incorrect diagnoses. Fine-tuning and retrieval-augmented generation (RAG) have emerged as methods to address these issues. However, the amount of high-quality data and distribution shifts between training data and deployment data limit the application of fine-tuning methods. Although RAG is lightweight and effective, existing RAG-based approaches are not sufficiently general to different medical domains and can potentially cause misalignment issues, both between modalities and between the model and the ground truth. In this paper, we propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs. Our approach introduces a domain-aware retrieval mechanism, an adaptive retrieved contexts selection method, and a provable RAG-based preference fine-tuning strategy. These innovations make the RAG process sufficiently general and reliable, significantly improving alignment when introducing retrieved contexts. Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in the factual accuracy of Med-LVLMs. Our data and code are available in https://github.com/richard-peng-xia/MMed-RAG.",
"score": 20,
"issue_id": 146,
"pub_date": "2024-10-16",
"pub_date_card": {
"ru": "16 октября",
"en": "October 16",
"zh": "10月16日"
},
"hash": "8ef96c4ea4d54ffd",
"authors": [
"Peng Xia",
"Kangyu Zhu",
"Haoran Li",
"Tianze Wang",
"Weijia Shi",
"Sheng Wang",
"Linjun Zhang",
"James Zou",
"Huaxiu Yao"
],
"affiliations": [
"Brown University",
"PloyU",
"Rutgers University",
"Stanford University",
"UNC-Chapel Hill",
"University of Washington"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13085.jpg",
"data": {
"categories": [
"#rag",
"#hallucinations",
"#benchmark",
"#cv",
"#multimodal",
"#healthcare",
"#data",
"#training",
"#dataset",
"#open_source",
"#alignment"
],
"emoji": "🏥",
"ru": {
"title": "MMed-RAG: Повышение точности медицинских AI-диагнозов",
"desc": "В статье представлена система MMed-RAG, разработанная для повышения фактической точности медицинских крупномасштабных моделей зрения и языка (Med-LVLMs). Система использует механизм поиска с учетом домена, адаптивный метод выбора найденных контекстов и стратегию дообучения на основе RAG. Эксперименты на пяти медицинских наборах данных показали среднее улучшение фактической точности Med-LVLMs на 43.8%. MMed-RAG решает проблемы галлюцинаций и несоответствий, часто встречающихся в существующих моделях."
},
"en": {
"title": "Enhancing Medical AI: MMed-RAG Boosts Diagnostic Accuracy",
"desc": "The paper discusses the development of MMed-RAG, a new system designed to improve the accuracy of Medical Large Vision-Language Models (Med-LVLMs) by addressing issues of factual hallucination. MMed-RAG uses a domain-aware retrieval mechanism and an adaptive context selection method to enhance the reliability of retrieval-augmented generation (RAG) processes. This approach ensures better alignment between the model's outputs and the ground truth across various medical domains. Experimental results show that MMed-RAG significantly boosts the factual accuracy of Med-LVLMs by 43.8% on average across multiple medical datasets."
},
"zh": {
"title": "MMed-RAG:提升医学AI模型准确性的多模态解决方案",
"desc": "这篇论文介绍了一种新的多模态检索增强生成系统,称为MMed-RAG,旨在提高医学大规模视觉语言模型的准确性。该系统通过引入领域感知的检索机制、自适应的检索上下文选择方法,以及可证明的偏好微调策略来增强模型的可靠性。实验结果表明,MMed-RAG在五个医学数据集上的表现显著提高了43.8%的事实准确性。此研究为医学诊断工具的开发提供了新的思路和方法。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.13804",
"title": "BenTo: Benchmark Task Reduction with In-Context Transferability",
"url": "https://huggingface.co/papers/2410.13804",
"abstract": "Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a <4% difference to the evaluation on the original benchmark. Compared to prior works, our method is training-free, gradient-free, and highly efficient requiring ICL only.",
"score": 19,
"issue_id": 147,
"pub_date": "2024-10-17",
"pub_date_card": {
"ru": "17 октября",
"en": "October 17",
"zh": "10月17日"
},
"hash": "e8177fd577296e7e",
"authors": [
"Hongyu Zhao",
"Ming Li",
"Lichao Sun",
"Tianyi Zhou"
],
"affiliations": [
"Lehigh University",
"University of Maryland, College Park"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13804.jpg",
"data": {
"categories": [
"#optimization",
"#training",
"#transfer_learning",
"#benchmark"
],
"emoji": "🎯",
"ru": {
"title": "Эффективная оценка языковых моделей: меньше задач, та же точность",
"desc": "Статья исследует методы эффективного сокращения количества задач для оценки больших языковых моделей (LLM) без ущерба для качества оценки. Авторы предлагают метрику для оценки переносимости между задачами с помощью обучения в контексте (ICL). Анализируя попарную переносимость, можно сократить набор задач в современных бенчмарках LLM до 5% с разницей менее 4% по сравнению с оценкой на полном наборе. Метод не требует дополнительного обучения и градиентов, что делает его высокоэффективным."
},
"en": {
"title": "Efficient LLM Evaluation: Less is More",
"desc": "This paper explores a method to make evaluating large language models (LLMs) more efficient by reducing the number of tasks needed for benchmarking. It introduces a way to identify the most important tasks using task transferability and relevance, optimizing a facility location function. The authors propose a metric to estimate how well tasks transfer to each other using in-context learning, which helps in selecting a smaller set of tasks. Their approach can cut down the tasks in benchmarks like MMLU or FLAN to just 5% of the original, with minimal impact on evaluation quality, and it doesn't require any training or gradients."
},
"zh": {
"title": "高效评估:减少任务,保持质量",
"desc": "这篇论文研究如何在不影响评估质量的情况下,减少用于评估大型语言模型的任务数量。研究表明,任务的可转移性和相关性是识别最具代表性任务子集的关键。通过优化设施位置函数,我们提出了一种高效的指标来估计任务之间的可转移性。分析任务间的可转移性,可以将现代LLM基准中的任务减少到5%,而评估结果仅有不到4%的差异。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.13785",
"title": "PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment",
"url": "https://huggingface.co/papers/2410.13785",
"abstract": "Alignment of large language models (LLMs) involves training models on preference-contrastive output pairs to adjust their responses according to human preferences. To obtain such contrastive pairs, traditional methods like RLHF and RLAIF rely on limited contrasting patterns, such as varying model variants or decoding temperatures. This singularity leads to two issues: (1) alignment is not comprehensive; and thereby (2) models are susceptible to jailbreaking attacks. To address these issues, we investigate how to construct more comprehensive and diversified contrasting patterns to enhance preference data (RQ1) and verify the impact of the diversification of contrasting patterns on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures. Regarding RQ2, we conduct thorough experiments demonstrating that PopAlign significantly outperforms existing methods, leading to more comprehensive alignment.",
"score": 18,
"issue_id": 147,
"pub_date": "2024-10-17",
"pub_date_card": {
"ru": "17 октября",
"en": "October 17",
"zh": "10月17日"
},
"hash": "d458841995668004",
"authors": [
"Zekun Moore Wang",
"Shawn Wang",
"Kang Zhu",
"Jiaheng Liu",
"Ke Xu",
"Jie Fu",
"Wangchunshu Zhou",
"Wenhao Huang"
],
"affiliations": [
"201.AI",
"AIWaves",
"Beihang University",
"HKUST",
"Tsinghua University"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13785.jpg",
"data": {
"categories": [
"#rlhf",
"#training",
"#security",
"#architecture",
"#alignment"
],
"emoji": "🎯",
"ru": {
"title": "PopAlign: Комплексное выравнивание языковых моделей через разнообразные контрастные паттерны",
"desc": "Статья представляет новый подход к выравниванию больших языковых моделей под названием PopAlign. Этот метод использует разнообразные контрастные паттерны на уровнях промпта, модели и пайплайна для улучшения данных о предпочтениях. PopAlign решает проблемы ограниченности традиционных методов, таких как RLHF и RLAIF, и повышает устойчивость моделей к атакам типа jailbreaking. Эксперименты показывают, что PopAlign значительно превосходит существующие методы, обеспечивая более комплексное выравнивание."
},
"en": {
"title": "PopAlign: Diversifying Patterns for Better Model Alignment",
"desc": "The paper discusses improving the alignment of large language models (LLMs) by using a new framework called PopAlign. Traditional methods like RLHF and RLAIF have limitations due to their reliance on limited contrasting patterns, which can make models vulnerable to jailbreaking attacks. PopAlign introduces diversified contrasting patterns across different levels, such as prompt, model, and pipeline, without needing extra feedback labeling. Experiments show that PopAlign enhances model alignment more effectively than existing methods, making models more robust and comprehensive."
},
"zh": {
"title": "PopAlign:多样化对比策略提升模型对齐",
"desc": "这篇论文研究了如何通过对比模式来更好地调整大型语言模型的输出,使其更符合人类的偏好。传统方法如RLHF和RLAIF在对比模式上存在局限性,导致模型对攻击的脆弱性。为了解决这个问题,作者提出了PopAlign框架,通过在提示、模型和流程层面引入多样化的对比策略来增强偏好数据。实验结果表明,PopAlign显著优于现有方法,实现了更全面的模型对齐。"
}
}
},
{
"id": "https://huggingface.co/papers/2410.13639",
"title": "A Comparative Study on Reasoning Patterns of OpenAI's o1 Model",
"url": "https://huggingface.co/papers/2410.13639",
"abstract": "Enabling Large Language Models (LLMs) to handle a wider range of complex tasks (e.g., coding, math) has drawn great attention from many researchers. As LLMs continue to evolve, merely increasing the number of model parameters yields diminishing performance improvements and heavy computational costs. Recently, OpenAI's o1 model has shown that inference strategies (i.e., Test-time Compute methods) can also significantly enhance the reasoning capabilities of LLMs. However, the mechanisms behind these methods are still unexplored. In our work, to investigate the reasoning patterns of o1, we compare o1 with existing Test-time Compute methods (BoN, Step-wise BoN, Agent Workflow, and Self-Refine) by using OpenAI's GPT-4o as a backbone on general reasoning benchmarks in three domains (i.e., math, coding, commonsense reasoning). Specifically, first, our experiments show that the o1 model has achieved the best performance on most datasets. Second, as for the methods of searching diverse responses (e.g., BoN), we find the reward models' capability and the search space both limit the upper boundary of these methods. Third, as for the methods that break the problem into many sub-problems, the Agent Workflow has achieved better performance than Step-wise BoN due to the domain-specific system prompt for planning better reasoning processes. Fourth, it is worth mentioning that we have summarized six reasoning patterns of o1, and provided a detailed analysis on several reasoning benchmarks.",
"score": 16,
"issue_id": 162,
"pub_date": "2024-10-17",
"pub_date_card": {
"ru": "17 октября",
"en": "October 17",
"zh": "10月17日"
},
"hash": "39b4ff88e70ccaf2",
"authors": [
"Siwei Wu",
"Zhongyuan Peng",
"Xinrun Du",
"Tuney Zheng",
"Minghao Liu",
"Jialong Wu",
"Jiachen Ma",
"Yizhi Li",
"Jian Yang",
"Wangchunshu Zhou",
"Qunshu Lin",
"Junbo Zhao",
"Zhaoxiang Zhang",
"Wenhao Huang",
"Ge Zhang",
"Chenghua Lin",
"J. H. Liu"
],
"affiliations": [
"2077AI",
"Abaka AI",
"M-A-P",
"OpenO1 Team",
"University of Chinese Academy of Sciences",
"University of Manchester",
"Zhejiang University"
],
"pdf_title_img": "assets\\pdf\\title_img\\2410.13639.jpg",
"data": {
"categories": [
"#reasoning",
"#rl",
"#benchmark",
"#inference",
"#optimization",
"#math",
"#plp"
],
"emoji": "🧠",
"ru": {