Skip to content

Commit

Permalink
release branch 0.2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
houbb committed Mar 25, 2021
1 parent dbf0a08 commit 1046d84
Show file tree
Hide file tree
Showing 7 changed files with 117 additions and 37 deletions.
10 changes: 9 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,4 +106,12 @@
| 类型 | 变化 | 时间 | 备注 |
|:---|:---|:---|:---|
| O | 优化 nlp-common 基础包依赖 | 2021-03-25 22:18:57 | |
| D | 移除 Appender 接口及实现类 | 2021-03-25 22:18:57 | |
| D | 移除 Appender 接口及实现类 | 2021-03-25 22:18:57 | |

# release_0.2.0

| 类型 | 变化 | 时间 | 备注 |
|:---|:---|:---|:---|
| O | 优化工具类基本方法性能 | 2021-3-25 21:26:17 | |
| D | 废弃声母/韵母方法 | 2021-03-25 22:18:57 | |
| T | 添加 pinyin4j v2.5.1 的 benchmark 测试 | 2021-03-25 22:18:57 | |
27 changes: 16 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,11 @@

- 支持获取声母韵母信息

### v0.1.6 主要变更
### v0.2.0 主要变更

- 依赖版本优化
- 性能优化

- 引导类添加分词自定义方法

# 快速开始

Expand All @@ -55,7 +57,7 @@ jdk 1.7+
<dependency>
<groupId>com.github.houbb</groupId>
<artifactId>pinyin</artifactId>
<version>0.1.6</version>
<version>0.2.0</version>
</dependency>
```

Expand Down Expand Up @@ -244,21 +246,24 @@ Assert.assertEquals("wǒ ài chóng qìng huǒ guō", pinyin);

均提前做好预热处理,可供参考。

对比 pinyin4j 版本为 v2.5.1

## 单个分词

| 对比函数 | 对比次数 | 对比内容 | 耗时 |
|:---|:---|:---|:---|
| `Pinyin4j toHanyuPinyinStringArray()` | 100w 次 | 相同文本随机选择一个字符 | 621 ms |
| `pinyin toPinyin()` | 100w 次 | 相同文本随机选择一个字符 | 317 ms |
| `Pinyin4j toHanyuPinyinStringArray()` | 100w 次 | 相同文本随机选择一个字符 | 650 ms |
| `pinyin toPinyin()` | 100w 次 | 相同文本随机选择一个字符 | 410 ms |

## 字符串分词

| 对比函数 | 对比次数 | 对比内容 | 耗时 |
|:---|:---|:---|:---|
| `Pinyin4j toHanyuPinyinString()` | 1w 次 | 相同长文本 | 33002 ms |
| `pinyin toPinyin()` | 1w 次 | 相同长文本 | 17975 ms |
| `Pinyin4j toHanyuPinyinString()` | 1w 次 | 相同长文本 | 26324 ms |
| `pinyin toPinyin()` | 1w 次 | 相同长文本 | 16260 ms |
| `pinyin toPinyin()` | 1w 次 | 相同长文本, chars 分词模式 | 14804 ms |

而且 Pinyin4j 的汉语字符串转换是不支持分词的,本项目在支持分词的情况下速度基本依然是 pinyin4j 的两倍。
pinyin4j 的汉语字符串转换是不支持分词的,本项目在支持分词的情况下速度基本是 pinyin4j 的两倍。

# 技术鸣谢

Expand All @@ -270,6 +275,8 @@ Assert.assertEquals("wǒ ài chóng qìng huǒ guō", pinyin);

- [x] 键盘输入拼音形式支持

- [x] 引导类开放分词的自定义配置

- [ ] 同音字列表返回

- [ ] 谐音字判断
Expand All @@ -278,6 +285,4 @@ Assert.assertEquals("wǒ ài chóng qìng huǒ guō", pinyin);

- [ ] 拼音转汉字

- [ ] 性能优化

- [ ] 开放分词的自定义配置
- [ ] 性能优化
4 changes: 2 additions & 2 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

<groupId>com.github.houbb</groupId>
<artifactId>pinyin</artifactId>
<version>0.1.6</version>
<version>0.2.0</version>

<properties>
<!--============================== All PLUGINS START ==============================-->
Expand Down Expand Up @@ -60,7 +60,7 @@
<dependency>
<groupId>com.belerweb</groupId>
<artifactId>pinyin4j</artifactId>
<version>2.5.0</version>
<version>2.5.1</version>
<optional>true</optional>
<scope>test</scope>
</dependency>
Expand Down
4 changes: 2 additions & 2 deletions release.bat
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ ECHO "============================= RELEASE START..."

:: 版本号信息(需要手动指定)
:::: 旧版本名称
SET version=0.1.6
SET version=0.2.0
:::: 新版本名称
SET newVersion=0.1.7
SET newVersion=0.2.1
:::: 组织名称
SET groupName=com.github.houbb
:::: 项目名称
Expand Down
24 changes: 19 additions & 5 deletions src/main/java/com/github/houbb/pinyin/bs/PinyinBs.java
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import com.github.houbb.pinyin.support.chinese.PinyinChineses;
import com.github.houbb.pinyin.support.data.PinyinData;
import com.github.houbb.pinyin.support.segment.DefaultPinyinSegment;
import com.github.houbb.pinyin.support.segment.PinyinSegments;
import com.github.houbb.pinyin.support.style.PinyinToneStyles;
import com.github.houbb.pinyin.support.tone.DefaultPinyinTone;

Expand All @@ -29,25 +30,25 @@ private PinyinBs(){}
* 默认分词
* @since 0.0.1
*/
private IPinyinSegment pinyinSegment = Instances.singleton(DefaultPinyinSegment.class);
private IPinyinSegment pinyinSegment = PinyinSegments.defaults();

/**
* 中文服务类
* @since 0.0.1
*/
private IPinyinChinese pinyinChinese = PinyinChineses.simple();
private final IPinyinChinese pinyinChinese = PinyinChineses.simple();

/**
* 注音映射
* @since 0.0.1
*/
private IPinyinTone pinyinTone = Instances.singleton(DefaultPinyinTone.class);
private final IPinyinTone pinyinTone = Instances.singleton(DefaultPinyinTone.class);

/**
* 拼音数据实现
* @since 0.1.1
*/
private IPinyinData data = Instances.singleton(PinyinData.class);
private final IPinyinData data = Instances.singleton(PinyinData.class);

/**
* 拼音的形式
Expand All @@ -59,7 +60,7 @@ private PinyinBs(){}
* 默认核心实现
* @since 0.1.1
*/
private IPinyin pinyin = Instances.singleton(Pinyin.class);
private final IPinyin pinyin = Instances.singleton(Pinyin.class);

/**
* 连接符号
Expand Down Expand Up @@ -100,6 +101,19 @@ public PinyinBs connector(String connector) {
return this;
}

/**
* 添加自定义分词
* @param pinyinSegment 拼音分词实现
* @return 分词
* @since 0.2.0
*/
public PinyinBs segment(IPinyinSegment pinyinSegment) {
ArgUtil.notNull(pinyinSegment, "segment");

this.pinyinSegment = pinyinSegment;
return this;
}

/**
* 转换为拼音字符串
* @param string 字符串
Expand Down
26 changes: 21 additions & 5 deletions src/main/java/com/github/houbb/pinyin/util/PinyinHelper.java
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,22 @@ public final class PinyinHelper {

private PinyinHelper(){}

/**
* 默认的实现类
*
* 避免最基本的方法调用,多次创建对象的问题
* @since 0.2.0
*/
private static final PinyinBs PINYIN_BS_DEFAULT = PinyinBs.newInstance();

/**
* 转换为拼音
* @param string 原始信息
* @return 结果
* @since 0.0.1
*/
public static String toPinyin(final String string) {
return toPinyin(string, PinyinStyleEnum.DEFAULT);
return PINYIN_BS_DEFAULT.toPinyin(string);
}

/**
Expand Down Expand Up @@ -72,7 +80,7 @@ public static String toPinyin(final String string,
* @since 0.1.1
*/
public static List<String> toPinyinList(final char chinese) {
return PinyinBs.newInstance().toPinyinList(chinese);
return PINYIN_BS_DEFAULT.toPinyinList(chinese);
}

/**
Expand All @@ -95,7 +103,7 @@ public static List<String> toPinyinList(final char chinese, final PinyinStyleEnu
* @since 0.0.8
*/
public static boolean hasSamePinyin(final char chineseOne, final char chineseTwo) {
return PinyinBs.newInstance().hasSamePinyin(chineseOne, chineseTwo);
return PINYIN_BS_DEFAULT.hasSamePinyin(chineseOne, chineseTwo);
}

/**
Expand All @@ -106,7 +114,7 @@ public static boolean hasSamePinyin(final char chineseOne, final char chineseTwo
* @since 0.1.1
*/
public static List<Integer> toneNumList(final String chinese) {
return PinyinBs.newInstance().toneNumList(chinese);
return PINYIN_BS_DEFAULT.toneNumList(chinese);
}

/**
Expand All @@ -117,26 +125,34 @@ public static List<Integer> toneNumList(final String chinese) {
* @since 0.1.1
*/
public static List<Integer> toneNumList(final char chinese) {
return PinyinBs.newInstance().toneNumList(chinese);
return PINYIN_BS_DEFAULT.toneNumList(chinese);
}

/**
* 转换为声母列表
*
* 暂时不是很想暴露这个方法,后期可能会删除
* @param chinese 中文
* @return 结果
* @since 0.1.1
* @deprecated 0.2.0 之后开始废弃
*/
@Deprecated
public static List<String> shengMuList(final String chinese) {
final IPinyinToneStyle pinyinTone = PinyinToneStyles.getTone(PinyinStyleEnum.NORMAL);
return PinyinBs.newInstance().style(pinyinTone).shengMuList(chinese);
}

/**
* 转换为韵母列表
*
* 暂时不是很想暴露这个方法,后期可能会删除
* @param chinese 中文
* @return 结果
* @since 0.1.1
* @deprecated 0.2.0 之后开始废弃
*/
@Deprecated
public static List<String> yunMuList(final String chinese) {
final IPinyinToneStyle pinyinTone = PinyinToneStyles.getTone(PinyinStyleEnum.NORMAL);
return PinyinBs.newInstance().style(pinyinTone).yunMuList(chinese);
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
package com.github.houbb.pinyin.test.benchmark;

import com.github.houbb.heaven.util.io.StreamUtil;
import com.github.houbb.pinyin.bs.PinyinBs;
import com.github.houbb.pinyin.support.segment.PinyinSegments;
import net.sourceforge.pinyin4j.PinyinHelper;
import net.sourceforge.pinyin4j.format.HanyuPinyinCaseType;
import net.sourceforge.pinyin4j.format.HanyuPinyinOutputFormat;
Expand Down Expand Up @@ -36,36 +38,40 @@ public class BenchmarkTest {
private static final int SINGLE_TIMES = 1000000;

/**
* pinyin4j: Pinyin4j cost: 33002 ms
* pinyin4j: Pinyin4j cost: 26324
*
* Pinyin4j 没有中文分词
* Pinyin4j 有中文分词,不过拼音的格式效果不是很好。
*
* 重庆火锅:chong qing huoguo
* @since 0.2.0
*/
@Test
public void pinyin4jNoSegmentTest() throws BadHanyuPinyinOutputFormatCombination {
public void pinyin4jTest() throws BadHanyuPinyinOutputFormatCombination {
// 创建汉语拼音处理类
HanyuPinyinOutputFormat defaultFormat = new HanyuPinyinOutputFormat();
// 输出设置,大小写,音标方式
defaultFormat.setCaseType(HanyuPinyinCaseType.LOWERCASE);
defaultFormat.setToneType(HanyuPinyinToneType.WITHOUT_TONE);
// 看的出来,没有支持分词。
// zhong qing huo guo
String result = PinyinHelper.toHanyuPinyinString("重庆火锅", defaultFormat, " ");
// chong qing huoguo
String result = PinyinHelper.toHanYuPinyinString("重庆火锅", defaultFormat, " ", false);
System.out.println(result);

// 验证
final String text = getText();
long startTime = System.currentTimeMillis();
for(int i = 0; i < TIMES; i++) {
PinyinHelper.toHanyuPinyinString(text, defaultFormat, " ");
PinyinHelper.toHanYuPinyinString(text, defaultFormat, " ", false);
}
long endTime = System.currentTimeMillis();
System.out.println("Pinyin4j cost: " + (endTime-startTime));
}

/**
* 预热耗时:565 ms
* 1w 次 cost: 17975 ms
* Pinyin cost: 16260
*
* 重庆火锅:chóng qìng huǒ guō
* 本框架,支持分词的测试。
*/
@Test
Expand All @@ -85,18 +91,48 @@ public void pinyinWithSegmentTest() {
System.out.println("Pinyin cost: " + (endTime-startTime));
}

/**
* chóng qìng huǒ guō
* Pinyin cost: 14804
*
* 重庆火锅:chóng qìng huǒ guō
*
* 使用指定的 chars 分词测试
*
* @since 0.2.0
*/
@Test
public void pinyinWithCharSegmentTest() {
// 预热
String result = com.github.houbb.pinyin.util.PinyinHelper.toPinyin("重庆火锅");
System.out.println(result);

// 验证
final String text = getText();
long startTime = System.currentTimeMillis();
PinyinBs pinyinBs = PinyinBs.newInstance().segment(PinyinSegments.chars());
for(int i = 0; i < TIMES; i++) {
pinyinBs.toPinyin(text);
}
long endTime = System.currentTimeMillis();

System.out.println("Pinyin cost: " + (endTime-startTime));
}

/**
* 获取文本内容
* @return 内容
* @since 0.0.1
*/
private String getText() {
return StreamUtil.getFileContent("永远的夏娃.txt");
return StreamUtil.getFileContent("back/永远的夏娃.txt");
}

/**
* 单个字符耗时统计
* 621ms
* [zhong4, chong2]
*
* Pinyin4j cost: 650
*/
@Test
public void pinyin4jCharTest() {
Expand All @@ -119,8 +155,9 @@ public void pinyin4jCharTest() {
}

/**
* 单个字符耗时统计
* 317 ms
* [zhòng, chóng, tóng]
*
* Pinyin cost: 410
*/
@Test
public void pinyinCharTest() {
Expand Down

0 comments on commit 1046d84

Please sign in to comment.