「GoCN酷Go推荐」Go高性能多语言NLP和分词库—

gse是什么？

Go高性能多语言NLP和分词库, 支持英语、中文、日语等，支持接入 elasticsearch 和 bleve Gse是结巴分词(jieba)的golang实现并尝试添加NLP功能和更多属性

特征

支持普通、搜索引擎、全模式、精确模式 HMM 分词模式多种多样
支持自定义词典，embed 词典、词性标注、停用词、整理分析分词
多语言支持：英文, 中文, 日文等
支持繁体字
NLP 和 TensorFlow 支持 (进行中)
命名实体识别 (进行中)
支持接入 Elasticsearch 和 bleve
可运行 JSON RPC 服务

算法

双数组用于词典 trie（Double-Array Trie）实现，
基于词频最短路径的分词器算法加动态规划，以及 DAG 和 HMM 算法分词.
支持 HMM 分词, 使用 viterbi 算法.

分词速度

单线程 9.2MB/s
goroutines 并发 26.8MB/s.
HMM 模式单线程分词速度 3.2MB/s.（双核 4 线程 Macbook Pro）。

快速入门

packagemain  import( "fmt" "regexp"  "github.com/go-ego/gse" "github.com/go-ego/gse/hmm/pos" )  var( seggse.Segmenter posSegpos.Segmenter  new,_=gse.New("zh,testdata/test_dict3.txt","alpha")  text="你好世界,Helloworld,Helloworld." )  funcmain(){ //加载默认字典 seg.LoadDict() //加载默认embed词典 //seg.LoadDictEmbed() // //加载简体中文词典 //seg.LoadDict("zh_s") //seg.LoadDictEmbed("zh_s") // //加载繁体中文词典 //seg.LoadDict("zh_t") // //加载日语词典 //seg.LoadDict("jp") // //载入词典 //seg.LoadDict("yourgopath" "/src/github.com/go-ego/gse/data/dict/dictionary.txt")  cut()  segCut() }   funccut(){ hmm:=new.Cut(text,true) fmt.Println("cutusehmm:",hmm)  hmm=new.CutSearch(text,true) fmt.Println("cutsearchusehmm:",hmm) fmt.Println("analyze:",new.Analyze(hmm,text))  hmm=new.CutAll(text) fmt.Println("cutall:",hmm)  reg:=regexp.MustCompile(`(\d 年|\d 月|\d 日|[\p{Latin}] |[\p{Hangul}] |\d \.\d |[a-zA-Z0-9] )`) text1:=`?????????,2021年09月10日,3.14` hmm=seg.CutDAG(text1,reg) fmt.Println("Cutwithhmmandregexp:",hmm,hmm[0],hmm[6]) }  funcanalyzeAndTrim(cut[]string){ a:=seg.Analyze(cut,"") fmt.Println("analyzethesegment:",a)  cut=seg.Trim(cut) fmt.Println("cutall:",cut)  fmt.Println(seg.String(text,true)) fmt.Println(seg.Slice(text,true)) }  funccutPos(){ po:=seg.Pos(text,true) fmt.Println("pos:",po) po=seg.TrimPos(po) fmt.Println("trimpos:",po)  posSeg.WithGse(seg) po=posSeg.Cut(text,true) fmt.Println("pos:",po)  po=posSeg.TrimWithPos(po,"zg") fmt.Println("trimpos:",po) }  funcsegCut(){ //分词文本 tb:=[]byte("联邦政府")  //处理分词结果 fmt.Println("输出分词结果，字符串的类型，使用搜索模式：",seg.String(string(tb),true)) fmt.Println("输出分词结果，类型为slice:",seg.Slice(string(tb)))  segments:=seg.Segment(tb) //处理分词结果，普通模式 fmt.Println(gse.ToString(segments))  segments1:=seg.Segment([]byte(text)) //搜索模式 fmt.Println(gse.ToString(segments1,true)) }

输出结果：

cutusehmm:[你好世界,helloworld,helloworld.] cutsearchusehmm:[你好世界,helloworld,helloworld.] analyze:[{0600你好725l}{61210世界34387n}{252720,0}{273230hell 0 } {26 27 4 0    0 } {32 37 5 0  world 0 } {12 14 6 0  ,  0 } {27 37 7 0  helloworld 0 } {37 38 8 0  . 0 }]
cut all:  [你好 世界 ,   h e l l o   w o r l d ,   h e l l o w o r l d .]
Cut with hmm and regexp:  [헬로월드   헬로   서울 ,  2021年 09月 10日 ,  3.14] 헬로월드 2021年
输出分词结果, 类型为字符串, 使用搜索模式:  山/n 达尔/nrt 星/n 联邦/n 共和/nz 国/zg 共和国/ns 联邦/n 政府/n 联邦政府/nt 
输出分词结果, 类型为 slice:  [山 达尔 星 联邦 共和国 联邦政府]
山/n 达尔/nrt 星/n 联邦/n 共和国/ns 联邦政府/nt 
你好/l 世界/n ,/x  /x hello/x  /x world/x ,/x  /x helloworld/x ./x

更多用法可参考github上官方用例

参考资料

https://github.com/go-ego/gse/blob/master/README_zh.md

《酷Go推荐》招募：

各位Gopher同学，最近我们社区打算推出一个类似GoCN每日新闻的新栏目《酷Go推荐》，主要是每周推荐一个库或者好的项目，然后写一点这个库使用方法或者优点之类的，这样可以真正的帮助到大家能够学习到

新的库，并且知道怎么用。

大概规则和每日新闻类似，如果报名人多的话每个人一个月轮到一次，欢迎大家报名！戳「阅读原文」，即可报名

扫码也可以加入 GoCN 的大家族哟～

资讯详情

「GoCN酷Go推荐」Go高性能多语言NLP和分词库——gse

动力学技术KTU1121 USB Type-C 端口保护器的介绍、特性、及应用

「GoCN酷Go推荐」Go高性能多语言NLP和分词库——gse

动力学技术KTU1121 USB Type-C 端口保护器的介绍、特性、及应用

最近热搜

历史搜索 清除历史记录

历史搜索清除历史记录