资讯详情

lucence fst初探

我们知道FST它可以节省大量的内存,但文章很少讨论其内部存储结构。本文构建了一个好的内存FST以数据为例,分析内部是如何存储的,以及如何使用存储结构相应的值。

分析程序:

package com.example;  import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefBuilder; import org.apache.lucene.util.IntsRefBuilder; import org.apache.lucene.util.fst.Builder; import org.apache.lucene.util.fst.FST; import org.apache.lucene.util.fst.PositiveIntOutputs; import org.apache.lucene.util.fst.Util;  public class App {      public static void main(String[] args) {         try {             String inputValues[] = {"cat", "deep", "do", "dog", "dogs"};             long outputValues[] = {5, 7, 17, 18, 21};             PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();             Builder<Long> builder = new Builder<Long>(FST.INPUT_TYPE.BYTE1, outputs);             BytesRefBuilder scratchBytes = new BytesRefBuilder ();             IntsRefBuilder scratchInts = new IntsRefBuilder();             for (int i = 0; i < inputValues.length; i  ) {                 scratchBytes.copyChars(inputValues[i]);                 BytesRef bytesRef = scratchBytes.get();                 builder.add(Util.toIntsRef(bytesRef, scratchInts), outputValues[i]);             }             FST<Long> fst = builder.finish();             Long value = Util.get(fst, new BytesRef("dog"));             System.out.println(value); // 18         } catch (Exception e) {             e.printStackTrace();         }     } }

使用这段elasticsearch为7.13.4,在pom添加以下依赖文件:

<dependency>     <groupId>org.elasticsearch</groupId>     <artifactId>elasticsearch</artifactId>     <version>7.13.4</version> </dependency>

该代码在执行过程中将被视为FST在内存中使用28字节的数据byte存储数组。FST本质上是一种图形,数组也是图形中常见的存储方法。分析如下:

0 = 0    | 1 = 116  |'t' 2 = 15   | 3 = 97   | 4 = 6    | 5 = 112  |'p' 6 = 15   | 7 = 101  |'e' 8 = 6    | 9 = 3    | 10 = 115 |'s' 11 = 31  |  12 = 1   |arc.output 13 = 103 |arc.label, 'g' 14 = 23  |14 is pre's arc.nextArc, 23 is new arc.flag: final_arc, no_target_size          | 15 = 10  |arc.output 16 = 111 |'o',arc.label 17 = 23  |arc.flag: final_arc, no_target_size          | 18 = 8   |arc.target 19 = 101 |'e', arg.label 20 = 0   |20 is pre's arc.nextArc and arc.target, 0 is arg.flag: no output  21 = 7   |arc.output, final output 22 = 100 |'d', arc.label 23 = 22  |23 is pre's arc.nextArc, 22 is arc.flag, no target size, use position value as target value  24 = 4   |4 is arg.target, no use? 25 = 5   |arc.output 26 = 99  |'c', arc.label 27 = 16  |27 is arg.target, 16 is arc.flag

0 = 0表示0号byte存储0,存储倒序,即起点在27号byte。以上分析了搜索"dog"看注释基本能理解过程。dog的值18是从d、o积累了和g的值。

标签: 荔波智能电容器fst

锐单商城拥有海量元器件数据手册IC替代型号,打造 电子元器件IC百科大全!

锐单商城 - 一站式电子元器件采购平台