我们知道FST它可以节省大量的内存,但文章很少讨论其内部存储结构。本文构建了一个好的内存FST以数据为例,分析内部是如何存储的,以及如何使用存储结构相应的值。
分析程序:
package com.example; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefBuilder; import org.apache.lucene.util.IntsRefBuilder; import org.apache.lucene.util.fst.Builder; import org.apache.lucene.util.fst.FST; import org.apache.lucene.util.fst.PositiveIntOutputs; import org.apache.lucene.util.fst.Util; public class App { public static void main(String[] args) { try { String inputValues[] = {"cat", "deep", "do", "dog", "dogs"}; long outputValues[] = {5, 7, 17, 18, 21}; PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton(); Builder<Long> builder = new Builder<Long>(FST.INPUT_TYPE.BYTE1, outputs); BytesRefBuilder scratchBytes = new BytesRefBuilder (); IntsRefBuilder scratchInts = new IntsRefBuilder(); for (int i = 0; i < inputValues.length; i ) { scratchBytes.copyChars(inputValues[i]); BytesRef bytesRef = scratchBytes.get(); builder.add(Util.toIntsRef(bytesRef, scratchInts), outputValues[i]); } FST<Long> fst = builder.finish(); Long value = Util.get(fst, new BytesRef("dog")); System.out.println(value); // 18 } catch (Exception e) { e.printStackTrace(); } } }
使用这段elasticsearch为7.13.4,在pom添加以下依赖文件:
<dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch</artifactId> <version>7.13.4</version> </dependency>
该代码在执行过程中将被视为FST在内存中使用28字节的数据byte存储数组。FST本质上是一种图形,数组也是图形中常见的存储方法。分析如下:
0 = 0 | 1 = 116 |'t' 2 = 15 | 3 = 97 | 4 = 6 | 5 = 112 |'p' 6 = 15 | 7 = 101 |'e' 8 = 6 | 9 = 3 | 10 = 115 |'s' 11 = 31 | 12 = 1 |arc.output 13 = 103 |arc.label, 'g' 14 = 23 |14 is pre's arc.nextArc, 23 is new arc.flag: final_arc, no_target_size | 15 = 10 |arc.output 16 = 111 |'o',arc.label 17 = 23 |arc.flag: final_arc, no_target_size | 18 = 8 |arc.target 19 = 101 |'e', arg.label 20 = 0 |20 is pre's arc.nextArc and arc.target, 0 is arg.flag: no output 21 = 7 |arc.output, final output 22 = 100 |'d', arc.label 23 = 22 |23 is pre's arc.nextArc, 22 is arc.flag, no target size, use position value as target value 24 = 4 |4 is arg.target, no use? 25 = 5 |arc.output 26 = 99 |'c', arc.label 27 = 16 |27 is arg.target, 16 is arc.flag
0 = 0表示0号byte存储0,存储倒序,即起点在27号byte。以上分析了搜索"dog"看注释基本能理解过程。dog的值18是从d、o积累了和g的值。