# fesen-minhash **Repository Path**: mirrors_codelibs/fesen-minhash ## Basic Information - **Project Name**: fesen-minhash - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-08-08 - **Last Updated**: 2026-01-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README Fesen MinHash Plugin [![Java CI with Maven](https://github.com/codelibs/fesen-minhash/actions/workflows/maven.yml/badge.svg)](https://github.com/codelibs/fesen-minhash/actions/workflows/maven.yml) ======================= ## Overview MinHash Plugin provides b-bit MinHash algorithm for Elasticsearch. Using a field type and a token filter provided by this plugin, you can add a minhash value to your document. ## Version [Versions in Maven Repository](https://repo1.maven.org/maven2/org/codelibs/fesen-minhash/) ### Issues/Questions Please file an [issue](https://github.com/codelibs/fesen-minhash/issues "issue"). ## Installation $ $ES_HOME/bin/fesen-plugin install org.codelibs:fesen-minhash:7.14.0 ## Getting Started ### Add MinHash Analyzer First, you need to add a minhash analyzer when creating your index: $ curl -XPUT 'localhost:9200/my_index' -d '{ "index":{ "analysis":{ "analyzer":{ "minhash_analyzer":{ "type":"custom", "tokenizer":"standard", "filter":["minhash"] } } } } }' You are free to change tokenizer/char\_filter/filter settings, but the minhash filter needs to be added as a last filter. ### Add MinHash field Put a minhash field into an index mapping: $ curl -XPUT "localhost:9200/my_index/_mapping" -d '{ "properties":{ "message":{ "type":"string", "copy_to":"minhash_value" }, "minhash_value":{ "type":"minhash", "store":true, "minhash_analyzer":"minhash_analyzer" } } }' The field type of minhash is of binary type. The above example calculates a minhash value of the message field and stores it in the minhash\_value field. ## Get MinHash Value Add the following document: $ curl -XPUT "localhost:9200/my_index/_doc/1" -d '{ "message":"Fess is Java based full text search server provided as OSS product." }' The minhash value is calculated automatically when adding the document. You can check it as below: $ curl -XGET "localhost:9200/my_index/_doc/1?pretty&stored_fields=minhash_value,_source" The response is: { "_index" : "my_index", "_type" : "_doc", "_id" : "1", "_version" : 1, "found" : true, "_source":{ "message":"Fess is Java based full text search server provided as OSS product." }, "fields" : { "minhash_value" : [ "KV5rsUfZpcZdVojpG8mHLA==" ] } } ## References ### Change the number of bits and hashes To change the number of bits and hashes, set them to a token filter setting: $ curl -XPUT 'localhost:9200/my_index' -d '{ "index":{ "analysis":{ "analyzer":{ "minhash_analyzer":{ "type":"custom", "tokenizer":"standard", "filter":["my_minhash"] } } }, "filter":{ "my_minhash":{ "type":"minhash", "seed":100, "bit":2, "size":32 } } } }' The above allows to set the number of bits to 2, the number of hashes to 32 and the seed of hash to 100.