How does Tantivy apply different tokenizers for different languages?

The site's search function is built using tantivy and tantivy-jieba. Tantivy is a high-performance full-text search engine library written in Rust, inspired by Apache Lucene. It supports BM25 scoring, natural language queries, phrase searches, faceted retrieval, and various field types (including text, numeric, date, IP, and JSON), along with multilingual tokenization support (including Chinese, Japanese, and Korean). It features extremely fast indexing and query speeds, millisecond-level startup times, and memory mapping (mmap) support.

After adding multilingual translation, the search content began including many entries in other languages. Recently, I finally separated searches by language. The main solution was to ensure that searches in the current language return only articles in that language, while applying different tokenizers per language—for example, using tantivy-jieba for Chinese, lindera for Japanese, and the default tokenizer for English and others. This resolved issues of mixed-language content and poor search performance due to tokenizer mismatches.

I originally considered using qdrant for semantic search, but since embeddings are processed locally, forwarding requests back would be too slow, with uncertain initialization time and success rate. However, this might still work within a WeChat official account—I’ll see if I can finish implementing it in the next couple of days.

This article was handwritten to reduce AI detection rates—readability is sufficient. Recently, I plan to delete all previously AI-generated articles and check when Bing indexing might recover.

1. Building the Index

pub async fn build_search_index() -> anyhow::Result<Index> {
	// Set up tokenizer for each language
    let en_text_options = TextOptions::default()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_tokenizer("en")
                .set_index_option(IndexRecordOption::WithFreqsAndPositions),
        )
        .set_stored();
    let zh_text_options = TextOptions::default()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_tokenizer("jieba")
                .set_index_option(IndexRecordOption::WithFreqsAndPositions),
        )
        .set_stored();
    let ja_text_options = TextOptions::default()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_tokenizer("lindera")
                .set_index_option(IndexRecordOption::WithFreqsAndPositions),
        )
        .set_stored();
	// Define schema for index
    let mut schema_builder = Schema::builder();
    // Apply corresponding tokenizer to each field
    let title_en_field = schema_builder.add_text_field("title_en", en_text_options.clone());
    let content_en_field = schema_builder.add_text_field("content_en", en_text_options); // Not stored
    let title_zh_field = schema_builder.add_text_field("title_zh", zh_text_options.clone());
    let content_zh_field = schema_builder.add_text_field("content_zh", zh_text_options);
    let title_ja_field = schema_builder.add_text_field("title_ja", ja_text_options.clone());
    let content_ja_field = schema_builder.add_text_field("content_ja", ja_text_options);
	//... other fields
    let schema = schema_builder.build();

    // Create index in memory
    let index = Index::create_in_ram(schema);

    // Register tokenizer for different languages
    let en_analyzer = TextAnalyzer::builder(SimpleTokenizer::default())
        .filter(LowerCaser)
        .filter(Stemmer::new(tantivy::tokenizer::Language::English))
        .build();
    index.tokenizers().register("en", en_analyzer);
    // Japanese
    let dictionary = load_embedded_dictionary(lindera::dictionary::DictionaryKind::IPADIC)?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    //let tokenizer = LinderaTokenizer::from_segmenter(segmenter);

    let lindera_analyzer = TextAnalyzer::from(LinderaTokenizer::from_segmenter(segmenter));
    index.tokenizers().register("lindera", lindera_analyzer);
	// Chinese
    let jieba_analyzer = TextAnalyzer::builder(JiebaTokenizer {})
        .filter(RemoveLongFilter::limit(40))
        .build();
    index.tokenizers().register("jieba", jieba_analyzer);
    // Write index (the number limits max memory usage)
    let mut index_writer = index.writer(50_000_000)?;

    let all_articles = your_articles;

    for article in all_articles {
        let mut doc = TantivyDocument::new();
        doc.add_text(lang_field, &article.lang);

        // Apply tokenizer here; simplified and traditional Chinese will be filtered via lang_field.
        match article.lang.as_str() {
            "zh-CN" | "zh-TW" => {
                doc.add_text(title_zh_field, &article.title);
                doc.add_text(content_zh_field, &article.md);
            }
            "ja" => {
                doc.add_text(title_ja_field, &article.title);
                doc.add_text(content_ja_field, &article.md);
            }
            _ => {
                doc.add_text(title_en_field, &article.title);
                doc.add_text(content_en_field, &article.md);
            }
        }

        index_writer.add_document(doc)?;
    }

    index_writer.commit()?;
    index_writer.wait_merging_threads()?;

    Ok(index)
}

It would be better to first match the language and then search within the corresponding language-specific field, but I didn’t refactor it after getting it working.

#[server]
pub async fn search_handler(query: SearchQuery) -> Result<String, ServerFnError> {
    // I used moka to cache in memory since there aren't many articles anyway.
    let index = SEARCH_INDEX_CACHE.get("primary_index").ok_or_else(|| {
        ServerFnErrorErr::ServerError("Search index not found in cache.".to_string())
    })?;
	// get_field
    let schema = index.schema();
    let title_en_f = schema.get_field("title_en").unwrap();
    let content_en_f = schema.get_field("content_en").unwrap();
    let title_zh_f = schema.get_field("title_zh").unwrap();
    let content_zh_f = schema.get_field("content_zh").unwrap();
    let title_ja_f = schema.get_field("title_ja").unwrap();
    let content_ja_f = schema.get_field("content_ja").unwrap();
    let canonical_f = schema.get_field("canonical").unwrap();
    let lang_f = schema.get_field("lang").unwrap();

    let reader = index.reader()?;
    let searcher = reader.searcher();
	// Use Occur::Must to filter results—must satisfy all conditions in queries: Vec
    let mut queries: Vec<(Occur, Box<dyn tantivy::query::Query>)> = Vec::new();

    let query_parser = QueryParser::for_index(
        &index,
        vec![
            title_en_f,
            content_en_f,
            title_zh_f,
            content_zh_f,
            title_ja_f,
            content_ja_f,
        ],
    );
	// User query
    let user_query = query_parser.parse_query(&query.q)?;
    queries.push((Occur::Must, user_query));
	// Filter by language
    if let Some(lang_code) = &query.lang {
        let lang_term = Term::from_field_text(lang_f, lang_code);
        let lang_query = Box::new(TermQuery::new(lang_term, IndexRecordOption::Basic));
        queries.push((Occur::Must, lang_query));
    }
	// Other filters
	...

    let final_query = BooleanQuery::new(queries);

    let hits: Vec<Hit> = match query.sort {
        SortStrategy::Relevance => {
            let top_docs = TopDocs::with_limit(query.limit);
            let search_results: Vec<(Score, DocAddress)> =
                searcher.search(&final_query, &top_docs)?;
            // Convert from Vec<(Score, DocAddress)>
            search_results
                .into_iter()
                .filter_map(|(score, doc_address)| {
                    let doc = searcher.doc::<TantivyDocument>(doc_address).ok()?;
                    let title = doc
                        .get_first(title_en_f)
                        .or_else(|| doc.get_first(title_zh_f))
                        .or_else(|| doc.get_first(title_ja_f))
                        .and_then(|v| v.as_str())
                        .unwrap_or("")
                        .to_string();

                    let formatted_lastmod =
                        match DateTime::parse_from_rfc3339(doc.get_first(lastmod_str_f)?.as_str()?)
                        {
                            Ok(dt) => {
                                let china_dt = dt.with_timezone(&Shanghai);
                                china_dt.format("%Y-%m-%d").to_string()
                            }
                            Err(_) => doc.get_first(lastmod_str_f)?.as_str()?.to_string(),
                        };
                    Some(Hit {
                        title,
                        canonical: doc.get_first(canonical_f)?.as_str()?.to_string(),
                        lastmod: formatted_lastmod,
                        score,
                    })
                })
                .collect()
        }
	// Other sorting methods... I mainly sort by time here
    }; // Not sure if parentheses are aligned

    serde_json::to_string(&hits).map_err(|e| ServerFnError::ServerError(e.to_string()))
}

3. Afterword

Tantivy delivers excellent search performance. Although it doesn't support semantic search yet, its speed and accuracy are impressive. Many vector databases also use Tantivy indexes for full-text search.

For more detailed usage of Tantivy, refer to: Tantivy Official Examples, which includes 20 comprehensive search examples, each thoroughly explained.

Articles you may be interested in

Discover more wonderful content

Comment