Classifier Intelligence Engine

The AI Layer That Makes Sense Unbeatable

18 specialized models. 5 detection layers. Every model runs exclusively on Segmento's private infrastructure — your data never reaches a third-party AI provider.

4A · Primary NER
4B · GLiNER
4C · Spatial Analysis
4D · Lang Detection
Framework

By The Numbers

An ensemble built for precision

0
Total Models in Stack
0
Currently Deployed
0
Fully Trainable
0.9963
Peak F1 Score

Model Stack

18 models. 5 detection layers.

Each sub-layer targets a specific challenge. Together, they form an ensemble no single model can match.

4A — Primary NER: ModernBERT & DeBERTa models for high-accuracy Named Entity Recognition

4AIn UseTrainable

joneauxedgar/pasteproof-pii-detector-v2

ModernBERT-base (149M)

ModernBERT-base with 149M parameters. Covers 27 PII types across PCI/HIPAA/GDPR frameworks. Trained on 150K synthetic examples with BIO tagging.

F1 0.970
F1 (held-out)
~120ms GPU
4AIn UseTrainable

llm-semantic-router/mmbert32k-pii-detector-merged

ModernBERT 307M + YaRN (32K context)

307M parameter ModernBERT with YaRN extending context to 32K tokens. Unmatched open-source coverage for long-form legal and compliance documents.

F1 0.969
F1 (reported)
~400ms GPU
4AComing SoonTrainable

OpenMed PII Family (FR/DE/IT variants)

ModernBERT-large (395M)

ModernBERT-large (395M) covering 55+ EU-localized PII types including French NSS and Italian Codice Fiscale. Multilingual healthcare focus.

>F1 0.960
F1 (reported)
~180ms GPU
4AIn UseTrainable

iiiorg/piiranha-v1-detect-personal-information

DeBERTa-v3-base

DeBERTa-v3-base with 99.44% binary accuracy. The fastest deployed model at 25ms GPU. Ideal as a high-speed first-pass gate across 6 languages.

99.44% acc · F1 0.931
Binary Acc / F1-macro
~25ms GPU
4AComing SoonTrainable

exdsgift/NerGuard-0.3B

mDeBERTa-v3-base (0.3B)

mDeBERTa-v3-base with the highest throughput in the 4A sub-layer. 33ms median latency across 8 EU languages. Ideal for real-time pipelines.

F1-macro 0.9963
F1-macro (in-dist)
33ms GPU
4AIn UseTrainable

lakshyakh93/deberta_finetuned_pii

DeBERTa-v3-base

DeBERTa-v3-base fine-tuned on ai4privacy/pii-masking-300k. Serves as a warm fallback and general-purpose baseline in the 4A ensemble.

F1 ~0.920
F1 (est.)
~30ms GPU
Green border = Actively deployed
Amber border = Coming soon

Technical Comparison

Every model. Every metric.

Filter and sort across the full stack. Click column headers to re-rank by accuracy or speed.

Showing 18 models

ModelLayerArchitectureTop MetricContext WindowBest ForTrainableLatency
joneauxedgar/pasteproof-pii-detector-v2
In Use
4AModernBERT-base (149M)
F1 0.970
F1 (held-out)
8,192 tokensLong compliance docs, leakage prevention, intentional variation coverageYes~120ms GPU
llm-semantic-router/mmbert32k-pii-detector-merged
In Use
4AModernBERT 307M + YaRN (32K context)
F1 0.969
F1 (reported)
32,768 tokensExtreme-length documents, legal contracts, batch reports with dense PIIYes~400ms GPU
OpenMed PII Family (FR/DE/IT variants)
4AModernBERT-large (395M)
>F1 0.960
F1 (reported)
8,192 tokensEU multilingual (FR/DE/IT), GDPR localized entity formats, healthcare recordsYes~180ms GPU
iiiorg/piiranha-v1-detect-personal-information
In Use
4ADeBERTa-v3-base
99.44% acc · F1 0.931
Binary Acc / F1-macro
512 tokens (sub-256 optimal)High-speed short-segment screening, multilingual real-time API, binary PII gateYes~25ms GPU
exdsgift/NerGuard-0.3B
4AmDeBERTa-v3-base (0.3B)
F1-macro 0.9963
F1-macro (in-dist)
512 tokensUltra-low latency EU multilingual, high-throughput real-time pipelinesYes33ms GPU
lakshyakh93/deberta_finetuned_pii
In Use
4ADeBERTa-v3-base
F1 ~0.920
F1 (est.)
512 tokensGeneral-purpose PII baseline, interpretable benchmarkYes~30ms GPU
knowledgator/gliner-pii-large-v1.0
4BGLiNER-large (bi-encoder)
F1 0.833 · Prec 0.874
F1 / Precision
512 tokensMinimizing false positives, broadest entity coverage, production compliance auditsYes~45ms GPU
nvidia/gliner-PII-0.1
In Use
4BGLiNER DeBERTa (570M)
Strict F1 0.870
Strict F1
512 tokensEnterprise compliance, healthcare / finance / legal tri-domainYes~60ms GPU
gretelai/gretel-gliner-bi-large-v1.0
4BGLiNER-large (bidirectional)
F1 0.950
F1 (internal bench)
512 tokensDual PII+PHI detection in one pass, HIPAA + GDPR simultaneouslyYes~50ms GPU
OvermindLab/nerpa
4BGLiNER2 (unified NER + structured extraction)
Micro-Prec 0.930
Micro-Precision
512 tokensDisambiguation of overlapping entity types, beats AWS ComprehendYes~55ms GPU
urchade/gliner_small-v2.1
In Use
4BGLiNER-small (DeBERTa-v3-small encoder)
F1 ~0.850
F1 (general NER)
512 tokensZero-shot custom entities, prototyping new PII types, ultra-fast inferenceYes~15ms GPU
Surya OCR (Datalab)
4CDetection + segmentation models
Prec 0.99 · Rec 0.96
Table Det. Prec / Recall
Full page canvasScanned document pre-processing, reading order correction, bounding box extractionPartial~620ms/page GPU
nielsr/layoutlmv3-finetuned-cord
4CLayoutLMv3-base (multimodal)
F1 0.9638
F1 (CORD)
512 tokens + image patchesReceipt & invoice spatial PII extraction, structured financial document parsingYes~200ms GPU (incl. OCR)
nielsr/layoutlmv3-finetuned-funsd
4CLayoutLMv3-base (multimodal)
F1 0.9078
F1 (FUNSD)
512 tokens + image patchesScanned form key-value extraction, insurance/government formsYes~200ms GPU (incl. OCR)
parthesh111/layoutlmv3-finetune-bioes-new
4CLayoutLMv3-base + PaddleOCR
F1 ~0.920
F1 (medical lab reports)
512 tokens + image patchesScanned medical lab report de-identification, HIPAA PHI spatial extractionYes~250ms GPU (incl. OCR)
fast-langdetect (FastText lite)
4DFastText (bag-of-n-grams classifier)
~98% (common langs)
Top-1 Accuracy (common langs)
Sentence/paragraphEdge-level language routing, <1ms CPU, GPU-free classificationLimited<1ms CPU
cis-lmu/glotlid (V3)
4DFastText-based (character n-grams)
2,102 language labels
Coverage (labels)
Sentence/paragraphLow-resource & obscure dialect routing, preventing metadata leakageLimited<2ms CPU
Microsoft Presidio
In Use
FrameworkRule-based + spaCy NER + custom recognizers
99%+ structured · ~80% names
Accuracy (structured entities)
Unlimited (chunked internally)Orchestration layer, rule-based PII (regex), plugging in any model aboveCustomizable~10–50ms CPU

Why Segmento Sense

Built different. By design.

These aren't marketing checkboxes. They are deliberate architectural decisions that took 18 months to get right.

01

You Always Know Why — Not Just What

Most PII tools hand you a list of findings and leave you guessing. Sense shows you which model flagged each entity, which rule triggered it, and the exact text span that caused it. Transparency isn't a feature — it's the foundation.

02

You Control Precision vs. Recall

A single model is a single threshold. Our Consensus Engine lets you dial a confidence slider: low confidence flags aggressively (maximizes recall for regulated environments), high confidence flags conservatively (minimizes false positives for high-throughput pipelines). You choose the tradeoff.

03

Works Completely Offline — No Cloud Required

Banks, hospitals, and defense contractors cannot use SaaS tools. Sense runs all 18 models on your own private infrastructure with zero external API calls. Air-gap deployments are fully supported. Your data never crosses a network boundary it shouldn't.

04

Replace Real PII With Valid Synthetic Data

Redacting PII for test environments usually breaks data integrity. Sense replaces real PII with structurally valid synthetic data — SSNs that pass Luhn checks, IBANs with correct checksums, email addresses that match real domains. Your test data stays usable.

05

Generate SQL & Python Fix Scripts Automatically

Sense bridges the gap between the Security team that finds PII and the Data Engineering team that fixes it. One click generates a ready-to-run remediation script for the exact columns and tables flagged. No more "we found 3,000 PII fields" without a path forward.

The Industry Reality

Every major vendor sends your data to the cloud.
We built Sense to be the exception.

Whether it's a cloud platform charging per GB, an enterprise DLP tool requiring weeks of setup, or an AI-first SaaS that runs your sensitive documents through external model APIs — the industry's default is to move your data.

Segmento Sense runs all 18 models on your private infrastructure.Zero third-party AI access. Zero data egress. Zero compliance risk from vendor-side breaches. Your PII stays where it belongs — with you.

Cloud Platforms
AWS Macie, Google Cloud DLP, Microsoft Purview
Data leaves your perimeter
Enterprise DLP
Symantec, Trellix, Spirion, Digital Guardian
Complex, expensive, opaque
AI-First SaaS Tools
Nightfall, Private AI, BigID, Varonis
Training on your sensitive data
vs Segmento Sense
Segmento Sense
18 self-hosted models · Full offline capability · Zero third-party AI
Your data stays with you
See It In Action

Ready to see the AI engine work for your data?

Upload a document. Watch all 18 models work in concert. See exactly which model flagged what entity, and why — in real time.