Terminology System - SRT Translator¶
Overview¶
The SRT Translator uses an intelligent terminology system that combines Do Not Translate (DNT) terms with AI-generated termbase to ensure consistent, high-quality translations while preserving important technical and brand terms.
Key Features¶
🚫 Do Not Translate (DNT) Terms¶
- Hard-Preserve Terms: Always kept in original language (acronyms, tech codes, product names)
- Soft-Preserve Terms: Kept unless overridden by termbase
- Automatic Filtering: Numeric and number-like terms are automatically filtered out
- Smart Detection: Identifies acronyms, CamelCase, and technical patterns
📚 AI-Generated Termbase¶
- Script Validation: Ensures translations use proper writing systems for each language
- Identity Prevention: Rejects translations that are identical to source terms
- Context-Aware: Analyzes transcript content to identify key terminology
🔄 Precedence System¶
- Termbase → DNT: Termbase translations take precedence over DNT terms
- Hard-Preserve Exception: Acronyms and tech codes always remain untranslated
- Conflict Resolution: Automatic handling of overlapping terms
How It Works¶
1. DNT Generation¶
Transcript Analysis → AI Identifies Terms → Hard-Preserve Filtering → Final DNT List
Example DNT Terms:
- ✅ Hard-Preserve: API, GPU, NASA, Adobe Premiere
- ❌ Filtered Out: 300ms, 6.7, 2024, $99.99
2. Termbase Generation¶
Transcript Analysis → AI Extracts Key Terms → Script Validation → Final Termbase
Termbase Format:
{
"zh-Hans": {
"machine learning": "机器学习",
"cloud computing": "云计算"
},
"es": {
"machine learning": "aprendizaje automático"
}
}
The termbase uses a simple {language: {source: target}} structure.
3. Script Validation¶
The system validates that translations use the correct writing system:
- Chinese (Simplified): Must contain CJK characters (一, 二, 三...)
- Japanese: Must contain Hiragana (あ, い, う...), Katakana (ア, イ, ウ...), or CJK
- Arabic: Must contain Arabic script (ا, ب, ت...)
- Latin Languages: No script restrictions (default behavior)
4. Precedence Application¶
Input: DNT=["machine learning", "API"], Termbase={"machine learning": "机器学习"}
Result: Effective DNT=["API"] (machine learning overridden by termbase)
Configuration¶
Language Script Configuration¶
The system automatically detects script requirements from languages.json:
{
"zh-Hans": {
"script": "cjk",
"script_blocks": ["CJK"]
},
"ja": {
"script": "japanese",
"script_blocks": ["Hiragana", "Katakana", "CJK"]
}
}
DNT Filtering Rules¶
- Always Keep: Acronyms (API, GPU, NASA)
- Always Keep: Tech product names (Adobe Premiere, Vivaldi)
- Always Keep: CamelCase terms (MachineLearning, DeepNeural)
- Filter Out: Pure numbers (300, 6.7, 2024)
- Filter Out: Number-like patterns (300ms, $99.99, 6.7%)
Usage Examples¶
GUI Usage¶
- Generate Translation Settings: AI analyzes your transcript and creates DNT + termbase
- Edit Settings: Review and modify AI-generated terminology
- Translate: System automatically applies terminology rules
CLI Usage¶
# Run CLI with specific tone (casual, neutral, or formal)
srtx-cli --tone formal
# Run with debug logging
srtx-cli --debug
# Run with report generation (html, md, both, or none)
srtx-cli --report both
Note: The CLI reads DNT terms and termbase from configuration files in the project directory. See the CLI documentation for configuration details.
Programmatic Usage¶
from srt_translator.core.terminology_utils import build_effective_dnt
from srt_translator.core.translator.term_handler import TermHandler
# Build effective DNT with precedence
effective_dnt = build_effective_dnt(dnt_terms, termbase)
# Create term handler for translation
handler = TermHandler(dnt_terms, termbase, target_lang="es")
effective_dnt = handler.get_effective_dnt()
Quality Metrics¶
DNT Preservation¶
- Hard-Preserve: Always preserved (acronyms, tech codes)
- Soft-Preserve: Preserved unless termbase provides translation
- Measurement: Reported in evaluation artifacts
Termbase Usage¶
- Script Validation: Translations must use correct writing system
- Identity Filtering: Identical source/target pairs are rejected
- Coverage: Reported per-language in evaluation
Script Validation¶
- Non-Latin scripts: Must contain appropriate Unicode blocks (CJK, Arabic, etc.)
- Latin scripts: No restrictions (default behavior)
Troubleshooting¶
Common Issues¶
DNT Terms Not Preserved - Check if terms are in termbase (they override soft-preserve DNT) - Verify terms meet hard-preserve criteria - Check for numeric filtering
Script Validation Failures
- Ensure language has proper script configuration in languages.json
- Check if AI is generating wrong-script translations
- Verify Unicode block definitions
Low Termbase Acceptance - Check confidence thresholds (default: 0.90) - Review script validation results - Verify transcript content quality
Debug Information¶
The system provides detailed logging for troubleshooting: - DNT filtering results - Termbase acceptance/rejection reasons - Script validation details - Precedence application statistics
Best Practices¶
For Content Creators¶
- Review AI-generated terminology before translation
- Add important brand names to DNT if not detected
- Verify technical terms in termbase for accuracy
- Test with small samples before full translation
For Developers¶
- Use script validation for all AI-generated content
- Implement confidence thresholds for quality control
- Log precedence decisions for debugging
- Cache language configurations for performance
Technical Details¶
Architecture¶
- Terminology Utils: Core filtering and validation logic
- Language Config: Script and language metadata
- Term Handler: DNT replacement and restoration
- AI Config: AI-powered terminology generation
Performance¶
- Lazy Loading: Language configurations loaded on demand
- Caching: Script specifications cached in memory
- Efficient Validation: Unicode block lookups optimized
- Batch Processing: Multiple terms processed simultaneously
Extensibility¶
- Custom Scripts: Add new writing systems via Unicode blocks
- Filtering Rules: Extend DNT filtering logic
- Validation Methods: Custom script validation functions
- Precedence Rules: Modify termbase/DNT priority
Future Enhancements¶
- Multi-script Support: Languages with mixed writing systems
- Context-Aware DNT: Dynamic DNT based on content type
- Quality Learning: Improve filtering based on user feedback
- Script Detection: Automatic script identification for unknown text