Skip to content

Lexer Architecture

Status: Active (Dual-Lexer) Target: Single ASTDB-based lexer for stable beta release Last Updated: 2026-01-25

The Janus compiler maintains two separate lexers that serve different architectural purposes. This document explains why they exist, their differences, and the planned unification strategy.

1. janus_tokenizer (Traditional Parser Path)

Section titled “1. janus_tokenizer (Traditional Parser Path)”

Location: compiler/libjanus/janus_tokenizer.zig

Used By:

  • janus_parser.zig (recursive descent parser)
  • All E2E compilation tests
  • LLVM codegen pipeline

Pipeline:

Source → Tokenizer → Parser → AST → QTJIR → LLVM IR → Executable
AspectDescription
Token StorageToken struct with raw lexeme: []const u8 slices
Memory ModelTokens own source slices directly
Trivia HandlingNot captured (whitespace/comments discarded)
Incremental SupportNo
Profile Support:min, :go, :sovereign keywords

Token Structure:

pub const Token = struct {
type: TokenType, // Direct enum (126 variants)
lexeme: []const u8, // Raw source slice
span: SourceSpan, // Start/end position
};

Location: compiler/astdb/lexer.zig

Used By:

  • region.zig (ASTDB region-based parsing)
  • semantic_analyzer.zig (type checking)
  • LSP server (incremental updates)

Pipeline:

Source → RegionLexer → ASTDB Snapshot → Columnar Queries → Semantic Analysis
AspectDescription
Token StorageToken struct with str: ?StrId (interned string ID)
Memory ModelString interning via StrInterner (deduplication)
Trivia HandlingSeparate Trivia array with trivia_lo/trivia_hi indices
Incremental SupportYes (region boundaries: start_pos, end_pos)
Designed ForColumnar database queries, incremental parsing

Token Structure:

pub const Token = struct {
kind: TokenKind, // Enum with 220+ variants
str: ?StrId, // Interned string (null for punctuation)
span: SourceSpan, // Byte offsets + line/column
trivia_lo: u32, // Index into trivia array
trivia_hi: u32, // Exclusive end of trivia
};
Aspectjanus_tokenizerRegionLexer
OriginTraditional compiler front-endASTDB columnar database
Memory ModelToken owns lexeme slicesString interning (deduplication)
TriviaDiscardedPreserved separately
IncrementalNoYes (region boundaries)
Use CaseAST generation, compilationSemantic queries, LSP
OptimizationSpeedMemory efficiency
  1. janus_tokenizer was built first for the traditional compilation pipeline
  2. RegionLexer was added later for ASTDB’s incremental parsing requirements
  3. Different consumers evolved with different data model expectations
  1. Different Storage Models: Raw slices vs. interned strings
  2. Different Consumers: Parser expects sequential stream; ASTDB expects columnar data
  3. Different Optimization Goals: Speed vs. memory efficiency + incrementality

Both lexers support the same token types:

  • All operators (arithmetic, logical, bitwise, comparison)
  • All keywords (:min, :go, :sovereign profiles)
  • Numeric literals: decimal, hex (0xFF), binary (0b1010), octal (0o777)
  • String literals, identifiers, punctuation

This consistency is actively maintained - any new token support must be added to both lexers.

Target: Stable beta release should use single ASTDB-based lexer

Create a thin adapter that converts RegionLexer output to janus_tokenizer format:

Source → RegionLexer → Adapter → Parser (unchanged)

Modify parser to consume ASTDB tokens directly:

Source → RegionLexer → ASTDB Snapshot → Parser → AST

Remove janus_tokenizer once all consumers migrate to ASTDB path.

  • Single source of truth for tokenization
  • Automatic incremental parsing for all paths
  • Memory-efficient string interning everywhere
  • Trivia preservation for formatting tools
  • Parser expects raw lexemes, ASTDB uses interned IDs
  • Performance regression risk during transition
  • Test coverage must be maintained
FilePurpose
compiler/libjanus/janus_tokenizer.zigTraditional tokenizer
compiler/astdb/lexer.zigASTDB region-based lexer
compiler/astdb/core.zigASTDB token/trivia definitions
compiler/libjanus/janus_parser.zigUses janus_tokenizer
compiler/semantic_analyzer.zigUses RegionLexer

When adding new token support:

  1. Update both lexers for consistency
  2. Test with E2E tests (uses janus_tokenizer path)
  3. Test with semantic analysis (uses RegionLexer path)
  4. Document any divergence in this file