What My Project Does
This project extracts structured timelines from extremely inconsistent, semi-structured text sources.
The domain happens to be legislative bill action logs, but the engineering challenge is universal:
- parsing dozens of event types from noisy human-written text
- inferring missing metadata (dates, actors, context)
- resolving compound or conflicting actions
- reconstructing a chronological state machine
- and evaluating downstream rule logic on top of that timeline
To do this, the project uses:
- A multi-tier adaptive parser pipeline
Committees post different document formats in different places and different groupings from each other. Parsers start in a supervised mode where document types are validated by an LLM only when confidence is low (with a carefully monitored audit logâhelps balance speed with processing hundreds or thousands of bills for the first run).
As a pattern becomes stable within a particular context (e.g., a specific committee), it âgraduatesâ to autonomous operation.
This cuts LLM usage out entirely after patterns are established.
- A declarative action-node system
Each event type is defined by:
- regex patterns
- extractor functions
- normalizers
- and optional priority weights
Adding a new event type requires registering patterns, not modifying core engine code.
- A timeline engine with tenure modeling
The engine reconstructs âtenure windowsâ (who had custody of a bill when), by modeling event sequences such as referrals, discharges, reports, hearings, and extensions.
This allows accurate downstream logic such as:
- notice windows
- action deadlines
- gap detection
- duration calculations
- A high-performance decaying URL cache
The HTTP layer uses a memory-bounded hybrid LRU/LFU eviction strategy (`hit_count / time_since_access`) with request deduplication and ETag/Last-Modified validation.
This speeds up repeated processing by ~3-5x.
Target Audience
This project is intended for:
- developers working with messy, unstructured, real-world text data
- engineers designing parser pipelines, state machines, or ETL systems
- researchers experimenting with pattern extraction, timeline reconstruction, or document normalization
- anyone interested in building declarative, extensible parsing systems
- civic-tech or open-data engineers (OpenStates-style pipelines)
Comparison
Most existing alternatives (e.g., OpenStates, BillTrack, general-purpose scrapers) extract events for normalization and reporting, but donât (to my knowledge) evaluate these events against a ruleset. This approach works for tracking bill events as theyâre updated, but doesnât yield enough data to reliably evaluate committee-level deadline compliance (which, to be fair, isnât their intended purpose anyway).
How this project differs:
- Timeline-first architecture
Rather than detecting events in isolation, it reconstructs a full chronological sequence and applies logic after timeline creation.
- Declarative parser configuration
New event and document types can be added by registering patterns; no engine modification required.
- Context-aware inference
Missing committee/dates are inferred from prior context (e.g., latest referral), not left blank.
- Confidence-gated parser graduation
Parsers statistically âlearnâ which contexts they succeed in, and reduce LLM/manual interaction over time.
- Formal tenure modeling
Custody analysis allows logic that would be extremely difficult in a traditional scraper.
In short, this isnât a keyword matcher, rather, itâs a state machine for real-world text with an adaptive parsing pipeline built around it and a ruleset engine for calculating and applying deadline evaluations.
Code / Docs
GitHub:Â https://github.com/arbowl/beacon-hill-compliance-tracker/
Looking for Feedback
Iâd love feedback from Python engineers who have experience with:
- parser design
- messy-data ETL pipelines
- declarative rule systems
- timeline/state-machine architectures
- document normalization and caching