Real Record — Open Civic Data
Descriptive Documentation & Data Dictionary
Dataset: Real Record — Open Civic Data: Local Government Transcripts and Fiscal Records
Publisher: Real Housing Reform Initiative (RHRI) — realrecord.org
License: CC BY 4.0
S3 Bucket: s3://real-record-civic-data/
Transcript + Fiscal Coverage: City of Bellingham, WA; Whatcom County, WA (incl. Ferndale, Lynden); King County, WA (in active pipeline)
Legislative Coverage: City of Bellingham + Whatcom County + Washington State (expands automatically with corpus)
Current Size: ~3 GB (text only)
Contact: brian@realhousingreform.org
Overview
Real Record is open government transparency infrastructure — a structured, cross-linked civic-data corpus designed to make the full record of local government (what officials said, what was authorized, what was spent, and what was enacted) permanently accessible to researchers, journalists, and residents. It links four layers:
| Layer | What it contains | Primary use |
|---|---|---|
| Transcripts | Speaker-attributed meeting speech | Who said what, when |
| Fiscal joins | Budget, revenue, levy, and fee data | What was authorized and spent |
| Briefings | Machine-assisted meeting summaries | Rapid discovery and search |
| Legislative joins | Ordinances, resolutions, contract awards, grant acceptances, and state bills with meeting cross-references | What was enacted, what it cost, and when it was discussed |
All four layers share a common meeting_id key and can be joined across layers
using jurisdiction + fiscal_year for financial data, or bill_id for legislative linkages.
Architecture & Scalability
Real Record is designed for systematic statewide expansion, not one-off jurisdiction coverage.
Auto-connect on import: When data is imported for a new jurisdiction — whether transcripts, fiscal documents, legislative records, or meeting packets — the pipeline automatically resolves and links all entities across all four layers. New jurisdictions, speakers, funds, and legislative items receive unique system identifiers on first encounter. No manual schema changes or configuration are required to add a jurisdiction to the corpus.
Washington State coverage universe:
RHRI maintains a WA Expansion Tracker covering all 281 incorporated Washington State cities and towns, organized into four tiers by population:
| Tier | Population | Cities | Example |
|---|---|---|---|
| 1 — Priority | 100,000+ | 9 | Seattle, Spokane, Tacoma, Vancouver, Bellevue |
| 2 — High | 25,000–100,000 | 36 | Bellingham, Kirkland, Yakima, Kennewick |
| 3 — Medium | 5,000–25,000 | 77 | Port Angeles, Pullman, Ellensburg |
| 4 — Low | <5,000 | 159 | Smaller cities and towns statewide |
As of May 2026: Bellingham and Whatcom County are active in the pipeline. Port Angeles (Clallam County) is in active research. The remaining 278 jurisdictions are prioritized and tracked.
Two-track expansion model:
Track A — Platform-standardized scraping: Washington State's public meeting ecosystem is dominated by a small number of commercial platforms — CivicPlus (~57 WA cities) and Granicus (~20 WA cities) account for the majority of Tier 1–3 jurisdictions. A single scraper built for each platform automatically reaches dozens of cities. Whatcom County and Bellingham were chosen as the founding test bed because their records were well-structured and scrapable, enabling rapid pipeline development and validation before scaling to larger jurisdictions.
Track B — Underserved jurisdictions (separate project and funding): Smaller cities and rural communities that lack the infrastructure to record, store, or publish their public meetings. For these jurisdictions, RHRI operates a parallel capacity-building program — providing the recording infrastructure and data collection directly, then ingesting the resulting records into the Real Record corpus under the same schema. These communities gain a permanent civic record for the first time; the corpus gains geographic coverage that no amount of web scraping could reach.
Both tracks produce structurally identical records — the same four layers, the same partition scheme, the same identifiers — so researchers can query across Track A and Track B jurisdictions without distinguishing the collection method.
Coverage at scale: At full Washington State coverage (39 counties + 281 cities), the corpus
is projected to reach ~70 GB. The Hive partition scheme (jurisdiction/year) and Parquet format
support efficient filtered reads at that scale without requiring users to download the full dataset.
Dataset Layers
Layer 1: Transcript Segments
Segment-level (sentence or speaker-turn) speech records from official public meetings.
Source: Official audio recordings (MP3) and closed-caption VTT files published by the City of Bellingham and Whatcom County. Transcription via Azure Speech-to-Text; speaker attribution via roster matching and heuristic classification.
Files: s3://real-record-civic-data/data/parquet/layer=transcripts/
Schema: segments
| Field | Type | Description |
|---|---|---|
segment_id |
string | Unique segment identifier (UUID) |
meeting_id |
string | Canonical meeting identifier (see Meeting ID Format) |
jurisdiction |
string | Jurisdiction slug (e.g., bellingham-wa, whatcom-county-wa) |
meeting_date |
date | Date of the meeting (YYYY-MM-DD) |
meeting_body |
string | Body code (e.g., BEL-CON, WHA-CON, BEL-PLN) |
segment_index |
integer | Ordinal position of segment within meeting |
start_time_sec |
float | Start offset in seconds from meeting audio start |
end_time_sec |
float | End offset in seconds from meeting audio start |
speaker_id |
string | Speaker identifier (UUID); NULL if unresolved |
speaker_name |
string | Resolved speaker display name; NULL if unresolved |
speaker_role |
string | elected_official, staff, public_commenter, unknown |
speaker_confidence |
float | Resolution confidence score (0.0–1.0) |
is_public_comment |
boolean | True if segment is part of a public comment period |
text |
string | Full text of the segment |
word_count |
integer | Word count of segment text |
source |
string | speech_to_text or vtt_caption |
collection_track |
string | digital_scrape (automated from public portal) or rhri_recorded (RHRI provided the recording for a jurisdiction without meeting infrastructure) |
Schema: speakers
| Field | Type | Description |
|---|---|---|
speaker_id |
string | UUID |
jurisdiction |
string | Jurisdiction slug |
display_name |
string | Full name as appears on official rosters |
role |
string | elected_official, staff, public_commenter |
body_codes |
string | Pipe-separated body codes for this speaker |
term_start |
date | Start of term (elected officials) |
term_end |
date | End of term; NULL if current |
Speaker Identification Methodology
Known participants (elected officials and staff) are pre-loaded from official rosters
before each meeting. Every city council member, county councilmember, planning commissioner,
and recurring staff presenter has a confirmed speaker_id. When the same person appears across
meetings, their segments are attributed to the same persistent ID — enabling longitudinal
analysis of individual officials' positions, participation rates, and topic focus over time.
Public commenters are identified by first name and last initial only (display_name
format: "Jane D."). Most public commenters self-introduce at the start of their remarks —
"My name is Jane D., I'm a resident of the Sunnyland neighborhood" — and the pipeline
extracts this pattern to log participation without exposing full names. Commenters who do
not self-identify are logged as public_commenter by role with no name attached.
Voices from the Meeting
Every meeting in the corpus includes a participation log — a structured record of who showed up and spoke, separate from the full transcript. This serves two purposes:
-
Public accountability: Residents who take the time to speak at a public meeting deserve to be acknowledged. The participation log gives them a named record that their voice was heard and logged, even if the briefing summary condenses or omits their individual comment.
-
Civic signal: Raw headcount and commenter volume are themselves data. Some meetings attract 3 speakers; others — like the October and November 2024 Whatcom County Council sessions on the jail bond — drew 4+ hours of continuous public comment. That disparity is part of the story.
Schema: meeting_participants
| Field | Type | Description |
|---|---|---|
participant_id |
string | UUID |
meeting_id |
string | FK → meetings |
jurisdiction |
string | Jurisdiction slug |
display_name |
string | First name + last initial ("Jane D.") if self-identified; NULL otherwise |
comment_position |
integer | Order in which they spoke during public comment |
comment_topics |
string | Pipe-separated topic tags extracted from their remarks |
comment_duration_sec |
integer | Approximate speaking time in seconds |
self_identified |
boolean | True if the commenter stated their own name |
neighborhood |
string | Self-stated neighborhood or city of residence if mentioned |
Meeting ID Format
Meeting IDs follow a hierarchical slug pattern that encodes geography and meeting body:
{JURISDICTION}-{BODY}-{YYYY-MM-DD}[-{SUFFIX}]
The jurisdiction segment is derived from the full geographic hierarchy — county + city/entity + body — so IDs remain globally unique as coverage expands:
| Segment | Derivation | Example |
|---|---|---|
| Jurisdiction slug | County + city/entity abbreviation | BEL (Bellingham, Whatcom Co.) · WHA (Whatcom County) · KCO (King County) |
| Body code | Committee or board abbreviation | CON (Council) · PLN (Planning) · CTW (Committee of the Whole) |
| Date | ISO 8601 date | 2026-04-28 |
| Suffix (optional) | Distinguishes same-day sessions | -A, -B, -EXS (executive session) |
Full example: BEL-CON-2026-04-28 = Bellingham City Council, April 28 2026.
Namespace capacity: Because each jurisdiction slug encodes a unique county + city
combination, and each body code is scoped within that jurisdiction, the ID space is
effectively unbounded for Washington State coverage (39 counties × 281 cities ×
multiple bodies per city × 365 dates per year). Even at full multi-state expansion,
a state prefix (WA-, OR-, ID-) can be prepended to the jurisdiction slug
without any changes to the schema — existing IDs remain stable and new states occupy
their own prefix namespace.
Meeting Body Codes
| Code | Body | Jurisdiction |
|---|---|---|
BEL-CON |
Bellingham City Council | City of Bellingham |
BEL-PLN |
Bellingham Planning Commission | City of Bellingham |
BEL-BFC |
Budget & Finance Committee | City of Bellingham |
WHA-CON |
Whatcom County Council | Whatcom County |
WHA-PLN |
Whatcom County Planning Commission | Whatcom County |
WHA-CTW |
Committee of the Whole | Whatcom County |
FER-CON |
Ferndale City Council | City of Ferndale |
LYN-CON |
Lynden City Council | City of Lynden |
Layer 2: Fiscal Joins
Fund-level revenue and expenditure history, tax levy authorizations, and ballot-measure linkages. Structured to enable promise-vs-delivery analysis across budget cycles.
Source: City of Bellingham Adopted Budget Books (2005–present), Whatcom County Adopted Budget Books, Comprehensive Annual Financial Reports (CAFRs) 1999–2023, Whatcom County Assessor Tax Books, and WA Department of Revenue levy data.
Files: s3://real-record-civic-data/data/parquet/layer=fiscal/
Schema: funds
| Field | Type | Description |
|---|---|---|
fund_id |
string | UUID |
jurisdiction |
string | Jurisdiction slug |
fund_code |
string | Official fund number (e.g., 120, 411) |
fund_name |
string | Official fund name |
fund_type |
string | general, special_revenue, capital, debt_service, enterprise |
rcw_authority |
string | Washington RCW citation (if applicable) |
description |
string | Fund purpose summary |
Schema: revenue_collections
| Field | Type | Description |
|---|---|---|
collection_id |
string | UUID |
fund_id |
string | FK → funds |
jurisdiction |
string | Jurisdiction slug |
fiscal_year |
integer | Fiscal year (YYYY) |
revenue_source_id |
string | FK → revenue_sources |
amount_usd |
decimal | Collected revenue in USD |
source_document |
string | Source document reference |
confidence |
string | high (CAFR-derived), medium (budget book), low (estimate) |
Schema: revenue_sources
| Field | Type | Description |
|---|---|---|
source_id |
string | Slug identifier |
source_name |
string | Display name |
source_type |
string | Revenue category — see Source Types below |
jurisdiction |
string | Jurisdiction slug |
rcw_authority |
string | Washington RCW citation |
Source Types:
| Value | Description | Example |
|---|---|---|
property_tax |
Regular and excess property tax levies | General Fund levy, Greenways levy |
sales_tax |
State-shared and local option sales tax | 0.1% Public Safety Sales Tax, Annexation Sales Tax |
bo_tax |
Business & Occupation tax on gross receipts | Bellingham B&O on manufacturing, retail |
utility_tax |
Occupation taxes on utility gross revenues | Bellingham BMC 6.06 (water 18.25%, wastewater 11.5%) |
utility_rate |
User rates for water, sewer, stormwater services | Monthly service charges, consumption tiers |
impact_fee |
One-time development impact fees (RCW 82.02.050) | Transportation, park, school, fire impact fees |
connection_fee |
System development charges for utility connections | Water/sewer connection charges |
permit_fee |
Building, land use, and development permit fees | Building permits, subdivision fees |
grant |
Federal, state, and foundation grants | CDBG, ecology grants |
bond |
General obligation or revenue bond proceeds | Capital project bonds |
intergovernmental |
State-shared revenues and interlocal transfers | Motor vehicle fuel tax, liquor tax, ILA payments |
other |
Miscellaneous revenues not categorized above | Donations, miscellaneous receipts |
Schema: budget_line_items
| Field | Type | Description |
|---|---|---|
line_id |
string | UUID |
fund_id |
string | FK → funds |
fiscal_year |
integer | Fiscal year |
category |
string | personnel, operations, capital, debt, transfers, cash_balance |
subcategory |
string | Department or program label |
amount_usd |
decimal | Budgeted or actual amount in USD |
amount_type |
string | adopted, amended, actual |
source_document |
string | Source document reference |
Schema: ballot_measures
| Field | Type | Description |
|---|---|---|
measure_id |
string | Slug identifier |
jurisdiction |
string | Jurisdiction slug |
measure_name |
string | Official ballot measure name |
election_date |
date | Election date |
passed |
boolean | Passed / failed |
yes_pct |
float | Percent yes votes |
authorized_amount_usd |
decimal | Authorized levy or bond amount |
authorized_rate |
float | Tax rate (per $1,000 AV) if applicable |
rcw_authority |
string | Washington RCW citation |
ordinance |
string | Enacting ordinance reference |
description |
string | Summary of what was authorized |
Schema: levy_authorizations
| Field | Type | Description |
|---|---|---|
levy_id |
string | Slug (e.g., bellingham-greenways-v) |
measure_id |
string | FK → ballot_measures |
fund_id |
string | FK → funds |
jurisdiction |
string | Jurisdiction slug |
levy_name |
string | Official levy name |
levy_type |
string | regular, excess, bond, special |
start_year |
integer | First collection year |
end_year |
integer | Final collection year (NULL if ongoing) |
authorized_rate |
float | Authorized rate per $1,000 AV |
Layer 3: Written Briefings
Machine-assisted meeting summaries produced by the RHRI briefing pipeline. Each meeting may have up to three briefing tiers.
Source: Derived from Layer 1 transcripts plus agenda PDFs and supporting documents. Produced using a reproducible pipeline (Claude claude-sonnet-4-5 via Anthropic API) with structured prompts versioned in the RHRI pipeline repository.
Files: s3://real-record-civic-data/data/markdown/layer=briefings/
s3://real-record-civic-data/data/parquet/layer=briefings/
Briefing Tiers
| Tier | Name | Description | Typical length |
|---|---|---|---|
basic |
Basic Briefing | Key decisions, votes, and action items | 300–500 words |
supplemental |
Supplemental Briefing | Topic-tagged summaries with fiscal context | 600–1,000 words |
narrative |
Full Narrative | Complete chronological account with public comment depth | 1,500–4,000 words |
Schema: briefings
| Field | Type | Description |
|---|---|---|
briefing_id |
string | Slug (matches meeting_id + tier) |
meeting_id |
string | FK → meetings |
jurisdiction |
string | Jurisdiction slug |
meeting_date |
date | Meeting date |
tier |
string | basic, supplemental, narrative |
title |
string | Meeting title |
executive_summary |
string | 2–3 sentence summary |
full_text |
string | Complete briefing text (Markdown) |
topics |
string | Pipe-separated topic tags |
fiscal_links |
string | Pipe-separated fund/levy/source IDs mentioned |
pipeline_version |
string | RHRI pipeline version used |
model |
string | LLM model used |
generated_at |
timestamp | Generation timestamp |
Layer 4: Legislative Joins
Structured tracking of ordinances, resolutions, and state legislation (RCW-cited bills) cross-referenced to the meetings where they were discussed and/or enacted.
Source: Official legislative records from jurisdiction clerk websites, WA State Legislature (app.leg.wa.gov), and ordinance/resolution text extracted from meeting packet PDFs. Cross-references are derived from briefing pipeline analysis and verified against meeting minutes.
Live tracker: realhousingreform.org/legislation
Files: s3://real-record-civic-data/data/parquet/layer=legislative/
Schema: bills
Covers all action items tracked through the RHRI Legislation Tracker at realhousingreform.org/legislation. As of May 2026: 908 items tracked, 555 approved, across 3 jurisdictions including 42 state bills (City of Bellingham, Whatcom County, Washington State).
The legislative layer automatically mirrors transcript and fiscal coverage — as new jurisdictions are added to the Real Record corpus, their legislative records are ingested in parallel. No separate configuration is required per jurisdiction.
Local items use a YYYY-NNN numbering system matching the jurisdiction's official clerk numbering.
State bills use standard WA legislative notation (e.g., SB 602, HB 1234).
| Field | Type | Description |
|---|---|---|
bill_id |
string | Slug identifier (e.g., bel-2024-898, wa-sb-602-2026) |
item_number |
string | Official item number as assigned by clerk (2024-898, SB 602) |
jurisdiction |
string | Jurisdiction slug (bellingham-wa, whatcom-county-wa, washington-state) |
bill_type |
string | Category — see Bill Types below |
title |
string | Official title as it appears on the agenda |
status |
string | final_vote_approved, committee_discussion, first_reading, tabled, failed, pending |
introduced_date |
date | Date first appearing on an agenda |
last_discussed_date |
date | Most recent meeting date where the item appeared |
passed_date |
date | Date of final vote; NULL if not yet passed |
vote_tally |
string | Vote count (e.g., 7-0, 6-0-1, 3-0); NULL if no vote taken |
dollar_amount |
decimal | Associated dollar amount (contract value, grant amount, budget impact); NULL if not applicable |
mention_count |
integer | Number of meeting segments in which this item is referenced |
rcw_authority |
string | Authorizing RCW citation(s) if state-law-enabled |
fund_links |
string | Pipe-separated fund IDs affected by this item |
levy_links |
string | Pipe-separated levy IDs authorized by this item |
description |
string | Plain-language summary of what the item does |
source_url |
string | URL to official text or agenda packet |
Bill Types:
| Value | Description | Example |
|---|---|---|
ordinance |
Local law or code amendment requiring council vote | Housing Action Plan Funding Allocation |
resolution |
Formal council position, policy adoption, or appointment | Rapid Transit Locally Preferred Alternative |
contract_award |
Procurement decisions — services, construction, or interlocal agreements | Pacific Street Fleet Building Contract ($808K) |
grant_acceptance |
Federal, state, or foundation grants received by the jurisdiction | Nooksack Riparian Resilience Project Grant ($2.2M) |
state_bill |
Washington State Legislature bills tracked for local impact | SB 602 Automatic License Plate Readers, SB 5184 Parking Reform |
discussion_item |
Agenda items presented for information without a vote | Downtown Vacancy Presentation, 2025 Budget Presentation |
Schema: legislative_cross_references
Links each bill or action item to the meetings where it appeared. Each row is one (bill, meeting) pair — a single bill may appear in multiple meetings as it moves through readings, public hearings, and final votes.
| Field | Type | Description |
|---|---|---|
cross_ref_id |
string | UUID |
bill_id |
string | FK → bills |
meeting_id |
string | FK → meetings (transcript layer) |
jurisdiction |
string | Jurisdiction slug |
reference_type |
string | first_reading, committee_discussion, public_hearing, final_vote, discussion_only |
segment_ids |
string | Pipe-separated segment IDs where the bill is mentioned |
agenda_item |
string | Agenda item number or title (e.g., "Agenda Item 7b") |
vote_result |
string | approved, failed, tabled, withdrawn; NULL if no vote |
vote_tally |
string | Vote count (e.g., 7-0, 6-0-1) |
dollar_amount |
decimal | Dollar amount if this action involved an expenditure or grant |
Cross-Layer Join: Legislation → Fiscal → Transcripts
The legislative layer is the connective tissue between the fiscal and transcript layers. A ballot measure (fiscal layer) → enacting ordinance (legislative layer) → council discussion (transcript layer) forms a complete accountability chain: what voters authorized, what council passed, and what officials said about it before the vote.
# Example: Find every council segment discussing the utility occupation tax ordinance
import pyarrow.dataset as ds
cross_refs = ds.dataset("s3://real-record-civic-data/data/parquet/layer=legislative/",
format="parquet", partitioning="hive", filesystem=fs)
# Get all meeting IDs where BMC 6.06 (utility occupation tax) was discussed
bmc_606_refs = cross_refs.to_table(
filter=ds.field("bill_id") == "bel-bmc-6.06-utility-occupation-tax"
).to_pandas()
# Join to transcript segments
meeting_ids = bmc_606_refs["meeting_id"].tolist()
segments = transcript_dataset.to_table(
filter=(
(ds.field("jurisdiction") == "bellingham-wa") &
(ds.field("meeting_id").isin(meeting_ids))
)
).to_pandas()
print(f"Found {len(segments)} segments across {len(meeting_ids)} meetings")
Topic Taxonomy: The 7 Policy Tenants
Every decision made by local government — every vote, ordinance, expenditure, levy, and policy outcome — connects to one or more of seven Policy Tenants through its subtopics. This is the organizing premise of the Real Record topic taxonomy: the seven tenants are comprehensive enough that no act of local government falls outside them.
A stormwater fee increase connects through Environment (water management) and Taxes (rate structure). A jail bond measure connects through Safety (corrections capacity) and Governance (ballot authorization). A housing ordinance may connect through Housing (zoning), Economy (development impact), and Taxes (impact fees). The multi-tenant connections are the story — they reveal which policy domains actually drive decisions, and which tenant combinations recur across years and jurisdictions.
Real Record tracks approximately 1,000 subtopics organized under the seven tenants. Every meeting segment, briefing, ordinance, and fiscal record in the corpus is tagged to the subtopics that appear in it — and through those subtopics, to one or more tenants. This makes the taxonomy a navigation layer across all four data layers simultaneously.
The tenants are the organizing spine of the Real Briefings Topic Hub: users can filter by tenant, browse subtopics ranked by mention count, and click through to every meeting, vote, and fund page where that topic appeared.
| Tenant | What it covers | Example subtopics |
|---|---|---|
| Housing | Land use, density, affordability, zoning, ADUs | affordable-housing, ADUs, middle-housing, urban-growth-areas, Housing Policy |
| Environment | Water, land, climate, natural resources | flood-mitigation, stormwater-management, flood-response, water-rights-adjudication, Lake Whatcom Watershed |
| Taxes | Revenue policy, levies, rates, tax structure | property-tax, sales-tax, utility-tax, B&O-tax, levy-authorizations, tax-exemptions |
| Economy | Local economic development, business, infrastructure | budget-finance, transportation-policy, economic-development, capital-projects |
| Safety | Public safety, emergency services, corrections | public-safety, jail-capacity, behavioral-health, emergency-management, fire-districts |
| Social | Community services, health, parks, engagement | behavioral-health, parks-recreation, community-engagement, library, human-services |
| Governance | Planning process, elections, administration, land-use law | comprehensive-plan, growth-management-act, urban-growth-areas, committee-appointments, PROS-plan |
How a topic connects across layers
Each topic tag is a cross-layer connector. When a user clicks a topic — say, flood-mitigation — they can reach:
- Transcript segments: Every speaker turn where the topic appeared, across all meetings and bodies, with speaker name and meeting date
- Briefings: Meeting summaries where the topic was flagged — ranked by recency or mention count
- Legislative joins: Ordinances, resolutions, and state bills cross-referenced to that topic (e.g., stormwater ordinance, flood-zone overlay amendment)
- Fiscal joins: Fund pages and revenue sources linked to the topic (e.g., Stormwater Utility Fund 480, Flood Control Zone District levies)
This four-layer cross-link is what distinguishes Real Record from a simple transcript archive. A topic isn't just a keyword — it's a thread connecting what residents said at a public hearing, what the council voted on, what it cost, and where the money came from.
Scale of the taxonomy
The current Whatcom County / Bellingham corpus contains approximately 1,000 distinct topic tags across the seven tenants. As additional jurisdictions are added, topics auto-resolve: a topic already in the taxonomy gains new data from the new jurisdiction's meetings automatically. Novel topics from new jurisdictions are added to the appropriate tenant during onboarding review.
The full browsable taxonomy is live at realhousingreform.org/topics (migrating to realrecord.org/topics).
Hot Topics: What's Being Discussed — and What Isn't
Real Record includes a topic-frequency layer that tracks how often specific subjects appear across meeting transcripts, over time and across jurisdictions. This powers a "trending topics" surface that shows researchers and residents what their government is actually spending time talking about.
What the data shows: Topic frequency is computed at the segment level — each speaker turn that references a tracked topic is counted. Frequency is normalized by meeting volume so jurisdictions with more meetings don't artificially dominate. Topics can be tracked monthly, quarterly, or annually.
The negative space is as important as the signal. An absence of discussion is a finding. For example: analysis of Bellingham City Council transcripts (2019–2026) shows that terms like jobs, employment, wages, and economic development appear in fewer than 0.3% of all segments — a near-total absence compared to housing, utilities, public safety, and budget topics that dominate the agenda. That gap tells a story about what local government is and isn't prioritizing, and it is only visible when you can measure the full record.
Trending topics surface issues gaining momentum — rising mention counts week-over-week — before they reach mainstream coverage. The jail bond / public safety sales tax discussion in Whatcom County, for example, was detectable as a trending topic in transcript data months before it became a front-page story.
Schema: topic_frequencies
| Field | Type | Description |
|---|---|---|
topic_id |
string | Slug identifier (matches topics table) |
jurisdiction |
string | Jurisdiction slug |
meeting_body |
string | Body code; NULL for jurisdiction-wide aggregate |
period |
string | YYYY-MM (monthly) or YYYY-Q1 (quarterly) |
mention_count |
integer | Number of segments referencing this topic in the period |
meeting_count |
integer | Number of distinct meetings in which the topic appeared |
speaker_count |
integer | Number of distinct speakers who mentioned the topic |
public_commenter_count |
integer | Number of public commenters (not officials) who mentioned the topic |
trend_delta |
float | Change in normalized frequency vs. prior period (positive = rising) |
Note on the public_commenter_count field: Separating resident-driven topic frequency from council-driven frequency reveals whether an issue is being pushed by the community or by officials — a meaningful distinction for accountability research.
Known Limitations & Data Caveats
-
Speaker attribution confidence: Approximately 65–75% of segments are attributed to named speakers. Unattributed segments default to
speaker_role=unknown. Public commenters are classified by role (not name) unless they self-identify. -
Transcription accuracy: Speech-to-text accuracy varies by audio quality. Proper nouns (parcel numbers, ordinance codes, proper names) are most susceptible to transcription errors. VTT-sourced segments (from official captions) have lower error rates than STT-derived segments.
-
Fiscal year conventions: Bellingham uses a biennial budget (two-year cycle); individual fiscal year figures are extracted from midpoint amendments and CAFR actuals. Whatcom County uses an annual budget. Revenue figures labeled
mediumconfidence are interpolated from biennial totals. -
Briefing completeness: Briefings are not produced for meetings where no transcript is available (audio not published, or transcription failed). Approximately 8% of meeting records have no associated briefing.
-
Coverage dates: Transcript coverage varies by body. Bellingham City Council and Whatcom County Council have the deepest coverage (2019–present). Planning Commission coverage begins 2024 for Bellingham and 2025 for Whatcom County.
Accessing the Data
From S3 (public read-only)
# List top-level structure
aws s3 ls s3://real-record-civic-data/ --no-sign-request
# List transcript partitions for Bellingham
aws s3 ls s3://real-record-civic-data/data/parquet/layer=transcripts/jurisdiction=bellingham-wa/ --no-sign-request
# Download all 2025 Bellingham Council transcripts
aws s3 sync s3://real-record-civic-data/data/parquet/layer=transcripts/jurisdiction=bellingham-wa/year=2025/ \
./local-data/ --no-sign-request
With Pandas / PyArrow
import pandas as pd
import pyarrow.dataset as ds
# Load Bellingham transcript segments for 2025
dataset = ds.dataset(
"s3://real-record-civic-data/data/parquet/layer=transcripts/",
format="parquet",
partitioning="hive",
)
df = dataset.to_table(
filter=(
(ds.field("jurisdiction") == "bellingham-wa") &
(ds.field("year") == 2025)
)
).to_pandas()
With Amazon Athena
-- Create external table (run once)
CREATE EXTERNAL TABLE real_record_segments (
segment_id STRING,
meeting_id STRING,
meeting_date DATE,
meeting_body STRING,
speaker_name STRING,
speaker_role STRING,
is_public_comment BOOLEAN,
text STRING,
word_count INT
)
PARTITIONED BY (jurisdiction STRING, year INT)
STORED AS PARQUET
LOCATION 's3://real-record-civic-data/data/parquet/layer=transcripts/'
TBLPROPERTIES ('parquet.compression'='SNAPPY');
MSCK REPAIR TABLE real_record_segments;
-- Example: find all public comments mentioning housing in 2025
SELECT meeting_id, meeting_date, speaker_name, text
FROM real_record_segments
WHERE year = 2025
AND jurisdiction = 'bellingham-wa'
AND is_public_comment = true
AND LOWER(text) LIKE '%housing%'
ORDER BY meeting_date, segment_index;
File Formats
| Format | Layer(s) | Use case |
|---|---|---|
| Parquet (Snappy) | transcripts, fiscal, briefings | Analytics, Athena, Spark, Pandas |
| JSON-Lines | transcripts, briefings | Streaming pipelines, LLM ingestion |
| Markdown | briefings | Human reading, documentation, RAG |
Provenance & Pipeline
The full data pipeline is open source and documented at: realrecord.org/methodology (this page).
Source documents (original audio, PDF agendas, budget books) are not redistributed
here due to size but are available via the originating jurisdictions' public records.
Accession metadata (source URLs, download dates, document hashes) is included in
the metadata/ prefix of the S3 bucket.
Citation
Real Housing Reform Initiative. (2024–present). Real Record — Open Civic Data:
Local Government Transcripts and Fiscal Records [Data set]. AWS Open Data.
https://registry.opendata.aws/real-record-civic-data/
Contact & Updates
- Primary contact: brian@realhousingreform.org
- Dataset home: https://registry.opendata.aws/real-record-civic-data/ (live after launch)
- Methodology & pipeline: https://realrecord.org/methodology
- Data archive: https://realrecord.org (migration in progress)
Last updated: 2026-05-13


