Health brand strategy depends on knowing what competitors are doing: pricing pages, indication updates, new patient resources, formulary changes. The strategy team was checking sites by hand, which meant they checked rarely and inconsistently. I built a Python service that does it every day, automatically, and emails the right people when something has actually changed.
The problem
Strategy work is competitive intelligence work. The team needed to know when an indication was added to a competitor's product page, when a new copay assistance program launched, when a label update went live. These changes drive client recommendations and pitch updates.
The existing process was a spreadsheet of URLs and a recurring calendar event. Sites were checked sporadically, often well after a change went live. Small updates were missed entirely. There was no diff capability, so reviewers had to remember what a page looked like last time, which they couldn't.
The brief I gave myself was simple: check every site every day, store snapshots, surface meaningful diffs to the people who can act on them. Engineering should run it. The brand teams should consume it.
The architecture
Site-Watch is a Python scheduler that walks each site in a config.json, captures every page as plain text, compares against the most recent archived snapshot, and only does the expensive work (screenshots, visual diffs, report generation, email) when something has actually changed. The output is a self-contained HTML report per host, hosted on a public server via SFTP, with an emailed summary linking to it. Sites are grouped by audience, and each group has its own recipient list, so the right team gets the right alerts.
Capture
For each URL the crawler follows internal links from a starting site, with an exclude_pages pattern list to skip irrelevant routes. requests fetches the page and BeautifulSoup extracts the plain text, stripping scripts, styles, and markup naturally as a side effect of get_text(). PDFs are handled the same way through PyMuPDF, which extracts text from each page. Snapshots land in archive/<host>/<date>/<timestamp>/<page-slug>.txt, so the snapshot store is just the filesystem, organised by host and run.
Diff
Plain text gets compared with diff-match-patch, which produces patches between the old and new snapshot. The diff is rendered into the HTML report with <ins> and <del> highlighting on the changed words. The semantic-cleanup pass collapses a one-word edit into a one-word highlight rather than reporting an entire paragraph as a delete + insert.
Screenshots and visual diff
When the text diff is non-empty, Playwright opens a headless Chromium against the page and grabs a full-page screenshot. Pillow + NumPy then generate a three-panel image (old, new, and a red-highlighted overlay of the changed pixels) so a reviewer can spot the change visually in seconds. This only runs on pages that actually changed, which is the difference between a fast daily run and an expensive one.
Report and delivery
For each host with changes, the run generates an index.html with a side-by-side text diff per page and a "Compare Screenshots" modal showing the visual diff. Style and script assets are bundled in so the report works as a standalone artifact. The bundle is uploaded over SFTP to a public web server (paramiko), and yagmail sends a per-group email with the site name, change count, a "View report" button linking to the public URL, and the list of changed pages.
Key decisions
Snapshot text, not HTML
Comparing raw HTML would surface every cache-buster, every analytics nonce, every dynamic ID as a change. BeautifulSoup's text extraction removes all of that as a side effect, so the diff is signal: copy changes, indication updates, new sections. The signal-to-noise is dramatically better than a byte-level diff and there's no custom strip-rule list to maintain.
Screenshot only on change
Capturing every page every day through Playwright across hundreds of URLs is expensive in time and CPU. Snapshotting text is cheap. Only triggering the headless browser when the text diff is non-empty keeps the daily run lightweight and the report email arrives within minutes of the schedule firing.
Filesystem snapshots, SFTP delivery
A database for snapshot history would be over-engineered. The natural unit is "the text of page X on date Y", which is exactly what the filesystem is shaped for: one file per page per timestamp, archived under host. The reports are static HTML, so any web server can host them. SFTP is the lowest-friction delivery to whatever box is already serving the team's other reports.
Per-group config
config.json groups sites by audience (DTC competitors, HCP competitors, personal sites) and pairs each group with its own email recipient list. The DTC strategist sees DTC moves without being on the HCP firehose, and vice versa. Adding a competitor is a JSON edit, not a code change.
Outcomes
Site-Watch has become the kind of tool nobody mentions but everyone uses. It catches competitor moves the team would otherwise miss and surfaces them as a daily email with a one-click link to the full diff. Because the expensive work only fires on real changes, the run cost is essentially nothing.
What I'd do differently
I'd add a noise filter for session-bound URL parameters and dynamic IDs that survive the text strip on some sites. Today these can produce a noisy email on an otherwise quiet competitor day. A small per-host allow/deny list of tokens to normalise before diffing would collapse those into the actual signal.
I'd also build a small admin UI for the URL list. Today the config sits in the repo, so adding a new competitor means a PR. A tiny self-service page that updates config.json and reloads would put the team in direct control of what's tracked, without an engineer in the loop.