Collect Data From Any Website With AI + Python
Turn public web pages into clean spreadsheets and JSON. AI writes the scraper; you learn the workflow, the etiquette, and what to do when the page fights back.
Every data site starts with collection. Prices, listings, statistics, schedules — the web is full of public data trapped in page layouts, and a 40-line Python script can liberate it into a spreadsheet. AI writes that script in one shot most of the time. This tutorial covers the workflow plus the two things AI won't volunteer: etiquette and fragility.
Step 0 — The rules of the road
Scrape public pages, gently. Check the site's robots.txt and terms; prefer an official API or downloadable dataset when one exists (ask the AI — 'does [source] offer an API or data export?' — government and reference sites usually do, and it's always less work). Add a delay between requests, identify your script honestly, and never scrape personal data or anything behind a login. A polite scraper requests pages slower than a human clicking around.
Step 1 — Show the AI the page, not the URL
The AI can't browse your target reliably, but it doesn't need to. On the page with your data: right-click the item you want → Inspect. The highlighted HTML is the structure. Copy the surrounding chunk (right-click the element in the inspector → Copy → Copy outerHTML).
Write a Python script using requests and BeautifulSoup that scrapes [the table of historical book prices] from [URL]. Here is the actual HTML of one item, copied from the page: [paste]. Extract: [title, price, date]. Requirements: 2-second delay between requests, a custom User-Agent identifying the script, save results to data.csv AND data.json, print progress, and handle missing fields without crashing.
pip install requests beautifulsoup4 python3 scraper.py # → data.csv, data.json
Step 2 — Validate before you trust
Open the CSV. Spot-check ten rows against the live page. The classic failures: only the first page scraped, prices captured with currency symbols breaking your numbers, and dates in mixed formats. Each is one follow-up prompt — 'also handle pagination, the next-page link looks like this: [paste HTML]'.
Step 3 — When the page comes back empty
If your script gets HTML but no data, the site likely loads content with JavaScript after the page loads — requests never sees it. Two escalations, in order: First, check whether the data arrives via a hidden API: in the browser, Inspect → Network tab → reload → filter by Fetch/XHR → click entries until you see your data as JSON. If you find it, scraping that endpoint directly is cleaner than parsing HTML. Second, if there's no findable endpoint, ask the AI to rewrite the scraper with Playwright, which drives a real browser.
Step 4 — From one-off script to dataset pipeline
A scrape is a snapshot; a dataset is snapshots over time. Schedule the script (cron on a server, or a scheduled task on your machine) to run daily, appending to dated files. Within a month you own a time series nobody else has — price history, availability trends, ranking movement. Proprietary accumulated data is the strongest moat a small site can build, and it's exactly how the pricing tracker on this site works.
Step 5 — Expect breakage, design for it
Sites redesign and scrapers die — this is normal, not failure. Make the script email or log loudly when it extracts zero rows instead of silently writing empty files. When it breaks: copy the new HTML structure from the inspector, paste it into a chat with your script, and ask for the updated selectors. Repair time with AI: about five minutes.
Keep going
Need somewhere to put it live? See where to host AI-built sites. Compare tool costs on the pricing tracker (or stick to the free options), then pick your next build.