Day 16: Web scraping for automation – ethical data extraction

🔗 Knowledge graph – Day 16 expands every system with external data

Day 1

Prompts to analyze scraped data

Day 2

Zapier Webhooks for scraped data

Day 3

Make HTTP scraper modules

Day 4

OpenAI to process scraped text

Day 5

Enrich leads with scraped data

Day 6

Competitor monitoring use cases

Day 7

3 builds with scraping

Day 8

Qualify leads from scraped lists

Day 9

Personalize with scraped data

Day 10

Content from scraped trends

Day 11

Scrape KB for support

Day 12

Niche-specific scraping (listings, agencies)

Day 13

Scrape competitor workflows

Day 14

CRM enrichment via scraping

Day 15

API security for scraping endpoints

Day 16

Web scraping fundamentals

Shared link: Every system you've built can be enhanced with external data. Day 16 teaches you to ethically gather that data when APIs aren't available. Combine with Day 15's security practices to build robust, compliant scrapers.

🕷️ What is web scraping?

📌 Extracting data from websites without an API

Web scraping is the automated process of collecting publicly available information from websites. When a site doesn't provide an API (or you need data not available via API), scraping is the alternative.

Examples: Competitor prices, real estate listings, job postings, news articles, social media profiles (public).

Analogy: APIs are like a restaurant with a menu – you order what they offer. Scraping is like walking through a farmer's market, looking at what's displayed, and taking notes. Both are valid, but you need to follow the rules (don't steal, don't trespass).

⚖️ Ethics & legality – scrape responsibly

What's ILLEGAL / UNETHICAL

❌ Scraping login-protected content (requires auth)
❌ Ignoring robots.txt (website's scraping rules)
❌ Scraping personal data without consent (GDPR violation)
❌ Overloading servers (DDOS-like behavior)
❌ Selling scraped data as your own
❌ Scraping copyrighted content for commercial use

What's LEGAL / ETHICAL

✅ Publicly available data
✅ Respecting robots.txt and rate limits
✅ Adding delays between requests (be polite)
✅ Identifying your bot (User-Agent string)
✅ Using data for personal/educational use
✅ Checking terms of service (some allow scraping)

Critical: Always check the website's robots.txt (e.g., example.com/robots.txt). It tells you what's allowed. Disobeying can get your IP banned or worse – legal action.

🛠️ No-code scraping tools (integration with Days 2-3)

Zapier Webhooks

GET requests to public APIs that return HTML

Make HTTP module

GET HTML, parse with Text parser

PhantomBuster

Pre-built scrapers for social, sales

Import.io

Point-and-click scraping to Sheets

Octoparse

Visual scraper, export to API

Browse AI

Monitor websites for changes

Airtable

Store scraped data

Google Sheets

IMPORTXML, IMPORTHTML functions

📊 Google Sheets built-in scraping (IMPORTXML, IMPORTHTML)

You already used Google Sheets in Day 2. Now use it to scrape!

IMPORTHTML – tables and lists

=IMPORTHTML("https://example.com", "table", 1)
// Gets the first table from the page
                        

Use case: Scrape real estate listings, job boards, price tables.

IMPORTXML – any data with XPath

=IMPORTXML("https://example.com", "//h1")
// Gets all H1 headings
                        

Use case: Extract specific elements (titles, prices, reviews).

Combine with Day 2 automation

Zapier or Make can trigger when new data appears in these cells, then process it (e.g., Day 8 qualification, Day 10 content generation).

Practice example: Use IMPORTHTML to scrape a table of upcoming events from a local website into Google Sheets. Then use Day 9 to send reminders.

⚙️ Make.com HTTP + Text parser scraping

HTTP module – GET the page

URL: https://example.com
Method: GET
Parse response: Yes
                        

Text parser – extract data with regex

Pattern: (.*?)
// Extracts all H1 content
                        

Aggregator – handle multiple items

Use iterator + aggregator to process lists (e.g., all products on a page).

Apply Day 15 security

Add delays between requests, respect rate limits, use rotating user-agents.

Make.com pro tip: Use the "RSS" module for blogs/news sites – it's like scraping but with permission!

🤖 Scraping + AI – extract meaning from chaos

Clean scraped text

Use OpenAI to remove HTML, format, and extract key info.

Summarize articles

Scrape news, use Day 1 prompts to summarize.

Extract entities

From scraped text, get names, dates, prices, locations.

Generate content

Scrape trends → Day 10 generates blog posts.

// Example: Clean scraped HTML with OpenAI
Prompt: "Extract the main article text from this HTML, remove navigation, ads, and formatting. Return clean text."
            

🔄 Apply scraping to every previous day

Day 5/8 – Lead qualifier

Scrape LinkedIn company pages to enrich leads (public data).

Day 9 – Sales assistant

Scrape competitor pricing to personalize follow-ups.

Day 10 – Content engine

Scrape trending topics to generate relevant content.

Day 11 – Support router

Scrape FAQ pages to build knowledge base.

Day 12 – Niche

Real estate: scrape new listings. Agencies: scrape job postings.

Day 14 – CRM

Enrich contacts with scraped company data.

8 hands-on practice exercises

📊 Exercise 1: Google Sheets scrape

Use IMPORTHTML to scrape a table from any public website. Save to sheet.

⚙️ Exercise 2: Make.com HTTP scrape

Use HTTP module to get a webpage. Use Text parser to extract all links.

🤖 Exercise 3: AI cleaning

Take scraped HTML, use OpenAI to extract clean text. Compare results.

🏠 Exercise 4: Real estate scrape

Scrape 5 property listings from a public real estate site. Extract price, address, bedrooms.

📈 Exercise 5: Competitor monitoring

Set up a weekly scraper that checks competitor prices and emails changes.

📝 Exercise 6: Content inspiration

Scrape 10 headlines from news sites. Use Day 10 to generate blog topics.

🔍 Exercise 7: robots.txt check

For 3 sites you want to scrape, check robots.txt. Document allowed/disallowed paths.

🔄 Exercise 8: Enrich CRM

Take 5 companies from your Day 14 CRM. Scrape their LinkedIn "About" page (public) and update notes.

🤖 Understanding robots.txt – your scraping rulebook

User-agent: *
Disallow: /private/
Disallow: /api/
Allow: /public/
Crawl-delay: 10
            

What it means:

User-agent *: Applies to all bots
Disallow: Paths you cannot scrape
Allow: Paths you can scrape (overrides disallow)
Crawl-delay: Wait 10 seconds between requests (respect it!)

Critical: robots.txt is a guideline, not a law. But ignoring it can get your IP banned. Always follow it for ethical scraping.

📄 Client proposal – Competitor monitoring service

📊 Competitor Price Monitoring – Service Overview

What I'll build:

✅ Automated scraper that checks competitor websites daily
✅ Extracts prices, new products, stock status
✅ Logs to Google Sheets with change history
✅ Sends weekly report with insights
✅ Alerts when competitors change prices on key items

Tech used: Make.com, Google Sheets, OpenAI for analysis

Ethical compliance: Respects robots.txt, adds delays, public data only

Investment: $1,800 setup + $300/mo

ROI: Price optimization can increase margins by 5-10%

📚 Resources

Day 16: You're now an ethical web scraping specialist

✔ Understand legality and ethics of scraping
✔ Can use Google Sheets and Make.com to scrape
✔ Combine scraping with AI for data enrichment
✔ Apply scraping to all previous days
✔ 8 hands-on practice exercises
✔ Client-ready competitor monitoring service