An automated web scraper that extracts tech events and internships from Meetup and GradConnection, stores them in a Supabase database, and sends AI-generated newsletter summaries via email using Playwright, AgentQL, and Google Gemini AI.
- π Event Scraping: Extracts tech events from Meetup.com in the Canterbury, Australia area
- πΌ Internship Scraping: Extracts engineering/software internships from GradConnection Melbourne
- π€ AI-Powered Extraction: Uses AgentQL for intelligent element detection
- πΎ Database Storage: Automatically stores scraped data in Supabase with auto-incrementing IDs
- π Browser Automation: Powered by Playwright for reliable web scraping
- π§ Email Newsletters: Generates AI-powered summaries using Google Gemini and sends via Gmail
- β¨ Smart Summarization: Converts scraped data into friendly newsletter format with links
- Python 3.13+
- AgentQL API key (sign up here)
- Supabase account and project
- Google Gemini API key (get it here)
- Gmail account with App Password enabled
- Clone the repository
git clone https://github.com/Aarav261/Cracking_networking.git
cd advanced_scraper- Create and activate virtual environment
python -m venv advanced_scraper
source advanced_scraper/bin/activate # On macOS/Linux
# advanced_scraper\Scripts\activate # On Windows- Install dependencies
pip install playwright agentql supabase python-dotenv google-generativeai
playwright install chromium- Set up AgentQL
agentql init- Configure environment variables
Create a .env file in the project root:
AGENT_API_KEY=your_agentql_api_key
SUPABASE_URL=your_supabase_project_url
SUPABASE_KEY=your_supabase_anon_key
GMAIL_USER=your_email@gmail.com
GMAIL_PASSWORD=your_gmail_app_password
GEMINI_API=your_gemini_api_keySetting up Gmail App Password:
-
Enable 2-Factor Authentication on your Google account
-
Go to Google Account > Security > 2-Step Verification > App passwords
-
Generate a new app password for "Mail"
-
Use this 16-character password in your
.envfile -
Set up Supabase Database
Create two tables in your Supabase project:
Events table:
CREATE TABLE "Events" (
id bigserial PRIMARY KEY,
title text,
link text,
date text,
description text,
location text
);Internships table:
CREATE TABLE "Internships" (
id bigserial PRIMARY KEY,
name text,
link text,
company text,
location text
);python main.pyThis will:
- Extract tech event links from Meetup.com
- Visit each event page and extract detailed information
- Extract internship listings from GradConnection
- Store all data in Supabase database
- Generate an AI-powered newsletter summary using Google Gemini
- Send the newsletter to the configured email address
Extract events:
from Event_scraper import extract_meetup_links, extract_meetup_events
url = "https://www.meetup.com/en-AU/find/?keywords=tech&location=au--Canterbury"
links = extract_meetup_links(url)
events = extract_meetup_events(links, Limit_links=5)Extract internships:
from internship_scraper import extract_internship_links
url = "https://au.gradconnection.com/internships/engineering-software/melbourne/"
internships = extract_internship_links(url)Store data in database:
from database import insert_event, insert_internship
insert_event({"title": "Tech Meetup", "link": "...", "date": "..."})
insert_internship({"name": "Software Intern", "company": "...", "location": "..."})Send email:
from gmail import send_email
send_email(
subject="Weekly Newsletter",
body="Your newsletter content here",
to_email="recipient@example.com",
email=os.getenv("GMAIL_USER"),
password=os.getenv("GMAIL_PASSWORD")
)advanced_scraper/
βββ main.py # Main orchestration script
βββ Event_scraper.py # Meetup event scraper with AgentQL
βββ internship_scraper.py # GradConnection internship scraper
βββ database.py # Supabase database operations
βββ gmail.py # Email sending functionality
βββ lit_test.py # Streamlit test file
βββ .env # Environment variables (not in repo)
βββ README.md # This file
βββ advanced_scraper/ # Virtual environment
In main.py, modify the LIMIT_LINKS variable:
LIMIT_LINKS = 2 # Change to desired number of items to scrapeMeetup events (main.py):
URL_OF_MEETUP_LISTING_PAGE = "https://www.meetup.com/en-AU/find/?keywords=tech&location=au--Canterbury&source=EVENTS&distance=tenMiles"Internships (main.py):
URL_OF_INTERNSHIP_LISTING_PAGE = "https://au.gradconnection.com/internships/engineering-software/melbourne/"Modify the system instruction in main.py:
config=types.GenerateContentConfig(
system_instruction="Your custom instructions here"
)Uses AgentQL queries to extract event data from Meetup:
extract_meetup_links(): Gets event URLs from listing pageextract_meetup_events(): Extracts detailed info (title, date, description, location) from each event
Extracts internship listings with:
extract_internship_links(): Scrapes internship name, link, company, and location
Manages Supabase operations:
insert_event(): Inserts events with auto-incrementing IDsinsert_internship(): Inserts internships with auto-incrementing IDs
Handles email delivery:
send_email(): Sends plain text emails via Gmail SMTP
Orchestrates the entire workflow:
- Scrapes events and internships
- Stores data in Supabase
- Generates AI summary with Google Gemini
- Emails newsletter to recipient
- Playwright: Browser automation and web scraping
- AgentQL: AI-powered web element detection and querying
- Supabase: PostgreSQL database and backend services
- Google Gemini AI: AI-powered newsletter generation
- Python-dotenv: Environment variable management
- smtplib: Email sending via Gmail
Import errors:
rm -rf __pycache__
rm -rf .pytest_cacheAgentQL API errors:
agentql initPlaywright browser errors:
playwright install --with-deps chromiumGmail authentication errors:
- Ensure 2-Factor Authentication is enabled
- Use App Password, not your regular Gmail password
- Check that "Less secure app access" is not blocking the connection
Gemini API errors:
- Verify your API key is valid and has credits
- Check the API quota limits
Database insertion errors:
- Verify Supabase URL and key are correct
- Check that tables exist with correct schema
- Ensure network connectivity to Supabase
- Add HTML email templates
- Support multiple newsletter recipients
- Add scheduling/cron job support
- Implement error handling and retry logic
- Add unit tests
- Create web dashboard with Streamlit
- Add more event and job board sources
This project is part of the Cracking_networking repository.
Feel free to submit issues or pull requests to improve the scraper functionality.
Built with β€οΈ for automating networking opportunities