Full Guide to AI Web Scraping With DeepSeek
Traditional scraping tools break easily due to advanced anti-bot measures, complex JavaScript structures, and layouts that change everyday.
But now, AI models like DeepSeek make scraping smarter, more flexible, and easier to do on a large scale.
Instead of using fragile CSS selectors, you can use LLMs (Large Language Models) like DeepSeek and tell them what data you want in simple English.
You don’t need to figure out how the website’s code is built or write complicated rules to get the information.
This guide shows you how to build a scraping setup that uses:
- DeepSeek to cleverly pull out the data you need.
- Crawl4AI or your own custom crawler to gather web pages.
- Scrape.do to get around website blocks and deal with JavaScript.
Setting Up the Environment
To get started, you’ll need a few tools and API keys.
Make sure you have the following for running the code in this guide:
- A DeepSeek API key: If you plan to host the model yourself, you’ll use your local DeepSeek model instead.
- A Scrape.do API key.
- If you want to use a method called headless crawling with Crawl4AI, you’ll need a Crawl4AI key.
Install Required Python Packages
To handle tasks like crawling websites, defining data structures, managing secret keys, and communicating with online services, you will need to install a few essential Python libraries.
Run this in your project environment:
pip install crawl4ai pydantic deepseek requests python-dotenv
You have two main options for running DeepSeek:
- Locally on your own machine using frameworks like Groq, vLLM, or Hugging Face Transformers.
- Or via community-hosted endpoints such as Hugging Face Spaces or Groq demos.
These are public or demo APIs created by third parties, not officially maintained by DeepSeek.
If you choose to run DeepSeek on your own machine, you will need to download the model files.
You can do this by cloning the repository from GitHub, or by setting it up through a visual interface like LLM Studio, which simplifies local model management and supports models like DeepSeek, Mistral, and LLaMA.
Additionally, you can use the lmstudio-python SDK to integrate locally running models directly into your Python scripts.
For clonning DeepSeek:
# for clonning DeepSeek
git clone https://github.com/deepseek-ai/DeepSeek-LLM.git
# for local model management
pip install lmstudio
Setting API Keys
Your scripts will need API keys to access services like DeepSeek, Scrape.do, and potentially Crawl4AI. You can set these as environment variables in your terminal.
export DEEPSEEK_API_KEY="your_deepseek_key"
export SCRAPEDO_API_KEY="your_scrapedo_key"
export CRAWL4AI_API_KEY="your_crawl4ai_key" # This line is usually unnecessary unless Crawl4AI changes architecture
As an alternative, you can create a file named .env
in your project’s main folder. List your keys in this file, and then use the python-dotenv
library in your script to load them.
Crawling: Collecting Your Target Pages
Before DeepSeek can work its magic, you first need to gather the web pages you want to extract data from.
This is the crawling phase. There are a couple of strong choices available for this, based on what you need. You can use Crawl4AI
for this job.
Using Crawl4AI
You can also use Crawl4AI
as an alternative. It uses a headless Chromium browser to fetch and render pages.
What It Offers
- It has built in browser control using Chromium.
- It captures the complete page structure (DOM) for websites that change content dynamically.
- It works with DeepSeek, GPT 4, Claude, and other LLMs.
- It can extract data based on a predefined structure using Pydantic.
- It can work with Markdown, HTML, or plain text.
This method is very suitable for modern applications that have dynamic tables, pop up windows, or pages that load more content as you scroll.
We can continue with the code below.
But before running it, make sure to run the crawl4ai-setup
command. This will:
- Install or update the required Playwright browsers (such as Chromium or Firefox)
- Perform OS-level checks (like detecting missing libraries on Linux)
- Ensure your environment is fully prepared for headless crawling
Optionally, you can run the crawl4ai-doctor
command to perform a full diagnostic and verify that everything is working correctly.
import asyncio
import os
import json
import re
from typing import List
from pydantic import BaseModel
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
# Load API key (currently hardcoded)
DEEPSEEK_API_KEY = "sk-3b566c3a561a49c0bbd0e642adcd5385" # Replace with your actual key or load from env
if not DEEPSEEK_API_KEY:
print("Warning: DEEPSEEK_API_KEY not found. LLM extraction will be skipped.")
# Define the schema for individual products
class ProductItem(BaseModel):
title: str
price: str
rating: str
link: str
# Define the schema for the complete response (list of products)
class ProductList(BaseModel):
products: List[ProductItem]
# Improved extraction prompt
extraction_prompt = """
You are an AI that extracts product listings from an Amazon search results page.
Analyze the content and extract all visible product listings.
For each product, extract:
- title: The product name/title
- price: The price (include currency symbol, use "N/A" if not found)
- rating: Star rating (use "N/A" if not found)
- link: The product URL (make it a complete URL starting with https://www.amazon.com)
Return the results as a JSON object with a "products" key containing an array of product objects.
Only include products that have at least a title and are clearly product listings.
Example format:
{
"products": [
{
"title": "Gaming Chair XYZ",
"price": "$299.99",
"rating": "4.5 stars",
"link": "https://www.amazon.com/dp/B123456789"
}
]
}
"""
async def scrape_amazon_with_llm(search_query="gaming+chairs"):
"""
Scrape Amazon product listings using LLM extraction.
Assumes DEEPSEEK_API_KEY is available (checked by the caller).
"""
url = f"https://www.amazon.com/s?k={search_query}"
try:
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token=DEEPSEEK_API_KEY,
)
extraction_strategy = LLMExtractionStrategy(
llm_config=llm_config,
schema=ProductList.model_json_schema(),
extraction_type="schema",
instruction=extraction_prompt,
chunk_token_threshold=2000,
overlap_rate=0.1,
apply_chunking=True,
input_format="markdown",
extra_args={"temperature": 0.1, "max_tokens": 2000}
)
crawl_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
cache_mode=CacheMode.BYPASS
)
browser_config = BrowserConfig(
headless=True,
browser_type="chromium"
)
print(f"Scraping Amazon with LLM extraction for '{search_query}'...")
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=url,
config=crawl_config,
js_code=[
"await new Promise(resolve => setTimeout(resolve, 3000));",
"window.scrollTo(0, document.body.scrollHeight/2);",
"await new Promise(resolve => setTimeout(resolve, 2000));"
],
wait_for="css:[data-component-type='s-search-result']",
page_timeout=30000
)
if result.success and result.extracted_content:
try:
extracted_data = json.loads(result.extracted_content)
if 'products' in extracted_data:
products = extracted_data['products']
print(f"Successfully extracted {len(products)} products using LLM.")
extraction_strategy.show_usage()
return products
else:
print("No products found in LLM extracted data.")
return None
except json.JSONDecodeError as e:
print(f"Failed to parse LLM extracted JSON: {e}")
print("Raw extracted content (first 500 chars):", str(result.extracted_content)[:500])
return None
else:
print("LLM extraction failed to retrieve or extract content.")
if result.error_message:
print(f"Error during LLM crawl: {result.error_message}")
return None
except Exception as e:
print(f"An error occurred during LLM extraction: {str(e)}")
return None
def extract_products_from_markdown(markdown_content):
products = []
lines = markdown_content.split('\n')
current_product = {}
for line in lines:
line = line.strip()
if line.startswith('## ') or line.startswith('### '):
if current_product.get('title'):
products.append(current_product)
current_product = {}
current_product['title'] = line.replace('## ', '').replace('### ', '').strip()
elif '$' in line and 'price' not in current_product:
price_match = re.search(r'\$[\d,]+\.?\d*', line)
if price_match:
current_product['price'] = price_match.group()
elif ('star' in line.lower() or 'rating' in line.lower()) and 'rating' not in current_product:
current_product['rating'] = line.strip()
elif (line.startswith('http') or 'amazon.com' in line) and 'link' not in current_product:
if 'amazon.com/dp/' in line or 'amazon.com/.*/dp/' in line or '/gp/product/' in line:
url_match = re.search(r'(https://www\.amazon\.com[^ ]+)', line)
if url_match:
current_product['link'] = url_match.group(1)
if current_product.get('title'):
products.append(current_product)
for product in products:
product.setdefault('title', 'N/A')
product.setdefault('price', 'N/A')
product.setdefault('rating', 'N/A')
product.setdefault('link', 'N/A')
return products
async def scrape_amazon_simple(search_query="gaming+chairs"):
url = f"https://www.amazon.com/s?k={search_query}"
try:
browser_config = BrowserConfig(headless=True, browser_type="chromium")
print(f"Scraping Amazon (simple extraction) for '{search_query}'...")
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=url,
js_code=[
"await new Promise(resolve => setTimeout(resolve, 3000));",
"window.scrollTo(0, document.body.scrollHeight/2);",
"await new Promise(resolve => setTimeout(resolve, 2000));"
],
wait_for="css:[data-component-type='s-search-result']",
page_timeout=30000
)
if result.success and result.markdown:
print("Successfully retrieved content for simple extraction.")
products = extract_products_from_markdown(result.markdown)
print(f"Successfully extracted {len(products)} products using simple parsing.")
return products
else:
print("Failed to retrieve content for simple extraction.")
if result.error_message:
print(f"Error during simple crawl: {result.error_message}")
return None
except Exception as e:
print(f"An error occurred during simple extraction: {str(e)}")
return None
def display_products(products, method_name=""):
if not products:
print(f"No products to display from {method_name} method.")
return
print(f"\n{'='*60}")
print(f"FOUND {len(products)} PRODUCTS (using {method_name})")
print(f"{'='*60}")
for i, product_data in enumerate(products, 1):
product = product_data.model_dump() if isinstance(product_data, BaseModel) else product_data
print(f"\n{i}. {product.get('title', 'N/A')}")
print(f" Price: {product.get('price', 'N/A')}")
print(f" Rating: {product.get('rating', 'N/A')}")
print(f" Link: {product.get('link', 'N/A')}")
print("-" * 50)
# --- Function to save individual product Markdown files ---
def save_products_as_markdown_files(products, base_filename_prefix="product"):
"""
Save each product as an individual Markdown (.md) file in 'crawled_pages'.
"""
if not products:
return
output_dir = "crawled_pages"
os.makedirs(output_dir, exist_ok=True)
print(f"\nSaving individual product Markdown files to '{output_dir}/'...")
count_saved = 0
for i, product_data in enumerate(products, 1):
product = product_data.model_dump() if isinstance(product_data, BaseModel) else product_data
title = product.get('title', f'product_{i:02d}').replace('/', '_').replace('\\', '_') # Sanitize title for filename
# Further sanitize by removing other problematic characters for filenames
sanitized_title_segment = re.sub(r'[^\w\s-]', '', title.lower().replace(' ', '_'))[:50] # Limit length
md_content = f"""## {product.get('title', 'N/A')}
**Price**: {product.get('price', 'N/A')}
**Rating**: {product.get('rating', 'N/A')}
**Link**: [{product.get('link', 'N/A')}]({product.get('link', 'N/A')})
---
*Extracted from: Amazon Search*
*Search Query used in script: [Not directly available here, defined in main]*
*Source URL: {product.get('link', 'N/A')}*
"""
# Use a combination of base prefix and a sanitized title segment or index for filename
filename = f"{output_dir}/{base_filename_prefix}_{i:02d}_{sanitized_title_segment}.md"
try:
with open(filename, "w", encoding="utf-8") as f:
f.write(md_content)
count_saved +=1
except Exception as e:
print(f"Could not save file {filename}: {e}")
print(f"Saved {count_saved} Markdown files to {output_dir}/")
# --- End of save_products_as_markdown_files ---
async def main():
search_query = "gaming+chairs"
print("=== Amazon Product Scraper ===\n")
products = None
extraction_method_used = ""
if DEEPSEEK_API_KEY:
print("Attempting LLM extraction...")
products = await scrape_amazon_with_llm(search_query)
extraction_method_used = "LLM"
if not products:
if DEEPSEEK_API_KEY and extraction_method_used == "LLM":
print("\nLLM extraction did not yield results, falling back to simple extraction...\n")
else:
print("API key not available or LLM skipped, proceeding with simple extraction...\n")
products = await scrape_amazon_simple(search_query)
extraction_method_used = "Simple"
if products:
display_products(products, extraction_method_used)
# Define base filename for outputs
base_output_filename = f"amazon_{search_query.replace('+', '_')}_{extraction_method_used.lower()}_products"
# Save main JSON file
json_filename = f"{base_output_filename}.json"
products_to_save = [p.model_dump() if isinstance(p, BaseModel) else p for p in products]
with open(json_filename, "w", encoding="utf-8") as f:
json.dump(products_to_save, f, indent=2, ensure_ascii=False)
print(f"\nResults saved to {json_filename}")
# Save individual Markdown files
save_products_as_markdown_files(products_to_save, base_filename_prefix=base_output_filename)
else:
print("\nBoth LLM and simple extraction methods failed to find products.")
if __name__ == "__main__":
asyncio.run(main())
The output will be structured JSON and Markdown files located in the project directory.
Parsing with DeepSeek
After you have crawled the content, the next part is to pull out organized data using an LLM.
DeepSeek
can take the raw content from a page, especially if it is in Markdown
format. It will then return the specific pieces of information you want. You do not need to do any manual parsing or use regular expressions.
Here is an example of how to ask DeepSeek to get product names and prices:
import os
from deepseek import DeepSeek # Assuming this is your DeepSeek library/SDK
# --- Configuration ---
# IMPORTANT: Replace "your_deepseek_key" with your actual DeepSeek API key.
# Consider using environment variables for API keys in production.
DEEPSEEK_API_KEY = "your-deepseek-api"
DEEPSEEK_MODEL = "deepseek-chat" # Or your preferred DeepSeek model
TEMPERATURE = 0.3 # Adjust for desired creativity/factuality (0.0-1.0)
MARKDOWN_DIR = "crawled_pages"
SYSTEM_PROMPT = "You are an expert AI assistant. Your task is to accurately extract specific information from the provided text, which is the content of a product markdown page."
def extract_product_info(client, file_content, filename):
"""
Uses DeepSeek to extract product name and price from the file content.
"""
# You can customize this prompt further based on the exact structure of your .md files
# For example, if you know the title is always a H2 heading (##) and price is bolded.
user_prompt = f"""
From the following content of the file '{filename}', please extract:
1. The product name (usually the main title or heading).
2. The product price (including the currency symbol).
Return the information in a structured format, like:
Product Name: [The extracted product name]
Price: [The extracted price]
If information is missing, indicate "N/A".
Markdown Content:
---
{file_content}
---
"""
try:
response = client.chat.completions.create(
model=DEEPSEEK_MODEL,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_prompt}
],
temperature=TEMPERATURE
)
return response.choices[0].message.content
except Exception as e:
print(f"Error processing file {filename} with DeepSeek: {e}")
return None
if __name__ == "__main__":
if DEEPSEEK_API_KEY == "your_deepseek_key":
print("Please replace 'your_deepseek_key' with your actual DeepSeek API key in the script.")
exit()
if not os.path.isdir(MARKDOWN_DIR):
print(f"Error: Directory '{MARKDOWN_DIR}' not found. Please ensure the crawled pages are in this directory.")
exit()
# Initialize the DeepSeek client
# Based on your snippet, the DeepSeek class might take model and temperature
# in its constructor. Adjust if your DeepSeek client initialization differs.
# If these are set in the constructor and used as defaults, you might not
# need to pass them in `completions.create()` unless you want to override.
deepseek_client = DeepSeek(
api_key=DEEPSEEK_API_KEY
# model=DEEPSEEK_MODEL, # Potentially set here if your SDK supports it as default
# temperature=TEMPERATURE # Potentially set here
)
print(f"Processing Markdown files from '{MARKDOWN_DIR}'...\n")
markdown_files_processed = 0
for filename in os.listdir(MARKDOWN_DIR):
if filename.endswith(".md"):
file_path = os.path.join(MARKDOWN_DIR, filename)
print(f"--- Reading file: {file_path} ---")
try:
with open(file_path, "r", encoding="utf-8") as file:
content = file.read()
if content.strip(): # Ensure content is not empty
extracted_info = extract_product_info(deepseek_client, content, filename)
if extracted_info:
print(f"Extracted Information for {filename}:\n{extracted_info}\n")
else:
print(f"Could not extract information for {filename}.\n")
markdown_files_processed += 1
else:
print(f"File {filename} is empty. Skipping.\n")
except Exception as e:
print(f"Error reading or processing file {file_path}: {e}\n")
if markdown_files_processed == 0:
print(f"No Markdown files were found or processed in '{MARKDOWN_DIR}'.")
else:
print(f"Finished processing {markdown_files_processed} Markdown file(s).")
This script loops through all Markdown files in the crawled_pages/ directory and extracts structured product data using DeepSeek.
Export and Automate
Once you have successfully crawled a website and extracted data, making this process larger is simple. You can save the structured information, set up regular crawls, and connect the data to your own systems or databases.
Export
The information you get from DeepSeek is already organized, so you can save it in any format you prefer.
Here is how to save the extracted items into a .json
file:
import json
with open("output.json", "w") as f:
json.dump(data, f, indent=2)
You can also save the data into a CSV file:
import csv
with open("output.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
If you want to send the results to a database like MongoDB or Postgres, you can use common Python tools such as pymongo
or psycopg2
.
Setup Scheduled Tasks
You can make this scraper run automatically on a schedule using different tools. Some common ones are:
- cron, which is used on Linux and macOS systems.
- Task Scheduler, which is available on Windows.
- APScheduler, a library you can use in your Python code.
- GitHub Actions or cloud functions, which are good for serverless automation.
For example, here is how you could run the script every day at 9 AM using cron:
0 9 * * * /usr/bin/python3 /home/user/deepseek_scraper/run_crawler.py >> /home/user/deepseek_scraper/logs/crawler.log 2>&1
Local DeepSeek Web Scraping Setup
If you prefer to run DeepSeek on your own computer, you can set it up using Groq
. This allows you to have a complete LLM scraping system that does not need to connect to any outside services.
Here are the general steps:
- Get a model like DeepSeek LLM or DeepSeek Coder by cloning it.
- Create an inference server using tools such as Hugging Face Transformers, vLLM, or Groq.
- For quick responses, use a GPU or a Groq Chip.
- Tell your scraper to use the local model’s URL instead of the one provided by the hosted service.
from deepseek import DeepSeek
deepseek = DeepSeek(
api_key="your_deepseek_key",
model="deepseek-chat",
temperature=0.5
)
There are several reasons why you might want to host DeepSeek yourself:
You will not have any limits on API calls. It can be cheaper if you use it a lot. You have full control over your data, which improves privacy. With special hardware, like Groq, you can get faster results. It can work even if you are not connected to the internet or are behind a firewall. You will not experience delays caused by remote APIs.
This is particularly important if you are scraping many thousands of pages or creating datasets for your own company’s use.
Conclusion
Scraping information from the web used to be a difficult task.
It involves a lot of regular expressions, selectors that broke easily, and headaches from anti bot measures. Thanks to AI and Scrape.do, that is not true anymore.
By using these tools together:
- Crawl4AI or a custom crawler to find links.
- Scrape.do to get past blocks and render pages.
- DeepSeek to intelligently pull out content.
- Groq or local processing for speed.
You now have a strong, scalable, and smart scraping system that can adjust to any website.