How to Build an Advanced BrightData Web Scraper with Google Gemini for AI-Powered Data Extraction

0 4 minutes read

1750294496 How to Build an Advanced BrightData Web Scraper with Google.png

In this tutorial, we are going to you by building an improved scraping tool on the Internet that calls for PrimeThe powerful agent network along with the GEMINI Application interface from Google to extract smart data. You will see how to organize your Python project, install and import the necessary libraries, and packaging the scraping logic within the BRIGHTDATASCRAPER category clean and reusable. Whether you are targeting the pages of Amazon products, best -selling book lists, or LinkedIn profiles, the normative methods of the abrasion show how to create discipline parameters, deal with errors safely, and return the structured JSON results. The integration of the AI RACT -style AI agent also shows you how to combine LLM’s thinking with actual abrasion, which enables you to offer natural language information to analyze data during the navigation.

!pip install langchain-brightdata langchain-google-genai langgraph langchain-core google-generativeai

We install all the main libraries needed for the educational program in one step: Langchain-BRIGHTDATA for Breeshdata Web Spite, Langchain-Google-Genai and Google-Henerativei for Google Gemini, and Langgraph to coordinate the agent, Langchain-Core for Langchain Core.

import os
import json
from typing import Dict, Any, Optional
from langchain_brightdata import BrightDataWebScraperAPI
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.prebuilt import create_react_agent

These imports are your environment and basic functions: OS and JSON processing system operations and data sequence, while writing provides organized organizations. You can then bring Brightdatawebscraperapi to collision Brightdata, Chatgooglegenerativei to interact with Gueini LLM from Google, and Create_rect_agement to regulate these components in an agent similar to the reaction.

class BrightDataScraper:
    """Enhanced web scraper using BrightData API"""
   
    def __init__(self, api_key: str, google_api_key: Optional[str] = None):
        """Initialize scraper with API keys"""
        self.api_key = api_key
        self.scraper = BrightDataWebScraperAPI(bright_data_api_key=api_key)
       
        if google_api_key:
            self.llm = ChatGoogleGenerativeAI(
                model="gemini-2.0-flash",
                google_api_key=google_api_key
            )
            self.agent = create_react_agent(self.llm, [self.scraper])
   
    def scrape_amazon_product(self, url: str, zipcode: str = "10001") -> Dict[str, Any]:
        """Scrape Amazon product data"""
        try:
            results = self.scraper.invoke({
                "url": url,
                "dataset_type": "amazon_product",
                "zipcode": zipcode
            })
            return {"success": True, "data": results}
        except Exception as e:
            return {"success": False, "error": str(e)}
   
    def scrape_amazon_bestsellers(self, region: str = "in") -> Dict[str, Any]:
        """Scrape Amazon bestsellers"""
        try:
            url = f"https://www.amazon.{region}/gp/bestsellers/"
            results = self.scraper.invoke({
                "url": url,
                "dataset_type": "amazon_product"
            })
            return {"success": True, "data": results}
        except Exception as e:
            return {"success": False, "error": str(e)}
   
    def scrape_linkedin_profile(self, url: str) -> Dict[str, Any]:
        """Scrape LinkedIn profile data"""
        try:
            results = self.scraper.invoke({
                "url": url,
                "dataset_type": "linkedin_person_profile"
            })
            return {"success": True, "data": results}
        except Exception as e:
            return {"success": False, "error": str(e)}
   
    def run_agent_query(self, query: str) -> None:
        """Run AI agent with natural language query"""
        if not hasattr(self, 'agent'):
            print("Error: Google API key required for agent functionality")
            return
       
        try:
            for step in self.agent.stream(
                {"messages": query},
                stream_mode="values"
            ):
                step["messages"][-1].pretty_print()
        except Exception as e:
            print(f"Agent error: {e}")
   
    def print_results(self, results: Dict[str, Any], title: str = "Results") -> None:
        """Pretty print results"""
        print(f"\n{'='*50}")
        print(f"{title}")
        print(f"{'='*50}")
       
        if results["success"]:
            print(json.dumps(results["data"], indent=2, ensure_ascii=False))
        else:
            print(f"Error: {results['error']}")
        print()

The Brightdatasccraper category envelops all the logic of the brightdata for penetration and optional intelligence that works in the gymnastics under one interface that can be reusable. Its methods enable you to easily bring Amazon product details, best book lists, LinkedIn features, API call processing, error processing, JSON format, and even the natural language “agent” inquiries when providing the API Google key. The Print_results assistant guarantees that your output is always coordinated for inspection.

def main():
    """Main execution function"""
    BRIGHT_DATA_API_KEY = "Use Your Own API Key"
    GOOGLE_API_KEY = "Use Your Own API Key"
   
    scraper = BrightDataScraper(BRIGHT_DATA_API_KEY, GOOGLE_API_KEY)
   
    print("🛍️ Scraping Amazon India Bestsellers...")
    bestsellers = scraper.scrape_amazon_bestsellers("in")
    scraper.print_results(bestsellers, "Amazon India Bestsellers")
   
    print("📦 Scraping Amazon Product...")
    product_url = "https://www.amazon.com/dp/B08L5TNJHG"
    product_data = scraper.scrape_amazon_product(product_url, "10001")
    scraper.print_results(product_data, "Amazon Product Data")
   
    print("👤 Scraping LinkedIn Profile...")
    linkedin_url = "https://www.linkedin.com/in/satyanadella/"
    linkedin_data = scraper.scrape_linkedin_profile(linkedin_url)
    scraper.print_results(linkedin_data, "LinkedIn Profile Data")
   
    print("🤖 Running AI Agent Query...")
    agent_query = """
    Scrape Amazon product data for https://www.amazon.com/dp/B0D2Q9397Y?th=1
    in New York (zipcode 10001) and summarize the key product details.
    """
    scraper.run_agent_query(agent_query)

The main function () connects everything together by setting brightdata and Google API keys, forming BrightdataScripter, then showing each feature: it gets rid of the best -selling books in Amazon India, brings details of a specific product, restore the LinkedIn profile, and finally runs an expression of the natural factor, prints local results after each each Step.

if __name__ == "__main__":
    print("Installing required packages...")
    os.system("pip install -q langchain-brightdata langchain-google-genai langgraph")
   
    os.environ["BRIGHT_DATA_API_KEY"] = "Use Your Own API Key"
   
    main()

Finally, this entry point block guarantees, when operating as an independent text, the required scraping libraries are installed quietly, and the API key is set into the environment. Then the main function is performed to start all the experimental workflow and the agent.

In conclusion, by the end of this tutorial, you will have a ready -made Python text for use that is automated by arduous data collection tasks, stripping the details of the low -level application programming interface, and is optional to Amnesty International to deal with advanced inquiries. You can extend this basis by adding support for other data sets, additional LLMS merge, or spreading the scraper as part of a larger data pipeline or web service. With these construction blocks, you are now equipped to collect, analyze and present web data, whether for market research, competitive intelligence or custom applications that depend on AI.

verify notebook. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 100K+ ML Subreddit And subscribe to Our newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically sound and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.