YouTip LogoYouTip

Python Scrapy

## Python3.x Python Scrapy Library Scrapy is a powerful Python web crawling framework designed specifically for scraping web pages and extracting information. Scrapy is often used in applications such as data mining, information processing, or storing historical data. Scrapy comes with many useful built-in features, including handling requests, tracking status, managing errors, and dealing with request rate limits, making it ideal for efficient, distributed web scraping tasks. Unlike simpler crawling libraries like (#) and (#), Scrapy is a full-featured crawling framework that offers high scalability and flexibility, suitable for complex and large-scale web scraping projects. Scrapy official website: [https://scrapy.org/](https://scrapy.org/). Scrapy features and introduction: [ Scrapy architecture diagram (green lines indicate data flow): !(#) Scrapy operates based on several core components: * **Spider**: The crawler class, used to define how to extract data from web pages and how to follow links. * **Item**: Used to define and store the scraped data. Equivalent to a data model. * **Pipeline**: Used to process the scraped data, commonly employed for cleaning, storing, or other operations. * **Middleware**: Handles requests and responses, allowing you to set proxies, manage cookies, configure user agents, and more. * **Settings**: Configures various settings for the Scrapy project, such as request delays, number of concurrent requests, etc. ### Installing Scrapy Before using Scrapy, you need to install it first. We use pip to install: ```bash pip install scrapy ### Scrapy Project Structure A Scrapy project is a structured directory containing multiple folders and modules, helping you organize your crawler code. Scrapy uses command-line tools to create and manage crawler projects. You can create a new Scrapy project with the following command: ```bash scrapy startproject myproject This will create a project named `myproject`, with a structure roughly as follows: myproject/ scrapy.cfg # Project configuration file myproject/ # Source code folder __init__.py # Empty placeholder file items.py # Defines the structure of scraped data middlewares.py # Defines middleware pipelines.py # Defines data processing pipeline settings.py # Project settings file spiders/ # Folder for storing crawler code __init__.py # Empty placeholder file myspider.py # Custom crawler code * * * Here's a basic Scrapy crawler example demonstrating how to scrape data from a webpage. We create a crawler project: ```bash scrapy startproject tutorial_test_spiders If successful, the above command will output: templates/project', created in:/Users//-test/tutorial_test_spiders You can start your first spider with: cd tutorial_test_spiders scrapy genspider example example.com The generated project structure is as follows: !(#) Then enter this directory: ```bash cd tutorial_test_spiders Next, use the `scrapy genspider` command to create a crawler: ```bash scrapy genspider douban_spider movie.douban.com The resulting directory structure is as follows: !(#) In the `tutorial_test_spiders` directory, a file named `douban_spider.py` is generated, with the following code: ## Example ```python import scrapy class DoubanSpider(scrapy.Spider): name = "douban_spider" allowed_domains = ["movie.douban.com"] start_urls = ["https://movie.douban.com"] def parse(self, response): pass **Code Explanation:** * **`name`**: Defines the crawler's name, which must be unique. * **`allowed_domains`**: Restricts the domains the crawler can access, preventing it from scraping pages on other domains. * **`start_urls`**: Specifies the initial URLs where the crawler begins its scraping process. * **`parse`**: The `parse` method is the core part of every crawler, responsible for processing responses and extracting data. It receives a `response` object representing the page content returned by the server. ### Writing Crawler Code Before writing crawler code, keep the following points in mind: * Websites like Douban may detect crawler behavior. It's recommended to set `USER_AGENT` and `DOWNLOAD_DELAY` to simulate normal user activity. * When scraping data, please comply with the target website's `robots.txt` rules to avoid putting excessive strain on the server. * Frequent scraping may trigger IP blocking. ### Modifying `settings.py` Configuration Add the following configurations to `settings.py` to simulate browser requests and bypass anti-crawling mechanisms: ```python # Set User-Agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' # Do not obey robots.txt rules ROBOTSTXT_OBEY = False # Set download delay to avoid overly rapid requests DOWNLOAD_DELAY = 2 # Enable automatic throttling extension AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 2 AUTOTHROTTLE_MAX_DELAY = 5 In the crawler code, add custom request headers (such as `User-Agent` and `Referer`) to further simulate browser behavior. Open the `douban_spider.py` file and modify its contents as follows: ## Example ```python import scrapy class DoubanSpider(scrapy.Spider): name = "douban_spider" start_urls = [ 'https://movie.douban.com/top250', ] def start_requests(self): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Referer': 'https://movie.douban.com/', } for url in self.start_urls: yield scrapy.Request(url, headers=headers, callback=self.parse) def parse(self, response): for movie in response.css('div.item'): yield { 'title': movie.css('span.title::text').get(), 'rating': movie.css('span.rating_num::text').get(), 'quote': movie.css('span.inq::text').get(), } # Handle pagination next_page = response.css('span.next a::attr(href)').get() if next_page is not None: yield response.follow(next_page, callback=self.parse) **Code Explanation:** 1. **`name = "douban_spider"`**: Defines the crawler's name. 2. **`start_urls`**: Specifies the initial URL where the crawler starts scraping (the Douban Movie Top 250 page). 3. **`parse` method**: * Uses CSS selectors to extract each movie's title, rating, and synopsis. * `span.title::text`: Extracts the movie title. * `span.rating_num::text`: Extracts the movie rating. * `span.inq::text`: Extracts the movie synopsis. 4. **Pagination Handling**: * Uses `span.next a::attr(href)` to extract the link to the next page. * If a next page exists, uses `response.follow` to continue scraping. Run the following command in the terminal to start the crawler: ```bash scrapy crawl douban_spider -o douban_movies.csv This will launch the crawler and save the extracted data into the `douban_movies.csv` file. !(#) Note: The above content is for educational purposes only. When scraping data, please adhere to the target website's `robots.txt` rules. * * * ## Common Methods ### 1. **Crawler Methods** | Method Name | Description | Example | | --- | --- | --- | | `start_requests()` | Generates initial requests, allowing customization of request headers, methods, etc. | `yield scrapy.Request(url, callback=self.parse)` | | `parse(response)` | Processes the response and extracts data; the core method of the crawler. | `yield {'title': response.css('h1::text').get()}` | | `follow(url, callback)` | Automatically handles relative URLs and generates new requests, useful for pagination or link navigation. | `yield response.follow(next_page, callback=self.parse)` | | `closed(reason)` | Called when the crawler shuts down, used for resource cleanup or logging. | `def closed(self, reason): print('Spider closed:', reason)` | | `log(message)` | Logs informational messages. | `self.log('This is a log message')` | * * * ### 2. **Data Extraction Methods** | Method Name | Description | Example | | --- | --- | --- | | `response.css(selector)` | Uses CSS selectors to extract data. | `title = response.css('h1::text').get()` | | `response.xpath(selector)` | Uses XPath selectors to extract data. | `title = response.xpath('//h1/text()').get()` | | `get()` | Extracts the first matching result from a `SelectorList` (string). | `title = response.css('h1::text').get()` | | `getall()` | Extracts all matching results from a `SelectorList` (list). | `titles = response.css('h1::text').getall()` | | `attrib` | Retrieves attributes of the current node. | `link = response.css('a::attr(href)').get()` | * * * ### 3. **Request and Response Methods** | Method Name | Description | Example | | --- | --- | --- | | `scrapy.Request(url, callback, method, headers, meta)` | Creates a new request. | `yield scrapy.Request(url, callback=self.parse, headers=headers)` | | `response.url` | Gets the current response's URL. | `current_url = response.url` | | `response.status` | Gets the response's status code. | `if response.status == 200: print('Success')` | | `response.meta` | Gets extra data passed along with the request. | `value = response.meta.get('key')` | | `response.headers` | Gets the response's header information. | `content_type = response.headers.get('Content-Type')` | * * * ### 4. **Middleware and Pipeline Methods** | Method Name | Description | Example | | --- | --- | --- | | `process_request(request, spider)` | Processes requests before they are sent (downloader middleware). | `request.headers['User-Agent'] = 'Mozilla/5.0'` | | `process_response(request, response, spider)` | Processes responses after they return (downloader middleware). | `if response.status == 403: return request.replace(dont_filter=True)` | | `process_item(item, spider)` | Processes extracted data (pipeline). | `if item['price'] < 0: raise DropItem('Invalid price')` | | `open_spider(spider)` | Called when the crawler starts (pipeline). | `def open_spider(self, spider): self.file = open('items.json', 'w')` | | `close_spider(spider)` | Called when the crawler shuts down (pipeline). | `def close_spider(self, spider): self.file.close()` | * * * ### 5. **Tools and Extension Methods** | Method Name | Description | Example | | --- | --- | --- | | `scrapy shell` | Launches an interactive shell for debugging and testing selectors. | `scrapy shell 'http://example.com'` | | `scrapy crawl ` | Runs a specified crawler. | `scrapy crawl myspider -o output.json` | | `scrapy check` | Checks the correctness of crawler code. | `scrapy check` | | `scrapy fetch` | Downloads content from a specified URL. | `scrapy fetch 'http://example.com'` | | `scrapy view` | Views the page downloaded by Scrapy in a browser. | `scrapy view 'http://example.com'` | * * * ### 6. **Common Settings (`settings.py`)** | Setting Item | Description | Example | | --- | --- | --- | | `USER_AGENT` | Sets the User-Agent in request headers. | `USER_AGENT = 'Mozilla/5.0'` | | `ROBOTSTXT_OBEY` | Determines whether to obey `robots.txt` rules. | `ROBOTSTXT_OBEY = False` | | `DOWNLOAD_DELAY` | Sets the download delay to avoid overly rapid requests. | `DOWNLOAD_DELAY = 2` | | `CONCURRENT_REQUESTS` | Sets the number of concurrent requests. | `CONCURRENT_REQUESTS = 16` | | `ITEM_PIPELINES` | Enables pipelines. | `ITEM_PIPELINES = {'myproject.pipelines.MyPipeline': 300}` | | `AUTOTHROTTLE_ENABLED` | Enables automatic throttling extension. | `AUTOTHROTTLE_ENABLED = True` | * * * ### 7. **Other Common Methods** | Method Name | Description | Example | | --- | --- | --- | | `response.follow_all(l
← Docker Update CommandVscode Shortcut Keys β†’