博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
(转)Scrapy 深入一点点
阅读量:4957 次
发布时间:2019-06-12

本文共 3165 字,大约阅读时间需要 10 分钟。

Scrapy 深入一点点

越来越感觉到scrapy的便利,下边继续记录Scrapy

  1. scrapy是基于twisted框架编写的,搞定PyBrain有机会就继续深入一下Twisted框架。

    Twisted is an event-driven networking engine written in Python and licensed under the open source

1. 上一篇中缺少了很多记述,现在补充上

* `scrapy startproject xxx` 新建一个xxx的project* `scrapy crawl xxx` 开始爬取,必须在project中* `scrapy shell url` 在scrapy的shell中打开url,非常实用* `scrapy runspider 
` 可以在没有project的情况下运行爬虫
  1. scrapy crawl xxx -a category=xxx 向spider传递参数(早知道这个,京东爬虫就不会写的那么乱套了,哎)def __init__(self, category=None): 在spider的init函数获取参数。

  2. 第一个Request对象是由make_requests_from_url函数生成的,callback=self.parse。

  3. 除了BaseSpider以外,还有很多可以直接继承来用的Spider,比如class scrapy.contrib.spiders.CrawlSpider

    This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. It may not be the best suited for your particular web sites or project, but it’s generic enough for several cases, so you can start from it and override it as needed for more custom functionality, or just implement your own spider.

    这个比BaseSpider多了一个rules对象,通过这个Rules我们可以选择爬取哪些结构的URL。示例代码:

    from scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.selector import HtmlXPathSelectorfrom scrapy.item import Itemclass MySpider(CrawlSpider):    name = 'example.com'    allowed_domains = ['example.com']    start_urls = ['http://www.example.com']    rules = (        # Extract links matching 'category.php' (but not matching 'subsection.php')        # and follow links from them (since no callback means follow=True by default).        Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),        # Extract links matching 'item.php' and parse them with the spider's method parse_item        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),    )    def parse_item(self, response):        self.log('Hi, this is an item page! %s' % response.url)        hxs = HtmlXPathSelector(response)        item = Item()        item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')        item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()        item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()        return item

    XMLFeedSpider: from scrapy import log from scrapy.contrib.spiders import XMLFeedSpider from myproject.items import TestItem

    class MySpider(XMLFeedSpider):    name = 'example.com'    allowed_domains = ['example.com']    start_urls = ['http://www.example.com/feed.xml']    iterator = 'iternodes' # This is actually unnecessary, since it's the default value    itertag = 'item'    def parse_node(self, response, node):        log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))        item = Item()        item['id'] = node.select('@id').extract()        item['name'] = node.select('name').extract()        item['description'] = node.select('description').extract()        return item

    还有CSVFeedSpider SitemapSpider 等等各种针对不同需求的Spider,scrapy.contrib.spiders

  4. Scrapy 还提供了一个服务器版scrapyd。可以方便的上传管理爬虫任务。

转载于:https://www.cnblogs.com/wuxinqiu/p/3854590.html

你可能感兴趣的文章
windows 查看端口占用情况
查看>>
php基础-字符串处理
查看>>
Java中的注解以及应用 @Deprecated @SupressWarning @Override
查看>>
oracle自治事务(PRAGMA AUTONOMOUS_TRANSACTION)
查看>>
6.5.4稀疏表示与基筛选
查看>>
Codeforces Round #310 (Div. 2)简洁题解
查看>>
.Net Sokcet 异步编程
查看>>
tp js结合时间戳
查看>>
复杂网络
查看>>
http协议讲解
查看>>
随笔第一页
查看>>
Python 如何用列表实现栈和队列?
查看>>
android 127.0.0.1/localhost connection refused,在模拟器上应该用10.0.2.2访问你的电脑本机...
查看>>
js/javascript format json(js/javascript 格式化json字符串)
查看>>
Nginx从听说到学会
查看>>
Python内置函数详解
查看>>
保险购买方式
查看>>
SQL - 2.基础语法
查看>>
【CodeChef PREFIXOR】Prefix XOR
查看>>
浅谈CSRF攻击方式
查看>>