python - How can I change User_AGENT in scrapy spider? -


i wrote spider ip http://ip.42.pl/raw via proxy. first spider. want change user_agent. got information tutorial http://blog.privatenode.in/torifying-scrapy-project-on-ubuntu

i completed steps tutorial , code.

settings.py

bot_name = 'checkip'  spider_modules = ['checkip.spiders'] newspider_module = 'checkip.spiders'  user_agent_list = ['mozilla/5.0 (iphone; cpu iphone os 5_1 mac os x) applewebkit/534.46 (khtml, gecko) version/5.1 mobile/9b179 safari/7534.48.3', 'mozilla/5.0 (linux; u; android 4.0.3; ko-kr; lg-l160l build/iml74k) applewebkit/534.30 (khtml, gecko) version/4.0 mobile safari/534.30', 'mozilla/5.0 (linux; u; android 4.0.3; de-ch; htc sensation build/iml74k) applewebkit/534.30 (khtml, gecko) version/4.0 mobile safari/534.30', 'mozilla/5.0 (linux; u; android 2.3; en-us) applewebkit/999+ (khtml, gecko) safari/999.9', 'mozilla/5.0 (linux; u; android 2.3.5; zh-cn; htc_incredibles_s710e build/grj90) applewebkit/533.1 (khtml, gecko) version/4.0 mobile safari/533.1'     ]  http_proxy = 'http://127.0.0.1:8123'  downloader_middlewares = {     'checkip.middlewares.randomuseragentmiddleware': 400,     'checkip.middlewares.proxymiddleware': 410,     'checkip.contrib.downloadermiddleware.useragent.useragentmiddleware': none, } 

middleware.py

import random scrapy.conf import settings scrapy import log   class randomuseragentmiddleware(object):     def process_request(self, request, spider):         ua = random.choice(settings.get('user_agent_list'))         if ua:             request.headers.setdefault('user-agent', ua)             #this check user agent being used request             spider.log(                 u'user-agent: {} {}'.format(request.headers.get('user-agent'), request),                 level=log.debug             )   class proxymiddleware(object):     def process_request(self, request, spider):         request.meta['proxy'] = settings.get('http_proxy') 

checkip.py

import time scrapy.spider import spider scrapy.http import request  class checkipspider(spider):     name = 'checkip'     allowed_domains = ["ip.42.pl"]     url = "http://ip.42.pl/raw"      def start_requests(self):             yield request(self.url, callback=self.parse)      def parse(self, response):         = time.strftime("%c")         ip = now+"-"+response.body+"\n"         open('ips.txt', 'a') f:              f.write(ip) 

this returned information user_agent

2015-10-30 22:24:20+0200 [scrapy] debug: web service listening on 127.0.0.1:6080 2015-10-30 22:24:20+0200 [checkip] debug: user-agent: scrapy/0.24.4 (+http://scrapy.org) <get http://ip.42.pl/raw> 

user-agent: scrapy/0.24.4 (+http://scrapy.org)

when manual add header in request working correctly.

   def start_requests(self):         yield request(self.url, callback=self.parse, headers={"user-agent": "mozilla/5.0 (iphone; cpu iphone os 5_1 mac os x) applewebkit/534.46 (khtml, gecko) version/5.1 mobile/9b179 safari/7534.48.3"}) 

this returned result in console with

2015-10-30 22:50:32+0200 [checkip] debug: user-agent: mozilla/5.0 (iphone; cpu iphone os 5_1 mac os x) applewebkit/534.46 (khtml, gecko) version/5.1 mobile/9b179 safari/7534.48.3 <get http://ip.42.pl/raw> 

how can use user_agent_list in spider?

if don't need random user_agent, can put user_agent on settings file, like:

settings.py:

... user_agent = 'mozilla/5.0 (macintosh; intel mac os x 10.10; rv:39.0) gecko/20100101 firefox/39.0' ... 

no need middleware. if want randomly select user_agent, first make sure on scrapy logs randomuseragentmiddleware being used, should check on logs:

enabled downloader middlewares: [     ...     'checkip.middlewares.randomuseragentmiddleware',     ... ] 

check checkip.middlewares path middleware.

now maybe settings being incorrectly loaded on middleware, recommend use from_crawler method load this:

class randomuseragentmiddleware(object):     def __init__(self, settings):         self.settings = settings      @classmethod     def from_crawler(cls, crawler):         settings = crawler.settings         o = cls(settings, crawler.stats)         return o 

now use self.settings.get('user_agent_list') getting want inside process_request method.

also please update scrapy version, looks using 0.24 while passed 1.0.


Comments

Popular posts from this blog

matlab - error with cyclic autocorrelation function -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

c# - What is a good .Net RefEdit control to use with ExcelDna? -