python - How can I change User_AGENT in scrapy spider? -
i wrote spider ip http://ip.42.pl/raw via proxy.
first spider. want change user_agent. got information tutorial http://blog.privatenode.in/torifying-scrapy-project-on-ubuntu
i completed steps tutorial , code.
settings.py
bot_name = 'checkip' spider_modules = ['checkip.spiders'] newspider_module = 'checkip.spiders' user_agent_list = ['mozilla/5.0 (iphone; cpu iphone os 5_1 mac os x) applewebkit/534.46 (khtml, gecko) version/5.1 mobile/9b179 safari/7534.48.3', 'mozilla/5.0 (linux; u; android 4.0.3; ko-kr; lg-l160l build/iml74k) applewebkit/534.30 (khtml, gecko) version/4.0 mobile safari/534.30', 'mozilla/5.0 (linux; u; android 4.0.3; de-ch; htc sensation build/iml74k) applewebkit/534.30 (khtml, gecko) version/4.0 mobile safari/534.30', 'mozilla/5.0 (linux; u; android 2.3; en-us) applewebkit/999+ (khtml, gecko) safari/999.9', 'mozilla/5.0 (linux; u; android 2.3.5; zh-cn; htc_incredibles_s710e build/grj90) applewebkit/533.1 (khtml, gecko) version/4.0 mobile safari/533.1' ] http_proxy = 'http://127.0.0.1:8123' downloader_middlewares = { 'checkip.middlewares.randomuseragentmiddleware': 400, 'checkip.middlewares.proxymiddleware': 410, 'checkip.contrib.downloadermiddleware.useragent.useragentmiddleware': none, }
middleware.py
import random scrapy.conf import settings scrapy import log class randomuseragentmiddleware(object): def process_request(self, request, spider): ua = random.choice(settings.get('user_agent_list')) if ua: request.headers.setdefault('user-agent', ua) #this check user agent being used request spider.log( u'user-agent: {} {}'.format(request.headers.get('user-agent'), request), level=log.debug ) class proxymiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = settings.get('http_proxy')
checkip.py
import time scrapy.spider import spider scrapy.http import request class checkipspider(spider): name = 'checkip' allowed_domains = ["ip.42.pl"] url = "http://ip.42.pl/raw" def start_requests(self): yield request(self.url, callback=self.parse) def parse(self, response): = time.strftime("%c") ip = now+"-"+response.body+"\n" open('ips.txt', 'a') f: f.write(ip)
this returned information user_agent
2015-10-30 22:24:20+0200 [scrapy] debug: web service listening on 127.0.0.1:6080 2015-10-30 22:24:20+0200 [checkip] debug: user-agent: scrapy/0.24.4 (+http://scrapy.org) <get http://ip.42.pl/raw>
user-agent: scrapy/0.24.4 (+http://scrapy.org)
when manual add header in request working correctly.
def start_requests(self): yield request(self.url, callback=self.parse, headers={"user-agent": "mozilla/5.0 (iphone; cpu iphone os 5_1 mac os x) applewebkit/534.46 (khtml, gecko) version/5.1 mobile/9b179 safari/7534.48.3"})
this returned result in console with
2015-10-30 22:50:32+0200 [checkip] debug: user-agent: mozilla/5.0 (iphone; cpu iphone os 5_1 mac os x) applewebkit/534.46 (khtml, gecko) version/5.1 mobile/9b179 safari/7534.48.3 <get http://ip.42.pl/raw>
how can use user_agent_list in spider?
if don't need random user_agent, can put user_agent
on settings file, like:
settings.py:
... user_agent = 'mozilla/5.0 (macintosh; intel mac os x 10.10; rv:39.0) gecko/20100101 firefox/39.0' ...
no need middleware. if want randomly select user_agent, first make sure on scrapy logs randomuseragentmiddleware
being used, should check on logs:
enabled downloader middlewares: [ ... 'checkip.middlewares.randomuseragentmiddleware', ... ]
check checkip.middlewares
path middleware.
now maybe settings being incorrectly loaded on middleware, recommend use from_crawler
method load this:
class randomuseragentmiddleware(object): def __init__(self, settings): self.settings = settings @classmethod def from_crawler(cls, crawler): settings = crawler.settings o = cls(settings, crawler.stats) return o
now use self.settings.get('user_agent_list')
getting want inside process_request
method.
also please update scrapy version, looks using 0.24
while passed 1.0
.
Comments
Post a Comment