python - Scrapy sqmllinkextractor rule changes urls -
I am trying to create a webcrawler spider using a scraper and I have a rule for the intended link to my crawler To create a regex using:
rule = (rule (SgmlLinkExtractor (allow = 'http: \ \ / \ www \ .cartelera \ .com \ .uy \ / apeliculafunciones \ .aspx \ [^ "] * 1 and 2 9 '),' Parse_loli, follow or = True)) that I examined Rijeks and it works And this link I want to see the spider To do this, but when I ran the program, the spider did not get any items and with little investigation I came to know that whenever the RezX matches the item correctly, when the spider goes to crawl the page So that URL should be used as much as you want.
For example, spiders should match and crawl URLs such as
http: / /www.cartelera.com.uy/apeliculafunciones.aspx?7242&&; cine and o Ra & A 1 and 2 9 http://www.cartelera.com.uy/apeliculafunciones.aspx?10704&&CINE&OBRA&1&29 http://www.cartelera.com.uy/apeliculafunciones .aspx? 10697 & amp; Amp; & Amp; And & amp; Obara & amp; 1 & amp; Amp; 2 & amp; but come out like url scrapy crawls
http: // www cartelera.com.uy/apeliculafunciones.aspx?-1 = & Amp; 29 = & amp; 7242 = & amp; CINE = & amp; OBRA = http://www.cartelera.com.uy/apeliculafunciones.aspx?-1=& 29 = & amp; 10704 = & amp; Cine = & amp; OBRA = http://www.cartelera.com.uy/apeliculafunciones.aspx?-1=&&29=&10697=&CINE=&OBRA= these URLs Errors take pages, obviously, no items were found.
There is no error message why this is happening and how do I avoid it?
class MySpider (crawlspider): name = 'cartelera' permission_domain = ["www.cartelera.com.uy"] start_urls = ["http: // www2 .cartelera.com.uy / apeliculafunciones.aspx? ,, PELICULAS, OBRA, 0,26 "] = rule (rule (SgmlLinkExtractor (allow = 'http: \ \ / \ / www \ .cartelera \ .com \. U / Apelikulfunsins \. Spaks \ "[^"] * 1 and 2 9 '),' Parse_loli, follow or = True),) were simple debugging method and Czech crawled to ## output urls def parse_loly (self , Response): print '% s'% response.url returns
those URLs There are no standards, you can overcome this by saying not to remove SGML, like to do them canonicalize:
SgmlLinkExtractor (allow = 'http ...', canonicalize = False) Here's how you work with Scapery Shell:
$ scrapy shell "http: // www2 Cartelera.com.uy/apeliculafunciones Aspx ?, PELICULAS, Obra, 0,26 "... ... gt; & Gt; & Gt; SgmlLinkExtractor from Scrapy.contrib.linkextractors.sgml Import & gt; & Gt; & Gt; S = SgmlLinkExtractor (allowed = 'http: \ / \ / www \ .cartelera \ .com \ .uy \ / apeliculafunciones \ .aspx \? [^ "] * 1 and 29', canonicalize = false)> gt; & gt For links in (;): ... Print Link ... link (url = 'http: & amp; //www.cartelera.com.uy/apeliculafunciones.aspx 10697 & amp; Cine & amp; OBRA & amp; - 1 & amp; 29 ', text = u' ', fragment =' ', nofollow = False) link (url =' http: //www.cartelera.com.uy/apeliculafunciones. Aspx? 10697 & amp; amp; amp; & amp;; & amp;; & amp;; & amp; # 29; titulosalas '; text = u'aqu \ xed', fragment = '', nofollow = False) link (url = 'Http: //www.cartelera.com.uy/apeliculafunciones.aspx? 10795 & amp; Cine and Obra and 1 & 29', text = u '', block = '', nofollow = False) ...
Comments
Post a Comment