python - Scrapy sqmllinkextractor rule changes urls -


I am trying to create a webcrawler spider using a scraper and I have a rule for the intended link to my crawler To create a regex using:

  rule = (rule (SgmlLinkExtractor (allow = 'http: \ \ / \ www \ .cartelera \ .com \ .uy \ / apeliculafunciones \ .aspx \ [^ "] * 1 and 2 9 '),' Parse_loli, follow or = True)) that   

I examined Rijeks and it works And this link I want to see the spider To do this, but when I ran the program, the spider did not get any items and with little investigation I came to know that whenever the RezX matches the item correctly, when the spider goes to crawl the page So that URL should be used as much as you want.

For example, spiders should match and crawl URLs such as

  http: / /www.cartelera.com.uy/apeliculafunciones.aspx?7242&&; cine and o Ra & A 1 and 2 9 http://www.cartelera.com.uy/apeliculafunciones.aspx?10704&&CINE&OBRA&1&29 http://www.cartelera.com.uy/apeliculafunciones .aspx? 10697 & amp; Amp; & Amp; And & amp; Obara & amp; 1 & amp; Amp; 2 & amp;    

but come out like url scrapy crawls

  http: // www cartelera.com.uy/apeliculafunciones.aspx?-1 = & Amp; 29 = & amp; 7242 = & amp; CINE = & amp; OBRA = http://www.cartelera.com.uy/apeliculafunciones.aspx?-1=& 29 = & amp; 10704 = & amp; Cine = & amp; OBRA = http://www.cartelera.com.uy/apeliculafunciones.aspx?-1=&&29=&10697=&CINE=&OBRA=   

these URLs Errors take pages, obviously, no items were found.

There is no error message why this is happening and how do I avoid it?

  class MySpider (crawlspider): name = 'cartelera' permission_domain = ["www.cartelera.com.uy"] start_urls = ["http: // www2 .cartelera.com.uy / apeliculafunciones.aspx? ,, PELICULAS, OBRA, 0,26 "] = rule (rule (SgmlLinkExtractor (allow = 'http: \ \ / \ / www \ .cartelera \ .com \. U / Apelikulfunsins \. Spaks \ "[^"] * 1 and 2 9 '),' Parse_loli, follow or = True),) were simple debugging method and Czech crawled to ## output urls def parse_loly (self , Response): print '% s'% response.url returns    

those URLs There are no standards, you can overcome this by saying not to remove SGML, like to do them canonicalize:

  SgmlLinkExtractor (allow = 'http ...', canonicalize = False)   

Here's how you work with Scapery Shell:

  $ scrapy shell "http: // www2 Cartelera.com.uy/apeliculafunciones Aspx ?, PELICULAS, Obra, 0,26 "... ... gt; & Gt; & Gt; SgmlLinkExtractor from Scrapy.contrib.linkextractors.sgml Import & gt; & Gt; & Gt; S = SgmlLinkExtractor (allowed = 'http: \ / \ / www \ .cartelera \ .com \ .uy \ / apeliculafunciones \ .aspx \? [^ "] * 1 and 29', canonicalize = false)> gt; & gt For links in (;): ... Print Link ... link (url = 'http: & amp; //www.cartelera.com.uy/apeliculafunciones.aspx 10697 & amp; Cine & amp; OBRA & amp; - 1 & amp; 29 ', text = u' ', fragment =' ', nofollow = False) link (url =' http: //www.cartelera.com.uy/apeliculafunciones. Aspx? 10697 & amp; amp; amp; & amp;; & amp;; & amp;; & amp; # 29; titulosalas '; text = u'aqu \ xed', fragment = '', nofollow = False) link (url = 'Http: //www.cartelera.com.uy/apeliculafunciones.aspx? 10795 & amp; Cine and Obra and 1 & 29', text = u '', block = '', nofollow = False) ...    

Comments

Popular posts from this blog

c - Mpirun hangs when mpi send and recieve is put in a loop -

python - Apply coupon to a customer's subscription based on non-stripe related actions on the site -

java - Unable to get JDBC connection in Spring application to MySQL -