python - Scrapy sqmllinkextractor rule changes urls -

September 15, 2011

I am trying to create a webcrawler spider using a scraper and I have a rule for the intended link to my crawler To create a regex using:

  rule = (rule (SgmlLinkExtractor (allow = 'http: \ \ / \ www \ .cartelera \ .com \ .uy \ / apeliculafunciones \ .aspx \ [^ "] * 1 and 2 9 '),' Parse_loli, follow or = True)) that    I examined Rijeks and it works And this link I want to see the spider To do this, but when I ran the program, the spider did not get any items and with little investigation I came to know that whenever the RezX matches the item correctly, when the spider goes to crawl the page So that URL should be used as much as you want.  
 For example, spiders should match and crawl URLs such as  
  http: / /www.cartelera.com.uy/apeliculafunciones.aspx?7242&&; cine and o Ra & A 1 and 2 9 http://www.cartelera.com.uy/apeliculafunciones.aspx?10704&&CINE&OBRA&1&29 http://www.cartelera.com.uy/apeliculafunciones .aspx? 10697 & amp; Amp; & Amp; And & amp; Obara & amp; 1 & amp; Amp; 2 & amp;     but come out like url scrapy crawls  
  http: // www cartelera.com.uy/apeliculafunciones.aspx?-1 = & Amp; 29 = & amp; 7242 = & amp; CINE = & amp; OBRA = http://www.cartelera.com.uy/apeliculafunciones.aspx?-1=& 29 = & amp; 10704 = & amp; Cine = & amp; OBRA = http://www.cartelera.com.uy/apeliculafunciones.aspx?-1=&&29=&10697=&CINE=&OBRA=    these URLs Errors take pages, obviously, no items were found.  
 There is no error message why this is happening and how do I avoid it?  
  
  class MySpider (crawlspider): name = 'cartelera' permission_domain = ["www.cartelera.com.uy"] start_urls = ["http: // www2 .cartelera.com.uy / apeliculafunciones.aspx? ,, PELICULAS, OBRA, 0,26 "] = rule (rule (SgmlLinkExtractor (allow = 'http: \ \ / \ / www \ .cartelera \ .com \. U / Apelikulfunsins \. Spaks \ "[^"] * 1 and 2 9 '),' Parse_loli, follow or = True),) were simple debugging method and Czech crawled to ## output urls def parse_loly (self , Response): print '% s'% response.url returns     
  those URLs There are no standards, you can overcome this by saying not to remove SGML, like to do them canonicalize:  
  SgmlLinkExtractor (allow = 'http ...', canonicalize = False)    Here's how you work with Scapery Shell:  
  $ scrapy shell "http: // www2 Cartelera.com.uy/apeliculafunciones Aspx ?, PELICULAS, Obra, 0,26 "... ... gt; & Gt; & Gt; SgmlLinkExtractor from Scrapy.contrib.linkextractors.sgml Import & gt; & Gt; & Gt; S = SgmlLinkExtractor (allowed = 'http: \ / \ / www \ .cartelera \ .com \ .uy \ / apeliculafunciones \ .aspx \? [^ "] * 1 and 29', canonicalize = false)> gt; & gt For links in (;): ... Print Link ... link (url = 'http: & amp; //www.cartelera.com.uy/apeliculafunciones.aspx 10697 & amp; Cine & amp; OBRA & amp; - 1 & amp; 29 ', text = u' ', fragment =' ', nofollow = False) link (url =' http: //www.cartelera.com.uy/apeliculafunciones. Aspx? 10697 & amp; amp; amp; & amp;; & amp;; & amp;; & amp; # 29; titulosalas '; text = u'aqu \ xed', fragment = '', nofollow = False) link (url = 'Http: //www.cartelera.com.uy/apeliculafunciones.aspx? 10795 & amp; Cine and Obra and 1 & 29', text = u '', block = '', nofollow = False) ...    

 



















Get link





Facebook





X





Pinterest





Email





Other Apps




Comments





Post a Comment



Popular posts from this blog




python - how we can use ajax() in views.py in django? -






June 15, 2012








    I have a little code here.   Could you please tell me what this code really does? please.   Here in the AJAX call:    $ .ajax ({url: "{% url} upload_document '%}", type: "POST", data : {Csrfmiddletoken: document.getElementsByName ('csrfmiddletoken') [0] Price, Title: document.getElementById ('title'). Value, // document: document: document.getElementById ('document'),}, datatype: "Jason", success: function (feedback) {if (feedback == "true") {// success} other {// attachment errors}}});    How AJAX works in the Django and how we can see the AJAX request.  Here is the idea of my thoughts    def upload_document (requested): Print request. POTT print request FILES if request.is_ajax (): If request.method == 'POST': form = UploadForm (request.POST, request.FILES, user = request.user) if form.is_valid (): Form.save () return HTTPPCS (SimpleJason Dumps ('True'), Mime Type = 'App / Jason') Other:...





Read more





matlab - Using loops to get multiple values into a cell -






April 15, 2013








    I have 31 topics (S1, S2, S3, S4, etc.) 3 images in each topic, opposite 1 IMG, Contrast 2 IMG and Contrast 3 Are IMG I p In a NX1 cell named P, I would like to use a loop to get all the paths of contrasts from all the subjects:    data / S1 / contrast1.img   / S1 / contrast2.img   data / S1 / contrast3.img   data / s2 / contrast1.img   data / S2 / contrast2.img   Data / S2 / contrast3.img ...   Data / S31 / contast3.img    This is what I've tried:    A = {'S1', 'S2', 'S3', ..., 'S31'}; % All topics C = {'contrast1.img', 'contrast2.img', 'contrast3.img'}; For each = P = cell (31 * 3,1) for each topic, the required contrast images: Length = A for Jammu = 1: Length (C) P {j} = spm_select ('FPList', Fullfile (data_path, q {i}) sprint ('% s', cell2mat (c))); % Of each topic is to select three contrast images, it works in my script, it is probably not 100% correct, because I had to simplify this example. End of ...





Read more





python - Sequence Pattern recognition with Pybrain -






March 15, 2011













    I am constantly trying to use recurrent neural networks to classify a series of data. To be more specific, I have a sequence of sensor reading (which is continuous over time), I have to learn an algorithm which can detect the state related to this pattern, given the readings change.   Example:   Time step_1: 1.4   Time step_2: 1   Time step_3: 0.8   State = New Sequence:   Time Phase 1: 0.4   Time Phase 2: 0.3   Time Phase 3: 0.1   State = Sitting   I actually have 12 censors, I'm just showing a sequence of numbers for convenience. (Numbers are not real, I'm just trying to fulfill this idea)!   I am trying to make my network with PyBrain RNN, however, I can not get a Data Set Container that can detect it, like information I tried to use sequential data But after some testing I came to know that this number is the next element in the sequence of numbers. Here's how I prepare my dataset:    self.alldata = SequentialDataSet (ds.num_features, 1) # Now add samples to the dat...





Read more

Search This Blog

ABC code

python - Scrapy sqmllinkextractor rule changes urls -

Comments

Post a Comment

Popular posts from this blog

python - how we can use ajax() in views.py in django? -

matlab - Using loops to get multiple values into a cell -

python - Sequence Pattern recognition with Pybrain -