Scrapy output multiple item elements with xpath to single csv? -


i attempting scrape items page containing various html elements , series of nested tables.

i have code working scraping table x class="classa" , outputting table elements series of items, such company address, phone number, website address, etc.

i add items list outputting, other items scraped aren't located within same table, , aren't located in table @ all, eg < h1 > tag in part of page.

how possible add other items output, using xpath filter , have them appear in same array / output structure ? noticed if scrape table items table (even when table has exact same class name , id) csv output other items outputted on different lines in csv, not keeping csv structure intact :(

im sure there must way items remain unified in csv output, if scraped different areas on page ? simple fix...

----- html example page being scraped -----

<html> <head></head> <body>  < // huge amount of other html , tables not scraped >  <h2>heading scraped - company name</h2> <p>company description</p>  < table cellspacing="0" class="contenttable company-details"> <tr>   <th>item code</th>   <td>it123</td> </tr>   <th>listing date</th>   <td>12 september, 2011</td> </tr> <tr>   <th>internet address</th>   <td class="altrow"><a href="http://www.website.com/" target="_top">http://www.website.com/</a></td> </tr> <tr>   <th>office address</th>   <td>123 example street</td> </tr>     <tr>   <th>office telephone</th>   <td>(01) 1234 5678</td> </tr>        </table>  <table cellspacing="0" class="contenttable" id="staff"> <tr><th>management names</th></tr> <tr>     <td>     mr john citizen (ceo)<br/>mrs mary doe (director)<br/>dr j. watson (manager)<br/>     </td> </tr> </table>  <table cellspacing="0" class="contenttable company-details">     <tr>     <th>contact person</th>     <td>             mr john citizen<br/>             </td> </tr>    <tr>     <th class=principal>company mission</th>     <td>acme corp retail sales company.</td> </tr>    </table>  </body> </html> 

---- scrapy code example ----

from scrapy.spider import spider scrapy.selector import selector my.items import asxitem  class myspider(spider): name = "my" allowed_domains = ["website.com"] start_urls = ["http://www.website.com/abc" ]  def parse(self, response):    sel = selector(response)    sites = sel.xpath('//table[@class="contenttable company-details"]')    items = []     site in sites:       item = myitem()       item['company_name'] = site.xpath('.//h1//text()').extract()       item['item_code'] = site.xpath('.//th[text()="item code"]/following-sibling::td//text()').extract()       item['listing_date'] = site.xpath('.//th[text()="listing date"]/following-sibling::td//text()').extract()       item['website_url'] = site.xpath('.//th[text()="internet address"]/following-sibling::td//text()').extract()       item['office_address'] = site.xpath('.//th[text()="office address"]/following-sibling::td//text()').extract()       item['office_phone'] = site.xpath('.//th[text()="office telephone"]/following-sibling::td//text()').extract()       item['company_mission'] = site.xpath('//th[text()="company mission"]/following-sibling::td//text()').extract()       yield item 

outputting csv

scrapy crawl -o items.csv -t csv 

with example code above, [company mission] item appears on different line in csv other items (guessing because in different table) though has same class name , id, , additionally im unsure how scrape < h1 > field since falls outside table structure current xpath sites filter ?

i expand sites xpath filter include more content, won't less effecient , defeat point of filtering ?

here's example of debug log, can see company mission being processed twice reason, , first loop empty, must why outputting onto new line in csv, why ??

{'item_code': [u'abc'],  'listing_date': [u'1 january, 2000'],  'office_address': [u'level 1, street, sydney, nsw, australia, 2000'],  'office_fax': [u'(02) 1234 5678'],  'office_phone': [u'(02) 1234 5678'],  'company_mission': [],  'website_url': [u'http://www.company.com']} 2014-02-06 16:32:13+1000 [my] debug: scraped <200 http://www.website.com/code=abc> {'item_code': [],  'listing_date': [],  'office_address': [],  'office_fax': [],  'office_phone': [],  'company_mission': [u'the comapany involved in retail, food , beverage, wholesale services.'],  'website_url': []} 

the other thing baffled why items spat out in csv in different order items on html page , order have defined in spiders config file. scrapy run asynchronously returning items in whatever order pleases ?

i understand want scrape 1 item page //table[@class="contenttable company-details"] matches 2 tables elements in html content, for site in sites: run twice, creating 2 items.

and each table, xpath expressions applied within current table if relative -- .//th[text()="item code"]. absolute xpath expressions, such //th[text()="company mission"], elements root element of html document.

your sample output shows "company_mission" once while appears twice. , because you're using absolute xpath expression it, should have indeed appeared twice. not sure if ouput matches current spider code in question.

so, first iteration of loop,

    <table cellspacing="0" class="contenttable company-details">     <tr>       <th>item code</th>       <td>it123</td>     </tr>       <th>listing date</th>       <td>12 september, 2011</td>     </tr>     <tr>       <th>internet address</th>       <td class="altrow"><a href="http://www.website.com/" target="_top">http://www.website.com/</a></td>     </tr>     <tr>       <th>office address</th>       <td>123 example street</td>     </tr>         <tr>       <th>office telephone</th>       <td>(01) 1234 5678</td>     </tr>            </table> 

in can scrape:

  • item code
  • listing date
  • internet address --> website url
  • office address
  • office telephone

and because you're using absolute xpath expression, //th[text()="company mission"]/following-sibling::td//text() anywhere in document, not in first <table cellspacing="0" class="contenttable company-details">

these extracted field go item of own.

then comes 2nd table matching xpath sites:

    <table cellspacing="0" class="contenttable company-details">         <tr>         <th>contact person</th>         <td>                 mr john citizen<br/>                 </td>     </tr>        <tr>         <th class=principal>company mission</th>         <td>acme corp retail sales company.</td>     </tr>        </table> 

for new myitem() instantiated, , here, no xpath expression match except absolute xpath "company mission", @ end of loop iteration, you've got item "company mission".

if you're sure expect 1 , 1 item page, can use longer xpaths //table[@class="contenttable company-details"]//th[text()="item code"]/following-sibling::td//text() each field want, match 1st or 2nd table,

and use 1 myitem() instance.

also, can try css selectors shorter read , write , easier maintain:

  • "company_name" <-- sel.css('h2::text')
  • "item_code" <-- sel.css('table.company-details th:contains("item code") + td::text')
  • "listing_date" <-- sel.css('table.company-details th:contains("listing date") + td::text')
  • etc.

note :contains() available in scrapy via cssselect underneath, it's not standard (was remove css specs, handy) , ::text pseudo-element selector non-standard scrapy extension, , handy.


Comments

Popular posts from this blog

python - Subclassed QStyledItemDelegate ignores Stylesheet -

java - HttpClient 3.1 Connection pooling vs HttpClient 4.3.2 -

SQL: Divide the sum of values in one table with the count of rows in another -