Scrapy output multiple item elements with xpath to single csv? -
i attempting scrape items page containing various html elements , series of nested tables.
i have code working scraping table x class="classa" , outputting table elements series of items, such company address, phone number, website address, etc.
i add items list outputting, other items scraped aren't located within same table, , aren't located in table @ all, eg < h1 > tag in part of page.
how possible add other items output, using xpath filter , have them appear in same array / output structure ? noticed if scrape table items table (even when table has exact same class name , id) csv output other items outputted on different lines in csv, not keeping csv structure intact :(
im sure there must way items remain unified in csv output, if scraped different areas on page ? simple fix...
----- html example page being scraped -----
<html> <head></head> <body> < // huge amount of other html , tables not scraped > <h2>heading scraped - company name</h2> <p>company description</p> < table cellspacing="0" class="contenttable company-details"> <tr> <th>item code</th> <td>it123</td> </tr> <th>listing date</th> <td>12 september, 2011</td> </tr> <tr> <th>internet address</th> <td class="altrow"><a href="http://www.website.com/" target="_top">http://www.website.com/</a></td> </tr> <tr> <th>office address</th> <td>123 example street</td> </tr> <tr> <th>office telephone</th> <td>(01) 1234 5678</td> </tr> </table> <table cellspacing="0" class="contenttable" id="staff"> <tr><th>management names</th></tr> <tr> <td> mr john citizen (ceo)<br/>mrs mary doe (director)<br/>dr j. watson (manager)<br/> </td> </tr> </table> <table cellspacing="0" class="contenttable company-details"> <tr> <th>contact person</th> <td> mr john citizen<br/> </td> </tr> <tr> <th class=principal>company mission</th> <td>acme corp retail sales company.</td> </tr> </table> </body> </html>
---- scrapy code example ----
from scrapy.spider import spider scrapy.selector import selector my.items import asxitem class myspider(spider): name = "my" allowed_domains = ["website.com"] start_urls = ["http://www.website.com/abc" ] def parse(self, response): sel = selector(response) sites = sel.xpath('//table[@class="contenttable company-details"]') items = [] site in sites: item = myitem() item['company_name'] = site.xpath('.//h1//text()').extract() item['item_code'] = site.xpath('.//th[text()="item code"]/following-sibling::td//text()').extract() item['listing_date'] = site.xpath('.//th[text()="listing date"]/following-sibling::td//text()').extract() item['website_url'] = site.xpath('.//th[text()="internet address"]/following-sibling::td//text()').extract() item['office_address'] = site.xpath('.//th[text()="office address"]/following-sibling::td//text()').extract() item['office_phone'] = site.xpath('.//th[text()="office telephone"]/following-sibling::td//text()').extract() item['company_mission'] = site.xpath('//th[text()="company mission"]/following-sibling::td//text()').extract() yield item
outputting csv
scrapy crawl -o items.csv -t csv
with example code above, [company mission] item appears on different line in csv other items (guessing because in different table) though has same class name , id, , additionally im unsure how scrape < h1 > field since falls outside table structure current xpath sites filter ?
i expand sites xpath filter include more content, won't less effecient , defeat point of filtering ?
here's example of debug log, can see company mission being processed twice reason, , first loop empty, must why outputting onto new line in csv, why ??
{'item_code': [u'abc'], 'listing_date': [u'1 january, 2000'], 'office_address': [u'level 1, street, sydney, nsw, australia, 2000'], 'office_fax': [u'(02) 1234 5678'], 'office_phone': [u'(02) 1234 5678'], 'company_mission': [], 'website_url': [u'http://www.company.com']} 2014-02-06 16:32:13+1000 [my] debug: scraped <200 http://www.website.com/code=abc> {'item_code': [], 'listing_date': [], 'office_address': [], 'office_fax': [], 'office_phone': [], 'company_mission': [u'the comapany involved in retail, food , beverage, wholesale services.'], 'website_url': []}
the other thing baffled why items spat out in csv in different order items on html page , order have defined in spiders config file. scrapy run asynchronously returning items in whatever order pleases ?
i understand want scrape 1 item page //table[@class="contenttable company-details"]
matches 2 tables elements in html content, for site in sites:
run twice, creating 2 items.
and each table, xpath expressions applied within current table if relative -- .//th[text()="item code"]
. absolute xpath expressions, such //th[text()="company mission"]
, elements root element of html document.
your sample output shows "company_mission"
once while appears twice. , because you're using absolute xpath expression it, should have indeed appeared twice. not sure if ouput matches current spider code in question.
so, first iteration of loop,
<table cellspacing="0" class="contenttable company-details"> <tr> <th>item code</th> <td>it123</td> </tr> <th>listing date</th> <td>12 september, 2011</td> </tr> <tr> <th>internet address</th> <td class="altrow"><a href="http://www.website.com/" target="_top">http://www.website.com/</a></td> </tr> <tr> <th>office address</th> <td>123 example street</td> </tr> <tr> <th>office telephone</th> <td>(01) 1234 5678</td> </tr> </table>
in can scrape:
- item code
- listing date
- internet address --> website url
- office address
- office telephone
and because you're using absolute xpath expression, //th[text()="company mission"]/following-sibling::td//text()
anywhere in document, not in first <table cellspacing="0" class="contenttable company-details">
these extracted field go item of own.
then comes 2nd table matching xpath sites
:
<table cellspacing="0" class="contenttable company-details"> <tr> <th>contact person</th> <td> mr john citizen<br/> </td> </tr> <tr> <th class=principal>company mission</th> <td>acme corp retail sales company.</td> </tr> </table>
for new myitem()
instantiated, , here, no xpath expression match except absolute xpath "company mission", @ end of loop iteration, you've got item "company mission".
if you're sure expect 1 , 1 item page, can use longer xpaths //table[@class="contenttable company-details"]//th[text()="item code"]/following-sibling::td//text()
each field want, match 1st or 2nd table,
and use 1 myitem()
instance.
also, can try css selectors shorter read , write , easier maintain:
- "company_name" <--
sel.css('h2::text')
- "item_code" <--
sel.css('table.company-details th:contains("item code") + td::text')
- "listing_date" <--
sel.css('table.company-details th:contains("listing date") + td::text')
- etc.
note :contains()
available in scrapy via cssselect underneath, it's not standard (was remove css specs, handy) , ::text
pseudo-element selector non-standard scrapy extension, , handy.
Comments
Post a Comment