python - How to output parsed HTML into a file? -
(updated) after help, have following code. can output csv file can't seem csv have proper number of columns:
soup = beautifulsoup(html_doc) import csv outfile=csv.writer(open('outputrows.csv','wb'),delimiter='\t') #def get_movie_info(imdb): tbl = soup.find('table') rows = tbl.findall('tr') list=[] row in rows: cols = row.find_all('td') col in cols: if col.has_attr('class') , col['class'][0] == 'title': spans = col.find_all('span') span in spans: if span.has_attr('class') , span['class'][0] == 'wlb_wrapper': id = span.get('data-tconst') list.append(id) elif col.has_attr('class') , col['class'][0] == 'number': rank = col.text list.append(rank) elif col.has_attr('class') , col['class'][0] == 'image': hrefs = col.find_all('a') href in hrefs: moviename = href.get('title') list.append(moviename) outfile.writerows(list) print list
the problem ouputs in format, 1 column of data:
1. shawshank redemption (1994) tt0111161 2. dark knight (2008) tt0468569 3. inception (2010) tt1375666
when want 3 columns of data shown below:
1. shawshank redemption (1994) tt0111161 2. dark knight (2008) tt0468569 3. inception (2010) tt1375666
sample html code:
<tr class="odd detailed"> <td class="number"> 48. </td> <td class="image"> <a href="/title/tt0082971/" title="raiders of lost ark (1981)"> <img alt="raiders of lost ark (1981)" height="74" src="http://ia.media-imdb.com/images/m/mv5bmja0odezmtc1nl5bml5banbnxkftztcwodm2mjaxna@@._v1._sx54_cr0,0,54,74_.jpg" title="raiders of lost ark (1981)" width="54"/> </a> </td> <td class="title"> <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0082971"> </span> <a href="/title/tt0082971/"> raiders of lost ark </a> <span class="year_type"> (1981) </span> <br/>
can please try (not optimized solution should job):
soup = beautifulsoup(html_doc) def get_movie_info(): tbl = soup.find('table') rows = tbl.findall('tr') row in rows: (imagetitle, datatconst, number) = ('', '', '') cols = row.find_all('td') col in cols: if col.has_attr('class') , col['class'][0] == 'image': href = col.find('a') imagetitle = href.get('title') elif col.has_attr('class') , col['class'][0] == 'title': span = col.find('span') if span.has_attr('class') , span['class'][0] == 'wlb_wrapper': datatconst = span.get('data-tconst') elif col.has_attr('class') , col['class'][0] == 'number': number = col.text yield (imagetitle, datatconst, number) ################################################# import csv outfile=csv.writer(open('outputrows.csv','wb'), delimiter='\t') row in get_movie_info(): outfile.writerow(row)
Comments
Post a Comment