python - How to output parsed HTML into a file? -

- February 15, 2013

(updated) after help, have following code. can output csv file can't seem csv have proper number of columns:

soup = beautifulsoup(html_doc) import csv outfile=csv.writer(open('outputrows.csv','wb'),delimiter='\t') #def get_movie_info(imdb): tbl = soup.find('table') rows = tbl.findall('tr') list=[] row in rows:     cols = row.find_all('td')     col in cols:         if col.has_attr('class') , col['class'][0] == 'title':             spans = col.find_all('span')             span in spans:                 if span.has_attr('class') , span['class'][0] == 'wlb_wrapper':                     id = span.get('data-tconst')                     list.append(id)         elif col.has_attr('class') , col['class'][0] == 'number':             rank = col.text             list.append(rank)                     elif col.has_attr('class') , col['class'][0] == 'image':             hrefs = col.find_all('a')             href in hrefs:                 moviename = href.get('title')                 list.append(moviename)  outfile.writerows(list)  print list

the problem ouputs in format, 1 column of data:

1. shawshank redemption (1994) tt0111161 2. dark knight (2008) tt0468569 3. inception (2010) tt1375666

when want 3 columns of data shown below:

1.   shawshank redemption (1994)   tt0111161 2.   dark knight (2008)   tt0468569 3.   inception (2010)   tt1375666

sample html code:

 <tr class="odd detailed">      <td class="number">       48.      </td>      <td class="image">       <a href="/title/tt0082971/" title="raiders of lost ark (1981)">        <img alt="raiders of lost ark (1981)" height="74" src="http://ia.media-imdb.com/images/m/mv5bmja0odezmtc1nl5bml5banbnxkftztcwodm2mjaxna@@._v1._sx54_cr0,0,54,74_.jpg" title="raiders of lost ark (1981)" width="54"/>       </a>      </td>      <td class="title">       <span class="wlb_wrapper" data-caller-name="search" data-size="small" data-tconst="tt0082971">       </span>       <a href="/title/tt0082971/">        raiders of lost ark       </a>       <span class="year_type">        (1981)       </span>       <br/>

can please try (not optimized solution should job):

soup = beautifulsoup(html_doc)  def get_movie_info():   tbl = soup.find('table')   rows = tbl.findall('tr')   row in rows:     (imagetitle, datatconst, number) = ('', '', '')     cols = row.find_all('td')     col in cols:         if col.has_attr('class') , col['class'][0] == 'image':             href = col.find('a')             imagetitle = href.get('title')         elif col.has_attr('class') , col['class'][0] == 'title':             span = col.find('span')             if span.has_attr('class') , span['class'][0] == 'wlb_wrapper':                 datatconst = span.get('data-tconst')         elif col.has_attr('class') , col['class'][0] == 'number':             number = col.text      yield (imagetitle, datatconst, number)  ################################################# import csv outfile=csv.writer(open('outputrows.csv','wb'), delimiter='\t') row in get_movie_info():     outfile.writerow(row)

Search This Blog

KBPS

python - How to output parsed HTML into a file? -

Comments

Post a Comment

Popular posts from this blog

node.js - StackOverflow API not returning JSON -

python - Subclassed QStyledItemDelegate ignores Stylesheet -

java - HttpClient 3.1 Connection pooling vs HttpClient 4.3.2 -