python - Extract the text from `p` within `div` with BeautifulSoup -

- March 15, 2013

i new web-scraping python, , having hard time extracting nested text within html (p within div, exact). here got far:

from bs4 import beautifulsoup import urllib  url = urllib.urlopen('http://meinparlament.diepresse.com/') content = url.read() soup = beautifulsoup(content, 'lxml')

this works fine:

links=soup.findall('a',{'title':'zur antwort'}) link in links:     print(link['href'])

this extraction works fine:

table = soup.findall('div',attrs={"class":"content-question"}) x in table:     print(x)

this output:

<div class="content-question"> <p>[...] die verhandlungen über die mögliche visabefreiung für     türkische staatsbürger per ende ju... <a href="http://meinparlament.diepresse.com/frage/10144/" title="zur  antwort">mehr »</a> </p> </div>

now, want extract text within p , /p. code use:

table = soup.findall('div',attrs={"class":"content-question"}) x in table:     print(x['p'])

however, python raises keyerror.

the following code finds , prints text of each p element in div's class "content-question"

from bs4 import beautifulsoup import urllib  url = urllib.urlopen('http://meinparlament.diepresse.com/') content = url.read() soup = beautifulsoup(content, 'lxml')  table = soup.findall('div',attrs={"class":"content-question"}) x in table:     print x.find('p').text  # way retrieve tables: # table = soup.select('div[class="content-question"]')

the following printed text of first p element in table:

[...] die verhandlungen über die mögliche visabefreiung für türkische staatsbürger per ende juni sind noch nicht abgeschlossen, sodass nicht mit sicherheit gesagt werden kann, ob es zu diesem zeitpunkt bereits zu einer visabefreiung kommt. auch die genauen modalitäten einer solchen visaliberalisierung sind noch nicht ausverhandelt. prinzipiell ist es jedoch so, dass visaerleichterungen bzw. -liberalisierungen eine frage von reziprozität sind, d.h. dass diese für beide staaten gelten müssten. [...]

Search This Blog

First Image

python - Extract the text from `p` within `div` with BeautifulSoup -

Comments

Post a Comment

Popular posts from this blog

php - Passing multiple values in a url using checkbox -

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -