Python 100 project #15: Web Scraping with BeautifulSoup

This time, I use python library BeautifulSoup to collect Dr. Who transcripts from this site.

The transcripts are stored in respective htm files under the base url. Visiting all of these url is tedious job, and it’d be good chance to learn web scraping tool.

Transcripts in this site are somewhat difficult to retrieve though, as there is no clear differentiations between conversations and situation explanation, which is appeared inside parentheses and not by tags. So I used regular expression mainly and used Beautifulsoup as an initial retrieval of the text.

 

Output Example:

 

Here is the code: