Python 100 project #15: Web Scraping with BeautifulSoup

This time, I use python library BeautifulSoup to collect Dr. Who transcripts from this site.

The transcripts are stored in respective htm files under the base url. Visiting all of these url is tedious job, and it’d be good chance to learn web scraping tool.

Transcripts in this site are somewhat difficult to retrieve though, as there is no clear differentiations between conversations and situation explanation, which is appeared inside parentheses and not by tags. So I used regular expression mainly and used Beautifulsoup as an initial retrieval of the text.

 

Output Example:

ROSE said: Bye!
JACKIE said: See you later!
TANNOY said: This is a customer announcement. The store will be closing in five minutes. Thank you.
GUARD said: Oi!
ROSE said: Wilson? Wilson, I've got the lottery money. Wilson, are you there?
ROSE said: I can't hang about 'cos they're closing the shop. Wilson! Oh, come on.
ROSE said: Hello? Hello, Wilson, it's Rose. Hello? Wilson?
...

 

Here is the code:

def get_transcripts(url):
    from bs4 import BeautifulSoup
    from requests import get
    
    resp = get(url)
    soup = BeautifulSoup(resp.content, 'lxml')

    second_sentence = soup.td.find_all(text=True)

    result_str = ''
    for c in second_sentence:
        result_str += ' ' + c
    
    return gen_speeches(result_str)


def gen_speeches(raw_str):
    import re
    
    raw_str = raw_str.split()
    
    name_pat = re.compile(r'[a-zA-z]+:')
    special_chr = set('{}[]()<>')
    
    names = []
    speeches = []
    speech = ""
    narration = True
    for string in raw_str:
        if re.match(name_pat, string):
            narration = False
            names.append(string[:-1])
            if len(speech) > 0:
                speeches.append(speech.strip())
                speech = ""
        elif any((c in special_chr) for c in string):
            narration = True
            if len(speech) > 0:
                speeches.append(speech.strip())
                speech = ""
            continue
        elif not narration:
            speech += ' ' + string

    return names, speeches


if __name__ == '__main__':

    url = 'http://www.chakoteya.net/DoctorWho/27-1.htm'
    
    names, speeches = get_transcripts(url)
    
    for name, speech in zip(names, speeches):
        print(f'{name} said: {speech}')