Python 100 project #45: PyBites “Code Challenge 03 – PyBites Blog Tag Analysis”

This is day 3 for PyBites Code Challenge. It’s using all standard library and does not require any additional libraries. It’s an excellent practice to get the things done with standard libraries for the tasks which I usually use external libraries(e.g . HTML Tag Analysis).

Output:

$ python3 tags.py 
* Top 10 tags:
python               10
learning             7
tips                 6
tricks               5
cleancode            5
github               4
pythonic             4
collections          4
beginners            4
virtualenv           4

* Similar tags:
game                 games
generators           generator
best practices       best-practices
challenges           challenge

 

This is the code:

from collections import Counter
from difflib import SequenceMatcher
from itertools import combinations
import re

IDENTICAL = 1.0
TOP_NUMBER = 10
RSS_FEED = 'rss.xml'
SIMILAR = 0.87
TAG_HTML = re.compile(r'<category>(.*?)</category>')


def get_tags():
    """Find all tags (TAG_HTML) in RSS_FEED.
    Replace dash with whitespace.
    Hint: use TAG_HTML.findall"""
    with open(RSS_FEED, "r") as f:
        html_source = f.read()
    return re.findall(TAG_HTML, html_source)


def get_top_tags(tags):
    """Get the TOP_NUMBER of most common tags
    Hint: use most_common method of Counter (already imported)"""
    return Counter(tags).most_common(TOP_NUMBER)


def get_similarities(tags):
    """Find set of tags pairs with similarity ratio of > SIMILAR
    Hint 1: compare each tag, use for in for, or product from itertools (already imported)
    Hint 2: use SequenceMatcher (imported) to calculate the similarity ratio
    Bonus: for performance gain compare the first char of each tag in pair and continue if not the same"""
    set_of_tags = set(tags)
    similarities = []
    for a, b in combinations(set_of_tags, 2):
        if SequenceMatcher(None, a, b).ratio() >= SIMILAR:
            similarities.append((a, b))
    return similarities


if __name__ == "__main__":
    tags = get_tags()
    top_tags = get_top_tags(tags)
    print('* Top {} tags:'.format(TOP_NUMBER))
    for tag, count in top_tags:
        print('{:<20} {}'.format(tag, count))
    similar_tags = dict(get_similarities(tags))
    print()
    print('* Similar tags:')
    for singular, plural in similar_tags.items():
        print('{:<20} {}'.format(singular, plural))