This is day 3 for PyBites Code Challenge. It’s using all standard library and does not require any additional libraries. It’s an excellent practice to get the things done with standard libraries for the tasks which I usually use external libraries(e.g . HTML Tag Analysis).
Output:
$ python3 tags.py * Top 10 tags: python 10 learning 7 tips 6 tricks 5 cleancode 5 github 4 pythonic 4 collections 4 beginners 4 virtualenv 4 * Similar tags: game games generators generator best practices best-practices challenges challenge
This is the code:
from collections import Counter from difflib import SequenceMatcher from itertools import combinations import re IDENTICAL = 1.0 TOP_NUMBER = 10 RSS_FEED = 'rss.xml' SIMILAR = 0.87 TAG_HTML = re.compile(r'<category>(.*?)</category>') def get_tags(): """Find all tags (TAG_HTML) in RSS_FEED. Replace dash with whitespace. Hint: use TAG_HTML.findall""" with open(RSS_FEED, "r") as f: html_source = f.read() return re.findall(TAG_HTML, html_source) def get_top_tags(tags): """Get the TOP_NUMBER of most common tags Hint: use most_common method of Counter (already imported)""" return Counter(tags).most_common(TOP_NUMBER) def get_similarities(tags): """Find set of tags pairs with similarity ratio of > SIMILAR Hint 1: compare each tag, use for in for, or product from itertools (already imported) Hint 2: use SequenceMatcher (imported) to calculate the similarity ratio Bonus: for performance gain compare the first char of each tag in pair and continue if not the same""" set_of_tags = set(tags) similarities = [] for a, b in combinations(set_of_tags, 2): if SequenceMatcher(None, a, b).ratio() >= SIMILAR: similarities.append((a, b)) return similarities if __name__ == "__main__": tags = get_tags() top_tags = get_top_tags(tags) print('* Top {} tags:'.format(TOP_NUMBER)) for tag, count in top_tags: print('{:<20} {}'.format(tag, count)) similar_tags = dict(get_similarities(tags)) print() print('* Similar tags:') for singular, plural in similar_tags.items(): print('{:<20} {}'.format(singular, plural))