🔎 A Search tool that crawls the ‘webscraping’ website to provide you with country search functionality!

A Python search tool that crawls, creates an inverted index of all word occurrences in this example scraping website using BeautifulSoup library, which then allows the user to find pages containing the search term.

➰ Project Duration

April, 2020 - May 2020

🎨 Features / 주요 기능

  • Crawls all the pages of the website
  • Tokenizes the parsed objects and removes styling elements and strips punctuation and whitespaces.
  • Creates the inverted index for the whole website
  • Prints the inverted list for a certain word
  • Find pages containing search terms
  • Compute the scores of pages when processing a search query

🐾 Examples / 사용 예제

Commands Explanation Syntax
Build It crawls the website, build the inverted index, and save the resulting index into the file system build
Load It loads the existing index from the file system load
Print It prints the inverted index for a particular word print <word>
Find It finds a certain query phrase in the inverted index and returns a list of all pages containing this phrase find <word1> <word2> <word n>

- Command Line menu

Screenshot 2020-12-01 at 5 21 41 pm

1. Built inverted index

Screenshot 2020-12-01 at 5 11 46 pm

2. Print

Screenshot 2020-12-01 at 5 21 48 pm

3. Find

  • No match (when the search term doesn’t exist):
    Screenshot 2020-12-01 at 5 22 11 pm

  • Successful search:
    Screenshot 2020-12-01 at 5 27 39 pm

    Screenshot 2020-12-01 at 5 25 29 pm

📚 Stack / 개발 환경

  • Python
  • Requests Library - A simple python library to compose HTTP requests.
  • BeautifulSoup - A Python library for navigating, searching, and modifying a parse tree out of HTML and XML files.

⚒ Installation / 실행 방법

Pip-install the following libraries:

pip install requests
pip install beautifulsoup4
pip install urllib

Running the client:

  • python3 Crawl and Search.py

📜 License

This project is licensed under the terms of the MIT license.

You can check out the full license here

Leave a comment