This project analyzes domain data from the Common Crawl dataset, a non-profit organization that maintains petabytes of web content and makes it publicly available for research and educational purposes.
Common crawl crawls, archives, and analyses content from all public websites.
data/: Raw and processed datanotebooks/: Analysis notebookssrc/: Source code
- Install requirements:
pip install -r requirements.txt - Run Jupyter notebook:
jupyter notebook notebooks/analysis.ipynb
[Add usage instructions here]