# Website Navigation Crawler

A comprehensive Python script that crawls entire websites, follows all links, and creates a detailed navigation structure stored in JSON format.

## Features

- **Complete Site Mapping**: Crawls all pages within a specified depth and page limit
- **Navigation Detection**: Automatically identifies navigation links vs content links
- **Comprehensive Data Extraction**: Captures titles, descriptions, headings, images, forms, and meta tags
- **Respectful Crawling**: Configurable delays between requests to be server-friendly
- **JSON Output**: Structured data output for easy analysis and processing
- **Statistics**: Provides detailed statistics about the crawled site

## Installation

1. Install Python dependencies:
```bash
pip install -r requirements.txt
```

2. Make the script executable:
```bash
chmod +x site_crawler.py
```

## Usage

### Command Line Usage

Basic usage:
```bash
python site_crawler.py https://example.com
```

Advanced usage with options:
```bash
python site_crawler.py https://example.com \
    --max-depth 3 \
    --max-pages 100 \
    --delay 1.0 \
    --output my_site_navigation.json
```

### Command Line Options

- `url`: Base URL to start crawling (required)
- `--max-depth`: Maximum crawl depth (default: 3)
- `--max-pages`: Maximum number of pages to crawl (default: 100)
- `--delay`: Delay between requests in seconds (default: 1.0)
- `--output`: Output JSON file name (default: navigation_structure.json)

### Programmatic Usage

```python
from site_crawler import WebsiteCrawler

# Create crawler instance
crawler = WebsiteCrawler(
    base_url="https://example.com",
    max_depth=2,
    max_pages=50,
    delay=0.5,
    output_file="site_navigation.json"
)

# Start crawling
navigation_structure = crawler.crawl()

# Access results
print(f"Found {len(navigation_structure['pages'])} pages")
print(f"Main menu items: {len(navigation_structure['navigation']['main_menu'])}")
```

## Output Structure

The crawler generates a comprehensive JSON file with the following structure:

```json
{
  "site_info": {
    "base_url": "https://example.com",
    "domain": "example.com",
    "total_pages": 25,
    "crawl_date": "2024-01-15 14:30:00",
    "crawl_settings": {
      "max_depth": 3,
      "max_pages": 100,
      "delay": 1.0
    }
  },
  "pages": {
    "https://example.com/": {
      "title": "Home Page",
      "description": "Welcome to our website",
      "h1": ["Main Heading"],
      "h2": ["Section Heading"],
      "navigation_links": [...],
      "content_links": [...],
      "external_links": [...],
      "images": 5,
      "forms": 1,
      "word_count": 1500,
      "meta_tags": {...}
    }
  },
  "navigation": {
    "main_menu": [
      {
        "url": "https://example.com/about",
        "frequency": 15,
        "title": "About Us"
      }
    ],
    "sitemap": [...]
  },
  "statistics": {
    "total_links": 150,
    "images": 45,
    "forms": 8,
    "avg_word_count": 1200
  }
}
```

## What Gets Extracted

### Page Information
- **Title**: Page title from `<title>` tag
- **Description**: Meta description
- **Keywords**: Meta keywords
- **Headings**: All H1, H2, H3 headings
- **Word Count**: Approximate word count of page content

### Links
- **Navigation Links**: Links in navigation menus, headers, footers
- **Content Links**: Internal links within page content
- **External Links**: Links to other domains
- **Link Text**: The visible text of each link
- **Link Titles**: Title attributes of links

### Media
- **Images**: All images with src, alt, and title attributes
- **Forms**: Form actions, methods, and input fields

### Meta Data
- **Meta Tags**: All meta tags with names and content
- **Open Graph**: Social media meta tags
- **Twitter Cards**: Twitter-specific meta tags

## Navigation Detection

The crawler automatically identifies navigation links by:

1. **HTML Structure**: Looking for links within `<nav>`, `<header>`, `<footer>` elements
2. **CSS Classes**: Checking for navigation-related class names
3. **Link Text**: Analyzing link text for common navigation words
4. **Frequency**: Links that appear frequently across pages

## Best Practices

### For Crawling Your Own Sites
- Use shorter delays (0.5-1.0 seconds) for local/development sites
- Set appropriate max_depth based on your site structure
- Monitor the crawl progress and adjust settings as needed

### For Crawling External Sites
- Use longer delays (2.0+ seconds) to be respectful
- Check the site's robots.txt file
- Consider the site's terms of service
- Limit the number of pages crawled

### Performance Tips
- Start with a small max_depth and max_pages to test
- Use the delay parameter to avoid overwhelming servers
- Monitor memory usage for large sites

## Example Use Cases

### 1. Site Audit
```bash
python site_crawler.py https://mysite.com --max-depth 4 --max-pages 200
```

### 2. Navigation Analysis
```python
# Analyze navigation structure
with open('navigation_structure.json', 'r') as f:
    data = json.load(f)

for item in data['navigation']['main_menu']:
    print(f"{item['title']}: {item['frequency']} occurrences")
```

### 3. Content Analysis
```python
# Find pages with most content
pages = [(url, info['word_count']) for url, info in data['pages'].items()]
pages.sort(key=lambda x: x[1], reverse=True)
for url, word_count in pages[:10]:
    print(f"{url}: {word_count} words")
```

### 4. Broken Link Detection
```python
# Check for potential broken links
for url, page_info in data['pages'].items():
    for link in page_info['content_links']:
        if link['url'] not in data['pages']:
            print(f"Potential broken link: {link['url']} on {url}")
```

## Troubleshooting

### Common Issues

1. **Connection Errors**: Check if the site is accessible and your internet connection
2. **Rate Limiting**: Increase the delay parameter
3. **Memory Issues**: Reduce max_pages or max_depth
4. **Permission Errors**: Check file write permissions for output directory

### Debug Mode

Enable debug logging by modifying the logging level in the script:
```python
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
```

## License

This script is provided as-is for educational and development purposes. Please respect website terms of service and robots.txt files when crawling external sites. 