# Image Scraper

A powerful Python script that crawls websites and downloads all images from every page. The scraper runs in a virtual environment and creates organized folders for storing collected images.

## Features

- 🕸️ **Full Website Crawling**: Automatically discovers and scrapes all pages on a website
- 🖼️ **Comprehensive Image Detection**: Finds images in `<img>` tags, CSS backgrounds, and lazy-loaded content
- 🛡️ **Duplicate Prevention**: Uses content hashing to avoid downloading duplicate images
- 🤖 **Robots.txt Compliance**: Respects website robots.txt files (can be disabled)
- ⏱️ **Rate Limiting**: Configurable delays between requests to be respectful to servers
- 📁 **Organized Storage**: Creates clean folder structure with proper file naming
- 🔍 **Multiple Image Formats**: Supports JPG, PNG, GIF, WebP, SVG, ICO, and more
- 📊 **Detailed Reporting**: Provides comprehensive statistics after scraping
- 🎛️ **Flexible Configuration**: Many command-line options for customization

## Quick Start

### 1. Setup (One-time only)

```bash
# Make setup script executable and run it
chmod +x setup.sh
./setup.sh
```

This will:
- Create a Python virtual environment
- Install all required dependencies
- Make the scraper ready to use

### 2. Run the Scraper

#### Easy Method (Recommended)
```bash
# Make run script executable
chmod +x run_scraper.sh

# Scrape a website
./run_scraper.sh https://example.com
```

#### Manual Method
```bash
# Activate virtual environment
source venv/bin/activate

# Run the scraper
python image_scraper.py https://example.com
```

## Usage Examples

### Basic Usage
```bash
# Scrape all images from a website
./run_scraper.sh https://example.com
```

### Custom Output Directory
```bash
# Save images to a specific folder
./run_scraper.sh https://example.com -o my_website_images
```

### Rate Limiting
```bash
# Add 2-second delay between requests (be nice to servers!)
./run_scraper.sh https://example.com --delay 2.0
```

### Limit Pages
```bash
# Only scrape first 50 pages
./run_scraper.sh https://example.com --max-pages 50
```

### Verbose Output
```bash
# See detailed logging
./run_scraper.sh https://example.com -v
```

### Ignore Robots.txt
```bash
# Ignore robots.txt restrictions (use responsibly!)
./run_scraper.sh https://example.com --ignore-robots
```

### Combined Options
```bash
# Comprehensive scraping with custom settings
./run_scraper.sh https://example.com -o website_images --delay 1.5 --max-pages 100 -v
```

## Command Line Options

| Option | Description | Default |
|--------|-------------|---------|
| `url` | Website URL to scrape (required) | - |
| `-o, --output` | Output directory for images | `scraped_images` |
| `-d, --delay` | Delay between requests (seconds) | `1.0` |
| `-m, --max-pages` | Maximum pages to crawl | `unlimited` |
| `--ignore-robots` | Ignore robots.txt restrictions | `False` |
| `-v, --verbose` | Enable verbose logging | `False` |
| `-h, --help` | Show help message | - |

## How It Works

1. **Page Discovery**: Starts from the provided URL and crawls all internal links
2. **Image Detection**: On each page, finds images from:
   - `<img src="...">` tags
   - `<img data-src="...">` tags (lazy loading)
   - CSS `background-image` properties
3. **Smart Downloading**: 
   - Validates image content types
   - Generates clean filenames
   - Prevents duplicate downloads using content hashing
   - Handles filename conflicts automatically
4. **Organization**: Creates organized folder structure with proper file naming

## Output Structure

```
scraped_images/
├── image_001.jpg
├── logo.png
├── banner_1.gif
├── photo_abc123.webp
└── ...
```

## Requirements

- Python 3.7 or higher
- Internet connection
- Sufficient disk space for images

## Dependencies

The script uses these Python packages (automatically installed):
- `requests` - HTTP requests
- `beautifulsoup4` - HTML parsing
- `lxml` - Fast XML/HTML parser
- `Pillow` - Image processing utilities
- `urllib3` - URL handling utilities

## Ethical Usage

Please use this tool responsibly:

- ✅ **Do**: Respect website terms of service
- ✅ **Do**: Use reasonable delays between requests
- ✅ **Do**: Check robots.txt compliance
- ✅ **Do**: Only scrape content you have permission to access
- ❌ **Don't**: Overload servers with rapid requests
- ❌ **Don't**: Scrape copyrighted content without permission
- ❌ **Don't**: Ignore robots.txt without good reason

## Troubleshooting

### Setup Issues
```bash
# If setup fails, try manual installation:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

### Permission Errors
```bash
# Make scripts executable:
chmod +x setup.sh run_scraper.sh image_scraper.py
```

### Network Issues
- Check your internet connection
- Some websites may block scrapers
- Try increasing the delay with `--delay` option
- Check if the website requires special headers or authentication

### Out of Space
- The scraper can download many images
- Monitor disk space, especially for large websites
- Use `--max-pages` to limit crawling scope

## Advanced Usage

### Custom User Agent
The scraper uses a standard browser user agent. If you need to modify this, edit the `user-agent` header in the script.

### Filtering Images
To filter images by size, type, or other criteria, modify the `_download_image` method in the script.

### Custom Parsing
For websites with unusual image loading patterns, you may need to modify the `_extract_images_from_page` method.

## License

This tool is provided as-is for educational and legitimate scraping purposes. Users are responsible for complying with website terms of service and applicable laws.

## Support

If you encounter issues:
1. Check this README for common solutions
2. Ensure all dependencies are properly installed
3. Try running with `-v` flag for detailed logging
4. Check website accessibility in your browser first 