# ILGA Scraper Pipeline

This project provides a robust, automated pipeline for scraping and maintaining up-to-date data on Illinois General Assembly legislators, bills, votes, and committee memberships.

---

## Table of Contents
- [Setup](#setup)
- [Legislators Pipeline](#legislators-pipeline)
- [Bills Pipeline](#bills-pipeline)
- [Votes Pipeline](#votes-pipeline)
- [Committees Pipeline](#committees-pipeline)
- [Troubleshooting & Updates](#troubleshooting--updates)
- [Contributing](#contributing)
- [License](#license)

---

## Setup

1. **Install dependencies:**
   ```bash
   pip install -r requirements.txt
   ```
2. **Configure your environment:**
   - Edit `/app/config.py` to set `CURRENT_SESSION` and other settings.
   - Ensure your `.env` file in `/app` (or `/apps` for some scripts) has correct database credentials:
     ```
     DB_HOST=your_host
     DB_NAME=your_db
     DB_USER=your_user
     DB_PASS=your_password
     CURRENT_SESSION=104
     ```
3. **Database:**
   - Ensure your MySQL database is set up with the required tables (`bill_listings`, `bill_listings_exp`, etc.).

---

## Legislators Pipeline

1. **Download member XMLs**
   - URLs are set in `config.py` as `SENATE_MEMBERS_URL` and `HOUSE_MEMBERS_URL`.
2. **Parse and store legislator data**
   - Use your custom script to parse the XMLs and populate your database.
3. **Update**
   - If a new session starts, update `CURRENT_SESSION` in `config.py`.
   - Rerun the import script to refresh legislator data.

### Troubleshooting Legislators
- If a legislator is missing, check the XML and session in `config.py`.
- If data is outdated, verify the XML URLs and rerun the import.

---

## Bills Pipeline

### 1. `session_analyzer.py`
- **Purpose:** Analyzes session pages, outputs bill counts to `session_config.json`.
- **How to run:**
  ```bash
  python app/bills/session_analyzer.py
  ```

### 2. `bill_listing_collecter.py`
- **Purpose:** Compiles all bill listing links for each session/category, populates `bill_listings`.
- **How to run:**
  ```bash
  python app/bills/bill_listing_collecter.py
  ```

### 3. `bill_listing_expander.py`
- **Purpose:** Expands each listing into individual bills, populates `bill_listings_exp` with bill URLs and basic info.
- **How to run:**
  ```bash
  python app/bills/bill_listing_expander.py
  ```

### 4. `bill_listing_append.py`
- **Purpose:**
  - For each bill, checks the FTP last-modified date for the bill's HTML file (e.g., `10400SB0001.html`).
  - If the FTP file is newer than the database's `updated_at`, scrapes the bill page and updates `bill_listings_exp`.
  - If not, skips the bill (already up to date).
  - Uses parallel scraping for speed.
- **How to run:**
  ```bash
  python app/bills/bill_listing_append.py --limit 100
  ```
  (Adjust `--limit` as needed.)

#### How the FTP Date Check Works
- The script reads `CURRENT_SESSION` from `config.py`.
- It parses the FTP directory (e.g., [ILGA FTP HTML](https://ilga.gov/ftp/legislation/104/BillStatus/HTML/)) for all bill HTML files and their last-modified dates.
- For each bill, it constructs the filename (e.g., `10400SB0001.html` for SB0001 in session 104).
- If the FTP file's last-modified date is **newer** than the `updated_at` in your database, the bill is scraped and updated. Otherwise, it is skipped.

---

## Votes Pipeline

### Features
- Scrapes roll call vote data for all bills from the ILGA website.
- Downloads and parses roll call PDFs to extract vote totals and individual legislator votes.
- Populates the `votes` and `vote_details` tables in the database.
- Uses robust name-matching logic, including support for a manual mapping file (`manual_legislator_map.txt`) to handle name discrepancies between roll call PDFs and the database.
- Skips already-processed vote events, but will backfill missing `vote_details` for existing votes if they were previously incomplete.
- Logs unmatched legislator names for review and correction.

### Table Structure

#### votes
- id (INT, PK)
- bill_id (VARCHAR)
- chamber (VARCHAR)
- vote_date (TIMESTAMP)
- vote_type (VARCHAR)
- vote_result (VARCHAR)
- yea_count (INT)
- nay_count (INT)
- present_count (INT)
- not_voting_count (INT)
- excused_count (INT)
- session (INT)
- created_at (TIMESTAMP)

#### vote_details
- id (INT, PK)
- vote_id (FK to votes)
- legislator_id (FK to legislators)
- vote_value (VARCHAR)
- created_at (TIMESTAMP)

### Usage
Run the votes scraper to populate votes and vote_details:
```bash
python app/votes/votes_scraper.py
```
- The script will log progress and any unmatched legislator names.
- To improve name matching, update `app/votes/manual_legislator_map.txt` as needed.

### Troubleshooting Votes
- **Unmatched legislator names:**
  - Check the logs for warnings about unmatched names.
  - Update the manual mapping file or improve the name-matching logic as needed.
- **Database connection issues:**
  - Verify your `.env` file is correct and the database is accessible.

---

## Committees Pipeline

### Features
- Loads all Senate and House committee memberships from official ILGA XML files.
- Parses XML files to extract committee names, codes, and all members (with party and role).
- Populates the `committees` and `committee_members` tables in the database.
- Uses name-matching logic to link committee members to legislators in the database.
- Logs unmatched names for further review.

### Table Structure

#### committees
- id (INT, PK)
- name (VARCHAR)
- chamber (VARCHAR)
- code (VARCHAR)

#### committee_members
- id (INT, PK)
- committee_id (FK to committees)
- legislator_id (FK to legislators)
- role (VARCHAR)

### Usage
Run the committees loader to populate committees and committee_members:
```bash
python app/committees/committees_loader.py
```
- The script will download and parse the latest Senate and House committee XML files.
- Logs will show the number of parsed committees and any unmatched names.

### Troubleshooting Committees
- **Unmatched legislator names:**
  - Check the logs for warnings about unmatched names.
  - Update the manual mapping file or improve the name-matching logic as needed.
- **XML parsing errors:**
  - Ensure the committee files are accessible and valid XML.
  - The script now handles BOM and encoding issues, but check the logs for the first 200 characters if errors persist.
- **Database connection issues:**
  - Verify your `.env` file is correct and the database is accessible.

---

## Troubleshooting & Updates

- **If a bill or legislator is missing:**
  - Check the session and bill number formatting.
  - Ensure the bill or legislator exists in the source XML/FTP for the current session.
- **If data is outdated:**
  - Verify the FTP directory or XML URLs and rerun the relevant script.
  - Check that `CURRENT_SESSION` in `config.py` is correct.
- **If scraping fails:**
  - Check for changes in the ILGA site's HTML structure or XML format.
  - Review debug output and update selectors/parsers as needed.
- **If the script is slow:**
  - The append script uses parallel scraping (default 8 threads). You can increase or decrease this in the code for your environment.
- **If you need to force a refresh:**
  - You can manually update the `updated_at` field in your database to an older date, or temporarily disable the FTP date check in the script.
- **If you add new fields or change the scraping logic:**
  - Update the relevant scripts and database schema.
  - Document any changes in this README for future reference.

---

## Contributing
Pull requests and suggestions are welcome!

## License
MIT 