瀏覽代碼

split extraction and parsing into 2 stages + better lecturer shorthand parsing + better documentation

Noah Vogt 1 月之前
父節點
當前提交
4cd3c683bd
共有 8 個文件被更改,包括 366 次插入35 次删除
  1. 44 6
      README.md
  2. 14 7
      config/constants.py
  3. 193 0
      extract_lecturer_shorthands_pdf.py
  4. 1 1
      parse/__init__.py
  5. 24 3
      parse/models.py
  6. 62 7
      parse/parse_modules.py
  7. 2 2
      parse/table_extraction.py
  8. 26 9
      parse_class_pdf.py

+ 44 - 6
README.md

@@ -4,22 +4,57 @@ Provides backend tooling for [modulplaner](https://codeberg.org/Modulplaner/modu
 
 Because the original repo only contains frontend code and data updates were slow and intransparent, I created this repo as a solution.
 
+## Installation
+
+You need to install
+
+- [python3](https://www.python.org)
+- the python dependencies in `requirements.txt`
+
+For some simple commands mentioned in this documentat, It is also recommended to install and setup
+
+- [jq](https://github.com/jqlang/jq)
+- a working POSIX shell environment
+
 ## Basic Usage
 
-After installing the [python3](https://www.python.org/) dependencies in `requirements.txt`, execute `parse_class_pdf.py` to parse a class timetable PDF.
+This section is split into the different script provided by this repository.
+
+### parse_class_pdf.py
 
-```bash
-./parse_class_pdf.py [-h] [-l LECTURERS] [-i INPUT] [-o OUTPUT] [lecturers_pos]
+Execute the following to parse a class timetable PDF into the `classes.json` file needed by the frontend.
+
+```sh
+./parse_class_pdf.py [-h] [-l LECTURERS] [-i INPUT] [-o OUTPUT] [--save-intermediate SAVE_INTERMEDIATE] [--load-intermediate LOAD_INTERMEDIATE]
 ```
 
-### Arguments
+#### Arguments
 
 - `-i`, `--input`: Path to the input PDF file. Defaults to `klassen.pdf`.
 - `-o`, `--output`: Path to the output JSON file. Defaults to `classes.json`.
-- `-l`, `--lecturers` or `lecturers_pos`: Path to the `lecturers.json` file. If provided, it is used to validate lecturer shorthands during parsing.
+- `-l`, `--lecturers`: Path to the `lecturers.json` file. If provided, it is used to validate lecturer shorthands during parsing.
+- `--save-intermediate`: Path to save the intermediate extraction data (pickle format) and exit. Useful for skipping the slow extraction stage in subsequent runs.
+- `--load-intermediate`: Path to load the intermediate extraction data from (pickle format) and skip extraction.
 
 The default values for input and output files are defined in `config/constants.py`.
 
+#### Faster Development Cycle
+
+Since the PDF extraction takes a significant amount of time, you can split the process into two stages:
+
+1.  **Stage 1 (Extraction):** Run once and save the result: `./parse_class_pdf.py --save-intermediate data.pkl`
+2.  **Stage 2 (Parsing):** Load the saved data and iterate on the parsing logic: `./parse_class_pdf.py --load-intermediate data.pkl --output classes.json`
+
+### extract_lecturer_shorthands_pdf.py
+
+Use this script to parse a lecturer shortname PDF into the `lecturers.json` file needed by the frontend. Note that if you don't merge the script output with your previous `lecuturer.json` file, the view of previous semesters may break. You can easily do that using `jq`:
+
+```sh
+jq -s 'add | unique' previous_lecturers.json script_output.json > merged.json
+```
+
+For more information, show the cli arguments via `./extract_lecturer_shorthands_pdf.py -h`.
+
 ## Project Roadmap
 
 Currently I am working on refining the core data generation. In the future, I can see myself also working on:
@@ -38,7 +73,10 @@ Currently I am working on refining the core data generation. In the future, I ca
 - The redundant class name in the class pdf cells sometimes gets mixed up with the module shorthand, which is especially annoying when part of the class name gets cut off too (is handled)
 - missing degree programs in the text above the table need to be guessed via ugly heuristics
 - there is a class called `alle` which is degree program agnostic
-- degree_program's `Kontext BWL`, `Kontext Kommunikation`, `Kontext GSW` have mixed classes, which arises the need the have a table to differentiate the modules
+- degree_program's `Kontext BWL`, `Kontext Kommunikation`, `Kontext GSW` have mixed classes, which arises the need the have a table to differentiate the modules based on their shorthands
+- some lecturers shorthands present in class timetable pdf's are missing altogether in both the lecturer shorthands pdf and the lecturers timetable pdf
+- there are different lecturer shorthands for the same full name lecturer in the lecturer shorthands pdf and the lecturer timetable pdf
+- there are timeslots for modules - which are part of the same class - that are found in the class timetable pdf but not the lecturer pdf
 
 ## Problems in the Frontend Data Formats
 - there seem to be `teaching_type`'s defined that may not ever be found in class pdf's

+ 14 - 7
config/constants.py

@@ -1,12 +1,6 @@
 CLASS_PDF_INPUT_FILE = "klassen.pdf"
-
 CLASSES_JSON_OUTPUT_FILE = "classes.json"
-
-TOLERANCE = 3
-
-LECTURER_SHORTHAND_SIZE = 6
-
-TABLE_SETTINGS = {
+CLASS_PDF_TABLE_SETTINGS = {
     "vertical_strategy": "lines",
     "horizontal_strategy": "lines",
     "snap_tolerance": 3,
@@ -14,6 +8,19 @@ TABLE_SETTINGS = {
     "edge_min_length": 3,
 }
 
+TOLERANCE = 3
+
+LECTURER_SHORTHAND_SIZE = 6
+LECTURER_SHORTHAND_PDF_PDF_INPUT_FILE = "lecturer_shorthands.pdf"
+LECTURER_SHORTHAND_JSON_OUTPUT_FILE = "lecturers.json"
+LECTURER_SHORTHAND_PDF_TABLE_SETTINGS = {
+    "vertical_strategy": "text",
+    "horizontal_strategy": "text",
+    "snap_tolerance": 5,
+    "intersection_x_tolerance": 15,
+}
+LECTURER_SHORTHAND_PDF_ROW_SKIP_VALUES = ["Name Nachname Vorname", "vak"]
+
 ALLOWED_TIMESLOTS = [
     ("8:15", "9:00"),
     ("9:15", "10:00"),

+ 193 - 0
extract_lecturer_shorthands_pdf.py

@@ -0,0 +1,193 @@
+#!/usr/bin/env python3
+
+import logging
+from argparse import ArgumentParser
+
+import pdfplumber
+from pdfplumber.table import Table
+from pydantic import TypeAdapter
+
+from config import (
+    LECTURER_SHORTHAND_PDF_TABLE_SETTINGS,
+    LECTURER_SHORTHAND_PDF_PDF_INPUT_FILE,
+    LECTURER_SHORTHAND_JSON_OUTPUT_FILE,
+)
+from parse import RawLecturer, Lecturer
+
+
+def extract_rows_from_lecturer_shorthand_pdf(input_file) -> list[RawLecturer]:
+    lecturers: list[RawLecturer] = []
+
+    with pdfplumber.open(input_file) as pdf:
+        # find the X coordinates of "Nachname" and "Vorname" on the first page
+        # to use as fixed separators for all rows. This assumes they do not
+        # deviate their x values on subsequent pages.
+        first_page = pdf.pages[0]
+        nachname_rects = first_page.search("Nachname")
+        vorname_rects = first_page.search("Vorname")
+
+        sep_x_1 = 0
+        sep_x_2 = 0
+
+        if nachname_rects and vorname_rects:
+            # Subtract 2 pixels to ensure the start of the letter is caught
+            # even if it drifts slightly left.
+            sep_x_1 = nachname_rects[0]["x0"] - 2
+            sep_x_2 = vorname_rects[0]["x0"] - 2
+            logging.debug(
+                "calculated separators: %d (Nachname), %d (Vorname)", sep_x_1, sep_x_2
+            )
+        else:
+            raise RuntimeError("Could not find headers for separator calculation")
+
+        lines_y1: list = []
+        min_line_y1 = 0
+        max_line_y1 = 0
+
+        for page_index, page in enumerate(pdf.pages):
+            # Remove top header and bottom footer based on first / last line.
+            # Assumes the header and footer positions do not go beyond these
+            # values on subsequent pages.
+            if page_index == 0:
+                for line in page.lines:
+                    lines_y1.append(line.get("y1"))
+                if lines_y1:
+                    min_line_y1 = min(lines_y1)
+                    max_line_y1 = max(lines_y1)
+
+            # guard against empty lines list if page has no lines
+            if not lines_y1:
+                logging.warning("First page has no lines")
+                crop_box = (0, 0, page.width, page.height)
+            else:
+                crop_box = (0, min_line_y1, page.width, max_line_y1)
+
+            cropped_page = page.crop(crop_box)
+
+            found_tables: list[Table] = cropped_page.find_tables(
+                LECTURER_SHORTHAND_PDF_TABLE_SETTINGS
+            )
+
+            if len(found_tables) != 1:
+                raise RuntimeError(
+                    "Did not find exactly 1 table in the lecuturer shorthands pdf"
+                    + f" on page {page_index + 1}"
+                )
+
+            table: Table = found_tables[0]
+
+            for row_index, row in enumerate(table.rows):
+                if row is None:
+                    logging.debug("None table row found")
+                    continue
+
+                valid_cells = [cell for cell in row.cells if cell is not None]
+
+                if not valid_cells:
+                    continue
+
+                row_top = valid_cells[0][1]
+                row_bottom = valid_cells[0][3]
+                row_left = valid_cells[0][0]
+                row_right = valid_cells[-1][2]
+
+                row_bbox = (row_left, row_top, row_right, row_bottom)
+
+                logging.debug("row %d dimensions: %s", row_index, row_bbox)
+
+                # column 1: From start of row -> Nachname separator
+                col1_bbox = (row_left, row_top, sep_x_1, row_bottom)
+                # column 2: From Nachname separator -> Vorname separator
+                col2_bbox = (sep_x_1, row_top, sep_x_2, row_bottom)
+                # column 3: From Vorname separator -> End of row
+                col3_bbox = (sep_x_2, row_top, row_right, row_bottom)
+
+                logging.debug("col 1 bbox: %s", col1_bbox)
+                logging.debug("col 2 bbox: %s", col2_bbox)
+                logging.debug("col 3 bbox: %s", col3_bbox)
+
+                row_text: str = cropped_page.crop(row_bbox).extract_text()
+                logging.debug("row text: %s", row_text)
+                col1_text = cropped_page.crop(col1_bbox).extract_text()
+                logging.debug("col 1 text: %s", col1_text)
+                col2_text = cropped_page.crop(col2_bbox).extract_text()
+                logging.debug("col 2 text: %s", col2_text)
+                col3_text = cropped_page.crop(col3_bbox).extract_text()
+                logging.debug("col 3 text: %s", col3_text)
+                lecturers.append(RawLecturer(col1_text, col3_text, col2_text))
+
+    return lecturers
+
+
+def is_table_header_row(raw_lecturer: RawLecturer) -> bool:
+    return (
+        raw_lecturer.shorthand == "Name"
+        and raw_lecturer.surname == "Nachname"
+        and raw_lecturer.firstname == "Vorname"
+    )
+
+
+def is_vak_example_row(raw_lecturer):
+    return (
+        raw_lecturer.shorthand == "vak"
+        and raw_lecturer.surname == ""
+        and raw_lecturer.firstname == ""
+    )
+
+
+def get_lecturers_json(modules: list[Lecturer]) -> str:
+    """
+    Serializes a list of Lecturer objects into a formatted JSON string.
+    """
+    adapter = TypeAdapter(list[Lecturer])
+    return adapter.dump_json(modules).decode("utf-8")
+
+
+def parse_lecturers(raw_lecturers: list[RawLecturer]) -> list[Lecturer]:
+    lecturers: list[Lecturer] = []
+    for raw_lecturer in raw_lecturers:
+        if is_table_header_row(raw_lecturer) or is_vak_example_row(raw_lecturer):
+            logging.debug("skipping raw lecturer: %s", raw_lecturer)
+        else:
+            new_lecturer: Lecturer = Lecturer(
+                short=raw_lecturer.shorthand,
+                surname=raw_lecturer.surname,
+                firstname=raw_lecturer.firstname,
+            )
+            if new_lecturer in lecturers:
+                logging.debug("skipped over duplicate lecturer: %s", new_lecturer)
+            else:
+                lecturers.append(new_lecturer)
+    return lecturers
+
+
+def main() -> None:
+    parser = ArgumentParser(description="Parse lecturer shorthand PDF to JSON.")
+    parser.add_argument(
+        "-i",
+        "--input",
+        help="Path to the input PDF file",
+        default=LECTURER_SHORTHAND_PDF_PDF_INPUT_FILE,
+    )
+    parser.add_argument(
+        "-o",
+        "--output",
+        help="Path to the output JSON file",
+        default=LECTURER_SHORTHAND_JSON_OUTPUT_FILE,
+    )
+    args = parser.parse_args()
+
+    logging.basicConfig(level=logging.INFO)
+
+    raw_lecturers: list[RawLecturer] = extract_rows_from_lecturer_shorthand_pdf(
+        args.input
+    )
+    lecturers: list[Lecturer] = parse_lecturers(raw_lecturers)
+    json_output: str = get_lecturers_json(lecturers)
+
+    with open(args.output, "w", encoding="utf-8") as f:
+        f.write(json_output)
+
+
+if __name__ == "__main__":
+    main()

+ 1 - 1
parse/__init__.py

@@ -4,4 +4,4 @@ from .parse_modules import (
     get_modules_for_class_json,
     deduplicate_modules,
 )
-from .models import ClassPdfExtractionPageData, ClassJsonModule
+from .models import ClassPdfExtractionPageData, ClassJsonModule, RawLecturer, Lecturer

+ 24 - 3
parse/models.py

@@ -1,6 +1,6 @@
 from enum import Enum, unique
 from dataclasses import dataclass
-from typing import Annotated, TYPE_CHECKING
+from typing import Annotated
 
 from pydantic import BaseModel, PlainSerializer, Field, ConfigDict
 
@@ -142,9 +142,24 @@ class TeachingType(Enum):
 
 
 @dataclass
-class Teacher:
+class RawLecturer:
+    """
+    Basic representation of an extracted lecturer from a pdf that needs to be parsed.
+    """
+
     shorthand: str
-    full_name: str
+    firstname: str
+    surname: str
+
+
+class Lecturer(BaseModel):
+    """
+    JSON-serializable representation of a parsed lecturer ready to be exported.
+    """
+
+    short: str
+    surname: str
+    firstname: str
 
 
 # tells pydantic to to use the index field for the special Weekday Enum
@@ -182,3 +197,9 @@ class ClassJsonModule(BaseModel):
 class ClassPdfExtractionPageData:
     raw_extracted_modules: list[RawExtractedModule]
     page_metadata: PageMetadata
+
+
+@dataclass
+class StartsWithMatch:
+    shorthand_found: str
+    num_of_matches: int

+ 62 - 7
parse/parse_modules.py

@@ -12,6 +12,7 @@ from .models import (
     DegreeProgram,
     TeachingType,
     Weekday,
+    StartsWithMatch,
 )
 
 
@@ -21,10 +22,14 @@ def get_modules_for_class_json(
     degree_program: DegreeProgram,
     valid_lecturer_shorthands: list[str] | None = None,
 ) -> list[ClassJsonModule]:
+    """
+    Parses the Raw Extracted Modules from the class pdf into the format to
+    export them to the classes.json file.
+    """
     output_modules: list[ClassJsonModule] = []
 
     for input_module in modules:
-        parsed_data: ParsedModuleCellTextData = parse_module_cell_text(
+        parsed_data: ParsedModuleCellTextData = parse_module_class_pdf_cell_text(
             input_module.text, class_name, degree_program, valid_lecturer_shorthands
         )
 
@@ -88,12 +93,15 @@ def parse_mixed_degree_programs(
     return degree_program
 
 
-def parse_module_cell_text(
+def parse_module_class_pdf_cell_text(
     text: str,
     class_name: str,
     degree_program: DegreeProgram,
     valid_lecturer_shorthands: list[str] | None = None,
 ) -> ParsedModuleCellTextData:
+    """
+    Parse a single class pdf module cell text.
+    """
     lines = text.split("\n")
     logging.debug("Parsing module cell text: \n%s", text)
     if len(lines) != 3 and len(lines) != 2:
@@ -123,24 +131,64 @@ def parse_module_cell_text(
 def get_lecturer_shortnames(
     second_line: str, valid_lecturer_shorthands: list[str] | None = None
 ) -> list[str]:
+    """
+    Get the lecturer shorthand based on the second class pdf cell line.
+    You can provide a list of valid lecturer shorthands for more accurate parsing.
+    """
     lecturer_shorthands: list[str] = []
     words = second_line.split(" ")
     if valid_lecturer_shorthands is None:
         for word in words:
             if len(word) == LECTURER_SHORTHAND_SIZE:
                 lecturer_shorthands.append(word)
+            else:
+                logging.warning("Could not get lecturer shorthand from word: %s", word)
     else:
         for word in words:
-            if word in valid_lecturer_shorthands or (
-                len(word) == LECTURER_SHORTHAND_SIZE and shorthand.startswith(word)
-                for shorthand in valid_lecturer_shorthands
-            ):
-                lecturer_shorthands.append(word)
+            exact_starts_with_match = matches_startswith(
+                word, valid_lecturer_shorthands
+            )
+            minus_last_char_starts_with_match = matches_startswith(
+                word[:-1], valid_lecturer_shorthands
+            )
 
+            if word in valid_lecturer_shorthands:
+                lecturer_shorthands.append(word)
+            elif is_valid_starts_with_match(exact_starts_with_match):
+                lecturer_shorthands.append(exact_starts_with_match.shorthand_found)
+            elif is_valid_starts_with_match(minus_last_char_starts_with_match):
+                lecturer_shorthands.append(
+                    minus_last_char_starts_with_match.shorthand_found
+                )
+            else:
+                logging.warning("Could not get lecturer shorthand from word: %s", word)
     return lecturer_shorthands
 
 
+def is_valid_starts_with_match(exact_starts_with_match: StartsWithMatch) -> bool:
+    return (
+        exact_starts_with_match.shorthand_found != ""
+        and exact_starts_with_match.num_of_matches == 1
+    )
+
+
+def matches_startswith(
+    word: str, valid_lecturer_shorthands: list[str]
+) -> StartsWithMatch:
+    shorthand_with_start: str = ""
+    # catch the number of matches to make sure the matching is unambiguous
+    num_of_startwith_matches: int = 0
+    for shorthand in valid_lecturer_shorthands:
+        if shorthand.startswith(word):
+            shorthand_with_start = shorthand
+            num_of_startwith_matches += 1
+    return StartsWithMatch(shorthand_with_start, num_of_startwith_matches)
+
+
 def get_module_shorthand(first_line: str, class_name: str) -> str:
+    """
+    Get the module shorthand based on the first class pdf cell line.
+    """
     words = first_line.split(" ")
     if len(words) < 1:
         raise RuntimeError("Cannot extract module shorthand")
@@ -162,18 +210,25 @@ def get_id(
     start_seconds: int,
     end_seconds: int,
 ) -> str:
+    """Calculate the json id of a module."""
     return (
         f"{class_name}-{module_shorthand}-{weekday.index}-{start_seconds}-{end_seconds}"
     )
 
 
 def get_teaching_type(third_line: str) -> TeachingType:
+    """
+    Get the teaching type based on the third class pdf cell line.
+    """
     if "Online" in third_line:
         return TeachingType.ONLINE
     return TeachingType.ON_SITE
 
 
 def get_rooms(third_line: str) -> list[str]:
+    """
+    Get the rooms based on the third class pdf cell line.
+    """
     if "DSMixe" in third_line:
         return []
 

+ 2 - 2
parse/table_extraction.py

@@ -2,7 +2,7 @@ import logging
 from pdfplumber.page import Page
 import pdfplumber
 
-from config import TABLE_SETTINGS, ALLOWED_TIMESLOTS
+from config import CLASS_PDF_TABLE_SETTINGS, ALLOWED_TIMESLOTS
 from .models import (
     Weekday,
     TimeSlot,
@@ -120,7 +120,7 @@ def extract_data_from_class_pdf(
             for day in Weekday:
                 weekday_areas[day] = Area(0, 0, 0, 0)
 
-            found_tables = page.find_tables(TABLE_SETTINGS)
+            found_tables = page.find_tables(CLASS_PDF_TABLE_SETTINGS)
             logging.debug(
                 "amount of tables found on page %d: %d",
                 page_index + 1,

+ 26 - 9
parse_class_pdf.py

@@ -1,7 +1,8 @@
 #!/usr/bin/env python3
 
 import logging
-import argparse
+from argparse import ArgumentParser
+import pickle
 import json
 
 from parse import (
@@ -22,6 +23,7 @@ def get_valid_lecturers(file_path: str) -> list[str]:
     """
     valid_lecturers: list[str] = []
     try:
+        print(f"READING: '{file_path}'")
         with open(file_path, "r", encoding="utf-8") as f:
             data = json.load(f)
             if isinstance(data, list):
@@ -37,7 +39,7 @@ def get_valid_lecturers(file_path: str) -> list[str]:
 
 
 def main() -> None:
-    parser = argparse.ArgumentParser(description="Parse class PDF to JSON.")
+    parser = ArgumentParser(description="Parse class PDF to JSON.")
     parser.add_argument(
         "-l", "--lecturers", help="Path to the lecturers.json file", default=None
     )
@@ -51,14 +53,18 @@ def main() -> None:
         default=CLASSES_JSON_OUTPUT_FILE,
     )
     parser.add_argument(
-        "lecturers_pos",
-        nargs="?",
-        help="Path to the lecturers.json file (positional)",
+        "--save-intermediate",
+        help="Path to save the intermediate extraction data (pickle format) and exit",
+        default=None,
+    )
+    parser.add_argument(
+        "--load-intermediate",
+        help="Path to load the intermediate extraction data from (pickle format) and skip extraction",
         default=None,
     )
 
     args = parser.parse_args()
-    lecturers_file = args.lecturers or args.lecturers_pos
+    lecturers_file = args.lecturers
 
     logging.basicConfig(level=logging.DEBUG)
 
@@ -66,9 +72,20 @@ def main() -> None:
     if lecturers_file:
         valid_lecturer_shorthands = get_valid_lecturers(lecturers_file)
 
-    extraction_data: list[ClassPdfExtractionPageData] = extract_data_from_class_pdf(
-        args.input
-    )
+    extraction_data: list[ClassPdfExtractionPageData]
+
+    if args.load_intermediate:
+        logging.info("Loading intermediate data from %s", args.load_intermediate)
+        with open(args.load_intermediate, "rb") as f:
+            extraction_data = pickle.load(f)
+    else:
+        extraction_data = extract_data_from_class_pdf(args.input)
+        if args.save_intermediate:
+            logging.info("Saving intermediate data to %s", args.save_intermediate)
+            with open(args.save_intermediate, "wb") as f:
+                pickle.dump(extraction_data, f)
+            return
+
     parsed_modules: list[ClassJsonModule] = [
         module
         for data in extraction_data