Metadata-Version: 2.4
Name: build-corpus
Version: 0.1.0
Summary: Convert DOCX to Markdown with tables, images, and KaTeX-readable Word equations.
Author: LIFE AI
License-Expression: Apache-2.0
Keywords: docx,markdown,omml,katex,word,converter
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: omml2latex>=0.1.1
Provides-Extra: s3
Requires-Dist: boto3>=1.34; extra == "s3"
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Dynamic: license-file

# Build Corpus

Build Corpus converts `.docx` files to Markdown while preserving the pieces that usually break in generic converters:

- Word OMML equations as KaTeX-readable TeX
- embedded images as local assets, base64 data URIs, or S3/R2-hosted URLs
- Markdown tables for simple Word tables
- HTML table fallback for complex tables
- headings, lists, links, bold, italic, inline code, and code-style paragraphs

## Install

Python is the native runtime:

```powershell
pip install build-corpus
```

The npm package is a convenience wrapper around the Python CLI:

```powershell
npm install -g build-corpus
```

For S3/R2 image upload support:

```powershell
pip install "build-corpus[s3]"
```

## Basic Usage

```powershell
build-corpus input.docx --out out
```

Convert every `.docx` in a folder:

```powershell
build-corpus ./word-files --out ./markdown
```

Write Markdown beside each source document:

```powershell
build-corpus ./word-files --out-same-dir
```

## Image Modes

Local asset files, the default:

```powershell
build-corpus input.docx --images assets
```

Single-file Markdown with base64 image data URIs:

```powershell
build-corpus input.docx --images base64
```

Upload images to S3-compatible storage and write public URLs:

```powershell
build-corpus input.docx --images s3 --config examples\build-corpus.config.example.json
```

Cloudflare R2 uses the same `s3` mode. Set `endpoint_url` to:

```text
https://ACCOUNT_ID.r2.cloudflarestorage.com
```

## Config

Copy `examples/build-corpus.config.example.json` and edit it for your environment.

```json
{
  "conversion": {
    "equations": "tex",
    "images": "s3"
  },
  "output": {
    "out": "out",
    "out_same_dir": false
  },
  "s3": {
    "bucket": "build-corpus-assets",
    "public_base_url": "https://assets.example.com",
    "prefix": "knowledge-base",
    "endpoint_url": "https://ACCOUNT_ID.r2.cloudflarestorage.com",
    "region_name": "auto",
    "access_key_id": "%R2_ACCESS_KEY_ID%",
    "secret_access_key": "%R2_SECRET_ACCESS_KEY%"
  }
}
```

Build Corpus expands environment variables in JSON string values, so credentials do not need to be committed.

### Output Placement

There are two output modes.

Write all converted Markdown into one output tree:

```json
{
  "output": {
    "out": "./markdown",
    "out_same_dir": false
  }
}
```

Write each `.md`, asset folder, and report beside the source `.docx`:

```json
{
  "output": {
    "out_same_dir": true
  }
}
```

The same-dir mode is equivalent to:

```powershell
build-corpus ./word-files --out-same-dir
```

## Equations

The default equation mode is parseable TeX:

```powershell
build-corpus input.docx --equations tex
```

Equation images are only for visual debugging:

```powershell
build-corpus input.docx --equations image
```

## Validation

The package includes a KaTeX validator for emitted Markdown math:

```powershell
build-corpus-katex out
```

## Repeatable Test Wrappers

Run a single known DOCX through conversion plus validators:

```powershell
.\scripts\run-smoke.ps1 -Docx ".\fixtures\sample.docx" -Out ".tmp\smoke" -Images assets
```

Run a whole folder corpus:

```powershell
.\scripts\run-corpus.ps1 -Source ".\fixtures\wordtest" -Out ".tmp\wordtest" -Images base64
```

Build a public online DOCX corpus for regression testing:

```powershell
python .\tools\collect_online_docx_corpus.py --out ".tmp\online-docx\source-docx" --target 50
.\scripts\run-corpus.ps1 -Source ".tmp\online-docx\source-docx" -Out ".tmp\online-docx\markdown"
```

## Failed Documents

If a document does not convert correctly, open an issue with:

- the `.docx` file if it is safe to share
- the generated `.md`
- the `export-report.json`
- the command and config used
- a screenshot of the expected Word output if layout is the issue

For confidential files, strip or replace sensitive content before sharing. The useful part is the broken DOCX structure, not the private text.
