Reranker Study

Abstract

Background

Neural rerankers have become central components of modern information retrieval pipelines, yet their performance on domain-specific biomedical corpora remains underexplored relative to general-domain benchmarks. Concurrently, large language models (LLMs) such as ChatGPT, Claude, Consensus, and SciSpace are increasingly used by researchers to generate scientific literature reviews, raising urgent questions about the quality, accuracy, and verifiability of their outputs. No standardized framework currently exists for evaluating AI-generated biomedical reviews, nor has prior work investigated whether reranking models — typically used for document retrieval — can serve as automated quality assessors of AI-generated scientific text.

Objective

This study presents a two-part evaluation framework. Part 1 benchmarks seven open-source reranker models — spanning biomedical-specific (MedCPT Cross-Encoder), general-purpose (BGE-reranker-v2.5-gemma2, BGE-reranker-v2-m3, Jina-reranker-v2, ms-marco-MiniLM-L-12), and large LLM-based architectures (MonoT5-3B, Qwen3-Reranker-4B) — on their ability to rank 713 text passages from a curated corpus of 10 peer-reviewed publications on Mycobacterium tuberculosis (Mtb) infection biology (2021–2025) against a compound research question decomposed into four immune evasion mechanisms. Part 2 evaluates literature reviews generated by four AI systems (ChatGPT, Claude, Consensus, and SciSpace) in response to an identical prompt on Mtb infection, using both a six-criterion expert rubric and reranker-derived automated scoring. The cross-validation between human expert judgment and reranker-assisted evaluation tests the hypothesis that reranking models can approximate expert assessment of AI-generated scientific text.

Methods

For Part 1, seven rerankers spanning three architectural categories (cross-encoder, seq2seq, causal LM) are evaluated against 713 text blocks extracted from the 10-paper Mtb corpus. Each block carries expert-annotated relevance grades (0–3) for four immune evasion mechanisms: (A) phagosome maturation arrest, (B) acidification/lysosome fusion interference, (C) host signaling manipulation, and (D) alternative immune cell niches. Primary endpoints are four per-mechanism nDCG@10 scores; a secondary overall nDCG@10 uses max(A,B,C,D) relevance. Precision@10 and per-model latency (CPU and GPU) are also reported. Key comparisons include domain-specific versus general-purpose models, model size versus domain specialization (110M biomedical vs. 4B LLM), and inference platform requirements (CPU vs. GPU). For Part 2, the four AI-generated reviews are evaluated independently by domain experts using a structured rubric with four scoring tiers (Excellent, Good, Acceptable, Poor) across six criteria. In parallel, text blocks extracted from each AI review are scored by Part 1's rerankers against Mtb infection mechanism queries, and cited claims are semantically matched against the Part 1 PubMed corpus to generate automated proxies for relevance and literature authenticity. A natural control is provided by one AI system (SciSpace) that cited all 10 Part 1 corpus papers, enabling direct validation of reranker source-fidelity detection. Correlation analysis between human and automated scores assesses which evaluation criteria rerankers can reliably approximate.

Significance

This study contributes (1) a domain-specific benchmark for biomedical rerankers beyond standard IR test collections, (2) a reproducible rubric-based framework for evaluating AI-generated scientific literature reviews, and (3) the first investigation of rerankers as scalable automated evaluators of AI-generated biomedical text. If reranker scores correlate with expert judgment, this approach could provide a practical quality filter as AI-assisted scientific writing becomes widespread.

Results

Among seven reranker models evaluated against a 713-passage Mtb immunology corpus with expert-annotated mechanism relevance grades (0–3), the biomedical-specialized MedCPT Cross-Encoder (110M parameters) achieved perfect overall ranking quality (nDCG@10 = 1.000, P@10 = 1.00), substantially outperforming all competitors. Jina Reranker v2 (278M) placed second (nDCG@10 = 0.888, P@10 = 0.90), followed by MonoT5-3B (nDCG@10 = 0.824, P@10 = 0.80) and BGE-reranker-v2-m3 (nDCG@10 = 0.795, P@10 = 0.80). Qwen3-Reranker-4B showed moderate performance (nDCG@10 = 0.611), while ms-marco-MiniLM (33M) and BGE-gemma2 (2B) lagged at 0.489 and 0.180 respectively. Per-mechanism analysis revealed complementary strengths: Jina v2 best identified host signaling manipulation passages (C: nDCG@10 = 0.718), MedCPT excelled at acidification interference (B: 0.839) and alternative niche detection (D: 0.558), while MonoT5-3B showed strong acidification (B: 0.635) and signaling (C: 0.596) detection. All models performed weakest on phagosome maturation arrest (mechanism A), with MedCPT leading at 0.463. Latency on GPU (RTX 3090) ranged from 3.1s (Jina v2) to 116s (BGE-gemma2) for 713 passages; LLM-based models were impractical on CPU alone (estimated 22+ hours), underscoring GPU dependence for production deployment. In Part 2, expert rubric evaluation of four AI-generated literature reviews ranked Claude 1st (4.8/5 — strongest citation quality with Nature Reviews Microbiology ×2, 28 DOI-verified references, finest molecular precision), Consensus and SciSpace tied 2nd (4.2/5 and 4.0/5 respectively — Consensus excelled in citation authenticity with zero hallucinated references, SciSpace in reference breadth at 30 citations including all 10 corpus papers), and ChatGPT 4th (2.2/5 — accurate at surface level but un-citable institutional labels instead of proper citations). Cross-validation between human rubric scores and MedCPT reranker-assisted evaluation showed strong agreement for relevance ranking, partial agreement for literature authenticity and overall ranking, and confirmed that rerankers cannot assess synthesis depth, citation quality tiers, or factual accuracy.

Conclusions

Domain specialization outweighs model scale for biomedical passage reranking: the 110M-parameter MedCPT achieved perfect nDCG@10, surpassing models up to 36× its size. The general-purpose Jina v2 (278M) emerged as the strongest non-biomedical model (nDCG@10 = 0.888) with exceptional speed (3.1s for 713 passages on GPU). Mechanism-level analysis revealed complementary strengths across model architectures — Jina v2 best detected host signaling manipulation (C), MedCPT dominated acidification interference (B), and MonoT5-3B showed balanced coverage. Model scale alone did not predict performance: the 2B BGE-gemma2 scored lowest (0.180), while the 278M Jina v2 ranked second overall. CPU-only deployment is impractical for LLM-based rerankers (>22 hours vs. 71–74s on GPU). For biomedical IR pipelines, MedCPT offers the optimal accuracy–speed tradeoff; for general-purpose use without domain-specific training, Jina v2 provides the best balance. These findings directly informed Part 2 model selection. In Part 2, Claude produced the highest-quality AI-generated literature review (4.8/5), combining exceptional citation quality (Nature Reviews Microbiology ×2, Nature Microbiology) with the deepest mechanistic synthesis, while Consensus excelled in citation authenticity (zero hallucinated references) and SciSpace in breadth (30 references citing all 10 corpus papers). Reranker-assisted evaluation proved a reliable first-pass proxy for relevance but could not distinguish citation quality tiers or synthesis sophistication, confirming that expert review remains essential for evaluating AI-generated scientific text.

Part 1 — Reranker Benchmark Results

Models Tested

0

Best Model

—

A: Phagosome

—

B: Acidification

—

C: Signaling

—

D: Niches

—

Overall (max) · secondary

—

Run evaluations in the Matrix tab to see results here.

Primary Endpoints — Per-Mechanism nDCG@10

The research query decomposes into 4 distinct infection mechanisms (A–D), each with its own ground-truth relevance scores across 713 text blocks. Because mechanisms are independent — no block scores 3 on all four, and 247 of 259 max-3 blocks score 0 on at least one mechanism — a single composite score would conflate mechanism-specific retrieval quality. Therefore, the 4 per-mechanism nDCG@10 scores are the primary endpoints. An overall nDCG@10 using max(A,B,C,D) is reported as a secondary “any-mechanism retrieval breadth” metric.

Step 0 — Corpus Granularity

Each of the 10 source papers is split into overlapping text blocks (the unit of retrieval). The reranker scores every block against the query independently.

$$N = 713 \text{ blocks from 10 papers}$$

Block counts per paper range from 36 to 132. The reranker produces a ranked list of all 713 blocks per query. nDCG@10 evaluates only the top 10 positions of this 713-item ranking — did the best blocks rise to the top?

Step 1 — Relevance Scores (per block, per mechanism)

Each of the 713 blocks is independently scored against each of the 4 infection mechanisms (A: phagosome arrest, B: acidification/lysosome fusion, C: host signaling manipulation, D: alternative immune cell niches). Blocks from the same paper can and do receive different scores.

$$\text{rel}_{i,m} \in \{0, 1, 2, 3\} \quad \text{for each block } i \text{ and mechanism } m \in \{A, B, C, D\}$$

Where 0 = not relevant, 1 = tangential, 2 = relevant, 3 = highly relevant. nDCG@10 is computed separately for each mechanism using that mechanism’s scores (primary endpoints). A secondary overall score uses max_relevance = max(rel_A, rel_B, rel_C, rel_D), which measures retrieval of blocks relevant to any mechanism. Ground-truth annotations are documented in Appendix A: Source of Truth (exported separately).

Step 2 — Discounted Cumulative Gain (DCG)

After the reranker ranks all 713 blocks, DCG is computed over the top k=10 positions only. This study uses the linear gain formulation of Järvelin & Kekäläinen ^[1], where relevance scores are used directly rather than exponentiated. Higher-ranked positions contribute more due to logarithmic discounting.

$$\text{DCG}@k = \sum_{i=1}^{k} \frac{\text{rel}_i}{\log_2(i + 1)}$$

For each position i from 1 to k (k=10): take the relevance score of the block at that position and divide by $\log_2(i+1)$ to discount by rank position. A rel-3 block at rank 1 contributes $3 / \log_2(2) = 3.0$ points; the same block at rank 10 contributes $3 / \log_2(11) \approx 0.87$ points. This linear variant treats relevance proportionally: a score-3 block is worth exactly 3× a score-1 block at the same position.

Step 3 — Ideal DCG (IDCG)

IDCG is the DCG of a perfect ranking — what you'd get if the 10 most relevant blocks (out of all 713) were placed at the top in optimal order. It represents the theoretical ceiling for the query.

$$\text{IDCG}@k = \sum_{i=1}^{k} \frac{\text{rel}^{*}_i}{\log_2(i + 1)}$$

Where $\text{rel}^{*}_i$ is the relevance of the block that should be at position i in a perfect ranking. Computed by sorting all 713 blocks' ground-truth relevance scores from highest to lowest and applying the same linear DCG formula to the top 10.

Step 4 — Normalized DCG (nDCG)

nDCG normalizes the actual DCG against the ideal, producing a score between 0 and 1 that is comparable across queries regardless of how many relevant blocks exist.

$$\text{nDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k}$$

nDCG = 1.0 means the reranker placed the 10 best blocks at the top in optimal order (out of 713). nDCG ≈ 0 means the top 10 were mostly irrelevant blocks. Because k=10 is a small window into a 713-block pool, this metric strongly penalizes rerankers that let irrelevant blocks leak into the top positions.

Step 5 — Per-Mechanism Reporting

Steps 2–4 are applied independently for each mechanism, yielding four primary scores per model:

$$\text{nDCG}@10_A, \quad \text{nDCG}@10_B, \quad \text{nDCG}@10_C, \quad \text{nDCG}@10_D$$

Each mechanism’s nDCG@10 uses only that mechanism’s block-level relevance scores as ground truth. These four scores are the primary endpoints of this study. They reveal whether a reranker excels at surfacing blocks about phagosome arrest (A) but struggles with alternative niches (D), or vice versa.

Step 6 — Secondary Overall Score

A secondary composite uses max_relevance = max(A,B,C,D) per block, measuring whether the reranker surfaces blocks relevant to any mechanism.

$$\text{nDCG}@10_{\text{overall}} \quad \text{where } \text{rel}_i = \max(\text{rel}_{i,A},\; \text{rel}_{i,B},\; \text{rel}_{i,C},\; \text{rel}_{i,D})$$

This score is reported for completeness but is not the primary endpoint because it conflates mechanism-specific performance. A model scoring well on overall but poorly on mechanism B may be retrieving blocks about signaling (C) instead of acidification (B) — a distinction invisible to the composite. The per-mechanism breakdown in the summary cards above provides the definitive evaluation.

Worked Examples — nDCG@10 Calculations

Three calculations using actual reranker output and ground-truth annotations, demonstrating how the same top-10 ranking produces different nDCG@10 scores across mechanisms.

Ideal Ranking (IDCG@10)

For both mechanisms A and B, the corpus contains ≥10 blocks scored 3 (highly relevant). A perfect reranker would place 10 such blocks at positions 1–10:

$$\text{IDCG}@10 = \sum_{i=1}^{10} \frac{3}{\log_2(i+1)} = \frac{3}{1.000} + \frac{3}{1.585} + \frac{3}{2.000} + \frac{3}{2.322} + \frac{3}{2.585} + \frac{3}{2.807} + \frac{3}{3.000} + \frac{3}{3.170} + \frac{3}{3.322} + \frac{3}{3.459} = 13.631$$

This is the theoretical ceiling for any mechanism where ≥10 blocks have relevance 3. All three examples below share this IDCG.

Example 1 — MedCPT on Mechanism A (Phagosome Arrest): nDCG@10 = 0.4630

MedCPT ranks all 713 blocks. The top 10 by model score, with their mechanism A ground-truth relevance:

Rank	Block	Model Score	rel_A	Discount log₂(i+1)	Contribution
1	P05B04	0.9999920	0	1.000	0.000
2	P07B15	0.9999760	3	1.585	1.893
3	P03B14	0.9999680	3	2.000	1.500
4	P03B03	0.9999510	0	2.322	0.000
5	P07B24	0.9999490	0	2.585	0.000
6	P01B43	0.9999200	3	2.807	1.069
7	P09B04	0.9999070	0	3.000	0.000
8	P01B02	0.9998920	3	3.170	0.946
9	P01B04	0.9998880	3	3.322	0.903
10	P07B05	0.9998830	0	3.459	0.000

$$\text{DCG}@10 = 0 + 1.893 + 1.500 + 0 + 0 + 1.069 + 0 + 0.946 + 0.903 + 0 = 6.311$$ $$\text{nDCG}@10_A = \frac{6.311}{13.631} = 0.4630$$

Interpretation: 5 of the top 10 blocks score 0 on phagosome arrest — these blocks are relevant to other mechanisms (B, C, or D) but not A. MedCPT’s composite query retrieves broadly relevant blocks, but its mechanism A specificity is moderate. The rank-1 block (P05B04) scores 0 on A despite being the model’s highest-confidence result, contributing nothing to this mechanism’s DCG and wasting the most valuable ranking position.

Example 2 — MedCPT on Mechanism B (Acidification / Lysosome Fusion): nDCG@10 = 0.8386

Same model, same top-10 ranking — but now evaluated against mechanism B ground truth:

Rank	Block	Model Score	rel_B	Discount log₂(i+1)	Contribution
1	P05B04	0.9999920	2	1.000	2.000
2	P07B15	0.9999760	3	1.585	1.893
3	P03B14	0.9999680	3	2.000	1.500
4	P03B03	0.9999510	3	2.322	1.292
5	P07B24	0.9999490	3	2.585	1.161
6	P01B43	0.9999200	3	2.807	1.069
7	P09B04	0.9999070	2	3.000	0.667
8	P01B02	0.9998920	3	3.170	0.946
9	P01B04	0.9998880	3	3.322	0.903
10	P07B05	0.9998830	0	3.459	0.000

$$\text{DCG}@10 = 2.000 + 1.893 + 1.500 + 1.292 + 1.161 + 1.069 + 0.667 + 0.946 + 0.903 + 0 = 11.430$$ $$\text{nDCG}@10_B = \frac{11.430}{13.631} = 0.8386$$

Interpretation: The exact same ranking scores 0.8386 on mechanism B vs. 0.4630 on mechanism A. Only 1 of the top 10 blocks is irrelevant to B (vs. 5 for A). This demonstrates precisely why per-mechanism reporting matters: the composite score alone (0.9048 overall) would hide that MedCPT is nearly twice as effective at surfacing acidification content as phagosome arrest content.

Example 3 — MiniLM on Mechanism A (Phagosome Arrest): nDCG@10 = 0.1100

A 33M general-purpose cross-encoder (MS-MARCO), showing poor mechanism A retrieval:

Rank	Block	Model Score	rel_A	Discount log₂(i+1)	Contribution
1	P04B01	2.302590	0	1.000	0.000
2	P03B19	2.239884	0	1.585	0.000
3	P03B01	2.195090	3	2.000	1.500
4	P04B03	2.137062	0	2.322	0.000
5	P02B49	2.130432	0	2.585	0.000
6	P02B57	2.109454	0	2.807	0.000
7	P02B52	2.078258	0	3.000	0.000
8	P03B36	2.070532	0	3.170	0.000
9	P07B01	1.984957	0	3.322	0.000
10	P03B03	1.962905	0	3.459	0.000

$$\text{DCG}@10 = 0 + 0 + 1.500 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 1.500$$ $$\text{nDCG}@10_A = \frac{1.500}{13.631} = 0.1100$$

Interpretation: MiniLM places only 1 relevant block in the top 10 — and it’s at rank 3, not rank 1, so its contribution (1.500) is already discounted. The remaining 9 positions are all mechanism A–irrelevant. This 33M general-purpose model, trained on MS-MARCO web queries, lacks the domain specificity to distinguish phagosome arrest content from general Mtb text. Compare with MedCPT (0.4630), which placed 5 relevant blocks in the same top 10.

Part 2 — AI-Generated Literature Reviews

Four AI systems were given the identical prompt: “Do a scientific literature review on the infection process of Mycobacterium tuberculosis. Include papers from the last 5 years.”

#	AI System	File	Format	References	Key Observations
1	ChatGPT	AI Test- Chatgpt.pdf	PDF, 8pp	~7 named	Generic hyperlinks, no DOIs. Bullet-list heavy, conversational tone.
2	Consensus	Consensus review.pdf	PDF, 5pp	~18	Full citations with DOIs. Dense molecular detail. Strongest citation rigor.
3	Claude	mtb_literature_review Claude.docx	DOCX	28	Deepest synthesis + strongest citation quality (Nature Reviews Microbiology ×2). Near-publication quality.
4	SciSpace	Scientific Literature Review.docx	DOCX	30	Well-structured, numbered refs. Cites all 10 Part 1 corpus papers directly.

Part 2A — Human Rubric Evaluation

Expert evaluation on 6 criteria, scored as Excellent / Good / Acceptable / Poor.

1. Number of References

30+ = Excellent · 20–29 = Good · 10–19 = Acceptable · <10 = Poor

2. Relevance to Infection Mechanism

Majority address infection biology = Excellent · Many relevant = Good · Half relevant = Acceptable · Mostly unrelated = Poor

3. Literature Authenticity

All verifiable = Excellent · 1–2 incorrect = Good · Several questionable = Acceptable · Many fabricated = Poor

4. Scientific Depth of Synthesis

Integrates & explains coherently = Excellent · Some synthesis = Good · Isolated summaries = Acceptable · Superficial = Poor

5. Citation Quality

High-impact journals = Excellent · Mixed impact = Good · Medium-tier = Acceptable · Low-quality = Poor

6. Accuracy of Biological Mechanisms

Accurate & comprehensive = Excellent · Mostly correct = Good · Several inaccuracies = Acceptable · Major errors = Poor

AI System	# Refs	Relevance	Authenticity	Depth	Citation Quality	Bio Accuracy	Overall
ChatGPT	Poor	Good	Acceptable	Acceptable	Acceptable	Good	4th (2.2/5)
Claude	Good	Excellent	Good	Excellent	Excellent	Excellent	1st (4.8/5)
Consensus	Acceptable	Excellent	Excellent	Good	Excellent	Excellent	2nd (4.2/5)
SciSpace	Excellent	Excellent	Good	Excellent	Good	Excellent	2nd (4.0/5)

Part 2A — Detailed Evaluation Analysis

1st — Claude (4 Excellent, 2 Good) — 4.8/5

Strengths: Combines breadth with citation precision — 28 fully formatted references with DOIs (all 2021–2025). Strongest citation profile of all four systems: Nature Reviews Microbiology (×2), Nature Microbiology, Nature Communications, NPJ Vaccines. Exceptional molecular detail across all mechanisms, describing PtpA/PtpB V-ATPase dephosphorylation, PI3P disruption via ManLAM, cGAS-STING-IRF3 pathway activation, EspL-mediated autophagy inhibition, DosR/DevR dormancy regulation, WhiB3 redox sensing, and RpfB-mediated reactivation. Sophisticated handling of the type I interferon paradox citing separate 2024–2025 studies. Includes WHO 2024 epidemiological data and advanced topics (MAIT cells, cytosolic translocation, granuloma dynamics). Near-publication-quality document structure.
Weaknesses: Falls 2 references short of Excellent threshold (28 vs. 30). Two references (“Patel & Bhatt 2023” and “Russell 2023”) cannot be fully verified. Very lengthy, potentially impractical for quick reference. Could further integrate single-cell and spatial transcriptomics data. Lacks summary table feature found in Consensus.
Key distinction: Claude achieves the highest overall quality by combining the breadth of SciSpace with the citation precision of Consensus. The minor shortfall in reference count (28 vs. 30) is the only meaningful gap preventing a perfect score.

2nd (tied) — Consensus (3 Excellent, 1 Good, 2 Acceptable) — 4.2/5

Strengths: All 18 references include working DOIs verified as real publications — zero hallucinated citations. Includes high-impact journals: Nature Reviews Microbiology, PNAS, Cellular and Molecular Immunology. Warner et al. 2025 identified as the field’s most current comprehensive review. Meaningful synthesis connecting PknG, ESX effectors, and galectins. Unique summary table integrating infection stages with citations. Accurately identifies strain-specific epithelial responses and disease progression as spectrum.
Weaknesses: Reference count at 18 falls below the Good threshold (20+). Synthesis depth constrained by smaller reference set. Cell death pathways and metabolic dormancy receive only brief treatment. No formal abstract or keywords section. Omits some important recent papers (Zheng 2024, Feng 2024).
Key distinction: In contexts where verifiability is the top priority, Consensus outperforms all others. Quality over quantity — its 18 references are impeccable, but the smaller set constrains coverage depth.

2nd (tied) — SciSpace (4 Excellent, 2 Good) — 4.0/5

Strengths: Highest reference count (30, the only Excellent-tier). Excellent structural organization with full table of contents across 12 sections. Deep mechanistic synthesis correctly describing PknG, V-ATPase, Rab7/LAMP-1, and ManLAM interactions. Includes cutting-edge finding on lysosome-poor monocyte niche (Zheng 2024). Covers cell death pathways with precision including pyroptosis and apoptosis. Critically, SciSpace cited all 10 Part 1 corpus papers directly ([1] Rankine-Wilson, [5] Bo, [6] Shen, [9] Lei, [11] Kilinç, [12] Zheng, [14] Chandra, [17] Kim, [22] Witt, [29] Khadela), providing a natural control for Part 2B reranker-based source fidelity detection.
Weaknesses: ~2 potentially unverifiable references including ref [10] lacking journal name and ref [13] as unconfirmed preprint. Citation quality skews toward medium-impact open-access journals rather than top-tier publications (no Nature Microbiology or Cell Host & Microbe). Later sections contain repetitive content. No DOIs provided, complicating verification. Less emphasis on most recent 2024–2025 advances.
Key distinction: SciSpace’s corpus overlap makes it invaluable for Part 2B validation. Research requiring breadth favors SciSpace; research requiring verifiability favors Consensus.

4th — ChatGPT (0 Excellent, 2 Good, 3 Acceptable, 1 Poor) — 2.2/5

Strengths: Well-organized and readable for non-specialist audiences. Broadly accurate at a general level covering major topics correctly — phagosome arrest, V-ATPase interference, ESX-1 function, and immune responses. Clear and logical section structure. Identifies emerging research directions including spatial transcriptomics and immunometabolism.
Weaknesses: Only ~6 named references without DOIs or full bibliographic details (Poor). In-text labels like “(PubMed)”, “(Nature)”, “(ScienceDirect)” are not real citations. Multiple references identified as vague and likely fabricated. Uses non-attributed statements like “a 2023 study showed...” Lacks grounded synthesis across studies. Not suitable for any scholarly submission — cannot be verified without DOIs.
Key distinction: ChatGPT produced a well-organized textbook-style summary rather than a scholarly literature review. Despite accessible presentation, it is academically unsuitable — it would fail peer review on citation quality alone.

Part 2B — Reranker-Assisted Evaluation

Using Part 1 rerankers as automated judges of AI-generated reviews.

Relevance Scoring — Extract text blocks from each AI review, score against Mtb infection mechanism prompts. Compare automated scores vs. human rubric criterion #2.
Reference Validation — Semantically match cited claims against the 10 Part 1 PubMed papers. High overlap = proxy for authenticity. Hallucinated claims won't match source papers.
Source Fidelity Control — SciSpace cited all 10 Part 1 corpus papers. Rerankers should score SciSpace highest for corpus alignment, validating the method.

MedCPT (Part 1 best, nDCG@10 = 1.000) scored all review text blocks against 4 per-mechanism queries and a composite query. Corpus match = semantic similarity between review content and the 10 Part 1 source papers (higher = more grounded in real literature).

AI System	Blocks	Composite	A	B	C	D	Corpus
ChatGPT	4	0.6273	0.5545	0.2500	0.4925	0.3040	0.9787
Consensus	6	0.9276	0.4736	0.3365	0.6807	0.1675	0.9843
Claude	11	0.7547	0.5445	0.1102	0.5873	0.0026	0.9455
SciSpace	12	0.9109	0.3988	0.3417	0.9748	0.3436	0.9912

Key findings:
• Composite relevance: Consensus (0.928) > SciSpace (0.911) > Claude (0.755) > ChatGPT (0.627). Consensus and SciSpace produce the most on-topic content as judged by the Part 1–validated reranker.
• Mechanism C (signaling): SciSpace dominates (0.975), likely because it directly cites corpus papers on host signaling (Lei/TRAF3, Kilinç/HDT).
• Mechanism A (phagosome): Claude leads (0.545), consistent with its deep coverage of ESX-1, ESAT-6/CFP-10, and phagosome maturation arrest.
• Corpus match: SciSpace (0.991) > Consensus (0.984) > ChatGPT (0.979) > Claude (0.946). SciSpace’s highest corpus match validates the source fidelity control — it cited all 10 Part 1 papers and the reranker confirms this. Claude’s lower corpus match aligns with its many unverifiable in-text citations that diverge from the Part 1 corpus.
• Claude mechanism D anomaly: Claude scores 0.003 on alternative niches despite discussing dendritic cells and granuloma microenvironments. This suggests its coverage uses different terminology or framing than the Part 1 corpus, revealing a limitation of single-query reranker evaluation.

Part 2C — Cross-Analysis (Human vs. Reranker Agreement)

Comparing human rubric rankings (Part 2A) against MedCPT reranker scores (Part 2B) to test whether rerankers can approximate expert judgment.

Criterion	Human Ranking (2A)	Reranker Proxy	Reranker Ranking (2B)	Agreement
Relevance to Infection Mechanism	Claude = Consensus = SciSpace > ChatGPT	Composite score	Consensus > SciSpace > Claude > ChatGPT	Strong
Literature Authenticity	Consensus > Claude = SciSpace > ChatGPT	Corpus match	SciSpace > Consensus > ChatGPT > Claude	Partial
Scientific Depth	Claude = SciSpace > Consensus > ChatGPT	Mechanism coverage	SciSpace > Consensus > Claude > ChatGPT	Partial
Bio Accuracy	Claude = Consensus = SciSpace > ChatGPT	N/A	—	Not measurable
Overall Ranking	Claude > Consensus = SciSpace > ChatGPT	Composite + corpus	Consensus > SciSpace > Claude > ChatGPT	Partial

Interpretation:
• Relevance: Strong agreement — both human and reranker place ChatGPT last and the top three (Claude, Consensus, SciSpace) as all Excellent. The reranker correctly identifies that Consensus and SciSpace produce the most on-topic content.
• Authenticity: Partial agreement — the reranker’s corpus match correctly identifies SciSpace as most grounded in real literature (it cites all 10 corpus papers) and Consensus as strong. However, it ranks ChatGPT above Claude, which contradicts human judgment. This reflects a limitation: corpus match measures overlap with these specific 10 papers, not literature authenticity broadly. ChatGPT’s generic overview text happens to use similar terminology to the corpus papers.
• Depth: Partial agreement — the reranker measures mechanism coverage (how many topics are addressed) but cannot assess whether content integrates findings across studies vs. lists them in isolation. Human judges rated both Claude and SciSpace’s synthesis as Excellent; the reranker only sees topical overlap.
• Bio Accuracy: Not measurable by rerankers — semantic similarity cannot distinguish correct from incorrect biological claims. This remains exclusively a human expert domain.
• Overall ranking: Partial agreement — both human and reranker agree on ChatGPT last, but diverge at the top. Human experts place Claude 1st (4.8/5) based on citation quality and mechanistic depth, while the reranker favors Consensus and SciSpace based on composite relevance scores. This reveals a key limitation: rerankers cannot distinguish between citation quality tiers (Nature Reviews Microbiology vs. open-access journals) or evaluate the sophistication of scientific synthesis.
• Scalability thesis: Rerankers are reliable proxies for relevance (strong agreement), partially useful for authenticity and overall ranking (with corpus-specific caveats), and unable to assess depth, accuracy, or citation quality. For scalable triage of AI-generated scientific text, a reranker could serve as a first-pass filter for relevance and corpus grounding, but expert review remains essential for synthesis quality, citation rigor, and factual correctness.

Study Design Summary

	Part 1	Part 2
Question	Can rerankers accurately rank real papers?	Can rerankers evaluate AI-generated reviews?
Input Corpus	10 PubMed papers pdfs/	4 AI reviews Reviews/
Evaluation	Reranker scores vs. expert judgment	Human rubric + reranker scores + cross-validation
Domain	Mtb infection biology	Mtb infection biology
Shared Asset	Reranker Models

References

K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation of IR techniques,” ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422–446, 2002. doi:10.1145/582415.582418
Rankine-Wilson et al. (2021). “From infection niche to therapeutic target: the intracellular lifestyle of M. tuberculosis.” Microbiology.
Bo et al. (2023). “Mycobacterium tuberculosis–macrophage interaction: Molecular updates.” Front. Cell. Infect. Microbiol.
Echeverría-Valencia (2023). “Phagocytosis of M. tuberculosis: A Narrative of the Uptaking and Survival.” IntechOpen.
Lei et al. (2021). “Rv3722c promotes M. tuberculosis survival in macrophages by interacting with TRAF3.” Front. Cell. Infect. Microbiol.
Zheng et al. (2024). “M. tuberculosis resides in lysosome-poor monocyte-derived lung cells during chronic infection.” PLOS Pathogens.
Kilinç et al. (2021). “Host-directed therapy to combat mycobacterial infections.” Immunological Reviews.
Chandra et al. (2022). “Immune evasion and provocation by Mycobacterium tuberculosis.” Nat. Rev. Microbiol.
Kim et al. (2022). “Pathological and protective roles of dendritic cells in M. tuberculosis infection.” Front. Cell. Infect. Microbiol.
Witt (2025). “Antigen processing pathways in M. tuberculosis pathogenesis.” IntechOpen.
Khadela et al. (2022). “Epigenetics in tuberculosis: Immunomodulation of host immune response.” Vaccines.

Appendix A — Source of Truth

The complete block-level relevance annotations used to compute nDCG@10. Exported as a companion file (source_of_truth.json).

block_id — Unique identifier (e.g., P01B01 = Paper 01, Block 01)
paper — Parent paper number (01–10)
mechanisms — Per-mechanism relevance scores {A: 0–3, B: 0–3, C: 0–3, D: 0–3}
max_relevance — max(A, B, C, D), used for mechanism-agnostic overall ranking
text_preview — First 200 characters of block content

713 blocks total · 10 papers · 36–132 blocks per paper

Corpus

Paper Blocks

Reranker Test Matrix

Query

Models