Abstract
Background
Neural rerankers have become central components of modern information retrieval pipelines, yet their performance on domain-specific biomedical corpora remains underexplored relative to general-domain benchmarks. Concurrently, large language models (LLMs) such as ChatGPT, Claude, Consensus, and SciSpace are increasingly used by researchers to generate scientific literature reviews, raising urgent questions about the quality, accuracy, and verifiability of their outputs. No standardized framework currently exists for evaluating AI-generated biomedical reviews, nor has prior work investigated whether reranking models — typically used for document retrieval — can serve as automated quality assessors of AI-generated scientific text.
Objective
This study presents a two-part evaluation framework. Part 1 benchmarks seven open-source reranker models — spanning biomedical-specific (MedCPT Cross-Encoder), general-purpose (BGE-reranker-v2.5-gemma2, BGE-reranker-v2-m3, Jina-reranker-v2, ms-marco-MiniLM-L-12), and large LLM-based architectures (MonoT5-3B, Qwen3-Reranker-4B) — on their ability to rank 713 text passages from a curated corpus of 10 peer-reviewed publications on Mycobacterium tuberculosis (Mtb) infection biology (2021–2025) against a compound research question decomposed into four immune evasion mechanisms. Part 2 evaluates literature reviews generated by four AI systems (ChatGPT, Claude, Consensus, and SciSpace) in response to an identical prompt on Mtb infection, using both a six-criterion expert rubric and reranker-derived automated scoring. The cross-validation between human expert judgment and reranker-assisted evaluation tests the hypothesis that reranking models can approximate expert assessment of AI-generated scientific text.
Methods
For Part 1, seven rerankers spanning three architectural categories (cross-encoder, seq2seq, causal LM) are evaluated against 713 text blocks extracted from the 10-paper Mtb corpus. Each block carries expert-annotated relevance grades (0–3) for four immune evasion mechanisms: (A) phagosome maturation arrest, (B) acidification/lysosome fusion interference, (C) host signaling manipulation, and (D) alternative immune cell niches. Primary endpoints are four per-mechanism nDCG@10 scores; a secondary overall nDCG@10 uses max(A,B,C,D) relevance. Precision@10 and per-model latency (CPU and GPU) are also reported. Key comparisons include domain-specific versus general-purpose models, model size versus domain specialization (110M biomedical vs. 4B LLM), and inference platform requirements (CPU vs. GPU). For Part 2, the four AI-generated reviews are evaluated independently by domain experts using a structured rubric with four scoring tiers (Excellent, Good, Acceptable, Poor) across six criteria. In parallel, text blocks extracted from each AI review are scored by Part 1's rerankers against Mtb infection mechanism queries, and cited claims are semantically matched against the Part 1 PubMed corpus to generate automated proxies for relevance and literature authenticity. A natural control is provided by one AI system (SciSpace) that cited all 10 Part 1 corpus papers, enabling direct validation of reranker source-fidelity detection. Correlation analysis between human and automated scores assesses which evaluation criteria rerankers can reliably approximate.
Significance
This study contributes (1) a domain-specific benchmark for biomedical rerankers beyond standard IR test collections, (2) a reproducible rubric-based framework for evaluating AI-generated scientific literature reviews, and (3) the first investigation of rerankers as scalable automated evaluators of AI-generated biomedical text. If reranker scores correlate with expert judgment, this approach could provide a practical quality filter as AI-assisted scientific writing becomes widespread.
Results
Among seven reranker models evaluated against a 713-passage Mtb immunology corpus with expert-annotated mechanism relevance grades (0–3), the biomedical-specialized MedCPT Cross-Encoder (110M parameters) achieved perfect overall ranking quality (nDCG@10 = 1.000, P@10 = 1.00), substantially outperforming all competitors. Jina Reranker v2 (278M) placed second (nDCG@10 = 0.888, P@10 = 0.90), followed by MonoT5-3B (nDCG@10 = 0.824, P@10 = 0.80) and BGE-reranker-v2-m3 (nDCG@10 = 0.795, P@10 = 0.80). Qwen3-Reranker-4B showed moderate performance (nDCG@10 = 0.611), while ms-marco-MiniLM (33M) and BGE-gemma2 (2B) lagged at 0.489 and 0.180 respectively. Per-mechanism analysis revealed complementary strengths: Jina v2 best identified host signaling manipulation passages (C: nDCG@10 = 0.718), MedCPT excelled at acidification interference (B: 0.839) and alternative niche detection (D: 0.558), while MonoT5-3B showed strong acidification (B: 0.635) and signaling (C: 0.596) detection. All models performed weakest on phagosome maturation arrest (mechanism A), with MedCPT leading at 0.463. Latency on GPU (RTX 3090) ranged from 3.1s (Jina v2) to 116s (BGE-gemma2) for 713 passages; LLM-based models were impractical on CPU alone (estimated 22+ hours), underscoring GPU dependence for production deployment. In Part 2, expert rubric evaluation of four AI-generated literature reviews ranked Claude 1st (4.8/5 — strongest citation quality with Nature Reviews Microbiology ×2, 28 DOI-verified references, finest molecular precision), Consensus and SciSpace tied 2nd (4.2/5 and 4.0/5 respectively — Consensus excelled in citation authenticity with zero hallucinated references, SciSpace in reference breadth at 30 citations including all 10 corpus papers), and ChatGPT 4th (2.2/5 — accurate at surface level but un-citable institutional labels instead of proper citations). Cross-validation between human rubric scores and MedCPT reranker-assisted evaluation showed strong agreement for relevance ranking, partial agreement for literature authenticity and overall ranking, and confirmed that rerankers cannot assess synthesis depth, citation quality tiers, or factual accuracy.
Conclusions
Domain specialization outweighs model scale for biomedical passage reranking: the 110M-parameter MedCPT achieved perfect nDCG@10, surpassing models up to 36× its size. The general-purpose Jina v2 (278M) emerged as the strongest non-biomedical model (nDCG@10 = 0.888) with exceptional speed (3.1s for 713 passages on GPU). Mechanism-level analysis revealed complementary strengths across model architectures — Jina v2 best detected host signaling manipulation (C), MedCPT dominated acidification interference (B), and MonoT5-3B showed balanced coverage. Model scale alone did not predict performance: the 2B BGE-gemma2 scored lowest (0.180), while the 278M Jina v2 ranked second overall. CPU-only deployment is impractical for LLM-based rerankers (>22 hours vs. 71–74s on GPU). For biomedical IR pipelines, MedCPT offers the optimal accuracy–speed tradeoff; for general-purpose use without domain-specific training, Jina v2 provides the best balance. These findings directly informed Part 2 model selection. In Part 2, Claude produced the highest-quality AI-generated literature review (4.8/5), combining exceptional citation quality (Nature Reviews Microbiology ×2, Nature Microbiology) with the deepest mechanistic synthesis, while Consensus excelled in citation authenticity (zero hallucinated references) and SciSpace in breadth (30 references citing all 10 corpus papers). Reranker-assisted evaluation proved a reliable first-pass proxy for relevance but could not distinguish citation quality tiers or synthesis sophistication, confirming that expert review remains essential for evaluating AI-generated scientific text.
Primary Endpoints — Per-Mechanism nDCG@10
The research query decomposes into 4 distinct infection mechanisms (A–D), each with its own ground-truth relevance scores across 713 text blocks. Because mechanisms are independent — no block scores 3 on all four, and 247 of 259 max-3 blocks score 0 on at least one mechanism — a single composite score would conflate mechanism-specific retrieval quality. Therefore, the 4 per-mechanism nDCG@10 scores are the primary endpoints. An overall nDCG@10 using max(A,B,C,D) is reported as a secondary “any-mechanism retrieval breadth” metric.
Step 0 — Corpus Granularity
Each of the 10 source papers is split into overlapping text blocks (the unit of retrieval). The reranker scores every block against the query independently.
$$N = 713 \text{ blocks from 10 papers}$$
Block counts per paper range from 36 to 132. The reranker produces a ranked list of all 713 blocks per query. nDCG@10 evaluates only the top 10 positions of this 713-item ranking — did the best blocks rise to the top?
Step 1 — Relevance Scores (per block, per mechanism)
Each of the 713 blocks is independently scored against each of the 4 infection mechanisms (A: phagosome arrest, B: acidification/lysosome fusion, C: host signaling manipulation, D: alternative immune cell niches). Blocks from the same paper can and do receive different scores.
$$\text{rel}_{i,m} \in \{0, 1, 2, 3\} \quad \text{for each block } i \text{ and mechanism } m \in \{A, B, C, D\}$$
Where 0 = not relevant, 1 = tangential, 2 = relevant, 3 = highly relevant. nDCG@10 is computed separately for each mechanism using that mechanism’s scores (primary endpoints). A secondary overall score uses max_relevance = max(relA, relB, relC, relD), which measures retrieval of blocks relevant to any mechanism. Ground-truth annotations are documented in Appendix A: Source of Truth (exported separately).
Step 2 — Discounted Cumulative Gain (DCG)
After the reranker ranks all 713 blocks, DCG is computed over the top k=10 positions only. This study uses the linear gain formulation of Järvelin & Kekäläinen [1], where relevance scores are used directly rather than exponentiated. Higher-ranked positions contribute more due to logarithmic discounting.
$$\text{DCG}@k = \sum_{i=1}^{k} \frac{\text{rel}_i}{\log_2(i + 1)}$$
For each position i from 1 to k (k=10): take the relevance score of the block at that position and divide by $\log_2(i+1)$ to discount by rank position. A rel-3 block at rank 1 contributes $3 / \log_2(2) = 3.0$ points; the same block at rank 10 contributes $3 / \log_2(11) \approx 0.87$ points. This linear variant treats relevance proportionally: a score-3 block is worth exactly 3× a score-1 block at the same position.
Step 3 — Ideal DCG (IDCG)
IDCG is the DCG of a perfect ranking — what you'd get if the 10 most relevant blocks (out of all 713) were placed at the top in optimal order. It represents the theoretical ceiling for the query.
$$\text{IDCG}@k = \sum_{i=1}^{k} \frac{\text{rel}^{*}_i}{\log_2(i + 1)}$$
Where $\text{rel}^{*}_i$ is the relevance of the block that should be at position i in a perfect ranking. Computed by sorting all 713 blocks' ground-truth relevance scores from highest to lowest and applying the same linear DCG formula to the top 10.
Step 4 — Normalized DCG (nDCG)
nDCG normalizes the actual DCG against the ideal, producing a score between 0 and 1 that is comparable across queries regardless of how many relevant blocks exist.
$$\text{nDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k}$$
nDCG = 1.0 means the reranker placed the 10 best blocks at the top in optimal order (out of 713). nDCG ≈ 0 means the top 10 were mostly irrelevant blocks. Because k=10 is a small window into a 713-block pool, this metric strongly penalizes rerankers that let irrelevant blocks leak into the top positions.
Step 5 — Per-Mechanism Reporting
Steps 2–4 are applied independently for each mechanism, yielding four primary scores per model:
$$\text{nDCG}@10_A, \quad \text{nDCG}@10_B, \quad \text{nDCG}@10_C, \quad \text{nDCG}@10_D$$
Each mechanism’s nDCG@10 uses only that mechanism’s block-level relevance scores as ground truth. These four scores are the primary endpoints of this study. They reveal whether a reranker excels at surfacing blocks about phagosome arrest (A) but struggles with alternative niches (D), or vice versa.
Step 6 — Secondary Overall Score
A secondary composite uses max_relevance = max(A,B,C,D) per block, measuring whether the reranker surfaces blocks relevant to any mechanism.
$$\text{nDCG}@10_{\text{overall}} \quad \text{where } \text{rel}_i = \max(\text{rel}_{i,A},\; \text{rel}_{i,B},\; \text{rel}_{i,C},\; \text{rel}_{i,D})$$
This score is reported for completeness but is not the primary endpoint because it conflates mechanism-specific performance. A model scoring well on overall but poorly on mechanism B may be retrieving blocks about signaling (C) instead of acidification (B) — a distinction invisible to the composite. The per-mechanism breakdown in the summary cards above provides the definitive evaluation.
Worked Examples — nDCG@10 Calculations
Three calculations using actual reranker output and ground-truth annotations, demonstrating how the same top-10 ranking produces different nDCG@10 scores across mechanisms.
Ideal Ranking (IDCG@10)
For both mechanisms A and B, the corpus contains ≥10 blocks scored 3 (highly relevant). A perfect reranker would place 10 such blocks at positions 1–10:
$$\text{IDCG}@10 = \sum_{i=1}^{10} \frac{3}{\log_2(i+1)} = \frac{3}{1.000} + \frac{3}{1.585} + \frac{3}{2.000} + \frac{3}{2.322} + \frac{3}{2.585} + \frac{3}{2.807} + \frac{3}{3.000} + \frac{3}{3.170} + \frac{3}{3.322} + \frac{3}{3.459} = 13.631$$
This is the theoretical ceiling for any mechanism where ≥10 blocks have relevance 3. All three examples below share this IDCG.
Example 1 — MedCPT on Mechanism A (Phagosome Arrest): nDCG@10 = 0.4630
MedCPT ranks all 713 blocks. The top 10 by model score, with their mechanism A ground-truth relevance:
| Rank | Block | Model Score | relA | Discount log2(i+1) | Contribution |
| 1 | P05B04 | 0.9999920 | 0 | 1.000 | 0.000 |
| 2 | P07B15 | 0.9999760 | 3 | 1.585 | 1.893 |
| 3 | P03B14 | 0.9999680 | 3 | 2.000 | 1.500 |
| 4 | P03B03 | 0.9999510 | 0 | 2.322 | 0.000 |
| 5 | P07B24 | 0.9999490 | 0 | 2.585 | 0.000 |
| 6 | P01B43 | 0.9999200 | 3 | 2.807 | 1.069 |
| 7 | P09B04 | 0.9999070 | 0 | 3.000 | 0.000 |
| 8 | P01B02 | 0.9998920 | 3 | 3.170 | 0.946 |
| 9 | P01B04 | 0.9998880 | 3 | 3.322 | 0.903 |
| 10 | P07B05 | 0.9998830 | 0 | 3.459 | 0.000 |
$$\text{DCG}@10 = 0 + 1.893 + 1.500 + 0 + 0 + 1.069 + 0 + 0.946 + 0.903 + 0 = 6.311$$
$$\text{nDCG}@10_A = \frac{6.311}{13.631} = 0.4630$$
Interpretation: 5 of the top 10 blocks score 0 on phagosome arrest — these blocks are relevant to other mechanisms (B, C, or D) but not A. MedCPT’s composite query retrieves broadly relevant blocks, but its mechanism A specificity is moderate. The rank-1 block (P05B04) scores 0 on A despite being the model’s highest-confidence result, contributing nothing to this mechanism’s DCG and wasting the most valuable ranking position.
Example 2 — MedCPT on Mechanism B (Acidification / Lysosome Fusion): nDCG@10 = 0.8386
Same model, same top-10 ranking — but now evaluated against mechanism B ground truth:
| Rank | Block | Model Score | relB | Discount log2(i+1) | Contribution |
| 1 | P05B04 | 0.9999920 | 2 | 1.000 | 2.000 |
| 2 | P07B15 | 0.9999760 | 3 | 1.585 | 1.893 |
| 3 | P03B14 | 0.9999680 | 3 | 2.000 | 1.500 |
| 4 | P03B03 | 0.9999510 | 3 | 2.322 | 1.292 |
| 5 | P07B24 | 0.9999490 | 3 | 2.585 | 1.161 |
| 6 | P01B43 | 0.9999200 | 3 | 2.807 | 1.069 |
| 7 | P09B04 | 0.9999070 | 2 | 3.000 | 0.667 |
| 8 | P01B02 | 0.9998920 | 3 | 3.170 | 0.946 |
| 9 | P01B04 | 0.9998880 | 3 | 3.322 | 0.903 |
| 10 | P07B05 | 0.9998830 | 0 | 3.459 | 0.000 |
$$\text{DCG}@10 = 2.000 + 1.893 + 1.500 + 1.292 + 1.161 + 1.069 + 0.667 + 0.946 + 0.903 + 0 = 11.430$$
$$\text{nDCG}@10_B = \frac{11.430}{13.631} = 0.8386$$
Interpretation: The exact same ranking scores 0.8386 on mechanism B vs. 0.4630 on mechanism A. Only 1 of the top 10 blocks is irrelevant to B (vs. 5 for A). This demonstrates precisely why per-mechanism reporting matters: the composite score alone (0.9048 overall) would hide that MedCPT is nearly twice as effective at surfacing acidification content as phagosome arrest content.
Example 3 — MiniLM on Mechanism A (Phagosome Arrest): nDCG@10 = 0.1100
A 33M general-purpose cross-encoder (MS-MARCO), showing poor mechanism A retrieval:
| Rank | Block | Model Score | relA | Discount log2(i+1) | Contribution |
| 1 | P04B01 | 2.302590 | 0 | 1.000 | 0.000 |
| 2 | P03B19 | 2.239884 | 0 | 1.585 | 0.000 |
| 3 | P03B01 | 2.195090 | 3 | 2.000 | 1.500 |
| 4 | P04B03 | 2.137062 | 0 | 2.322 | 0.000 |
| 5 | P02B49 | 2.130432 | 0 | 2.585 | 0.000 |
| 6 | P02B57 | 2.109454 | 0 | 2.807 | 0.000 |
| 7 | P02B52 | 2.078258 | 0 | 3.000 | 0.000 |
| 8 | P03B36 | 2.070532 | 0 | 3.170 | 0.000 |
| 9 | P07B01 | 1.984957 | 0 | 3.322 | 0.000 |
| 10 | P03B03 | 1.962905 | 0 | 3.459 | 0.000 |
$$\text{DCG}@10 = 0 + 0 + 1.500 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 1.500$$
$$\text{nDCG}@10_A = \frac{1.500}{13.631} = 0.1100$$
Interpretation: MiniLM places only 1 relevant block in the top 10 — and it’s at rank 3, not rank 1, so its contribution (1.500) is already discounted. The remaining 9 positions are all mechanism A–irrelevant. This 33M general-purpose model, trained on MS-MARCO web queries, lacks the domain specificity to distinguish phagosome arrest content from general Mtb text. Compare with MedCPT (0.4630), which placed 5 relevant blocks in the same top 10.
Part 2A — Detailed Evaluation Analysis
1st — Claude (4 Excellent, 2 Good) — 4.8/5
Strengths: Combines breadth with citation precision — 28 fully formatted references with DOIs (all 2021–2025). Strongest citation profile of all four systems: Nature Reviews Microbiology (×2), Nature Microbiology, Nature Communications, NPJ Vaccines. Exceptional molecular detail across all mechanisms, describing PtpA/PtpB V-ATPase dephosphorylation, PI3P disruption via ManLAM, cGAS-STING-IRF3 pathway activation, EspL-mediated autophagy inhibition, DosR/DevR dormancy regulation, WhiB3 redox sensing, and RpfB-mediated reactivation. Sophisticated handling of the type I interferon paradox citing separate 2024–2025 studies. Includes WHO 2024 epidemiological data and advanced topics (MAIT cells, cytosolic translocation, granuloma dynamics). Near-publication-quality document structure.
Weaknesses: Falls 2 references short of Excellent threshold (28 vs. 30). Two references (“Patel & Bhatt 2023” and “Russell 2023”) cannot be fully verified. Very lengthy, potentially impractical for quick reference. Could further integrate single-cell and spatial transcriptomics data. Lacks summary table feature found in Consensus.
Key distinction: Claude achieves the highest overall quality by combining the breadth of SciSpace with the citation precision of Consensus. The minor shortfall in reference count (28 vs. 30) is the only meaningful gap preventing a perfect score.
2nd (tied) — Consensus (3 Excellent, 1 Good, 2 Acceptable) — 4.2/5
Strengths: All 18 references include working DOIs verified as real publications — zero hallucinated citations. Includes high-impact journals: Nature Reviews Microbiology, PNAS, Cellular and Molecular Immunology. Warner et al. 2025 identified as the field’s most current comprehensive review. Meaningful synthesis connecting PknG, ESX effectors, and galectins. Unique summary table integrating infection stages with citations. Accurately identifies strain-specific epithelial responses and disease progression as spectrum.
Weaknesses: Reference count at 18 falls below the Good threshold (20+). Synthesis depth constrained by smaller reference set. Cell death pathways and metabolic dormancy receive only brief treatment. No formal abstract or keywords section. Omits some important recent papers (Zheng 2024, Feng 2024).
Key distinction: In contexts where verifiability is the top priority, Consensus outperforms all others. Quality over quantity — its 18 references are impeccable, but the smaller set constrains coverage depth.
2nd (tied) — SciSpace (4 Excellent, 2 Good) — 4.0/5
Strengths: Highest reference count (30, the only Excellent-tier). Excellent structural organization with full table of contents across 12 sections. Deep mechanistic synthesis correctly describing PknG, V-ATPase, Rab7/LAMP-1, and ManLAM interactions. Includes cutting-edge finding on lysosome-poor monocyte niche (Zheng 2024). Covers cell death pathways with precision including pyroptosis and apoptosis. Critically, SciSpace cited all 10 Part 1 corpus papers directly ([1] Rankine-Wilson, [5] Bo, [6] Shen, [9] Lei, [11] Kilinç, [12] Zheng, [14] Chandra, [17] Kim, [22] Witt, [29] Khadela), providing a natural control for Part 2B reranker-based source fidelity detection.
Weaknesses: ~2 potentially unverifiable references including ref [10] lacking journal name and ref [13] as unconfirmed preprint. Citation quality skews toward medium-impact open-access journals rather than top-tier publications (no Nature Microbiology or Cell Host & Microbe). Later sections contain repetitive content. No DOIs provided, complicating verification. Less emphasis on most recent 2024–2025 advances.
Key distinction: SciSpace’s corpus overlap makes it invaluable for Part 2B validation. Research requiring breadth favors SciSpace; research requiring verifiability favors Consensus.
4th — ChatGPT (0 Excellent, 2 Good, 3 Acceptable, 1 Poor) — 2.2/5
Strengths: Well-organized and readable for non-specialist audiences. Broadly accurate at a general level covering major topics correctly — phagosome arrest, V-ATPase interference, ESX-1 function, and immune responses. Clear and logical section structure. Identifies emerging research directions including spatial transcriptomics and immunometabolism.
Weaknesses: Only ~6 named references without DOIs or full bibliographic details (Poor). In-text labels like “(PubMed)”, “(Nature)”, “(ScienceDirect)” are not real citations. Multiple references identified as vague and likely fabricated. Uses non-attributed statements like “a 2023 study showed...” Lacks grounded synthesis across studies. Not suitable for any scholarly submission — cannot be verified without DOIs.
Key distinction: ChatGPT produced a well-organized textbook-style summary rather than a scholarly literature review. Despite accessible presentation, it is academically unsuitable — it would fail peer review on citation quality alone.
Part 2C — Cross-Analysis (Human vs. Reranker Agreement)
Comparing human rubric rankings (Part 2A) against MedCPT reranker scores (Part 2B) to test whether rerankers can approximate expert judgment.
| Criterion | Human Ranking (2A) | Reranker Proxy | Reranker Ranking (2B) | Agreement |
| Relevance to Infection Mechanism |
Claude = Consensus = SciSpace > ChatGPT |
Composite score |
Consensus > SciSpace > Claude > ChatGPT |
Strong |
| Literature Authenticity |
Consensus > Claude = SciSpace > ChatGPT |
Corpus match |
SciSpace > Consensus > ChatGPT > Claude |
Partial |
| Scientific Depth |
Claude = SciSpace > Consensus > ChatGPT |
Mechanism coverage |
SciSpace > Consensus > Claude > ChatGPT |
Partial |
| Bio Accuracy |
Claude = Consensus = SciSpace > ChatGPT |
N/A |
— |
Not measurable |
| Overall Ranking |
Claude > Consensus = SciSpace > ChatGPT |
Composite + corpus |
Consensus > SciSpace > Claude > ChatGPT |
Partial |
Interpretation:
• Relevance: Strong agreement — both human and reranker place ChatGPT last and the top three (Claude, Consensus, SciSpace) as all Excellent. The reranker correctly identifies that Consensus and SciSpace produce the most on-topic content.
• Authenticity: Partial agreement — the reranker’s corpus match correctly identifies SciSpace as most grounded in real literature (it cites all 10 corpus papers) and Consensus as strong. However, it ranks ChatGPT above Claude, which contradicts human judgment. This reflects a limitation: corpus match measures overlap with these specific 10 papers, not literature authenticity broadly. ChatGPT’s generic overview text happens to use similar terminology to the corpus papers.
• Depth: Partial agreement — the reranker measures mechanism coverage (how many topics are addressed) but cannot assess whether content integrates findings across studies vs. lists them in isolation. Human judges rated both Claude and SciSpace’s synthesis as Excellent; the reranker only sees topical overlap.
• Bio Accuracy: Not measurable by rerankers — semantic similarity cannot distinguish correct from incorrect biological claims. This remains exclusively a human expert domain.
• Overall ranking: Partial agreement — both human and reranker agree on ChatGPT last, but diverge at the top. Human experts place Claude 1st (4.8/5) based on citation quality and mechanistic depth, while the reranker favors Consensus and SciSpace based on composite relevance scores. This reveals a key limitation: rerankers cannot distinguish between citation quality tiers (Nature Reviews Microbiology vs. open-access journals) or evaluate the sophistication of scientific synthesis.
• Scalability thesis: Rerankers are reliable proxies for relevance (strong agreement), partially useful for authenticity and overall ranking (with corpus-specific caveats), and unable to assess depth, accuracy, or citation quality. For scalable triage of AI-generated scientific text, a reranker could serve as a first-pass filter for relevance and corpus grounding, but expert review remains essential for synthesis quality, citation rigor, and factual correctness.