Rebecca Schnall, Hui Lin, Maeve Brin, Jean Jimenez, Amy K Johnson, Mirjam-Colette Kempf, Nan Liu
Objective: This study examined whether attendance at online digital health research appointments in the American Women Assessing Risk Epidemiologically (AWARE) study was associated with (1) participant age, (2) scheduling factors (time of day, day of week, month), (3) appointment confirmation, and (4) HIV behavioral risk factors.
Materials and methods: We analyzed scheduling and eligibility screening data from AWARE, a 24-month U.S.-based longitudinal digital cohort of cisgender women at elevated likelihood of HIV seroconversion. Participant demographic and behavioral data were merged with the study team's Outlook calendar. Chi-square tests and logistic regression models assessed associations between appointment attendance and participant characteristics and scheduling factors.
Results: Women aged ≥50 years had higher odds of missing baseline visits compared to those aged 20-29 years (44.7% vs 32.3%). Appointments scheduled at 2:00 pm (45.7%), 4:00 pm (45.2%), and 8:00 am (40.2%) had higher no-show rates than other times. No-show rates were lowest on Fridays (30.2%) and during March (27.7%) and June (25.2%). Confirming appointments 24 hours in advance significantly reduced no-shows compared to no confirmation (19.0% vs 51.6%). Histories of having been physically hurt (44.2% vs 32.1%), forced to have sexual activities (41.8% vs 34.1%) and incarcerated (39.3% vs 33.4%) were also associated with higher no-show rates. Similar patterns were observed for rescheduled visits.
Conclusion: Attendance in digital research was influenced by age, scheduling, and structural vulnerabilities. Incorporating digital access support into study design and grant budgets may reduce disparities, improve retention, and enhance efficiency.
目的:本研究考察了美国女性风险流行病学评估(AWARE)研究中在线数字健康研究预约的出席率是否与(1)参与者年龄、(2)日程安排因素(一天中的时间、一周中的哪一天、月份)、(3)预约确认以及(4)艾滋病毒行为风险因素相关。材料和方法:我们分析了来自AWARE的计划和资格筛选数据,AWARE是一项为期24个月的美国纵向数字队列研究,研究对象是艾滋病毒血清转化可能性较高的顺性女性。参与者的人口统计和行为数据与研究小组的Outlook日历合并。卡方检验和逻辑回归模型评估了预约出席率与参与者特征和调度因素之间的关联。结果:与20-29岁的女性相比,≥50岁的女性错过基线就诊的几率更高(44.7% vs 32.3%)。下午2点(45.7%)、下午4点(45.2%)、上午8点(40.2%)预约的失约率高于其他时间。缺席率最低的是周五(30.2%),3月(27.7%)和6月(25.2%)。提前24小时确认预约大大减少了缺席人数(19.0%对51.6%)。身体伤害史(44.2%对32.1%)、被迫进行性活动史(41.8%对34.1%)和监禁史(39.3%对33.4%)也与较高的缺勤率相关。在重新安排的就诊中也观察到类似的模式。结论:数字研究的出勤率受年龄、时间安排和结构脆弱性的影响。将数字访问支持纳入研究设计和拨款预算可以减少差异,提高保留率并提高效率。
{"title":"Optimizing participation in digital health studies: understanding appointment attendance.","authors":"Rebecca Schnall, Hui Lin, Maeve Brin, Jean Jimenez, Amy K Johnson, Mirjam-Colette Kempf, Nan Liu","doi":"10.1093/jamia/ocag055","DOIUrl":"https://doi.org/10.1093/jamia/ocag055","url":null,"abstract":"<p><strong>Objective: </strong>This study examined whether attendance at online digital health research appointments in the American Women Assessing Risk Epidemiologically (AWARE) study was associated with (1) participant age, (2) scheduling factors (time of day, day of week, month), (3) appointment confirmation, and (4) HIV behavioral risk factors.</p><p><strong>Materials and methods: </strong>We analyzed scheduling and eligibility screening data from AWARE, a 24-month U.S.-based longitudinal digital cohort of cisgender women at elevated likelihood of HIV seroconversion. Participant demographic and behavioral data were merged with the study team's Outlook calendar. Chi-square tests and logistic regression models assessed associations between appointment attendance and participant characteristics and scheduling factors.</p><p><strong>Results: </strong>Women aged ≥50 years had higher odds of missing baseline visits compared to those aged 20-29 years (44.7% vs 32.3%). Appointments scheduled at 2:00 pm (45.7%), 4:00 pm (45.2%), and 8:00 am (40.2%) had higher no-show rates than other times. No-show rates were lowest on Fridays (30.2%) and during March (27.7%) and June (25.2%). Confirming appointments 24 hours in advance significantly reduced no-shows compared to no confirmation (19.0% vs 51.6%). Histories of having been physically hurt (44.2% vs 32.1%), forced to have sexual activities (41.8% vs 34.1%) and incarcerated (39.3% vs 33.4%) were also associated with higher no-show rates. Similar patterns were observed for rescheduled visits.</p><p><strong>Conclusion: </strong>Attendance in digital research was influenced by age, scheduling, and structural vulnerabilities. Incorporating digital access support into study design and grant budgets may reduce disparities, improve retention, and enhance efficiency.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147787632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amy C Justice, Benjamin McMahon, Daniel A Jacobson, Kelly Cho, Anuj J Kapadia, Samuel M Aguayo, Zeynep H Gümüş, Ioana Danciu, Jean C Beckham, Nathan A Kimbrel, Silvia Crivelli, Eilis A Boudreau, Pat Finley, Alex K Bryant, Michael Green, Shinjae Yoo, Jacob Joseph, Peter Reaven, Jin Zhou, Shiuh-Wen Luoh, Ravi Madduri, Ayman Fanous, Khushbu Agarwal, Harshini Mukundan, Sumitra Muralidhar
Objective: Phase II of MVP-CHAMPION, a federal collaboration between the Veterans Affairs Healthcare System (VA) and the Department of Energy (DoE), leveraged large-scale clinical, geo-spatial, and genetic data with state-of-the-art artificial intelligence (AI), and high-performance computing (HPC) to improve value in healthcare.
Materials and methods: Eight clinical priority projects for which AI was a critical missing capability were initiated to address: lung cancer screening (MVP 061), suicide risk screening (MVP 062), cardiovascular risk in obstructive sleep apnea (MVP 063), checkpoint inhibitor toxicity (MVP 064), heart failure (MVP 065), renal complications in diabetes (MVP 066), post COVID-19 sequelae (MVP 067), and antipsychotic medication toxicity (MVP 068).
Results: Building on a strong regulatory and administrative foundation, we developed multimorbidity-aware analytic frameworks, reusable computational tools, and analytic pipelines. These greatly facilitated identification of novel risk factors including genetic variants and specification of more discriminating prediction models. Novel genetic risk factors are informing development and repurposing of medications and discriminating prediction models promise to improve healthcare value.
Discussion: The research foundation developed in Phase I and extended in Phase II of MVP CHAMPION has supported an unprecedented federal collaboration and yielded significant scientific advances. Our clinical findings are poised for near-term application, while advances in machine learning and high-performance computing may accelerate the broader adoption of artificial intelligence in healthcare.
Conclusion: This maturing VA-DoE federal collaboration is poised to transform the future of Veterans' healthcare and the broader national landscape of precision health.
{"title":"Increasing value in the Veterans Affairs Healthcare System (VA) with precision health: a continuing landmark collaboration with the Department of Energy.","authors":"Amy C Justice, Benjamin McMahon, Daniel A Jacobson, Kelly Cho, Anuj J Kapadia, Samuel M Aguayo, Zeynep H Gümüş, Ioana Danciu, Jean C Beckham, Nathan A Kimbrel, Silvia Crivelli, Eilis A Boudreau, Pat Finley, Alex K Bryant, Michael Green, Shinjae Yoo, Jacob Joseph, Peter Reaven, Jin Zhou, Shiuh-Wen Luoh, Ravi Madduri, Ayman Fanous, Khushbu Agarwal, Harshini Mukundan, Sumitra Muralidhar","doi":"10.1093/jamia/ocag062","DOIUrl":"https://doi.org/10.1093/jamia/ocag062","url":null,"abstract":"<p><strong>Objective: </strong>Phase II of MVP-CHAMPION, a federal collaboration between the Veterans Affairs Healthcare System (VA) and the Department of Energy (DoE), leveraged large-scale clinical, geo-spatial, and genetic data with state-of-the-art artificial intelligence (AI), and high-performance computing (HPC) to improve value in healthcare.</p><p><strong>Materials and methods: </strong>Eight clinical priority projects for which AI was a critical missing capability were initiated to address: lung cancer screening (MVP 061), suicide risk screening (MVP 062), cardiovascular risk in obstructive sleep apnea (MVP 063), checkpoint inhibitor toxicity (MVP 064), heart failure (MVP 065), renal complications in diabetes (MVP 066), post COVID-19 sequelae (MVP 067), and antipsychotic medication toxicity (MVP 068).</p><p><strong>Results: </strong>Building on a strong regulatory and administrative foundation, we developed multimorbidity-aware analytic frameworks, reusable computational tools, and analytic pipelines. These greatly facilitated identification of novel risk factors including genetic variants and specification of more discriminating prediction models. Novel genetic risk factors are informing development and repurposing of medications and discriminating prediction models promise to improve healthcare value.</p><p><strong>Discussion: </strong>The research foundation developed in Phase I and extended in Phase II of MVP CHAMPION has supported an unprecedented federal collaboration and yielded significant scientific advances. Our clinical findings are poised for near-term application, while advances in machine learning and high-performance computing may accelerate the broader adoption of artificial intelligence in healthcare.</p><p><strong>Conclusion: </strong>This maturing VA-DoE federal collaboration is poised to transform the future of Veterans' healthcare and the broader national landscape of precision health.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147787415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fatima Sayed, Albert Park, Patrick S Sullivan, Alexis Jordan, Soyeon Kwon, Yaorong Ge
Objective: This review aims to identify the contribution of user experience features and underlying technical features to sustained engagement in unguided chatbots for improving health-related behaviors.
Materials and methods: Following PRISMA-2020 guidelines, we conducted a systematic review, searching PubMed, ACM, APA PsycINFO, Cochrane, Web of Science, and IEEE Xplore from June to September 2022 and updated in April 2025. Data was analyzed via Synthesis without Meta-Analysis (SWiM), to understand the relationship between user engagement overall and individual experience metrics.
Results: Customizable avatars and flexible input interactions may enhance overall user engagement. Conversely, pre-scripted content that lacks personalization and emotional support negatively impacts user satisfaction and adherence to health interventions. Other features contributing to sustained engagement are in-app technical assistance, user learning features, and crisis support systems. A strong positive correlation (r = 0.808, n = 16) was observed between user satisfaction and engagement, specifically for satisfaction dimensions including need fulfillment (r = 0.872, n = 6), willingness to recommend chatbot (r = 0.817, n = 4) and user enjoyment (r = 0.971, n = 3) in SWiM analysis. The limited application of large language models and retrieval augmented generation techniques may constrain the quality of support available to users and overall sustained engagement.
Conclusion: Effective unguided chatbot design requires an emphasis on interactive educational elements, in-app technical assistance and crisis support, and personalized content. This can be achieved with high context awareness, input understanding, and quality content generation. Our findings suggest that user satisfaction is a primary driver of sustained engagement, though further research is needed to validate individual user satisfaction features for sustained engagement.
目的:本综述旨在确定用户体验特征和潜在技术特征对持续参与无引导聊天机器人以改善健康相关行为的贡献。材料和方法:根据PRISMA-2020指南,我们进行了系统评价,检索PubMed, ACM, APA PsycINFO, Cochrane, Web of Science和IEEE explore,检索时间为2022年6月至9月,更新时间为2025年4月。数据通过综合无元分析(SWiM)进行分析,以了解整体用户粘性和个人体验指标之间的关系。结果:可定制的头像和灵活的输入交互可以提高整体用户参与度。相反,缺乏个性化和情感支持的预先编写的内容会对用户满意度和对健康干预措施的依从性产生负面影响。其他有助于保持用户粘性的功能包括应用内技术援助、用户学习功能和危机支持系统。在SWiM分析中,用户满意度与用户参与度呈显著正相关(r = 0.808, n = 16),其中需求满足度(r = 0.872, n = 6)、聊天机器人推荐意愿(r = 0.817, n = 4)、用户享受度(r = 0.971, n = 3)为满意度维度。大型语言模型和检索增强生成技术的有限应用可能会限制用户可用支持的质量和整体持续参与。结论:有效的无引导聊天机器人设计需要强调互动教育元素、应用内技术援助和危机支持以及个性化内容。这可以通过高度的上下文感知、输入理解和高质量的内容生成来实现。我们的研究结果表明,用户满意度是持续用户粘性的主要驱动因素,尽管需要进一步的研究来验证个人用户满意度对持续用户粘性的影响。
{"title":"Identifying key user experience and technical features for sustained use of unguided chatbots for health-related behavior change: a systematic review.","authors":"Fatima Sayed, Albert Park, Patrick S Sullivan, Alexis Jordan, Soyeon Kwon, Yaorong Ge","doi":"10.1093/jamia/ocag044","DOIUrl":"https://doi.org/10.1093/jamia/ocag044","url":null,"abstract":"<p><strong>Objective: </strong>This review aims to identify the contribution of user experience features and underlying technical features to sustained engagement in unguided chatbots for improving health-related behaviors.</p><p><strong>Materials and methods: </strong>Following PRISMA-2020 guidelines, we conducted a systematic review, searching PubMed, ACM, APA PsycINFO, Cochrane, Web of Science, and IEEE Xplore from June to September 2022 and updated in April 2025. Data was analyzed via Synthesis without Meta-Analysis (SWiM), to understand the relationship between user engagement overall and individual experience metrics.</p><p><strong>Results: </strong>Customizable avatars and flexible input interactions may enhance overall user engagement. Conversely, pre-scripted content that lacks personalization and emotional support negatively impacts user satisfaction and adherence to health interventions. Other features contributing to sustained engagement are in-app technical assistance, user learning features, and crisis support systems. A strong positive correlation (r = 0.808, n = 16) was observed between user satisfaction and engagement, specifically for satisfaction dimensions including need fulfillment (r = 0.872, n = 6), willingness to recommend chatbot (r = 0.817, n = 4) and user enjoyment (r = 0.971, n = 3) in SWiM analysis. The limited application of large language models and retrieval augmented generation techniques may constrain the quality of support available to users and overall sustained engagement.</p><p><strong>Conclusion: </strong>Effective unguided chatbot design requires an emphasis on interactive educational elements, in-app technical assistance and crisis support, and personalized content. This can be achieved with high context awareness, input understanding, and quality content generation. Our findings suggest that user satisfaction is a primary driver of sustained engagement, though further research is needed to validate individual user satisfaction features for sustained engagement.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147787440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric Bressman, Sae-Hwan Park, S Ryan Greysen, Jinbo Chen
Objective: Traditional readmission risk models relying on static discharge data have limited predictive performance and fail to capture patients' recovery trajectories after hospitalization. We sought to identify optimal modeling parameters for dynamically predicting readmission risk using post-discharge step-count data from remote monitoring devices.
Methods: We combined data for adults aged 55+ from 2 studies that collected longitudinal activity data after discharge. We constructed a patient-day dataset incorporating static demographic and clinical variables and dynamic activity features aggregated over retrospective windows of 3, 5, 7, or 10 days. Models predicted readmission or death over prospective horizons of 3, 5, 7, or 10 days, within follow-up periods of 30-180 days. Logistic regression and LightGBM models were trained using 5-fold cross-validation on an 80:20 patient-level split.
Results: Among 215 participants, LightGBM outperformed logistic regression across all configurations (mean AUC 0.82 vs 0.76). Performance improved with longer prospective horizons but was insensitive to retrospective window length. The LightGBM model was well-calibrated (Hosmer-Lemeshow χ2 = 2.46, P = .96), whereas logistic regression showed miscalibration (χ2 = 51.8, P < .001). In feature-importance analyses, LightGBM ranked static (length of stay, vitals, BMI) and activity (recent steps, distance) features highly, whereas logistic regression emphasized activity variables.
Discussion: Prediction performance was impacted by horizon length and training window, with minimal effect of retrospective window. LightGBM achieved better discrimination and calibration, supporting flexible, non-parametric methods for post-discharge risk prediction.
Conclusion: Post-discharge step count data enhance dynamic readmission risk prediction. Optimizing temporal windows and model type improves discrimination and calibration.
目的:基于静态出院数据的传统再入院风险模型预测效果有限,且无法捕捉患者住院后的康复轨迹。我们试图利用远程监测设备的出院后步数数据,确定动态预测再入院风险的最佳建模参数。方法:我们合并了来自2项研究的55岁以上成年人的数据,这些研究收集了出院后的纵向活动数据。我们构建了一个患者日数据集,包括静态人口统计学和临床变量以及动态活动特征,这些特征在3、5、7或10天的回顾性窗口中汇总。在30-180天的随访期内,模型预测在3天、5天、7天或10天内再入院或死亡。Logistic回归和LightGBM模型在80:20的患者水平分割上使用5倍交叉验证进行训练。结果:在215名参与者中,LightGBM在所有配置中都优于逻辑回归(平均AUC 0.82 vs 0.76)。较长的远景视野提高了性能,但对回顾窗口长度不敏感。LightGBM模型校正良好(Hosmer-Lemeshow χ2 = 2.46, P =。96),而逻辑回归显示校准错误(χ2 = 51.8, P)。讨论:预测性能受视界长度和训练窗口的影响,回顾性窗口的影响最小。LightGBM实现了更好的识别和校准,支持灵活的非参数方法进行出院后风险预测。结论:出院后步数数据有助于动态再入院风险预测。优化时间窗口和模型类型提高了识别和校准。
{"title":"Optimizing temporal windows for wearable-augmented post-discharge risk prediction: a methods study.","authors":"Eric Bressman, Sae-Hwan Park, S Ryan Greysen, Jinbo Chen","doi":"10.1093/jamia/ocag057","DOIUrl":"https://doi.org/10.1093/jamia/ocag057","url":null,"abstract":"<p><strong>Objective: </strong>Traditional readmission risk models relying on static discharge data have limited predictive performance and fail to capture patients' recovery trajectories after hospitalization. We sought to identify optimal modeling parameters for dynamically predicting readmission risk using post-discharge step-count data from remote monitoring devices.</p><p><strong>Methods: </strong>We combined data for adults aged 55+ from 2 studies that collected longitudinal activity data after discharge. We constructed a patient-day dataset incorporating static demographic and clinical variables and dynamic activity features aggregated over retrospective windows of 3, 5, 7, or 10 days. Models predicted readmission or death over prospective horizons of 3, 5, 7, or 10 days, within follow-up periods of 30-180 days. Logistic regression and LightGBM models were trained using 5-fold cross-validation on an 80:20 patient-level split.</p><p><strong>Results: </strong>Among 215 participants, LightGBM outperformed logistic regression across all configurations (mean AUC 0.82 vs 0.76). Performance improved with longer prospective horizons but was insensitive to retrospective window length. The LightGBM model was well-calibrated (Hosmer-Lemeshow χ2 = 2.46, P = .96), whereas logistic regression showed miscalibration (χ2 = 51.8, P < .001). In feature-importance analyses, LightGBM ranked static (length of stay, vitals, BMI) and activity (recent steps, distance) features highly, whereas logistic regression emphasized activity variables.</p><p><strong>Discussion: </strong>Prediction performance was impacted by horizon length and training window, with minimal effect of retrospective window. LightGBM achieved better discrimination and calibration, supporting flexible, non-parametric methods for post-discharge risk prediction.</p><p><strong>Conclusion: </strong>Post-discharge step count data enhance dynamic readmission risk prediction. Optimizing temporal windows and model type improves discrimination and calibration.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147787646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seungjun Kim, Yiliang Zhou, Yawen Guo, Changrui Xiao, Kai Zheng
Objectives: Patients with rare diseases often face long delays before receiving a diagnosis. Using electronic health records for automated phenotyping and diagnosis of rare diseases is a promising approach but can be challenging because critical information is often recorded in unstructured notes rather than structured fields. This systematic review synthesizes the current literature applying natural language processing (NLP) and large language models (LLMs) for rare disease phenotyping and diagnosis from clinical text.
Materials and methods: A systematic search was conducted in PubMed, ACM Digital Library, and IEEE Xplore. Two reviewers independently screened papers and extracted data. Methodological rigor and quality of the studies were evaluated using the MI-CLAIM framework.
Results: The search resulted in 135 studies; 27 of them met the inclusion criteria. Methods used spanned rule-based systems, classical ML/DL models, transformer architectures, and LLMs. Transformer- and LLM-based approaches outperformed earlier methods in entity recognition, phenotype extraction, and diagnostic ranking. Several studies demonstrated clinical impact, such as increased genetic testing and identification of undiagnosed cases. However, most studies relied on retrospective and single-center datasets. Reporting of preprocessing, evaluation, and reproducibility was largely inconsistent, and interpretability, fairness, and privacy were rarely addressed.
Discussion: Natural language processing and LLMs show strong potential to accelerate rare disease diagnosis. However, heterogeneity in methods and metrics hinders cross-study comparability. Data scarcity, lack of generalization, and limited transparency remain significant challenges.
Conclusions: Natural language processing/LLM methods can support timely diagnosis of rare diseases using unstructured clinical text. Future research should prioritize multicenter studies, standardized evaluation frameworks, transparency, and fairness safeguards to enable reliable, equitable deployment.
{"title":"Applying natural language processing and large language models to clinical notes for phenotyping and diagnosing rare diseases: a systematic review.","authors":"Seungjun Kim, Yiliang Zhou, Yawen Guo, Changrui Xiao, Kai Zheng","doi":"10.1093/jamia/ocag045","DOIUrl":"https://doi.org/10.1093/jamia/ocag045","url":null,"abstract":"<p><strong>Objectives: </strong>Patients with rare diseases often face long delays before receiving a diagnosis. Using electronic health records for automated phenotyping and diagnosis of rare diseases is a promising approach but can be challenging because critical information is often recorded in unstructured notes rather than structured fields. This systematic review synthesizes the current literature applying natural language processing (NLP) and large language models (LLMs) for rare disease phenotyping and diagnosis from clinical text.</p><p><strong>Materials and methods: </strong>A systematic search was conducted in PubMed, ACM Digital Library, and IEEE Xplore. Two reviewers independently screened papers and extracted data. Methodological rigor and quality of the studies were evaluated using the MI-CLAIM framework.</p><p><strong>Results: </strong>The search resulted in 135 studies; 27 of them met the inclusion criteria. Methods used spanned rule-based systems, classical ML/DL models, transformer architectures, and LLMs. Transformer- and LLM-based approaches outperformed earlier methods in entity recognition, phenotype extraction, and diagnostic ranking. Several studies demonstrated clinical impact, such as increased genetic testing and identification of undiagnosed cases. However, most studies relied on retrospective and single-center datasets. Reporting of preprocessing, evaluation, and reproducibility was largely inconsistent, and interpretability, fairness, and privacy were rarely addressed.</p><p><strong>Discussion: </strong>Natural language processing and LLMs show strong potential to accelerate rare disease diagnosis. However, heterogeneity in methods and metrics hinders cross-study comparability. Data scarcity, lack of generalization, and limited transparency remain significant challenges.</p><p><strong>Conclusions: </strong>Natural language processing/LLM methods can support timely diagnosis of rare diseases using unstructured clinical text. Future research should prioritize multicenter studies, standardized evaluation frameworks, transparency, and fairness safeguards to enable reliable, equitable deployment.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147700485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dylan Owens, Jing Cao, Mehak Gupta, Danh Nguyen, Eric Peterson, Ann Marie Navar
Objective: Computable phenotypes derived from electronic health records (EHRs) are central to clinical research and quality reporting. Although large language models (LLMs) can extract clinically rich information from unstructured notes, routine application to all patients is computationally expensive. We evaluated whether uncertainty-guided selective use of LLMs can improve phenotyping accuracy while preserving scalability.
Materials and methods: We developed a selective augmentation framework integrating structured and unstructured EHR data using uncertainty-guided triage. An ensemble of heterogeneous classifiers trained on structured data generated probabilistic phenotype predictions and uncertainty measures to identify patients at elevated risk of misclassification. Only flagged patients underwent LLM-based analysis of unstructured clinical notes using retrieval-augmented generation. LLM-derived outputs were incorporated as additional predictors in a final probabilistic model. Performance was evaluated for two registry-based phenotypes: diabetes mellitus and peripheral arterial disease (PAD), using internal cross-registry and external validation cohorts.
Results: For diabetes mellitus, selective augmentation improved sensitivity in the internal validation cohort from 0.81 to 0.90 without loss of specificity (0.92). More than 70% of triage-flagged patients represented misclassifications by structured data alone. For PAD, selective augmentation markedly increased sensitivity from 0.18 to 0.97 while maintaining high specificity (0.99), requiring LLM analysis for only 10% of patients.
Discussion: Uncertainty-guided triage efficiently concentrated LLM use on patients most likely to benefit, improving case identification-particularly for phenotypes poorly captured by structured data-while minimizing computational burden.
Conclusion: Selective, uncertainty-guided integration of LLMs enables scalable, interpretable, and accurate EHR-based phenotyping, offering a practical alternative to universal LLM deployment in real-world informatics workflows.
{"title":"Targeted use of large language models for EHR-based computable phenotyping.","authors":"Dylan Owens, Jing Cao, Mehak Gupta, Danh Nguyen, Eric Peterson, Ann Marie Navar","doi":"10.1093/jamia/ocag051","DOIUrl":"10.1093/jamia/ocag051","url":null,"abstract":"<p><strong>Objective: </strong>Computable phenotypes derived from electronic health records (EHRs) are central to clinical research and quality reporting. Although large language models (LLMs) can extract clinically rich information from unstructured notes, routine application to all patients is computationally expensive. We evaluated whether uncertainty-guided selective use of LLMs can improve phenotyping accuracy while preserving scalability.</p><p><strong>Materials and methods: </strong>We developed a selective augmentation framework integrating structured and unstructured EHR data using uncertainty-guided triage. An ensemble of heterogeneous classifiers trained on structured data generated probabilistic phenotype predictions and uncertainty measures to identify patients at elevated risk of misclassification. Only flagged patients underwent LLM-based analysis of unstructured clinical notes using retrieval-augmented generation. LLM-derived outputs were incorporated as additional predictors in a final probabilistic model. Performance was evaluated for two registry-based phenotypes: diabetes mellitus and peripheral arterial disease (PAD), using internal cross-registry and external validation cohorts.</p><p><strong>Results: </strong>For diabetes mellitus, selective augmentation improved sensitivity in the internal validation cohort from 0.81 to 0.90 without loss of specificity (0.92). More than 70% of triage-flagged patients represented misclassifications by structured data alone. For PAD, selective augmentation markedly increased sensitivity from 0.18 to 0.97 while maintaining high specificity (0.99), requiring LLM analysis for only 10% of patients.</p><p><strong>Discussion: </strong>Uncertainty-guided triage efficiently concentrated LLM use on patients most likely to benefit, improving case identification-particularly for phenotypes poorly captured by structured data-while minimizing computational burden.</p><p><strong>Conclusion: </strong>Selective, uncertainty-guided integration of LLMs enables scalable, interpretable, and accurate EHR-based phenotyping, offering a practical alternative to universal LLM deployment in real-world informatics workflows.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147700627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objectives: To examine real-world barriers to implementing federated learning in healthcare and highlight the organizational, regulatory, and socio-technical factors often overlooked in technical research.
Materials and methods: Insights were derived from a 3-year implementation of a Nordic-Baltic federated health data network involving 5 countries and 9 institutions, incorporating legal, organizational, and cross-disciplinary perspectives.
Results: Structural challenges included coordination burdens, divergent interpretations of privacy and risk, epistemological gaps between disciplines, and the absence of legal frameworks for multi-country distributed learning in Europe. These constraints limited progress despite the availability of robust technical solutions.
Discussion: Technical privacy measures alone cannot replace trust-building, governance development, and cross-disciplinary translation work. Federated learning is more accurately understood as a socio-technical collaboration model rather than a purely technical architecture.
Conclusion: Pre-implementation planning, tiered participation models, and strengthened governance are essential to support equitable, sustainable, and clinically impactful adoption of federated learning in healthcare.
{"title":"Federated learning's uncomfortable truth: why human networks matter more than neural networks.","authors":"Laura-Maria Peltonen, Taridzo Chomutare","doi":"10.1093/jamia/ocag047","DOIUrl":"https://doi.org/10.1093/jamia/ocag047","url":null,"abstract":"<p><strong>Objectives: </strong>To examine real-world barriers to implementing federated learning in healthcare and highlight the organizational, regulatory, and socio-technical factors often overlooked in technical research.</p><p><strong>Materials and methods: </strong>Insights were derived from a 3-year implementation of a Nordic-Baltic federated health data network involving 5 countries and 9 institutions, incorporating legal, organizational, and cross-disciplinary perspectives.</p><p><strong>Results: </strong>Structural challenges included coordination burdens, divergent interpretations of privacy and risk, epistemological gaps between disciplines, and the absence of legal frameworks for multi-country distributed learning in Europe. These constraints limited progress despite the availability of robust technical solutions.</p><p><strong>Discussion: </strong>Technical privacy measures alone cannot replace trust-building, governance development, and cross-disciplinary translation work. Federated learning is more accurately understood as a socio-technical collaboration model rather than a purely technical architecture.</p><p><strong>Conclusion: </strong>Pre-implementation planning, tiered participation models, and strengthened governance are essential to support equitable, sustainable, and clinically impactful adoption of federated learning in healthcare.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147693181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objective: Home healthcare (HHC) clinical notes contain critical infection indicators that clinicians need in structured "indicator + context" pairs. Data sparsity and limited computing resources hinder automated extraction in decentralized HHC settings. This study developed and evaluated a resource-efficient pipeline using instruction-tuned, moderate-sized large language models (LLMs) to address these barriers. To address the data sparsity challenge, we also assessed the impact of a targeted LLM-based data augmentation strategy.
Materials and methods: An expert-defined schema of 26 infection indicator categories was developed. We expanded the training set using a 3-stage workflow: targeted annotation, context mutation, and synthetic generation. We adapted 2 moderate-sized models (Gemma-12B and Qwen-14B) via Quantized Low-Rank Adaptation (QLoRA). We compared them to a larger-sized, prompted model and a smaller-sized, fully fine-tuned LLM. We evaluated all models on a held-out test set using partial micro-averaged F1 score, output reliability metrics, and qualitative error analysis.
Results: Instruction-tuned moderate-sized LLMs outperformed both baselines. The top-performing model, augmented Gemma-12B, achieved a partial micro-averaged F1 score of 0.879. LLM-based data augmentation enhanced overall performance, improving the identification of rare indicators and the interpretation of negations. The best model maintained a partial F1 score above 0.750 across all indicator categories. It also showed high format adherence, confirming its ability to generate reliable structured outputs.
Discussion: Instruction-tuning moderate-sized LLMs with QLoRA and targeted data augmentation enables high-accuracy extraction of infection indicators from HHC notes.
Conclusion: This resource-efficient pipeline provides a scalable foundation for automated infection surveillance in healthcare settings with limited resources.
{"title":"Automating infection indicator extraction in home healthcare through instruction-tuned large language models.","authors":"Zidu Xu, Jiyoun Song, Shuang Zhou, Danielle Scharp, Mollie Hobensack, Yan Hu, Jingjing Shang, Maxim Topaz","doi":"10.1093/jamia/ocag040","DOIUrl":"https://doi.org/10.1093/jamia/ocag040","url":null,"abstract":"<p><strong>Objective: </strong>Home healthcare (HHC) clinical notes contain critical infection indicators that clinicians need in structured \"indicator + context\" pairs. Data sparsity and limited computing resources hinder automated extraction in decentralized HHC settings. This study developed and evaluated a resource-efficient pipeline using instruction-tuned, moderate-sized large language models (LLMs) to address these barriers. To address the data sparsity challenge, we also assessed the impact of a targeted LLM-based data augmentation strategy.</p><p><strong>Materials and methods: </strong>An expert-defined schema of 26 infection indicator categories was developed. We expanded the training set using a 3-stage workflow: targeted annotation, context mutation, and synthetic generation. We adapted 2 moderate-sized models (Gemma-12B and Qwen-14B) via Quantized Low-Rank Adaptation (QLoRA). We compared them to a larger-sized, prompted model and a smaller-sized, fully fine-tuned LLM. We evaluated all models on a held-out test set using partial micro-averaged F1 score, output reliability metrics, and qualitative error analysis.</p><p><strong>Results: </strong>Instruction-tuned moderate-sized LLMs outperformed both baselines. The top-performing model, augmented Gemma-12B, achieved a partial micro-averaged F1 score of 0.879. LLM-based data augmentation enhanced overall performance, improving the identification of rare indicators and the interpretation of negations. The best model maintained a partial F1 score above 0.750 across all indicator categories. It also showed high format adherence, confirming its ability to generate reliable structured outputs.</p><p><strong>Discussion: </strong>Instruction-tuning moderate-sized LLMs with QLoRA and targeted data augmentation enables high-accuracy extraction of infection indicators from HHC notes.</p><p><strong>Conclusion: </strong>This resource-efficient pipeline provides a scalable foundation for automated infection surveillance in healthcare settings with limited resources.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147678247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Howard R Strasberg, Edward P Hoffer, Ross Koppel, Kevin B Johnson, William M Tierney, Geoffrey W Rutledge, Elmer V Bernstam
Objectives: We report on findings from a meeting convened by the American College of Medical Informatics (ACMI) to characterize aspects of the patient experience that could be improved using informatics.
Materials and methods: The American College of Medical Informatics fellows were invited to share their experiences as patients and suggest informatics approaches that may improve the patient experience.
Results: We identified 4 themes: (1) getting the right care, (2) data sharing and data interoperability, (3) guiding low-cost evaluations, and (4) predictive analytics.
Discussion: Despite widespread adoption of health IT, patient experiences remain far from optimal.
Conclusion: The American College of Medical Informatics fellows identified informatics approaches, applications, and research areas that have the potential to improve patient experiences with health care systems.
{"title":"Opportunities for informatics to improve patient experiences: observations and reflections of ACMI fellows.","authors":"Howard R Strasberg, Edward P Hoffer, Ross Koppel, Kevin B Johnson, William M Tierney, Geoffrey W Rutledge, Elmer V Bernstam","doi":"10.1093/jamia/ocag046","DOIUrl":"https://doi.org/10.1093/jamia/ocag046","url":null,"abstract":"<p><strong>Objectives: </strong>We report on findings from a meeting convened by the American College of Medical Informatics (ACMI) to characterize aspects of the patient experience that could be improved using informatics.</p><p><strong>Materials and methods: </strong>The American College of Medical Informatics fellows were invited to share their experiences as patients and suggest informatics approaches that may improve the patient experience.</p><p><strong>Results: </strong>We identified 4 themes: (1) getting the right care, (2) data sharing and data interoperability, (3) guiding low-cost evaluations, and (4) predictive analytics.</p><p><strong>Discussion: </strong>Despite widespread adoption of health IT, patient experiences remain far from optimal.</p><p><strong>Conclusion: </strong>The American College of Medical Informatics fellows identified informatics approaches, applications, and research areas that have the potential to improve patient experiences with health care systems.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147678160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alicia Lu, Velandai Srikanth, Sarah Westworth, Yue-Guang Baey, Chris Moran, Richard Beare, Kristy Siostrom, Nadine Andrew, Taya Collyer
Objectives: Leveraging routine electronic health records (EHR) for dementia detection is a growing field, but quality and clinical utility of existing models are unclear. This systematic review aimed to evaluate performance, methodological quality, and risk of bias of EHR-based dementia prediction models.
Materials and methods: We systematically searched Medline, EMBASE, Scopus, IEEE Xplore, and ACM from conception until July 2024. All studies and grey literature describing development or validation of probabilistic prediction models using EHR data for dementia detection were included. Risk of bias was assessed using PROBAST.
Results: Fifty-six studies (434 prediction models, 155 external validations) were included. Most models were prognostic (66%), used US data (71%), relied solely on structured data, and 47 (11%) were externally validated. Modeled outcomes were extremely heterogeneous: gold-standard clinical criteria were used in 17 models (4%), with others reliant on diagnostic codes for case ascertainment. Discriminative metrics were frequently reported (82% of models), but calibration was rarely assessed (16%). All models were judged high risk of bias, driven by poor outcome definition, inadequate handling of missing data, and potential overfitting.
Discussion: Our review highlights significant issues with methodological rigor and reporting transparency in existing EHR dementia prediction models. Ambiguous outcomes, flawed case ascertainment, and incomplete performance reporting, all limit clinical usefulness. Overall, model performance was difficult to assess and compare across studies due to incomplete reporting.
Conclusion: Electronic health record-based dementia prediction is still in its infancy. Methodological rigor and interdisciplinary collaboration are essential to meet clinical needs and achieve real-world impact.
{"title":"Electronic health record-based prediction models for dementia detection: a systematic review of model performance and quality.","authors":"Alicia Lu, Velandai Srikanth, Sarah Westworth, Yue-Guang Baey, Chris Moran, Richard Beare, Kristy Siostrom, Nadine Andrew, Taya Collyer","doi":"10.1093/jamia/ocag048","DOIUrl":"https://doi.org/10.1093/jamia/ocag048","url":null,"abstract":"<p><strong>Objectives: </strong>Leveraging routine electronic health records (EHR) for dementia detection is a growing field, but quality and clinical utility of existing models are unclear. This systematic review aimed to evaluate performance, methodological quality, and risk of bias of EHR-based dementia prediction models.</p><p><strong>Materials and methods: </strong>We systematically searched Medline, EMBASE, Scopus, IEEE Xplore, and ACM from conception until July 2024. All studies and grey literature describing development or validation of probabilistic prediction models using EHR data for dementia detection were included. Risk of bias was assessed using PROBAST.</p><p><strong>Results: </strong>Fifty-six studies (434 prediction models, 155 external validations) were included. Most models were prognostic (66%), used US data (71%), relied solely on structured data, and 47 (11%) were externally validated. Modeled outcomes were extremely heterogeneous: gold-standard clinical criteria were used in 17 models (4%), with others reliant on diagnostic codes for case ascertainment. Discriminative metrics were frequently reported (82% of models), but calibration was rarely assessed (16%). All models were judged high risk of bias, driven by poor outcome definition, inadequate handling of missing data, and potential overfitting.</p><p><strong>Discussion: </strong>Our review highlights significant issues with methodological rigor and reporting transparency in existing EHR dementia prediction models. Ambiguous outcomes, flawed case ascertainment, and incomplete performance reporting, all limit clinical usefulness. Overall, model performance was difficult to assess and compare across studies due to incomplete reporting.</p><p><strong>Conclusion: </strong>Electronic health record-based dementia prediction is still in its infancy. Methodological rigor and interdisciplinary collaboration are essential to meet clinical needs and achieve real-world impact.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147678229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}