Generative Artificial Intelligence (GenAI) in cardiac surgery refers to the integration of advanced computational models, such as Large Language Models (LLMs), to automate and enhance clinical decision-making, preoperative risk assessment, and surgical education. In the context of surgical training, it functions as a personalized pedagogical tool that supports various learning activities, ranging from information acquisition and clinical inquiry to procedural practice, while requiring rigorous human oversight to ensure patient safety and clinical accuracy. (1) Background: Generative Artificial Intelligence (GenAI) is increasingly integrated into health professions education, offering new opportunities for learning; however, its specific application and pedagogical mapping in high-stakes fields such as cardiac surgery remain underexplored. This systematic review investigates how GenAI is utilized in cardiac surgery and surgical education, aligning these uses with Laurillard’s six learning types. (2) Methods: Following the PRISMA 2020 guidelines, we searched the Web of Science Core Collection for studies on GenAI in cardiac surgery, resulting in 42 studies that met the inclusion criteria. Study quality was appraised using the Medical Education Research Study Quality Instrument (MERSQI). (3) Results: GenAI applications most frequently supported clinical inquiry (93.8%) and practice (68.8%), demonstrating expanding efficiency across commercial and open-source models (including ChatGPT-4o, Gemini AI, and emerging reasoning architectures such as DeepSeek) for knowledge acquisition and medical production. While it significantly improves individualized learning and preoperative assessment workflows, its practical role in Discussion and Collaboration remains heavily underutilized, highlighting a distinct shift toward individualized solo professional workflows. (4) Conclusions: GenAI provides a transformative and scalable approach to cardiac surgical training by offering personalized and accessible knowledge retrieval. However, clinical educators and governance bodies must deliberately balance these immediate productivity benefits with long-term concerns regarding structural “hallucinations,” data verifiability, and the preservation of collaborative competencies within modern multidisciplinary Heart Teams.
Medical education has long relied on traditional modalities such as didactic lectures, cadaveric dissection, and apprenticeship-based clinical instruction to train future healthcare professionals
[1]. While these methods have produced generations of competent clinicians, they are increasingly challenged by constraints of accessibility, ethics, cost, and scalability
[2,3][2][3]. For example, cadaver-based anatomy—once considered the gold standard of anatomical training—faces shortages of specimens in many regions, along with cultural and ethical sensitivities that restrict its use
[4]. Similarly, the apprenticeship paradigm in surgery, often summarized as “see one, do one, teach one,” exposes patients to risks and is constrained by reduced clinical availability, patient safety imperatives, and time limitations
[5,6][5][6].
Generative artificial intelligence (GenAI), particularly large language models (LLMs), has transitioned from a general-purpose technology into a practical tool for healthcare contexts, enabling rapid synthesis of information, drafting of clinical text, and interactive question answering
[7]. This rapid diffusion has been accompanied by persistent concerns regarding factual reliability, hallucinations, privacy, accountability, and the risk of inappropriate overreliance—concerns that intensify in high-stakes environments such as surgery and perioperative care
[8]. Within this landscape, cardiothoracic domains represent a particularly consequential test case: cardiac and thoracic surgery routinely demand time-sensitive decisions based on complex, multimodal clinical information, frequently under guideline constraints and multidisciplinary coordination
[9].
Cardiac surgery decision-making commonly requires integrating patient comorbidities, imaging and angiographic findings, procedural feasibility, operative risk, and evolving evidence-based recommendations
[10]. In this setting, early evaluations have explored whether LLMs can approximate expert reasoning in structured scenarios. For example, in coronary revascularization decision-making, LLM outputs were compared with multidisciplinary Heart Team recommendations, and measurable concordance was reported—while also revealing that performance and alignment depend strongly on the completeness and structure of the case context provided to the model
[11]. Similar “alignment” questions are now being asked in adjacent cardiothoracic decision spaces; for instance, AI-driven recommendations have been compared with Heart Team decisions for multivessel coronary artery disease, again emphasizing potential utility while underscoring the need to interpret outputs within established clinical governance structures
[12].
A parallel stream of work has focused on improving the traceability and verifiability of LLM-based outputs by incorporating retrieval-augmented methods
[13]. In cardiology guideline extraction tasks, a multi-query, multimodal retrieval-augmented pipeline was assessed on vignette-based questions and demonstrated improved accuracy relative to general-purpose chat models, while also returning traceable references to support point-of-care verification
[14]. While these tools show promise for established clinicians, their impact on the formative development of surgical trainees remains a critical area of inquiry
[15]. These design priorities—accuracy, transparency, and auditability—are especially relevant to cardiac surgical pathways, where guideline updates are frequent and where the consequences of confidently delivered misinformation can be severe
[16]. In other words, while generic LLM chat interfaces may be appealing, clinically meaningful adoption in cardiac surgery is likely to depend on architectures and governance models that support explicit sourcing and human oversight
[17].
Beyond conversational decision support, LLMs and GenAI are also being explored in perioperative risk assessment and prevention-oriented modeling
[18]. For instance, an LLM-enabled approach has been proposed for preoperative prevention of cardiopulmonary bypass-associated acute kidney injury (CPB-AKI), illustrating how language-based representations can be integrated into structured predictive pipelines for risk mitigation planning
[19]. Along similar lines of “structured evaluation,” multimodal and unimodal LLMs have been compared with human clinical experts in complex cardiovascular emergencies such as aortic dissection management, reflecting the field’s growing emphasis on benchmarking LLM performance against multicenter expert reasoning
[20]. In valvular disease contexts, proof-of-concept work has also explored AI-automated operative risk stratification for severe aortic stenosis, indicating expanding interest in model-supported planning and risk framing in procedural heart disease
[21].
At the same time, evidence suggests that LLM performance can be uneven and context-dependent, with stronger results typically observed when tasks are anchored to explicit guideline statements and weaker reliability in open-ended or highly contextual scenarios. In thoracic trauma management, for example, ChatGPT-4o was evaluated against guideline-based questions using specialist scoring, with favorable average performance but continued emphasis on oversight and careful characterization of failure modes
[22]. Patient-facing use cases further illustrate this tension between accessibility and safety. In thoracic oncology, ChatGPT responses to common patient questions about lung cancer surgery have been clinically evaluated for accuracy and relevance, underscoring both the potential to support patient education and the necessity of clinician review for perioperative counseling
[23]. Together, these studies highlight that “patient education” may be one of the most visible adoption pathways, but also one of the most safety-sensitive.
In parallel to clinical workflows, GenAI is actively reshaping the educational environment in which future cardiac and thoracic surgeons are trained
[7]. Surgical education relies on progressive responsibility, case-based reasoning, simulation, supervised decision-making, and the development of tacit knowledge through feedback and team-based socialization
[24]. GenAI tools could plausibly support trainees by generating explanations, providing structured rehearsal cases, creating question banks, or scaffolding reflective feedback
[25]. Yet these same tools may also change how clinical reasoning is developed, how learners engage with uncertainty, and how collaborative competencies are cultivated
[26]. A recent systematic review in Medical Education emphasized that a major gap in existing syntheses is the limited understanding of how learners interact with GenAI and how these interactions map onto different learning activity types. Using Laurillard’s framework, Pham and colleagues showed that GenAI use in health professional education most frequently aligns with acquisition, inquiry, practice, and production, whereas discussion and collaboration are less commonly supported—suggesting a shift toward individualized learning workflows and raising questions about how to deliberately design GenAI-enhanced learning to preserve collaborative learning goals
[27].
To critically evaluate the structural impact of these emerging technologies on medical and surgical training, a robust pedagogical framework is required. Laurillard’s Conversational Framework
[28] offers a comprehensive lens for this purpose, conceptualizing learning as a series of formal, multi-directional dialogues between teachers, learners, and learning environments
[29]. The framework delineates six distinct learning-activity types that constitute a holistic educational experience: (1) Acquisition, where learners absorb information through lectures or texts; (2) Inquiry, which involves active exploration of resources and literature; (3) Practice, where learners apply knowledge in simulated settings and receive operational feedback; (4) Production, involving the articulation of conceptual understandings through outputs like essays or projects; (5) Discussion, which fosters peer-to-peer or teacher-student debates; and (6) Collaboration, where learners work collectively toward a shared output or goal
[28]. By applying this lens to GenAI in cardiac surgery, we can critically assess whether these tools facilitate dialogic clinical reasoning or, conversely, isolate the learner within individualized, non-collaborative workflows.
Within health professions education, contemporary learning theories emphasize that clinical competence is inherently dialogic and socially embedded
[30]. The recent wave of GenAI integration primarily acts as a disruptor to this social balance. Scholars note that while AI tools dramatically scale up individualized learning behaviors—serving as highly responsive personal tutors for acquisition and inquiry—they may inadvertently induce an “educational silo effect,” decoupling the trainee from the collaborative and peer-supported environments critical for high-stakes domains like cardiac surgery
[31]. Therefore, evaluating GenAI tools through Laurillard’s six learning types is essential to ensure that digital adoption scaffolds, rather than dismantles, the collaborative competencies required in modern clinical teams.
While general syntheses exist, the high-stakes, multimodal, and technically demanding nature of cardiac surgery necessitates a domain-specific learning-activity map to ensure clinical safety and educational efficacy.
Notably, GenAI-related discussions are also appearing in perioperative and intraoperative practice domains where behavioral adoption and workflow integration are central
[32,33][32][33]. For example, a literature review on intraoperative coronary graft verification combined conventional synthesis with AI-driven insights under human supervision, reflecting how GenAI is increasingly used not only as an object of evaluation but also as a meta-analytic or analytic aid within surgical scholarship
[34]. While such approaches may increase efficiency, they further strengthen the case for transparent methods and clear reporting standards regarding where and how GenAI contributes to evidence synthesis.
Despite rapidly expanding publications across medicine, the evidence base specific to cardiac surgery remains fragmented
[35]. Consequently, a focused synthesis is needed to clarify: (i) current GenAI/LLM uses in cardiac surgical care and training, (ii) use cases with empirical support and dominant metrics, (iii) recurring safety and governance concerns, and (iv) urgent gaps in dialogic and collaborative learning. Accordingly, this systematic review aims to summarize reported outcomes across 42 identified studies and map educational applications to Laurillard’s learning-activity categories. This learning-activity map will highlight underrepresented modes and inform pedagogically grounded, safer integration strategies for the future of cardiac surgical training.
Accordingly, the present systematic review aims to characterize GenAI/LLM applications in cardiac surgery and surgical education and to summarize evaluation approaches and reported outcomes (e.g., decision concordance, guideline adherence, vignette accuracy, specialist rubric scores, and education-related outcomes). In addition, educational applications will be mapped to Laurillard’s learning-activity categories to produce a Laurillard-based learning-activity map, highlighting underrepresented learning modes and informing more pedagogically grounded and safer integration strategies for cardiac surgical training and practice
[27,36][27][36].