
IDRMF:
ML Practices for Medical Devices:
Guiding Principles
On January 29, 2025, IMDRF released a significant document titled “Good Machine Learning Practice for Medical Device Development: Guiding Principles.” This publication outlines ten guiding principles designed to promote the development of safe, effective, and high-quality medical devices that incorporate artificial intelligence (AI) and machine learning (ML) technologies. These principles are intended to be applied across the entire lifecycle of AI/ML-enabled medical devices.
IMDRF’s latest publication provides a comprehensive framework for the development and deployment of AI/ML-enabled medical devices. By adhering to these guiding principles, stakeholders can ensure that such devices are safe, effective, and aligned with the best practices in medical device development, thus achieving:
✔️ Regulatory approval compliance
✔️ Clinical reliability and safety
✔️ Better generalization across patient populations
✔️ Improved trust and adoption among healthcare professionals
Let’s see the ten guiding principles for Good Machine Learning Practice (GMLP).
1. Intended use and multi-disciplinary expertise:
A clear understanding of a medical device’s intended use within clinical workflows, combined with multi-disciplinary expertise, ensures that AI-enabled medical devices address clinically meaningful needs while optimizing safety and effectiveness across the total product life cycle.
- AI/ML-enabled medical devices require input from diverse disciplines, including data science, software engineering, regulatory affairs, clinical medicine, and cybersecurity.
- Collaboration across these domains ensures that models are designed with clinical applicability, regulatory compliance, and safety considerations from the outset.
- Engaging clinicians early in development helps align AI models with real-world medical workflows.
💡 “AI in healthcare is not just a software problem—it’s a clinical, ethical, and regulatory challenge. A truly impactful AI solution integrates expertise from engineers, clinicians, and regulators from day one.”
2. Software engineering and security best practices
AI models should follow robust software engineering, usability, cybersecurity, and risk management practices are implemented throughout the product life cycle to ensure the reliability, safety, traceability, and ethical handling of patient data.
- Software reliability is crucial for AI in healthcare, where errors can have serious consequences.
- Adhering to industry standards like ISO 13485 (Quality Management for Medical Devices) and IEC 62304 (Software Lifecycle Processes) ensures high-quality software development.
- Security measures such as encryption, authentication protocols, and real-time threat monitoring help mitigate risks of cyberattacks and unauthorized data access.
🔐 “Security is patient safety. In an era of increasing healthcare cyber threats, strong encryption, access controls, and rigorous software validation should be non-negotiable.”
3. Representative clinical evaluation datasets
Datasets used in clinical evaluation must sufficiently represent the intended patient population and use environment to ensure generalizability, mitigate bias, and identify performance limitations across subgroups.
- AI models must generalize well across diverse populations to prevent biased or inaccurate predictions.
- Data collection should include age, gender, ethnicity, and comorbidities that reflect real-world populations.
- Regulatory bodies may require demographic balance and fairness audits to prevent algorithmic bias.
⚖️ “Bias in AI leads to bias in care. A model trained on homogeneous data will fail in diverse clinical settings, potentially harming underrepresented groups. Fairness is not an afterthought—it’s a requirement.”
4. Independence of training and test datasets
Training and test datasets need to be maintained as independent from each other to prevent data leakage, ensuring robust validation and appropriate external testing proportional to risk.
- Avoiding data leakage between training and test sets ensures that performance metrics accurately reflect a model’s ability to generalize to unseen cases.
- Cross-validation techniques, such as k-fold validation and independent test sets, help prevent overfitting.
- Proper dataset partitioning practices help improve reliability in real-world clinical settings.
📊 “AI models can easily fool us with artificially inflated accuracy if the same data leaks between training and testing. True generalization is tested in unseen, real-world data—not in artificially curated test sets.”
5. Fit-for-purpose reference standards
Reference standards are selected based on clinical relevance, broad consensus, and suitability for the intended use, ensuring model robustness and generalizability across patient populations.
- Reference datasets, or ground truth data, should be sourced from high-quality, validated sources.
- Annotations should be verified by multiple expert radiologists, pathologists, or other relevant specialists to minimize errors.
- Data curation should follow industry best practices for reproducibility and transparency.
🛠️ “High-quality ground truth is the foundation of trustworthy AI. Relying on poorly labeled data is like building a skyscraper on sand—it’s a disaster waiting to happen.”
6. Tailored model choice and design
Model selection and design are aligned with available data and intended use to mitigate risks like overfitting and performance degradation while ensuring clinical effectiveness and safety across patient populations and use conditions.
- AI model architecture should match the nature of the problem—e.g., convolutional neural networks (CNNs) for imaging, recurrent neural networks (RNNs) for time-series data.
- The model should be explainable, interpretable, and its complexity should not exceed what is necessary for clinical utility.
- AI models should be tested under edge cases (e.g., rare diseases) to ensure robustness.
🎯 “Tailor model to use case. The best AI models are those that balance complexity with explainability, performance with practicality.”
7. Human-AI interaction assessment
Device performance should be evaluated within its intended clinical environment, considering human factors such as user expertise, understanding of model outputs, potential overreliance, and foreseeable misuse to optimize safety and effectiveness in real-world use.
- AI should augment, not replace, human decision-making in medical environments.
- Performance metrics should evaluate how well clinicians and AI work together, not just AI in isolation.
- User interfaces should be designed to improve workflow integration, minimizing cognitive load and false alarms.
🤝 “AI is a tool, not a replacement for clinicians. The best AI models enhance human decision-making, reducing workload while improving accuracy and confidence.”
8. Clinically relevant performance testing
Testing methodologies should ensure that device performance is assessed under real-world clinical conditions, independent of the training dataset, accounting for patient subgroups, clinical workflows, and potential confounding factors
- AI models should be tested in real-world hospital environments, not just in lab simulations.
- Performance validation should include different imaging equipment, scanner resolutions, and real-life patient variability.
- Stress testing should assess the AI’s resilience to poor-quality data, missing values, or uncommon presentations.
🏥 “An AI model that performs well in a controlled lab setting but fails in the chaotic reality of a hospital is not clinically useful. It is of great value to be able to claim Real-world testing and eventually should be included a mandatory step before deployment. This might be a next bar to claim broad applicability.”
9. Clear and essential user information
Users should receive transparent and contextually relevant information on intended use, benefits, risks, performance, limitations, acceptable inputs, clinical workflow integration, updates, and a means to report concerns.
- Transparent documentation should accompany AI devices, detailing intended use, limitations, and confidence scores.
- Explainability tools, such as heatmaps for imaging AI or SHAP/LIME for tabular data, could for instance included in the documentation to aid clinicians in decision-making. (a great resource and place to start looking into this: Explainable AI, LIME & SHAP for Model Interpretability)
- AI predictions should be presented with uncertainty estimates, ensuring users understand the reliability of outputs.
📖 “AI without transparency is a black box of risk. Clinicians need to understand how AI reaches its conclusions—otherwise, they won’t trust or use it.”
10. Ongoing performance monitoring and retraining risk management
Deployed models should be continuously monitored for performance in real-world settings, with safeguards against overfitting, dataset drift, and unintended bias during retraining to ensure ongoing safety and effectiveness.
- AI models should be continuously monitored post-deployment to detect drift in performance due to changing patient demographics or new medical practices.
- Regulatory frameworks should define how AI models can be updated while maintaining safety and efficacy.
- Manufacturers should establish a post-market surveillance system to detect unexpected adverse events or degradation in performance.
📡 “AI models are not ‘fire and forget’—they must evolve with changing patient populations, medical practices, and technologies. Continuous monitoring ensures AI stays reliable over time.”
International Medical Device Regulator Forum (IMDRF)
The IMDRF is a voluntary group of medical device regulators from around the world, working together to accelerate international medical device regulatory harmonization and convergence. Established in 2011, IMDRF builds on the foundational work of the Global Harmonization Task Force (GHTF) and aims to strategically promote an efficient and effective regulatory model for medical devices that is responsive to emerging challenges while protecting and maximizing public health and safety.