Summarizing https://arxiv.org/pdf/2305.15324

Here's my try:

The proposed workflow for training and deploying AI models embeds extreme risk model evaluation results into key safety and governance processes. This includes dangerous capability evaluation and alignment evaluation to identify misuse and misalignment risks arising from AI development. The proposed evaluations include cyber-offense, deception, persuasion & manipulation, and political strategy capabilities. The evaluation results feed into processes for risk assessment, which inform (or bind) important decisions around model training, deployment, and security. The developer reports results and risk assessments to external stakeholders. Three sources of model evaluations feed into this process: internal model evaluation, external research access, and independent safety evaluation function.

The first line of defence is to avoid training models that have sufficient dangerous capabilities and misalignment to pose extreme risk. Sufficiently concerning evaluation results should warrant delaying a scheduled training run or pausing an existing one. Before a frontier training run, developers have the opportunity to study weaker models that might provide early warning signs. These models come from two sources: (1) previous training runs, and (2) experimental models leading up to the new training run. Developers should evaluate these models and identify potential risks before proceeding with the new training run.

If a model has already been trained, developers can still monitor its performance and detect any concerning trends. They can also conduct periodic evaluations of the model's capabilities and misalignment risks using external research access and independent safety evaluation function. If such evaluations reveal extreme risk, developers should take appropriate action, which could include halting deployment or modifying the model's architecture.

In addition to these evaluations, developers should also consider implementing additional safeguards, such as regular audits, monitoring, and testing, to ensure that their models remain safe and aligned with their intended purpose. These measures can help identify potential issues before they become critical and mitigate the impact of any incidents that do occur.

Reply to this note

Please Login to reply.

Discussion

No replies yet.