Migrating to angr 7

Last updated 7 months ago

The release of angr 7 introduces several departures from long-standing angr-isms. While the community has created a compatibility layer to give external code written for angr 6 a good chance of working on angr 7, the best thing to do is to port it to the new version. This document serves as a guide for this.

SimuVEX is gone

angr versions up through angr 6 split the program analysis into two modules: simuvex, which was responsible for analyzing the effects of a single piece of code (whether a basic block or a SimProcedure) on a program state, and angr, which aggregated analyses of these basic blocks into program-level analysis such as control-flow recovery, symbolic execution, and so forth. In theory, this would encourage for the encapsulation of block-level analyses, and allow other program analysis frameworks to build upon simuvex for their needs. In practice, no one (to our knowledge) used simuvex without angr, and the separation introduced frustrating limitations (such as not being able to reference the history of a state from a SimInspect breakpoint) and duplication of code (such as the need to synchronize data from state.scratch into path.history).

Realizing that SimuVEX wasn't a usable independent package, we brainstormed about merging it into angr and further noticed that this would allow us to address the frustrations resulting from their separation.

All of the SimuVEX concepts (SimStates, SimProcedures, calling conventions, types, etc) have been migrated into angr. The migration guide for common classes is bellow:

Before

After

simuvex.SimState

angr.SimState

simuvex.SimProcedure

angr.SimProcedure

simuvex.SimEngine

angr.SimEngine

simuvex.SimCC

angr.SimCC

And for common modules:

Before

After

simuvex.s_cc

angr.calling_conventions

simuvex.s_state

angr.sim_state

simuvex.s_procedure

angr.sim_procedure

simuvex.plugins

angr.state_plugins

simuvex.engines

angr.engines

simuvex.concretization_strategies

angr.concretization_strategies

Additionally, simuvex.SimProcedures has been renamed to angr.SIM_PROCEDURES, since it is a global variable and not a class. There have been some other changes to its semantics, see the section on SimProcedures for details.

Removal of angr.Path

In angr, a Path object maintained references to a SimState and its history. The fact that the history was separated from the state caused a lot of headaches when trying to analyze states inside a breakpoint, and caused overhead in synchronizing data from the state to its history.

In the new model, a state's history is maintained in a SimState plugin: state.history. Since the path would now simply point to the state, we got rid of it. The mapping of concepts is roughly as follows:

Before

After

path

state

path.state

state

path.history

state.history

path.callstack

state.callstack

path.trace

state.history.descriptions

path.addr_trace

state.history.bbl_addrs

path.jumpkinds

state.history.jumpkinds

path.guards

state.history.guards

path.actions

state.history.actions

path.events

state.history.events

path.recent_actions

state.history.recent_actions

An important behavior change about path.actions and path.recent_actions - actions are no longer tracked by default. If you would like them to be tracked again, please add angr.options.refs to your state.

Path Group -> Simulation Manager

Since there are no paths, there cannot be a path group. Instead, we have a Simulation Manager now (we recommend using the abbreviation "simgr" in places you were previously using "pg"), which is exactly the same as a path group except it holds states instead of paths. You can make one with project.factory.simulation_manager(...).

Errored Paths

Before, error resilience was handled at the path level, where stepping a path that caused an error would return a subclass of Path called ErroredPath, and these paths would be put in the errored stash of a path group. Now, error resilience is handled at the simulation manager level, and any state that throws an error during stepping will be wrapped in an ErrorRecord object, which is not a subclass of SimState, and put into the errored list attribute of the simulation manager, which is not a stash.

An ErrorRecord object has attributes for .state (the initial state that caused the error), .error (the error that was thrown), and .traceback (the traceback from the error). To debug these errors you can call .debug().

These changes are because we were uncomfortable making a subclass of SimState, and the ErrorRecord class then has sufficiently different semantics from a normal state that it cannot be placed in a stash.

Changes to SimProcedures

The most noticeable difference from the old version to the new version is that the catalog of built-in simprocedures are no longer organized strictly according to which library they live in. Now, they are organized according to which standards they conform to, which helps with re-using procedures between different libraries. For instance, the old SimProcedures['libc.so.6'] has been split up between SIM_PROCEDURES['libc'], SIM_PROCEDURES['posix'], and SIM_PROCEDURES['glibc'], depending on what specifications each function conforms to. This allows us to reuse the libc catalog in msvcrt.dll and the MUSL libc, for example.

In order to group SimProcedures together by libraries, we have introduced a new abstraction called the SimLibrary, the definitions for which are stored in angr.procedures.definitions. Each SimLibrary object stores information about a single shared library, and can contain SimProcedure implementations, calling convention information, and type information. SimLibraries are scraped from the filesystem at import time, just like SimProcedures, and placed into angr.SIM_LIBRARIES.

Syscalls are now categorized through a subclass of SimLibrary called SimSyscallLibrary. The API for managing syscalls through SimOS has been changed - check the API docs for the SimUserspace class.

One important implication of this change is that if you previously used a trick where you changed one of the SimProcedures present in the SimProcedures dict in order to change which SimProcedures would be used to hook over library functions by default, this will no longer work. Instead of SimProcedures[lib][func_name] = proc, you now need to say SIM_LIBRARIES[lib].add(func_name, proc). But really you should just be using hook_symbol anyway.

Changes to hooking

The Hook class is gone. Instead, we now can hook with individual instances of SimProcedure objects, as opposed to just the classes. A shallow copy of the SimProcedure will be made at runtime to preserve thread safety.

So, previously, where you would have done project.hook(addr, Hook(proc, ...)) or project.hook(addr, proc), you can now do project.hook(addr, proc(...)). In order to use simple functions as hooks, you can either say project.hook(addr, func) or decorate the declaration of your function with @project.hook(addr).

Having simprocedures as instances and letting them have access to the project cleans up a lot of other hacks that were present in the codebase, mostly related to the self.call(...) SimProcedure continuation system. It is no longer required to set IS_FUNCTION = True if you intend to use self.call() while writing a SimProcedure, and each call-return target you use will have a unique address associated with it. These addresses will be allocated lazily, which does have the side effect of making address allocation nondeterministic, sometimes based on dictionary-iteration order.

Changes to loading

The hook_symbol method will no longer attempt to redo relocations for the given symbol, instead just hooking directly over the address of the symbol in whatever library it comes from. This speeds up loading substancially and ensures more consistent behavior for when mixing and matching native library code and SimProcedure summaries.

The angr externs object has been moved into CLE, which will ALWAYS make sure that every dependency is resolved to something, never left unrelocated. Similarly, CLE provides the "kernel object" used to provide addresses for syscalls now.

Before

After

project._extern_obj

loader.extern_object

project._syscall_obj

loader.kernel_object

Several properties and methods have been renamed in CLE in order to maintain a more consistent and explicit API. The most common changes are listed below:

Before

After

loader.whats_at()

loader.describe_addr

loader.addr_belongs_to_object()

loader.find_object_containing()

loader.find_symbol_name()

loader.find_symbol().name

whatever the hell you were doing before to look up a symbol

loader.find_symbol(name or addr)

loader.find_module_name()

loader.find_object_containing().provides

loader.find_symbol_got_entry()

loader.find_relevant_relocations()

loader.main_bin

loader.main_object

anything.get_min_addr()

anything.min_addr

symbol.addr

symbol.linked_addr

Changes to the solver interface

We cleaned up the menagerie of functions present on state.solver (if you're still referring to it as state.se you should stop) and simplified it into a cleaner interface:

  • solver.eval(expression) will give you one possible solution to the given expression.

  • solver.eval_one(expression) will give you the solution to the given expression, or throw an error if more than one solution is possible.

  • solver.eval_upto(expression, n) will give you up to n solutions to the given expression, returning fewer than n if fewer than n are possible.

  • solver.eval_atleast(expression, n) will give you n solutions to the given expression, throwing an error if fewer than n are possible.

  • solver.eval_exact(expression, n) will give you n solutions to the given expression, throwing an error if fewer or more than are possible.

  • solver.min(expression) will give you the minimum possible solution to the given expression.

  • solver.max(expression) will give you the maximum possible solution to the given expression.

Additionally, all of these methods can take the following keyword arguments:

  • extra_constraints can be passed as a tuple of constraints.

    These constraints will be taken into account for this evaluation, but will not be added to the state.

  • cast_to can be passed a data type to cast the result to.

    Currently, this can only be str, which will cause the method to return the byte representation of the underlying data.

    For example, state.solver.eval(state.solver.BVV(0x41424344, 32, cast_to=str) will return "ABCD".