While working on my M.S. thesis, I got to know Open3D. To me, it is basically the Swiss army knife for 3D data acquired from reality.
It offers implementations of performant algorithms, Python support to quickly change your scripts or adjust parameters to improve your results, and a visualizer to see them.
Unfortunately, this very visualizer is not very interactive. Often, I would like to pick objects in the scene and drag them or modify them. Sadly, in Python, it is not possible. But it is in C++, and I will comment on how you can implement that.
But first, I suggest you download my code, as I will refer to it. It combines the routines of the following sections to allow picking and deleting voxels.
Interact with the Mouse
Usually, when I want to do something with Open3D, I look at the examples on their Read the Docs documentation. However, they also offer a Doxygen-based one for the C++ part of the library.
There, I noticed that the open3d::visualizer::Visualizer class contains some protected virtual methods, like MouseButtonCallback, MouseMoveCallback, KeyPressCallback, and so on… 
So, the first trick is to extend the Visualizer class and create our own visualizer.
However, these callbacks are not enough. For example, they tell that a mouse button has been pressed and which one, but they do not tell you which position that happened at.
Now, the GLFW library is at the heart of the Open3D visualizer. These methods are called directly by that library when one of these events happens.
Thus, the second trick is calling GLFW functions. In particular, glfwGetCursorPos is very helpful to get the mouse position for click events.
And that’s it: we can now intercept mouse events and get all the information we need.
Please, notice that we must pass the events also to the parent class, or we will break stuff like camera movements.
void CustomVisualizer::MouseButtonCallback(GLFWwindow *window, int button, int action, int mods)
{
	// Our code here
	double xpos, ypos;
	glfwGetCursorPos(window, &xpos, &ypos);
	printf("Pressed button %d at (%f, %f)\n", button, xpos, ypos);
	// Then, at the end
	open3d::visualization::Visualizer::MouseButtonCallback(window, button, action, mods);
}
Actually, this is the only part that requires C++ because these callbacks are not exposed to the Python bindings. If they were, then it would be possible to port my example also to that language.
From 2D to 3D
For picking, we are interested in 3D coordinates rather than 2D ones. But when we render 3D, we perform a projection: an irreversible operation, from which we cannot recover a single set of 3D coordinates.
Instead, its “reverse” operation produces a ray, a line starting from the camera, which we can intersect with our geometries.
While cameras are a common concept in 3D software, OpenGL does not know them. Heck, it does not even know your screen size and ratio!
It uses clip coordinates: for OpenGL, the renderable area is a 2×2×2 cube, whose coordinates are in rages [-1; 1]×[-1; 1]×[-1; 1], and the y-axis is opposite to the window y. Therefore, we have to convert the 2D point to this coordinate system first:
[x, y] ⇒ [2 * x / width - 1, 1 - 2 * y / height, 1, 1].
We set z = 1 to take that point on the near plane of the camera. We also need a fourth coordinate, set to 1; more on that later.
The camera, the projection, and any other transformation of the model are all expressed through 4×4 matrices. Eventually, they are combined in a single matrix, which converts vertex coordinates to clip space coordinates.
We have to do the inverse operation to get an almost ready-to-use vector with the coordinates we were looking for.
This vector is another 4D vector, but this time we cannot ignore the fourth coordinate. Computer graphics often involves homogeneous coordinates, in which an additional coordinate is added: 2D becomes 3D and 3D becomes 4D. This allows including translations in transformation matrices.
A normalized homogeneous vector has 1 as the last component and can be transformed into Cartesian coordinates by removing it. Otherwise, you have to normalize it first. In other words:
[x, y, z, w] ⇒ [x / w, y / w, z / w].
Provided that we are using perspective projection, our ray goes from the camera’s “eye” to the just-found point. In some cases, Open3D may use an orthogonal camera, but this is out of this post’s scope.
The VoxelPicking::Unproject method on the code I linked above is an implementation of this procedure.
From Ray to Voxel
This is a kind of task I usually do not like to implement. It is too easy to overlook something or to forget about some specific cases.
To be sure this was not the case, I referred to A Fast Voxel Traversal Algorithm for Ray Tracing, by John Amanatides and Andrew Woo.
The proposed algorithm is combined of two parts: an initialization and a loop that continues until the ray intercepts a voxel or exits from the grid.
Initialization
The initialization was a bit troublesome to get right because it connects three spaces: world space (R3), ray parameter space (R), and grid space (N3).
We start by finding the point at which the ray enters the grid. Again, I preferred not reinventing the wheel and adapted Andrew Kensler’s algorithm. You can find it in the VoxelPicking::IntersectAABB method.
Then, we convert this point to grid coordinates, with open3d::geometry::VoxelGrid::GetVoxel. With the same method, we also find the grid size.
Next, we initialize the step vector: it tells whether the ray is growing or decreasing in a direction. Basically, it is the sign vector of the ray direction. To be more precise, this is correct only if the grid and the world have the same orientation. And we assume this is always the case with Open3D.
Later, we compute tDelta: the size of a voxel, in ray parameter units. In other words, if we are at a point, how much do we have to increase the ray parameter to reach the same point in the next voxel?
Finally, we need to initialize tMax. It is the ray parameter value at which we pass to the next grid row, column, or depth. This value converted to world coordinates is on one edge between two rows/columns/depths. Very likely, it is not one of the 8 voxel vertices.
However, tMax is a misleading name, in my opinion, so I preferred calling it tNext.
The Loop
Amanatides and Woo give C-like code for the loop, which I prefer to rewrite in this way:
while (voxel not found) {
	find the direction that has the smallest increase of ray parameter;
	go to the next voxel in that direction;
	update tNext;
}
or, in pseudocode:
while (!isThereAVoxel(coords) && isInsideGrid(coords)) {
	dim = argmin(tNext);
	coords[dim] += step[dim];
	tNext[dim] += tDelta[dim];
}
Each component of tDelta and tNext is focused on its direction. If we sum tDelta[d] to tNext[d], we will be at the next border in the dimension d. But by doing so, we might also skip several voxels in other components! Taking the minimum value from tNext prevents us from doing so and make us walk from a voxel to an adjacent one.
VoxelPicking::GetVoxel is an implementation of what I explained in this section.
What About Other Geometries?
Voxel grids are indeed convenient structures. Thanks to their uniformity, many operations have O(1) time.
With triangle meshes, for instance, you need to intersect the ray with the various planes.
With point clouds, you might need to reformulate the problem. For example, you may be interested in the nearest point within a certain radius from the ray.
And in both cases, you may want to accelerate the problem resolution with tree data structures like octrees or BVH. Open3D offers them, but its implementation is not very powerful. Another library, like Geogram, might get you covered.
In some cases, you may avoid using rays. For example, if you have only a bunch of points, you can project them manually and query 2D coordinates.