How can you best exploit the information on the protein structures that we already know in order to solve the (many more) that we don’t know? That’s the question we try to solve in our group and this talk will show our use of community clustering on the use case of aiding protein structure solution.
The study of proteins is extremely relevant to many fields, ranging from medicine to industrial processes. If a cell was a factory, proteins would be the workers, performing very specific tasks in a production chain, in which a failure at any of the steps might affect the others. In order to really understand the function of proteins, there are a number of biochemical techniques that we can apply, but to get a really accurate description of their mechanisms we need to have information up to their atomic level. Proteins are composed by different combinations of 20 amino acids which form chains and result in an intricate and functional structure. We know the final shape of the protein is dependent from its sequence but we can't (currently) predict the outcome of a given sequence, so we need experimental techniques to obtain this information. We can't directly look a protein with a microscope and see its three-dimensional atomic arrangement. There are biophysical techniques that allow us to get low-resolution shapes or surfaces of them, but the only current method that provides atomic detail is x-ray diffraction. In contrast to microscopy, were after illuminating your sample you can, with the appropriate lenses, transform back your reflected rays in an image, in x-ray diffraction we only get the Fourier transform of our protein's structure, that is, its diffraction pattern, but we lose the phase information from the diffracted rays. This problem, known as the phase problem, is an issue in structure solution, because, compared to the intensities of the diffraction pattern, phases provide much more information about our protein's structure. Phases can be computed from a coordinates model, so, if there is a structure of a protein that is expected to be similar (in terms of their coordinates) to an unknown one, it can be used to provide initial estimates of the phases and then improve them until the problem structure is solved.
Proteins do not adopt random shapes. There are forces that act on them and, according to the chemical properties of the amino acids and the context they have around, they will arrange in particular forms. The first level of structure formed is what we call secondary structure, which is formed of alpha helices and beta strands. These are very general and are present in all kind of proteins. As letters in different books, which are general if looked at independently but can acquire a quite different meaning in 'El Quijote' than in '50 Shades of Grey', these structure fragments alone have a different meaning than when set in a context with other structure elements. Yet, even combinations of a few elements can still be general and frequent, and have a sense, such as the phrase 'Once upon a time' in so many tales. In the protein case, particular small combinations of alpha helices and beta strands should also appear more frequently and can be studied on their different contexts. In a way, we could say that, protein structures, from a top down view, look quite different between them, but from a bottom up one, when we go to small pieces, they look much more similar.
In our group, we try to exploit these properties of protein structures for solving two types of problems. One is a search problem, consisting in understanding which is the best way in which to break down a larger model of a protein similar to an unknown one in order to refine it and get the correct phases and solve the structure. The other one is more of a prediction problem, in which we want to find structural units in the database of solved structures in order to help us interpret them and extract new information from this vast amount of data. In order to solve both problems we need a numerical description of our system than can help us describing its geometrical features in an accurate way. We have developed a description of such secondary structure fragments based on what we call characteristic vectors, that we also employ to represent the relations between such elements. This description can be implemented in a network graph that allows to use community clustering algorithms on it in order to evaluate simultaneously all the relations between the elements of the structure and find its communities.
From the technical point of view, the language we use for our development is Python, and the tool in which our graphs are implemented is python-igraph. There are a number of community clustering algorithms available and deciding which is the best for our case and which metrics to use to describe our structures has been an interesting process that we want to share with you. In this talk I will explain our research and development on this topic and I will link the general descriptions to our current and successful use case, possibly leaving open questions on what can we do more!