top of page
Writer's pictureChintan

Clustering Pharmaceutical Drugs by Side Effects

Updated: Nov 10, 2018


For most chemists, it is common knowledge that drugs with similar chemical structure possess similar characteristics, and are used to treat similar conditions. In the past, I had come across many examples where drugs with similar chemical structure also had similar side effects. As abundant data is available for side effects of most drugs, I was curious to know if you could classify drugs based on their side effect data alone. Also, I wanted to know about other drugs that have been classified as having distinct indications, but have similar side effects. An interesting case can be seen in the figure below.



Motivation: A Side-by-Side Comparison Example


Potential Benefits of Clustering Drugs by Side Effects


In theory, such an approach has wide-ranging implications, and to the best of my knowledge, such clustering of pharmaceutical drugs has not been carried out before. Some of the obvious benefits of such an approach are as follows: 1) Identify drugs that are distinct from others in the same class (say, a new variety of anti-Diabetes drugs); 2) Assess intellectual property claims for a new drug to quantify any additional benefits with regards to side effects; 3) Quicker classification of drugs (or mixtures) based on side effect data from in vivo studies (on say, mice); 4) A tool for regulatory agencies such as FDA to gauge a new drug candidate.

Data Collection and Methodology



To get started, I scraped data for the top 450 drugs from 'Drugs.com' using the Selenium module in Python. While large data-sets could be obtained from the SIDER database and FDA's OpenAPI, 'Drugs.com' provided a quick list of top 450 drugs for this 1-week project. For similar reasons, I avoided using data from other popular websites such as 'Webmd.com' and 'Drugsdb.com'. With Natural Language Processing (NLP) libraries such as nltk and tools such as Word2Vec, it is easy to break down text data into a row vector matrix using scikit-learn tools such as CountVectorizer or TFIDF (Term Frequency–Inverse Document Frequency). The matrix (with/without bigrams) can then be subjected to unsupervised (clustering) or supervised (classification) learning methods. In this project, I focus on clustering methods such as K-means, Hierarchical, Mean-shift, Spectral, Affinity Propagation and DBSCAN to cluster drugs by their side effects. Note that most of these clustering methods can be applied using the scikit-learn module in combination with the pandas module in Python. The results are discussed below as two separate cases: A) Limiting the data-set to 7 major classes of drugs; B) Carrying out clustering across all classes of drugs.


Visualizing the Results of Clustering (n=7) with Singular-Value Decomposition (SVD): See 2D plot on left and clusters/classes on the right. Each dot represents a distinct medication

Similar Drugs Cluster Together: Mind-Altering Medications


A) By limiting the data-set to 7 major classes of drugs, one can measure the efficacy of clustering methods relative to established classes of drugs. When the clusters were plotted on a 2-D graph (axes generated with Singular-Value Decomposition (SVD)), it was easy to notice that different clusters occupied different pockets, albeit with some overlap. When the results of my clustering methods were labeled (clustering, an unsupervised learning technique was carried out on unlabeled data first), I noticed some interesting trends. Firstly, all the mind-altering drugs such as Psychotropics and Opioid-based Painkillers occupied the right portion of the graph, whereas drugs that interact with the Central Nervous System (CNS) such as Anti-hypertensive and Non-steroidal Anti-inflammatory occupied the middle portion of the graph. Also, drugs that do not interact with the CNS, such as Anti-Diabetes and Monoclonal Antibodies, occupied the left portion of the graph. Thus, the X-axis of the graph obtained by SVD could be thought of as 'Degree of Effect on the Central Nervous System'. The Y-axis, on the other hand appeared to indicate the 'uniqueness' of side effect descriptors. Secondly, the clustering (unsupervised) itself was at least partially successful. It was interesting to note that one of the clusters consisted solely of Anti-Diabetes drugs, whereas Psychotropics and Opioid-based Painkillers were found to be clustered together. A certain class of Anti-hypertensive drugs clustered along with Painkillers and Sedatives, whereas Anti-biotics, given their distinct chemical structures, were spread out among various clusters. Based on the results of the above Hierarchical clustering technique, one can conclude that clustering drugs based on their side effects is a viable way to approach the classification of drugs.


Visualizing More Classes: See major class names on the top-right. See a represenation of all classes on the bottom-right.


B) When clustering was carried out across all classes of drugs, similar distribution of drugs could be seen on the 2-D graph (axes obtained with SVD). The following three classes could be seen represented distinctly on the graph:

  1. Mind-altering Medications: Opioid-based Painkillers and Sedatives, Psychotropics, Anti-psychotic, Benzodiazepines, Non-Benzodiazepine Anesthetics, Tricyclic Antidepressants

  2. Central Nervous System (CNS) Stimulants/Depressants: Anti-hypertensive, Anti-histamine, Analgesic, Non-steroidal Anti-Inflammatory, Anti-Asthamatic, Anti-Arrhythmic

  3. Other: Antibiotics, Monoclonal Antibodies, Anti-viral, Anti-Diabetic


Mind-altering medications have similar side-effects!: They form a group on the right-side of the 2D SVD plot

CNS stimulants/depressants have similar side-effects!: They form a group in the middle portion of the 2D SVD plot

All other drugs (that do not interact with the nervous system) form a group on the left-side of the 2D SVD plot


Key Take-away: The initial analysis suggests that mere side effect descriptions allow for a new and efficient classification technique. While a thorough analysis was beyond the scope of this short-project, I was able to cluster several classes efficiently. In some cases, the clusters consisted of just one class or of two closely-related classes. If this analysis is carried out for the entire data-set of all clinically approved drugs with FDA's OpenAPI, one can reliably classify drugs based on their side-effects. Such an approach would have numerous benefits as highlighted above.


(Note that the above results were obtained with automated text analysis of side effect data alone!)

댓글


bottom of page