DATA SCIENCE PROJECTS

Quantifying Entropy in Images – A Prototype with Neural Networks

Chintan

Updated: Nov 7, 2020

As I write this blog post, it is my understanding that the following areas are at the forefront of AI research:

Understanding human speech
Competing in strategic game systems (such as Go)
Autonomous cars
Intelligent routing in content delivery network

While the above list is not exhaustive, and while above capabilities are useful at supplementing human abilities, they do not inherently mimic humans and their innate preferences. As AI research expands into various directions, I couldn't find a project that would try to assign numeric values to images based on their inherent 'order'. In general, while individual preferences are limitless, my hypothesis was that humans typically favor certain ‘orderly’ arrangements over others. Most humans might prefer an orderly room over a messy one, and an ordered piece of clothing over a rugged one, and a symmetrical architecture over an unsymmetrical one. Let me illustrate this in a few examples below:

Example 1: A ‘stacked’ set of matches versus a ‘random’ bunch of matches

Example 2: A ‘messy’ room versus an ‘organized’ room

If this were indeed the case, then an algorithm that can classify images as such would further true AI capabilities. In this regard, the very idea of randomness needs a closer look:

"What is randomness and where does it come from? We take for granted the randomness in our reality. Randomness cannot be understood in mathematical terms. My opinion is that randomness is a manifestation of complex information processing. If perhaps what is perceived as randomness is just an exceedingly complex computation, then can it be possible to discover some accidental hidden structure in that randomness?" - Anonymous

But before we try to build an AI that scores images in the same way as humans, we also need to understand such ‘randomness’ in mathematical terms. In statistical mechanics, concrete efforts have paved the way to understand the behavior of ensemble systems over time, and in conclusion, the ‘entropy’ of a system has been linked with time itself in the second law of thermodynamics. It is intricately tied to physics, chemistry and biology, and while the precise explanations of underlying ideas is beyond the scope of this project, I’d like to point the reader towards the following quotes by Boltzmann and Lehninger:

“The general struggle for existence of animate beings is not a struggle for raw materials … nor for energy which exists in plenty in any body in the form of heat, but a struggle for [negative] entropy.” – Ludwig Boltzmann (Physicist)

"Living organisms preserve their internal order by taking from their surroundings free energy, in the form of nutrients or sunlight, and returning to their surroundings an equal amount of energy as heat and [positive] entropy." - Albert Lehninger (Biologist)

While everything so far is interesting, how do we capture symmetry algorithmically? That, of course, is the crux of the problem. From a strictly geometrical point of view, we could try to capture the Euclidean, reflectional, rotational, translational, roto-reflectional, helical, double rotation, non-isometric etc symmetries and then try to account for distortions due to the location of the viewer. But doing that would require building an algorithm from the ground-up, and that is going to require at least a few years, if not decades. I wondered if there was a shortcut to developing our AI! After all, that is the whole point of having AI think for itself!

Among the most important requirements of building AI software is the need to train it on large amounts of data coupled with extensive computing resources to train it. This naturally turned my attention to pre-trained models that could classify images. For example, VGG16 is a deep convolutional network for object recognition with a top-5 accuracy rate of over 90%.

Essentially, to put it crudely, VGG16 identifies patterns and shapes within images, and assigns them a certain label. While AI tends to be a black box that can’t be opened-up to understand the underlying logic, my hypothesis was that models such as VGG16 can be used to extract the underlying order in day-to-day objects. But to retrain VGG16 would require the availability of large, labeled datasets and enormous computing resources. Also, it would destroy VGG16’s ability to identify patterns correctly. However, by adding additional layers on top of VGG16, one could train the additional layers alone to get the desired output.

While it’d be straight-forward to add additional layers on top of VGG16, what kind of images need to be used to train this new combination of neural network layers? Ideally, I’d want the machine to label a completely random image consisting of random pixels to be labeled as zero. And I’d also the want the AI system to label a perfectly structured and neat image to be labeled as 10. However, it remains to be seen what AI’s definition of structure is!

Tools used for building this prototype (over 2 weeks)

So, I gathered a set of ‘clean’ images with definite shapes and introduced random distortions. The distortions were created using a random number generator and were applied to random parts of the image. However, the degree to which the image was distorted could be controlled, and a set of images was created for every ‘real’ image from the ImageNet dataset.

In the above slide, the image of an i-Pod is being distorted to varying degrees, and the model output was trained to assign a value of 10 to the ‘real’ i-pod image versus a value of 0 to a completely distorted image of the i-pod. To add an element of symmetry, a collection of over 400 abstract images was also used to add to the scoring process for situations in which two images being compared are otherwise similar. I obtained an accuracy of over 86% for my model’s ability to classify between ‘real’ versus ‘distorted’ images when a binary output was mapped.

Note the higher scores for ordered images on the right, except for the last case on the bottom-right (checkered squares) wherein the model finds the image on the left to be more ordered. It appears that the algorithm finds groupings of colors to be more symmetric, but does not recognize C2 symmetry that humans would quickly identify. While a large data-set consisting of ordered vs disordered images was not available for this work, excellent results were obtained for a vast majority of straight-forward comparisons. The model was trained using GPU-intensive EC2 instances on Amazon AWS, and the Keras model was used to build a Flask app.

To the best of my knowledge, the above work is a first of it’s kind proof of concept AI.

Clustering Pharmaceutical Drugs by Side Effects

Chintan

Updated: Nov 10, 2018

For most chemists, it is common knowledge that drugs with similar chemical structure possess similar characteristics, and are used to treat similar conditions. In the past, I had come across many examples where drugs with similar chemical structure also had similar side effects. As abundant data is available for side effects of most drugs, I was curious to know if you could classify drugs based on their side effect data alone. Also, I wanted to know about other drugs that have been classified as having distinct indications, but have similar side effects. An interesting case can be seen in the figure below.

Motivation: A Side-by-Side Comparison Example

Potential Benefits of Clustering Drugs by Side Effects

In theory, such an approach has wide-ranging implications, and to the best of my knowledge, such clustering of pharmaceutical drugs has not been carried out before. Some of the obvious benefits of such an approach are as follows: 1) Identify drugs that are distinct from others in the same class (say, a new variety of anti-Diabetes drugs); 2) Assess intellectual property claims for a new drug to quantify any additional benefits with regards to side effects; 3) Quicker classification of drugs (or mixtures) based on side effect data from in vivo studies (on say, mice); 4) A tool for regulatory agencies such as FDA to gauge a new drug candidate.

Data Collection and Methodology

To get started, I scraped data for the top 450 drugs from 'Drugs.com' using the Selenium module in Python. While large data-sets could be obtained from the SIDER database and FDA's OpenAPI, 'Drugs.com' provided a quick list of top 450 drugs for this 1-week project. For similar reasons, I avoided using data from other popular websites such as 'Webmd.com' and 'Drugsdb.com'. With Natural Language Processing (NLP) libraries such as nltk and tools such as Word2Vec, it is easy to break down text data into a row vector matrix using scikit-learn tools such as CountVectorizer or TFIDF (Term Frequency–Inverse Document Frequency). The matrix (with/without bigrams) can then be subjected to unsupervised (clustering) or supervised (classification) learning methods. In this project, I focus on clustering methods such as K-means, Hierarchical, Mean-shift, Spectral, Affinity Propagation and DBSCAN to cluster drugs by their side effects. Note that most of these clustering methods can be applied using the scikit-learn module in combination with the pandas module in Python. The results are discussed below as two separate cases: A) Limiting the data-set to 7 major classes of drugs; B) Carrying out clustering across all classes of drugs.

Visualizing the Results of Clustering (n=7) with Singular-Value Decomposition (SVD): See 2D plot on left and clusters/classes on the right. Each dot represents a distinct medication

Similar Drugs Cluster Together: Mind-Altering Medications

A) By limiting the data-set to 7 major classes of drugs, one can measure the efficacy of clustering methods relative to established classes of drugs. When the clusters were plotted on a 2-D graph (axes generated with Singular-Value Decomposition (SVD)), it was easy to notice that different clusters occupied different pockets, albeit with some overlap. When the results of my clustering methods were labeled (clustering, an unsupervised learning technique was carried out on unlabeled data first), I noticed some interesting trends. Firstly, all the mind-altering drugs such as Psychotropics and Opioid-based Painkillers occupied the right portion of the graph, whereas drugs that interact with the Central Nervous System (CNS) such as Anti-hypertensive and Non-steroidal Anti-inflammatory occupied the middle portion of the graph. Also, drugs that do not interact with the CNS, such as Anti-Diabetes and Monoclonal Antibodies, occupied the left portion of the graph. Thus, the X-axis of the graph obtained by SVD could be thought of as 'Degree of Effect on the Central Nervous System'. The Y-axis, on the other hand appeared to indicate the 'uniqueness' of side effect descriptors. Secondly, the clustering (unsupervised) itself was at least partially successful. It was interesting to note that one of the clusters consisted solely of Anti-Diabetes drugs, whereas Psychotropics and Opioid-based Painkillers were found to be clustered together. A certain class of Anti-hypertensive drugs clustered along with Painkillers and Sedatives, whereas Anti-biotics, given their distinct chemical structures, were spread out among various clusters. Based on the results of the above Hierarchical clustering technique, one can conclude that clustering drugs based on their side effects is a viable way to approach the classification of drugs.

Visualizing More Classes: See major class names on the top-right. See a represenation of all classes on the bottom-right.

B) When clustering was carried out across all classes of drugs, similar distribution of drugs could be seen on the 2-D graph (axes obtained with SVD). The following three classes could be seen represented distinctly on the graph:

Mind-altering Medications: Opioid-based Painkillers and Sedatives, Psychotropics, Anti-psychotic, Benzodiazepines, Non-Benzodiazepine Anesthetics, Tricyclic Antidepressants
Central Nervous System (CNS) Stimulants/Depressants: Anti-hypertensive, Anti-histamine, Analgesic, Non-steroidal Anti-Inflammatory, Anti-Asthamatic, Anti-Arrhythmic
Other: Antibiotics, Monoclonal Antibodies, Anti-viral, Anti-Diabetic

Mind-altering medications have similar side-effects!: They form a group on the right-side of the 2D SVD plot

CNS stimulants/depressants have similar side-effects!: They form a group in the middle portion of the 2D SVD plot

All other drugs (that do not interact with the nervous system) form a group on the left-side of the 2D SVD plot

Key Take-away: The initial analysis suggests that mere side effect descriptions allow for a new and efficient classification technique. While a thorough analysis was beyond the scope of this short-project, I was able to cluster several classes efficiently. In some cases, the clusters consisted of just one class or of two closely-related classes. If this analysis is carried out for the entire data-set of all clinically approved drugs with FDA's OpenAPI, one can reliably classify drugs based on their side-effects. Such an approach would have numerous benefits as highlighted above.

(Note that the above results were obtained with automated text analysis of side effect data alone!)

Predicting Loan-Defaults for Peer-to-Peer Lending

Chintan

Updated: Aug 27, 2018

I have always been fascinated by financial modeling as the stakes are usually huge! While people tend to vilify banks, the truth remains that the world is currently facing a savings glut, and banks are under immense pressure to lend (read: low interest rates). In this project-blog, I only look at peer-to-peer lending company called LendingClub. LendingClub's platform enables borrowers to obtain a loan, and investors to purchase notes backed by payments made on loans. While it started in 2006, it has grown exponentially since then to be the largest peer-to-peer lending platform of its kind. Personally, I found it interesting because it's a great example of a company earning profits by playing within the rules of our free-market economic system. It also goes without mentioning that LendingClub started out during the aftermath of the 2008 financial crisis.

Timeline of Loans Funded by LendingClub

Interest Rates versus Loan Grades

US States by Mean Interest Rates

LendingClub's data is available publicly for investors to judge if it's doing a good job with it's portfolio. As a result, I expected the data to be a good fit, considering that the data (2007-2015) consisted of about $ 13.1 billion of loans made. During this period, the average loan amount was $14,742, and the total number of loans funded was 887,449. As the dataset is larger than usual, one should realize that analyzing this dataset directly using Jupyter Notebook (Python) is likely to crash the computer. One needs some experience to realize the boundaries between ordinary data and Big Data, and for an array shaped at 887449 rows and 135 features, we are beginning to enter this territory. To deal with size, the data was inserted into a PostgreSQL database located in my Amazon AWS server. This way, I could query the data remotely using Python's SQLAlchemy module. Nonetheless, this was done after cleaning the data.

A few obvious trends could be identified: a) Lending standards have become loose since late 2014 (read: lower interest rates), and the total loan volumes have increased every year since 2007; b) With a few exceptions, the lending standards are generally uniform across all 50 states; c) The default rate is strongly correlated to the grade of the loan being funded, which in-turn is correlated with credit score, debt-to-income-ratio and number-of-delinquencies; d) Home owners are much less likely to default than renters.

After data exploration, I wanted to develop a model that could predict the probability that a borrower would stop making payments. As classification models typically lead to a binary output, the problem at hand required the calculation of probabilities with different models. With probabilities in hand, I could then use different thresholds to loosen or tighten the lending standards and observe it's effect on profits/revenues. Yet, this approach still requires that I select the best model. One could look at the Accuracy Score, Recall, Precision, Sensitivity, True Positive Rate (TPR), False Positive Rate (FPR), Positive Predictive Value (PPV) etc. However, the best metric to compare binary outcomes from models' probability distribution is by examining the ROC (Receiver Operating Characteristic) curves and their corresponding AUC (Area Under the Curve) values.

ROC Curves for the Churn Data

As seen in the table above, the best AUC value of 0.68 was obtained for Logistic Regression, while Gradient Boosting and Linear SVC were close in performance. Note that the above values were obtained after tweaking many different parameters for each model, and also that relevant scaling (normalization) had to be applied for all features for kNN, Logistic Regression, Linear SVC and SVC.

Next, I wanted to pickle my Logistic Regression model, and develop a Flask App to evaluate a borrower's ability to borrow. But I still had not set an optimum threshold value for my model. It was easy to see that a high threshold would reduce the default-rate, but decrease the total volume of loans funded. Similarly, a low threshold would increase the default-rate, but increase the total volume of loans funded. At this point, one could get into further microeconomic complexities for this data-set. By playing with thresholds and corresponding recall values, I realized the nature of modeling that must have enabled LendingClub's success. There is a certain inherent limitation in being able to predict one's chances of default. But by creating bundles of low-risk and high-risk borrowers, LendingClub mitigates risk by charging higher interest rates to high-risk borrowers. Thus, any chance of optimizing profits/revenues is likely to be tight fit against LendingClub's own models.

A Snap-shot of a Flask-based tool to Re-evaluate a Customer's Chance of Getting Funded

Define Thresholds for Lending Standards
Maximize Recall for Defaulters (Ideal range: 0.29-0.70)
Maximize F-1 for Regular Borrowers (Ideal range: 0.89-0.68)

For example, by setting a stricter lending standard with 29% Recall from our Logistic Regression model, initial analysis indicate significant savings. However, these values should be taken with a grain of salt because any change in default rate or market conditions would significantly sway the results one way or other:

Savings from Decreased Default (~ 29% Recall): 588 million
Loss from Stricter Lending (~ 29% Recall): 578 million
Net Increase in Profits (~ 29% Recall): $10,000,000

In conclusion, with specific values of recall/threshold, one can estimate savings for LendingClub. However, as the models are a tight fit, any increase in the overall rate of default are likely to lead to significant losses. One would need to address a wide variety of economic preferences to be able to estimate precise savings or losses with the given amount of data.