Diego Klabjan, Xiaofeng Zhu
We propose incremental (re)training of a neural network model to cope with a continuous flow of new data in inference during model serving. As such, this is a life-long learning process. We address two challenges of life-long retraining: catastrophic forgetting and efficient retraining. If we combine all past and new data it can easily become intractable to retrain the neural network model. On the other hand, if the model is retrained using only new data, it can easily suffer catastrophic forgetting and thus it is paramount to strike the right balance. Moreover, if we retrain all weights of the model every time new data is collected, retraining tends to require too many computing resources. To solve these two issues, we propose a novel retraining model that can select important samples and important weights utilizing multi-armed bandits. To further address forgetting, we propose a new regularization term focusing on synapse and neuron importance. We analyze multiple datasets to document the outcome of the proposed retraining methods. Various experiments demonstrate that our retraining methodologies mitigate the catastrophic forgetting problem while boosting model performance.
Taejong Joo, Diego Klabjan
Distribution shifts pose significant challenges for model calibration and model selection tasks in the unsupervised domain adaptation problem -- a scenario where the goal is to perform well in a distribution shifted domain without labels. In this work, we tackle difficulties coming from distribution shifts by developing a novel importance weighted group accuracy estimator. Specifically, we present a new perspective of addressing the model calibration and model selection tasks by estimating the group accuracy. Then, we formulate an optimization problem for finding an importance weight that leads to an accurate group accuracy estimation with theoretical analyses. Our extensive experiments show that our approach improves state-of-the-art performances by 22% in the model calibration task and 14% in the model selection task.
Papis Wongchaisuwat, Diego Klabjan, John O. McGinnis
Patent lawsuits are costly and time-consuming. An ability to forecast a patent litigation and time to litigation allows companies to better allocate budget and time in managing their patent portfolios. We develop predictive models for estimating the likelihood of litigation for patents and the expected time to litigation based on both textual and non-textual features. Our work focuses on improving the state-of-the-art by relying on a different set of features and employing more sophisticated algorithms with more realistic data. The rate of patent litigations is very low, which consequently makes the problem difficult. The initial model for predicting the likelihood is further modified to capture a time-to-litigation perspective.
Eun Hee Ko, Diego Klabjan
An increasing number of people are using online social networking services (SNSs), and a significant amount of information related to experiences in consumption is shared in this new media form. Text mining is an emerging technique for mining useful information from the web. We aim at discovering in particular tweets semantic patterns in consumers' discussions on social media. Specifically, the purposes of this study are twofold: 1) finding similarity and dissimilarity between two sets of textual documents that include consumers' sentiment polarities, two forms of positive vs. negative opinions and 2) driving actual content from the textual data that has a semantic trend. The considered tweets include consumers opinions on US retail companies (e.g., Amazon, Walmart). Cosine similarity and K-means clustering methods are used to achieve the former goal, and Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, is used for the latter purpose. This is the first study which discover semantic properties of textual data in consumption context beyond sentiment analysis. In addition to major findings, we apply LDA (Latent Dirichlet Allocations) to the same data and drew latent topics that represent consumers' positive opinions and negative opinions on social media.
Matthew Dixon, Diego Klabjan, Jin Hoon Bang
Deep neural networks (DNNs) are powerful types of artificial neural networks (ANNs) that use several hidden layers. They have recently gained considerable attention in the speech transcription and image recognition community (Krizhevsky et al., 2012) for their superior predictive properties including robustness to overfitting. However their application to algorithmic trading has not been previously researched, partly because of their computational complexity. This paper describes the application of DNNs to predicting financial market movement directions. In particular we describe the configuration and training approach and then demonstrate their application to backtesting a simple trading strategy over 43 different Commodity and FX future mid-prices at 5-minute intervals. All results in this paper are generated using a C++ implementation on the Intel Xeon Phi co-processor which is 11.4x faster than the serial version and a Python strategy backtesting environment both of which are available as open source code written by the authors.
Anders Drachen, Matthew Yancey, John Maguire, Derrek Chu, Iris Yuhui Wang, Tobias Mahlmann, Matthias Schubert, Diego Klabjan
Multiplayer Online Battle Arena (MOBA) games are among the most played digital games in the world. In these games, teams of players fight against each other in arena environments, and the gameplay is focused on tactical combat. Mastering MOBAs requires extensive practice, as is exemplified in the popular MOBA Defence of the Ancients 2 (DotA 2). In this paper, we present three data-driven measures of spatio-temporal behavior in DotA 2: 1) Zone changes; 2) Distribution of team members and: 3) Time series clustering via a fuzzy approach. We present a method for obtaining accurate positional data from DotA 2. We investigate how behavior varies across these measures as a function of the skill level of teams, using four tiers from novice to professional players. Results indicate that spatio-temporal behavior of MOBA teams is related to team skill, with professional teams having smaller within-team distances and conducting more zone changes than amateur teams. The temporal distribution of the within-team distances of professional and high-skilled teams also generally follows patterns distinct from lower skill ranks.
Nikhil Byanna, Diego Klabjan
How does one objectively measure the performance of an individual offensive lineman in the NFL? The existing literature proposes various measures that rely on subjective assessments of game film, but has yet to develop an objective methodology to evaluate performance. Using a variety of statistics related to an offensive lineman's performance, we develop a framework to objectively analyze the overall performance of an individual offensive lineman and determine specific linemen who are overvalued or undervalued relative to their salary. We identify eight players across the 2013-2014 and 2014-2015 NFL seasons that are considered to be overvalued or undervalued and corroborate the results with existing metrics that are based on subjective evaluation. To the best of our knowledge, the techniques set forth in this work have not been utilized in previous works to evaluate the performance of NFL players at any position, including offensive linemen.
Alexander Lavin, Diego Klabjan
Investigations have been performed into using clustering methods in data mining time-series data from smart meters. The problem is to identify patterns and trends in energy usage profiles of commercial and industrial customers over 24-hour periods, and group similar profiles. We tested our method on energy usage data provided by several U.S. power utilities. The results show accurate grouping of accounts similar in their energy usage patterns, and potential for the method to be utilized in energy efficiency programs.
Anders Drachen, Joseph Riley, Shawna Baskin, Diego Klabjan
The in-game economies of massively multi-player online games (MMOGs) are complex systems that have to be carefully designed and managed. This paper presents the results of an analysis of auction house data from the MMOG Glitch, across a 14 month time period, the entire lifetime of the game. The data comprise almost 3 million data points, over 20,000 unique players and more than 650 products. Furthermore, an interactive visualization, based on Sankey flow diagrams, is presented which shows the proportion of the different clusters across each time bin, as well as the flow of players between clusters. The diagram allows evaluation of migration of players between clusters as a function of time, as well as churn analysis. The presented work provides a template analysis and visualization model for progression-based or temporal-based analysis of player behavior broadly applicable to games.
Taeheon Jeong, Diego Klabjan, Justin Starren
The behaviors of patients with depression are usually difficult to predict because the patients demonstrate the symptoms of a depressive episode without a warning at unexpected times. The goal of this research is to build algorithms that detect signals of such unusual moments so that doctors can be proactive in approaching already diagnosed patients before they fall in depression. Each patient is equipped with a smartphone with the capability to track its sensors. We first find the home location of a patient, which is then augmented with other sensor data to identify sleep patterns and select communication patterns. The algorithms require two to three weeks of training data to build standard patterns, which are considered normal behaviors; and then, the methods identify any anomalies in day-to-day data readings of sensors. Four smartphone sensors, including the accelerometer, the gyroscope, the location probe and the communication log probe are used for anomaly detection in sleeping and communication patterns.
Jaehoon Koo, Diego Klabjan, Jean Utke
Most recent machine learning research focuses on developing new classifiers for the sake of improving classification accuracy. With many well-performing state-of-the-art classifiers available, there is a growing need for understanding interpretability of a classifier necessitated by practical purposes such as to find the best diet recommendation for a diabetes patient. Inverse classification is a post modeling process to find changes in input features of samples to alter the initially predicted class. It is useful in many business applications to determine how to adjust a sample input data such that the classifier predicts it to be in a desired class. In real world applications, a budget on perturbations of samples corresponding to customers or patients is usually considered, and in this setting, the number of successfully perturbed samples is key to increase benefits. In this study, we propose a new framework to solve inverse classification that maximizes the number of perturbed samples subject to a per-feature-budget limits and favorable classification classes of the perturbed samples. We design algorithms to solve this optimization problem based on gradient methods, stochastic processes, Lagrangian relaxations, and the Gumbel trick. In experiments, we find that our algorithms based on stochastic processes exhibit an excellent performance in different budget settings and they scale well.
Xingyu Wang, Lida Zhang, Diego Klabjan
Certain type of documents such as tweets are collected by specifying a set of keywords. As topics of interest change with time it is beneficial to adjust keywords dynamically. The challenge is that these need to be specified ahead of knowing the forthcoming documents and the underlying topics. The future topics should mimic past topics of interest yet there should be some novelty in them. We develop a keyword-based topic model that dynamically selects a subset of keywords to be used to collect future documents. The generative process first selects keywords and then the underlying documents based on the specified keywords. The model is trained by using a variational lower bound and stochastic gradient optimization. The inference consists of finding a subset of keywords where given a subset the model predicts the underlying topic-word matrix for the unknown forthcoming documents. We compare the keyword topic model against a benchmark model using viral predictions of tweets combined with a topic model. The keyword-based topic model outperforms this sophisticated baseline model by 67%.
Cheolmin Kim, Youngseok Kim, Diego Klabjan
Power iteration has been generalized to solve many interesting problems in machine learning and statistics. Despite its striking success, theoretical understanding of when and how such an algorithm enjoys good convergence property is limited. In this work, we introduce a new class of optimization problems called scale invariant problems and prove that they can be efficiently solved by scale invariant power iteration (SCI-PI) with a generalized convergence guarantee of power iteration. By deriving that a stationary point is an eigenvector of the Hessian evaluated at the point, we show that scale invariant problems indeed resemble the leading eigenvector problem near a local optimum. Also, based on a novel reformulation, we geometrically derive SCI-PI which has a general form of power iteration. The convergence analysis shows that SCI-PI attains local linear convergence with a rate being proportional to the top two eigenvalues of the Hessian at the optimum. Moreover, we discuss some extended settings of scale invariant problems and provide similar convergence results for them. In numerical experiments, we introduce applications to independent component analysis, Gaussian mixtures, and non-negative matrix factorization. Experimental results demonstrate that SCI-PI is competitive to state-of-the-art benchmark algorithms and often yield better solutions.
Xin Qian, Diego Klabjan
Neural network pruning techniques reduce the number of parameters without compromising predicting ability of a network. Many algorithms have been developed for pruning both over-parameterized fully-connected networks (FCNs) and convolutional neural networks (CNNs), but analytical studies of capabilities and compression ratios of such pruned sub-networks are lacking. We theoretically study the performance of two pruning techniques (random and magnitude-based) on FCNs and CNNs. Given a target network {whose weights are independently sampled from appropriate distributions}, we provide a universal approach to bound the gap between a pruned and the target network in a probabilistic sense. The results establish that there exist pruned networks with expressive power within any specified bound from the target network.
Xiaoyi Liu, Diego Klabjan, Patrick NBless
Automatic data extraction from charts is challenging for two reasons: there exist many relations among objects in a chart, which is not a common consideration in general computer vision problems; and different types of charts may not be processed by the same model. To address these problems, we propose a framework of a single deep neural network, which consists of object detection, text recognition and object matching modules. The framework handles both bar and pie charts, and it may also be extended to other types of charts by slight revisions and by augmenting the training data. Our model performs successfully on 79.4% of test simulated bar charts and 88.0% of test simulated pie charts, while for charts outside of the training domain it degrades for 57.5% and 62.3%, respectively.
Alexander Stec, Diego Klabjan, Jean Utke
There are classification tasks that take as inputs groups of images rather than single images. In order to address such situations, we introduce a nested multi-instance deep network. The approach is generic in that it is applicable to general data instances, not just images. The network has several convolutional neural networks grouped together at different stages. This primarily differs from other previous works in that we organize instances into relevant groups that are treated differently. We also introduce a method to replace instances that are missing which successfully creates neutral input instances and consistently outperforms standard fill-in methods in real world use cases. In addition, we propose a method for manual dropout when a whole group of instances is missing that allows us to use richer training data and obtain higher accuracy at the end of training. With specific pretraining, we find that the model works to great effect on our real world and pub-lic datasets in comparison to baseline methods, justifying the different treatment among groups of instances.
Mengfan Xu, Diego Klabjan
Multi-armed Bandit motivates methods with provable upper bounds on regret and also the counterpart lower bounds have been extensively studied in this context. Recently, Multi-agent Multi-armed Bandit has gained significant traction in various domains, where individual clients face bandit problems in a distributed manner and the objective is the overall system performance, typically measured by regret. While efficient algorithms with regret upper bounds have emerged, limited attention has been given to the corresponding regret lower bounds, except for a recent lower bound for adversarial settings, which, however, has a gap with let known upper bounds. To this end, we herein provide the first comprehensive study on regret lower bounds across different settings and establish their tightness. Specifically, when the graphs exhibit good connectivity properties and the rewards are stochastically distributed, we demonstrate a lower bound of order $O(\log T)$ for instance-dependent bounds and $\sqrt{T}$ for mean-gap independent bounds which are tight. Assuming adversarial rewards, we establish a lower bound $O(T^{\frac{2}{3}})$ for connected graphs, thereby bridging the gap between the lower and upper bound in the prior work. We also show a linear regret lower bound when the graph is disconnected. While previous works have explored these settings with upper bounds, we provide a thorough study on tight lower bounds.
Shuyang Wang, Diego Klabjan
Jul 27, 2023·q-fin.TR·PDF We propose an ensemble method to improve the generalization performance of trading strategies trained by deep reinforcement learning algorithms in a highly stochastic environment of intraday cryptocurrency portfolio trading. We adopt a model selection method that evaluates on multiple validation periods, and propose a novel mixture distribution policy to effectively ensemble the selected models. We provide a distributional view of the out-of-sample performance on granular test periods to demonstrate the robustness of the strategies in evolving market conditions, and retrain the models periodically to address non-stationarity of financial data. Our proposed ensemble method improves the out-of-sample performance compared with the benchmarks of a deep reinforcement learning strategy and a passive investment strategy.
Baiyang Wang, Diego Klabjan
We consider the problem of modeling temporal textual data taking endogenous and exogenous processes into account. Such text documents arise in real world applications, including job advertisements and economic news articles, which are influenced by the fluctuations of the general economy. We propose a hierarchical Bayesian topic model which imposes a "group-correlated" hierarchical structure on the evolution of topics over time incorporating both processes, and show that this model can be estimated from Markov chain Monte Carlo sampling methods. We further demonstrate that this model captures the intrinsic relationships between the topic distribution and the time-dependent factors, and compare its performance with latent Dirichlet allocation (LDA) and two other related models. The model is applied to two collections of documents to illustrate its empirical performance: online job advertisements from DirectEmployers Association and journalists' postings on BusinessInsider.com.
Papis Wongchaisuwat, Diego Klabjan, Siddhartha R. Jonnalagadda
Community-based Question Answering (CQA) sites play an important role in addressing health information needs. However, a significant number of posted questions remain unanswered. Automatically answering the posted questions can provide a useful source of information for online health communities. In this study, we developed an algorithm to automatically answer health-related questions based on past questions and answers (QA). We also aimed to understand information embedded within online health content that are good features in identifying valid answers. Our proposed algorithm uses information retrieval techniques to identify candidate answers from resolved QA. In order to rank these candidates, we implemented a semi-supervised leaning algorithm that extracts the best answer to a question. We assessed this approach on a curated corpus from Yahoo! Answers and compared against a rule-based string similarity baseline. On our dataset, the semi-supervised learning algorithm has an accuracy of 86.2%. UMLS-based (health-related) features used in the model enhance the algorithm's performance by proximately 8 %. A reasonably high rate of accuracy is obtained given that the data is considerably noisy. Important features distinguishing a valid answer from an invalid answer include text length, number of stop words contained in a test question, a distance between the test question and other questions in the corpus as well as a number of overlapping health-related terms between questions. Overall, our automated QA system based on historical QA pairs is shown to be effective according to the data set in this case study. It is developed for general use in the health care domain which can also be applied to other CQA sites.