Metadata-Version: 2.1
Name: pythresh
Version: 0.1.0
Summary: A Python Toolbox for Outlier Detection Thresholding
Home-page: https://github.com/KulikDM/pythresh
Author: D Kulik
License: UNKNOWN
Download-URL: https://github.com/KulikDM/pythresh/archive/master.zip
Keywords: outlier detection,anomaly detection,outlier ensembles,thresholding,cutoff
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Description-Content-Type: text/x-rst
License-File: LICENSE

Python Outlier Detection Thresholding (PyThresh)
================================================

PyThresh is a comprehensive and scalable **Python toolkit** for **thresholding detected possible outliers** in univariate/multivariate data. It has been writen to work in tandem with PyOD with similar syntax and data structures. However, it is not limited to this single library to achieve good results. PyThresh is meant to threshold scores generated by an outlier detection. It thresholds scores without the need to set a contamination level or guess the amount of outliers that may be in the dataset beforehand. These non-parametric methods were written to reduce the user's input/guess work and rely on statistics instead to threshold outlier scores. The scores needed to apply thresholing correctly must follow: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where 0 values represent inliers, while 1 values are outliers. 

PyThresh includes more than 30 thresholding algorithms. These alogorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology. 


**Outlier Detection Thresholding with 7 Lines of Code**\ :


.. code-block:: python


    # train the KNN detector
    from pyod.models.knn import KNN
    from pythresh.thresholds.dsn import DSN
    
    clf = KNN()
    clf.fit(X_train)

    # get outlier scores
    decision_scores = clf.decision_scores_  # raw outlier scores on the train data
    
    # get outlier labels 
    thres = DSN()
    labels = thresh.eval(decision_scores)
    

Installation
^^^^^^^^^^^^

It is recommended to use **pip** or **conda** for installation:

.. code-block:: bash

   pip install pythresh            # normal install
   pip install --upgrade pythresh  # or update if needed

.. code-block:: bash

   conda install -c conda-forge pythresh

Alternatively, you could clone and run setup.py file:

.. code-block:: bash

   git clone https://github.com/KulikDM/pythresh
   cd pythresh
   pip install .


**Required Dependencies**\ :


* matplotlib
* numpy>=1.13
* scipy>=1.3.1
* scikit_learn>=0.20.0
* six
* pyod


API Cheatsheet
^^^^^^^^^^^^^^


* **eval(score)**\ : evaluate outlier score.

Key Attributes of a fitted model:


* **thresh_**\ : Return the threshold value that seperates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from normalized scores.

Implemented Algorithms
^^^^^^^^^^^^^^^^^^^^^^

**(i) Individual Detection Algorithms** :

===================== ================================================================ ==============================================================================
Abbr                  Description                                                      Parameters    
===================== ================================================================ ==============================================================================
AUCP                  Area Under Curve Precentage thresholder			       None
BOOT                  Bootstrapping thresholder					       None
CHAU		      Chauvenet's criterion thresholder				       method: ['mean', 'median', default='gmean']
CLF		      Trained Classifier thresholder				       None
DSN		      Distance Shift from Normal thresholder			       metric: ['JS':  Jensen-Shannon, 'WS':  Wasserstein, 'ENG': Energy, 
_                     _                                                                         'BHT': Bhattacharyya, 'HLL': Hellinger 'HI':  Histogram intersection, 
_                     _                                                                         default = 'LK':  Lukaszyk–Karmowski metric for normal distributions, 
_                     _                                                                         'LP':  Levy-Prokhorov, 'MAH': Mahalanobis, 'TMT': Tanimoto, 
_        	      _										'RES': Studentized residual distance]
EB		      Elliptical Boundary thresholder				       None
FGD		      Fixed Gradient Descent thresholder			       None
FWFM		      Full Width at Full Minimum thresholder			       None
GESD		      Generalized Extreme Studentized Deviate thresholder	       max_outliers: int, default=None; alpha: float, default=0.05 
GF		      Gaussian Filter thresholder				       None
HIST		      Histogram based thresholders			               n_bins: int, default=None, method: [default='otsu', 'yen', 'isodata', 'li',
_                     _                                                                                                    'minimum', 'triangle']
IQR		      Inter-Qaurtile Region thresholder		                       None
KMEANS		      KMEANS clustering thresholder				       None
MAD		      Median Absolute Deviation thresholder			       None
MCST		      Monte Carlo Shapiro Tests thresholder			       None
MOLL		      Friedrichs' mollifier thresholder				       None
MTT		      Modified Thompson Tau test thresholder			       strictness: [1,2,3,default=4,5]
QMCD                  Quasi-Monte Carlo Discreprancy thresholder		       method: ['CD', default='WD', 'MD', 'L2-star'], lim: ['Q', default='P']
REGR		      Regression based thresholder				       method: [default='siegel', 'theil']
SHIFT		      Mean Shift clustering thresholder				       None
WIND		      Topological Winding number thresholder			       None
YJ		      Yeo-Johnson transformation thresholder			       None
ZSCORE		      ZSCORE thresholder					       None

===================== ================================================================ ==============================================================================

Implementations & Benchmarks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**The comparison among implemented models and general implementation** is made available below

For Jupyter Notebooks, please navigate to **"/notebooks/"**.




