Klib
Examples¶
Missing Value Plot¶
Klib Technologies Private Limited is a Private incorporated on 20 March 2019. It is classified as Non-govt company and is registered at Registrar of Companies, Mumbai. Its authorized share capital is Rs. 100,000 and its paid up capital is Rs. Jan 17, 2021 klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor). Klib - India's Largest Corporate Library. This is the official android App for kilb.in registered users. Available 24/7, now your favorite corporate library comes to you. You can place order for book delivery, indicate return of book and add books to queue. Automatic notification of queue availability so you don’t have to monitor. Klib 是什么? Klib 是全新的读书笔记管理工具,目前已支持 Kindle、Apple Books、多看。 俗话说「不动笔墨不读书」;在 Kindle、手机多看等设备上看书时,会积累很多笔记。如果任由.
This plot visualizes the missing values in a dataset. At the top itshows the aggregate for each column using a relative scale and absolutemissing-value annotations, while on the right, summary statistics andindividual row results are displayed. Using this plot allows to gain aquick overview over the structure of missing values and their relationin a dataset and easily determine which columns and rows to investigate/ drop.
Correlation Plots¶
This plot visualizes the correlation between different features.Settings include the possibility to only display positive, negative,high or low correlations as well as specify an additional threshold.This works for Person, Spearman and Kendall correlation. Annotations anddevelopment settings can optionally be turned on or off.
Further, as seen below, if a column is specified, either by name or bypassing in a separate target List or pd.Series, the plot gives thecorrelation of all features with the specified target.
Numerical Data Distribution Plot¶
Categorical Data Plot¶
This section shows an example of categorical data visualization. Thefunction allows to dispaly the top and/or bottom values regarding theirfrequency in each column. Further, it gives an idea of the distributionof the values in the dataset. This plot comes in very handy during dataanalysis when considering changing datatypes to “category” or whenplanning to combine less frequent values into a seperate category beforeapplying one-hot-encoding or similar functions.
Data Cleaning and Aggregation¶
This sections describes the data cleaning and aggregation capabilitiesof klib. The functions have been shows to yield great results, even withdataframes as large as 20GB, drastically reducing the size anddimensions and therefore speeding up further calculations or reducingthe time to save and load the data.
For demonstration purposes, we apply the function to a dataset about USflight data, which has an initial size of about 51 MB.
klib.data_cleaning()¶
By applying klib.data_cleaning()the size reduces by about 44 MB(-85.2%). This is achieved by dropping empty and single valued columnsas well as empty and duplicate rows (neither found in this example).Additionally, the optimal data types are inferred and applied, whichalso increases memory efficiency. This kind of reduction is notuncommon. For larger datasets the reduction in size often surpasses 90%.
klib.pool_duplicate_subsets()¶
Further, klib.pool_duplicate_subsets() can be applied, whatultimately reduces the dataset to only 3.8 MB (from 51 MB originally).This is a reduction of roughly -92.5%.
Kliban Tees
This function “pools” columns together based on several settings.Specifically, the pooling is achieved by finding duplicates in subsetsof the data and encoding the largest possible subset with sufficientduplicates with integers. These are then added to the original data whatallows dropping the previously identified and now encoded columns. Whilethe encoding itself does not lead to a loss in information, some detailsmight get lost in the aggregation step. While this is unlikely, it isadvised to specifically exclude features that provide sufficientinformational content by themselves as well as the target column byusing the “exclude” setting.
As can be seen in cat_plot() the “carrier” column is made up of afew very frequent values - the top 4 values make up roughly 75% - whilein “tailnum” the top 4 values barely make up 2%. This allows “carrier”and similar columns to be bundled and encoded, while “tailnum” remainsin the dataset. Using this procedure, 56006 duplicate rows areidentified in the subset, i.e., 56006 rows in 10 columns are encodedinto a single column of dtype integer, greatly reducing the memoryfootprint and number of columns which should speed up model training.
All of these functions were run with their relatively “soft” defaultsettings. Many parameters are available allowing a more restrictive datacleaning where needed.
Klib Library
Furthermore, the function klib.mv_col_handling() provides asophisticated selection mechanism for columns with relatively manymissing values. Ddo mac download. Instead of just dropping the data, these are convertedinto binary features (empty or not) checked for correlations among eachother, with other features and afterwards for correlations with thelabel before a decision on ommitting them is made.