Friday, 17 May 2019

Top Python Libraries for Data Science

Data Science is today's hot and trending topics for all.Everybody want to learn it and be a data scientist or analyst. But getting into data science on your own is being little difficult for most of us ,and the same case also i felt during my early stage.So i had decided to list all the things systematically so that others should get it easily from one place.

Related image

Python is the heart of data science. Python is an object-oriented, high-level programming language with integrated dynamic semantics primarily for web and app development.Also, Python supports the use of modules and packages, which means that programs can be designed in a modular style and code can be reused across a variety of projects.
As is the case with many other programming languages, it’s the available libraries that lead to Python’s success: some 72,000 of them in the Python Package Index (PyPI) and growing constantly.
Data Science which is the current trending in market is the best run only because of its libraries which are available in Python.Data Science means it includes AI(Artificial Intelligence),ML(machine Learning),DL(Deep Learning),Data Visualization and many others.

Here I am listing all the python libraries required for data science categorically,-

Overall

Pandas -  data analysis and manipulation



Related image


It is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

Pandas is well suited for many different kinds of data:
  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered  time series data.
  • Arbitrary matrix data  with row and column labels
  • Any other statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases.



Numpy scientific calculation




Image result for numpy in python



Numpy is the fundamental package for scientific calculation in python.It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

NumPy is memory efficiency, meaning it can handle the vast amount of data more accessible than any other library. Besides, NumPy is very convenient to work with, especially for matrix multiplication and reshaping. On top of that, NumPy is fast. In fact, TensorFlow and Scikit (other python packages) learn to use NumPy array to compute the matrix multiplication in the back end.

Scipy - extended of numpy

Image result for scipy in python

SciPy is a collection of mathematical algorithms and convenience functions built on the Numpy extension of Python. It adds significant power to the interactive Python session by providing the user with high-level commands and classes for manipulating and visualizing data. With SciPy an interactive Python session becomes a data-processing and system-prototyping environment rivaling systems such as MATLAB, IDL, Octave, R-Lab, and SciLab..


Python SciPy has modules for the following tasks:
  • Optimization
  • Linear algebra
  • Integration
  • Interpolation
  • Special functions
  • FFT
  • Signal and Image processing
  • ODE solvers

statsmodel - statistical analysis

Image result for statsmodel overview
It is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. An extensive list of result statistics are available for each estimator. The results are tested against existing statistical packages to ensure that they are correct. 
StatsModels is built on top of NumPy and SciPy.
It also uses Pandas for data handling and Patsy for R-like formula interface. It takes its graphics functions from matplotlib. It is known to provide statistical background for other python packages.


  • Machine learning 

Image result for machine learning



It is the field of study that gives computers the capability to learn without being explicitly programmed. ML is one of the most exciting technologies that one would have ever come across. As it is evident from the name, it gives the computer that which makes it more similar to humans: ability to learn on of its own. Machine learning is actively being used today, perhaps in many more places than one would expect.



Scikit - 

Image result for scikit



It is a free software machine learning library for the Python programming language. A learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry , it is said to have several attributes or features.
It features various classificationregression and clustering algorithms including support vector machinesrandom forestsgradient boostingk-means and DBSCAN .

Learning problems fall into a few categories:- 
  1. supervised learning-  classification/regression
  2. unsupervised learning

XGBoost/LightGBM/CatBoost 


 Image result for XGBoost/LightGBM/CatBoost

CatBoost has the flexibility of giving indices of categorical columns so that it can be encoded as one-hot encoding using one_hot_max_size (Use one-hot encoding for all features with number of different values less than or equal to the given parameter value).

Similar to CatBoost, LightGBM can also handle categorical features by taking the input of feature names. It does not convert to one-hot coding, and is much faster than one-hot coding. LGBM uses a special algorithm to find the split value of categorical features.

Unlike CatBoost or LGBM, XGBoost cannot handle categorical features by itself, it only accepts numerical values similar to Random Forest. Therefore one has to perform various encodings like label encoding, mean encoding or one-hot encoding before supplying categorical data to XGBoost.


Eli5



ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It provides support for the following machine learning frameworks and packages:



  1. scikit-learn
  2. XGBoost
  3. LightGBM
  4. lightning
  5. sklearn-crfsuite




  • Deep Learning

Image result for deep learning

Deep learning is part of a broader family of machine learning methods based on artificial neural networks.Learning can be supervised, semi-supervised or unsupervised.It is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.
Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost. It is the key to voice control in consumer devices like phones, tablets, TVs, and hands-free speakers.




Tensorflow -

Image result for tensorflow

TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks.

Currently, the most famous deep learning library in the world is Google's TensorFlow. Google product uses machine learning in all of its products to improve the search engine, translation, image captioning or recommendations.

In Tensorflow, all the computations involve tensors. A tensor is a vector or matrix of n-dimensions that represents all types of data. All values in a tensor hold identical data type with a known (or partially known) shape. The shape of the data is the dimensionality of the matrix or array.



PyTorch



Image result for pytorch
It’s a Python-based scientific computing package targeted at two sets of audiences:
  • A replacement for NumPy to use the power of GPUs
  • a deep learning research platform that provides maximum .and speed .


PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.A few other advantages of using PyTorch are it’s multiGPU support, custom data loaders and simplified preprocessors.Since its release in the start of January 2016, many researchers have adopted it as a go-to library because of its ease of building novel and even extremely complex graphs. Having said that, there is still some time before PyTorch is adopted by the majority of data science practitioners due to it’s new and “under construction” status.





Keras 



Image result for keras

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlowCNTK, or Theano. It was developed with a focus on enabling fast experimentation. 

Keras is compatible with: Python 2.7-3.6.

The core data structure of Keras is a model, a way to organize layers. The simplest type of model is the sequential  model, a linear stack of layers. For more complex architectures, you should use the Keras functional API, which allows to build arbitrary graphs of layers.The other model is functional API.

  • Visualization

Image result for visualization python



Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions.

As the “age of Big Data” kicks into high-gear, visualization is an increasingly key tool to make sense of the trillions of rows of data generated every day. Data visualization helps to tell stories by curating data into a form easier to understand, highlighting the trends and outliers. A good visualization tells a story, removing the noise from data and highlighting the useful information

matplotlib


Image result for matplotlib
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code. For examples, see the sample plots and thumbnail gallery.For simple plotting the pyplot module provides a MATLAB-like interface, particularly when combined with IPython. For the power user, you have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users.

seaborn



Image result for seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures.

Seaborn aims to make visualization a central part of exploring and understanding data. Its dataset-oriented plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.



plotly


Image result for plotly


Plotly is a company that makes visualization tools including a Python API library. (Plotly also makes Dash, a framework for building interactive web-based applications with Python code). For this article, we’ll stick to working with the plotly Python library in a Jupyter Notebook and touching up images in the online plotly editor. When we make a plotly graph, it’s published online by default which makes sharing visualizations easy.


Plotly (the Python library) uses declarative programming which means we write code describing what we want to make rather than how to make it. We provide the basic framework and end goals and let plotly figure out the implementation details. In. practice, this means less effort spent building up a figure, allowing us to focus on what to present and how to interpret it.




Thank you for reading the blog.
Please leave your feedback in the below comment section.And if you like it,I will post more blog on data science as it is the starting only and you have a long way to go.
Also,I will try to rectify ,if any suggestion will be there. And you can comment if you need blog on any particular related topics.
If you like this blog, please share it .




Top Python Libraries for Data Science

Data Science is today's hot and trending topics for all.Everybody want to learn it and be a data scientist or analyst. But getting ...