pandas – get started with examples

This is to get started with pandas and try few concrete examples. pandas is a Python based library that helps in reading, transforming, cleaning and analyzing data. It is built on the NumPy package.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

https://pandas.pydata.org


Key data structure in pandas is called DataFrame – it helps to work with tabular data translated as rows of observations and columns of features.

Download or fork entire Jupiter notebook from my GitHub to play around: https://github.com/sandeep-mewara/python-examples

pandas basics includes:

  • Series
  • Dataframes
    • Create
      • from list of tuples
      • from a dictionary
      • from a CSV
      • from built-in dataset (eg: from sklearn.datasets)
    • Data retrieval
    • Modifying data
    • Group by operation
    • Custom Functions – apply method
    • Pre-Processing
      • drop, mean, mode
      • ordinal feature
      • nominal feature
    • Reshaping
      • CrossTab
      • Merge
      • Melt
      • Pivot

# .info(), .head(), .sample are handy method to use first off with dataframe to get a high level details

# index may be not unique – can return multiple values

# boolean indexing (masking) can help select certain set of rows

# .isin() is a useful when building a boolean index

# .where() is useful to retain shape of the original table

# Column names & Indexes can be set if needed

# to modify the table right away, use inplace=True

# aggregate operations can be applied on a groupby object

# dropna(), mean() or mode() are handy ways for pre-processing missing data

Key learning’s …

Examples notebook includes:

  • Uber taxi drivers
  • Apple stock price
  • Day or Night
  • Students marks
  • Balance Calculator

# .describe() is a handy method to get the statistical summary of numerical columns

# one-hot-encoding is really helpful for nominal features (that cannot be ordered)

# converting the columns into right datatype helps

# converting data into meaningful numbers help for analysis

# groupby is a powerful tool with dataframes for analysis

Key learning’s …

Cheat sheet

Credit: Pandas website

Download cheat sheet pdf from here
For more details about pandas, look at the documentation reference.

Keep learning!

Python as statistics workbench

While reading for AI/ML (Artificial Intelligence/Machine Learning), I came across a discussion – if Python can be used as a “statistics workbench” to replace R, SPSS, etc? It was nice shareout by multiple knowledge folks related to languages used for problems of statistics, specifically R (read about R here).

Discussion here: https://stats.stackexchange.com/questions/1595/python-as-a-statistics-workbench

For quick reference, I will quote few of the latest thoughts from there that are in favor of Python and how it has evolved. I too conquer with most of them:

1. Python is easily the most intuitive syntax of any programming language. This makes for extremely fast development time.

2. Python is performant. It opens large datasets reliably.

3. The packages in Python are fast catching up to R’s packages. Python usage has increased tremendously last few years.

4. Readability is one of the most important qualities good code can possess, and Python is one of the most readable language.

5. Python has an extremely well-thought-out IDE now: PyCharm & Visual Studio Code.

https://stats.stackexchange.com/a/457753

Overall, Python is a general purpose language with an easy to understand syntax which would be relatively easier for usual programmers to learn/adopt. R is developed keeping statisticians in mind. Thus it has many features around data visualization and is a tad ahead currently.

A little research …

Recently DataCamp too published an article comparing R and Python for data analysis. There is a nice comparison in it on various parameters, picking just couple of them here:

Final analysis in the paper shares R being ahead in comparison for data analysis but Python having potential to catch up quickly and easily.

My thoughts …

My intent was to understand which of the programming language serves as an essential tool to demonstrate AI/ML capabilities. Looking at them, Python seems good enough for me to serve as AI/ML tool to start and probably conquer it.

Ammunition needed …

There are many python based libraries and packages that are generally used for statistical work. Below are few of them that would help in our data analysis exploration going ahead:

  • scipy – python-based ecosystem of open-source software for mathematics, science, and engineering.
    • cookbook – many statistical facilities, a collection of various user-contributed recipes already available
    • numpy – base N-dimensional array package. Handful of example lists here
    • pandas – a fast, powerful, flexible and easy to use data analysis and manipulation tool
    • matplotlib – a comprehensive library for creating static, animated, and interactive visualizations
  • scikit-learn – simple and efficient machine learning tools for predictive data analysis
  • keras – API for deep learning
  • tensorflow – API to develop and train ML models

Since I am a programmer, I maybe be biased here. But, it seems Python can and does all the needful to start with AI/ML journey.

Happy learning!

NumPy – Basics & Examples

This is to get started with NumPy and try few concrete examples. NumPy (Numerical Python) are packages for numerical computation designed for efficient work on large data sets.

Entire Jupiter notebook can be downloaded or forked from my GitHub to play around: https://github.com/sandeep-mewara/python-examples

numpy-icon

Reference: https://numpy.org/learn/

NumPy basics includes:

  • Initialize Matrix via
    • List
    • NULL Matrix
    • IDENTITY Matrix
    • ONES Matrix
  • Matrix Transpose
  • Matrix Indexing
  • Simulation
  • Basic CSV file operations
  • Matrix Broadcasting
  • Basic Image Processing

# matrix in python is list of a list

# arrays are compatible for broadcasting when the trailing dimensions match or either of them is of length 1

# image when read as numbers, the values are between 0 & 1

Key learning’s …

Examples notebook includes:

  • Random walk simulation
  • Triangle simulation
  • Random Number
  • Correlation co-efficient
  • Mean/Variance of crude oil

# masking helps get all the values back that satisfy the mask

# cumsum() is a handy function for cumulative sum

# there are handy methods for random number generation

Key learning’s …

For learning more about NumPy, look here: https://numpy.org/doc/stable/

Keep learning!

Python – Basics & Examples

This is to get started with Python and try few concrete examples. It should help beginners to learn or others to do a quick revision without getting too deep.

Entire Jupyter notebook can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/python-examples

I started Python programming using Jupiter notebook web application. Later, I moved to Visual Studio Code that looked much user friendly.

A guide on how to setup VS Code for Python is here.

Python basics includes:

  • Variables
  • Conditional statements
  • String manipulations
  • Type conversion
  • Formatting strings
  • Data Structure – List, Tuple
  • Functions
  • List comprehension
  • Zip & Pack

# items are indexed by integers, starting from 0.

# % is a format operator and %d, %s, %f are special format sequences

# negative index is used to access list elements from the end

# [start:end:step] Returns a new list from start to end-1 with default step 1

# zip can merge two lists into a list of tuples

Key learning’s …

Examples notebook includes:

  • Palindrome
  • Sum of Squares
  • Sort students marks list
  • Format students marks list
  • Word Frequency

# sometimes anonymous functions are enough

# storing data in dictionary as key-value pair helps

Key learning’s …

Keep learning!