{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using Python Data Science packages to manipulate and visualize data\n", "\n", "In this Jupyter notebook we will:\n", "- Go over several popular Python packages used for Data Science\n", "- Go through the example of analyzing avocado prices using these popular Python Data Science packages\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: An overview of popular Python Data Science packages \n", "\n", "Let's very briefly discuss several popular Python Data Science packages. The packages we will discuss are:\n", "- NumPy\n", "- pandas\n", "- Matplotlib\n", "- seaborn\n", "\n", "We can discuss additional Python packages, particular for modeling and prediction, later in the workshop.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1.1: NumPy\n", "\n", "[NumPy](https://numpy.org/) is a library that adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. In many ways, it's functionality is similar to MATLAB's basic functionality. \n", "\n", "The core data structure of NumPy is the `ndarray`. ndarrays are similar to Python lists but all elements in an ndarray must of the same type; e.g., all elements are numbers, or all elements are strings, etc.\n", "\n", "Let's create a few ndarrays below!\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 1 2 3 4 5 6 7 8 9]\n" ] } ], "source": [ "import numpy as np \n", "\n", "x = np.array([1, 2, 3])\n", "\n", "print(np.arange(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1.2: pandas\n", "\n", "[pandas](https://pandas.pydata.org/) is a package for data manipulation and analysis that has two main data structures:\n", "\n", "1. `Series`: One-dimensional ndarray with an index for each value. They are similar to a named vector in R.\n", "\n", "2. `DataFrame`: Two-dimensional, size-mutable, potentially heterogeneous tabular data. They are similar to an R data frame. DataFrames can also be thought of as multiple Series of the same length with the same index, or as muliple ndarrays with the same index.\n", "\n", "Here are some documents that show translations between Data 8 datascience package and pandas\n", "- [googledoc I created](https://docs.google.com/spreadsheets/d/1GeghI6Md4QjJcugEEa4a_N_jQNGZRdxqFrynvJgq1CM/edit#gid=0)\n", "- [babypandas documentation](https://pypi.org/project/babypandas/)\n", "\n", "\n", "Let's load our avocado data as a DataFrame and look at the first three rows using the `df.head(3)` method.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Date | \n", "AveragePrice | \n", "Total Volume | \n", "4046 | \n", "4225 | \n", "4770 | \n", "Total Bags | \n", "Small Bags | \n", "Large Bags | \n", "XLarge Bags | \n", "type | \n", "year | \n", "region | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "12/27/2015 | \n", "1.33 | \n", "64236.62 | \n", "1036.74 | \n", "54454.85 | \n", "48.16 | \n", "8696.87 | \n", "8603.62 | \n", "93.25 | \n", "0.0 | \n", "conventional | \n", "2015 | \n", "Albany | \n", "
1 | \n", "12/20/2015 | \n", "1.35 | \n", "54876.98 | \n", "674.28 | \n", "44638.81 | \n", "58.33 | \n", "9505.56 | \n", "9408.07 | \n", "97.49 | \n", "0.0 | \n", "conventional | \n", "2015 | \n", "Albany | \n", "
2 | \n", "12/13/2015 | \n", "0.93 | \n", "118220.22 | \n", "794.70 | \n", "109149.67 | \n", "130.50 | \n", "8145.35 | \n", "8042.21 | \n", "103.14 | \n", "0.0 | \n", "conventional | \n", "2015 | \n", "Albany | \n", "