{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Tom Augspurger Dplyr/Pandas comparison (copy of 2016-01-01)\n", "\n", "### See result there\n", "http://nbviewer.ipython.org/urls/gist.githubusercontent.com/TomAugspurger/6e052140eaa5fdb6e8c0/raw/627b77addb4bcfc39ab6be6d85cb461e956fb3a3/dplyr_pandas.ipynb\n", "\n", "### to reproduce on your WinPython you'll need to get flights.csv in this directory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook compares [pandas](http://pandas.pydata.org)\n", "and [dplyr](http://cran.r-project.org/web/packages/dplyr/index.html).\n", "The comparison is just on syntax (verbage), not performance. Whether you're an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition.\n", "\n", "We'll work through the [introductory dplyr vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html) to analyze some flight data.\n", "\n", "I'm working on a better layout to show the two packages side by side.\n", "But for now I'm just putting the ``dplyr`` code in a comment above each python call.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### using R steps to get flights.csv\n", "\n", "un-comment the next cell unless you have installed R and want to get Flights example from the source\n", "\n", "to install R on your Winpython:\n", "[how to install R](installing_R.ipynb)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#%load_ext rpy2.ipython\n", "#%R install.packages(\"nycflights13\", repos='http://cran.us.r-project.org')\n", "#%R library(nycflights13)\n", "#%R write.csv(flights, \"flights.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### using an internet download to get flight.qcsv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Downloading and unzipg a file, without R method :\n", "# source= http://stackoverflow.com/a/34863053/3140336\n", "import io\n", "from zipfile import ZipFile\n", "import requests\n", "\n", "def get_zip(file_url):\n", " url = requests.get(file_url)\n", " zipfile = ZipFile(io.BytesIO(url.content))\n", " zip_names = zipfile.namelist()\n", " if len(zip_names) == 1:\n", " file_name = zip_names.pop()\n", " extracted_file = zipfile.open(file_name)\n", " return extracted_file\n", "\n", "url=r'https://github.com/winpython/winpython_afterdoc/raw/master/examples/nycflights13_datas/flights.zip'\n", "with io.open(\"flights.csv\", 'wb') as f:\n", " f.write(get_zip(url).read())\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Some prep work to get the data from R and into pandas\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "#%load_ext rpy2.ipython\n", "\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "pd.set_option(\"display.max_rows\", 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data: nycflights13" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "flights = pd.read_csv(\"flights.csv\", index_col=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# dim(flights) <--- The R code\n", "flights.shape # <--- The python code" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# head(flights)\n", "flights.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Single table verbs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``dplyr`` has a small set of nicely defined verbs. I've listed their closest pandas verbs.\n", "\n", "\n", "
dplyr | \n", "pandas | \n", "
filter() (and slice() ) | \n",
" query() (and loc[] , iloc[] ) | \n",
"
arrange() | \n",
" sort_values and sort_index() | \n",
"
select() (and rename() ) | \n",
" __getitem__ (and rename() ) | \n",
"
distinct() | \n",
" drop_duplicates() | \n",
"
mutate() (and transmute() ) | \n",
" assign | \n", "
summarise() | \n", "None | \n", "
sample_n() and sample_frac() | \n", "sample | \n",
"
%>% | \n",
" pipe | \n",
"