{ "metadata": { "name": "9932_03_03" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Chapter 3, example 3\n", "====================\n", "\n", "In this example, we will download and analyze some data about a large number of cities around the world and their population. This data has been created by MaxMind and is available for free at http://www.maxmind.com.\n", "\n", "We first download the Zip file and uncompress it in a folder. The Zip file is about 40MB so that downloading it may take a while." ] }, { "cell_type": "code", "collapsed": true, "input": [ "import urllib2, zipfile" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "url = 'http://ipython.rossant.net/'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "filename = 'cities.zip'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "downloaded = urllib2.urlopen(url + filename)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "folder = 'data'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "mkdir data" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "with open(filename, 'wb') as f:\n", " f.write(downloaded.read())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "with zipfile.ZipFile(filename) as zip:\n", " zip.extractall(folder)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we're going to load the CSV file that has been extracted with Pandas. The `read_csv` function of Pandas can open any CSV file." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "filename = 'data/worldcitiespop.txt'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "data = pd.read_csv(filename)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's explore the newly created data object." ] }, { "cell_type": "code", "collapsed": false, "input": [ "type(data)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 12, "text": [ "pandas.core.frame.DataFrame" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data object is a DataFrame, a Pandas type consisting of a two-dimensional labeled data structure with columns of potentially different types (like a Excel spreadsheet). Like a NumPy array, the shape attribute returns the shape of the table. But unlike NumPy, the DataFrame object has a richer structure, and in particular the keys methods returns the names of the different columns." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.shape, data.keys()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 13, "text": [ "((3173958, 7),\n", " Index([Country, City, AccentCity, Region, Population, Latitude, Longitude], dtype=object))" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that data has more than 3 million lines, and seven columns including the country, city, population and GPS coordinates of each city. The head and tail methods allow to take a quick look to the beginning and the end of the table, respectively." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.tail()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | Country | \n", "City | \n", "AccentCity | \n", "Region | \n", "Population | \n", "Latitude | \n", "Longitude | \n", "
---|---|---|---|---|---|---|---|
3173953 | \n", "zw | \n", "zimre park | \n", "Zimre Park | \n", "4 | \n", "NaN | \n", "-17.866111 | \n", "31.213611 | \n", "
3173954 | \n", "zw | \n", "ziyakamanas | \n", "Ziyakamanas | \n", "0 | \n", "NaN | \n", "-18.216667 | \n", "27.950000 | \n", "
3173955 | \n", "zw | \n", "zizalisari | \n", "Zizalisari | \n", "4 | \n", "NaN | \n", "-17.758889 | \n", "31.010556 | \n", "
3173956 | \n", "zw | \n", "zuzumba | \n", "Zuzumba | \n", "6 | \n", "NaN | \n", "-20.033333 | \n", "27.933333 | \n", "
3173957 | \n", "zw | \n", "zvishavane | \n", "Zvishavane | \n", "7 | \n", "79876 | \n", "-20.333333 | \n", "30.033333 | \n", "