{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Operações com *DataFrames*\n", "\n", "Como dissemos anterioremente, o *DataFrame* é a segunda estrutura basilar do *pandas*. Um *DataFrame*:\n", "- é uma tabela, ou seja, é bidimensional;\n", "- tem cada coluna formada como uma *Series* do *pandas*;\n", "- pode ter *Series* contendo tipos de dado diferentes." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Criando um *DataFrame*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "O método padrão para criarmos um *DataFrame* é através de uma função com mesmo nome.\n", "\n", "```python\n", "df_exemplo = pd.DataFrame(dados_de_interesse, index = indice_de_interesse, \n", " columns = colunas_de_interesse)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Ao criar um *DataFrame*, podemos informar\n", "- `index`: rótulos para as linhas (atributos *index* das *Series*).\n", "- `columns`: rótulos para as colunas (atributos *name* das *Series*)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "No _template_, `dados_de_interesse` pode ser\n", "\n", "* um dicionário de:\n", " * *arrays* unidimensionais do *numpy*;\n", " * listas;\n", " * dicionários;\n", " * *Series* do *pandas*.\n", "* um *array* bidimensional do *numpy*;\n", "* uma *Series* do *Pandas*;\n", "* outro *DataFrame*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Criando um *DataFrame* a partir de dicionários de *Series*\n", "\n", "Neste método de criação, as *Series* do dicionário não precisam possuir o mesmo número de elementos. O *index* do *DataFrame* será dado pela **união** dos *index* de todas as *Series* contidas no dicionário." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Exemplo:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "serie_Idade = pd.Series({'Ana':20, 'João': 19, 'Maria': 21, 'Pedro': 22}, name=\"Idade\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "serie_Peso = pd.Series({'Ana':55, 'João': 80, 'Maria': 62, 'Pedro': 67, 'Túlio': 73}, name=\"Peso\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "serie_Altura = pd.Series({'Ana':162, 'João': 178, 'Maria': 162, 'Pedro': 165, 'Túlio': 171}, name=\"Altura\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "dicionario_series_exemplo = {'Idade': serie_Idade, 'Peso': serie_Peso, 'Altura': serie_Altura}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df_dict_series = pd.DataFrame(dicionario_series_exemplo)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Ana20.055162
João19.080178
Maria21.062162
Pedro22.067165
TúlioNaN73171
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Ana 20.0 55 162\n", "João 19.0 80 178\n", "Maria 21.0 62 162\n", "Pedro 22.0 67 165\n", "Túlio NaN 73 171" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare este resultado com a criação de uma planilha pelos métodos usuais. Veja que há muita flexibilidade para criarmos ou modificarmos uma tabela.\n", "\n", "Vejamos exemplos sobre como acessar intervalos de dados na tabela." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Ana2055162
Maria2162162
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Ana 20 55 162\n", "Maria 21 62 162" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(dicionario_series_exemplo, index=['Ana','Maria'])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PesoAltura
Ana55162
Maria62162
\n", "
" ], "text/plain": [ " Peso Altura\n", "Ana 55 162\n", "Maria 62 162" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(dicionario_series_exemplo, index=['Ana','Maria'], columns=['Peso','Altura'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Neste exemplo, adicionamos a coluna `IMC`, ainda sem valores calculados." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PesoAlturaIMC
Ana55.0162.0NaN
Maria62.0162.0NaN
PaulaNaNNaNNaN
\n", "
" ], "text/plain": [ " Peso Altura IMC\n", "Ana 55.0 162.0 NaN\n", "Maria 62.0 162.0 NaN\n", "Paula NaN NaN NaN" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(dicionario_series_exemplo, index=['Ana','Maria','Paula'], \n", " columns=['Peso','Altura','IMC'])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df_exemplo_IMC = pd.DataFrame(dicionario_series_exemplo, \n", " columns=['Peso','Altura','IMC'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Agora, mostramos como os valores do IMC podem ser calculados diretamente por computação vetorizada sobre as *Series*." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "df_exemplo_IMC['IMC']=round(df_exemplo_IMC['Peso']/(df_exemplo_IMC['Altura']/100)**2,2)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PesoAlturaIMC
Ana5516220.96
João8017825.25
Maria6216223.62
Pedro6716524.61
Túlio7317124.96
\n", "
" ], "text/plain": [ " Peso Altura IMC\n", "Ana 55 162 20.96\n", "João 80 178 25.25\n", "Maria 62 162 23.62\n", "Pedro 67 165 24.61\n", "Túlio 73 171 24.96" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo_IMC" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Criando um *DataFrame* a partir de dicionários de listas ou *arrays* do *numpy*:\n", "\n", "Neste método de criação, os *arrays* ou as listas **devem** possuir o mesmo comprimento. Se o *index* não for informado, o *index* será dado de forma similar ao do objeto tipo *Series*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Exemplo com dicionário de listas:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "dicionario_lista_exemplo = {'Idade': [20,19,21,22,20],\n", " 'Peso': [55,80,62,67,73],\n", " 'Altura': [162,178,162,165,171]}" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
02055162
11980178
22162162
32267165
42073171
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "0 20 55 162\n", "1 19 80 178\n", "2 21 62 162\n", "3 22 67 165\n", "4 20 73 171" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(dicionario_lista_exemplo)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Mais exemplos:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Ana2055162
João1980178
Maria2162162
Pedro2267165
Túlio2073171
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Ana 20 55 162\n", "João 19 80 178\n", "Maria 21 62 162\n", "Pedro 22 67 165\n", "Túlio 20 73 171" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(dicionario_lista_exemplo, index=['Ana','João','Maria','Pedro','Túlio'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Exemplos com dicionário de *arrays* do *numpy*:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "dicionario_array_exemplo = {'Idade': np.array([20,19,21,22,20]),\n", " 'Peso': np.array([55,80,62,67,73]),\n", " 'Altura': np.array([162,178,162,165,171])}" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
02055162
11980178
22162162
32267165
42073171
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "0 20 55 162\n", "1 19 80 178\n", "2 21 62 162\n", "3 22 67 165\n", "4 20 73 171" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(dicionario_array_exemplo)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Mais exemplos:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Ana2055162
João1980178
Maria2162162
Pedro2267165
Túlio2073171
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Ana 20 55 162\n", "João 19 80 178\n", "Maria 21 62 162\n", "Pedro 22 67 165\n", "Túlio 20 73 171" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(dicionario_array_exemplo, index=['Ana','João','Maria','Pedro','Túlio'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Criando um *DataFrame* a partir de uma *Series* do *pandas*\n", "\n", "Neste caso, o *DataFrame* terá o mesmo *index* que a *Series* do *pandas* e apenas uma coluna." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "series_exemplo = pd.Series({'Ana':20, 'João': 19, 'Maria': 21, 'Pedro': 22, 'Túlio': 20})" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
Ana20
João19
Maria21
Pedro22
Túlio20
\n", "
" ], "text/plain": [ " 0\n", "Ana 20\n", "João 19\n", "Maria 21\n", "Pedro 22\n", "Túlio 20" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(series_exemplo)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Caso a *Series* possua um atributo `name` especificado, este será o nome da coluna do *DataFrame*." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "series_exemplo_Idade = pd.Series({'Ana':20, 'João': 19, 'Maria': 21, 'Pedro': 22, 'Túlio': 20}, name=\"Idade\")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Idade
Ana20
João19
Maria21
Pedro22
Túlio20
\n", "
" ], "text/plain": [ " Idade\n", "Ana 20\n", "João 19\n", "Maria 21\n", "Pedro 22\n", "Túlio 20" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(series_exemplo_Idade)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Criando um *DataFrame* a partir de lista de *Series* do *pandas*\n", "\n", "Neste caso, a entrada dos dados da lista no *DataFrame* será feita por linha." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AnaJoãoMariaPedroTúlio
Peso55.080.062.067.073.0
Altura162.0178.0162.0165.0171.0
Idade20.019.021.022.0NaN
\n", "
" ], "text/plain": [ " Ana João Maria Pedro Túlio\n", "Peso 55.0 80.0 62.0 67.0 73.0\n", "Altura 162.0 178.0 162.0 165.0 171.0\n", "Idade 20.0 19.0 21.0 22.0 NaN" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame([serie_Peso, serie_Altura, serie_Idade])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Podemos corrigir a orientação usando o método `transpose`." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PesoAlturaIdade
Ana55.0162.020.0
João80.0178.019.0
Maria62.0162.021.0
Pedro67.0165.022.0
Túlio73.0171.0NaN
\n", "
" ], "text/plain": [ " Peso Altura Idade\n", "Ana 55.0 162.0 20.0\n", "João 80.0 178.0 19.0\n", "Maria 62.0 162.0 21.0\n", "Pedro 67.0 165.0 22.0\n", "Túlio 73.0 171.0 NaN" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame([serie_Peso, serie_Altura, serie_Idade]).transpose()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Criando um *DataFrame* a partir de arquivos\n", "\n", "Para criar um *DataFrame* a partir de um arquivo, precisamos de funções do tipo `pd.read_FORMATO`, onde `FORMATO` indica o formato a ser importado sob o pressuposto de que a biblioteca *pandas* foi devidamente importada com `pd`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Os formatos mais comuns são: \n", "\n", "* *csv* (comma-separated values), \n", "* *xls* ou *xlsx* (formatos do Microsoft Excel),\n", "* *hdf5* (comumente utilizado em *big data*), \n", "* *json* (comumente utilizado em páginas da internet)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "As funções para leitura correspondentes são: \n", "* `pd.read_csv`, \n", "* `pd.read_excel`, \n", "* `pd.read_hdf`, \n", "* `pd.read_json`, \n", "\n", "respectivamente." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "De todas elas, a função mais utilizada é `read_csv`. Ela possui vários argumentos. Vejamos os mais utilizados:\n", "\n", "* `file_path_or_buffer`: o endereço do arquivo a ser lido. Pode ser um endereço da internet.\n", "* `sep`: o separador entre as entradas de dados. O separador padrão é `,`.\n", "* `index_col`: a coluna que deve ser usada para formar o *index*. O padrão é `None`. Porém pode ser alterado para outro. Um separador comumente encontrado é o `\\t` (TAB).\n", "* `names`: nomes das colunas a serem usadas. O padrão é `None`.\n", "* `header`: número da linha que servirá como nome para as colunas. O padrão é `infer` (ou seja, tenta deduzir automaticamente). Se os nomes das colunas forem passados através do `names`, então `header` será automaticamente considerado como `None`. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Exemplo:** considere o arquivo `data/exemplo_data.csv` contendo:\n", "\n", "```\n", ",coluna_1,coluna_2\n", "2020-01-01,-0.4160923582996922,1.8103644347460834\n", "2020-01-02,-0.1379696602473578,2.5785204825192785\n", "2020-01-03,0.5758273450544708,0.06086648807755068\n", "2020-01-04,-0.017367186564883633,1.2995865328684455\n", "2020-01-05,1.3842792448510655,-0.3817320973859929\n", "2020-01-06,0.5497056238566345,-1.308789022968975\n", "2020-01-07,-0.2822962331437976,-1.6889791765925102\n", "2020-01-08,-0.9897300598660013,-0.028120707936426497\n", "2020-01-09,0.27558240737928663,-0.1776585993494299\n", "2020-01-10,0.6851316082235455,0.5025348904591399\n", "``` \n", "\n", "Para ler o arquivo acima basta fazer:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df_exemplo_0 = pd.read_csv('data/exemplo_data.csv')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0coluna_1coluna_2
02020-01-01-0.4160921.810364
12020-01-02-0.1379702.578520
22020-01-030.5758270.060866
32020-01-04-0.0173671.299587
42020-01-051.384279-0.381732
52020-01-060.549706-1.308789
62020-01-07-0.282296-1.688979
72020-01-08-0.989730-0.028121
82020-01-090.275582-0.177659
92020-01-100.6851320.502535
\n", "
" ], "text/plain": [ " Unnamed: 0 coluna_1 coluna_2\n", "0 2020-01-01 -0.416092 1.810364\n", "1 2020-01-02 -0.137970 2.578520\n", "2 2020-01-03 0.575827 0.060866\n", "3 2020-01-04 -0.017367 1.299587\n", "4 2020-01-05 1.384279 -0.381732\n", "5 2020-01-06 0.549706 -1.308789\n", "6 2020-01-07 -0.282296 -1.688979\n", "7 2020-01-08 -0.989730 -0.028121\n", "8 2020-01-09 0.275582 -0.177659\n", "9 2020-01-10 0.685132 0.502535" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo_0" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "No exemplo anterior, as colunas receberam nomes corretamentes exceto pela primeira coluna que gostaríamos de considerar como *index*. Neste caso fazemos:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "df_exemplo = pd.read_csv('data/exemplo_data.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
coluna_1coluna_2
2020-01-01-0.4160921.810364
2020-01-02-0.1379702.578520
2020-01-030.5758270.060866
2020-01-04-0.0173671.299587
2020-01-051.384279-0.381732
2020-01-060.549706-1.308789
2020-01-07-0.282296-1.688979
2020-01-08-0.989730-0.028121
2020-01-090.275582-0.177659
2020-01-100.6851320.502535
\n", "
" ], "text/plain": [ " coluna_1 coluna_2\n", "2020-01-01 -0.416092 1.810364\n", "2020-01-02 -0.137970 2.578520\n", "2020-01-03 0.575827 0.060866\n", "2020-01-04 -0.017367 1.299587\n", "2020-01-05 1.384279 -0.381732\n", "2020-01-06 0.549706 -1.308789\n", "2020-01-07 -0.282296 -1.688979\n", "2020-01-08 -0.989730 -0.028121\n", "2020-01-09 0.275582 -0.177659\n", "2020-01-10 0.685132 0.502535" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### O método *head* do *DataFrame*\n", "\n", "O método `head`, sem argumento, permite que visualizemos as 5 primeiras linhas do *DataFrame*." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
coluna_1coluna_2
2020-01-01-0.4160921.810364
2020-01-02-0.1379702.578520
2020-01-030.5758270.060866
2020-01-04-0.0173671.299587
2020-01-051.384279-0.381732
\n", "
" ], "text/plain": [ " coluna_1 coluna_2\n", "2020-01-01 -0.416092 1.810364\n", "2020-01-02 -0.137970 2.578520\n", "2020-01-03 0.575827 0.060866\n", "2020-01-04 -0.017367 1.299587\n", "2020-01-05 1.384279 -0.381732" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Se for passado um argumento com valor `n`, as `n` primeiras linhas são impressas." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
coluna_1coluna_2
2020-01-01-0.4160921.810364
2020-01-02-0.1379702.578520
\n", "
" ], "text/plain": [ " coluna_1 coluna_2\n", "2020-01-01 -0.416092 1.810364\n", "2020-01-02 -0.137970 2.578520" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.head(2)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
coluna_1coluna_2
2020-01-01-0.4160921.810364
2020-01-02-0.1379702.578520
2020-01-030.5758270.060866
2020-01-04-0.0173671.299587
2020-01-051.384279-0.381732
2020-01-060.549706-1.308789
2020-01-07-0.282296-1.688979
\n", "
" ], "text/plain": [ " coluna_1 coluna_2\n", "2020-01-01 -0.416092 1.810364\n", "2020-01-02 -0.137970 2.578520\n", "2020-01-03 0.575827 0.060866\n", "2020-01-04 -0.017367 1.299587\n", "2020-01-05 1.384279 -0.381732\n", "2020-01-06 0.549706 -1.308789\n", "2020-01-07 -0.282296 -1.688979" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.head(7)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### O método `tail` do *DataFrame*\n", "\n", "O método `tail`, sem argumento, retorna as últimas 5 linhas do *DataFrame*." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
coluna_1coluna_2
2020-01-060.549706-1.308789
2020-01-07-0.282296-1.688979
2020-01-08-0.989730-0.028121
2020-01-090.275582-0.177659
2020-01-100.6851320.502535
\n", "
" ], "text/plain": [ " coluna_1 coluna_2\n", "2020-01-06 0.549706 -1.308789\n", "2020-01-07 -0.282296 -1.688979\n", "2020-01-08 -0.989730 -0.028121\n", "2020-01-09 0.275582 -0.177659\n", "2020-01-10 0.685132 0.502535" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.tail()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Se for passado um argumento com valor `n`, as `n` últimas linhas são impressas." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
coluna_1coluna_2
2020-01-090.275582-0.177659
2020-01-100.6851320.502535
\n", "
" ], "text/plain": [ " coluna_1 coluna_2\n", "2020-01-09 0.275582 -0.177659\n", "2020-01-10 0.685132 0.502535" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.tail(2)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
coluna_1coluna_2
2020-01-04-0.0173671.299587
2020-01-051.384279-0.381732
2020-01-060.549706-1.308789
2020-01-07-0.282296-1.688979
2020-01-08-0.989730-0.028121
2020-01-090.275582-0.177659
2020-01-100.6851320.502535
\n", "
" ], "text/plain": [ " coluna_1 coluna_2\n", "2020-01-04 -0.017367 1.299587\n", "2020-01-05 1.384279 -0.381732\n", "2020-01-06 0.549706 -1.308789\n", "2020-01-07 -0.282296 -1.688979\n", "2020-01-08 -0.989730 -0.028121\n", "2020-01-09 0.275582 -0.177659\n", "2020-01-10 0.685132 0.502535" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.tail(7)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Atributos de *Series* e *DataFrames*\n", "\n", "Atributos comumente usados para *Series* e *DataFrames* são:\n", "\n", "* `shape`: fornece as dimensões do objeto em questão (*Series* ou *DataFrame*) em formato consistente com o atributo `shape` de um *array* do *numpy*.\n", "* `index`: fornece o índice do objeto. No caso do *DataFrame* são os rótulos das linhas.\n", "* `columns`: fornece as colunas (apenas disponível para *DataFrames*) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Exemplo:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(10, 2)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.shape" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "serie_1 = pd.Series([1,2,3,4,5])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5,)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_1.shape" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05',\n", " '2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09', '2020-01-10'],\n", " dtype='object')" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.index" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=5, step=1)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_1.index" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['coluna_1', 'coluna_2'], dtype='object')" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.columns" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Se quisermos obter os dados contidos nos *index* ou nas *Series* podemos utilizar a propriedade `.array`." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "[0, 1, 2, 3, 4]\n", "Length: 5, dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_1.index.array" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "['coluna_1', 'coluna_2']\n", "Length: 2, dtype: object" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.columns.array" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Se o interesse for obter os dados como um `array` do *numpy*, devemos utilizar o método `.to_numpy()`.\n", "\n", "Exemplo:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2, 3, 4])" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_1.index.to_numpy()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['coluna_1', 'coluna_2'], dtype=object)" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.columns.to_numpy()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "O método `.to_numpy()` também está disponível em *DataFrames*:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-0.41609236, 1.81036443],\n", " [-0.13796966, 2.57852048],\n", " [ 0.57582735, 0.06086649],\n", " [-0.01736719, 1.29958653],\n", " [ 1.38427924, -0.3817321 ],\n", " [ 0.54970562, -1.30878902],\n", " [-0.28229623, -1.68897918],\n", " [-0.98973006, -0.02812071],\n", " [ 0.27558241, -0.1776586 ],\n", " [ 0.68513161, 0.50253489]])" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.to_numpy()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "A função do *numpy* `asarray()` é compatível com *index*, *columns* e *DataFrames* do *pandas*:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',\n", " '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',\n", " '2020-01-09', '2020-01-10'], dtype=object)" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.asarray(df_exemplo.index)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['coluna_1', 'coluna_2'], dtype=object)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.asarray(df_exemplo.columns)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-0.41609236, 1.81036443],\n", " [-0.13796966, 2.57852048],\n", " [ 0.57582735, 0.06086649],\n", " [-0.01736719, 1.29958653],\n", " [ 1.38427924, -0.3817321 ],\n", " [ 0.54970562, -1.30878902],\n", " [-0.28229623, -1.68897918],\n", " [-0.98973006, -0.02812071],\n", " [ 0.27558241, -0.1776586 ],\n", " [ 0.68513161, 0.50253489]])" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.asarray(df_exemplo)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Informações sobre as colunas de um *DataFrame*\n", "\n", "Para obtermos uma breve descrição sobre as colunas de um *DataFrame* utilizamos o método `info`.\n", "\n", "Exemplo:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 10 entries, 2020-01-01 to 2020-01-10\n", "Data columns (total 2 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 coluna_1 10 non-null float64\n", " 1 coluna_2 10 non-null float64\n", "dtypes: float64(2)\n", "memory usage: 240.0+ bytes\n" ] } ], "source": [ "df_exemplo.info()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Criando arquivos a partir de *DataFrames*\n", "\n", "Para criar arquivos a partir de *DataFrames*, basta utilizar os métodos do tipo `pd.to_FORMATO`, onde `FORMATO` indica o formato a ser exportado e supondo que a biblioteca *pandas* foi importada com `pd`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Com relação aos tipos de arquivo anteriores, os métodos para exportação correspondentes são:\n", "* `.to_csv` ('endereço_do_arquivo'), \n", "* `.to_excel` ('endereço_do_arquivo'), \n", "* `.to_hdf` ('endereço_do_arquivo'), \n", "* `.to_json`('endereço_do_arquivo'), \n", "\n", "onde `endereço_do_arquivo` é uma `str` que contém o endereço do arquivo a ser exportado." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Exemplo:\n", " \n", "Para exportar para o arquivo `exemplo_novo.csv`, utilizaremos o método `.to_csv` ao *DataFrame* `df_exemplo`:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "df_exemplo.to_csv('data/exemplo_novo.csv')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Exemplo COVID-19 PB" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Dados diários de COVID-19 do estado da Paraíba:\n", "\n", "*Fonte: https://superset.plataformatarget.com.br/superset/dashboard/microdados/*" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "dados_covid_PB = pd.read_csv('https://superset.plataformatarget.com.br/superset/explore_json/?form_data=%7B%22slice_id%22%3A1550%7D&csv=true', \n", " sep=',', index_col=0)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 331 entries, 2021-02-09 to 2020-03-16\n", "Data columns (total 7 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 casosAcumulados 331 non-null int64 \n", " 1 casosNovos 331 non-null int64 \n", " 2 descartados 331 non-null int64 \n", " 3 recuperados 331 non-null int64 \n", " 4 obitosAcumulados 331 non-null int64 \n", " 5 obitosNovos 331 non-null int64 \n", " 6 Letalidade 331 non-null float64\n", "dtypes: float64(1), int64(6)\n", "memory usage: 20.7+ KB\n" ] } ], "source": [ "dados_covid_PB.info()" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosAcumuladoscasosNovosdescartadosrecuperadosobitosAcumuladosobitosNovosLetalidade
data
2021-02-091997069712373691527794171130.0209
2021-02-081987356112371891518794158120.0209
2021-02-071981246642370721515354146110.0209
2021-02-061974609182367741501754135120.0209
2021-02-0519654210602362161501694123130.0210
\n", "
" ], "text/plain": [ " casosAcumulados casosNovos descartados recuperados \\\n", "data \n", "2021-02-09 199706 971 237369 152779 \n", "2021-02-08 198735 611 237189 151879 \n", "2021-02-07 198124 664 237072 151535 \n", "2021-02-06 197460 918 236774 150175 \n", "2021-02-05 196542 1060 236216 150169 \n", "\n", " obitosAcumulados obitosNovos Letalidade \n", "data \n", "2021-02-09 4171 13 0.0209 \n", "2021-02-08 4158 12 0.0209 \n", "2021-02-07 4146 11 0.0209 \n", "2021-02-06 4135 12 0.0209 \n", "2021-02-05 4123 13 0.0210 " ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.head()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosAcumuladoscasosNovosdescartadosrecuperadosobitosAcumuladosobitosNovosLetalidade
data
2020-03-200000000.0
2020-03-190000000.0
2020-03-180000000.0
2020-03-170000000.0
2020-03-160000000.0
\n", "
" ], "text/plain": [ " casosAcumulados casosNovos descartados recuperados \\\n", "data \n", "2020-03-20 0 0 0 0 \n", "2020-03-19 0 0 0 0 \n", "2020-03-18 0 0 0 0 \n", "2020-03-17 0 0 0 0 \n", "2020-03-16 0 0 0 0 \n", "\n", " obitosAcumulados obitosNovos Letalidade \n", "data \n", "2020-03-20 0 0 0.0 \n", "2020-03-19 0 0 0.0 \n", "2020-03-18 0 0 0.0 \n", "2020-03-17 0 0 0.0 \n", "2020-03-16 0 0 0.0 " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.tail()" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "dados_covid_PB['estado'] = 'PB'" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosAcumuladoscasosNovosdescartadosrecuperadosobitosAcumuladosobitosNovosLetalidadeestado
data
2021-02-091997069712373691527794171130.0209PB
2021-02-081987356112371891518794158120.0209PB
2021-02-071981246642370721515354146110.0209PB
2021-02-061974609182367741501754135120.0209PB
2021-02-0519654210602362161501694123130.0210PB
\n", "
" ], "text/plain": [ " casosAcumulados casosNovos descartados recuperados \\\n", "data \n", "2021-02-09 199706 971 237369 152779 \n", "2021-02-08 198735 611 237189 151879 \n", "2021-02-07 198124 664 237072 151535 \n", "2021-02-06 197460 918 236774 150175 \n", "2021-02-05 196542 1060 236216 150169 \n", "\n", " obitosAcumulados obitosNovos Letalidade estado \n", "data \n", "2021-02-09 4171 13 0.0209 PB \n", "2021-02-08 4158 12 0.0209 PB \n", "2021-02-07 4146 11 0.0209 PB \n", "2021-02-06 4135 12 0.0209 PB \n", "2021-02-05 4123 13 0.0210 PB " ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.head()" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "dados_covid_PB.to_csv('data/dadoscovidpb.csv')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Índices dos valores máximos ou mínimos\n", "\n", "Os métodos `idxmin()` e `idxmax()` retornam o *index* cuja entrada fornece o valor mínimo ou máximo da *Series* ou *DataFrame*. Se houver múltiplas ocorrências de mínimos ou máximos, o método retorna a primeira ocorrência.\n", "\n", "Vamos recriar um *DataFrame* genérico." ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "serie_Idade = pd.Series({'Ana':20, 'João': 19, 'Maria': 21, 'Pedro': 22, 'Túlio': 20}, name=\"Idade\")\n", "serie_Peso = pd.Series({'Ana':55, 'João': 80, 'Maria': 62, 'Pedro': 67, 'Túlio': 73}, name=\"Peso\")\n", "serie_Altura = pd.Series({'Ana':162, 'João': 178, 'Maria': 162, 'Pedro': 165, 'Túlio': 171}, name=\"Altura\")" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "dicionario_series_exemplo = {'Idade': serie_Idade, 'Peso': serie_Peso, 'Altura': serie_Altura}" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "df_dict_series = pd.DataFrame(dicionario_series_exemplo)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Ana2055162
João1980178
Maria2162162
Pedro2267165
Túlio2073171
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Ana 20 55 162\n", "João 19 80 178\n", "Maria 21 62 162\n", "Pedro 22 67 165\n", "Túlio 20 73 171" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assim, podemos localizar quem possui menores idade, peso e altura." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "Idade João\n", "Peso Ana\n", "Altura Ana\n", "dtype: object" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.idxmin()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "De igual forma, localizamos quem possui maiores idade, peso e altura." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Idade Pedro\n", "Peso João\n", "Altura João\n", "dtype: object" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.idxmax()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Exemplo:** Aplicaremos as funções `idxmin()` e `idxmax()` aos dados do arquivo `data/exemplo_data.csv` para localizar entradas de interesse." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
coluna_1coluna_2
2020-01-01-0.4160921.810364
2020-01-02-0.1379702.578520
2020-01-030.5758270.060866
2020-01-04-0.0173671.299587
2020-01-051.384279-0.381732
2020-01-060.549706-1.308789
2020-01-07-0.282296-1.688979
2020-01-08-0.989730-0.028121
2020-01-090.275582-0.177659
2020-01-100.6851320.502535
\n", "
" ], "text/plain": [ " coluna_1 coluna_2\n", "2020-01-01 -0.416092 1.810364\n", "2020-01-02 -0.137970 2.578520\n", "2020-01-03 0.575827 0.060866\n", "2020-01-04 -0.017367 1.299587\n", "2020-01-05 1.384279 -0.381732\n", "2020-01-06 0.549706 -1.308789\n", "2020-01-07 -0.282296 -1.688979\n", "2020-01-08 -0.989730 -0.028121\n", "2020-01-09 0.275582 -0.177659\n", "2020-01-10 0.685132 0.502535" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo = pd.read_csv('data/exemplo_data.csv', index_col=0); df_exemplo" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df_exemplo = pd.DataFrame(df_exemplo, columns=['coluna_1','coluna_2','coluna_3'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inserimos uma terceira coluna com dados fictícios." ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
coluna_1coluna_2coluna_3
2020-01-01-0.4160921.8103641.0
2020-01-02-0.1379702.5785202.0
2020-01-030.5758270.0608663.0
2020-01-04-0.0173671.2995874.0
2020-01-051.384279-0.3817325.0
2020-01-060.549706-1.3087896.0
2020-01-07-0.282296-1.6889797.0
2020-01-08-0.989730-0.0281218.0
2020-01-090.275582-0.177659NaN
2020-01-100.6851320.502535NaN
\n", "
" ], "text/plain": [ " coluna_1 coluna_2 coluna_3\n", "2020-01-01 -0.416092 1.810364 1.0\n", "2020-01-02 -0.137970 2.578520 2.0\n", "2020-01-03 0.575827 0.060866 3.0\n", "2020-01-04 -0.017367 1.299587 4.0\n", "2020-01-05 1.384279 -0.381732 5.0\n", "2020-01-06 0.549706 -1.308789 6.0\n", "2020-01-07 -0.282296 -1.688979 7.0\n", "2020-01-08 -0.989730 -0.028121 8.0\n", "2020-01-09 0.275582 -0.177659 NaN\n", "2020-01-10 0.685132 0.502535 NaN" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo['coluna_3'] = pd.Series([1,2,3,4,5,6,7,8,np.nan,np.nan],index=df_exemplo.index)\n", "df_exemplo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Os *index* correspondentes aos menores e maiores valores são datas, evidentemente." ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "coluna_1 2020-01-08\n", "coluna_2 2020-01-07\n", "coluna_3 2020-01-01\n", "dtype: object" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.idxmin()" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "coluna_1 2020-01-05\n", "coluna_2 2020-01-02\n", "coluna_3 2020-01-08\n", "dtype: object" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.idxmax()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Reindexação de *DataFrames*\n", "\n", "No *pandas*, o método `reindex`\n", "\n", "* reordena o *DataFrame* de acordo com o conjunto de rótulos inserido como argumento;\n", "* insere valores faltantes caso um rótulo do novo *index* não tenha valor atribuído no conjunto de dados;\n", "* remove valores correspondentes a rótulos que não estão presentes no novo *index*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Exemplos:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AlturaPesoIMC
VictorNaNNaNNaN
Túlio171.073.0NaN
Pedro165.067.0NaN
João178.080.0NaN
\n", "
" ], "text/plain": [ " Altura Peso IMC\n", "Victor NaN NaN NaN\n", "Túlio 171.0 73.0 NaN\n", "Pedro 165.0 67.0 NaN\n", "João 178.0 80.0 NaN" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.reindex(index=['Victor', 'Túlio', 'Pedro', 'João'], columns=['Altura','Peso','IMC'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Remoção de linhas ou colunas de um *DataFrame*\n", "\n", "Para remover linhas ou colunas de um *DataFrame* do *pandas* podemos utilizar o método `drop`. O argumento `axis` identifica o eixo de remoção: `axis=0`, que é o padrão, indica a remoção de uma ou mais linhas; `axis=1` indica a remoção de uma ou mais colunas." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Nos exemplos que segue, note que novos *DataFrames* são obtidos a partir de `df_dict_series` sem que este seja sobrescrito." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Ana2055162
João1980178
Maria2162162
Pedro2267165
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Ana 20 55 162\n", "João 19 80 178\n", "Maria 21 62 162\n", "Pedro 22 67 165" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.drop('Túlio') # axis=0 implícito " ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
João1980178
Pedro2267165
Túlio2073171
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "João 19 80 178\n", "Pedro 22 67 165\n", "Túlio 20 73 171" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.drop(['Ana','Maria'], axis=0)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PesoAltura
Ana55162
João80178
Maria62162
Pedro67165
Túlio73171
\n", "
" ], "text/plain": [ " Peso Altura\n", "Ana 55 162\n", "João 80 178\n", "Maria 62 162\n", "Pedro 67 165\n", "Túlio 73 171" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.drop(['Idade'], axis=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Renomeando *index* e *columns*\n", "\n", "O método `rename` retorna uma cópia na qual o *index* (no caso de *Series* e *DataFrames*) e *columns* (no caso de *DataFrames*) foram renomeados. O método aceita como entrada um dicionário, uma *Series* do *pandas* ou uma função." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Exemplo:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "serie_exemplo = pd.Series([1,2,3], index=['a','b','c'])" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 2\n", "c 3\n", "dtype: int64" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_exemplo" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "abacaxi 1\n", "banana 2\n", "cebola 3\n", "dtype: int64" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_exemplo.rename({'a':'abacaxi', 'b':'banana', 'c': 'cebola'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exemplo:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Ana2055162
João1980178
Maria2162162
Pedro2267165
Túlio2073171
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Ana 20 55 162\n", "João 19 80 178\n", "Maria 21 62 162\n", "Pedro 22 67 165\n", "Túlio 20 73 171" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IPA
a2055162
j1980178
m2162162
p2267165
t2073171
\n", "
" ], "text/plain": [ " I P A\n", "a 20 55 162\n", "j 19 80 178\n", "m 21 62 162\n", "p 22 67 165\n", "t 20 73 171" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.rename(index = {'Ana':'a', 'João':'j', 'Maria':'m', 'Pedro':'p','Túlio':'t'},\n", " columns = {'Idade':'I', 'Peso':'P','Altura':'A'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No próximo exemplo, usamos uma *Series* para renomear os rótulos." ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "indice_novo = pd.Series({'Ana':'a', 'João':'j', 'Maria':'m', 'Pedro':'p','Túlio':'t'})" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
a2055162
j1980178
m2162162
p2267165
t2073171
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "a 20 55 162\n", "j 19 80 178\n", "m 21 62 162\n", "p 22 67 165\n", "t 20 73 171" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.rename(index = indice_novo)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Neste exemplo, usamos a função `str.upper` (altera a `str` para \"todas maiúsculas\") para renomear colunas." ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDADEPESOALTURA
Ana2055162
João1980178
Maria2162162
Pedro2267165
Túlio2073171
\n", "
" ], "text/plain": [ " IDADE PESO ALTURA\n", "Ana 20 55 162\n", "João 19 80 178\n", "Maria 21 62 162\n", "Pedro 22 67 165\n", "Túlio 20 73 171" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.rename(columns=str.upper)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Ordenando *Series* e *DataFrames*\n", "\n", "É possível ordenar ambos pelos rótulos do *index* (para tanto é necessário que eles sejam ordenáveis) ou por valores nas colunas. \n", "\n", "O método `sort_index` ordena a *Series* ou o *DataFrame* pelo *index*. O método `sort_values` ordena a *Series* ou o *DataFrame* pelos valores (escolhendo uma ou mais colunas no caso de *DataFrames*). No caso do *DataFrame*, o argumento `by` é necessário para indicar qual(is) coluna(s) será(ão) utilizada(s) como base para a ordenação." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Exemplos:" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Maria 21\n", "Pedro 22\n", "Túlio 20\n", "João 19\n", "Ana 20\n", "dtype: int64" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_desordenada = pd.Series({'Maria': 21, 'Pedro': 22, 'Túlio': 20, 'João': 19, 'Ana':20}); \n", "serie_desordenada" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ana 20\n", "João 19\n", "Maria 21\n", "Pedro 22\n", "Túlio 20\n", "dtype: int64" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_desordenada.sort_index() # ordenação alfabética" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Mais exemplos:" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "df_desordenado = df_dict_series.reindex(index=['Pedro','Maria','Ana','Túlio','João'])" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Pedro2267165
Maria2162162
Ana2055162
Túlio2073171
João1980178
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Pedro 22 67 165\n", "Maria 21 62 162\n", "Ana 20 55 162\n", "Túlio 20 73 171\n", "João 19 80 178" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_desordenado" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Ana2055162
João1980178
Maria2162162
Pedro2267165
Túlio2073171
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Ana 20 55 162\n", "João 19 80 178\n", "Maria 21 62 162\n", "Pedro 22 67 165\n", "Túlio 20 73 171" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_desordenado.sort_index()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Mais exemplos:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "João 19\n", "Túlio 20\n", "Ana 20\n", "Maria 21\n", "Pedro 22\n", "dtype: int64" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_desordenada.sort_values()" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Maria2162162
Ana2055162
Pedro2267165
Túlio2073171
João1980178
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Maria 21 62 162\n", "Ana 20 55 162\n", "Pedro 22 67 165\n", "Túlio 20 73 171\n", "João 19 80 178" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_desordenado.sort_values(by=['Altura']) # ordena por 'Altura'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "No caso de \"empate\", podemos utilizar outra coluna para desempatar." ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Ana2055162
Maria2162162
Pedro2267165
Túlio2073171
João1980178
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Ana 20 55 162\n", "Maria 21 62 162\n", "Pedro 22 67 165\n", "Túlio 20 73 171\n", "João 19 80 178" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_desordenado.sort_values(by=['Altura','Peso']) # usa a coluna 'Peso' para desempatar" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Os métodos `sort_index` e `sort_values` admitem o argumento opcional `ascending`, que permite inverter a ordenação caso tenha valor `False`." ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Túlio2073171
Pedro2267165
Maria2162162
João1980178
Ana2055162
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Túlio 20 73 171\n", "Pedro 22 67 165\n", "Maria 21 62 162\n", "João 19 80 178\n", "Ana 20 55 162" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_desordenado.sort_index(ascending=False)" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Pedro2267165
Maria2162162
Ana2055162
Túlio2073171
João1980178
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Pedro 22 67 165\n", "Maria 21 62 162\n", "Ana 20 55 162\n", "Túlio 20 73 171\n", "João 19 80 178" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_desordenado.sort_values(by=['Idade'], ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Comparando *Series* e *DataFrames*\n", "\n", "*Series* e *DataFrames* possuem os seguintes métodos de comparação lógica: \n", "\n", "- `eq` (igual);\n", "- `ne` (diferente);\n", "- `lt` (menor do que);\n", "- `gt` (maior do que);\n", "- `le` (menor ou igual a); \n", "- `ge` (maior ou igual a)\n", "\n", "que permitem a utilização dos operadores binários `==`, `!=`, `<`, `>`, `<=` e `>=`, respectivamente. As comparações são realizadas em cada entrada da *Series* ou do *DataFrame*.\n", "\n", "**Observação**: Para que esses métodos sejam aplicados, todos os objetos presentes nas colunas do *DataFrame* devem ser de mesma natureza. Por exemplo, se um *DataFrame* possui algumas colunas numéricas e outras com *strings*, ao realizar uma comparação do tipo `> 1`, um erro ocorrerá, pois o *pandas* tentará comparar objetos do tipo `int` com objetos do tipo `str`, assim gerando uma incompatibilidade.\n", "\n", "Exemplos:" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 2\n", "c 3\n", "dtype: int64" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_exemplo" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "a False\n", "b True\n", "c False\n", "dtype: bool" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_exemplo == 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "De outra forma:" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a False\n", "b True\n", "c False\n", "dtype: bool" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_exemplo.eq(2)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a False\n", "b True\n", "c True\n", "dtype: bool" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_exemplo > 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ou, na forma funcional:" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a False\n", "b True\n", "c True\n", "dtype: bool" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_exemplo.gt(1)" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
coluna_1coluna_2coluna_3
2020-01-01FalseTrueFalse
2020-01-02FalseTrueTrue
2020-01-03FalseFalseTrue
2020-01-04FalseTrueTrue
2020-01-05TrueFalseTrue
2020-01-06FalseFalseTrue
2020-01-07FalseFalseTrue
2020-01-08FalseFalseTrue
2020-01-09FalseFalseFalse
2020-01-10FalseFalseFalse
\n", "
" ], "text/plain": [ " coluna_1 coluna_2 coluna_3\n", "2020-01-01 False True False\n", "2020-01-02 False True True\n", "2020-01-03 False False True\n", "2020-01-04 False True True\n", "2020-01-05 True False True\n", "2020-01-06 False False True\n", "2020-01-07 False False True\n", "2020-01-08 False False True\n", "2020-01-09 False False False\n", "2020-01-10 False False False" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo > 1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Importante:** Ao comparar *np.nan*, o resultado tipicamente é falso:" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan == np.nan" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan > np.nan" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan >= np.nan" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Só é verdadeiro para indicar que é diferente:" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan != np.nan" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Neste sentido, podemos ter tabelas iguais sem que a comparação usual funcione:" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "# 'copy', como o nome sugere, gera uma cópia do DataFrame\n", "df_exemplo_2 = df_exemplo.copy() " ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
coluna_1coluna_2coluna_3
2020-01-01TrueTrueTrue
2020-01-02TrueTrueTrue
2020-01-03TrueTrueTrue
2020-01-04TrueTrueTrue
2020-01-05TrueTrueTrue
2020-01-06TrueTrueTrue
2020-01-07TrueTrueTrue
2020-01-08TrueTrueTrue
2020-01-09TrueTrueFalse
2020-01-10TrueTrueFalse
\n", "
" ], "text/plain": [ " coluna_1 coluna_2 coluna_3\n", "2020-01-01 True True True\n", "2020-01-02 True True True\n", "2020-01-03 True True True\n", "2020-01-04 True True True\n", "2020-01-05 True True True\n", "2020-01-06 True True True\n", "2020-01-07 True True True\n", "2020-01-08 True True True\n", "2020-01-09 True True False\n", "2020-01-10 True True False" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df_exemplo == df_exemplo_2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "O motivo de haver entradas como `False` ainda que `df_exemplo_2` seja uma cópia exata de `df_exemplo` é a presença do `np.nan`. Neste caso, devemos utilizar o método `equals` para realizar a comparação." ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.equals(df_exemplo_2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Os métodos `any`, `all` e a propriedade `empty`\n", "\n", "O método `any` é aplicado a entradas booleanas (verdadeiras ou falsas) e retorna *verdadeiro* se existir alguma entrada verdadeira, ou *falso*, se todas forem falsas. O método `all` é aplicado a entradas booleanas e retorna *verdadeiro* se todas as entradas forem verdadeiras, ou *falso*, se houver pelo menos uma entrada falsa. A propriedade `empty` retorna *verdadeiro* se a *Series* ou o *DataFrame* estiver vazio, ou *falso* caso contrário." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Exemplos:" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 2\n", "c 3\n", "dtype: int64" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_exemplo" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a -1\n", "b 0\n", "c 1\n", "dtype: int64" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_exemplo_2 = serie_exemplo-2; \n", "serie_exemplo_2" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(serie_exemplo_2 > 0).any()" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(serie_exemplo > 1).all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Este exemplo reproduz um valor `False` único." ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df_exemplo == df_exemplo_2).all().all()" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "serie_exemplo.empty" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Mais exemplos:" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "coluna_1 True\n", "coluna_2 True\n", "coluna_3 True\n", "dtype: bool" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df_exemplo == df_exemplo_2).any()" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_exemplo.empty" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [], "source": [ "df_vazio = pd.DataFrame()" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_vazio.empty" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecionando colunas de um *DataFrame*\n", "\n", "Para selecionar colunas de um *DataFrame*, basta aplicar *colchetes* a uma lista contendo os nomes das colunas de interesse." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "No exemplo abaixo, temos um *DataFrame* contendo as colunas `'Idade'`, `'Peso'` e `'Altura'`. Selecionaremos `'Peso'` e `'Altura'`, apenas." ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PesoAltura
Ana55162
João80178
Maria62162
Pedro67165
Túlio73171
\n", "
" ], "text/plain": [ " Peso Altura\n", "Ana 55 162\n", "João 80 178\n", "Maria 62 162\n", "Pedro 67 165\n", "Túlio 73 171" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series[['Peso','Altura']]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Se quisermos selecionar apenas uma coluna, não há necessidade de inserir uma lista. Basta utilizar o nome da coluna:" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ana 55\n", "João 80\n", "Maria 62\n", "Pedro 67\n", "Túlio 73\n", "Name: Peso, dtype: int64" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series['Peso']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Para remover colunas, podemos utilizar o método `drop`." ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Idade
Ana20
João19
Maria21
Pedro22
Túlio20
\n", "
" ], "text/plain": [ " Idade\n", "Ana 20\n", "João 19\n", "Maria 21\n", "Pedro 22\n", "Túlio 20" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.drop(['Peso','Altura'], axis=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Criando novas colunas a partir de colunas existentes\n", "\n", "Um método eficiente para criar novas colunas a partir de colunas já existentes é `eval`. Neste método, podemos utilizar como argumento uma *string* contendo uma expressão matemática envolvendo nomes de colunas do *DataFrame*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Como exemplo, vamos ver como calcular o IMC no *DataFrame* anterior:" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ana 20.957171\n", "João 25.249337\n", "Maria 23.624447\n", "Pedro 24.609734\n", "Túlio 24.964946\n", "dtype: float64" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.eval('Peso/(Altura/100)**2')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Se quisermos obter um *DataFrame* contendo o IMC como uma nova coluna, podemos utilizar o método `assign` (sem modificar o *DataFrame* original):" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAlturaIMC
Ana205516220.96
João198017825.25
Maria216216223.62
Pedro226716524.61
Túlio207317124.96
\n", "
" ], "text/plain": [ " Idade Peso Altura IMC\n", "Ana 20 55 162 20.96\n", "João 19 80 178 25.25\n", "Maria 21 62 162 23.62\n", "Pedro 22 67 165 24.61\n", "Túlio 20 73 171 24.96" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series.assign(IMC=round(df_dict_series.eval('Peso/(Altura/100)**2'),2))" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAltura
Ana2055162
João1980178
Maria2162162
Pedro2267165
Túlio2073171
\n", "
" ], "text/plain": [ " Idade Peso Altura\n", "Ana 20 55 162\n", "João 19 80 178\n", "Maria 21 62 162\n", "Pedro 22 67 165\n", "Túlio 20 73 171" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series # não modificado" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Se quisermos modificar o *DataFrame* para incluir a coluna IMC fazemos:" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [], "source": [ "df_dict_series['IMC']=round(df_dict_series.eval('Peso/(Altura/100)**2'),2)" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAlturaIMC
Ana205516220.96
João198017825.25
Maria216216223.62
Pedro226716524.61
Túlio207317124.96
\n", "
" ], "text/plain": [ " Idade Peso Altura IMC\n", "Ana 20 55 162 20.96\n", "João 19 80 178 25.25\n", "Maria 21 62 162 23.62\n", "Pedro 22 67 165 24.61\n", "Túlio 20 73 171 24.96" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series # modificado \"in-place\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecionando linhas de um *DataFrame*:\n", "\n", "Podemos selecionar linhas de um *DataFrame* de diversas formas diferentes. Veremos agora algumas dessas formas.\n", "\n", "Diferentemente da forma de selecionar colunas, para selecionar diretamente linhas de um *DataFrame* devemos utilizar o método `loc` (fornecendo o *index*, isto é, o rótulo da linha) ou o `iloc` (fornecendo a posição da linha):" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Trabalharemos a seguir com um banco de dados atualizado sobre a COVID-19. Para tanto, importaremos o módulo `datetime` que nos auxiliará com datas." ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [], "source": [ "import datetime" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "dados_covid_PB = pd.read_csv('https://superset.plataformatarget.com.br/superset/explore_json/?form_data=%7B%22slice_id%22%3A1550%7D&csv=true', \n", " sep=',', index_col=0)\n", "\n", "# busca o banco na data D-1, visto que a atualização\n", "# ocorre em D\n", "ontem = (datetime.date.today() - datetime.timedelta(days=1)).strftime('%Y-%m-%d') " ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosAcumuladoscasosNovosdescartadosrecuperadosobitosAcumuladosobitosNovosLetalidade
data
2021-02-091997069712373691527794171130.0209
\n", "
" ], "text/plain": [ " casosAcumulados casosNovos descartados recuperados \\\n", "data \n", "2021-02-09 199706 971 237369 152779 \n", "\n", " obitosAcumulados obitosNovos Letalidade \n", "data \n", "2021-02-09 4171 13 0.0209 " ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.head(1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Podemos ver as informações de um único dia como argumento. Para tanto, excluímos a coluna `'Letalidade'` (valor não inteiro) e convertemos o restante para valores inteiros:" ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "casosAcumulados 199706\n", "casosNovos 971\n", "descartados 237369\n", "recuperados 152779\n", "obitosAcumulados 4171\n", "obitosNovos 13\n", "Name: 2021-02-09, dtype: int64" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.loc[ontem].drop('Letalidade').astype('int')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Podemos selecionar um intervalo de datas como argumento (excluindo letalidade):" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosAcumuladoscasosNovosdescartadosrecuperadosobitosAcumuladosobitosNovos
2021-02-011925981014234215149235406812
2021-02-02193465867234366149242408214
2021-02-031945191054234902149248409614
2021-02-04195482963235319149792411014
2021-02-051965421060236216150169412313
\n", "
" ], "text/plain": [ " casosAcumulados casosNovos descartados recuperados \\\n", "2021-02-01 192598 1014 234215 149235 \n", "2021-02-02 193465 867 234366 149242 \n", "2021-02-03 194519 1054 234902 149248 \n", "2021-02-04 195482 963 235319 149792 \n", "2021-02-05 196542 1060 236216 150169 \n", "\n", " obitosAcumulados obitosNovos \n", "2021-02-01 4068 12 \n", "2021-02-02 4082 14 \n", "2021-02-03 4096 14 \n", "2021-02-04 4110 14 \n", "2021-02-05 4123 13 " ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.index = pd.to_datetime(dados_covid_PB.index) # Convertendo o index de string para data\n", "dados_covid_PB.loc[pd.date_range('2021-02-01',periods=5,freq=\"D\")].drop('Letalidade',axis=1) \n", " #função pd.date_range é muito útil para criar índices a partir de datas." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Podemos colocar uma lista como argumento:" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosAcumuladoscasosNovosdescartadosrecuperadosobitosAcumuladosobitosNovosLetalidade
2021-01-01167062578219890127388368080.0220
2021-02-0119259810142342151492354068120.0211
\n", "
" ], "text/plain": [ " casosAcumulados casosNovos descartados recuperados \\\n", "2021-01-01 167062 578 219890 127388 \n", "2021-02-01 192598 1014 234215 149235 \n", "\n", " obitosAcumulados obitosNovos Letalidade \n", "2021-01-01 3680 8 0.0220 \n", "2021-02-01 4068 12 0.0211 " ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.loc[pd.to_datetime(['2021-01-01','2021-02-01'])]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Vamos agora examinar os dados da posição 100 (novamente excluindo a coluna letalidade e convertendo para inteiro):" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "casosAcumulados 133220\n", "casosNovos 71\n", "descartados 185162\n", "recuperados 108737\n", "obitosAcumulados 3107\n", "obitosNovos 6\n", "Name: 2020-11-01 00:00:00, dtype: int64" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.iloc[100].drop('Letalidade').astype('int')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Podemos colocar um intervalo como argumento:" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosAcumuladoscasosNovosdescartadosrecuperadosobitosAcumuladosobitosNovos
data
2020-12-21159639203213617123430355213
2020-12-20159436327213356123417353910
2020-12-1915910966021314912329535296
2020-12-181584491053212439122935352316
2020-12-171573961274210628122219350720
\n", "
" ], "text/plain": [ " casosAcumulados casosNovos descartados recuperados \\\n", "data \n", "2020-12-21 159639 203 213617 123430 \n", "2020-12-20 159436 327 213356 123417 \n", "2020-12-19 159109 660 213149 123295 \n", "2020-12-18 158449 1053 212439 122935 \n", "2020-12-17 157396 1274 210628 122219 \n", "\n", " obitosAcumulados obitosNovos \n", "data \n", "2020-12-21 3552 13 \n", "2020-12-20 3539 10 \n", "2020-12-19 3529 6 \n", "2020-12-18 3523 16 \n", "2020-12-17 3507 20 " ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.iloc[50:55].drop('Letalidade', axis=1).astype('int') " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecionando colunas pelos métodos *loc* e *iloc*\n", "\n", "Podemos selecionar colunas utilizando os métodos `loc` e `iloc` utilizando um argumento adicional." ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosNovosobitosNovos
data
2021-02-0997113
2021-02-0861112
2021-02-0766411
2021-02-0691812
2021-02-05106013
.........
2020-03-2000
2020-03-1900
2020-03-1800
2020-03-1700
2020-03-1600
\n", "

331 rows × 2 columns

\n", "
" ], "text/plain": [ " casosNovos obitosNovos\n", "data \n", "2021-02-09 971 13\n", "2021-02-08 611 12\n", "2021-02-07 664 11\n", "2021-02-06 918 12\n", "2021-02-05 1060 13\n", "... ... ...\n", "2020-03-20 0 0\n", "2020-03-19 0 0\n", "2020-03-18 0 0\n", "2020-03-17 0 0\n", "2020-03-16 0 0\n", "\n", "[331 rows x 2 columns]" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.loc[:,['casosNovos','obitosNovos']]" ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
obitosAcumuladosobitosNovos
data
2021-02-09417113
2021-02-08415812
2021-02-07414611
2021-02-06413512
2021-02-05412313
.........
2020-03-2000
2020-03-1900
2020-03-1800
2020-03-1700
2020-03-1600
\n", "

331 rows × 2 columns

\n", "
" ], "text/plain": [ " obitosAcumulados obitosNovos\n", "data \n", "2021-02-09 4171 13\n", "2021-02-08 4158 12\n", "2021-02-07 4146 11\n", "2021-02-06 4135 12\n", "2021-02-05 4123 13\n", "... ... ...\n", "2020-03-20 0 0\n", "2020-03-19 0 0\n", "2020-03-18 0 0\n", "2020-03-17 0 0\n", "2020-03-16 0 0\n", "\n", "[331 rows x 2 columns]" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.iloc[:,4:6] # fatiamento na coluna" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecionando linhas e colunas específicas pelos métodos *loc* e *iloc*:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Usando o mesmo princípio de *fatiamento* aplicado a *arrays* do numpy, podemos selecionar linhas e colunas em um intervalo específico de forma a obter uma subtabela." ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
obitosAcumuladosobitosNovos
data
2020-11-0631468
2020-11-0531388
2020-11-04313011
2020-11-03311911
2020-11-0231081
\n", "
" ], "text/plain": [ " obitosAcumulados obitosNovos\n", "data \n", "2020-11-06 3146 8\n", "2020-11-05 3138 8\n", "2020-11-04 3130 11\n", "2020-11-03 3119 11\n", "2020-11-02 3108 1" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.iloc[95:100,4:6]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Neste exemplo um pouco mais complexo, buscamos casos novos e óbitos novos em um período específico e ordenamos a tabela da data mais recente para a mais antiga." ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosNovosobitosNovos
2020-04-1060
2020-04-09244
2020-04-08143
2020-04-0750
2020-04-0610
\n", "
" ], "text/plain": [ " casosNovos obitosNovos\n", "2020-04-10 6 0\n", "2020-04-09 24 4\n", "2020-04-08 14 3\n", "2020-04-07 5 0\n", "2020-04-06 1 0" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.loc[pd.date_range('2020-04-06','2020-04-10'),['casosNovos','obitosNovos']].sort_index(ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Suponha que o peso de Ana foi medido corretamente, mas registrado de maneira errônea no *DataFrame* `df_dict_series` como 55." ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAlturaIMC
Ana205516220.96
João198017825.25
Maria216216223.62
Pedro226716524.61
Túlio207317124.96
\n", "
" ], "text/plain": [ " Idade Peso Altura IMC\n", "Ana 20 55 162 20.96\n", "João 19 80 178 25.25\n", "Maria 21 62 162 23.62\n", "Pedro 22 67 165 24.61\n", "Túlio 20 73 171 24.96" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Supondo que, na realidade, o valor é 65, alteramos a entrada específica com um simples `loc`. Em seguida, atualizamos a tabela." ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "df_dict_series.loc['Ana','Peso'] = 65\n", "\n", "df_dict_series = df_dict_series.assign(IMC=round(df_dict_series.eval('Peso/(Altura/100)**2'),2)) # O IMC mudou" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAlturaIMC
Ana206516224.77
João198017825.25
Maria216216223.62
Pedro226716524.61
Túlio207317124.96
\n", "
" ], "text/plain": [ " Idade Peso Altura IMC\n", "Ana 20 65 162 24.77\n", "João 19 80 178 25.25\n", "Maria 21 62 162 23.62\n", "Pedro 22 67 165 24.61\n", "Túlio 20 73 171 24.96" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecionando linha através de critérios lógicos ou funções\n", "\n", "Vamos selecionar quais os dias em que houve mais de 40 mortes registradas:" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosAcumuladoscasosNovosdescartadosrecuperadosobitosAcumuladosobitosNovosLetalidade
data
2020-07-2168844116478605250281558410.0226
2020-07-1563939147774399236951383410.0216
2020-07-0249536136148272163491044420.0211
2020-06-304695719004307014930977460.0208
\n", "
" ], "text/plain": [ " casosAcumulados casosNovos descartados recuperados \\\n", "data \n", "2020-07-21 68844 1164 78605 25028 \n", "2020-07-15 63939 1477 74399 23695 \n", "2020-07-02 49536 1361 48272 16349 \n", "2020-06-30 46957 1900 43070 14930 \n", "\n", " obitosAcumulados obitosNovos Letalidade \n", "data \n", "2020-07-21 1558 41 0.0226 \n", "2020-07-15 1383 41 0.0216 \n", "2020-07-02 1044 42 0.0211 \n", "2020-06-30 977 46 0.0208 " ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.loc[dados_covid_PB['obitosNovos']>40]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Selecionando os dias com mais de 25 óbitos e mais de 1500 casos novos:" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosAcumuladoscasosNovosdescartadosrecuperadosobitosAcumuladosobitosNovosLetalidade
data
2020-08-04857601549106500385541901310.0222
2020-07-3182794168696753359711811260.0219
2020-07-2373104213284047285661618370.0221
2020-07-1059118150469567214811229330.0208
2020-07-0856344154267549199991171260.0208
2020-07-0754802165164933193731145270.0209
2020-06-304695719004307014930977460.0208
2020-06-09224521501206504671534270.0238
\n", "
" ], "text/plain": [ " casosAcumulados casosNovos descartados recuperados \\\n", "data \n", "2020-08-04 85760 1549 106500 38554 \n", "2020-07-31 82794 1686 96753 35971 \n", "2020-07-23 73104 2132 84047 28566 \n", "2020-07-10 59118 1504 69567 21481 \n", "2020-07-08 56344 1542 67549 19999 \n", "2020-07-07 54802 1651 64933 19373 \n", "2020-06-30 46957 1900 43070 14930 \n", "2020-06-09 22452 1501 20650 4671 \n", "\n", " obitosAcumulados obitosNovos Letalidade \n", "data \n", "2020-08-04 1901 31 0.0222 \n", "2020-07-31 1811 26 0.0219 \n", "2020-07-23 1618 37 0.0221 \n", "2020-07-10 1229 33 0.0208 \n", "2020-07-08 1171 26 0.0208 \n", "2020-07-07 1145 27 0.0209 \n", "2020-06-30 977 46 0.0208 \n", "2020-06-09 534 27 0.0238 " ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.loc[(dados_covid_PB.obitosNovos > 25) & (dados_covid_PB.casosNovos>1500)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Obs**.: Note que podemos utilizar o nome da coluna como um atributo." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Vamos inserir uma coluna sobrenome no `df_dict_series`:" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAlturaIMCSobrenome
Ana206516224.77Silva
João198017825.25PraDo
Maria216216223.62Sales
Pedro226716524.61MachadO
Túlio207317124.96Coutinho
\n", "
" ], "text/plain": [ " Idade Peso Altura IMC Sobrenome\n", "Ana 20 65 162 24.77 Silva\n", "João 19 80 178 25.25 PraDo\n", "Maria 21 62 162 23.62 Sales\n", "Pedro 22 67 165 24.61 MachadO\n", "Túlio 20 73 171 24.96 Coutinho" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dict_series['Sobrenome'] = ['Silva', 'PraDo', 'Sales', 'MachadO', 'Coutinho']\n", "df_dict_series" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Vamos encontrar as linhas cujo sobrenome termina em \"do\". Para tanto, note que a função abaixo retorna `True` se o final é \"do\" e `False` caso contrário.\n", "\n", "```python\n", "def verifica_final_do(palavra):\n", " return palavra.lower()[-2:] == 'do'\n", "```\n", "**Obs**.: Note que convertemos tudo para minúsculo." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Agora vamos utilizar essa função para alcançar nosso objetivo:" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ana False\n", "João True\n", "Maria False\n", "Pedro True\n", "Túlio False\n", "Name: Sobrenome, dtype: bool" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'map' aplica a função lambda a cada elemento da *Series*\n", "df_dict_series['Sobrenome'].map(lambda palavra: palavra.lower()[-2:]=='do') " ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdadePesoAlturaIMCSobrenome
João198017825.25PraDo
Pedro226716524.61MachadO
\n", "
" ], "text/plain": [ " Idade Peso Altura IMC Sobrenome\n", "João 19 80 178 25.25 PraDo\n", "Pedro 22 67 165 24.61 MachadO" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# procurando no df inteiro\n", "df_dict_series.loc[df_dict_series['Sobrenome'].map(lambda palavra: palavra.lower()[-2:]=='do')]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Vamos selecionar as linhas do mês 2 (fevereiro) usando `index.month`:" ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "scrolled": true, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosAcumuladoscasosNovosdescartadosrecuperadosobitosAcumuladosobitosNovosLetalidade
data
2021-02-091997069712373691527794171130.0209
2021-02-081987356112371891518794158120.0209
2021-02-071981246642370721515354146110.0209
2021-02-061974609182367741501754135120.0209
2021-02-0519654210602362161501694123130.0210
\n", "
" ], "text/plain": [ " casosAcumulados casosNovos descartados recuperados \\\n", "data \n", "2021-02-09 199706 971 237369 152779 \n", "2021-02-08 198735 611 237189 151879 \n", "2021-02-07 198124 664 237072 151535 \n", "2021-02-06 197460 918 236774 150175 \n", "2021-02-05 196542 1060 236216 150169 \n", "\n", " obitosAcumulados obitosNovos Letalidade \n", "data \n", "2021-02-09 4171 13 0.0209 \n", "2021-02-08 4158 12 0.0209 \n", "2021-02-07 4146 11 0.0209 \n", "2021-02-06 4135 12 0.0209 \n", "2021-02-05 4123 13 0.0210 " ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.loc[dados_covid_PB.index.month==2].head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecionando linhas com o método *query*\n", "\n", "Similarmente ao método `eval`, ao utilizarmos `query`, podemos criar expressões lógicas a partir de nomes das colunas do *DataFrame*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Assim, podemos reescrever o código\n", "\n", "```python\n", "dados_covid_PB.loc[(dados_covid_PB.obitosNovos>25) & \n", " (dados_covid_PB.casosNovos>1500)]\n", "```\n", "como" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
casosAcumuladoscasosNovosdescartadosrecuperadosobitosAcumuladosobitosNovosLetalidade
data
2020-08-04857601549106500385541901310.0222
2020-07-3182794168696753359711811260.0219
2020-07-2373104213284047285661618370.0221
2020-07-1059118150469567214811229330.0208
2020-07-0856344154267549199991171260.0208
2020-07-0754802165164933193731145270.0209
2020-06-304695719004307014930977460.0208
2020-06-09224521501206504671534270.0238
\n", "
" ], "text/plain": [ " casosAcumulados casosNovos descartados recuperados \\\n", "data \n", "2020-08-04 85760 1549 106500 38554 \n", "2020-07-31 82794 1686 96753 35971 \n", "2020-07-23 73104 2132 84047 28566 \n", "2020-07-10 59118 1504 69567 21481 \n", "2020-07-08 56344 1542 67549 19999 \n", "2020-07-07 54802 1651 64933 19373 \n", "2020-06-30 46957 1900 43070 14930 \n", "2020-06-09 22452 1501 20650 4671 \n", "\n", " obitosAcumulados obitosNovos Letalidade \n", "data \n", "2020-08-04 1901 31 0.0222 \n", "2020-07-31 1811 26 0.0219 \n", "2020-07-23 1618 37 0.0221 \n", "2020-07-10 1229 33 0.0208 \n", "2020-07-08 1171 26 0.0208 \n", "2020-07-07 1145 27 0.0209 \n", "2020-06-30 977 46 0.0208 \n", "2020-06-09 534 27 0.0238 " ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dados_covid_PB.query('obitosNovos>25 and casosNovos>1500') # note que 'and' é usado em vez de '&'" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "latex_metadata": { "affiliation": "Departamento de Computação Científica / UFPB", "author": "Gustavo Oliveira e Andréa Rocha", "date": "Julho de 2020", "title": "Aula 6A - Manipulação de Dados" }, "rise": { "enable_chalkboard": true, "footer": "UFPB - CI - DCC", "header": "Gustavo Oliveira / Andrea Rocha", "theme": "sky" } }, "nbformat": 4, "nbformat_minor": 4 }