{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Operações com *DataFrames*\n",
"\n",
"Como dissemos anterioremente, o *DataFrame* é a segunda estrutura basilar do *pandas*. Um *DataFrame*:\n",
"- é uma tabela, ou seja, é bidimensional;\n",
"- tem cada coluna formada como uma *Series* do *pandas*;\n",
"- pode ter *Series* contendo tipos de dado diferentes."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Criando um *DataFrame*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"O método padrão para criarmos um *DataFrame* é através de uma função com mesmo nome.\n",
"\n",
"```python\n",
"df_exemplo = pd.DataFrame(dados_de_interesse, index = indice_de_interesse, \n",
" columns = colunas_de_interesse)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"Ao criar um *DataFrame*, podemos informar\n",
"- `index`: rótulos para as linhas (atributos *index* das *Series*).\n",
"- `columns`: rótulos para as colunas (atributos *name* das *Series*)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"No _template_, `dados_de_interesse` pode ser\n",
"\n",
"* um dicionário de:\n",
" * *arrays* unidimensionais do *numpy*;\n",
" * listas;\n",
" * dicionários;\n",
" * *Series* do *pandas*.\n",
"* um *array* bidimensional do *numpy*;\n",
"* uma *Series* do *Pandas*;\n",
"* outro *DataFrame*."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Criando um *DataFrame* a partir de dicionários de *Series*\n",
"\n",
"Neste método de criação, as *Series* do dicionário não precisam possuir o mesmo número de elementos. O *index* do *DataFrame* será dado pela **união** dos *index* de todas as *Series* contidas no dicionário."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Exemplo:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"serie_Idade = pd.Series({'Ana':20, 'João': 19, 'Maria': 21, 'Pedro': 22}, name=\"Idade\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"serie_Peso = pd.Series({'Ana':55, 'João': 80, 'Maria': 62, 'Pedro': 67, 'Túlio': 73}, name=\"Peso\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"serie_Altura = pd.Series({'Ana':162, 'João': 178, 'Maria': 162, 'Pedro': 165, 'Túlio': 171}, name=\"Altura\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"dicionario_series_exemplo = {'Idade': serie_Idade, 'Peso': serie_Peso, 'Altura': serie_Altura}"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"df_dict_series = pd.DataFrame(dicionario_series_exemplo)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20.0 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 19.0 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Maria | \n",
" 21.0 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22.0 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" NaN | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Ana 20.0 55 162\n",
"João 19.0 80 178\n",
"Maria 21.0 62 162\n",
"Pedro 22.0 67 165\n",
"Túlio NaN 73 171"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compare este resultado com a criação de uma planilha pelos métodos usuais. Veja que há muita flexibilidade para criarmos ou modificarmos uma tabela.\n",
"\n",
"Vejamos exemplos sobre como acessar intervalos de dados na tabela."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Ana 20 55 162\n",
"Maria 21 62 162"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(dicionario_series_exemplo, index=['Ana','Maria'])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" Maria | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Peso Altura\n",
"Ana 55 162\n",
"Maria 62 162"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(dicionario_series_exemplo, index=['Ana','Maria'], columns=['Peso','Altura'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Neste exemplo, adicionamos a coluna `IMC`, ainda sem valores calculados."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Peso | \n",
" Altura | \n",
" IMC | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 55.0 | \n",
" 162.0 | \n",
" NaN | \n",
"
\n",
" \n",
" Maria | \n",
" 62.0 | \n",
" 162.0 | \n",
" NaN | \n",
"
\n",
" \n",
" Paula | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Peso Altura IMC\n",
"Ana 55.0 162.0 NaN\n",
"Maria 62.0 162.0 NaN\n",
"Paula NaN NaN NaN"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(dicionario_series_exemplo, index=['Ana','Maria','Paula'], \n",
" columns=['Peso','Altura','IMC'])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df_exemplo_IMC = pd.DataFrame(dicionario_series_exemplo, \n",
" columns=['Peso','Altura','IMC'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Agora, mostramos como os valores do IMC podem ser calculados diretamente por computação vetorizada sobre as *Series*."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"df_exemplo_IMC['IMC']=round(df_exemplo_IMC['Peso']/(df_exemplo_IMC['Altura']/100)**2,2)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Peso | \n",
" Altura | \n",
" IMC | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 55 | \n",
" 162 | \n",
" 20.96 | \n",
"
\n",
" \n",
" João | \n",
" 80 | \n",
" 178 | \n",
" 25.25 | \n",
"
\n",
" \n",
" Maria | \n",
" 62 | \n",
" 162 | \n",
" 23.62 | \n",
"
\n",
" \n",
" Pedro | \n",
" 67 | \n",
" 165 | \n",
" 24.61 | \n",
"
\n",
" \n",
" Túlio | \n",
" 73 | \n",
" 171 | \n",
" 24.96 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Peso Altura IMC\n",
"Ana 55 162 20.96\n",
"João 80 178 25.25\n",
"Maria 62 162 23.62\n",
"Pedro 67 165 24.61\n",
"Túlio 73 171 24.96"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo_IMC"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Criando um *DataFrame* a partir de dicionários de listas ou *arrays* do *numpy*:\n",
"\n",
"Neste método de criação, os *arrays* ou as listas **devem** possuir o mesmo comprimento. Se o *index* não for informado, o *index* será dado de forma similar ao do objeto tipo *Series*."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Exemplo com dicionário de listas:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"dicionario_lista_exemplo = {'Idade': [20,19,21,22,20],\n",
" 'Peso': [55,80,62,67,73],\n",
" 'Altura': [162,178,162,165,171]}"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" 1 | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" 2 | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" 3 | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" 4 | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"0 20 55 162\n",
"1 19 80 178\n",
"2 21 62 162\n",
"3 22 67 165\n",
"4 20 73 171"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(dicionario_lista_exemplo)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Mais exemplos:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Ana 20 55 162\n",
"João 19 80 178\n",
"Maria 21 62 162\n",
"Pedro 22 67 165\n",
"Túlio 20 73 171"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(dicionario_lista_exemplo, index=['Ana','João','Maria','Pedro','Túlio'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Exemplos com dicionário de *arrays* do *numpy*:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"dicionario_array_exemplo = {'Idade': np.array([20,19,21,22,20]),\n",
" 'Peso': np.array([55,80,62,67,73]),\n",
" 'Altura': np.array([162,178,162,165,171])}"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" 1 | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" 2 | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" 3 | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" 4 | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"0 20 55 162\n",
"1 19 80 178\n",
"2 21 62 162\n",
"3 22 67 165\n",
"4 20 73 171"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(dicionario_array_exemplo)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Mais exemplos:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Ana 20 55 162\n",
"João 19 80 178\n",
"Maria 21 62 162\n",
"Pedro 22 67 165\n",
"Túlio 20 73 171"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(dicionario_array_exemplo, index=['Ana','João','Maria','Pedro','Túlio'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Criando um *DataFrame* a partir de uma *Series* do *pandas*\n",
"\n",
"Neste caso, o *DataFrame* terá o mesmo *index* que a *Series* do *pandas* e apenas uma coluna."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"series_exemplo = pd.Series({'Ana':20, 'João': 19, 'Maria': 21, 'Pedro': 22, 'Túlio': 20})"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0\n",
"Ana 20\n",
"João 19\n",
"Maria 21\n",
"Pedro 22\n",
"Túlio 20"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(series_exemplo)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Caso a *Series* possua um atributo `name` especificado, este será o nome da coluna do *DataFrame*."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"series_exemplo_Idade = pd.Series({'Ana':20, 'João': 19, 'Maria': 21, 'Pedro': 22, 'Túlio': 20}, name=\"Idade\")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade\n",
"Ana 20\n",
"João 19\n",
"Maria 21\n",
"Pedro 22\n",
"Túlio 20"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(series_exemplo_Idade)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Criando um *DataFrame* a partir de lista de *Series* do *pandas*\n",
"\n",
"Neste caso, a entrada dos dados da lista no *DataFrame* será feita por linha."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Ana | \n",
" João | \n",
" Maria | \n",
" Pedro | \n",
" Túlio | \n",
"
\n",
" \n",
" \n",
" \n",
" Peso | \n",
" 55.0 | \n",
" 80.0 | \n",
" 62.0 | \n",
" 67.0 | \n",
" 73.0 | \n",
"
\n",
" \n",
" Altura | \n",
" 162.0 | \n",
" 178.0 | \n",
" 162.0 | \n",
" 165.0 | \n",
" 171.0 | \n",
"
\n",
" \n",
" Idade | \n",
" 20.0 | \n",
" 19.0 | \n",
" 21.0 | \n",
" 22.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Ana João Maria Pedro Túlio\n",
"Peso 55.0 80.0 62.0 67.0 73.0\n",
"Altura 162.0 178.0 162.0 165.0 171.0\n",
"Idade 20.0 19.0 21.0 22.0 NaN"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame([serie_Peso, serie_Altura, serie_Idade])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Podemos corrigir a orientação usando o método `transpose`."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Peso | \n",
" Altura | \n",
" Idade | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 55.0 | \n",
" 162.0 | \n",
" 20.0 | \n",
"
\n",
" \n",
" João | \n",
" 80.0 | \n",
" 178.0 | \n",
" 19.0 | \n",
"
\n",
" \n",
" Maria | \n",
" 62.0 | \n",
" 162.0 | \n",
" 21.0 | \n",
"
\n",
" \n",
" Pedro | \n",
" 67.0 | \n",
" 165.0 | \n",
" 22.0 | \n",
"
\n",
" \n",
" Túlio | \n",
" 73.0 | \n",
" 171.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Peso Altura Idade\n",
"Ana 55.0 162.0 20.0\n",
"João 80.0 178.0 19.0\n",
"Maria 62.0 162.0 21.0\n",
"Pedro 67.0 165.0 22.0\n",
"Túlio 73.0 171.0 NaN"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame([serie_Peso, serie_Altura, serie_Idade]).transpose()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Criando um *DataFrame* a partir de arquivos\n",
"\n",
"Para criar um *DataFrame* a partir de um arquivo, precisamos de funções do tipo `pd.read_FORMATO`, onde `FORMATO` indica o formato a ser importado sob o pressuposto de que a biblioteca *pandas* foi devidamente importada com `pd`."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Os formatos mais comuns são: \n",
"\n",
"* *csv* (comma-separated values), \n",
"* *xls* ou *xlsx* (formatos do Microsoft Excel),\n",
"* *hdf5* (comumente utilizado em *big data*), \n",
"* *json* (comumente utilizado em páginas da internet)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"As funções para leitura correspondentes são: \n",
"* `pd.read_csv`, \n",
"* `pd.read_excel`, \n",
"* `pd.read_hdf`, \n",
"* `pd.read_json`, \n",
"\n",
"respectivamente."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"De todas elas, a função mais utilizada é `read_csv`. Ela possui vários argumentos. Vejamos os mais utilizados:\n",
"\n",
"* `file_path_or_buffer`: o endereço do arquivo a ser lido. Pode ser um endereço da internet.\n",
"* `sep`: o separador entre as entradas de dados. O separador padrão é `,`.\n",
"* `index_col`: a coluna que deve ser usada para formar o *index*. O padrão é `None`. Porém pode ser alterado para outro. Um separador comumente encontrado é o `\\t` (TAB).\n",
"* `names`: nomes das colunas a serem usadas. O padrão é `None`.\n",
"* `header`: número da linha que servirá como nome para as colunas. O padrão é `infer` (ou seja, tenta deduzir automaticamente). Se os nomes das colunas forem passados através do `names`, então `header` será automaticamente considerado como `None`. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**Exemplo:** considere o arquivo `data/exemplo_data.csv` contendo:\n",
"\n",
"```\n",
",coluna_1,coluna_2\n",
"2020-01-01,-0.4160923582996922,1.8103644347460834\n",
"2020-01-02,-0.1379696602473578,2.5785204825192785\n",
"2020-01-03,0.5758273450544708,0.06086648807755068\n",
"2020-01-04,-0.017367186564883633,1.2995865328684455\n",
"2020-01-05,1.3842792448510655,-0.3817320973859929\n",
"2020-01-06,0.5497056238566345,-1.308789022968975\n",
"2020-01-07,-0.2822962331437976,-1.6889791765925102\n",
"2020-01-08,-0.9897300598660013,-0.028120707936426497\n",
"2020-01-09,0.27558240737928663,-0.1776585993494299\n",
"2020-01-10,0.6851316082235455,0.5025348904591399\n",
"``` \n",
"\n",
"Para ler o arquivo acima basta fazer:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df_exemplo_0 = pd.read_csv('data/exemplo_data.csv')"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" coluna_1 | \n",
" coluna_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2020-01-01 | \n",
" -0.416092 | \n",
" 1.810364 | \n",
"
\n",
" \n",
" 1 | \n",
" 2020-01-02 | \n",
" -0.137970 | \n",
" 2.578520 | \n",
"
\n",
" \n",
" 2 | \n",
" 2020-01-03 | \n",
" 0.575827 | \n",
" 0.060866 | \n",
"
\n",
" \n",
" 3 | \n",
" 2020-01-04 | \n",
" -0.017367 | \n",
" 1.299587 | \n",
"
\n",
" \n",
" 4 | \n",
" 2020-01-05 | \n",
" 1.384279 | \n",
" -0.381732 | \n",
"
\n",
" \n",
" 5 | \n",
" 2020-01-06 | \n",
" 0.549706 | \n",
" -1.308789 | \n",
"
\n",
" \n",
" 6 | \n",
" 2020-01-07 | \n",
" -0.282296 | \n",
" -1.688979 | \n",
"
\n",
" \n",
" 7 | \n",
" 2020-01-08 | \n",
" -0.989730 | \n",
" -0.028121 | \n",
"
\n",
" \n",
" 8 | \n",
" 2020-01-09 | \n",
" 0.275582 | \n",
" -0.177659 | \n",
"
\n",
" \n",
" 9 | \n",
" 2020-01-10 | \n",
" 0.685132 | \n",
" 0.502535 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 coluna_1 coluna_2\n",
"0 2020-01-01 -0.416092 1.810364\n",
"1 2020-01-02 -0.137970 2.578520\n",
"2 2020-01-03 0.575827 0.060866\n",
"3 2020-01-04 -0.017367 1.299587\n",
"4 2020-01-05 1.384279 -0.381732\n",
"5 2020-01-06 0.549706 -1.308789\n",
"6 2020-01-07 -0.282296 -1.688979\n",
"7 2020-01-08 -0.989730 -0.028121\n",
"8 2020-01-09 0.275582 -0.177659\n",
"9 2020-01-10 0.685132 0.502535"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo_0"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"No exemplo anterior, as colunas receberam nomes corretamentes exceto pela primeira coluna que gostaríamos de considerar como *index*. Neste caso fazemos:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"df_exemplo = pd.read_csv('data/exemplo_data.csv', index_col=0)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coluna_1 | \n",
" coluna_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-01 | \n",
" -0.416092 | \n",
" 1.810364 | \n",
"
\n",
" \n",
" 2020-01-02 | \n",
" -0.137970 | \n",
" 2.578520 | \n",
"
\n",
" \n",
" 2020-01-03 | \n",
" 0.575827 | \n",
" 0.060866 | \n",
"
\n",
" \n",
" 2020-01-04 | \n",
" -0.017367 | \n",
" 1.299587 | \n",
"
\n",
" \n",
" 2020-01-05 | \n",
" 1.384279 | \n",
" -0.381732 | \n",
"
\n",
" \n",
" 2020-01-06 | \n",
" 0.549706 | \n",
" -1.308789 | \n",
"
\n",
" \n",
" 2020-01-07 | \n",
" -0.282296 | \n",
" -1.688979 | \n",
"
\n",
" \n",
" 2020-01-08 | \n",
" -0.989730 | \n",
" -0.028121 | \n",
"
\n",
" \n",
" 2020-01-09 | \n",
" 0.275582 | \n",
" -0.177659 | \n",
"
\n",
" \n",
" 2020-01-10 | \n",
" 0.685132 | \n",
" 0.502535 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coluna_1 coluna_2\n",
"2020-01-01 -0.416092 1.810364\n",
"2020-01-02 -0.137970 2.578520\n",
"2020-01-03 0.575827 0.060866\n",
"2020-01-04 -0.017367 1.299587\n",
"2020-01-05 1.384279 -0.381732\n",
"2020-01-06 0.549706 -1.308789\n",
"2020-01-07 -0.282296 -1.688979\n",
"2020-01-08 -0.989730 -0.028121\n",
"2020-01-09 0.275582 -0.177659\n",
"2020-01-10 0.685132 0.502535"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### O método *head* do *DataFrame*\n",
"\n",
"O método `head`, sem argumento, permite que visualizemos as 5 primeiras linhas do *DataFrame*."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coluna_1 | \n",
" coluna_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-01 | \n",
" -0.416092 | \n",
" 1.810364 | \n",
"
\n",
" \n",
" 2020-01-02 | \n",
" -0.137970 | \n",
" 2.578520 | \n",
"
\n",
" \n",
" 2020-01-03 | \n",
" 0.575827 | \n",
" 0.060866 | \n",
"
\n",
" \n",
" 2020-01-04 | \n",
" -0.017367 | \n",
" 1.299587 | \n",
"
\n",
" \n",
" 2020-01-05 | \n",
" 1.384279 | \n",
" -0.381732 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coluna_1 coluna_2\n",
"2020-01-01 -0.416092 1.810364\n",
"2020-01-02 -0.137970 2.578520\n",
"2020-01-03 0.575827 0.060866\n",
"2020-01-04 -0.017367 1.299587\n",
"2020-01-05 1.384279 -0.381732"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Se for passado um argumento com valor `n`, as `n` primeiras linhas são impressas."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coluna_1 | \n",
" coluna_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-01 | \n",
" -0.416092 | \n",
" 1.810364 | \n",
"
\n",
" \n",
" 2020-01-02 | \n",
" -0.137970 | \n",
" 2.578520 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coluna_1 coluna_2\n",
"2020-01-01 -0.416092 1.810364\n",
"2020-01-02 -0.137970 2.578520"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coluna_1 | \n",
" coluna_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-01 | \n",
" -0.416092 | \n",
" 1.810364 | \n",
"
\n",
" \n",
" 2020-01-02 | \n",
" -0.137970 | \n",
" 2.578520 | \n",
"
\n",
" \n",
" 2020-01-03 | \n",
" 0.575827 | \n",
" 0.060866 | \n",
"
\n",
" \n",
" 2020-01-04 | \n",
" -0.017367 | \n",
" 1.299587 | \n",
"
\n",
" \n",
" 2020-01-05 | \n",
" 1.384279 | \n",
" -0.381732 | \n",
"
\n",
" \n",
" 2020-01-06 | \n",
" 0.549706 | \n",
" -1.308789 | \n",
"
\n",
" \n",
" 2020-01-07 | \n",
" -0.282296 | \n",
" -1.688979 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coluna_1 coluna_2\n",
"2020-01-01 -0.416092 1.810364\n",
"2020-01-02 -0.137970 2.578520\n",
"2020-01-03 0.575827 0.060866\n",
"2020-01-04 -0.017367 1.299587\n",
"2020-01-05 1.384279 -0.381732\n",
"2020-01-06 0.549706 -1.308789\n",
"2020-01-07 -0.282296 -1.688979"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.head(7)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### O método `tail` do *DataFrame*\n",
"\n",
"O método `tail`, sem argumento, retorna as últimas 5 linhas do *DataFrame*."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coluna_1 | \n",
" coluna_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-06 | \n",
" 0.549706 | \n",
" -1.308789 | \n",
"
\n",
" \n",
" 2020-01-07 | \n",
" -0.282296 | \n",
" -1.688979 | \n",
"
\n",
" \n",
" 2020-01-08 | \n",
" -0.989730 | \n",
" -0.028121 | \n",
"
\n",
" \n",
" 2020-01-09 | \n",
" 0.275582 | \n",
" -0.177659 | \n",
"
\n",
" \n",
" 2020-01-10 | \n",
" 0.685132 | \n",
" 0.502535 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coluna_1 coluna_2\n",
"2020-01-06 0.549706 -1.308789\n",
"2020-01-07 -0.282296 -1.688979\n",
"2020-01-08 -0.989730 -0.028121\n",
"2020-01-09 0.275582 -0.177659\n",
"2020-01-10 0.685132 0.502535"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Se for passado um argumento com valor `n`, as `n` últimas linhas são impressas."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coluna_1 | \n",
" coluna_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-09 | \n",
" 0.275582 | \n",
" -0.177659 | \n",
"
\n",
" \n",
" 2020-01-10 | \n",
" 0.685132 | \n",
" 0.502535 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coluna_1 coluna_2\n",
"2020-01-09 0.275582 -0.177659\n",
"2020-01-10 0.685132 0.502535"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.tail(2)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coluna_1 | \n",
" coluna_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-04 | \n",
" -0.017367 | \n",
" 1.299587 | \n",
"
\n",
" \n",
" 2020-01-05 | \n",
" 1.384279 | \n",
" -0.381732 | \n",
"
\n",
" \n",
" 2020-01-06 | \n",
" 0.549706 | \n",
" -1.308789 | \n",
"
\n",
" \n",
" 2020-01-07 | \n",
" -0.282296 | \n",
" -1.688979 | \n",
"
\n",
" \n",
" 2020-01-08 | \n",
" -0.989730 | \n",
" -0.028121 | \n",
"
\n",
" \n",
" 2020-01-09 | \n",
" 0.275582 | \n",
" -0.177659 | \n",
"
\n",
" \n",
" 2020-01-10 | \n",
" 0.685132 | \n",
" 0.502535 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coluna_1 coluna_2\n",
"2020-01-04 -0.017367 1.299587\n",
"2020-01-05 1.384279 -0.381732\n",
"2020-01-06 0.549706 -1.308789\n",
"2020-01-07 -0.282296 -1.688979\n",
"2020-01-08 -0.989730 -0.028121\n",
"2020-01-09 0.275582 -0.177659\n",
"2020-01-10 0.685132 0.502535"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.tail(7)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Atributos de *Series* e *DataFrames*\n",
"\n",
"Atributos comumente usados para *Series* e *DataFrames* são:\n",
"\n",
"* `shape`: fornece as dimensões do objeto em questão (*Series* ou *DataFrame*) em formato consistente com o atributo `shape` de um *array* do *numpy*.\n",
"* `index`: fornece o índice do objeto. No caso do *DataFrame* são os rótulos das linhas.\n",
"* `columns`: fornece as colunas (apenas disponível para *DataFrames*) "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Exemplo:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10, 2)"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.shape"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"serie_1 = pd.Series([1,2,3,4,5])"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(5,)"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_1.shape"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05',\n",
" '2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09', '2020-01-10'],\n",
" dtype='object')"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.index"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=5, step=1)"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_1.index"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['coluna_1', 'coluna_2'], dtype='object')"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Se quisermos obter os dados contidos nos *index* ou nas *Series* podemos utilizar a propriedade `.array`."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"[0, 1, 2, 3, 4]\n",
"Length: 5, dtype: int64"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_1.index.array"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"['coluna_1', 'coluna_2']\n",
"Length: 2, dtype: object"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.columns.array"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Se o interesse for obter os dados como um `array` do *numpy*, devemos utilizar o método `.to_numpy()`.\n",
"\n",
"Exemplo:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 2, 3, 4])"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_1.index.to_numpy()"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['coluna_1', 'coluna_2'], dtype=object)"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.columns.to_numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"O método `.to_numpy()` também está disponível em *DataFrames*:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[-0.41609236, 1.81036443],\n",
" [-0.13796966, 2.57852048],\n",
" [ 0.57582735, 0.06086649],\n",
" [-0.01736719, 1.29958653],\n",
" [ 1.38427924, -0.3817321 ],\n",
" [ 0.54970562, -1.30878902],\n",
" [-0.28229623, -1.68897918],\n",
" [-0.98973006, -0.02812071],\n",
" [ 0.27558241, -0.1776586 ],\n",
" [ 0.68513161, 0.50253489]])"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.to_numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"A função do *numpy* `asarray()` é compatível com *index*, *columns* e *DataFrames* do *pandas*:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',\n",
" '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',\n",
" '2020-01-09', '2020-01-10'], dtype=object)"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.asarray(df_exemplo.index)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['coluna_1', 'coluna_2'], dtype=object)"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.asarray(df_exemplo.columns)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[-0.41609236, 1.81036443],\n",
" [-0.13796966, 2.57852048],\n",
" [ 0.57582735, 0.06086649],\n",
" [-0.01736719, 1.29958653],\n",
" [ 1.38427924, -0.3817321 ],\n",
" [ 0.54970562, -1.30878902],\n",
" [-0.28229623, -1.68897918],\n",
" [-0.98973006, -0.02812071],\n",
" [ 0.27558241, -0.1776586 ],\n",
" [ 0.68513161, 0.50253489]])"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.asarray(df_exemplo)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Informações sobre as colunas de um *DataFrame*\n",
"\n",
"Para obtermos uma breve descrição sobre as colunas de um *DataFrame* utilizamos o método `info`.\n",
"\n",
"Exemplo:"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Index: 10 entries, 2020-01-01 to 2020-01-10\n",
"Data columns (total 2 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 coluna_1 10 non-null float64\n",
" 1 coluna_2 10 non-null float64\n",
"dtypes: float64(2)\n",
"memory usage: 240.0+ bytes\n"
]
}
],
"source": [
"df_exemplo.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Criando arquivos a partir de *DataFrames*\n",
"\n",
"Para criar arquivos a partir de *DataFrames*, basta utilizar os métodos do tipo `pd.to_FORMATO`, onde `FORMATO` indica o formato a ser exportado e supondo que a biblioteca *pandas* foi importada com `pd`."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Com relação aos tipos de arquivo anteriores, os métodos para exportação correspondentes são:\n",
"* `.to_csv` ('endereço_do_arquivo'), \n",
"* `.to_excel` ('endereço_do_arquivo'), \n",
"* `.to_hdf` ('endereço_do_arquivo'), \n",
"* `.to_json`('endereço_do_arquivo'), \n",
"\n",
"onde `endereço_do_arquivo` é uma `str` que contém o endereço do arquivo a ser exportado."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Exemplo:\n",
" \n",
"Para exportar para o arquivo `exemplo_novo.csv`, utilizaremos o método `.to_csv` ao *DataFrame* `df_exemplo`:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"df_exemplo.to_csv('data/exemplo_novo.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Exemplo COVID-19 PB"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"Dados diários de COVID-19 do estado da Paraíba:\n",
"\n",
"*Fonte: https://superset.plataformatarget.com.br/superset/dashboard/microdados/*"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"dados_covid_PB = pd.read_csv('https://superset.plataformatarget.com.br/superset/explore_json/?form_data=%7B%22slice_id%22%3A1550%7D&csv=true', \n",
" sep=',', index_col=0)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Index: 331 entries, 2021-02-09 to 2020-03-16\n",
"Data columns (total 7 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 casosAcumulados 331 non-null int64 \n",
" 1 casosNovos 331 non-null int64 \n",
" 2 descartados 331 non-null int64 \n",
" 3 recuperados 331 non-null int64 \n",
" 4 obitosAcumulados 331 non-null int64 \n",
" 5 obitosNovos 331 non-null int64 \n",
" 6 Letalidade 331 non-null float64\n",
"dtypes: float64(1), int64(6)\n",
"memory usage: 20.7+ KB\n"
]
}
],
"source": [
"dados_covid_PB.info()"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosAcumulados | \n",
" casosNovos | \n",
" descartados | \n",
" recuperados | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
" Letalidade | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2021-02-09 | \n",
" 199706 | \n",
" 971 | \n",
" 237369 | \n",
" 152779 | \n",
" 4171 | \n",
" 13 | \n",
" 0.0209 | \n",
"
\n",
" \n",
" 2021-02-08 | \n",
" 198735 | \n",
" 611 | \n",
" 237189 | \n",
" 151879 | \n",
" 4158 | \n",
" 12 | \n",
" 0.0209 | \n",
"
\n",
" \n",
" 2021-02-07 | \n",
" 198124 | \n",
" 664 | \n",
" 237072 | \n",
" 151535 | \n",
" 4146 | \n",
" 11 | \n",
" 0.0209 | \n",
"
\n",
" \n",
" 2021-02-06 | \n",
" 197460 | \n",
" 918 | \n",
" 236774 | \n",
" 150175 | \n",
" 4135 | \n",
" 12 | \n",
" 0.0209 | \n",
"
\n",
" \n",
" 2021-02-05 | \n",
" 196542 | \n",
" 1060 | \n",
" 236216 | \n",
" 150169 | \n",
" 4123 | \n",
" 13 | \n",
" 0.0210 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosAcumulados casosNovos descartados recuperados \\\n",
"data \n",
"2021-02-09 199706 971 237369 152779 \n",
"2021-02-08 198735 611 237189 151879 \n",
"2021-02-07 198124 664 237072 151535 \n",
"2021-02-06 197460 918 236774 150175 \n",
"2021-02-05 196542 1060 236216 150169 \n",
"\n",
" obitosAcumulados obitosNovos Letalidade \n",
"data \n",
"2021-02-09 4171 13 0.0209 \n",
"2021-02-08 4158 12 0.0209 \n",
"2021-02-07 4146 11 0.0209 \n",
"2021-02-06 4135 12 0.0209 \n",
"2021-02-05 4123 13 0.0210 "
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.head()"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosAcumulados | \n",
" casosNovos | \n",
" descartados | \n",
" recuperados | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
" Letalidade | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-03-20 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2020-03-19 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2020-03-18 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2020-03-17 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2020-03-16 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosAcumulados casosNovos descartados recuperados \\\n",
"data \n",
"2020-03-20 0 0 0 0 \n",
"2020-03-19 0 0 0 0 \n",
"2020-03-18 0 0 0 0 \n",
"2020-03-17 0 0 0 0 \n",
"2020-03-16 0 0 0 0 \n",
"\n",
" obitosAcumulados obitosNovos Letalidade \n",
"data \n",
"2020-03-20 0 0 0.0 \n",
"2020-03-19 0 0 0.0 \n",
"2020-03-18 0 0 0.0 \n",
"2020-03-17 0 0 0.0 \n",
"2020-03-16 0 0 0.0 "
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.tail()"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"dados_covid_PB['estado'] = 'PB'"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosAcumulados | \n",
" casosNovos | \n",
" descartados | \n",
" recuperados | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
" Letalidade | \n",
" estado | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2021-02-09 | \n",
" 199706 | \n",
" 971 | \n",
" 237369 | \n",
" 152779 | \n",
" 4171 | \n",
" 13 | \n",
" 0.0209 | \n",
" PB | \n",
"
\n",
" \n",
" 2021-02-08 | \n",
" 198735 | \n",
" 611 | \n",
" 237189 | \n",
" 151879 | \n",
" 4158 | \n",
" 12 | \n",
" 0.0209 | \n",
" PB | \n",
"
\n",
" \n",
" 2021-02-07 | \n",
" 198124 | \n",
" 664 | \n",
" 237072 | \n",
" 151535 | \n",
" 4146 | \n",
" 11 | \n",
" 0.0209 | \n",
" PB | \n",
"
\n",
" \n",
" 2021-02-06 | \n",
" 197460 | \n",
" 918 | \n",
" 236774 | \n",
" 150175 | \n",
" 4135 | \n",
" 12 | \n",
" 0.0209 | \n",
" PB | \n",
"
\n",
" \n",
" 2021-02-05 | \n",
" 196542 | \n",
" 1060 | \n",
" 236216 | \n",
" 150169 | \n",
" 4123 | \n",
" 13 | \n",
" 0.0210 | \n",
" PB | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosAcumulados casosNovos descartados recuperados \\\n",
"data \n",
"2021-02-09 199706 971 237369 152779 \n",
"2021-02-08 198735 611 237189 151879 \n",
"2021-02-07 198124 664 237072 151535 \n",
"2021-02-06 197460 918 236774 150175 \n",
"2021-02-05 196542 1060 236216 150169 \n",
"\n",
" obitosAcumulados obitosNovos Letalidade estado \n",
"data \n",
"2021-02-09 4171 13 0.0209 PB \n",
"2021-02-08 4158 12 0.0209 PB \n",
"2021-02-07 4146 11 0.0209 PB \n",
"2021-02-06 4135 12 0.0209 PB \n",
"2021-02-05 4123 13 0.0210 PB "
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.head()"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"dados_covid_PB.to_csv('data/dadoscovidpb.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Índices dos valores máximos ou mínimos\n",
"\n",
"Os métodos `idxmin()` e `idxmax()` retornam o *index* cuja entrada fornece o valor mínimo ou máximo da *Series* ou *DataFrame*. Se houver múltiplas ocorrências de mínimos ou máximos, o método retorna a primeira ocorrência.\n",
"\n",
"Vamos recriar um *DataFrame* genérico."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"serie_Idade = pd.Series({'Ana':20, 'João': 19, 'Maria': 21, 'Pedro': 22, 'Túlio': 20}, name=\"Idade\")\n",
"serie_Peso = pd.Series({'Ana':55, 'João': 80, 'Maria': 62, 'Pedro': 67, 'Túlio': 73}, name=\"Peso\")\n",
"serie_Altura = pd.Series({'Ana':162, 'João': 178, 'Maria': 162, 'Pedro': 165, 'Túlio': 171}, name=\"Altura\")"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"dicionario_series_exemplo = {'Idade': serie_Idade, 'Peso': serie_Peso, 'Altura': serie_Altura}"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"df_dict_series = pd.DataFrame(dicionario_series_exemplo)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Ana 20 55 162\n",
"João 19 80 178\n",
"Maria 21 62 162\n",
"Pedro 22 67 165\n",
"Túlio 20 73 171"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assim, podemos localizar quem possui menores idade, peso e altura."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Idade João\n",
"Peso Ana\n",
"Altura Ana\n",
"dtype: object"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.idxmin()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"De igual forma, localizamos quem possui maiores idade, peso e altura."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Idade Pedro\n",
"Peso João\n",
"Altura João\n",
"dtype: object"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.idxmax()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**Exemplo:** Aplicaremos as funções `idxmin()` e `idxmax()` aos dados do arquivo `data/exemplo_data.csv` para localizar entradas de interesse."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coluna_1 | \n",
" coluna_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-01 | \n",
" -0.416092 | \n",
" 1.810364 | \n",
"
\n",
" \n",
" 2020-01-02 | \n",
" -0.137970 | \n",
" 2.578520 | \n",
"
\n",
" \n",
" 2020-01-03 | \n",
" 0.575827 | \n",
" 0.060866 | \n",
"
\n",
" \n",
" 2020-01-04 | \n",
" -0.017367 | \n",
" 1.299587 | \n",
"
\n",
" \n",
" 2020-01-05 | \n",
" 1.384279 | \n",
" -0.381732 | \n",
"
\n",
" \n",
" 2020-01-06 | \n",
" 0.549706 | \n",
" -1.308789 | \n",
"
\n",
" \n",
" 2020-01-07 | \n",
" -0.282296 | \n",
" -1.688979 | \n",
"
\n",
" \n",
" 2020-01-08 | \n",
" -0.989730 | \n",
" -0.028121 | \n",
"
\n",
" \n",
" 2020-01-09 | \n",
" 0.275582 | \n",
" -0.177659 | \n",
"
\n",
" \n",
" 2020-01-10 | \n",
" 0.685132 | \n",
" 0.502535 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coluna_1 coluna_2\n",
"2020-01-01 -0.416092 1.810364\n",
"2020-01-02 -0.137970 2.578520\n",
"2020-01-03 0.575827 0.060866\n",
"2020-01-04 -0.017367 1.299587\n",
"2020-01-05 1.384279 -0.381732\n",
"2020-01-06 0.549706 -1.308789\n",
"2020-01-07 -0.282296 -1.688979\n",
"2020-01-08 -0.989730 -0.028121\n",
"2020-01-09 0.275582 -0.177659\n",
"2020-01-10 0.685132 0.502535"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo = pd.read_csv('data/exemplo_data.csv', index_col=0); df_exemplo"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df_exemplo = pd.DataFrame(df_exemplo, columns=['coluna_1','coluna_2','coluna_3'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inserimos uma terceira coluna com dados fictícios."
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coluna_1 | \n",
" coluna_2 | \n",
" coluna_3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-01 | \n",
" -0.416092 | \n",
" 1.810364 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2020-01-02 | \n",
" -0.137970 | \n",
" 2.578520 | \n",
" 2.0 | \n",
"
\n",
" \n",
" 2020-01-03 | \n",
" 0.575827 | \n",
" 0.060866 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 2020-01-04 | \n",
" -0.017367 | \n",
" 1.299587 | \n",
" 4.0 | \n",
"
\n",
" \n",
" 2020-01-05 | \n",
" 1.384279 | \n",
" -0.381732 | \n",
" 5.0 | \n",
"
\n",
" \n",
" 2020-01-06 | \n",
" 0.549706 | \n",
" -1.308789 | \n",
" 6.0 | \n",
"
\n",
" \n",
" 2020-01-07 | \n",
" -0.282296 | \n",
" -1.688979 | \n",
" 7.0 | \n",
"
\n",
" \n",
" 2020-01-08 | \n",
" -0.989730 | \n",
" -0.028121 | \n",
" 8.0 | \n",
"
\n",
" \n",
" 2020-01-09 | \n",
" 0.275582 | \n",
" -0.177659 | \n",
" NaN | \n",
"
\n",
" \n",
" 2020-01-10 | \n",
" 0.685132 | \n",
" 0.502535 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coluna_1 coluna_2 coluna_3\n",
"2020-01-01 -0.416092 1.810364 1.0\n",
"2020-01-02 -0.137970 2.578520 2.0\n",
"2020-01-03 0.575827 0.060866 3.0\n",
"2020-01-04 -0.017367 1.299587 4.0\n",
"2020-01-05 1.384279 -0.381732 5.0\n",
"2020-01-06 0.549706 -1.308789 6.0\n",
"2020-01-07 -0.282296 -1.688979 7.0\n",
"2020-01-08 -0.989730 -0.028121 8.0\n",
"2020-01-09 0.275582 -0.177659 NaN\n",
"2020-01-10 0.685132 0.502535 NaN"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo['coluna_3'] = pd.Series([1,2,3,4,5,6,7,8,np.nan,np.nan],index=df_exemplo.index)\n",
"df_exemplo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Os *index* correspondentes aos menores e maiores valores são datas, evidentemente."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"coluna_1 2020-01-08\n",
"coluna_2 2020-01-07\n",
"coluna_3 2020-01-01\n",
"dtype: object"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.idxmin()"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"coluna_1 2020-01-05\n",
"coluna_2 2020-01-02\n",
"coluna_3 2020-01-08\n",
"dtype: object"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.idxmax()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Reindexação de *DataFrames*\n",
"\n",
"No *pandas*, o método `reindex`\n",
"\n",
"* reordena o *DataFrame* de acordo com o conjunto de rótulos inserido como argumento;\n",
"* insere valores faltantes caso um rótulo do novo *index* não tenha valor atribuído no conjunto de dados;\n",
"* remove valores correspondentes a rótulos que não estão presentes no novo *index*."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Exemplos:"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Altura | \n",
" Peso | \n",
" IMC | \n",
"
\n",
" \n",
" \n",
" \n",
" Victor | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" Túlio | \n",
" 171.0 | \n",
" 73.0 | \n",
" NaN | \n",
"
\n",
" \n",
" Pedro | \n",
" 165.0 | \n",
" 67.0 | \n",
" NaN | \n",
"
\n",
" \n",
" João | \n",
" 178.0 | \n",
" 80.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Altura Peso IMC\n",
"Victor NaN NaN NaN\n",
"Túlio 171.0 73.0 NaN\n",
"Pedro 165.0 67.0 NaN\n",
"João 178.0 80.0 NaN"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.reindex(index=['Victor', 'Túlio', 'Pedro', 'João'], columns=['Altura','Peso','IMC'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Remoção de linhas ou colunas de um *DataFrame*\n",
"\n",
"Para remover linhas ou colunas de um *DataFrame* do *pandas* podemos utilizar o método `drop`. O argumento `axis` identifica o eixo de remoção: `axis=0`, que é o padrão, indica a remoção de uma ou mais linhas; `axis=1` indica a remoção de uma ou mais colunas."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Nos exemplos que segue, note que novos *DataFrames* são obtidos a partir de `df_dict_series` sem que este seja sobrescrito."
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Ana 20 55 162\n",
"João 19 80 178\n",
"Maria 21 62 162\n",
"Pedro 22 67 165"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.drop('Túlio') # axis=0 implícito "
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"João 19 80 178\n",
"Pedro 22 67 165\n",
"Túlio 20 73 171"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.drop(['Ana','Maria'], axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Maria | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Peso Altura\n",
"Ana 55 162\n",
"João 80 178\n",
"Maria 62 162\n",
"Pedro 67 165\n",
"Túlio 73 171"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.drop(['Idade'], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Renomeando *index* e *columns*\n",
"\n",
"O método `rename` retorna uma cópia na qual o *index* (no caso de *Series* e *DataFrames*) e *columns* (no caso de *DataFrames*) foram renomeados. O método aceita como entrada um dicionário, uma *Series* do *pandas* ou uma função."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Exemplo:"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"serie_exemplo = pd.Series([1,2,3], index=['a','b','c'])"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a 1\n",
"b 2\n",
"c 3\n",
"dtype: int64"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_exemplo"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"abacaxi 1\n",
"banana 2\n",
"cebola 3\n",
"dtype: int64"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_exemplo.rename({'a':'abacaxi', 'b':'banana', 'c': 'cebola'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exemplo:"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Ana 20 55 162\n",
"João 19 80 178\n",
"Maria 21 62 162\n",
"Pedro 22 67 165\n",
"Túlio 20 73 171"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" I | \n",
" P | \n",
" A | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" j | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" m | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" p | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" t | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" I P A\n",
"a 20 55 162\n",
"j 19 80 178\n",
"m 21 62 162\n",
"p 22 67 165\n",
"t 20 73 171"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.rename(index = {'Ana':'a', 'João':'j', 'Maria':'m', 'Pedro':'p','Túlio':'t'},\n",
" columns = {'Idade':'I', 'Peso':'P','Altura':'A'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"No próximo exemplo, usamos uma *Series* para renomear os rótulos."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"indice_novo = pd.Series({'Ana':'a', 'João':'j', 'Maria':'m', 'Pedro':'p','Túlio':'t'})"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" j | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" m | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" p | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" t | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"a 20 55 162\n",
"j 19 80 178\n",
"m 21 62 162\n",
"p 22 67 165\n",
"t 20 73 171"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.rename(index = indice_novo)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Neste exemplo, usamos a função `str.upper` (altera a `str` para \"todas maiúsculas\") para renomear colunas."
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" IDADE | \n",
" PESO | \n",
" ALTURA | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" IDADE PESO ALTURA\n",
"Ana 20 55 162\n",
"João 19 80 178\n",
"Maria 21 62 162\n",
"Pedro 22 67 165\n",
"Túlio 20 73 171"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.rename(columns=str.upper)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Ordenando *Series* e *DataFrames*\n",
"\n",
"É possível ordenar ambos pelos rótulos do *index* (para tanto é necessário que eles sejam ordenáveis) ou por valores nas colunas. \n",
"\n",
"O método `sort_index` ordena a *Series* ou o *DataFrame* pelo *index*. O método `sort_values` ordena a *Series* ou o *DataFrame* pelos valores (escolhendo uma ou mais colunas no caso de *DataFrames*). No caso do *DataFrame*, o argumento `by` é necessário para indicar qual(is) coluna(s) será(ão) utilizada(s) como base para a ordenação."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Exemplos:"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Maria 21\n",
"Pedro 22\n",
"Túlio 20\n",
"João 19\n",
"Ana 20\n",
"dtype: int64"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_desordenada = pd.Series({'Maria': 21, 'Pedro': 22, 'Túlio': 20, 'João': 19, 'Ana':20}); \n",
"serie_desordenada"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ana 20\n",
"João 19\n",
"Maria 21\n",
"Pedro 22\n",
"Túlio 20\n",
"dtype: int64"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_desordenada.sort_index() # ordenação alfabética"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Mais exemplos:"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [],
"source": [
"df_desordenado = df_dict_series.reindex(index=['Pedro','Maria','Ana','Túlio','João'])"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Pedro 22 67 165\n",
"Maria 21 62 162\n",
"Ana 20 55 162\n",
"Túlio 20 73 171\n",
"João 19 80 178"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_desordenado"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Ana 20 55 162\n",
"João 19 80 178\n",
"Maria 21 62 162\n",
"Pedro 22 67 165\n",
"Túlio 20 73 171"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_desordenado.sort_index()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Mais exemplos:"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"João 19\n",
"Túlio 20\n",
"Ana 20\n",
"Maria 21\n",
"Pedro 22\n",
"dtype: int64"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_desordenada.sort_values()"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Maria 21 62 162\n",
"Ana 20 55 162\n",
"Pedro 22 67 165\n",
"Túlio 20 73 171\n",
"João 19 80 178"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_desordenado.sort_values(by=['Altura']) # ordena por 'Altura'"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"No caso de \"empate\", podemos utilizar outra coluna para desempatar."
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Ana 20 55 162\n",
"Maria 21 62 162\n",
"Pedro 22 67 165\n",
"Túlio 20 73 171\n",
"João 19 80 178"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_desordenado.sort_values(by=['Altura','Peso']) # usa a coluna 'Peso' para desempatar"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Os métodos `sort_index` e `sort_values` admitem o argumento opcional `ascending`, que permite inverter a ordenação caso tenha valor `False`."
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Túlio 20 73 171\n",
"Pedro 22 67 165\n",
"Maria 21 62 162\n",
"João 19 80 178\n",
"Ana 20 55 162"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_desordenado.sort_index(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Pedro 22 67 165\n",
"Maria 21 62 162\n",
"Ana 20 55 162\n",
"Túlio 20 73 171\n",
"João 19 80 178"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_desordenado.sort_values(by=['Idade'], ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Comparando *Series* e *DataFrames*\n",
"\n",
"*Series* e *DataFrames* possuem os seguintes métodos de comparação lógica: \n",
"\n",
"- `eq` (igual);\n",
"- `ne` (diferente);\n",
"- `lt` (menor do que);\n",
"- `gt` (maior do que);\n",
"- `le` (menor ou igual a); \n",
"- `ge` (maior ou igual a)\n",
"\n",
"que permitem a utilização dos operadores binários `==`, `!=`, `<`, `>`, `<=` e `>=`, respectivamente. As comparações são realizadas em cada entrada da *Series* ou do *DataFrame*.\n",
"\n",
"**Observação**: Para que esses métodos sejam aplicados, todos os objetos presentes nas colunas do *DataFrame* devem ser de mesma natureza. Por exemplo, se um *DataFrame* possui algumas colunas numéricas e outras com *strings*, ao realizar uma comparação do tipo `> 1`, um erro ocorrerá, pois o *pandas* tentará comparar objetos do tipo `int` com objetos do tipo `str`, assim gerando uma incompatibilidade.\n",
"\n",
"Exemplos:"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a 1\n",
"b 2\n",
"c 3\n",
"dtype: int64"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_exemplo"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"a False\n",
"b True\n",
"c False\n",
"dtype: bool"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_exemplo == 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"De outra forma:"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a False\n",
"b True\n",
"c False\n",
"dtype: bool"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_exemplo.eq(2)"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a False\n",
"b True\n",
"c True\n",
"dtype: bool"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_exemplo > 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ou, na forma funcional:"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a False\n",
"b True\n",
"c True\n",
"dtype: bool"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_exemplo.gt(1)"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coluna_1 | \n",
" coluna_2 | \n",
" coluna_3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-01 | \n",
" False | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" 2020-01-02 | \n",
" False | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-03 | \n",
" False | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-04 | \n",
" False | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-05 | \n",
" True | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-06 | \n",
" False | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-07 | \n",
" False | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-08 | \n",
" False | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-09 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 2020-01-10 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coluna_1 coluna_2 coluna_3\n",
"2020-01-01 False True False\n",
"2020-01-02 False True True\n",
"2020-01-03 False False True\n",
"2020-01-04 False True True\n",
"2020-01-05 True False True\n",
"2020-01-06 False False True\n",
"2020-01-07 False False True\n",
"2020-01-08 False False True\n",
"2020-01-09 False False False\n",
"2020-01-10 False False False"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo > 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**Importante:** Ao comparar *np.nan*, o resultado tipicamente é falso:"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan == np.nan"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan > np.nan"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan >= np.nan"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Só é verdadeiro para indicar que é diferente:"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan != np.nan"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Neste sentido, podemos ter tabelas iguais sem que a comparação usual funcione:"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"# 'copy', como o nome sugere, gera uma cópia do DataFrame\n",
"df_exemplo_2 = df_exemplo.copy() "
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" coluna_1 | \n",
" coluna_2 | \n",
" coluna_3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-01-01 | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-02 | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-03 | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-04 | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-05 | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-06 | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-07 | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-08 | \n",
" True | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2020-01-09 | \n",
" True | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" 2020-01-10 | \n",
" True | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" coluna_1 coluna_2 coluna_3\n",
"2020-01-01 True True True\n",
"2020-01-02 True True True\n",
"2020-01-03 True True True\n",
"2020-01-04 True True True\n",
"2020-01-05 True True True\n",
"2020-01-06 True True True\n",
"2020-01-07 True True True\n",
"2020-01-08 True True True\n",
"2020-01-09 True True False\n",
"2020-01-10 True True False"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(df_exemplo == df_exemplo_2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"O motivo de haver entradas como `False` ainda que `df_exemplo_2` seja uma cópia exata de `df_exemplo` é a presença do `np.nan`. Neste caso, devemos utilizar o método `equals` para realizar a comparação."
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.equals(df_exemplo_2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Os métodos `any`, `all` e a propriedade `empty`\n",
"\n",
"O método `any` é aplicado a entradas booleanas (verdadeiras ou falsas) e retorna *verdadeiro* se existir alguma entrada verdadeira, ou *falso*, se todas forem falsas. O método `all` é aplicado a entradas booleanas e retorna *verdadeiro* se todas as entradas forem verdadeiras, ou *falso*, se houver pelo menos uma entrada falsa. A propriedade `empty` retorna *verdadeiro* se a *Series* ou o *DataFrame* estiver vazio, ou *falso* caso contrário."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Exemplos:"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a 1\n",
"b 2\n",
"c 3\n",
"dtype: int64"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_exemplo"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a -1\n",
"b 0\n",
"c 1\n",
"dtype: int64"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_exemplo_2 = serie_exemplo-2; \n",
"serie_exemplo_2"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(serie_exemplo_2 > 0).any()"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(serie_exemplo > 1).all()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Este exemplo reproduz um valor `False` único."
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(df_exemplo == df_exemplo_2).all().all()"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"serie_exemplo.empty"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Mais exemplos:"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"coluna_1 True\n",
"coluna_2 True\n",
"coluna_3 True\n",
"dtype: bool"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(df_exemplo == df_exemplo_2).any()"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_exemplo.empty"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [],
"source": [
"df_vazio = pd.DataFrame()"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_vazio.empty"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Selecionando colunas de um *DataFrame*\n",
"\n",
"Para selecionar colunas de um *DataFrame*, basta aplicar *colchetes* a uma lista contendo os nomes das colunas de interesse."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"No exemplo abaixo, temos um *DataFrame* contendo as colunas `'Idade'`, `'Peso'` e `'Altura'`. Selecionaremos `'Peso'` e `'Altura'`, apenas."
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Maria | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Peso Altura\n",
"Ana 55 162\n",
"João 80 178\n",
"Maria 62 162\n",
"Pedro 67 165\n",
"Túlio 73 171"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series[['Peso','Altura']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Se quisermos selecionar apenas uma coluna, não há necessidade de inserir uma lista. Basta utilizar o nome da coluna:"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ana 55\n",
"João 80\n",
"Maria 62\n",
"Pedro 67\n",
"Túlio 73\n",
"Name: Peso, dtype: int64"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series['Peso']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Para remover colunas, podemos utilizar o método `drop`."
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade\n",
"Ana 20\n",
"João 19\n",
"Maria 21\n",
"Pedro 22\n",
"Túlio 20"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.drop(['Peso','Altura'], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Criando novas colunas a partir de colunas existentes\n",
"\n",
"Um método eficiente para criar novas colunas a partir de colunas já existentes é `eval`. Neste método, podemos utilizar como argumento uma *string* contendo uma expressão matemática envolvendo nomes de colunas do *DataFrame*."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Como exemplo, vamos ver como calcular o IMC no *DataFrame* anterior:"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ana 20.957171\n",
"João 25.249337\n",
"Maria 23.624447\n",
"Pedro 24.609734\n",
"Túlio 24.964946\n",
"dtype: float64"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.eval('Peso/(Altura/100)**2')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Se quisermos obter um *DataFrame* contendo o IMC como uma nova coluna, podemos utilizar o método `assign` (sem modificar o *DataFrame* original):"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
" IMC | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
" 20.96 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
" 25.25 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
" 23.62 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
" 24.61 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
" 24.96 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura IMC\n",
"Ana 20 55 162 20.96\n",
"João 19 80 178 25.25\n",
"Maria 21 62 162 23.62\n",
"Pedro 22 67 165 24.61\n",
"Túlio 20 73 171 24.96"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series.assign(IMC=round(df_dict_series.eval('Peso/(Altura/100)**2'),2))"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura\n",
"Ana 20 55 162\n",
"João 19 80 178\n",
"Maria 21 62 162\n",
"Pedro 22 67 165\n",
"Túlio 20 73 171"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series # não modificado"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Se quisermos modificar o *DataFrame* para incluir a coluna IMC fazemos:"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [],
"source": [
"df_dict_series['IMC']=round(df_dict_series.eval('Peso/(Altura/100)**2'),2)"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
" IMC | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
" 20.96 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
" 25.25 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
" 23.62 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
" 24.61 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
" 24.96 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura IMC\n",
"Ana 20 55 162 20.96\n",
"João 19 80 178 25.25\n",
"Maria 21 62 162 23.62\n",
"Pedro 22 67 165 24.61\n",
"Túlio 20 73 171 24.96"
]
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series # modificado \"in-place\""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Selecionando linhas de um *DataFrame*:\n",
"\n",
"Podemos selecionar linhas de um *DataFrame* de diversas formas diferentes. Veremos agora algumas dessas formas.\n",
"\n",
"Diferentemente da forma de selecionar colunas, para selecionar diretamente linhas de um *DataFrame* devemos utilizar o método `loc` (fornecendo o *index*, isto é, o rótulo da linha) ou o `iloc` (fornecendo a posição da linha):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Trabalharemos a seguir com um banco de dados atualizado sobre a COVID-19. Para tanto, importaremos o módulo `datetime` que nos auxiliará com datas."
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [],
"source": [
"import datetime"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"dados_covid_PB = pd.read_csv('https://superset.plataformatarget.com.br/superset/explore_json/?form_data=%7B%22slice_id%22%3A1550%7D&csv=true', \n",
" sep=',', index_col=0)\n",
"\n",
"# busca o banco na data D-1, visto que a atualização\n",
"# ocorre em D\n",
"ontem = (datetime.date.today() - datetime.timedelta(days=1)).strftime('%Y-%m-%d') "
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosAcumulados | \n",
" casosNovos | \n",
" descartados | \n",
" recuperados | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
" Letalidade | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2021-02-09 | \n",
" 199706 | \n",
" 971 | \n",
" 237369 | \n",
" 152779 | \n",
" 4171 | \n",
" 13 | \n",
" 0.0209 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosAcumulados casosNovos descartados recuperados \\\n",
"data \n",
"2021-02-09 199706 971 237369 152779 \n",
"\n",
" obitosAcumulados obitosNovos Letalidade \n",
"data \n",
"2021-02-09 4171 13 0.0209 "
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Podemos ver as informações de um único dia como argumento. Para tanto, excluímos a coluna `'Letalidade'` (valor não inteiro) e convertemos o restante para valores inteiros:"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"casosAcumulados 199706\n",
"casosNovos 971\n",
"descartados 237369\n",
"recuperados 152779\n",
"obitosAcumulados 4171\n",
"obitosNovos 13\n",
"Name: 2021-02-09, dtype: int64"
]
},
"execution_count": 126,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.loc[ontem].drop('Letalidade').astype('int')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Podemos selecionar um intervalo de datas como argumento (excluindo letalidade):"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosAcumulados | \n",
" casosNovos | \n",
" descartados | \n",
" recuperados | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
"
\n",
" \n",
" \n",
" \n",
" 2021-02-01 | \n",
" 192598 | \n",
" 1014 | \n",
" 234215 | \n",
" 149235 | \n",
" 4068 | \n",
" 12 | \n",
"
\n",
" \n",
" 2021-02-02 | \n",
" 193465 | \n",
" 867 | \n",
" 234366 | \n",
" 149242 | \n",
" 4082 | \n",
" 14 | \n",
"
\n",
" \n",
" 2021-02-03 | \n",
" 194519 | \n",
" 1054 | \n",
" 234902 | \n",
" 149248 | \n",
" 4096 | \n",
" 14 | \n",
"
\n",
" \n",
" 2021-02-04 | \n",
" 195482 | \n",
" 963 | \n",
" 235319 | \n",
" 149792 | \n",
" 4110 | \n",
" 14 | \n",
"
\n",
" \n",
" 2021-02-05 | \n",
" 196542 | \n",
" 1060 | \n",
" 236216 | \n",
" 150169 | \n",
" 4123 | \n",
" 13 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosAcumulados casosNovos descartados recuperados \\\n",
"2021-02-01 192598 1014 234215 149235 \n",
"2021-02-02 193465 867 234366 149242 \n",
"2021-02-03 194519 1054 234902 149248 \n",
"2021-02-04 195482 963 235319 149792 \n",
"2021-02-05 196542 1060 236216 150169 \n",
"\n",
" obitosAcumulados obitosNovos \n",
"2021-02-01 4068 12 \n",
"2021-02-02 4082 14 \n",
"2021-02-03 4096 14 \n",
"2021-02-04 4110 14 \n",
"2021-02-05 4123 13 "
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.index = pd.to_datetime(dados_covid_PB.index) # Convertendo o index de string para data\n",
"dados_covid_PB.loc[pd.date_range('2021-02-01',periods=5,freq=\"D\")].drop('Letalidade',axis=1) \n",
" #função pd.date_range é muito útil para criar índices a partir de datas."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Podemos colocar uma lista como argumento:"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosAcumulados | \n",
" casosNovos | \n",
" descartados | \n",
" recuperados | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
" Letalidade | \n",
"
\n",
" \n",
" \n",
" \n",
" 2021-01-01 | \n",
" 167062 | \n",
" 578 | \n",
" 219890 | \n",
" 127388 | \n",
" 3680 | \n",
" 8 | \n",
" 0.0220 | \n",
"
\n",
" \n",
" 2021-02-01 | \n",
" 192598 | \n",
" 1014 | \n",
" 234215 | \n",
" 149235 | \n",
" 4068 | \n",
" 12 | \n",
" 0.0211 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosAcumulados casosNovos descartados recuperados \\\n",
"2021-01-01 167062 578 219890 127388 \n",
"2021-02-01 192598 1014 234215 149235 \n",
"\n",
" obitosAcumulados obitosNovos Letalidade \n",
"2021-01-01 3680 8 0.0220 \n",
"2021-02-01 4068 12 0.0211 "
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.loc[pd.to_datetime(['2021-01-01','2021-02-01'])]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Vamos agora examinar os dados da posição 100 (novamente excluindo a coluna letalidade e convertendo para inteiro):"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"casosAcumulados 133220\n",
"casosNovos 71\n",
"descartados 185162\n",
"recuperados 108737\n",
"obitosAcumulados 3107\n",
"obitosNovos 6\n",
"Name: 2020-11-01 00:00:00, dtype: int64"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.iloc[100].drop('Letalidade').astype('int')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Podemos colocar um intervalo como argumento:"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosAcumulados | \n",
" casosNovos | \n",
" descartados | \n",
" recuperados | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-12-21 | \n",
" 159639 | \n",
" 203 | \n",
" 213617 | \n",
" 123430 | \n",
" 3552 | \n",
" 13 | \n",
"
\n",
" \n",
" 2020-12-20 | \n",
" 159436 | \n",
" 327 | \n",
" 213356 | \n",
" 123417 | \n",
" 3539 | \n",
" 10 | \n",
"
\n",
" \n",
" 2020-12-19 | \n",
" 159109 | \n",
" 660 | \n",
" 213149 | \n",
" 123295 | \n",
" 3529 | \n",
" 6 | \n",
"
\n",
" \n",
" 2020-12-18 | \n",
" 158449 | \n",
" 1053 | \n",
" 212439 | \n",
" 122935 | \n",
" 3523 | \n",
" 16 | \n",
"
\n",
" \n",
" 2020-12-17 | \n",
" 157396 | \n",
" 1274 | \n",
" 210628 | \n",
" 122219 | \n",
" 3507 | \n",
" 20 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosAcumulados casosNovos descartados recuperados \\\n",
"data \n",
"2020-12-21 159639 203 213617 123430 \n",
"2020-12-20 159436 327 213356 123417 \n",
"2020-12-19 159109 660 213149 123295 \n",
"2020-12-18 158449 1053 212439 122935 \n",
"2020-12-17 157396 1274 210628 122219 \n",
"\n",
" obitosAcumulados obitosNovos \n",
"data \n",
"2020-12-21 3552 13 \n",
"2020-12-20 3539 10 \n",
"2020-12-19 3529 6 \n",
"2020-12-18 3523 16 \n",
"2020-12-17 3507 20 "
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.iloc[50:55].drop('Letalidade', axis=1).astype('int') "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Selecionando colunas pelos métodos *loc* e *iloc*\n",
"\n",
"Podemos selecionar colunas utilizando os métodos `loc` e `iloc` utilizando um argumento adicional."
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosNovos | \n",
" obitosNovos | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2021-02-09 | \n",
" 971 | \n",
" 13 | \n",
"
\n",
" \n",
" 2021-02-08 | \n",
" 611 | \n",
" 12 | \n",
"
\n",
" \n",
" 2021-02-07 | \n",
" 664 | \n",
" 11 | \n",
"
\n",
" \n",
" 2021-02-06 | \n",
" 918 | \n",
" 12 | \n",
"
\n",
" \n",
" 2021-02-05 | \n",
" 1060 | \n",
" 13 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 2020-03-20 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2020-03-19 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2020-03-18 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2020-03-17 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2020-03-16 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
331 rows × 2 columns
\n",
"
"
],
"text/plain": [
" casosNovos obitosNovos\n",
"data \n",
"2021-02-09 971 13\n",
"2021-02-08 611 12\n",
"2021-02-07 664 11\n",
"2021-02-06 918 12\n",
"2021-02-05 1060 13\n",
"... ... ...\n",
"2020-03-20 0 0\n",
"2020-03-19 0 0\n",
"2020-03-18 0 0\n",
"2020-03-17 0 0\n",
"2020-03-16 0 0\n",
"\n",
"[331 rows x 2 columns]"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.loc[:,['casosNovos','obitosNovos']]"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2021-02-09 | \n",
" 4171 | \n",
" 13 | \n",
"
\n",
" \n",
" 2021-02-08 | \n",
" 4158 | \n",
" 12 | \n",
"
\n",
" \n",
" 2021-02-07 | \n",
" 4146 | \n",
" 11 | \n",
"
\n",
" \n",
" 2021-02-06 | \n",
" 4135 | \n",
" 12 | \n",
"
\n",
" \n",
" 2021-02-05 | \n",
" 4123 | \n",
" 13 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 2020-03-20 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2020-03-19 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2020-03-18 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2020-03-17 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2020-03-16 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
331 rows × 2 columns
\n",
"
"
],
"text/plain": [
" obitosAcumulados obitosNovos\n",
"data \n",
"2021-02-09 4171 13\n",
"2021-02-08 4158 12\n",
"2021-02-07 4146 11\n",
"2021-02-06 4135 12\n",
"2021-02-05 4123 13\n",
"... ... ...\n",
"2020-03-20 0 0\n",
"2020-03-19 0 0\n",
"2020-03-18 0 0\n",
"2020-03-17 0 0\n",
"2020-03-16 0 0\n",
"\n",
"[331 rows x 2 columns]"
]
},
"execution_count": 132,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.iloc[:,4:6] # fatiamento na coluna"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Selecionando linhas e colunas específicas pelos métodos *loc* e *iloc*:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Usando o mesmo princípio de *fatiamento* aplicado a *arrays* do numpy, podemos selecionar linhas e colunas em um intervalo específico de forma a obter uma subtabela."
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-11-06 | \n",
" 3146 | \n",
" 8 | \n",
"
\n",
" \n",
" 2020-11-05 | \n",
" 3138 | \n",
" 8 | \n",
"
\n",
" \n",
" 2020-11-04 | \n",
" 3130 | \n",
" 11 | \n",
"
\n",
" \n",
" 2020-11-03 | \n",
" 3119 | \n",
" 11 | \n",
"
\n",
" \n",
" 2020-11-02 | \n",
" 3108 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" obitosAcumulados obitosNovos\n",
"data \n",
"2020-11-06 3146 8\n",
"2020-11-05 3138 8\n",
"2020-11-04 3130 11\n",
"2020-11-03 3119 11\n",
"2020-11-02 3108 1"
]
},
"execution_count": 133,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.iloc[95:100,4:6]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Neste exemplo um pouco mais complexo, buscamos casos novos e óbitos novos em um período específico e ordenamos a tabela da data mais recente para a mais antiga."
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosNovos | \n",
" obitosNovos | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-04-10 | \n",
" 6 | \n",
" 0 | \n",
"
\n",
" \n",
" 2020-04-09 | \n",
" 24 | \n",
" 4 | \n",
"
\n",
" \n",
" 2020-04-08 | \n",
" 14 | \n",
" 3 | \n",
"
\n",
" \n",
" 2020-04-07 | \n",
" 5 | \n",
" 0 | \n",
"
\n",
" \n",
" 2020-04-06 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosNovos obitosNovos\n",
"2020-04-10 6 0\n",
"2020-04-09 24 4\n",
"2020-04-08 14 3\n",
"2020-04-07 5 0\n",
"2020-04-06 1 0"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.loc[pd.date_range('2020-04-06','2020-04-10'),['casosNovos','obitosNovos']].sort_index(ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Suponha que o peso de Ana foi medido corretamente, mas registrado de maneira errônea no *DataFrame* `df_dict_series` como 55."
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
" IMC | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 55 | \n",
" 162 | \n",
" 20.96 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
" 25.25 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
" 23.62 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
" 24.61 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
" 24.96 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura IMC\n",
"Ana 20 55 162 20.96\n",
"João 19 80 178 25.25\n",
"Maria 21 62 162 23.62\n",
"Pedro 22 67 165 24.61\n",
"Túlio 20 73 171 24.96"
]
},
"execution_count": 135,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Supondo que, na realidade, o valor é 65, alteramos a entrada específica com um simples `loc`. Em seguida, atualizamos a tabela."
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"df_dict_series.loc['Ana','Peso'] = 65\n",
"\n",
"df_dict_series = df_dict_series.assign(IMC=round(df_dict_series.eval('Peso/(Altura/100)**2'),2)) # O IMC mudou"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
" IMC | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 65 | \n",
" 162 | \n",
" 24.77 | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
" 25.25 | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
" 23.62 | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
" 24.61 | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
" 24.96 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura IMC\n",
"Ana 20 65 162 24.77\n",
"João 19 80 178 25.25\n",
"Maria 21 62 162 23.62\n",
"Pedro 22 67 165 24.61\n",
"Túlio 20 73 171 24.96"
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Selecionando linha através de critérios lógicos ou funções\n",
"\n",
"Vamos selecionar quais os dias em que houve mais de 40 mortes registradas:"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosAcumulados | \n",
" casosNovos | \n",
" descartados | \n",
" recuperados | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
" Letalidade | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-07-21 | \n",
" 68844 | \n",
" 1164 | \n",
" 78605 | \n",
" 25028 | \n",
" 1558 | \n",
" 41 | \n",
" 0.0226 | \n",
"
\n",
" \n",
" 2020-07-15 | \n",
" 63939 | \n",
" 1477 | \n",
" 74399 | \n",
" 23695 | \n",
" 1383 | \n",
" 41 | \n",
" 0.0216 | \n",
"
\n",
" \n",
" 2020-07-02 | \n",
" 49536 | \n",
" 1361 | \n",
" 48272 | \n",
" 16349 | \n",
" 1044 | \n",
" 42 | \n",
" 0.0211 | \n",
"
\n",
" \n",
" 2020-06-30 | \n",
" 46957 | \n",
" 1900 | \n",
" 43070 | \n",
" 14930 | \n",
" 977 | \n",
" 46 | \n",
" 0.0208 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosAcumulados casosNovos descartados recuperados \\\n",
"data \n",
"2020-07-21 68844 1164 78605 25028 \n",
"2020-07-15 63939 1477 74399 23695 \n",
"2020-07-02 49536 1361 48272 16349 \n",
"2020-06-30 46957 1900 43070 14930 \n",
"\n",
" obitosAcumulados obitosNovos Letalidade \n",
"data \n",
"2020-07-21 1558 41 0.0226 \n",
"2020-07-15 1383 41 0.0216 \n",
"2020-07-02 1044 42 0.0211 \n",
"2020-06-30 977 46 0.0208 "
]
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.loc[dados_covid_PB['obitosNovos']>40]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Selecionando os dias com mais de 25 óbitos e mais de 1500 casos novos:"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosAcumulados | \n",
" casosNovos | \n",
" descartados | \n",
" recuperados | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
" Letalidade | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-08-04 | \n",
" 85760 | \n",
" 1549 | \n",
" 106500 | \n",
" 38554 | \n",
" 1901 | \n",
" 31 | \n",
" 0.0222 | \n",
"
\n",
" \n",
" 2020-07-31 | \n",
" 82794 | \n",
" 1686 | \n",
" 96753 | \n",
" 35971 | \n",
" 1811 | \n",
" 26 | \n",
" 0.0219 | \n",
"
\n",
" \n",
" 2020-07-23 | \n",
" 73104 | \n",
" 2132 | \n",
" 84047 | \n",
" 28566 | \n",
" 1618 | \n",
" 37 | \n",
" 0.0221 | \n",
"
\n",
" \n",
" 2020-07-10 | \n",
" 59118 | \n",
" 1504 | \n",
" 69567 | \n",
" 21481 | \n",
" 1229 | \n",
" 33 | \n",
" 0.0208 | \n",
"
\n",
" \n",
" 2020-07-08 | \n",
" 56344 | \n",
" 1542 | \n",
" 67549 | \n",
" 19999 | \n",
" 1171 | \n",
" 26 | \n",
" 0.0208 | \n",
"
\n",
" \n",
" 2020-07-07 | \n",
" 54802 | \n",
" 1651 | \n",
" 64933 | \n",
" 19373 | \n",
" 1145 | \n",
" 27 | \n",
" 0.0209 | \n",
"
\n",
" \n",
" 2020-06-30 | \n",
" 46957 | \n",
" 1900 | \n",
" 43070 | \n",
" 14930 | \n",
" 977 | \n",
" 46 | \n",
" 0.0208 | \n",
"
\n",
" \n",
" 2020-06-09 | \n",
" 22452 | \n",
" 1501 | \n",
" 20650 | \n",
" 4671 | \n",
" 534 | \n",
" 27 | \n",
" 0.0238 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosAcumulados casosNovos descartados recuperados \\\n",
"data \n",
"2020-08-04 85760 1549 106500 38554 \n",
"2020-07-31 82794 1686 96753 35971 \n",
"2020-07-23 73104 2132 84047 28566 \n",
"2020-07-10 59118 1504 69567 21481 \n",
"2020-07-08 56344 1542 67549 19999 \n",
"2020-07-07 54802 1651 64933 19373 \n",
"2020-06-30 46957 1900 43070 14930 \n",
"2020-06-09 22452 1501 20650 4671 \n",
"\n",
" obitosAcumulados obitosNovos Letalidade \n",
"data \n",
"2020-08-04 1901 31 0.0222 \n",
"2020-07-31 1811 26 0.0219 \n",
"2020-07-23 1618 37 0.0221 \n",
"2020-07-10 1229 33 0.0208 \n",
"2020-07-08 1171 26 0.0208 \n",
"2020-07-07 1145 27 0.0209 \n",
"2020-06-30 977 46 0.0208 \n",
"2020-06-09 534 27 0.0238 "
]
},
"execution_count": 139,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.loc[(dados_covid_PB.obitosNovos > 25) & (dados_covid_PB.casosNovos>1500)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Obs**.: Note que podemos utilizar o nome da coluna como um atributo."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Vamos inserir uma coluna sobrenome no `df_dict_series`:"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
" IMC | \n",
" Sobrenome | \n",
"
\n",
" \n",
" \n",
" \n",
" Ana | \n",
" 20 | \n",
" 65 | \n",
" 162 | \n",
" 24.77 | \n",
" Silva | \n",
"
\n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
" 25.25 | \n",
" PraDo | \n",
"
\n",
" \n",
" Maria | \n",
" 21 | \n",
" 62 | \n",
" 162 | \n",
" 23.62 | \n",
" Sales | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
" 24.61 | \n",
" MachadO | \n",
"
\n",
" \n",
" Túlio | \n",
" 20 | \n",
" 73 | \n",
" 171 | \n",
" 24.96 | \n",
" Coutinho | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura IMC Sobrenome\n",
"Ana 20 65 162 24.77 Silva\n",
"João 19 80 178 25.25 PraDo\n",
"Maria 21 62 162 23.62 Sales\n",
"Pedro 22 67 165 24.61 MachadO\n",
"Túlio 20 73 171 24.96 Coutinho"
]
},
"execution_count": 140,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dict_series['Sobrenome'] = ['Silva', 'PraDo', 'Sales', 'MachadO', 'Coutinho']\n",
"df_dict_series"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Vamos encontrar as linhas cujo sobrenome termina em \"do\". Para tanto, note que a função abaixo retorna `True` se o final é \"do\" e `False` caso contrário.\n",
"\n",
"```python\n",
"def verifica_final_do(palavra):\n",
" return palavra.lower()[-2:] == 'do'\n",
"```\n",
"**Obs**.: Note que convertemos tudo para minúsculo."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Agora vamos utilizar essa função para alcançar nosso objetivo:"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Ana False\n",
"João True\n",
"Maria False\n",
"Pedro True\n",
"Túlio False\n",
"Name: Sobrenome, dtype: bool"
]
},
"execution_count": 141,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 'map' aplica a função lambda a cada elemento da *Series*\n",
"df_dict_series['Sobrenome'].map(lambda palavra: palavra.lower()[-2:]=='do') "
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Idade | \n",
" Peso | \n",
" Altura | \n",
" IMC | \n",
" Sobrenome | \n",
"
\n",
" \n",
" \n",
" \n",
" João | \n",
" 19 | \n",
" 80 | \n",
" 178 | \n",
" 25.25 | \n",
" PraDo | \n",
"
\n",
" \n",
" Pedro | \n",
" 22 | \n",
" 67 | \n",
" 165 | \n",
" 24.61 | \n",
" MachadO | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Idade Peso Altura IMC Sobrenome\n",
"João 19 80 178 25.25 PraDo\n",
"Pedro 22 67 165 24.61 MachadO"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# procurando no df inteiro\n",
"df_dict_series.loc[df_dict_series['Sobrenome'].map(lambda palavra: palavra.lower()[-2:]=='do')]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Vamos selecionar as linhas do mês 2 (fevereiro) usando `index.month`:"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosAcumulados | \n",
" casosNovos | \n",
" descartados | \n",
" recuperados | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
" Letalidade | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2021-02-09 | \n",
" 199706 | \n",
" 971 | \n",
" 237369 | \n",
" 152779 | \n",
" 4171 | \n",
" 13 | \n",
" 0.0209 | \n",
"
\n",
" \n",
" 2021-02-08 | \n",
" 198735 | \n",
" 611 | \n",
" 237189 | \n",
" 151879 | \n",
" 4158 | \n",
" 12 | \n",
" 0.0209 | \n",
"
\n",
" \n",
" 2021-02-07 | \n",
" 198124 | \n",
" 664 | \n",
" 237072 | \n",
" 151535 | \n",
" 4146 | \n",
" 11 | \n",
" 0.0209 | \n",
"
\n",
" \n",
" 2021-02-06 | \n",
" 197460 | \n",
" 918 | \n",
" 236774 | \n",
" 150175 | \n",
" 4135 | \n",
" 12 | \n",
" 0.0209 | \n",
"
\n",
" \n",
" 2021-02-05 | \n",
" 196542 | \n",
" 1060 | \n",
" 236216 | \n",
" 150169 | \n",
" 4123 | \n",
" 13 | \n",
" 0.0210 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosAcumulados casosNovos descartados recuperados \\\n",
"data \n",
"2021-02-09 199706 971 237369 152779 \n",
"2021-02-08 198735 611 237189 151879 \n",
"2021-02-07 198124 664 237072 151535 \n",
"2021-02-06 197460 918 236774 150175 \n",
"2021-02-05 196542 1060 236216 150169 \n",
"\n",
" obitosAcumulados obitosNovos Letalidade \n",
"data \n",
"2021-02-09 4171 13 0.0209 \n",
"2021-02-08 4158 12 0.0209 \n",
"2021-02-07 4146 11 0.0209 \n",
"2021-02-06 4135 12 0.0209 \n",
"2021-02-05 4123 13 0.0210 "
]
},
"execution_count": 143,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.loc[dados_covid_PB.index.month==2].head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Selecionando linhas com o método *query*\n",
"\n",
"Similarmente ao método `eval`, ao utilizarmos `query`, podemos criar expressões lógicas a partir de nomes das colunas do *DataFrame*."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Assim, podemos reescrever o código\n",
"\n",
"```python\n",
"dados_covid_PB.loc[(dados_covid_PB.obitosNovos>25) & \n",
" (dados_covid_PB.casosNovos>1500)]\n",
"```\n",
"como"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" casosAcumulados | \n",
" casosNovos | \n",
" descartados | \n",
" recuperados | \n",
" obitosAcumulados | \n",
" obitosNovos | \n",
" Letalidade | \n",
"
\n",
" \n",
" data | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2020-08-04 | \n",
" 85760 | \n",
" 1549 | \n",
" 106500 | \n",
" 38554 | \n",
" 1901 | \n",
" 31 | \n",
" 0.0222 | \n",
"
\n",
" \n",
" 2020-07-31 | \n",
" 82794 | \n",
" 1686 | \n",
" 96753 | \n",
" 35971 | \n",
" 1811 | \n",
" 26 | \n",
" 0.0219 | \n",
"
\n",
" \n",
" 2020-07-23 | \n",
" 73104 | \n",
" 2132 | \n",
" 84047 | \n",
" 28566 | \n",
" 1618 | \n",
" 37 | \n",
" 0.0221 | \n",
"
\n",
" \n",
" 2020-07-10 | \n",
" 59118 | \n",
" 1504 | \n",
" 69567 | \n",
" 21481 | \n",
" 1229 | \n",
" 33 | \n",
" 0.0208 | \n",
"
\n",
" \n",
" 2020-07-08 | \n",
" 56344 | \n",
" 1542 | \n",
" 67549 | \n",
" 19999 | \n",
" 1171 | \n",
" 26 | \n",
" 0.0208 | \n",
"
\n",
" \n",
" 2020-07-07 | \n",
" 54802 | \n",
" 1651 | \n",
" 64933 | \n",
" 19373 | \n",
" 1145 | \n",
" 27 | \n",
" 0.0209 | \n",
"
\n",
" \n",
" 2020-06-30 | \n",
" 46957 | \n",
" 1900 | \n",
" 43070 | \n",
" 14930 | \n",
" 977 | \n",
" 46 | \n",
" 0.0208 | \n",
"
\n",
" \n",
" 2020-06-09 | \n",
" 22452 | \n",
" 1501 | \n",
" 20650 | \n",
" 4671 | \n",
" 534 | \n",
" 27 | \n",
" 0.0238 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" casosAcumulados casosNovos descartados recuperados \\\n",
"data \n",
"2020-08-04 85760 1549 106500 38554 \n",
"2020-07-31 82794 1686 96753 35971 \n",
"2020-07-23 73104 2132 84047 28566 \n",
"2020-07-10 59118 1504 69567 21481 \n",
"2020-07-08 56344 1542 67549 19999 \n",
"2020-07-07 54802 1651 64933 19373 \n",
"2020-06-30 46957 1900 43070 14930 \n",
"2020-06-09 22452 1501 20650 4671 \n",
"\n",
" obitosAcumulados obitosNovos Letalidade \n",
"data \n",
"2020-08-04 1901 31 0.0222 \n",
"2020-07-31 1811 26 0.0219 \n",
"2020-07-23 1618 37 0.0221 \n",
"2020-07-10 1229 33 0.0208 \n",
"2020-07-08 1171 26 0.0208 \n",
"2020-07-07 1145 27 0.0209 \n",
"2020-06-30 977 46 0.0208 \n",
"2020-06-09 534 27 0.0238 "
]
},
"execution_count": 144,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados_covid_PB.query('obitosNovos>25 and casosNovos>1500') # note que 'and' é usado em vez de '&'"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"latex_metadata": {
"affiliation": "Departamento de Computação Científica / UFPB",
"author": "Gustavo Oliveira e Andréa Rocha",
"date": "Julho de 2020",
"title": "Aula 6A - Manipulação de Dados"
},
"rise": {
"enable_chalkboard": true,
"footer": "UFPB - CI - DCC",
"header": "Gustavo Oliveira / Andrea Rocha",
"theme": "sky"
}
},
"nbformat": 4,
"nbformat_minor": 4
}