CNPJ Source¶

Source¶

class publicbr.cnpj._source.CNPJSource(spark_session, file_dir)¶

Class used to extract CNPJ data.

Parameters

spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Root directory where the data will be saved

spark¶

Spark session used in data manipulation

Type: pyspark.sql.SparkSession

raw_dir¶

Path to the diectory used to store raw data

Type: str

trusted_dir¶

Path to the diectory used to store cleaned data

Type: str

crawler¶

Object used to extract data from the public source

Type: CNPJCrawler

cleaners¶

Dict with the cleaners used to consolidate tables

Type: Dict[Cleaner]

create(download=True, overwrite=True, **kwargs)¶

Wrapper for method execution.

Parameters

download (bool) – Indicator that the raw files must be downloaded
overwrite (bool) – Indicator of if the already existing files should be overwritten.
**kwargs –

modestr
Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error

n_partitionsint
Number of DataFrame partitions

partition_colstr
Column to partition DataFrame on writing

key :
Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

extract(overwrite)¶

Extract data from public CNPJ data source, using the CNPJCrawler.

Parameters: overwrite (bool) – Indicator of if the already existing files should be overwritten.
Returns: returns an instance of the object
Return type: self

transform(**kwargs)¶

Transform raw data extracted from public CNPJ data source.

Parameters

**kwargs –

modestr: Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
n_partitionsint: Number of DataFrame partitions
partition_colstr: Column to partition DataFrame on writing
key :: Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

Crawler¶

class publicbr.cnpj._crawler.CNPJCrawler(save_dir)¶

Class used to extract CNPJ data from the public source.

Parameters: save_dir (str) – Path to where the downloaded data should be stored. It creates a directory if it does not exists already.

base_url¶

Url containing all the files to be downloaded

Type: str

save_dir¶

Path to where the consolidated data should be stored

Type: str

files¶

Name of the files to be downloaded

Type: str

download_url(url, save_path) → None¶

Function that downlads data from the URL.

Parameters

url (str) – Full url of the file, created by joining the base_url and a file name.
save_path (str) – Path of the destination file

Returns

returns an instance of the object

Return type

self

get_data(overwrite) → None¶

Wrapper to download each file in files.

Parameters: overwrite (bool) – Indicator of if the already existing files should be overwritten.
Returns: returns an instance of the object
Return type: self

run(overwrite=True) → None¶

Wrapper for method execution.

Parameters: overwrite (bool) – Indicator of if the already existing files should be overwritten.
Returns: returns an instance of the object
Return type: self

unzip() → None¶

Extract data from the downloaded zipped files.

Returns: returns an instance of the object
Return type: self

Consolidation¶

class publicbr.cnpj._consolidation.AuxCleaner(spark_session, file_dir, save_dir)¶

Class used to clean the auxiliary tables that compose the CNPJ data. Currently, they are the following:

CNAE
Municípios
Natureza Jurídica
País
Qualificação de Sócios
Motivo da Situação Cadastral

Parameters

spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Path to where the raw data is stored.
save_dir (str) – Path to where the consolidated data should be stored

spark¶

Spark session used in data manipulation

Type: pyspark.sql.SparkSession

file_dir¶

Path to where the raw data is stored.

Type: str

save_dir¶

Path to where the consolidated data should be stored

Type: str

file_ids¶

Names used to identify the auxiliary tables

Type: List[str]

files¶

Name of the raw files

Type: str

clean(mode='error', n_partitions=32, **kwargs) → None¶

Wrapper for method execution.

Parameters

mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
n_partitions (int) – Number of data partitions in execution
**kwargs –

partition_colstr
Column to partition DataFrame on writing

key :
Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

define_schema(file_id) → str¶

Creates schema used in the file reading.

Parameters: file_id (str) – Name used to identify the auxiliary table
Returns: String specifying the schema of the DataFrame
Return type: str

get_files() → List[str]¶

Gets the correct files of the auxiliary tables from the raw directory

Returns: Name os the auxiliary table files
Return type: List[str]

transform_data(df) → pyspark.sql.dataframe.DataFrame¶

Performs the necessary transformations to clean the raw data.

Parameters: df (pyspark.sql.dataframe.DataFrame) – Spark DataFrame of the read raw data
Returns: Spark DataFrame of the consolidated data
Return type: pyspark.sql.dataframe.DataFrame

class publicbr.cnpj._consolidation.EmpresasCleaner(spark_session, file_dir, save_dir)¶

Class used to clean the table containing general information about the company, such as share capital.

Parameters

spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Path to where the raw data is stored.
save_dir (str) – Path to where the consolidated data should be stored

spark¶

Spark session used in data manipulation

Type: pyspark.sql.SparkSession

file_dir¶

Path to where the raw data is stored.

Type: str

save_dir¶

Path to where the consolidated data should be stored

Type: str

aux_paths¶

Dict with the path to the auxiliary tables used in cleaning

Type: Dict[str]

int_dir¶: Path to directory of intermediary tables

schema¶

Schema used to read raw data

Type: str

df¶

Spark DataFrame of raw data

Type: pyspark.sql.dataframe.DataFrame

df_cleaned¶

Spark DataFrame of cleaned data

Type: pyspark.sql.dataframe.DataFrame

clean(mode='error', n_partitions=32, **kwargs) → None¶

Wrapper for method execution.

Parameters

mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
n_partitions (int) – Number of data partitions in execution
**kwargs –

partition_colstr
Column to partition DataFrame on writing

key :
Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

define_schema() → None¶

Creates schema used in the file reading.

Returns: returns an instance of the object
Return type: self

transform_data() → None¶

Performs the necessary transformations to clean the raw data.

Returns: returns an instance of the object
Return type: self

class publicbr.cnpj._consolidation.EstabCleaner(spark_session, file_dir, save_dir)¶

Class used to clean the biggest dataset, that contains all the information of the company at the moment of registration, such as main economic activity, location, contacts etc.

Parameters

spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Path to where the raw data is stored.
save_dir (str) – Path to where the consolidated data should be stored

spark¶

Spark session used in data manipulation

Type: pyspark.sql.SparkSession

file_dir¶

Path to where the raw data is stored.

Type: str

save_dir¶

Path to where the consolidated data should be stored

Type: str

aux_paths¶

Dict with the path to the auxiliary tables used in cleaning

Type: Dict[str]

int_dir¶: Path to directory of intermediary tables

schema¶

Schema used to read raw data

Type: str

df¶

Spark DataFrame of raw data

Type: pyspark.sql.dataframe.DataFrame

df_cleaned¶

Spark DataFrame of cleaned data

Type: pyspark.sql.dataframe.DataFrame

clean(mode='error', n_partitions=32, **kwargs) → None¶

Wrapper for method execution.

Parameters

mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
n_partitions (int) – Number of data partitions in execution
**kwargs –

partition_colstr
Column to partition DataFrame on writing

key :
Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

define_schema() → None¶

Creates schema used in the file reading.

Returns: returns an instance of the object
Return type: self

transform_data() → None¶

Performs the necessary transformations to clean the raw data.

Returns: returns an instance of the object
Return type: self

class publicbr.cnpj._consolidation.SimplesCleaner(spark_session, file_dir, save_dir)¶

Class used to clean the simples table, that contains data of mostly micro and small companies that opted to be part of the Simples or MEI category.

Parameters

spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Path to where the raw data is stored.
save_dir (str) – Path to where the consolidated data should be stored

spark¶

Spark session used in data manipulation

Type: pyspark.sql.SparkSession

file_dir¶

Path to where the raw data is stored.

Type: str

save_dir¶

Path to where the consolidated data should be stored

Type: str

file_path¶

Path to raw data

Type: str

save_path¶

Path to write cleaned data

Type: str

schema¶

Schema used to read raw data

Type: str

df¶

Spark DataFrame of raw data

Type: pyspark.sql.dataframe.DataFrame

df_cleaned¶

Spark DataFrame of cleaned data

Type: pyspark.sql.dataframe.DataFrame

clean(mode='error', n_partitions=32, **kwargs) → None¶

Wrapper for method execution.

Parameters

mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
n_partitions (int) – Number of data partitions in execution
**kwargs –

partition_colstr
Column to partition DataFrame on writing

key :
Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

define_schema() → None¶

Creates schema used in the file reading.

Returns: returns an instance of the object
Return type: self

transform_data() → None¶

Performs the necessary transformations to clean the raw data.

Returns: returns an instance of the object
Return type: self

class publicbr.cnpj._consolidation.SociosCleaner(spark_session, file_dir, save_dir)¶

Class used to clean the table containing information about partners.

Parameters

spark_session (pyspark.sql.SparkSession) – Spark Session used to manipulate data
file_dir (str) – Path to where the raw data is stored.
save_dir (str) – Path to where the consolidated data should be stored

spark¶

Spark session used in data manipulation

Type: pyspark.sql.SparkSession

file_dir¶

Path to where the raw data is stored.

Type: str

save_dir¶

Path to where the consolidated data should be stored

Type: str

aux_paths¶

Dict with the path to the auxiliary tables used in cleaning

Type: Dict[str]

int_dir¶: Path to directory of intermediary tables

schema¶

Schema used to read raw data

Type: str

int_path¶: Path to intermediary table written

df¶

Spark DataFrame of raw data

Type: pyspark.sql.dataframe.DataFrame

df_cleaned¶

Spark DataFrame of cleaned data

Type: pyspark.sql.dataframe.DataFrame

clean(mode='error', n_partitions=32, **kwargs) → None¶

Wrapper for method execution.

Parameters

mode (str) – Specify the mode of writing data, if data already exist in the designed path * append: Append the contents of the DataFrame to the existing data * overwrite: Overwrite existing data * ignore: Silently ignores this operation * error or errorifexists (default): Raises an error
n_partitions (int) – Number of data partitions in execution
**kwargs –

partition_colstr
Column to partition DataFrame on writing

key :
Other options passed to DataFrameWriter.options

Returns

returns an instance of the object

Return type

self

define_schema() → None¶

Creates schema used in the file reading.

Returns: returns an instance of the object
Return type: self

transform_data() → None¶

Performs the necessary transformations to clean the raw data.

Returns: returns an instance of the object
Return type: self