Photo by Nguyen Dang Hoang Nhu on Unsplash

Databricks, founded by the creators of Apache Spark, is being largely adopted by many companies as a unified analytics engine for big data and machine learning. Gartner has classified Databricks as a leader in the last quadrant for Data Science and Machine Learning platforms.


Compartilhe ideias, contribua para a comunidade de dados e expanda seu network.

Photo by "My Life Through A Lens" on Unsplash

English version here.

O que você encontrará nesta página

  1. Por que contribuir?
  2. Regras para submeter um artigo
  3. Guia para construir um artigo
  4. Como submeter seu artigo?
  5. FAQ

Por que contribuir? 📝

O Data Arena é uma publicação independente do Medium e estamos à procura de pessoas que queiram…


Share ideas, contribute to the data community and expand your network.

Photo by "My Life Through A Lens" on Unsplash

Portuguese version here.

What you will find on this page

  1. Why become a contributor?
  2. Submission rules
  3. Guidelines
  4. How to submit your article?
  5. FAQ

Why become a contributor? 📝

Data Arena is an independent Medium publication and we are looking for writers that want to contribute to the data community by proposing…


This article explores Schema Registry compatibility modes and how to evolve schemas according to them.

Photo by ian dooley on Unsplash

People who work with data know how painful can be when an unexpected change is made in your data source. …


This article explores an approach to merge different schemas using Apache Spark

Photo by Ricardo Gomez Angel on Unsplash

Imagine that you have to work with a lot of files in your data lake and you discover that they don’t have the same schema. Or, to be more tragic, let’s say you have a process that reads data from a data lake and suddenly it stops working. …


Making Sense of Big Data

In this article, I’ll share a comprehensive example of how to integrate Spark Structured Streaming with Kafka to create a streaming data visualization.

Photo by Markus Spiske on Unsplash

Introduction

Apache Kafka is being largely adopted in modern architectures providing a more reliable and scalable way to capture and integrate real-time data between systems. …


A brief guide on how to set up a development environment with Spark, Airflow and Jupyter Notebook

Photo by Christopher Gower on Unsplash

Brief context

As a Data Engineer, it is common to use in our daily routine the Apache Spark and Apache Airflow (if you do not yet use them, you should try) to overcome typical Data Engineering challenges like build pipelines to get data from someplace, do a lot of transformations and deliver…


The objective of this article series is to identify hard bounce e-mails using machine learning techniques. The part 1 article was about Feature Engineering and Exploratory Analysis. In part 2 we will see how to train an Extreme Gradient Boost algorithm to identify hard bounce e-mails.

Photo by Thanhy Nguyen on Unsplash

The Dataset

In this article I…


This article series aims to show how to identify hard bounce e-mails using machine learning techniques. In part 1 we will see Feature Engineering and Exploratory Analysis.

Photo by Tiffany Tertipes on Unsplash

What is e-mail hard bounce?

This terminology is widely used in Marketing and is related to bounced e-mail messages which occur when an e-mail message is rejected by…

Thiago Cordon

Data practicioner, enabling business with data. Editor at https://medium.com/data-arena

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store