✨All that glitters is not gold: the downfall of trying to use NY Data✨

By Carla Mandiola

This story begins on November 21, 2023, when I had the epiphany to work with the data about baby names in New York. I sent an email asking for it to the Department of Health and Mental Hygiene, and not one but two lovely people answered me, giving me the link to the information and also, an exclusive item:

"Years 2020 and 2021 have not been posted yet, but they are attached."

I felt like the luckiest girl in the world.

I was confident about my data and wanted to show how cultural events affect the names parents choose for their kids. It was a good idea; I had the data; what could go wrong?

First, I saw the data set: 57.582 rows and six columns that indicate Year of Birth, Gender, Ethnicity, Child's First Name, Count, and Rank. Was I going to win a Pulitzer with this project?

Then I saw how many times these Popular Baby Names from NYC Open Data have been downloaded: more than 290.000 times. That's a huge number for me, a journalist from Rancagua, a little town in Chile.

It was time to work with the data. First, I wanted to check the ten most popular female names from 2011 to 2019. But the result was weird: in the ranking, two names repeated, Sophia and Isabella, because one option was written in uppercase and lowercase, and the other in uppercase only.

OKAY, THAT'S NOT OKAY.

I had to look at the male names because it's not easy to be a woman, and maybe the error was only in that category.

Yeah, they should have capitalized all the words correctly. I could do it on my own, but there may be more problems with this data.

That's why I wanted to see the popularity of my name, Carla, and see if there was a trend.

There is a possibility that people no longer wanted to name their children "Carla."

I wouldn't.

But still, the data is incorrect. I decided to check how many names are registered annually in the data set.

In life, you have to know when to stop and when to give up, and although I wanted this to work, the NYC Open Data is incorrect in many aspects:

Some names are lowercase.
Some are capitalized.
Some are not registered.
Some are duplicated.

It is a shame that a database downloaded more than 290 thousand times is incorrect. I sent an email to the Department of Health and Mental Hygiene telling of my bad experience, and I sent a formal complaint.

Maybe it was time to give up, but according to a page with the meaning of "Carla", I have the "power and ability to choose my destiny and achieve anything I want in life."

After that dose of self-help, I started looking for a database that did work and came across a package called "baby names," created by Hadley Wickham, the same person who created the ggplot2 and tidyverse packages. If this is not a sign, what is?

Thanks to him, I discovered the information collected by the Social Security Administration, where they compiled all the names since 1926 of people borned in the United States.

With this new database, I could look up how many people with my name were born between 2004 and 2019, and the answers match states with big Latino communities.

Now that I can use the New York name database, I looked up the name "Carla" and saw its evolution over the years and when it was most popular.

The moral of this story: never unquestioningly believe in databases, test them until you get bored, and if nothing else works, keep searching because someone probably already had your same problem. 🙃

✨All that glitters is not gold: the downfall of trying to use NYC Data✨