Empty Cart

News & Blog

What We Say

Data cleansing with Excel

Once again, I turned to Excel for a data cleanup project. It occurred to me that this would make a good blog article. Here's some techniques I use to help me handle data cleanup.

Keeping data together

Occasionally, I want to remove blank or incomplete rows. Or, I need to sort by a specific field (for example CITY) so I can find data that are in the wrong columns. If I sort by City, without a particular key, I could potentially lose the correct order for my data. The solution is to add a column at the beginning of the data. Use this to restore data to the original sort order. With your data sorted correctly, before you start working on data cleanup, in the new column, start to number the cells 1,2,3 and then highlight the three cells and double click the lower right corner. This will autofill the column to the last line that has data in the column to the right.

Before I do any clean up I always click on the upper left corner to select all data, copy it and then paste it into another worksheet. This is my backup data that I can quickly review to ensure I haven't lost anything. I use sorting quite a bit. I typically work a lot with contact data. I want to do things like take all CAPS and convert them to Proper case. I want to normalize data - in other words, make all States two letters instead of spelled out. I have some formulas I use to make telephone numbers the same format. Don't use format cells to build phone numbers. I have found that when you save it as a CSV format, sometimes the data gets converted to a funky serial number instead of a phone number. Most annoying.

Combining Data

Some data has contact names split into first name and last name. If you want to have it as a full name instead, use the Concatenate formula. It is one of my favorite formulas. One of my customers will call me periodically and say "Pat, what is that funky coracucate... you know what I mean...formula." Cracks me up every time.  This function can be used after splitting data apart which can then be sorted, put in the correct order, and then reform a field using the Concatenate function.

Often I will add text to a field using the same formula. To add additional text, you just include the text, enclosed in Double Quotes, as a parameter you pass to the Concatenate formula. For example, say you have a field that describes the source of the data you are importing. I also include this for reference later. In my CRM, ACT, I can import data into a note. So, assume I have a column, and the first cell is L2, that has the names of the original source. I can build another column and put the following formula in the first cell: =concatenate(l2,"- Original data source"). This will build a text field that has the source in column L plus the text "- Original data source" attached.  Copy the formula by double clicking on the lower right corner to auto fill all rows.  Copy the converted data and paste it back in place using Paste Special - Values. Now, I can do a lookup on Original data source and find any records that have that in a note.

Data Cleaning

Other formulas I use often to clean data are =Proper (to turn all upper case text into friendlier looking text) and =Clean which removes unwanted non-printable characters from text. These nasty critters can really mess up a data import. Another similar formula is =Trim which removes extra spaces.

If I am going to create a comma separated file from Excel, I will click on the upper left corner to highlight all data and then do a Find and Replace for any commas. I put in a comma for the find and leave the replace field empty. Don't do this, however, if you have data that has no spaces immediately following a comma.

Zip codes

Zip code fields can be interesting all by themselves. I often run into a situation where the client has exported the data, and Excel, being ever so helpful, has ensured that all zip codes that start with zero no longer do. Sigh. That means I have to find all the rows that are missing the leading zero and add one, tricking Excel into letting me enter a leading zero.

First thing I need to do is find the rows that are only 4 characters in length - this assumes I don't have the full zip codes that have the extra 4 digits after a dash. Assuming this is not the case, what I do is add a column next to the Zip code column. I then enter the following formula in the first cell - =len(xx) where xx is the cell immediately to the left of my column. What this will tell me is the length of the cell. I then highlight all my data, sort by the new column showing lengths and all the ones that are 4 characters are my ones with missing zeros.

You could even use this for the ones that have the extended zip codes. If they have a dash, the correct length would be 10 - if there are any that are 9 you once again have your zip codes with missing zeros. Now that I have those rows, I next add another column and add in the following formula =TEXT(XX, “00000”). The XX is the cell where zip codes with missing zeros start. I drag this formula down to the last occurrence of the bad zip codes. It will add in my leading zero. I know it looks strange but it really works. You probably think this is a too much extra work. Try adding extra zeros to over 5,000 lines. I had to do this one time and I made sure I came up with a solution that was faster than editing 5K lines of data.

That's it for now. Hope you find this helpful.

Patricia Egen Consulting, LLC

803 Creek Overlook, Chattanooga, TN 37415
Main office: 423-875-2652 • Arizona office: 480-788-7504 • Florida office: 754-300-2827