I live in a country where we have an alphabet that goes beyond the English set of characters. It is not your usual A-Z; sometimes I get to encounter a few characters that don't belong to that set. Reading, writing and pronouncing them do not present a serious challenge. It is getting scripts to crunch them that do. The language is simply not programming friendly.
I'm using Python scripts to process large datasets. It is pretty powerful and it gets the job done. Only when it encounters these special characters that it is brought to its knees. The error might be familiar to fellow "parsel-tongue" (the monicker some of would call Python coders). For me the most common encounter would be on the character "ñ".
The error:
Data would not be accurate if these characters are not properly processed. The error above is a welcome distraction, since there are times when the scripts would simply cut the character and not show any error at all. In that case, places like "Dasmariñas, Cavite" and "Biñan, Laguna" will be "Dasmarias, Cavite" and "Bian, Laguna". Person's names aren't spared either. And the list goes on..
Processing these manually is not an option so I had to find a solution to the dilemma. Encoding text characters to UTF-8 was suggested. I used to set it using the notation string.encode('utf-8')..
Until I discovered a suggestion to set this to a system-wide default encoding..
Inserting these lines at the beginning of scripts sets the default character encoding to UTF-8. I stumbled upon the solution from this blog: http://www.markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/.
I spent quite a significant amount of time working around this pain point. By far, this solution has helped me a lot in processing data. I hope it helps someone out there.
RELATED: Could not setup macAddress for ethernet0
As always, your mileage may vary. There are a lot of posts regarding this solution -- caveats, disadvantages and bugs. Read and understand them, and be aware that they exist.
I'm using Python scripts to process large datasets. It is pretty powerful and it gets the job done. Only when it encounters these special characters that it is brought to its knees. The error might be familiar to fellow "parsel-tongue" (the monicker some of would call Python coders). For me the most common encounter would be on the character "ñ".
The error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\x00f1' in position 20: ordinal not in range(128)
Data would not be accurate if these characters are not properly processed. The error above is a welcome distraction, since there are times when the scripts would simply cut the character and not show any error at all. In that case, places like "Dasmariñas, Cavite" and "Biñan, Laguna" will be "Dasmarias, Cavite" and "Bian, Laguna". Person's names aren't spared either. And the list goes on..
Processing these manually is not an option so I had to find a solution to the dilemma. Encoding text characters to UTF-8 was suggested. I used to set it using the notation string.encode('utf-8')..
Until I discovered a suggestion to set this to a system-wide default encoding..
import sys reload(sys) sys.setdefaultencoding('utf8')
Inserting these lines at the beginning of scripts sets the default character encoding to UTF-8. I stumbled upon the solution from this blog: http://www.markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/.
I spent quite a significant amount of time working around this pain point. By far, this solution has helped me a lot in processing data. I hope it helps someone out there.
RELATED: Could not setup macAddress for ethernet0
As always, your mileage may vary. There are a lot of posts regarding this solution -- caveats, disadvantages and bugs. Read and understand them, and be aware that they exist.