You've got the data, so that's actually the hard part done. Now just pick some m...

You've got the data, so that's actually the hard part done. Now just pick some method and don't worry too much about the details. Go with the easiest route first, for example: each emoji is a word, no need to treat a smileyface differently in your model from a word like "happy", _until you see how the representation actually matters_, only then will you know what the right way is. If you pick a complicated method first, you won't know if it was worth it. If you pick a simple method first, you'll know if it doesn't work, but you won't have wasted time.

Re:lemmatisation, either grab an existing machine-readable dictionary [1] if you just want all lemmas with their ambiguity, or just do simple PoS-tagging with something like https://honnibal.wordpress.com/2013/09/11/a-good-part-of-spe... (I love that post for its no-nonsense approach)

[1] http://wiki.apertium.org/wiki/Using_an_lttoolbox_dictionary has some (and will tokenise for while analysing).